CN105405439A

CN105405439A - Voice playing method and device

Info

Publication number: CN105405439A
Application number: CN201510757786.6A
Authority: CN
Inventors: 高建清; 王智国; 胡国平; 胡郁; 刘庆峰
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2015-11-04
Filing date: 2015-11-04
Publication date: 2016-03-16
Anticipated expiration: 2035-11-04
Also published as: CN105405439B

Abstract

The invention discloses a voice playing method and device. The method comprises the steps that voice data to be played are received; endpoint detection is performed on the voice data to be played so that all voice segments are obtained; whether all the voice is key information segments is determined; and voice speed of the voice data to be played is adjusted according to the key information segments when the voice data to be played are played. With application of the voice playing method and device, users can be assisted to rapidly and accurately find the focused voice segments.

Description

Speech playing method and device

Technical field

The present invention relates to field of voice signal, be specifically related to a kind of speech playing method and device.

Background technology

At present, increasing people like the form adopting recording to replace the information required for form record of text, during as meeting, adopt the form of recording to be got off by meeting content record, consult for follow-up; During interview, adopt the form of recording to record interview content, become original text based on this Edition Contains; Student's upper class hour, recorded in the place do not understood, go back inspection information etc.But when recording data amount is larger, people are difficult to find valuable recording substance rapidly and accurately.In order to reduce the reproduction time of recording, existing speech playing method generally adopts the method for end-point detection, detects pure noise segment or quiet section, these voice segments is skipped over, and plays remaining speech data with normal word speed.But, in the process of recording, often some unessential contents can be recorded together, during existing method playback, often need user's manual switch to become F.F. form to play, or directly skip unessential content.Especially when playback environ-ment is bad, the speech data quality of recording is often poor, and in order to not hear recording substance, user needs manual repeat playing repeatedly, greatly reduces user experience.

Summary of the invention

The invention provides a kind of speech playing method and device, find paid close attention to voice segments rapidly and accurately to help user.

For this reason, the invention provides following technical scheme:

A kind of speech playing method, comprising:

Receive speech data to be played;

End-point detection is carried out to described speech data to be played, obtains each voice segments;

Determine whether institute's speech segment is key message section according to the voice content of each voice segments and/or vocal print feature;

When playing described speech data to be played, adjust according to the word speed of described key message section to described speech data to be played.

Preferably, the described voice content according to each voice segments determines whether institute's speech segment is that key message section comprises:

Speech recognition is carried out to each voice segments, obtains the identification text of each voice segments;

According to the identification text of each voice segments, determine whether institute's speech segment is key message section.

Preferably, the described identification text according to each voice segments, determine whether institute's speech segment is that key message section comprises:

Determine whether the identification text of each voice segments comprises preset keyword;

If so, then determine that institute's speech segment is key message section.

Adopt iterative manner from all voice segments, extract summary voice segments, and after reaching the iterations of setting, obtain multiple summary voice segments, using described multiple summary voice segments as key message section.

Preferably, the described summary voice segments that extracts from all voice segments comprises:

Calculate the similarity of the identification text of current speech segment and the identification text of described speech data to be played, obtain the first calculated value;

Calculate the identification text of described current speech segment and the similarity extracting voice segments identification text of making a summary, obtain the second calculated value;

Calculate the difference of the first calculated value and the second calculated value, obtain the summary score of current speech segment;

After obtaining the summary score of all voice segments, the voice segments selecting summary score maximum is as summary voice segments.

Preferably, the described vocal print feature according to each voice segments determines whether institute's speech segment is that key message section comprises:

If described speech data to be played comprises the speech data of multiple speaker, then extract the vocal print feature of each voice segments;

According to the sound-groove model of described vocal print feature and speaker dependent, determine that whether institute's speech segment is the speech data of speaker dependent;

If so, then determine that institute's speech segment is key message section.

If described speech data to be played comprises the speech data of multiple speaker, then by speaker's isolation technics, determine main speaker;

Using the voice segments of described main speaker as key message section.

Preferably, describedly carry out adjustment according to the word speed of described key message section to described speech data to be played and comprise:

If current speech segment is key message section, then adopt normal word speed to play described current speech segment, otherwise adopt fast word speed to play described current speech segment; Or

If current speech segment is key message section, then adopt slow word speed to play described current speech segment, otherwise adopt normal word speed or fast word speed to play described current speech segment.

Preferably, described method also comprises:

Obtain the degree of confidence of each voice segments;

Carry out adjustment to the word speed of described speech data to be played to be specially: the word speed of degree of confidence to described speech data to be played according to described key message section and each voice segments adjusts.

Preferably, the described word speed of degree of confidence to described speech data to be played according to described key message section and each voice segments is carried out adjustment and is comprised:

If current speech segment is key message section, if then its degree of confidence is greater than Second Threshold, then adopt fast word speed to play described current speech segment, otherwise adopt slow word speed to play described current speech segment;

If current speech segment is non-critical information section, if then its degree of confidence is greater than Second Threshold, then skip over described current speech segment; If its degree of confidence is less than or equal to first threshold, then adopt slow word speed to play described current speech segment, described first threshold is less than described Second Threshold.

Preferably, described method also comprises:

Each voice segments is carried out to the analysis of voice signal aspect, the analysis of described voice signal aspect comprise following any one or multiple: volume change situation, reverberation situation, noise situations;

Play described speech data to be played time, according to analysis result, process is optimized to institute's speech segment, described optimization process comprise following any one or multiple:

(1) if there is the amplitude of continuous multiple frames speech data to exceed higher limit in current speech segment, then the amplitude of current speech segment is turned down; If have the amplitude of continuous multiple frames speech data in current speech segment lower than lower limit, then heighten the amplitude of current speech segment;

(2) if the reverberation time of current speech segment exceedes threshold value, then reverberation elimination is carried out to current speech segment;

(3) if the signal to noise ratio (S/N ratio) of current speech segment is less than snr threshold, then denoising is carried out to current speech segment.

A kind of voice playing device, comprising:

Receiver module, for receiving speech data to be played;

Endpoint detection module, for carrying out end-point detection to described speech data to be played, obtains each voice segments;

Key message section determination module, comprise the first determination module and/or the second determination module, described first determination module is used for determining whether institute's speech segment is key message section according to the voice content of each voice segments, and described second determination module is used for determining whether institute's speech segment is key message section according to the vocal print feature of each voice segments;

Playing module, for playing described speech data to be played;

Word speed adjusting module, for when described playing module plays described speech data to be played, adjusts according to the word speed of described key message section to described speech data to be played.

Preferably, described first determination module comprises:

Voice recognition unit, for carrying out speech recognition to each voice segments, obtains the identification text of each voice segments;

Determining unit, for the identification text according to each voice segments, determines whether institute's speech segment is key message section.

Preferably, described determining unit, specifically for determining whether the identification text of each voice segments comprises preset keyword; If so, then determine that institute's speech segment is key message section.

Preferably, described determining unit comprises:

Iterations setting subelement, for arranging iterations;

Summary extracts subelement, extracts summary voice segments for adopting iterative manner from all voice segments;

Judgment sub-unit, for judging whether the iterations reaching setting, and after reaching the iterations of setting, triggering described summary and extracting subelement stopping iterative process;

Key message section obtains subelement, for extract subelement stopping iterative process at described summary after, obtains current all summary voice segments, and it can be used as key message section.

Preferably, described summary extraction subelement comprises:

First computation subunit, for the similarity of the identification text of the identification text and described speech data to be played that calculate current speech segment, obtains the first calculated value;

Second computation subunit, for calculating the identification text of described current speech segment and the similarity extracting voice segments identification text of making a summary, obtains the second calculated value;

Mathematic interpolation subelement, for calculating the difference of the first calculated value and the second calculated value, obtains the summary score of current speech segment;

Chooser unit, for after obtaining the summary score of all voice segments, the voice segments selecting summary score maximum is as summary voice segments.

Preferably, described second determination module comprises:

Vocal print feature extraction unit, for when described speech data to be played comprises the speech data of multiple speaker, extracts the vocal print feature of each voice segments;

Application on Voiceprint Recognition unit, for the sound-groove model according to described vocal print feature and speaker dependent, determines that whether institute's speech segment is the speech data of speaker dependent; If so, then determine that institute's speech segment is key message section.

Preferably, described second determination module comprises:

Speaker's separative element, for according to described vocal print characteristic use speaker isolation technics, determines main speaker, and using the voice segments of described main speaker as key message section.

Preferably, described word speed adjusting module, specifically for when current speech segment is key message section, adjust its play word speed be normal word speed, otherwise adjust its play word speed be fast word speed; Or when current speech segment is key message section, adjusting it, to play word speed be slow word speed, otherwise adjusting it, to play word speed be normal word speed or fast word speed.

Preferably, described device also comprises:

Degree of confidence acquisition module, for obtaining the degree of confidence of each voice segments;

Described word speed adjusting module, adjusts specifically for the word speed of degree of confidence to described speech data to be played according to described key message section and each voice segments.

Preferably, described device also comprises:

Signal analysis and processing module, for carrying out the analysis of voice signal aspect to each voice segments, and is optimized process according to analysis result to institute's speech segment; Described signal analysis and processing module comprises following any one or more unit;

Volume analysis and processing unit, for calculating the amplitude of every frame speech data in units of frame, and when having the amplitude of continuous multiple frames speech data to exceed higher limit in current speech segment, turn down the amplitude of current speech segment, when having the amplitude of continuous multiple frames speech data lower than lower limit in current speech segment, heighten the amplitude of current speech segment;

Reverberation analysis and processing unit, for calculating the reverberation time of each voice segments, and when the reverberation time of current speech segment exceedes threshold value, carries out reverberation elimination to current speech segment;

Noise analysis processing unit, for calculating the noise of each voice segments, and when the signal to noise ratio (S/N ratio) of current speech segment is greater than snr threshold, carries out denoising to current speech segment.

The speech playing method that the embodiment of the present invention provides and device, speech data to be played is analyzed, determine key message section wherein, and when playing described speech data to be played, adjust according to the word speed of described key message section to speech data to be played, thus the voice segments that user can be helped to find user to pay close attention to from a large amount of speech data rapidly and accurately or valuable voice segments, from a large amount of session recording data, find the voice segments relevant to Session Topic as helping user fast.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, other accompanying drawing can also be obtained according to these accompanying drawings.

Fig. 1 is the process flow diagram of embodiment of the present invention speech playing method;

Fig. 2 is a kind of structural representation of embodiment of the present invention voice playing device;

Fig. 3 is the another kind of structural representation of embodiment of the present invention voice playing device;

Fig. 4 is the another kind of structural representation of embodiment of the present invention voice playing device.

Embodiment

In order to the scheme making those skilled in the art person understand the embodiment of the present invention better, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.

As shown in Figure 1, be the process flow diagram of embodiment of the present invention speech playing method, comprise the following steps:

Step 101, receives speech data to be played.

Described speech data can be TV programme recording, interview recording, session recording etc.

Step 102, carries out end-point detection to described speech data to be played, obtains each voice segments.

Described end-point detection can adopt more existing detection techniques, such as based on end-point detection, the end-point detection based on cepstrum feature, the end-point detection etc. based on entropy of short-time energy and short-time average zero-crossing rate.

According to the voice content of each voice segments and/or vocal print feature, step 103, determines whether institute's speech segment is key message section.

Be described in detail respectively to below this:

1. according to the voice content determination key message section of each voice segments

First, need to carry out speech recognition to each voice segments, obtain the identification text of each voice segments, then, according to the identification text of each voice segments, determine whether institute's speech segment is key message section.Particularly, the method for predetermined keyword can be adopted or extract the method for summary.

The method of described predetermined keyword needs the keyword being pre-set speech data to be played by user, then for each voice segments, judge whether comprise this keyword in the identification text of this voice segments, during concrete judgement, the method of exact matching or fuzzy matching can be adopted, the word identified in text is mated with this keyword; If the match is successful, then determine that this voice segments comprises described keyword, correspondingly, using this voice segments as key message section.

The described method extracting summary refers to the identification text extraction summary voice segments according to speech data to be played, and voice segments of making a summary is as key message section.Particularly, can adopt iterative manner from all voice segments, extract summary voice segments, and after reaching the iterations of setting, obtain multiple summary voice segments, using described multiple summary voice segments as key message section.In each iterative process, selecting the voice segments maximum with the whole identification text degree of correlation as voice segments of making a summary, extracting summary voice segments as used maximal margin correlation technique.When calculating the degree of correlation, first the similarity of the identification text of current speech segment and the identification text of speech data to be played is calculated, obtain the first calculated value, and then the identification text calculating current speech segment and the similarity of identification text having extracted voice segments of making a summary, obtain the second calculated value, finally using the score of the difference of the first calculated value and the second calculated value as current speech segment, shown in (1).

MMR(S _i)＝α*Sim ₁(S _i，D)-(1-α)*Sim ₂(S _i，Sum)(1)

Wherein, MMR (S _i) be the score of i-th voice segments, D is the identification text vector of speech data to be played, S _ifor the identification text vector of current speech segment, Sum is the identification text vector having extracted summary voice segments, and α is weight parameter, can rule of thumb or experimental result value.Sim ₁(S _i, D) and be the similarity of the identification text of current speech segment and the identification text of speech data to be played.Sim ₂(S _i, Sum) and be identification text and the similarity of identification text having extracted voice segments of making a summary of current speech segment.

It should be noted that, text vectorization can be adopted prior art, this is repeated no more.In addition, described similarity can adopt the distance between vector to measure, and as COS distance, circular can adopt prior art, is not described in detail in this.

After the score calculating of all voice segments terminates, the voice segments selecting score maximum, as summary voice segments, then carries out next iteration process.

Iteration repeatedly after, multiple summary voice segments can be obtained, these summary voice segments as key message section.

It should be noted that, the extraction of summary also can adopt other abstracting method of the prior art, does not limit this embodiment of the present invention.

2. according to the vocal print feature determination key message section of each voice segments

If speech data to be played comprises the speech data of multiple speaker, then can using the speech data of speaker dependent or main speaker as key message section.

For speaker dependent, can be preset by user, by method for recognizing sound-groove, determine that whether each voice segments is the speech data of this speaker dependent, if so, then determine that this voice segments is key message section.It should be noted that, in this embodiment, also need the speech data collecting this speaker dependent in advance, train its sound-groove model.When Application on Voiceprint Recognition, extract the vocal print feature of each voice segments, utilize sound-groove model described in described vocal print characteristic sum to identify speech data that whether institute's speech segment is this speaker dependent.

For main speaker, according to the vocal print characteristic use speaker isolation technics of each voice segments, the speech data of speaker each in speech data can be separated, using speaker more for speech data as main speaker, or will wherein active speaker as main speaker, the speech data of described active speaker can be applied in whole speech data to be played usually, as started, terminate, all there is the speech data of this speaker in the positions such as centre, the positional information of described speech data can be determined according to the time in the speech data after separation before separation whole speech data.Speaker's isolation technics can adopt prior art, realizes speaker be separated as by cluster scheduling algorithm.

3. according to voice content and the vocal print feature determination key message section of each voice segments

Namely this application is the comprehensive of two kinds of situations above, such as, has the session recording etc. of multiple spokesman, the content just wherein in someone's speech in environmental practice that user pays close attention to.In this case, the sound-groove model of this speaker can be first utilized to carry out Application on Voiceprint Recognition, identify the voice segments of this speaker, and then comprise the voice segments of this keyword in the voice segments being identified these words people by predetermined keyword, using these voice segments as key message section.

Visible, utilize the present invention, not only can meet the demand that user pays close attention to voice content to be played, and when there being many speakers, the demand that user pays close attention to different speaker can also be met.

Step 104, when playing described speech data to be played, adjusts according to the word speed of described key message section to described speech data to be played.

In embodiments of the present invention, in units of voice segments, carry out word speed adjustment to speech data to be played, concrete adjustment mode can be determined according to application demand, does not limit this embodiment of the present invention.Such as, when user writes interview original text according to interview recording, normal word speed can be adopted to play to key message section, to the mode that non-critical information section adopts fast word speed to play; For another example, when learning according to classroom recording for student, slow word speed can be adopted to play to key message section, to the mode that non-critical information section adopts normal word speed or fast word speed to play.

The speech playing method of the embodiment of the present invention, speech data to be played is analyzed from voice content and/or vocal print feature aspect, determine key message section wherein, and when playing described speech data to be played, adjust according to the word speed of described key message section to described speech data to be played, thus the voice segments that user can be helped to find user to pay close attention to from a large amount of speech data rapidly and accurately or valuable voice segments, from a large amount of session recording data, find the voice segments relevant to Session Topic as helping user fast.

In order to improve result of broadcast further, ensure the validity of key message section, in another embodiment of the inventive method, when adjusting according to the word speed of described key message section to described speech data to be played, the degree of confidence of each voice segments can also be considered.Particularly, in speech recognition process, the posterior probability of each word in voice segments can be obtained, after the posterior probability of words all in voice segments being averaged, the posterior probability of this voice segments can be obtained, using the degree of confidence of described posterior probability as this voice segments.Certainly, the calculating of described degree of confidence can also adopt other method, is calculated by the method for statistical modeling as by the text feature (mainly referring to identify the semantic feature of text) or acoustic feature extracting voice segments.It should be noted that, if determine whether each voice segments is key message section based on vocal print feature separately, also can be obtained the posterior probability of each word in each voice segments by the decode procedure of speech recognition, and then obtain the degree of confidence of each voice segments.

In this embodiment, when the word speed of speech data to be played is adjusted, can adjust according to the word speed of degree of confidence to described speech data to be played of described key message section and each voice segments.

Particularly, in advance the value of degree of confidence can be divided multiple grade, one or more confidence threshold value is set, during as arranged 2 threshold values, i.e. first threshold and Second Threshold, wherein Second Threshold is greater than first threshold, the value of degree of confidence can be divided into Three Estate, i.e. high confidence level, middle degree of confidence and low confidence, the grade belonging to degree of confidence value and key message section adjust word speed.During concrete adjustment, the word speed of speech data can be adjusted to fast word speed respectively, normal word speed and slow word speed, the value of described fast word speed and slow word speed can be determined according to practical application, if speech data to be played is the recording of momentous conference, fast word speed can be set to faster than normal word speed 5%, slow word speed can be set to slower than normal word speed 10%.

Below to arrange two confidence threshold value, illustrate speech data word speed adjustment process to be played, specific as follows:

Walk degree of confidence a) obtaining current speech segment;

Step b) judges whether the value of described degree of confidence exceedes the threshold value preset, if described degree of confidence is greater than Second Threshold, forwards step to c), if described degree of confidence is between first threshold to Second Threshold, forwards step to d); If described degree of confidence is less than first threshold, forward step to e);

Step c) judges whether current speech segment is key message section, if so, then the word speed of current speech segment is adjusted to fast word speed; Otherwise, directly skip over current speech segment;

Step d) judges whether current speech segment is key message section, if so, then the word speed of current speech segment is adjusted to slow word speed, if not, then do not adjust, namely the word speed of current speech segment is normal word speed;

The word speed of current speech segment e) is adjusted to slow word speed by step.

Certainly, according to the degree of confidence of key message section and each voice segments, adjustment is carried out to the word speed of speech data to be played and be not limited to above-mentioned concrete adjustment mode, according to application demand, can also other adjustment mode, this embodiment of the present invention is not limited.

In order to improve the quality of speech data to be played further, promote the sense of hearing of user, in another embodiment of speech playing method of the present invention, the analysis of voice signal aspect can also be carried out speech data to be played, and according to analysis result, process is optimized to institute's speech segment.The analysis of described voice signal aspect can comprise following any one or multiple: the volume change of speech data, the size, noise situations etc. of speech data reverberation.Correspondingly, analysis and the process of one or more voice signals can be carried out when process being optimized to voice segments according to analysis result.Below different dispositions is illustrated respectively.

1) according to volume analysis result, process is optimized to speech data

When volume is analyzed, in units of frame, calculate the range value of every frame speech data, if the amplitude of continuous multiple frames speech data exceedes higher limit, then think that the situation of cut ridge has appearred in current speech segment, when playing voice, volume can be higher, affect user's sense of hearing, need the amplitude of current speech segment to turn down; If the amplitude of continuous multiple frames speech data is lower than lower limit, then think that the amplitude of current speech segment is less, user is not easy to catch voice content, needs the amplitude of current speech segment to heighten.

2) according to reverberation analysis result, process is optimized to speech data

When reverberation is analyzed, the reverberation detection model built in advance can be adopted to detect current speech segment, described reverberation detection model can by collecting a large amount of speech data in advance, the method building statistical model is utilized to obtain, described statistical model is as deep neural network model, extract the input of spectrum signature as reverberation detection model of current speech segment, calculate the reverberation time T60 of current speech segment (after namely sound source stops sounding, the time of acoustic energy decay needed for 60db), if reverberation time T60 is higher than the threshold value of setting, then think that current speech segment reverberation is excessive, reverberation removing method is adopted to remove the reverberation of current speech segment, to ensure the sharpness of current speech segment, described reverberation removing method is as the reverberation removing method based on liftering technology.

3) according to noise analysis result, process is optimized to speech data

Calculate the signal to noise ratio (S/N ratio) of current speech data, if signal to noise ratio (S/N ratio) is greater than default snr threshold, then noise reduction is carried out to current speech segment, as the noise that the method for speech enhan-cement can be adopted to remove current speech segment, described speech enhan-cement refers to from noisy speech extracting data primary voice data pure as far as possible, eliminate ground unrest, thus raising voice quality, allow the imperceptible fatigue of user, the method that concrete sound strengthens is prior art, such as, the method of neural network is adopted to carry out speech enhan-cement, in advance using the input of the amplitude spectrum feature of a large amount of noisy speech as speech enhan-cement model, using the output of the amplitude spectrum feature of clean speech as speech enhan-cement model, carry out the training strengthening model, obtain speech enhan-cement model, then extract the input of amplitude spectrum feature as speech enhan-cement model of current noisy speech data, obtain the amplitude spectrum feature of clean speech data, finally in conjunction with the phase information of described speech data, the amplitude spectrum characteristic recovery of described clean speech data is become speech data, clean speech data can be obtained.

It should be noted that, above-mentioned optimization process can be for all voice segments in speech data to be played, also can only for key message section.And process opportunity can be before or after determining key message section, does not limit this embodiment of the present invention.

Valuable speech data is found quickly and efficiently in order to help user, the speech playing method of the embodiment of the present invention is before broadcasting speech data, respectively speech data is analyzed from voice content and/or vocal print feature aspect and voice signal aspect, the key message section of speech data to be played can be obtained from the analysis result of voice content and/or vocal print feature aspect; From the analysis result of voice signal aspect, judge whether speech data to be played has problems, if existed, for speech data produced problem to be played, process is optimized to speech data to be played, improves the quality of speech data to be played, improve the sense of hearing of user.

Correspondingly, the embodiment of the present invention also provides a kind of voice playing device, as shown in Figure 2, is a kind of structural representation of this device.

In this embodiment, described device comprises:

Receiver module 201, for receiving speech data to be played;

Endpoint detection module 202, for carrying out end-point detection to described speech data to be played, obtains each voice segments;

Key message section determination module 203, comprise the first determination module 231 and/or the second determination module 232, described first determination module 231 is for determining according to the voice content of each voice segments whether institute's speech segment is key message section, and described second determination module 232 is for determining according to the vocal print feature of each voice segments whether institute's speech segment is key message section;

Playing module 204, for playing described speech data to be played;

Word speed adjusting module 205, for when described playing module 204 plays described speech data to be played, adjusts according to the word speed of described key message section to described speech data to be played.

Illustrate in Fig. 2 that key message section determination module 203 comprises the situation of the first determination module 231 and the second determination module 232 simultaneously.

According to the voice content of each voice segments, above-mentioned first determination module 231 can determine whether this voice segments is key message section, it comprises: voice recognition unit and determining unit, wherein, described voice recognition unit is used for carrying out speech recognition to each voice segments, obtains the identification text of each voice segments; Described determining unit is used for the identification text according to each voice segments, determines whether institute's speech segment is key message section.

In a particular application, described determining unit can adopt the method for the method of predetermined keyword or extraction summary to determine whether each voice segments is key message section.

Such as, in one embodiment, described determining unit can determine whether the identification text of each voice segments comprises preset keyword; If so, then determine that institute's speech segment is key message section, otherwise determine that institute's speech segment is non-critical information section.

For another example, in another kind of embodiment, described determining unit can adopt iterative manner from all voice segments, extract summary voice segments, and after reaching the iterations of setting, obtain multiple summary voice segments, using described multiple summary voice segments as key message section.Correspondingly, its concrete structure can comprise following subelement:

Iterations setting subelement, for arranging iterations;

Wherein, described summary extraction subelement comprises:

The concrete computation process of the summary score of voice segments see the description in preceding formula (1), can not repeat them here.

According to the vocal print feature of each voice segments, above-mentioned second determination module 232 can determine whether this voice segments is key message section, particularly, if speech data to be played comprises the speech data of multiple speaker, then can using the speech data of speaker dependent or main speaker as key message section.

For speaker dependent, a kind of concrete structure of the second determination module 232 can comprise following unit:

For main speaker, a kind of concrete structure of the second determination module 232 can comprise following unit:

Speaker's separative element, for according to described vocal print characteristic use speaker isolation technics, separates the speech data of speaker each in speech data, determines main speaker, and using the voice segments of described main speaker as key message section.Speaker's isolation technics can adopt prior art, realizes speaker be separated as by cluster scheduling algorithm.

Key message section determination module 203 is comprised to the situation of the first determination module 231 and the second determination module 232 simultaneously, will can meet the voice segments of two kinds of requirements above as key message section simultaneously.Such as, can first the sound-groove model of speaker dependent be utilized to carry out Application on Voiceprint Recognition by the second determination module 232, identify the voice segments of this speaker, and then comprise the voice segments of this keyword in the voice segments being identified these words people by the first determination module 231 by predetermined keyword, using these voice segments as key message section; Or first extract multiple summary voice segments by the first determination module 231 according to the identification text of speech data to be played, and then second determination module 232 utilize the sound-groove model of speaker dependent to these summary voice segments carry out Application on Voiceprint Recognition, identify the voice segments of speaker dependent in these summary voice segments, using these voice segments as key message section.

In embodiments of the present invention, word speed adjusting module 205 is in units of voice segments, word speed adjustment is carried out to speech data to be played, according to the difference of application demand, different adjustment modes can be had, such as, when current speech segment is key message section, adjust its play word speed be normal word speed, otherwise adjust its play word speed be fast word speed; Or when current speech segment is key message section, adjusting it, to play word speed be slow word speed, otherwise adjusting it, to play word speed be normal word speed or fast word speed.

The voice playing device of the embodiment of the present invention, speech data to be played is analyzed from voice content and/or vocal print feature aspect, determine key message section wherein, and when playing described speech data to be played, adjust according to the word speed of described key message section to described speech data to be played, thus the voice segments that user can be helped to find user to pay close attention to from a large amount of speech data rapidly and accurately or valuable voice segments, from a large amount of session recording data, find the voice segments relevant to Session Topic as helping user fast.

In order to improve result of broadcast further, ensureing the validity of key message section, as shown in Figure 3, in another embodiment of voice playing device of the present invention, also comprising: degree of confidence acquisition module 206, for obtaining the degree of confidence of each voice segments.

It should be noted that, described degree of confidence acquisition module 206 specifically can obtain the posterior probability of each voice segments by speech recognition process, and using the degree of confidence of described posterior probability as this voice segments.

Due to the difference according to application demand, key message section determination module 203 has different implementations, when it comprises the first determination module 231, first determination module 231 comprises: voice recognition unit and determining unit, in this case, degree of confidence acquisition module 206 also can integrate with key message section determination module 203.Certainly, when key message section determination module 203 only includes the second determination module 232, degree of confidence acquisition module 206 can as a standalone module, by extracting the feature of each voice segments, the acoustic model of feature and the training in advance extracted and language model is utilized to decode to institute's speech segment, obtain the posterior probability of each word in this voice segments, then the posterior probability of words all in this voice segments is averaged, the posterior probability of this voice segments can be obtained, using the degree of confidence of this posterior probability as this voice segments.

That is, a kind of schematic diagram that the structure shown in Fig. 3 is only used to be convenient to understand voice playing device of the present invention and shows, but not entity structure during its application.And according to the difference of application demand, wherein some module also needs to carry out adaptive adjustment.

Correspondingly, in this embodiment, word speed adjusting module 205 needs to adjust according to the word speed of degree of confidence to described speech data to be played of described key message section and each voice segments.Equally, according to the difference of application demand, different adjustment modes can be had, such as, if current speech segment is key message section, if then its degree of confidence is greater than Second Threshold, then adopt fast word speed to play described current speech segment, otherwise adopt slow word speed to play described current speech segment; If current speech segment is non-critical information section, if then its degree of confidence is greater than Second Threshold, then skip over described current speech segment; If its degree of confidence is less than or equal to first threshold (first threshold is less than Second Threshold), then slow word speed is adopted to play described current speech segment.

Compared with embodiment illustrated in fig. 2, the word speed adjustment of voice playing device to speech data to be played of this embodiment is more versatile and flexible, and ensure that the validity of key message section better, further increase result of broadcast, meet the different application demand of user.

In order to improve the quality of speech data to be played further, promote the sense of hearing of user, as shown in Figure 4, in another embodiment of voice playing device of the present invention, also can comprise: signal analysis and processing module 207, for carrying out the analysis of voice signal aspect to each voice segments, and according to analysis result, process is optimized to institute's speech segment.

Described signal analysis and processing module 207 comprises following any one or more unit;

Noise analysis processing unit, for calculating the noise of each voice segments, and when the signal to noise ratio (S/N ratio) of current speech segment is less than snr threshold, carries out denoising to current speech segment.

The calculating of the amplitude of voice segments, reverberation time, noise and testing process with reference to the description in the inventive method embodiment above, can not repeat them here.

It should be noted that, the optimization process of signal analysis and processing module 207 pairs of speech datas can be for all voice segments in speech data to be played, also can only for key message section.And process opportunity can be before or after determining key message section, does not limit this embodiment of the present invention.In addition, it should be noted that, in actual applications, be not limited to the structure shown in Fig. 4, in another embodiment of apparatus of the present invention, also can comprise above-mentioned degree of confidence acquisition module 206 and signal analysis and processing module 207, and wherein the concrete structure of each module can do adaptive adjustment according to the difference of application demand simultaneously.

The voice signal device of this embodiment, before broadcasting speech data, respectively speech data is analyzed from voice content and/or vocal print feature aspect and voice signal aspect, the key message section of speech data to be played can be obtained from the analysis result of voice content and/or vocal print feature aspect; From the analysis result of voice signal aspect, judge whether speech data to be played has problems, if existed, for speech data produced problem to be played, process is optimized to speech data to be played, improves the quality of speech data to be played, improve the sense of hearing of user.

Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for device embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.Device embodiment described above is only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.

Being described in detail the embodiment of the present invention above, applying embodiment herein to invention has been elaboration, the explanation of above embodiment just understands method of the present invention and device for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims

1. a speech playing method, is characterized in that, comprising:

Receive speech data to be played;

2. method according to claim 1, is characterized in that, the described voice content according to each voice segments determines whether institute's speech segment is that key message section comprises:

3. method according to claim 2, is characterized in that, the described identification text according to each voice segments, determines whether institute's speech segment is that key message section comprises:

If so, then determine that institute's speech segment is key message section.

4. method according to claim 2, is characterized in that, the described identification text according to each voice segments, determines whether institute's speech segment is that key message section comprises:

5. method according to claim 4, is characterized in that, the described summary voice segments that extracts from all voice segments comprises:

6. method according to claim 1, is characterized in that, the described vocal print feature according to each voice segments determines whether institute's speech segment is that key message section comprises:

If so, then determine that institute's speech segment is key message section.

7. method according to claim 1, is characterized in that, the described vocal print feature according to each voice segments determines whether institute's speech segment is that key message section comprises:

Using the voice segments of described main speaker as key message section.

8. the method according to any one of claim 1 to 7, is characterized in that, describedly carries out adjustment according to the word speed of described key message section to described speech data to be played and comprises:

9. the method according to any one of claim 1 to 7, is characterized in that, described method also comprises:

Obtain the degree of confidence of each voice segments;

10. method according to claim 9, is characterized in that, the described word speed of degree of confidence to described speech data to be played according to described key message section and each voice segments is carried out adjustment and comprised:

11. methods according to claim 1, is characterized in that, described method also comprises:

12. 1 kinds of voice playing device, is characterized in that, comprising:

Receiver module, for receiving speech data to be played;

Playing module, for playing described speech data to be played;

13. devices according to claim 12, is characterized in that, described first determination module comprises:

14. devices according to claim 13, is characterized in that,

Described determining unit, specifically for determining whether the identification text of each voice segments comprises preset keyword; If so, then determine that institute's speech segment is key message section.

15. devices according to claim 13, is characterized in that, described determining unit comprises:

Iterations setting subelement, for arranging iterations;

16. devices according to claim 15, is characterized in that, described summary extracts subelement and comprises:

17. devices according to claim 12, is characterized in that, described second determination module comprises:

18. devices according to claim 12, is characterized in that, described second determination module comprises:

19., according to claim 12 to the device described in 18 any one, is characterized in that,

Described word speed adjusting module, specifically for when current speech segment is key message section, adjust its play word speed be normal word speed, otherwise adjust its play word speed be fast word speed; Or when current speech segment is key message section, adjusting it, to play word speed be slow word speed, otherwise adjusting it, to play word speed be normal word speed or fast word speed.

20., according to claim 12 to the device described in 18 any one, is characterized in that, described device also comprises:

21. devices according to claim 12, is characterized in that, described device also comprises: