CN105405439A - Voice playing method and device - Google Patents

Voice playing method and device Download PDF

Info

Publication number
CN105405439A
CN105405439A CN201510757786.6A CN201510757786A CN105405439A CN 105405439 A CN105405439 A CN 105405439A CN 201510757786 A CN201510757786 A CN 201510757786A CN 105405439 A CN105405439 A CN 105405439A
Authority
CN
China
Prior art keywords
voice segments
speech segment
key message
message section
speech data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201510757786.6A
Other languages
Chinese (zh)
Other versions
CN105405439B (en
Inventor
高建清
王智国
胡国平
胡郁
刘庆峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
iFlytek Co Ltd
Original Assignee
iFlytek Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by iFlytek Co Ltd filed Critical iFlytek Co Ltd
Priority to CN201510757786.6A priority Critical patent/CN105405439B/en
Publication of CN105405439A publication Critical patent/CN105405439A/en
Application granted granted Critical
Publication of CN105405439B publication Critical patent/CN105405439B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • G10L15/05Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0264Noise filtering characterised by the type of parameter measurement, e.g. correlation techniques, zero crossing techniques or predictive techniques
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0316Speech enhancement, e.g. noise reduction or echo cancellation by changing the amplitude

Abstract

The invention discloses a voice playing method and device. The method comprises the steps that voice data to be played are received; endpoint detection is performed on the voice data to be played so that all voice segments are obtained; whether all the voice is key information segments is determined; and voice speed of the voice data to be played is adjusted according to the key information segments when the voice data to be played are played. With application of the voice playing method and device, users can be assisted to rapidly and accurately find the focused voice segments.

Description

Speech playing method and device
Technical field
The present invention relates to field of voice signal, be specifically related to a kind of speech playing method and device.
Background technology
At present, increasing people like the form adopting recording to replace the information required for form record of text, during as meeting, adopt the form of recording to be got off by meeting content record, consult for follow-up; During interview, adopt the form of recording to record interview content, become original text based on this Edition Contains; Student's upper class hour, recorded in the place do not understood, go back inspection information etc.But when recording data amount is larger, people are difficult to find valuable recording substance rapidly and accurately.In order to reduce the reproduction time of recording, existing speech playing method generally adopts the method for end-point detection, detects pure noise segment or quiet section, these voice segments is skipped over, and plays remaining speech data with normal word speed.But, in the process of recording, often some unessential contents can be recorded together, during existing method playback, often need user's manual switch to become F.F. form to play, or directly skip unessential content.Especially when playback environ-ment is bad, the speech data quality of recording is often poor, and in order to not hear recording substance, user needs manual repeat playing repeatedly, greatly reduces user experience.
Summary of the invention
The invention provides a kind of speech playing method and device, find paid close attention to voice segments rapidly and accurately to help user.
For this reason, the invention provides following technical scheme:
A kind of speech playing method, comprising:
Receive speech data to be played;
End-point detection is carried out to described speech data to be played, obtains each voice segments;
Determine whether institute's speech segment is key message section according to the voice content of each voice segments and/or vocal print feature;
When playing described speech data to be played, adjust according to the word speed of described key message section to described speech data to be played.
Preferably, the described voice content according to each voice segments determines whether institute's speech segment is that key message section comprises:
Speech recognition is carried out to each voice segments, obtains the identification text of each voice segments;
According to the identification text of each voice segments, determine whether institute's speech segment is key message section.
Preferably, the described identification text according to each voice segments, determine whether institute's speech segment is that key message section comprises:
Determine whether the identification text of each voice segments comprises preset keyword;
If so, then determine that institute's speech segment is key message section.
Preferably, the described identification text according to each voice segments, determine whether institute's speech segment is that key message section comprises:
Adopt iterative manner from all voice segments, extract summary voice segments, and after reaching the iterations of setting, obtain multiple summary voice segments, using described multiple summary voice segments as key message section.
Preferably, the described summary voice segments that extracts from all voice segments comprises:
Calculate the similarity of the identification text of current speech segment and the identification text of described speech data to be played, obtain the first calculated value;
Calculate the identification text of described current speech segment and the similarity extracting voice segments identification text of making a summary, obtain the second calculated value;
Calculate the difference of the first calculated value and the second calculated value, obtain the summary score of current speech segment;
After obtaining the summary score of all voice segments, the voice segments selecting summary score maximum is as summary voice segments.
Preferably, the described vocal print feature according to each voice segments determines whether institute's speech segment is that key message section comprises:
If described speech data to be played comprises the speech data of multiple speaker, then extract the vocal print feature of each voice segments;
According to the sound-groove model of described vocal print feature and speaker dependent, determine that whether institute's speech segment is the speech data of speaker dependent;
If so, then determine that institute's speech segment is key message section.
Preferably, the described vocal print feature according to each voice segments determines whether institute's speech segment is that key message section comprises:
If described speech data to be played comprises the speech data of multiple speaker, then by speaker's isolation technics, determine main speaker;
Using the voice segments of described main speaker as key message section.
Preferably, describedly carry out adjustment according to the word speed of described key message section to described speech data to be played and comprise:
If current speech segment is key message section, then adopt normal word speed to play described current speech segment, otherwise adopt fast word speed to play described current speech segment; Or
If current speech segment is key message section, then adopt slow word speed to play described current speech segment, otherwise adopt normal word speed or fast word speed to play described current speech segment.
Preferably, described method also comprises:
Obtain the degree of confidence of each voice segments;
Carry out adjustment to the word speed of described speech data to be played to be specially: the word speed of degree of confidence to described speech data to be played according to described key message section and each voice segments adjusts.
Preferably, the described word speed of degree of confidence to described speech data to be played according to described key message section and each voice segments is carried out adjustment and is comprised:
If current speech segment is key message section, if then its degree of confidence is greater than Second Threshold, then adopt fast word speed to play described current speech segment, otherwise adopt slow word speed to play described current speech segment;
If current speech segment is non-critical information section, if then its degree of confidence is greater than Second Threshold, then skip over described current speech segment; If its degree of confidence is less than or equal to first threshold, then adopt slow word speed to play described current speech segment, described first threshold is less than described Second Threshold.
Preferably, described method also comprises:
Each voice segments is carried out to the analysis of voice signal aspect, the analysis of described voice signal aspect comprise following any one or multiple: volume change situation, reverberation situation, noise situations;
Play described speech data to be played time, according to analysis result, process is optimized to institute's speech segment, described optimization process comprise following any one or multiple:
(1) if there is the amplitude of continuous multiple frames speech data to exceed higher limit in current speech segment, then the amplitude of current speech segment is turned down; If have the amplitude of continuous multiple frames speech data in current speech segment lower than lower limit, then heighten the amplitude of current speech segment;
(2) if the reverberation time of current speech segment exceedes threshold value, then reverberation elimination is carried out to current speech segment;
(3) if the signal to noise ratio (S/N ratio) of current speech segment is less than snr threshold, then denoising is carried out to current speech segment.
A kind of voice playing device, comprising:
Receiver module, for receiving speech data to be played;
Endpoint detection module, for carrying out end-point detection to described speech data to be played, obtains each voice segments;
Key message section determination module, comprise the first determination module and/or the second determination module, described first determination module is used for determining whether institute's speech segment is key message section according to the voice content of each voice segments, and described second determination module is used for determining whether institute's speech segment is key message section according to the vocal print feature of each voice segments;
Playing module, for playing described speech data to be played;
Word speed adjusting module, for when described playing module plays described speech data to be played, adjusts according to the word speed of described key message section to described speech data to be played.
Preferably, described first determination module comprises:
Voice recognition unit, for carrying out speech recognition to each voice segments, obtains the identification text of each voice segments;
Determining unit, for the identification text according to each voice segments, determines whether institute's speech segment is key message section.
Preferably, described determining unit, specifically for determining whether the identification text of each voice segments comprises preset keyword; If so, then determine that institute's speech segment is key message section.
Preferably, described determining unit comprises:
Iterations setting subelement, for arranging iterations;
Summary extracts subelement, extracts summary voice segments for adopting iterative manner from all voice segments;
Judgment sub-unit, for judging whether the iterations reaching setting, and after reaching the iterations of setting, triggering described summary and extracting subelement stopping iterative process;
Key message section obtains subelement, for extract subelement stopping iterative process at described summary after, obtains current all summary voice segments, and it can be used as key message section.
Preferably, described summary extraction subelement comprises:
First computation subunit, for the similarity of the identification text of the identification text and described speech data to be played that calculate current speech segment, obtains the first calculated value;
Second computation subunit, for calculating the identification text of described current speech segment and the similarity extracting voice segments identification text of making a summary, obtains the second calculated value;
Mathematic interpolation subelement, for calculating the difference of the first calculated value and the second calculated value, obtains the summary score of current speech segment;
Chooser unit, for after obtaining the summary score of all voice segments, the voice segments selecting summary score maximum is as summary voice segments.
Preferably, described second determination module comprises:
Vocal print feature extraction unit, for when described speech data to be played comprises the speech data of multiple speaker, extracts the vocal print feature of each voice segments;
Application on Voiceprint Recognition unit, for the sound-groove model according to described vocal print feature and speaker dependent, determines that whether institute's speech segment is the speech data of speaker dependent; If so, then determine that institute's speech segment is key message section.
Preferably, described second determination module comprises:
Vocal print feature extraction unit, for when described speech data to be played comprises the speech data of multiple speaker, extracts the vocal print feature of each voice segments;
Speaker's separative element, for according to described vocal print characteristic use speaker isolation technics, determines main speaker, and using the voice segments of described main speaker as key message section.
Preferably, described word speed adjusting module, specifically for when current speech segment is key message section, adjust its play word speed be normal word speed, otherwise adjust its play word speed be fast word speed; Or when current speech segment is key message section, adjusting it, to play word speed be slow word speed, otherwise adjusting it, to play word speed be normal word speed or fast word speed.
Preferably, described device also comprises:
Degree of confidence acquisition module, for obtaining the degree of confidence of each voice segments;
Described word speed adjusting module, adjusts specifically for the word speed of degree of confidence to described speech data to be played according to described key message section and each voice segments.
Preferably, described device also comprises:
Signal analysis and processing module, for carrying out the analysis of voice signal aspect to each voice segments, and is optimized process according to analysis result to institute's speech segment; Described signal analysis and processing module comprises following any one or more unit;
Volume analysis and processing unit, for calculating the amplitude of every frame speech data in units of frame, and when having the amplitude of continuous multiple frames speech data to exceed higher limit in current speech segment, turn down the amplitude of current speech segment, when having the amplitude of continuous multiple frames speech data lower than lower limit in current speech segment, heighten the amplitude of current speech segment;
Reverberation analysis and processing unit, for calculating the reverberation time of each voice segments, and when the reverberation time of current speech segment exceedes threshold value, carries out reverberation elimination to current speech segment;
Noise analysis processing unit, for calculating the noise of each voice segments, and when the signal to noise ratio (S/N ratio) of current speech segment is greater than snr threshold, carries out denoising to current speech segment.
The speech playing method that the embodiment of the present invention provides and device, speech data to be played is analyzed, determine key message section wherein, and when playing described speech data to be played, adjust according to the word speed of described key message section to speech data to be played, thus the voice segments that user can be helped to find user to pay close attention to from a large amount of speech data rapidly and accurately or valuable voice segments, from a large amount of session recording data, find the voice segments relevant to Session Topic as helping user fast.
Accompanying drawing explanation
In order to be illustrated more clearly in the embodiment of the present application or technical scheme of the prior art, be briefly described to the accompanying drawing used required in embodiment below, apparently, the accompanying drawing that the following describes is only some embodiments recorded in the present invention, for those of ordinary skill in the art, other accompanying drawing can also be obtained according to these accompanying drawings.
Fig. 1 is the process flow diagram of embodiment of the present invention speech playing method;
Fig. 2 is a kind of structural representation of embodiment of the present invention voice playing device;
Fig. 3 is the another kind of structural representation of embodiment of the present invention voice playing device;
Fig. 4 is the another kind of structural representation of embodiment of the present invention voice playing device.
Embodiment
In order to the scheme making those skilled in the art person understand the embodiment of the present invention better, below in conjunction with drawings and embodiments, the embodiment of the present invention is described in further detail.
As shown in Figure 1, be the process flow diagram of embodiment of the present invention speech playing method, comprise the following steps:
Step 101, receives speech data to be played.
Described speech data can be TV programme recording, interview recording, session recording etc.
Step 102, carries out end-point detection to described speech data to be played, obtains each voice segments.
Described end-point detection can adopt more existing detection techniques, such as based on end-point detection, the end-point detection based on cepstrum feature, the end-point detection etc. based on entropy of short-time energy and short-time average zero-crossing rate.
According to the voice content of each voice segments and/or vocal print feature, step 103, determines whether institute's speech segment is key message section.
Be described in detail respectively to below this:
1. according to the voice content determination key message section of each voice segments
First, need to carry out speech recognition to each voice segments, obtain the identification text of each voice segments, then, according to the identification text of each voice segments, determine whether institute's speech segment is key message section.Particularly, the method for predetermined keyword can be adopted or extract the method for summary.
The method of described predetermined keyword needs the keyword being pre-set speech data to be played by user, then for each voice segments, judge whether comprise this keyword in the identification text of this voice segments, during concrete judgement, the method of exact matching or fuzzy matching can be adopted, the word identified in text is mated with this keyword; If the match is successful, then determine that this voice segments comprises described keyword, correspondingly, using this voice segments as key message section.
The described method extracting summary refers to the identification text extraction summary voice segments according to speech data to be played, and voice segments of making a summary is as key message section.Particularly, can adopt iterative manner from all voice segments, extract summary voice segments, and after reaching the iterations of setting, obtain multiple summary voice segments, using described multiple summary voice segments as key message section.In each iterative process, selecting the voice segments maximum with the whole identification text degree of correlation as voice segments of making a summary, extracting summary voice segments as used maximal margin correlation technique.When calculating the degree of correlation, first the similarity of the identification text of current speech segment and the identification text of speech data to be played is calculated, obtain the first calculated value, and then the identification text calculating current speech segment and the similarity of identification text having extracted voice segments of making a summary, obtain the second calculated value, finally using the score of the difference of the first calculated value and the second calculated value as current speech segment, shown in (1).
MMR(S i)=α*Sim 1(S i,D)-(1-α)*Sim 2(S i,Sum)(1)
Wherein, MMR (S i) be the score of i-th voice segments, D is the identification text vector of speech data to be played, S ifor the identification text vector of current speech segment, Sum is the identification text vector having extracted summary voice segments, and α is weight parameter, can rule of thumb or experimental result value.Sim 1(S i, D) and be the similarity of the identification text of current speech segment and the identification text of speech data to be played.Sim 2(S i, Sum) and be identification text and the similarity of identification text having extracted voice segments of making a summary of current speech segment.
It should be noted that, text vectorization can be adopted prior art, this is repeated no more.In addition, described similarity can adopt the distance between vector to measure, and as COS distance, circular can adopt prior art, is not described in detail in this.
After the score calculating of all voice segments terminates, the voice segments selecting score maximum, as summary voice segments, then carries out next iteration process.
Iteration repeatedly after, multiple summary voice segments can be obtained, these summary voice segments as key message section.
It should be noted that, the extraction of summary also can adopt other abstracting method of the prior art, does not limit this embodiment of the present invention.
2. according to the vocal print feature determination key message section of each voice segments
If speech data to be played comprises the speech data of multiple speaker, then can using the speech data of speaker dependent or main speaker as key message section.
For speaker dependent, can be preset by user, by method for recognizing sound-groove, determine that whether each voice segments is the speech data of this speaker dependent, if so, then determine that this voice segments is key message section.It should be noted that, in this embodiment, also need the speech data collecting this speaker dependent in advance, train its sound-groove model.When Application on Voiceprint Recognition, extract the vocal print feature of each voice segments, utilize sound-groove model described in described vocal print characteristic sum to identify speech data that whether institute's speech segment is this speaker dependent.
For main speaker, according to the vocal print characteristic use speaker isolation technics of each voice segments, the speech data of speaker each in speech data can be separated, using speaker more for speech data as main speaker, or will wherein active speaker as main speaker, the speech data of described active speaker can be applied in whole speech data to be played usually, as started, terminate, all there is the speech data of this speaker in the positions such as centre, the positional information of described speech data can be determined according to the time in the speech data after separation before separation whole speech data.Speaker's isolation technics can adopt prior art, realizes speaker be separated as by cluster scheduling algorithm.
3. according to voice content and the vocal print feature determination key message section of each voice segments
Namely this application is the comprehensive of two kinds of situations above, such as, has the session recording etc. of multiple spokesman, the content just wherein in someone's speech in environmental practice that user pays close attention to.In this case, the sound-groove model of this speaker can be first utilized to carry out Application on Voiceprint Recognition, identify the voice segments of this speaker, and then comprise the voice segments of this keyword in the voice segments being identified these words people by predetermined keyword, using these voice segments as key message section.
Visible, utilize the present invention, not only can meet the demand that user pays close attention to voice content to be played, and when there being many speakers, the demand that user pays close attention to different speaker can also be met.
Step 104, when playing described speech data to be played, adjusts according to the word speed of described key message section to described speech data to be played.
In embodiments of the present invention, in units of voice segments, carry out word speed adjustment to speech data to be played, concrete adjustment mode can be determined according to application demand, does not limit this embodiment of the present invention.Such as, when user writes interview original text according to interview recording, normal word speed can be adopted to play to key message section, to the mode that non-critical information section adopts fast word speed to play; For another example, when learning according to classroom recording for student, slow word speed can be adopted to play to key message section, to the mode that non-critical information section adopts normal word speed or fast word speed to play.
The speech playing method of the embodiment of the present invention, speech data to be played is analyzed from voice content and/or vocal print feature aspect, determine key message section wherein, and when playing described speech data to be played, adjust according to the word speed of described key message section to described speech data to be played, thus the voice segments that user can be helped to find user to pay close attention to from a large amount of speech data rapidly and accurately or valuable voice segments, from a large amount of session recording data, find the voice segments relevant to Session Topic as helping user fast.
In order to improve result of broadcast further, ensure the validity of key message section, in another embodiment of the inventive method, when adjusting according to the word speed of described key message section to described speech data to be played, the degree of confidence of each voice segments can also be considered.Particularly, in speech recognition process, the posterior probability of each word in voice segments can be obtained, after the posterior probability of words all in voice segments being averaged, the posterior probability of this voice segments can be obtained, using the degree of confidence of described posterior probability as this voice segments.Certainly, the calculating of described degree of confidence can also adopt other method, is calculated by the method for statistical modeling as by the text feature (mainly referring to identify the semantic feature of text) or acoustic feature extracting voice segments.It should be noted that, if determine whether each voice segments is key message section based on vocal print feature separately, also can be obtained the posterior probability of each word in each voice segments by the decode procedure of speech recognition, and then obtain the degree of confidence of each voice segments.
In this embodiment, when the word speed of speech data to be played is adjusted, can adjust according to the word speed of degree of confidence to described speech data to be played of described key message section and each voice segments.
Particularly, in advance the value of degree of confidence can be divided multiple grade, one or more confidence threshold value is set, during as arranged 2 threshold values, i.e. first threshold and Second Threshold, wherein Second Threshold is greater than first threshold, the value of degree of confidence can be divided into Three Estate, i.e. high confidence level, middle degree of confidence and low confidence, the grade belonging to degree of confidence value and key message section adjust word speed.During concrete adjustment, the word speed of speech data can be adjusted to fast word speed respectively, normal word speed and slow word speed, the value of described fast word speed and slow word speed can be determined according to practical application, if speech data to be played is the recording of momentous conference, fast word speed can be set to faster than normal word speed 5%, slow word speed can be set to slower than normal word speed 10%.
Below to arrange two confidence threshold value, illustrate speech data word speed adjustment process to be played, specific as follows:
Walk degree of confidence a) obtaining current speech segment;
Step b) judges whether the value of described degree of confidence exceedes the threshold value preset, if described degree of confidence is greater than Second Threshold, forwards step to c), if described degree of confidence is between first threshold to Second Threshold, forwards step to d); If described degree of confidence is less than first threshold, forward step to e);
Step c) judges whether current speech segment is key message section, if so, then the word speed of current speech segment is adjusted to fast word speed; Otherwise, directly skip over current speech segment;
Step d) judges whether current speech segment is key message section, if so, then the word speed of current speech segment is adjusted to slow word speed, if not, then do not adjust, namely the word speed of current speech segment is normal word speed;
The word speed of current speech segment e) is adjusted to slow word speed by step.
Certainly, according to the degree of confidence of key message section and each voice segments, adjustment is carried out to the word speed of speech data to be played and be not limited to above-mentioned concrete adjustment mode, according to application demand, can also other adjustment mode, this embodiment of the present invention is not limited.
In order to improve the quality of speech data to be played further, promote the sense of hearing of user, in another embodiment of speech playing method of the present invention, the analysis of voice signal aspect can also be carried out speech data to be played, and according to analysis result, process is optimized to institute's speech segment.The analysis of described voice signal aspect can comprise following any one or multiple: the volume change of speech data, the size, noise situations etc. of speech data reverberation.Correspondingly, analysis and the process of one or more voice signals can be carried out when process being optimized to voice segments according to analysis result.Below different dispositions is illustrated respectively.
1) according to volume analysis result, process is optimized to speech data
When volume is analyzed, in units of frame, calculate the range value of every frame speech data, if the amplitude of continuous multiple frames speech data exceedes higher limit, then think that the situation of cut ridge has appearred in current speech segment, when playing voice, volume can be higher, affect user's sense of hearing, need the amplitude of current speech segment to turn down; If the amplitude of continuous multiple frames speech data is lower than lower limit, then think that the amplitude of current speech segment is less, user is not easy to catch voice content, needs the amplitude of current speech segment to heighten.
2) according to reverberation analysis result, process is optimized to speech data
When reverberation is analyzed, the reverberation detection model built in advance can be adopted to detect current speech segment, described reverberation detection model can by collecting a large amount of speech data in advance, the method building statistical model is utilized to obtain, described statistical model is as deep neural network model, extract the input of spectrum signature as reverberation detection model of current speech segment, calculate the reverberation time T60 of current speech segment (after namely sound source stops sounding, the time of acoustic energy decay needed for 60db), if reverberation time T60 is higher than the threshold value of setting, then think that current speech segment reverberation is excessive, reverberation removing method is adopted to remove the reverberation of current speech segment, to ensure the sharpness of current speech segment, described reverberation removing method is as the reverberation removing method based on liftering technology.
3) according to noise analysis result, process is optimized to speech data
Calculate the signal to noise ratio (S/N ratio) of current speech data, if signal to noise ratio (S/N ratio) is greater than default snr threshold, then noise reduction is carried out to current speech segment, as the noise that the method for speech enhan-cement can be adopted to remove current speech segment, described speech enhan-cement refers to from noisy speech extracting data primary voice data pure as far as possible, eliminate ground unrest, thus raising voice quality, allow the imperceptible fatigue of user, the method that concrete sound strengthens is prior art, such as, the method of neural network is adopted to carry out speech enhan-cement, in advance using the input of the amplitude spectrum feature of a large amount of noisy speech as speech enhan-cement model, using the output of the amplitude spectrum feature of clean speech as speech enhan-cement model, carry out the training strengthening model, obtain speech enhan-cement model, then extract the input of amplitude spectrum feature as speech enhan-cement model of current noisy speech data, obtain the amplitude spectrum feature of clean speech data, finally in conjunction with the phase information of described speech data, the amplitude spectrum characteristic recovery of described clean speech data is become speech data, clean speech data can be obtained.
It should be noted that, above-mentioned optimization process can be for all voice segments in speech data to be played, also can only for key message section.And process opportunity can be before or after determining key message section, does not limit this embodiment of the present invention.
Valuable speech data is found quickly and efficiently in order to help user, the speech playing method of the embodiment of the present invention is before broadcasting speech data, respectively speech data is analyzed from voice content and/or vocal print feature aspect and voice signal aspect, the key message section of speech data to be played can be obtained from the analysis result of voice content and/or vocal print feature aspect; From the analysis result of voice signal aspect, judge whether speech data to be played has problems, if existed, for speech data produced problem to be played, process is optimized to speech data to be played, improves the quality of speech data to be played, improve the sense of hearing of user.
Correspondingly, the embodiment of the present invention also provides a kind of voice playing device, as shown in Figure 2, is a kind of structural representation of this device.
In this embodiment, described device comprises:
Receiver module 201, for receiving speech data to be played;
Endpoint detection module 202, for carrying out end-point detection to described speech data to be played, obtains each voice segments;
Key message section determination module 203, comprise the first determination module 231 and/or the second determination module 232, described first determination module 231 is for determining according to the voice content of each voice segments whether institute's speech segment is key message section, and described second determination module 232 is for determining according to the vocal print feature of each voice segments whether institute's speech segment is key message section;
Playing module 204, for playing described speech data to be played;
Word speed adjusting module 205, for when described playing module 204 plays described speech data to be played, adjusts according to the word speed of described key message section to described speech data to be played.
Illustrate in Fig. 2 that key message section determination module 203 comprises the situation of the first determination module 231 and the second determination module 232 simultaneously.
According to the voice content of each voice segments, above-mentioned first determination module 231 can determine whether this voice segments is key message section, it comprises: voice recognition unit and determining unit, wherein, described voice recognition unit is used for carrying out speech recognition to each voice segments, obtains the identification text of each voice segments; Described determining unit is used for the identification text according to each voice segments, determines whether institute's speech segment is key message section.
In a particular application, described determining unit can adopt the method for the method of predetermined keyword or extraction summary to determine whether each voice segments is key message section.
Such as, in one embodiment, described determining unit can determine whether the identification text of each voice segments comprises preset keyword; If so, then determine that institute's speech segment is key message section, otherwise determine that institute's speech segment is non-critical information section.
For another example, in another kind of embodiment, described determining unit can adopt iterative manner from all voice segments, extract summary voice segments, and after reaching the iterations of setting, obtain multiple summary voice segments, using described multiple summary voice segments as key message section.Correspondingly, its concrete structure can comprise following subelement:
Iterations setting subelement, for arranging iterations;
Summary extracts subelement, extracts summary voice segments for adopting iterative manner from all voice segments;
Judgment sub-unit, for judging whether the iterations reaching setting, and after reaching the iterations of setting, triggering described summary and extracting subelement stopping iterative process;
Key message section obtains subelement, for extract subelement stopping iterative process at described summary after, obtains current all summary voice segments, and it can be used as key message section.
Wherein, described summary extraction subelement comprises:
First computation subunit, for the similarity of the identification text of the identification text and described speech data to be played that calculate current speech segment, obtains the first calculated value;
Second computation subunit, for calculating the identification text of described current speech segment and the similarity extracting voice segments identification text of making a summary, obtains the second calculated value;
Mathematic interpolation subelement, for calculating the difference of the first calculated value and the second calculated value, obtains the summary score of current speech segment;
Chooser unit, for after obtaining the summary score of all voice segments, the voice segments selecting summary score maximum is as summary voice segments.
The concrete computation process of the summary score of voice segments see the description in preceding formula (1), can not repeat them here.
According to the vocal print feature of each voice segments, above-mentioned second determination module 232 can determine whether this voice segments is key message section, particularly, if speech data to be played comprises the speech data of multiple speaker, then can using the speech data of speaker dependent or main speaker as key message section.
For speaker dependent, a kind of concrete structure of the second determination module 232 can comprise following unit:
Vocal print feature extraction unit, for when described speech data to be played comprises the speech data of multiple speaker, extracts the vocal print feature of each voice segments;
Application on Voiceprint Recognition unit, for the sound-groove model according to described vocal print feature and speaker dependent, determines that whether institute's speech segment is the speech data of speaker dependent; If so, then determine that institute's speech segment is key message section.
For main speaker, a kind of concrete structure of the second determination module 232 can comprise following unit:
Vocal print feature extraction unit, for when described speech data to be played comprises the speech data of multiple speaker, extracts the vocal print feature of each voice segments;
Speaker's separative element, for according to described vocal print characteristic use speaker isolation technics, separates the speech data of speaker each in speech data, determines main speaker, and using the voice segments of described main speaker as key message section.Speaker's isolation technics can adopt prior art, realizes speaker be separated as by cluster scheduling algorithm.
Key message section determination module 203 is comprised to the situation of the first determination module 231 and the second determination module 232 simultaneously, will can meet the voice segments of two kinds of requirements above as key message section simultaneously.Such as, can first the sound-groove model of speaker dependent be utilized to carry out Application on Voiceprint Recognition by the second determination module 232, identify the voice segments of this speaker, and then comprise the voice segments of this keyword in the voice segments being identified these words people by the first determination module 231 by predetermined keyword, using these voice segments as key message section; Or first extract multiple summary voice segments by the first determination module 231 according to the identification text of speech data to be played, and then second determination module 232 utilize the sound-groove model of speaker dependent to these summary voice segments carry out Application on Voiceprint Recognition, identify the voice segments of speaker dependent in these summary voice segments, using these voice segments as key message section.
In embodiments of the present invention, word speed adjusting module 205 is in units of voice segments, word speed adjustment is carried out to speech data to be played, according to the difference of application demand, different adjustment modes can be had, such as, when current speech segment is key message section, adjust its play word speed be normal word speed, otherwise adjust its play word speed be fast word speed; Or when current speech segment is key message section, adjusting it, to play word speed be slow word speed, otherwise adjusting it, to play word speed be normal word speed or fast word speed.
The voice playing device of the embodiment of the present invention, speech data to be played is analyzed from voice content and/or vocal print feature aspect, determine key message section wherein, and when playing described speech data to be played, adjust according to the word speed of described key message section to described speech data to be played, thus the voice segments that user can be helped to find user to pay close attention to from a large amount of speech data rapidly and accurately or valuable voice segments, from a large amount of session recording data, find the voice segments relevant to Session Topic as helping user fast.
In order to improve result of broadcast further, ensureing the validity of key message section, as shown in Figure 3, in another embodiment of voice playing device of the present invention, also comprising: degree of confidence acquisition module 206, for obtaining the degree of confidence of each voice segments.
It should be noted that, described degree of confidence acquisition module 206 specifically can obtain the posterior probability of each voice segments by speech recognition process, and using the degree of confidence of described posterior probability as this voice segments.
Due to the difference according to application demand, key message section determination module 203 has different implementations, when it comprises the first determination module 231, first determination module 231 comprises: voice recognition unit and determining unit, in this case, degree of confidence acquisition module 206 also can integrate with key message section determination module 203.Certainly, when key message section determination module 203 only includes the second determination module 232, degree of confidence acquisition module 206 can as a standalone module, by extracting the feature of each voice segments, the acoustic model of feature and the training in advance extracted and language model is utilized to decode to institute's speech segment, obtain the posterior probability of each word in this voice segments, then the posterior probability of words all in this voice segments is averaged, the posterior probability of this voice segments can be obtained, using the degree of confidence of this posterior probability as this voice segments.
That is, a kind of schematic diagram that the structure shown in Fig. 3 is only used to be convenient to understand voice playing device of the present invention and shows, but not entity structure during its application.And according to the difference of application demand, wherein some module also needs to carry out adaptive adjustment.
Correspondingly, in this embodiment, word speed adjusting module 205 needs to adjust according to the word speed of degree of confidence to described speech data to be played of described key message section and each voice segments.Equally, according to the difference of application demand, different adjustment modes can be had, such as, if current speech segment is key message section, if then its degree of confidence is greater than Second Threshold, then adopt fast word speed to play described current speech segment, otherwise adopt slow word speed to play described current speech segment; If current speech segment is non-critical information section, if then its degree of confidence is greater than Second Threshold, then skip over described current speech segment; If its degree of confidence is less than or equal to first threshold (first threshold is less than Second Threshold), then slow word speed is adopted to play described current speech segment.
Compared with embodiment illustrated in fig. 2, the word speed adjustment of voice playing device to speech data to be played of this embodiment is more versatile and flexible, and ensure that the validity of key message section better, further increase result of broadcast, meet the different application demand of user.
In order to improve the quality of speech data to be played further, promote the sense of hearing of user, as shown in Figure 4, in another embodiment of voice playing device of the present invention, also can comprise: signal analysis and processing module 207, for carrying out the analysis of voice signal aspect to each voice segments, and according to analysis result, process is optimized to institute's speech segment.
Described signal analysis and processing module 207 comprises following any one or more unit;
Volume analysis and processing unit, for calculating the amplitude of every frame speech data in units of frame, and when having the amplitude of continuous multiple frames speech data to exceed higher limit in current speech segment, turn down the amplitude of current speech segment, when having the amplitude of continuous multiple frames speech data lower than lower limit in current speech segment, heighten the amplitude of current speech segment;
Reverberation analysis and processing unit, for calculating the reverberation time of each voice segments, and when the reverberation time of current speech segment exceedes threshold value, carries out reverberation elimination to current speech segment;
Noise analysis processing unit, for calculating the noise of each voice segments, and when the signal to noise ratio (S/N ratio) of current speech segment is less than snr threshold, carries out denoising to current speech segment.
The calculating of the amplitude of voice segments, reverberation time, noise and testing process with reference to the description in the inventive method embodiment above, can not repeat them here.
It should be noted that, the optimization process of signal analysis and processing module 207 pairs of speech datas can be for all voice segments in speech data to be played, also can only for key message section.And process opportunity can be before or after determining key message section, does not limit this embodiment of the present invention.In addition, it should be noted that, in actual applications, be not limited to the structure shown in Fig. 4, in another embodiment of apparatus of the present invention, also can comprise above-mentioned degree of confidence acquisition module 206 and signal analysis and processing module 207, and wherein the concrete structure of each module can do adaptive adjustment according to the difference of application demand simultaneously.
The voice signal device of this embodiment, before broadcasting speech data, respectively speech data is analyzed from voice content and/or vocal print feature aspect and voice signal aspect, the key message section of speech data to be played can be obtained from the analysis result of voice content and/or vocal print feature aspect; From the analysis result of voice signal aspect, judge whether speech data to be played has problems, if existed, for speech data produced problem to be played, process is optimized to speech data to be played, improves the quality of speech data to be played, improve the sense of hearing of user.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, between each embodiment identical similar part mutually see, what each embodiment stressed is the difference with other embodiments.Especially, for device embodiment, because it is substantially similar to embodiment of the method, so describe fairly simple, relevant part illustrates see the part of embodiment of the method.Device embodiment described above is only schematic, the wherein said unit illustrated as separating component or can may not be and physically separates, parts as unit display can be or may not be physical location, namely can be positioned at a place, or also can be distributed in multiple network element.Some or all of module wherein can be selected according to the actual needs to realize the object of the present embodiment scheme.Those of ordinary skill in the art, when not paying creative work, are namely appreciated that and implement.
Being described in detail the embodiment of the present invention above, applying embodiment herein to invention has been elaboration, the explanation of above embodiment just understands method of the present invention and device for helping; Meanwhile, for one of ordinary skill in the art, according to thought of the present invention, all will change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (21)

1. a speech playing method, is characterized in that, comprising:
Receive speech data to be played;
End-point detection is carried out to described speech data to be played, obtains each voice segments;
Determine whether institute's speech segment is key message section according to the voice content of each voice segments and/or vocal print feature;
When playing described speech data to be played, adjust according to the word speed of described key message section to described speech data to be played.
2. method according to claim 1, is characterized in that, the described voice content according to each voice segments determines whether institute's speech segment is that key message section comprises:
Speech recognition is carried out to each voice segments, obtains the identification text of each voice segments;
According to the identification text of each voice segments, determine whether institute's speech segment is key message section.
3. method according to claim 2, is characterized in that, the described identification text according to each voice segments, determines whether institute's speech segment is that key message section comprises:
Determine whether the identification text of each voice segments comprises preset keyword;
If so, then determine that institute's speech segment is key message section.
4. method according to claim 2, is characterized in that, the described identification text according to each voice segments, determines whether institute's speech segment is that key message section comprises:
Adopt iterative manner from all voice segments, extract summary voice segments, and after reaching the iterations of setting, obtain multiple summary voice segments, using described multiple summary voice segments as key message section.
5. method according to claim 4, is characterized in that, the described summary voice segments that extracts from all voice segments comprises:
Calculate the similarity of the identification text of current speech segment and the identification text of described speech data to be played, obtain the first calculated value;
Calculate the identification text of described current speech segment and the similarity extracting voice segments identification text of making a summary, obtain the second calculated value;
Calculate the difference of the first calculated value and the second calculated value, obtain the summary score of current speech segment;
After obtaining the summary score of all voice segments, the voice segments selecting summary score maximum is as summary voice segments.
6. method according to claim 1, is characterized in that, the described vocal print feature according to each voice segments determines whether institute's speech segment is that key message section comprises:
If described speech data to be played comprises the speech data of multiple speaker, then extract the vocal print feature of each voice segments;
According to the sound-groove model of described vocal print feature and speaker dependent, determine that whether institute's speech segment is the speech data of speaker dependent;
If so, then determine that institute's speech segment is key message section.
7. method according to claim 1, is characterized in that, the described vocal print feature according to each voice segments determines whether institute's speech segment is that key message section comprises:
If described speech data to be played comprises the speech data of multiple speaker, then by speaker's isolation technics, determine main speaker;
Using the voice segments of described main speaker as key message section.
8. the method according to any one of claim 1 to 7, is characterized in that, describedly carries out adjustment according to the word speed of described key message section to described speech data to be played and comprises:
If current speech segment is key message section, then adopt normal word speed to play described current speech segment, otherwise adopt fast word speed to play described current speech segment; Or
If current speech segment is key message section, then adopt slow word speed to play described current speech segment, otherwise adopt normal word speed or fast word speed to play described current speech segment.
9. the method according to any one of claim 1 to 7, is characterized in that, described method also comprises:
Obtain the degree of confidence of each voice segments;
Carry out adjustment to the word speed of described speech data to be played to be specially: the word speed of degree of confidence to described speech data to be played according to described key message section and each voice segments adjusts.
10. method according to claim 9, is characterized in that, the described word speed of degree of confidence to described speech data to be played according to described key message section and each voice segments is carried out adjustment and comprised:
If current speech segment is key message section, if then its degree of confidence is greater than Second Threshold, then adopt fast word speed to play described current speech segment, otherwise adopt slow word speed to play described current speech segment;
If current speech segment is non-critical information section, if then its degree of confidence is greater than Second Threshold, then skip over described current speech segment; If its degree of confidence is less than or equal to first threshold, then adopt slow word speed to play described current speech segment, described first threshold is less than described Second Threshold.
11. methods according to claim 1, is characterized in that, described method also comprises:
Each voice segments is carried out to the analysis of voice signal aspect, the analysis of described voice signal aspect comprise following any one or multiple: volume change situation, reverberation situation, noise situations;
Play described speech data to be played time, according to analysis result, process is optimized to institute's speech segment, described optimization process comprise following any one or multiple:
(1) if there is the amplitude of continuous multiple frames speech data to exceed higher limit in current speech segment, then the amplitude of current speech segment is turned down; If have the amplitude of continuous multiple frames speech data in current speech segment lower than lower limit, then heighten the amplitude of current speech segment;
(2) if the reverberation time of current speech segment exceedes threshold value, then reverberation elimination is carried out to current speech segment;
(3) if the signal to noise ratio (S/N ratio) of current speech segment is less than snr threshold, then denoising is carried out to current speech segment.
12. 1 kinds of voice playing device, is characterized in that, comprising:
Receiver module, for receiving speech data to be played;
Endpoint detection module, for carrying out end-point detection to described speech data to be played, obtains each voice segments;
Key message section determination module, comprise the first determination module and/or the second determination module, described first determination module is used for determining whether institute's speech segment is key message section according to the voice content of each voice segments, and described second determination module is used for determining whether institute's speech segment is key message section according to the vocal print feature of each voice segments;
Playing module, for playing described speech data to be played;
Word speed adjusting module, for when described playing module plays described speech data to be played, adjusts according to the word speed of described key message section to described speech data to be played.
13. devices according to claim 12, is characterized in that, described first determination module comprises:
Voice recognition unit, for carrying out speech recognition to each voice segments, obtains the identification text of each voice segments;
Determining unit, for the identification text according to each voice segments, determines whether institute's speech segment is key message section.
14. devices according to claim 13, is characterized in that,
Described determining unit, specifically for determining whether the identification text of each voice segments comprises preset keyword; If so, then determine that institute's speech segment is key message section.
15. devices according to claim 13, is characterized in that, described determining unit comprises:
Iterations setting subelement, for arranging iterations;
Summary extracts subelement, extracts summary voice segments for adopting iterative manner from all voice segments;
Judgment sub-unit, for judging whether the iterations reaching setting, and after reaching the iterations of setting, triggering described summary and extracting subelement stopping iterative process;
Key message section obtains subelement, for extract subelement stopping iterative process at described summary after, obtains current all summary voice segments, and it can be used as key message section.
16. devices according to claim 15, is characterized in that, described summary extracts subelement and comprises:
First computation subunit, for the similarity of the identification text of the identification text and described speech data to be played that calculate current speech segment, obtains the first calculated value;
Second computation subunit, for calculating the identification text of described current speech segment and the similarity extracting voice segments identification text of making a summary, obtains the second calculated value;
Mathematic interpolation subelement, for calculating the difference of the first calculated value and the second calculated value, obtains the summary score of current speech segment;
Chooser unit, for after obtaining the summary score of all voice segments, the voice segments selecting summary score maximum is as summary voice segments.
17. devices according to claim 12, is characterized in that, described second determination module comprises:
Vocal print feature extraction unit, for when described speech data to be played comprises the speech data of multiple speaker, extracts the vocal print feature of each voice segments;
Application on Voiceprint Recognition unit, for the sound-groove model according to described vocal print feature and speaker dependent, determines that whether institute's speech segment is the speech data of speaker dependent; If so, then determine that institute's speech segment is key message section.
18. devices according to claim 12, is characterized in that, described second determination module comprises:
Vocal print feature extraction unit, for when described speech data to be played comprises the speech data of multiple speaker, extracts the vocal print feature of each voice segments;
Speaker's separative element, for according to described vocal print characteristic use speaker isolation technics, determines main speaker, and using the voice segments of described main speaker as key message section.
19., according to claim 12 to the device described in 18 any one, is characterized in that,
Described word speed adjusting module, specifically for when current speech segment is key message section, adjust its play word speed be normal word speed, otherwise adjust its play word speed be fast word speed; Or when current speech segment is key message section, adjusting it, to play word speed be slow word speed, otherwise adjusting it, to play word speed be normal word speed or fast word speed.
20., according to claim 12 to the device described in 18 any one, is characterized in that, described device also comprises:
Degree of confidence acquisition module, for obtaining the degree of confidence of each voice segments;
Described word speed adjusting module, adjusts specifically for the word speed of degree of confidence to described speech data to be played according to described key message section and each voice segments.
21. devices according to claim 12, is characterized in that, described device also comprises:
Signal analysis and processing module, for carrying out the analysis of voice signal aspect to each voice segments, and is optimized process according to analysis result to institute's speech segment; Described signal analysis and processing module comprises following any one or more unit;
Volume analysis and processing unit, for calculating the amplitude of every frame speech data in units of frame, and when having the amplitude of continuous multiple frames speech data to exceed higher limit in current speech segment, turn down the amplitude of current speech segment, when having the amplitude of continuous multiple frames speech data lower than lower limit in current speech segment, heighten the amplitude of current speech segment;
Reverberation analysis and processing unit, for calculating the reverberation time of each voice segments, and when the reverberation time of current speech segment exceedes threshold value, carries out reverberation elimination to current speech segment;
Noise analysis processing unit, for calculating the noise of each voice segments, and when the signal to noise ratio (S/N ratio) of current speech segment is greater than snr threshold, carries out denoising to current speech segment.
CN201510757786.6A 2015-11-04 2015-11-04 Speech playing method and device Active CN105405439B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510757786.6A CN105405439B (en) 2015-11-04 2015-11-04 Speech playing method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510757786.6A CN105405439B (en) 2015-11-04 2015-11-04 Speech playing method and device

Publications (2)

Publication Number Publication Date
CN105405439A true CN105405439A (en) 2016-03-16
CN105405439B CN105405439B (en) 2019-07-05

Family

ID=55470882

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510757786.6A Active CN105405439B (en) 2015-11-04 2015-11-04 Speech playing method and device

Country Status (1)

Country Link
CN (1) CN105405439B (en)

Cited By (26)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105869626A (en) * 2016-05-31 2016-08-17 宇龙计算机通信科技(深圳)有限公司 Automatic speech rate adjusting method and terminal
CN105933128A (en) * 2016-04-25 2016-09-07 四川联友电讯技术有限公司 Audio conference minute push method based on noise filtering and identity authentication
CN106098057A (en) * 2016-06-13 2016-11-09 北京云知声信息技术有限公司 Play word speed management method and device
CN106504744A (en) * 2016-10-26 2017-03-15 科大讯飞股份有限公司 A kind of method of speech processing and device
CN106782571A (en) * 2017-01-19 2017-05-31 广东美的厨房电器制造有限公司 The display methods and device of a kind of control interface
CN106910500A (en) * 2016-12-23 2017-06-30 北京第九实验室科技有限公司 The method and apparatus of Voice command is carried out to the equipment with microphone array
CN107369085A (en) * 2017-06-28 2017-11-21 深圳市佰仟金融服务有限公司 A kind of information output method, device and terminal device
CN107464563A (en) * 2017-08-11 2017-12-12 潘金文 A kind of interactive voice toy
WO2017210991A1 (en) * 2016-06-06 2017-12-14 中兴通讯股份有限公司 Method, device and system for voice filtering
CN107610718A (en) * 2017-08-29 2018-01-19 深圳市买买提乐购金融服务有限公司 A kind of method and device that voice document content is marked
CN107689227A (en) * 2017-08-23 2018-02-13 上海爱优威软件开发有限公司 A kind of voice de-noising method and system based on data fusion
WO2018036555A1 (en) * 2016-08-25 2018-03-01 腾讯科技(深圳)有限公司 Session processing method and apparatus
CN108597521A (en) * 2018-05-04 2018-09-28 徐涌 Audio role divides interactive system, method, terminal and the medium with identification word
CN109119093A (en) * 2018-10-30 2019-01-01 Oppo广东移动通信有限公司 Voice de-noising method, device, storage medium and mobile terminal
CN109147802A (en) * 2018-10-22 2019-01-04 珠海格力电器股份有限公司 A kind of broadcasting word speed adjusting method and device
CN109256153A (en) * 2018-08-29 2019-01-22 北京云知声信息技术有限公司 A kind of sound localization method and system
CN109979467A (en) * 2019-01-25 2019-07-05 出门问问信息科技有限公司 Voice filter method, device, equipment and storage medium
CN109979474A (en) * 2019-03-01 2019-07-05 珠海格力电器股份有限公司 Speech ciphering equipment and its user speed modification method, device and storage medium
CN110534127A (en) * 2019-09-24 2019-12-03 华南理工大学 Applied to the microphone array voice enhancement method and device in indoor environment
CN110709921A (en) * 2018-05-28 2020-01-17 深圳市大疆创新科技有限公司 Noise reduction method and device and unmanned aerial vehicle
CN111899742A (en) * 2020-08-06 2020-11-06 广州科天视畅信息科技有限公司 Method and system for improving conference efficiency
WO2020233068A1 (en) * 2019-05-21 2020-11-26 深圳壹账通智能科技有限公司 Conference audio control method, system, device and computer readable storage medium
CN112133279A (en) * 2019-06-06 2020-12-25 Tcl集团股份有限公司 Vehicle-mounted information broadcasting method and device and terminal equipment
CN113364665A (en) * 2021-05-24 2021-09-07 维沃移动通信有限公司 Information broadcasting method and electronic equipment
CN113571054A (en) * 2020-04-28 2021-10-29 中国移动通信集团浙江有限公司 Speech recognition signal preprocessing method, device, equipment and computer storage medium
CN114566164A (en) * 2022-02-23 2022-05-31 成都智元汇信息技术股份有限公司 Manual broadcast audio self-adaption method, display terminal and system based on public transport

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1787070A (en) * 2005-12-09 2006-06-14 北京凌声芯语音科技有限公司 Chip upper system for language learner
KR20080039704A (en) * 2006-11-01 2008-05-07 엘지전자 주식회사 Portable audio player and controlling method thereof
CN103377203A (en) * 2012-04-18 2013-10-30 宇龙计算机通信科技(深圳)有限公司 Terminal and sound record management method
CN103400580A (en) * 2013-07-23 2013-11-20 华南理工大学 Method for estimating importance degree of speaker in multiuser session voice
CN103929539A (en) * 2014-04-10 2014-07-16 惠州Tcl移动通信有限公司 Mobile terminal notepad processing method and system based on voice recognition
CN104125340A (en) * 2014-07-25 2014-10-29 广东欧珀移动通信有限公司 Generating managing method and system for call sound recording files
CN104505108A (en) * 2014-12-04 2015-04-08 广东欧珀移动通信有限公司 Information positioning method and terminal

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1787070A (en) * 2005-12-09 2006-06-14 北京凌声芯语音科技有限公司 Chip upper system for language learner
KR20080039704A (en) * 2006-11-01 2008-05-07 엘지전자 주식회사 Portable audio player and controlling method thereof
CN103377203A (en) * 2012-04-18 2013-10-30 宇龙计算机通信科技(深圳)有限公司 Terminal and sound record management method
CN103400580A (en) * 2013-07-23 2013-11-20 华南理工大学 Method for estimating importance degree of speaker in multiuser session voice
CN103929539A (en) * 2014-04-10 2014-07-16 惠州Tcl移动通信有限公司 Mobile terminal notepad processing method and system based on voice recognition
CN104125340A (en) * 2014-07-25 2014-10-29 广东欧珀移动通信有限公司 Generating managing method and system for call sound recording files
CN104505108A (en) * 2014-12-04 2015-04-08 广东欧珀移动通信有限公司 Information positioning method and terminal

Cited By (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105933128A (en) * 2016-04-25 2016-09-07 四川联友电讯技术有限公司 Audio conference minute push method based on noise filtering and identity authentication
CN105869626B (en) * 2016-05-31 2019-02-05 宇龙计算机通信科技(深圳)有限公司 A kind of method and terminal of word speed automatic adjustment
CN105869626A (en) * 2016-05-31 2016-08-17 宇龙计算机通信科技(深圳)有限公司 Automatic speech rate adjusting method and terminal
WO2017206256A1 (en) * 2016-05-31 2017-12-07 宇龙计算机通信科技(深圳)有限公司 Method for automatically adjusting speaking speed and terminal
WO2017210991A1 (en) * 2016-06-06 2017-12-14 中兴通讯股份有限公司 Method, device and system for voice filtering
CN106098057A (en) * 2016-06-13 2016-11-09 北京云知声信息技术有限公司 Play word speed management method and device
WO2018036555A1 (en) * 2016-08-25 2018-03-01 腾讯科技(深圳)有限公司 Session processing method and apparatus
CN106504744A (en) * 2016-10-26 2017-03-15 科大讯飞股份有限公司 A kind of method of speech processing and device
CN106504744B (en) * 2016-10-26 2020-05-01 科大讯飞股份有限公司 Voice processing method and device
CN106910500A (en) * 2016-12-23 2017-06-30 北京第九实验室科技有限公司 The method and apparatus of Voice command is carried out to the equipment with microphone array
CN106782571A (en) * 2017-01-19 2017-05-31 广东美的厨房电器制造有限公司 The display methods and device of a kind of control interface
CN107369085A (en) * 2017-06-28 2017-11-21 深圳市佰仟金融服务有限公司 A kind of information output method, device and terminal device
CN107464563A (en) * 2017-08-11 2017-12-12 潘金文 A kind of interactive voice toy
CN107689227A (en) * 2017-08-23 2018-02-13 上海爱优威软件开发有限公司 A kind of voice de-noising method and system based on data fusion
CN107610718A (en) * 2017-08-29 2018-01-19 深圳市买买提乐购金融服务有限公司 A kind of method and device that voice document content is marked
CN108597521A (en) * 2018-05-04 2018-09-28 徐涌 Audio role divides interactive system, method, terminal and the medium with identification word
CN110709921A (en) * 2018-05-28 2020-01-17 深圳市大疆创新科技有限公司 Noise reduction method and device and unmanned aerial vehicle
CN109256153A (en) * 2018-08-29 2019-01-22 北京云知声信息技术有限公司 A kind of sound localization method and system
CN109147802A (en) * 2018-10-22 2019-01-04 珠海格力电器股份有限公司 A kind of broadcasting word speed adjusting method and device
CN109119093A (en) * 2018-10-30 2019-01-01 Oppo广东移动通信有限公司 Voice de-noising method, device, storage medium and mobile terminal
CN109979467A (en) * 2019-01-25 2019-07-05 出门问问信息科技有限公司 Voice filter method, device, equipment and storage medium
CN109979474A (en) * 2019-03-01 2019-07-05 珠海格力电器股份有限公司 Speech ciphering equipment and its user speed modification method, device and storage medium
CN109979474B (en) * 2019-03-01 2021-04-13 珠海格力电器股份有限公司 Voice equipment and user speech rate correction method and device thereof and storage medium
WO2020233068A1 (en) * 2019-05-21 2020-11-26 深圳壹账通智能科技有限公司 Conference audio control method, system, device and computer readable storage medium
CN112133279A (en) * 2019-06-06 2020-12-25 Tcl集团股份有限公司 Vehicle-mounted information broadcasting method and device and terminal equipment
CN110534127A (en) * 2019-09-24 2019-12-03 华南理工大学 Applied to the microphone array voice enhancement method and device in indoor environment
CN113571054A (en) * 2020-04-28 2021-10-29 中国移动通信集团浙江有限公司 Speech recognition signal preprocessing method, device, equipment and computer storage medium
CN113571054B (en) * 2020-04-28 2023-08-15 中国移动通信集团浙江有限公司 Speech recognition signal preprocessing method, device, equipment and computer storage medium
CN111899742A (en) * 2020-08-06 2020-11-06 广州科天视畅信息科技有限公司 Method and system for improving conference efficiency
CN113364665A (en) * 2021-05-24 2021-09-07 维沃移动通信有限公司 Information broadcasting method and electronic equipment
CN114566164A (en) * 2022-02-23 2022-05-31 成都智元汇信息技术股份有限公司 Manual broadcast audio self-adaption method, display terminal and system based on public transport

Also Published As

Publication number Publication date
CN105405439B (en) 2019-07-05

Similar Documents

Publication Publication Date Title
CN105405439A (en) Voice playing method and device
CN110709924B (en) Audio-visual speech separation
US10013977B2 (en) Smart home control method based on emotion recognition and the system thereof
Reddy et al. An individualized super-Gaussian single microphone speech enhancement for hearing aid users with smartphone as an assistive device
CN105304080B (en) Speech synthetic device and method
CN110211575B (en) Voice noise adding method and system for data enhancement
CN105405448B (en) A kind of sound effect treatment method and device
WO2020181824A1 (en) Voiceprint recognition method, apparatus and device, and computer-readable storage medium
CN102568478B (en) Video play control method and system based on voice recognition
CN107464563B (en) Voice interaction toy
CN112270933B (en) Audio identification method and device
Sadjadi et al. Blind spectral weighting for robust speaker identification under reverberation mismatch
WO2023184942A1 (en) Voice interaction method and apparatus and electric appliance
JP2023548157A (en) Other speaker audio filtering from calls and audio messages
CN112201262A (en) Sound processing method and device
Bhat et al. Formant frequency-based speech enhancement technique to improve intelligibility for hearing aid users with smartphone as an assistive device
CN115862658A (en) System and method for extracting target speaker voice
Zhao et al. Time-Domain Target-Speaker Speech Separation with Waveform-Based Speaker Embedding.
Lin et al. Focus on the sound around you: Monaural target speaker extraction via distance and speaker information
CN111933146A (en) Speech recognition system and method
CN111833869A (en) Voice interaction method and system applied to urban brain
CN106790963A (en) The control method and device of audio signal
CN116631406B (en) Identity feature extraction method, equipment and storage medium based on acoustic feature generation
Hoover et al. The consonant-weighted envelope difference index (cEDI): A proposed technique for quantifying envelope distortion
Jakubec et al. An Overview of Automatic Speaker Recognition in Adverse Acoustic Environment

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant