CN106356077B - A kind of laugh detection method and device - Google Patents
A kind of laugh detection method and device Download PDFInfo
- Publication number
- CN106356077B CN106356077B CN201610755283.XA CN201610755283A CN106356077B CN 106356077 B CN106356077 B CN 106356077B CN 201610755283 A CN201610755283 A CN 201610755283A CN 106356077 B CN106356077 B CN 106356077B
- Authority
- CN
- China
- Prior art keywords
- frame
- speech frame
- speech
- laugh
- voice signal
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/48—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
- G10L25/51—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
- G10L25/63—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
Abstract
The embodiment of the invention discloses a kind of laugh detection method and device, this method is used for electronic equipment, this method comprises: being directed to voice signal to be detected, the voice signal to be detected is divided into multiple speech frames, and obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;According to the gene frequency and multidimensional speech characteristic parameter of the preparatory training laugh detection model completed and each speech frame of acquisition, predict whether each speech frame is laugh frame;In the speech frame for identifying the first setting quantity adjacent with the current speech frame, prediction result is the quantity of the speech frame of laugh frame;When the quantity is greater than the amount threshold of setting, the current speech frame is determined as laugh frame.Since the multiframe for introducing setting quantity adjacent thereto for the detection of frame each in voice in embodiments of the present invention carries out ballot auxiliary judgement, the accuracy of laugh detection is improved, facilitates the laugh information obtained in multimedia file of user promptly and accurately.
Description
Technical field
The present invention relates to audio-video processing technology field, in particular to a kind of laugh detection method and device.
Background technique
With the fast development of Chinese economy, the requirement that the common people experience clothing, food, lodging and transportion -- basic necessities of life is higher and higher, and advanced voice
Detection system is to improve a kind of effective way of common people's experience.Laugh detection system is even more the key in speech detection system, and
And laugh detection system can not only extract the wonderful in voice, reduce the workload in voice shearing and improve accurately
Property, the variation for identifying mood can also be detected according to laugh, formulate the experience scheme of differentiation.
Whether laugh detection field, according to laugh detection model, judges each speech frame when carrying out laugh detection at present
Laugh frame, and determine whether each speech frame is laugh frame according to judging result.Pass through laugh detection model, judgement in this method
Whether each speech frame is laugh frame, and laugh model is obtained by training, although laugh detection model is with higher
Accuracy is detected, but in such a way that laugh detection model determines whether each frame is laugh frame, still largely
It is limited by laugh detection model accuracy.
Summary of the invention
The embodiment of the invention discloses a kind of laugh detection method and device, to improve the accuracy of laugh detection.
In order to achieve the above objectives, the embodiment of the invention discloses a kind of laugh detection methods, are applied to electronic equipment, the party
Method includes:
For voice signal to be detected, the voice signal to be detected is divided into multiple speech frames, and obtain each language
The gene frequency and multidimensional speech characteristic parameter of sound frame;
According to the gene frequency and multidimensional voice of the preparatory training laugh detection model completed and each speech frame of acquisition
Characteristic parameter predicts whether each speech frame is laugh frame;
In the speech frame for identifying the first setting quantity adjacent with the current speech frame, prediction result is laugh frame
The quantity of speech frame;
When the quantity is greater than the amount threshold of setting, the current speech frame is determined as laugh frame.
Further, the training process of the laugh detection model includes:
For voice signal each in training set, the voice signal is divided into multiple speech frames;
Obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Identify whether each speech frame is laugh frame, if so, the first label is added in the speech frame, otherwise,
The second label is added in the speech frame;
The gene frequency of speech frame after addition label and multidimensional speech characteristic parameter are input in laugh detection model,
The laugh detection model is trained.
Further, described be trained to the laugh detection model includes:
Using support vector machines method, the laugh detection model is trained;Or,
Using extreme learning machine ELM method, the laugh detection model is trained.
Further, described to be directed to voice signal to be detected, the voice signal to be detected is divided into multiple speech frames
Include:
Preemphasis processing is carried out to the voice signal, pretreated voice signal to be detected is divided into multiple voices
Frame.
Further, it is described the voice signal to be detected is divided into multiple speech frames after, it is described to obtain each language
Before the gene frequency and multidimensional speech characteristic parameter of sound frame, the method also includes:
End-point detection is carried out to each speech frame, removes the noise frame and mute frame in the speech frame.
Further, in the speech frame of the identification the first setting quantity adjacent with the current speech frame, prediction
As a result the quantity for the speech frame of laugh frame includes:
The position for identifying current speech frame, judges whether current speech frame is located at the front end of voice signal;
If so, prediction result is to laugh in the speech frame of the first setting quantity after the identification current speech frame
The quantity of the speech frame of acoustic frame;
If not, judging whether current speech frame is located at the rear end of voice signal;
If so, prediction result is to laugh in the speech frame of the first setting quantity before the identification current speech frame
The quantity of the speech frame of acoustic frame;Otherwise, the 4th sets quantity and the current voice before identifying the current speech frame
In the speech frame of the 5th setting quantity after frame, prediction result is the quantity of the speech frame of laugh frame, wherein the 4th setting number
It measures with the 5th setting quantity and sets quantity for described first.
On the other hand, the embodiment of the invention discloses a kind of laugh detection device, described device includes:
It divides and obtains module, for being directed to voice signal to be detected, the voice signal to be detected is divided into multiple languages
Sound frame, and obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Prediction module, for the gene frequency according to the laugh detection model that training is completed and each speech frame of acquisition in advance
And multidimensional speech characteristic parameter, predict whether each speech frame is laugh frame;
Recognition detection module, for identification in the speech frame of the first setting quantity adjacent with the current speech frame,
Prediction result is the quantity of the speech frame of laugh frame;When the quantity is greater than the amount threshold of setting, by the current language
Sound frame is determined as laugh frame.
Further, described device further include:
Training module, for for each voice signal in training set, the voice signal to be divided into multiple speech frames;
Obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;Identify whether each speech frame is laugh frame, if so,
The first label is added in the speech frame, otherwise, the second label is added in the speech frame;By the voice after addition label
The gene frequency and multidimensional speech characteristic parameter of frame are input in laugh detection model, are instructed to the laugh detection model
Practice.
Further, the division obtains module, is specifically used for carrying out preemphasis processing to the voice signal, will locate in advance
Voice signal to be detected after reason is divided into multiple speech frames;
Described device further include:
Filtering module removes noise frame in the speech frame and mute for carrying out end-point detection to each speech frame
Frame.
Further, the recognition detection module judges current language specifically for identifying the position of current speech frame
Whether sound frame is located at the front end of voice signal;If so, the language of the first setting quantity after the identification current speech frame
In sound frame, prediction result is the quantity of the speech frame of laugh frame;If not, judging whether current speech frame is located at voice signal
Rear end;If so, prediction result is laugh in the speech frame of the first setting quantity before the identification current speech frame
The quantity of the speech frame of frame;Otherwise, the 4th sets quantity and the current speech frame before identifying the current speech frame
In the speech frame of the 5th setting quantity later, prediction result is the quantity of the speech frame of laugh frame, wherein the 4th setting quantity
With the 5th setting quantity and for it is described first setting quantity.
The embodiment of the invention provides a kind of laugh detection method and device, this method is used for electronic equipment, this method packet
It includes: for voice signal to be detected, the voice signal to be detected being divided into multiple speech frames, and obtain each speech frame
Gene frequency and multidimensional speech characteristic parameter;According to each speech frame of the preparatory laugh detection model and acquisition trained and completed
Gene frequency and multidimensional speech characteristic parameter predict whether each speech frame is laugh frame;Identification and the current speech frame
In the speech frame of the first adjacent setting quantity, prediction result is the quantity of the speech frame of laugh frame;It is set when the quantity is greater than
When fixed amount threshold, the current speech frame is determined as laugh frame.Due in embodiments of the present invention, according to current language
Sound frame and its speech frame of adjacent first setting quantity determine whether present frame is laugh frame jointly, weaken to a certain extent
To the error rate of laugh detection model, and the continuity of laugh is also fully taken into account, so that laugh testing result is more
Accurately.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment
Attached drawing is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this
For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other
Attached drawing.
Fig. 1 is a kind of detection process of laugh detection method provided in an embodiment of the present invention;
Fig. 2A-Fig. 2 B is provided in an embodiment of the present invention in detection voice signal, the position where current detection frame
Schematic diagram;
Fig. 3 is a kind of structure of the detecting device schematic diagram of laugh detection method provided in an embodiment of the present invention.
Specific embodiment
In order to improve the accuracy of laugh detection, the embodiment of the invention provides a kind of laugh detection method and device
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete
Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on
Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other
Embodiment shall fall within the protection scope of the present invention.
Fig. 1 be a kind of detection process of laugh detection method provided in an embodiment of the present invention, the process the following steps are included:
S101: being directed to voice signal to be detected, the voice signal to be detected is divided into multiple speech frames, and obtain every
The gene frequency and multidimensional speech characteristic parameter of a speech frame.
Laugh detection method provided in an embodiment of the present invention is applied to electronic equipment, which can be audio collection
Equipment, such as recording pen, recorder, are also possible to the equipment such as mobile phone, tablet computer, PC.
Specifically, carrying out sub-frame processing for voice signal to be detected, voice signal to be detected is divided into multiple voices
Frame, and gene frequency to each speech frame and multidimensional speech characteristic parameter obtain.
Voice signal is divided into multiple speech frames in embodiments of the present invention and obtains the gene frequency of each speech frame
And multidimensional phonetic feature (MFCC) parameter, belong to the prior art, in embodiments of the present invention to the process without explanation.Accordingly
Following each embodiments in similarly exist the situation, also just no longer repeat one by one.
S102: according to the gene frequency and multidimensional of the preparatory training laugh detection model completed and each speech frame of acquisition
Speech characteristic parameter predicts whether each speech frame is laugh frame.
Laugh detection model is trained in advance in embodiments of the present invention, and is instructed to laugh detection model
When practicing, and the gene frequency and multidimensional speech characteristic parameter of each speech frame according to voice signal each in training set, it is right
What laugh detection model training was completed.The laugh detection model that training is completed can according to the gene frequency of the speech frame of input and
Multidimensional speech characteristic parameter, predict the speech frame whether laugh frame.Specific laugh detection model can be exported for the speech frame
Accordingly as a result, i.e. the speech frame is laugh frame or speech frame, in embodiments of the present invention by laugh detection model for every
The input of a speech frame as a result, as the corresponding prediction result of the speech frame.
S103: in the speech frame of identification the first setting quantity adjacent with the current speech frame, prediction result is to laugh at
The quantity of the speech frame of acoustic frame.
Specifically, in embodiments of the present invention when whether detect current speech frame is laugh frame, according to the speech frame
And in the speech frame of the first quantity adjacent with the speech frame, prediction result is the quantity of the speech frame of laugh frame, and determination is deserved
Whether preceding speech frame is laugh frame.In the speech frame for identifying the first adjacent setting quantity of current speech frame, prediction result
It wherein first sets quantity as the integer not less than 1, such as can be 2,3,10,20 etc. for the quantity of the speech frame of laugh frame
Deng.
Because in the speech frame for determining the first setting quantity adjacent with the current speech frame in the embodiment of the present invention
When, first before current speech frame can be set to the speech frame of quantity as the speech frame adjacent with current speech frame;
It is also possible to using the speech frame of the first setting quantity after current speech frame as the speech frame adjacent with current speech frame;
It is also possible to using the speech frame of the first setting quantity before current speech frame and later as adjacent with current speech frame
Speech frame, the quantity of the speech frame before and after current speech frame is without limiting, as long as guaranteeing the quantity of adjacent speech frame
For the first setting quantity.
Such as first set quantity as 20, the number of current speech frame is 060, then can be by the volume before current speech frame
Number for 040-059 speech frame as the speech frame adjacent with current speech frame, can also be by the number after current speech frame
For 061-080 speech frame as the speech frame adjacent with current speech frame, can also be by the number before current speech frame
The speech frame that the speech frame of 055-059 and the number after current speech frame are 061-075 is as adjacent with current speech frame
Speech frame, naturally it is also possible to 10 frames before being other modes, such as current speech frame and 10 frames later, or before
7 frames after 13 frames etc. can be selected arbitrarily When being determined.
S104: when the quantity is greater than the amount threshold of setting, the current speech frame is determined as laugh frame.
The amount threshold is according to the first setting quantity setting, such as the number for the speech frame that prediction result is laugh frame
When amount is greater than the amount threshold of setting, the current speech frame is determined as laugh frame.Such as first set quantity as 40 frames,
Amount threshold is 20, and it is 25 that prediction result, which is the frame number of the speech frame of laugh frame, in 40 frames adjacent with current speech frame, currently
The prediction result of speech frame is laugh frame, and the quantity of laugh frame is 26, is greater than amount threshold 20, determines that current speech frame is to laugh at
Acoustic frame.
Due in embodiments of the present invention, being sentenced jointly according to the speech frame of current speech frame and its adjacent first setting quantity
Whether settled previous frame is laugh frame, weakens the error rate to laugh detection model to a certain extent, and also fully consider
The continuity of laugh is arrived, so that laugh testing result is more accurate.
The laugh detection model is obtained according to each voice signal in training set, training in the embodiment of the present invention,
Specifically in one embodiment of the invention, the training process of laugh detection model includes:
For voice signal each in training set, the voice signal is divided into multiple speech frames;
Obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Identify whether each speech frame is laugh frame, if so, the first label is added in the speech frame, otherwise,
The second label is added in the speech frame;
The gene frequency of speech frame after addition label and multidimensional speech characteristic parameter are input in laugh detection model,
The laugh detection model is trained.
Specifically, including a large amount of voice signal in the training set, the length of each voice signal is identical or different, right
Each voice signal in training set carries out sub-frame processing to each voice signal, each voice signal is divided into multiple
Speech frame,
The gene frequency and multidimensional speech characteristic parameter for obtaining each speech frame of each speech frame, according to each speech frame
Whether it is laugh frame, the first label is added to laugh frame, the second label of addition for not being laugh frame is added each speech frame
Label and the gene frequency of the speech frame, multidimensional speech characteristic parameter be input in laugh detection model, to the laugh examine
Model is surveyed to be trained.Specifically, the process being trained to laugh detection model belongs to the prior art, in the embodiment of the present invention
In to the process without repeating.
After the completion of laugh detection model training, when by the gene frequency of each speech frame of voice signal to be detected and more
After dimension speech characteristic parameter is input to laugh detection model, laugh detection model can identify whether each speech frame is laugh
Frame, when speech frame is laugh frame, corresponding output result carries the first label, when speech frame ridicules acoustic frame, corresponds to
Output result carry the second label.
The laugh detection model is trained in embodiments of the present invention and includes:
Using support vector machines (Support Vector Machine, SVM) method, the laugh detection model is carried out
Training;Or,
Using extreme learning machine (Extreme Learning Machine, ELM) method, to the laugh detection model into
Row training.
Support vector machines or extreme learning machine ELM are used in embodiments of the present invention, belong to the prior art, in this hair
To the process without explanation in bright embodiment.In order under the premise of not reducing detection accuracy, improve trained efficiency, at this
Laugh detection model can be trained using ELM method in inventive embodiments.
In order to improve detection efficiency, and improve the accuracy of detection, on the basis of the above embodiment of the present invention, this hair
Voice signal to be detected is directed in bright another embodiment, the voice signal to be detected, which is divided into multiple speech frames, includes:
Preemphasis processing is carried out to the voice signal, pretreated voice signal to be detected is divided into multiple voices
Frame;
After voice signal to be detected is divided into multiple speech frames, the gene frequency and multidimensional language of each speech frame are obtained
Before sound characteristic parameter, the method also includes:
End-point detection is carried out to each speech frame, removes the noise frame and mute frame in the speech frame.
Specifically, eliminate voice signal is influenced by word length, to language for the ease of carrying out sub-frame processing to voice signal
Before sound signal carries out sub-frame processing, preemphasis processing is carried out to voice signal first, even if voice signal passes through a single order
Limited excitation plus response high-pass filter, make signal become flat, sub-frame processing are carried out to processed voice signal, by it
It is divided into multiple speech frames.The process for carrying out preemphasis processing and sub-frame processing to voice signal belongs to the prior art, in this hair
To this without repeating in bright embodiment.
After voice signal is divided into multiple speech frames, end-point detection is carried out to each speech frame, finds out each speech frame
The beginning of middle voice and terminating point, to remove the noise frame and mute frame in speech frame.End-point detection is carried out to speech frame, is gone
Except in speech frame noise frame and mute frame belong to the prior art, in embodiments of the present invention to the process without explanation.
The embodiment of the present invention fully considers the continuity of laugh, when being detected, for present frame to be detected, according to
The prediction result of present frame and the speech frame of the first setting quantity adjacent with the present frame, determines whether present frame is laugh
Frame.Specifically, on the basis of embodiment illustrated in fig. 1 of the present invention, in another embodiment of the invention, the identification with it is described
In the speech frame of the first adjacent setting quantity of current speech frame, prediction result is that the quantity of the speech frame of laugh frame includes:
The position for identifying current speech frame, judges whether current speech frame is located at the front end of voice signal;
If so, prediction result is to laugh in the speech frame of the first setting quantity after the identification current speech frame
The quantity of the speech frame of acoustic frame;
If not, judging whether current speech frame is located at the rear end of voice signal;
If so, prediction result is to laugh in the speech frame of the first setting quantity before the identification current speech frame
The quantity of the speech frame of acoustic frame;Otherwise, the 4th sets quantity and the current voice before identifying the current speech frame
In the speech frame of the 5th setting quantity after frame, prediction result is the quantity of the speech frame of laugh frame, wherein the 4th setting number
It measures with the 5th setting quantity and sets quantity for described first.
The continuity for having fully considered laugh in the above-described embodiments, for each speech frame, if according to the speech frame
The before and later prediction result of the speech frame of the first setting quantity, determines whether present frame is laugh frame, can be accurate
It realizes the detection to present frame, and can reduce because the detection accuracy bring of detection model influences.But if current language
Sound frame is the forward speech frame in position in voice signal, the not no speech frame of respective numbers before the speech frame, therefore is being carried out
It when detection, needs to be located at according to current speech frame the position of voice signal, determines that use what kind of mode to identify works as with described
In the speech frame of the first adjacent setting quantity of preceding speech frame, prediction result is the quantity of the speech frame of laugh frame.
When carrying out position identification, because each speech frame is corresponding after carrying out sub-frame processing to each voice signal
Identification information can be identified according to the time sequencing of each frame, which can be the number of speech frame, and language
The quantity of the total speech frame divided in sound signal it is also known that, therefore according to the identification information of current speech frame, can determine
Current speech frame is positioned at the front end or rear end of voice signal.When specifically dividing front-end and back-end, it can be set and be located at
The range of the identification information of the voice signal of front end, such as identification information is located at the speech frame of 000-020 range as being located at
Identification information is located at the speech frame of A-B range as the voice for being located at voice signal rear end by the speech frame of voice signal front end
Frame, wherein B is the corresponding identification information of end speech frame of voice signal, and A is the corresponding mark of end speech frame of voice signal
Know information and subtracts 15 or other numerical value.
In addition, eliminating noise frame and mute frame in voice signal, therefore when having carried out end-point detection to voice signal
Voice signal may be discontinuous, but aforesaid way is used whether to be still able to detect each speech frame for laugh frame.But in order into
One step improves the accuracy of detection, because the mute frame occurred in voice signal generally can continuously occur, the mark of mute frame is believed
Breath is also that can be known in advance, therefore what is detected to speech frame, the speech frame before mute frame can also be incited somebody to action
It is handled as the speech frame for the rear end for being located at voice signal, and the speech frame after mute frame can also be made
Speech frame for the front end positioned at voice signal is handled.
Fig. 2A -2B be inventive embodiments provide detection voice signal in, the signal of the position where current detection frame
Figure.
The above embodiment of the present invention is illustrated in conjunction with Fig. 2A -2B.When carrying out position identification, because to each
After voice signal carries out sub-frame processing, the corresponding identification information of each speech frame can be marked according to the time sequencing of each frame
Know, the quantity of total speech frame which can be the number of speech frame, and divide in voice signal it is also known that, can
To be located at the model of the identification information of the voice signal of front end according to the setting of the quantity of the identification information of speech frame and total speech frame
Enclose and positioned at front end voice signal identification information range.
As shown in Figure 2 A, shaded region shown in M can be that identification information is located at the speech frame of 000-020 range as position
Speech frame in voice signal front end, or the speech frame that identification information is located at 000-015 range, which is used as, is located at voice letter
The speech frame of number front end, or from 000 to other numerical value speech frame as the speech frame for being located at voice signal front end;Shown in N
Shaded region can be that identification information is located at the speech frame of A-B range as the speech frame for being located at voice signal rear end, wherein
B is the corresponding identification information of end speech frame of voice signal, and A is that the corresponding identification information of end speech frame of voice signal subtracts
Remove 15 or other numerical value;Range shown in L is to remove the intermediate range of front end range and rear end range.
As shown in Figure 2 B, there are mute frames in voice signal, because general mute frame can continuously occur, the mark of mute frame
Information is also that can be known in advance, therefore when detecting to speech frame, speech frame before mute frame can also be with
It is handled as the speech frame for the rear end for being located at voice signal, it can also be by it by the speech frame after mute frame
Speech frame as the front end for being located at voice signal is handled.O, Q in figure can such as be regarded as to voice identical with M in Fig. 2A
The front end of signal;P, R in figure can be regarded as to the rear end of voice signal identical with N in Fig. 2A;S, T in figure can be regarded as with
The identical intermediate range for removing front end range and rear end range of L in Fig. 2A.
Fig. 3 is that a kind of structure of the detecting device schematic diagram of laugh detection method provided in an embodiment of the present invention is applied to electronics
Equipment, the device include:
It divides and obtains module 32, for being directed to voice signal to be detected, the voice signal to be detected is divided into multiple
Speech frame, and obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Prediction module 33, for according to the frequency of the gene of the laugh detection model that training is completed and each speech frame of acquisition in advance
Rate and multidimensional speech characteristic parameter predict whether each speech frame is laugh frame;
Recognition detection module 34, for identification speech frame of adjacent with the current speech frame the first setting quantity
In, prediction result is the quantity of the speech frame of laugh frame;It, will be described current when the quantity is greater than the amount threshold of setting
Speech frame is determined as laugh frame.
Described device further include:
Training module 31, for for each voice signal in training set, the voice signal to be divided into multiple voices
Frame;Obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;Identify whether each speech frame is laugh frame, if
It is to add the first label in the speech frame, otherwise, the second label is added in the speech frame;After addition label
The gene frequency and multidimensional speech characteristic parameter of speech frame are input in laugh detection model, are carried out to the laugh detection model
Training.
The division obtains module 32, is specifically used for carrying out preemphasis processing to the voice signal, will be pretreated
Voice signal to be detected is divided into multiple speech frames;
Described device further include:
Filtering module 35 removes noise frame in the speech frame and quiet for carrying out end-point detection to each speech frame
Sound frame.
The recognition detection module 34 judges that current speech frame is specifically for identifying the position of current speech frame
The no front end positioned at voice signal;If so, in the speech frame of the first setting quantity after the identification current speech frame,
Prediction result is the quantity of the speech frame of laugh frame;If not, judging whether current speech frame is located at the rear end of voice signal;
If so, prediction result is the language of laugh frame in the speech frame of the first setting quantity before the identification current speech frame
The quantity of sound frame;Otherwise, it identifies before the current speech frame after the 4th setting quantity and the current speech frame
In the speech frame of 5th setting quantity, prediction result is the quantity of the speech frame of laugh frame, wherein the 4th setting quantity and the 5th
Set quantity and for it is described first setting quantity.
The embodiment of the invention provides a kind of laugh detection method and device, this method is used for electronic equipment, this method packet
It includes: for voice signal to be detected, the voice signal to be detected being divided into multiple speech frames, and obtain each speech frame
Gene frequency and multidimensional speech characteristic parameter;According to each speech frame of the preparatory laugh detection model and acquisition trained and completed
Gene frequency and multidimensional speech characteristic parameter predict whether each speech frame is laugh frame;Identification and the current speech frame
In the speech frame of the first adjacent setting quantity, prediction result is the quantity of the speech frame of laugh frame;It is set when the quantity is greater than
When fixed amount threshold, the current speech frame is determined as laugh frame.Due in embodiments of the present invention, according to current language
Sound frame and its speech frame of adjacent first setting quantity determine whether present frame is laugh frame jointly, weaken to a certain extent
To the error rate of laugh detection model, and the continuity of laugh is also fully taken into account, so that laugh testing result is more
Accurately.
For systems/devices embodiment, since it is substantially similar to the method embodiment, so the comparison of description is simple
Single, the relevent part can refer to the partial explaination of embodiments of method.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program
Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application
Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more,
The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces
The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application
Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions
The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs
Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce
A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real
The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy
Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates,
Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or
The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting
Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or
The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one
The step of function of being specified in a box or multiple boxes.
Although the preferred embodiment of the application has been described, it is created once a person skilled in the art knows basic
Property concept, then additional changes and modifications can be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as
It selects embodiment and falls into all change and modification of the application range.
Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application
Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies
Within, then the application is also intended to include these modifications and variations.
Claims (10)
1. a kind of laugh detection method, which is characterized in that it is applied to electronic equipment, this method comprises:
For voice signal to be detected, the voice signal to be detected is divided into multiple speech frames, and obtain each speech frame
Gene frequency and multidimensional speech characteristic parameter;
According to the gene frequency and multidimensional phonetic feature of the preparatory training laugh detection model completed and each speech frame of acquisition
Parameter predicts whether each speech frame is laugh frame;
In the speech frame for identifying the first setting quantity adjacent with current speech frame, prediction result is the speech frame of laugh frame
Quantity;
When the quantity is greater than the amount threshold of setting, the current speech frame is determined as laugh frame.
2. the method according to claim 1, wherein the training process of the laugh detection model includes:
For voice signal each in training set, the voice signal is divided into multiple speech frames;
Obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Identify whether each speech frame is laugh frame, if so, the first label is added in the speech frame, otherwise, described
The second label is added in speech frame;
The gene frequency of speech frame after addition label and multidimensional speech characteristic parameter are input in laugh detection model, to institute
Laugh detection model is stated to be trained.
3. according to the method described in claim 2, it is characterized in that, described be trained to the laugh detection model includes:
Using support vector machines method, the laugh detection model is trained;Or,
Using extreme learning machine ELM method, the laugh detection model is trained.
4. the method according to claim 1, wherein it is described be directed to voice signal to be detected, will be described to be detected
Voice signal is divided into multiple speech frames
Preemphasis processing is carried out to the voice signal, pretreated voice signal to be detected is divided into multiple speech frames.
5. method according to claim 1 or 4, which is characterized in that it is described the voice signal to be detected is divided into it is more
After a speech frame, before the gene frequency for obtaining each speech frame and multidimensional speech characteristic parameter, the method is also wrapped
It includes:
End-point detection is carried out to each speech frame, removes the noise frame and mute frame in the speech frame.
6. the method according to claim 1, wherein described identify first adjacent with the current speech frame
In the speech frame for setting quantity, prediction result is that the quantity of the speech frame of laugh frame includes:
The position for identifying current speech frame, judges whether current speech frame is located at the front end of voice signal;
If so, prediction result is laugh frame in the speech frame of the first setting quantity after the identification current speech frame
Speech frame quantity;
If not, judging whether current speech frame is located at the rear end of voice signal;
If so, prediction result is laugh frame in the speech frame of the first setting quantity before the identification current speech frame
Speech frame quantity;Otherwise, identify before the current speech frame the 4th setting quantity and the current speech frame it
Afterwards the 5th setting quantity speech frame in, prediction result be laugh frame speech frame quantity, wherein the 4th setting quantity and
5th setting quantity and for it is described first setting quantity.
7. a kind of laugh detection device, which is characterized in that described device includes:
It divides and obtains module, for for voice signal to be detected, the voice signal to be detected to be divided into multiple speech frames,
And obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Prediction module, for the gene frequency of each speech frame according to the laugh detection model that training is completed in advance and acquisition and more
Speech characteristic parameter is tieed up, predicts whether each speech frame is laugh frame;
Recognition detection module, for identification in the speech frame of the first setting quantity adjacent with current speech frame, prediction result
For the quantity of the speech frame of laugh frame;When the quantity is greater than the amount threshold of setting, the current speech frame is determined
For laugh frame.
8. device according to claim 7, which is characterized in that described device further include:
Training module, for for each voice signal in training set, the voice signal to be divided into multiple speech frames;It obtains
The gene frequency and multidimensional speech characteristic parameter of each speech frame;Identify whether each speech frame is laugh frame, if so, in institute
It states and adds the first label in speech frame, otherwise, the second label is added in the speech frame;By the speech frame after addition label
Gene frequency and multidimensional speech characteristic parameter are input in laugh detection model, are trained to the laugh detection model.
9. device according to claim 7, which is characterized in that the division obtains module, is specifically used for the voice
Signal carries out preemphasis processing, and pretreated voice signal to be detected is divided into multiple speech frames;
Described device further include:
Filtering module removes the noise frame and mute frame in the speech frame for carrying out end-point detection to each speech frame.
10. device according to claim 7, which is characterized in that the recognition detection module, specifically for identifying currently
The position of speech frame, judges whether current speech frame is located at the front end of voice signal;If so, the identification current voice
In the speech frame of the first setting quantity after frame, prediction result is the quantity of the speech frame of laugh frame;If not, judgement is current
Speech frame whether be located at the rear end of voice signal;If so, the first setting quantity before the identification current speech frame
Speech frame in, prediction result be laugh frame speech frame quantity;Otherwise, the 4th sets before identifying the current speech frame
In the speech frame of the 5th setting quantity after fixed number amount and the current speech frame, prediction result is the speech frame of laugh frame
Quantity, wherein the 4th setting quantity and the 5th setting quantity and for it is described first setting quantity.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610755283.XA CN106356077B (en) | 2016-08-29 | 2016-08-29 | A kind of laugh detection method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610755283.XA CN106356077B (en) | 2016-08-29 | 2016-08-29 | A kind of laugh detection method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106356077A CN106356077A (en) | 2017-01-25 |
CN106356077B true CN106356077B (en) | 2019-09-27 |
Family
ID=57856963
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610755283.XA Active CN106356077B (en) | 2016-08-29 | 2016-08-29 | A kind of laugh detection method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106356077B (en) |
Families Citing this family (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN107393559B (en) * | 2017-07-14 | 2021-05-18 | 深圳永顺智信息科技有限公司 | Method and device for checking voice detection result |
CN107545902B (en) * | 2017-07-14 | 2020-06-02 | 清华大学 | Article material identification method and device based on sound characteristics |
CN111210804A (en) * | 2018-11-01 | 2020-05-29 | 普天信息技术有限公司 | Method and device for identifying social signal |
CN111755029B (en) * | 2020-05-27 | 2023-08-25 | 北京大米科技有限公司 | Voice processing method, device, storage medium and electronic equipment |
CN112632369B (en) * | 2020-12-05 | 2023-03-24 | 武汉风行在线技术有限公司 | Short video recommendation system and method for identifying laughter |
CN113689861B (en) * | 2021-08-10 | 2024-02-27 | 上海淇玥信息技术有限公司 | Intelligent track dividing method, device and system for mono call recording |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1698097A (en) * | 2003-02-19 | 2005-11-16 | 松下电器产业株式会社 | Speech recognition device and speech recognition method |
CN101030384A (en) * | 2007-03-27 | 2007-09-05 | 西安交通大学 | Electronic throat speech reinforcing system and its controlling method |
CN101727900A (en) * | 2009-11-24 | 2010-06-09 | 北京中星微电子有限公司 | Method and equipment for detecting user pronunciation |
CN101944359A (en) * | 2010-07-23 | 2011-01-12 | 杭州网豆数字技术有限公司 | Voice recognition method facing specific crowd |
CN102486920A (en) * | 2010-12-06 | 2012-06-06 | 索尼公司 | Audio event detection method and device |
CN102881284A (en) * | 2012-09-03 | 2013-01-16 | 江苏大学 | Unspecific human voice and emotion recognition method and system |
CN105632486A (en) * | 2015-12-23 | 2016-06-01 | 北京奇虎科技有限公司 | Voice wake-up method and device of intelligent hardware |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6275806B1 (en) * | 1999-08-31 | 2001-08-14 | Andersen Consulting, Llp | System method and article of manufacture for detecting emotion in voice signals by utilizing statistics for voice signal parameters |
US7451084B2 (en) * | 2003-07-29 | 2008-11-11 | Fujifilm Corporation | Cell phone having an information-converting function |
CN102915728B (en) * | 2011-08-01 | 2014-08-27 | 佳能株式会社 | Sound segmentation device and method and speaker recognition system |
-
2016
- 2016-08-29 CN CN201610755283.XA patent/CN106356077B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1698097A (en) * | 2003-02-19 | 2005-11-16 | 松下电器产业株式会社 | Speech recognition device and speech recognition method |
CN101030384A (en) * | 2007-03-27 | 2007-09-05 | 西安交通大学 | Electronic throat speech reinforcing system and its controlling method |
CN101727900A (en) * | 2009-11-24 | 2010-06-09 | 北京中星微电子有限公司 | Method and equipment for detecting user pronunciation |
CN101944359A (en) * | 2010-07-23 | 2011-01-12 | 杭州网豆数字技术有限公司 | Voice recognition method facing specific crowd |
CN102486920A (en) * | 2010-12-06 | 2012-06-06 | 索尼公司 | Audio event detection method and device |
CN102881284A (en) * | 2012-09-03 | 2013-01-16 | 江苏大学 | Unspecific human voice and emotion recognition method and system |
CN105632486A (en) * | 2015-12-23 | 2016-06-01 | 北京奇虎科技有限公司 | Voice wake-up method and device of intelligent hardware |
Non-Patent Citations (1)
Title |
---|
语音信号情感识别;陈佳;《中国硕士论文全文数据库 信息科技辑》;20090115;第1-60页 * |
Also Published As
Publication number | Publication date |
---|---|
CN106356077A (en) | 2017-01-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106356077B (en) | A kind of laugh detection method and device | |
CN109473123B (en) | Voice activity detection method and device | |
CN108630193B (en) | Voice recognition method and device | |
CN110706694A (en) | Voice endpoint detection method and system based on deep learning | |
CN111312218B (en) | Neural network training and voice endpoint detection method and device | |
CN106531195B (en) | A kind of dialogue collision detection method and device | |
CN112786029B (en) | Method and apparatus for training VAD using weakly supervised data | |
CN110852215A (en) | Multi-mode emotion recognition method and system and storage medium | |
CN106887241A (en) | A kind of voice signal detection method and device | |
CN109286848B (en) | Terminal video information interaction method and device and storage medium | |
CN110503944B (en) | Method and device for training and using voice awakening model | |
CN104317392B (en) | A kind of information control method and electronic equipment | |
CN109360551B (en) | Voice recognition method and device | |
CN111918122A (en) | Video processing method and device, electronic equipment and readable storage medium | |
CN107025913A (en) | A kind of way of recording and terminal | |
CN106649253A (en) | Auxiliary control method and system based on post verification | |
CN112331188A (en) | Voice data processing method, system and terminal equipment | |
CN112750461B (en) | Voice communication optimization method and device, electronic equipment and readable storage medium | |
CN112735466B (en) | Audio detection method and device | |
CN112185382A (en) | Method, device, equipment and medium for generating and updating wake-up model | |
CN113223499B (en) | Method and device for generating audio negative sample | |
CN110708619A (en) | Word vector training method and device for intelligent equipment | |
CN112614506B (en) | Voice activation detection method and device | |
US20220101871A1 (en) | Live streaming control method and apparatus, live streaming device, and storage medium | |
CN114842382A (en) | Method, device, equipment and medium for generating semantic vector of video |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |