CN106356077B - A kind of laugh detection method and device - Google Patents

A kind of laugh detection method and device Download PDF

Info

Publication number
CN106356077B
CN106356077B CN201610755283.XA CN201610755283A CN106356077B CN 106356077 B CN106356077 B CN 106356077B CN 201610755283 A CN201610755283 A CN 201610755283A CN 106356077 B CN106356077 B CN 106356077B
Authority
CN
China
Prior art keywords
frame
speech frame
speech
laugh
voice signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610755283.XA
Other languages
Chinese (zh)
Other versions
CN106356077A (en
Inventor
谢湘
徐利强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Institute of Technology BIT
Original Assignee
Beijing Institute of Technology BIT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Institute of Technology BIT filed Critical Beijing Institute of Technology BIT
Priority to CN201610755283.XA priority Critical patent/CN106356077B/en
Publication of CN106356077A publication Critical patent/CN106356077A/en
Application granted granted Critical
Publication of CN106356077B publication Critical patent/CN106356077B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination
    • G10L25/63Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination for estimating an emotional state
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training

Abstract

The embodiment of the invention discloses a kind of laugh detection method and device, this method is used for electronic equipment, this method comprises: being directed to voice signal to be detected, the voice signal to be detected is divided into multiple speech frames, and obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;According to the gene frequency and multidimensional speech characteristic parameter of the preparatory training laugh detection model completed and each speech frame of acquisition, predict whether each speech frame is laugh frame;In the speech frame for identifying the first setting quantity adjacent with the current speech frame, prediction result is the quantity of the speech frame of laugh frame;When the quantity is greater than the amount threshold of setting, the current speech frame is determined as laugh frame.Since the multiframe for introducing setting quantity adjacent thereto for the detection of frame each in voice in embodiments of the present invention carries out ballot auxiliary judgement, the accuracy of laugh detection is improved, facilitates the laugh information obtained in multimedia file of user promptly and accurately.

Description

A kind of laugh detection method and device
Technical field
The present invention relates to audio-video processing technology field, in particular to a kind of laugh detection method and device.
Background technique
With the fast development of Chinese economy, the requirement that the common people experience clothing, food, lodging and transportion -- basic necessities of life is higher and higher, and advanced voice Detection system is to improve a kind of effective way of common people's experience.Laugh detection system is even more the key in speech detection system, and And laugh detection system can not only extract the wonderful in voice, reduce the workload in voice shearing and improve accurately Property, the variation for identifying mood can also be detected according to laugh, formulate the experience scheme of differentiation.
Whether laugh detection field, according to laugh detection model, judges each speech frame when carrying out laugh detection at present Laugh frame, and determine whether each speech frame is laugh frame according to judging result.Pass through laugh detection model, judgement in this method Whether each speech frame is laugh frame, and laugh model is obtained by training, although laugh detection model is with higher Accuracy is detected, but in such a way that laugh detection model determines whether each frame is laugh frame, still largely It is limited by laugh detection model accuracy.
Summary of the invention
The embodiment of the invention discloses a kind of laugh detection method and device, to improve the accuracy of laugh detection.
In order to achieve the above objectives, the embodiment of the invention discloses a kind of laugh detection methods, are applied to electronic equipment, the party Method includes:
For voice signal to be detected, the voice signal to be detected is divided into multiple speech frames, and obtain each language The gene frequency and multidimensional speech characteristic parameter of sound frame;
According to the gene frequency and multidimensional voice of the preparatory training laugh detection model completed and each speech frame of acquisition Characteristic parameter predicts whether each speech frame is laugh frame;
In the speech frame for identifying the first setting quantity adjacent with the current speech frame, prediction result is laugh frame The quantity of speech frame;
When the quantity is greater than the amount threshold of setting, the current speech frame is determined as laugh frame.
Further, the training process of the laugh detection model includes:
For voice signal each in training set, the voice signal is divided into multiple speech frames;
Obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Identify whether each speech frame is laugh frame, if so, the first label is added in the speech frame, otherwise, The second label is added in the speech frame;
The gene frequency of speech frame after addition label and multidimensional speech characteristic parameter are input in laugh detection model, The laugh detection model is trained.
Further, described be trained to the laugh detection model includes:
Using support vector machines method, the laugh detection model is trained;Or,
Using extreme learning machine ELM method, the laugh detection model is trained.
Further, described to be directed to voice signal to be detected, the voice signal to be detected is divided into multiple speech frames Include:
Preemphasis processing is carried out to the voice signal, pretreated voice signal to be detected is divided into multiple voices Frame.
Further, it is described the voice signal to be detected is divided into multiple speech frames after, it is described to obtain each language Before the gene frequency and multidimensional speech characteristic parameter of sound frame, the method also includes:
End-point detection is carried out to each speech frame, removes the noise frame and mute frame in the speech frame.
Further, in the speech frame of the identification the first setting quantity adjacent with the current speech frame, prediction As a result the quantity for the speech frame of laugh frame includes:
The position for identifying current speech frame, judges whether current speech frame is located at the front end of voice signal;
If so, prediction result is to laugh in the speech frame of the first setting quantity after the identification current speech frame The quantity of the speech frame of acoustic frame;
If not, judging whether current speech frame is located at the rear end of voice signal;
If so, prediction result is to laugh in the speech frame of the first setting quantity before the identification current speech frame The quantity of the speech frame of acoustic frame;Otherwise, the 4th sets quantity and the current voice before identifying the current speech frame In the speech frame of the 5th setting quantity after frame, prediction result is the quantity of the speech frame of laugh frame, wherein the 4th setting number It measures with the 5th setting quantity and sets quantity for described first.
On the other hand, the embodiment of the invention discloses a kind of laugh detection device, described device includes:
It divides and obtains module, for being directed to voice signal to be detected, the voice signal to be detected is divided into multiple languages Sound frame, and obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Prediction module, for the gene frequency according to the laugh detection model that training is completed and each speech frame of acquisition in advance And multidimensional speech characteristic parameter, predict whether each speech frame is laugh frame;
Recognition detection module, for identification in the speech frame of the first setting quantity adjacent with the current speech frame, Prediction result is the quantity of the speech frame of laugh frame;When the quantity is greater than the amount threshold of setting, by the current language Sound frame is determined as laugh frame.
Further, described device further include:
Training module, for for each voice signal in training set, the voice signal to be divided into multiple speech frames; Obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;Identify whether each speech frame is laugh frame, if so, The first label is added in the speech frame, otherwise, the second label is added in the speech frame;By the voice after addition label The gene frequency and multidimensional speech characteristic parameter of frame are input in laugh detection model, are instructed to the laugh detection model Practice.
Further, the division obtains module, is specifically used for carrying out preemphasis processing to the voice signal, will locate in advance Voice signal to be detected after reason is divided into multiple speech frames;
Described device further include:
Filtering module removes noise frame in the speech frame and mute for carrying out end-point detection to each speech frame Frame.
Further, the recognition detection module judges current language specifically for identifying the position of current speech frame Whether sound frame is located at the front end of voice signal;If so, the language of the first setting quantity after the identification current speech frame In sound frame, prediction result is the quantity of the speech frame of laugh frame;If not, judging whether current speech frame is located at voice signal Rear end;If so, prediction result is laugh in the speech frame of the first setting quantity before the identification current speech frame The quantity of the speech frame of frame;Otherwise, the 4th sets quantity and the current speech frame before identifying the current speech frame In the speech frame of the 5th setting quantity later, prediction result is the quantity of the speech frame of laugh frame, wherein the 4th setting quantity With the 5th setting quantity and for it is described first setting quantity.
The embodiment of the invention provides a kind of laugh detection method and device, this method is used for electronic equipment, this method packet It includes: for voice signal to be detected, the voice signal to be detected being divided into multiple speech frames, and obtain each speech frame Gene frequency and multidimensional speech characteristic parameter;According to each speech frame of the preparatory laugh detection model and acquisition trained and completed Gene frequency and multidimensional speech characteristic parameter predict whether each speech frame is laugh frame;Identification and the current speech frame In the speech frame of the first adjacent setting quantity, prediction result is the quantity of the speech frame of laugh frame;It is set when the quantity is greater than When fixed amount threshold, the current speech frame is determined as laugh frame.Due in embodiments of the present invention, according to current language Sound frame and its speech frame of adjacent first setting quantity determine whether present frame is laugh frame jointly, weaken to a certain extent To the error rate of laugh detection model, and the continuity of laugh is also fully taken into account, so that laugh testing result is more Accurately.
Detailed description of the invention
To describe the technical solutions in the embodiments of the present invention more clearly, make required in being described below to embodiment Attached drawing is briefly introduced, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill in field, without creative efforts, it can also be obtained according to these attached drawings other Attached drawing.
Fig. 1 is a kind of detection process of laugh detection method provided in an embodiment of the present invention;
Fig. 2A-Fig. 2 B is provided in an embodiment of the present invention in detection voice signal, the position where current detection frame Schematic diagram;
Fig. 3 is a kind of structure of the detecting device schematic diagram of laugh detection method provided in an embodiment of the present invention.
Specific embodiment
In order to improve the accuracy of laugh detection, the embodiment of the invention provides a kind of laugh detection method and device
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
Fig. 1 be a kind of detection process of laugh detection method provided in an embodiment of the present invention, the process the following steps are included:
S101: being directed to voice signal to be detected, the voice signal to be detected is divided into multiple speech frames, and obtain every The gene frequency and multidimensional speech characteristic parameter of a speech frame.
Laugh detection method provided in an embodiment of the present invention is applied to electronic equipment, which can be audio collection Equipment, such as recording pen, recorder, are also possible to the equipment such as mobile phone, tablet computer, PC.
Specifically, carrying out sub-frame processing for voice signal to be detected, voice signal to be detected is divided into multiple voices Frame, and gene frequency to each speech frame and multidimensional speech characteristic parameter obtain.
Voice signal is divided into multiple speech frames in embodiments of the present invention and obtains the gene frequency of each speech frame And multidimensional phonetic feature (MFCC) parameter, belong to the prior art, in embodiments of the present invention to the process without explanation.Accordingly Following each embodiments in similarly exist the situation, also just no longer repeat one by one.
S102: according to the gene frequency and multidimensional of the preparatory training laugh detection model completed and each speech frame of acquisition Speech characteristic parameter predicts whether each speech frame is laugh frame.
Laugh detection model is trained in advance in embodiments of the present invention, and is instructed to laugh detection model When practicing, and the gene frequency and multidimensional speech characteristic parameter of each speech frame according to voice signal each in training set, it is right What laugh detection model training was completed.The laugh detection model that training is completed can according to the gene frequency of the speech frame of input and Multidimensional speech characteristic parameter, predict the speech frame whether laugh frame.Specific laugh detection model can be exported for the speech frame Accordingly as a result, i.e. the speech frame is laugh frame or speech frame, in embodiments of the present invention by laugh detection model for every The input of a speech frame as a result, as the corresponding prediction result of the speech frame.
S103: in the speech frame of identification the first setting quantity adjacent with the current speech frame, prediction result is to laugh at The quantity of the speech frame of acoustic frame.
Specifically, in embodiments of the present invention when whether detect current speech frame is laugh frame, according to the speech frame And in the speech frame of the first quantity adjacent with the speech frame, prediction result is the quantity of the speech frame of laugh frame, and determination is deserved Whether preceding speech frame is laugh frame.In the speech frame for identifying the first adjacent setting quantity of current speech frame, prediction result It wherein first sets quantity as the integer not less than 1, such as can be 2,3,10,20 etc. for the quantity of the speech frame of laugh frame Deng.
Because in the speech frame for determining the first setting quantity adjacent with the current speech frame in the embodiment of the present invention When, first before current speech frame can be set to the speech frame of quantity as the speech frame adjacent with current speech frame; It is also possible to using the speech frame of the first setting quantity after current speech frame as the speech frame adjacent with current speech frame; It is also possible to using the speech frame of the first setting quantity before current speech frame and later as adjacent with current speech frame Speech frame, the quantity of the speech frame before and after current speech frame is without limiting, as long as guaranteeing the quantity of adjacent speech frame For the first setting quantity.
Such as first set quantity as 20, the number of current speech frame is 060, then can be by the volume before current speech frame Number for 040-059 speech frame as the speech frame adjacent with current speech frame, can also be by the number after current speech frame For 061-080 speech frame as the speech frame adjacent with current speech frame, can also be by the number before current speech frame The speech frame that the speech frame of 055-059 and the number after current speech frame are 061-075 is as adjacent with current speech frame Speech frame, naturally it is also possible to 10 frames before being other modes, such as current speech frame and 10 frames later, or before 7 frames after 13 frames etc. can be selected arbitrarily When being determined.
S104: when the quantity is greater than the amount threshold of setting, the current speech frame is determined as laugh frame.
The amount threshold is according to the first setting quantity setting, such as the number for the speech frame that prediction result is laugh frame When amount is greater than the amount threshold of setting, the current speech frame is determined as laugh frame.Such as first set quantity as 40 frames, Amount threshold is 20, and it is 25 that prediction result, which is the frame number of the speech frame of laugh frame, in 40 frames adjacent with current speech frame, currently The prediction result of speech frame is laugh frame, and the quantity of laugh frame is 26, is greater than amount threshold 20, determines that current speech frame is to laugh at Acoustic frame.
Due in embodiments of the present invention, being sentenced jointly according to the speech frame of current speech frame and its adjacent first setting quantity Whether settled previous frame is laugh frame, weakens the error rate to laugh detection model to a certain extent, and also fully consider The continuity of laugh is arrived, so that laugh testing result is more accurate.
The laugh detection model is obtained according to each voice signal in training set, training in the embodiment of the present invention, Specifically in one embodiment of the invention, the training process of laugh detection model includes:
For voice signal each in training set, the voice signal is divided into multiple speech frames;
Obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Identify whether each speech frame is laugh frame, if so, the first label is added in the speech frame, otherwise, The second label is added in the speech frame;
The gene frequency of speech frame after addition label and multidimensional speech characteristic parameter are input in laugh detection model, The laugh detection model is trained.
Specifically, including a large amount of voice signal in the training set, the length of each voice signal is identical or different, right Each voice signal in training set carries out sub-frame processing to each voice signal, each voice signal is divided into multiple Speech frame,
The gene frequency and multidimensional speech characteristic parameter for obtaining each speech frame of each speech frame, according to each speech frame Whether it is laugh frame, the first label is added to laugh frame, the second label of addition for not being laugh frame is added each speech frame Label and the gene frequency of the speech frame, multidimensional speech characteristic parameter be input in laugh detection model, to the laugh examine Model is surveyed to be trained.Specifically, the process being trained to laugh detection model belongs to the prior art, in the embodiment of the present invention In to the process without repeating.
After the completion of laugh detection model training, when by the gene frequency of each speech frame of voice signal to be detected and more After dimension speech characteristic parameter is input to laugh detection model, laugh detection model can identify whether each speech frame is laugh Frame, when speech frame is laugh frame, corresponding output result carries the first label, when speech frame ridicules acoustic frame, corresponds to Output result carry the second label.
The laugh detection model is trained in embodiments of the present invention and includes:
Using support vector machines (Support Vector Machine, SVM) method, the laugh detection model is carried out Training;Or,
Using extreme learning machine (Extreme Learning Machine, ELM) method, to the laugh detection model into Row training.
Support vector machines or extreme learning machine ELM are used in embodiments of the present invention, belong to the prior art, in this hair To the process without explanation in bright embodiment.In order under the premise of not reducing detection accuracy, improve trained efficiency, at this Laugh detection model can be trained using ELM method in inventive embodiments.
In order to improve detection efficiency, and improve the accuracy of detection, on the basis of the above embodiment of the present invention, this hair Voice signal to be detected is directed in bright another embodiment, the voice signal to be detected, which is divided into multiple speech frames, includes:
Preemphasis processing is carried out to the voice signal, pretreated voice signal to be detected is divided into multiple voices Frame;
After voice signal to be detected is divided into multiple speech frames, the gene frequency and multidimensional language of each speech frame are obtained Before sound characteristic parameter, the method also includes:
End-point detection is carried out to each speech frame, removes the noise frame and mute frame in the speech frame.
Specifically, eliminate voice signal is influenced by word length, to language for the ease of carrying out sub-frame processing to voice signal Before sound signal carries out sub-frame processing, preemphasis processing is carried out to voice signal first, even if voice signal passes through a single order Limited excitation plus response high-pass filter, make signal become flat, sub-frame processing are carried out to processed voice signal, by it It is divided into multiple speech frames.The process for carrying out preemphasis processing and sub-frame processing to voice signal belongs to the prior art, in this hair To this without repeating in bright embodiment.
After voice signal is divided into multiple speech frames, end-point detection is carried out to each speech frame, finds out each speech frame The beginning of middle voice and terminating point, to remove the noise frame and mute frame in speech frame.End-point detection is carried out to speech frame, is gone Except in speech frame noise frame and mute frame belong to the prior art, in embodiments of the present invention to the process without explanation.
The embodiment of the present invention fully considers the continuity of laugh, when being detected, for present frame to be detected, according to The prediction result of present frame and the speech frame of the first setting quantity adjacent with the present frame, determines whether present frame is laugh Frame.Specifically, on the basis of embodiment illustrated in fig. 1 of the present invention, in another embodiment of the invention, the identification with it is described In the speech frame of the first adjacent setting quantity of current speech frame, prediction result is that the quantity of the speech frame of laugh frame includes:
The position for identifying current speech frame, judges whether current speech frame is located at the front end of voice signal;
If so, prediction result is to laugh in the speech frame of the first setting quantity after the identification current speech frame The quantity of the speech frame of acoustic frame;
If not, judging whether current speech frame is located at the rear end of voice signal;
If so, prediction result is to laugh in the speech frame of the first setting quantity before the identification current speech frame The quantity of the speech frame of acoustic frame;Otherwise, the 4th sets quantity and the current voice before identifying the current speech frame In the speech frame of the 5th setting quantity after frame, prediction result is the quantity of the speech frame of laugh frame, wherein the 4th setting number It measures with the 5th setting quantity and sets quantity for described first.
The continuity for having fully considered laugh in the above-described embodiments, for each speech frame, if according to the speech frame The before and later prediction result of the speech frame of the first setting quantity, determines whether present frame is laugh frame, can be accurate It realizes the detection to present frame, and can reduce because the detection accuracy bring of detection model influences.But if current language Sound frame is the forward speech frame in position in voice signal, the not no speech frame of respective numbers before the speech frame, therefore is being carried out It when detection, needs to be located at according to current speech frame the position of voice signal, determines that use what kind of mode to identify works as with described In the speech frame of the first adjacent setting quantity of preceding speech frame, prediction result is the quantity of the speech frame of laugh frame.
When carrying out position identification, because each speech frame is corresponding after carrying out sub-frame processing to each voice signal Identification information can be identified according to the time sequencing of each frame, which can be the number of speech frame, and language The quantity of the total speech frame divided in sound signal it is also known that, therefore according to the identification information of current speech frame, can determine Current speech frame is positioned at the front end or rear end of voice signal.When specifically dividing front-end and back-end, it can be set and be located at The range of the identification information of the voice signal of front end, such as identification information is located at the speech frame of 000-020 range as being located at Identification information is located at the speech frame of A-B range as the voice for being located at voice signal rear end by the speech frame of voice signal front end Frame, wherein B is the corresponding identification information of end speech frame of voice signal, and A is the corresponding mark of end speech frame of voice signal Know information and subtracts 15 or other numerical value.
In addition, eliminating noise frame and mute frame in voice signal, therefore when having carried out end-point detection to voice signal Voice signal may be discontinuous, but aforesaid way is used whether to be still able to detect each speech frame for laugh frame.But in order into One step improves the accuracy of detection, because the mute frame occurred in voice signal generally can continuously occur, the mark of mute frame is believed Breath is also that can be known in advance, therefore what is detected to speech frame, the speech frame before mute frame can also be incited somebody to action It is handled as the speech frame for the rear end for being located at voice signal, and the speech frame after mute frame can also be made Speech frame for the front end positioned at voice signal is handled.
Fig. 2A -2B be inventive embodiments provide detection voice signal in, the signal of the position where current detection frame Figure.
The above embodiment of the present invention is illustrated in conjunction with Fig. 2A -2B.When carrying out position identification, because to each After voice signal carries out sub-frame processing, the corresponding identification information of each speech frame can be marked according to the time sequencing of each frame Know, the quantity of total speech frame which can be the number of speech frame, and divide in voice signal it is also known that, can To be located at the model of the identification information of the voice signal of front end according to the setting of the quantity of the identification information of speech frame and total speech frame Enclose and positioned at front end voice signal identification information range.
As shown in Figure 2 A, shaded region shown in M can be that identification information is located at the speech frame of 000-020 range as position Speech frame in voice signal front end, or the speech frame that identification information is located at 000-015 range, which is used as, is located at voice letter The speech frame of number front end, or from 000 to other numerical value speech frame as the speech frame for being located at voice signal front end;Shown in N Shaded region can be that identification information is located at the speech frame of A-B range as the speech frame for being located at voice signal rear end, wherein B is the corresponding identification information of end speech frame of voice signal, and A is that the corresponding identification information of end speech frame of voice signal subtracts Remove 15 or other numerical value;Range shown in L is to remove the intermediate range of front end range and rear end range.
As shown in Figure 2 B, there are mute frames in voice signal, because general mute frame can continuously occur, the mark of mute frame Information is also that can be known in advance, therefore when detecting to speech frame, speech frame before mute frame can also be with It is handled as the speech frame for the rear end for being located at voice signal, it can also be by it by the speech frame after mute frame Speech frame as the front end for being located at voice signal is handled.O, Q in figure can such as be regarded as to voice identical with M in Fig. 2A The front end of signal;P, R in figure can be regarded as to the rear end of voice signal identical with N in Fig. 2A;S, T in figure can be regarded as with The identical intermediate range for removing front end range and rear end range of L in Fig. 2A.
Fig. 3 is that a kind of structure of the detecting device schematic diagram of laugh detection method provided in an embodiment of the present invention is applied to electronics Equipment, the device include:
It divides and obtains module 32, for being directed to voice signal to be detected, the voice signal to be detected is divided into multiple Speech frame, and obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Prediction module 33, for according to the frequency of the gene of the laugh detection model that training is completed and each speech frame of acquisition in advance Rate and multidimensional speech characteristic parameter predict whether each speech frame is laugh frame;
Recognition detection module 34, for identification speech frame of adjacent with the current speech frame the first setting quantity In, prediction result is the quantity of the speech frame of laugh frame;It, will be described current when the quantity is greater than the amount threshold of setting Speech frame is determined as laugh frame.
Described device further include:
Training module 31, for for each voice signal in training set, the voice signal to be divided into multiple voices Frame;Obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;Identify whether each speech frame is laugh frame, if It is to add the first label in the speech frame, otherwise, the second label is added in the speech frame;After addition label The gene frequency and multidimensional speech characteristic parameter of speech frame are input in laugh detection model, are carried out to the laugh detection model Training.
The division obtains module 32, is specifically used for carrying out preemphasis processing to the voice signal, will be pretreated Voice signal to be detected is divided into multiple speech frames;
Described device further include:
Filtering module 35 removes noise frame in the speech frame and quiet for carrying out end-point detection to each speech frame Sound frame.
The recognition detection module 34 judges that current speech frame is specifically for identifying the position of current speech frame The no front end positioned at voice signal;If so, in the speech frame of the first setting quantity after the identification current speech frame, Prediction result is the quantity of the speech frame of laugh frame;If not, judging whether current speech frame is located at the rear end of voice signal; If so, prediction result is the language of laugh frame in the speech frame of the first setting quantity before the identification current speech frame The quantity of sound frame;Otherwise, it identifies before the current speech frame after the 4th setting quantity and the current speech frame In the speech frame of 5th setting quantity, prediction result is the quantity of the speech frame of laugh frame, wherein the 4th setting quantity and the 5th Set quantity and for it is described first setting quantity.
The embodiment of the invention provides a kind of laugh detection method and device, this method is used for electronic equipment, this method packet It includes: for voice signal to be detected, the voice signal to be detected being divided into multiple speech frames, and obtain each speech frame Gene frequency and multidimensional speech characteristic parameter;According to each speech frame of the preparatory laugh detection model and acquisition trained and completed Gene frequency and multidimensional speech characteristic parameter predict whether each speech frame is laugh frame;Identification and the current speech frame In the speech frame of the first adjacent setting quantity, prediction result is the quantity of the speech frame of laugh frame;It is set when the quantity is greater than When fixed amount threshold, the current speech frame is determined as laugh frame.Due in embodiments of the present invention, according to current language Sound frame and its speech frame of adjacent first setting quantity determine whether present frame is laugh frame jointly, weaken to a certain extent To the error rate of laugh detection model, and the continuity of laugh is also fully taken into account, so that laugh testing result is more Accurately.
For systems/devices embodiment, since it is substantially similar to the method embodiment, so the comparison of description is simple Single, the relevent part can refer to the partial explaination of embodiments of method.
It should be understood by those skilled in the art that, embodiments herein can provide as method, system or computer program Product.Therefore, complete hardware embodiment, complete software embodiment or reality combining software and hardware aspects can be used in the application Apply the form of example.Moreover, it wherein includes the computer of computer usable program code that the application, which can be used in one or more, The computer program implemented in usable storage medium (including but not limited to magnetic disk storage, CD-ROM, optical memory etc.) produces The form of product.
The application is referring to method, the process of equipment (system) and computer program product according to the embodiment of the present application Figure and/or block diagram describe.It should be understood that every one stream in flowchart and/or the block diagram can be realized by computer program instructions The combination of process and/or box in journey and/or box and flowchart and/or the block diagram.It can provide these computer programs Instruct the processor of general purpose computer, special purpose computer, Embedded Processor or other programmable data processing devices to produce A raw machine, so that being generated by the instruction that computer or the processor of other programmable data processing devices execute for real The device for the function of being specified in present one or more flows of the flowchart and/or one or more blocks of the block diagram.
These computer program instructions, which may also be stored in, is able to guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works, so that it includes referring to that instruction stored in the computer readable memory, which generates, Enable the manufacture of device, the command device realize in one box of one or more flows of the flowchart and/or block diagram or The function of being specified in multiple boxes.
These computer program instructions also can be loaded onto a computer or other programmable data processing device, so that counting Series of operation steps are executed on calculation machine or other programmable devices to generate computer implemented processing, thus in computer or The instruction executed on other programmable devices is provided for realizing in one or more flows of the flowchart and/or block diagram one The step of function of being specified in a box or multiple boxes.
Although the preferred embodiment of the application has been described, it is created once a person skilled in the art knows basic Property concept, then additional changes and modifications can be made to these embodiments.So it includes excellent that the following claims are intended to be interpreted as It selects embodiment and falls into all change and modification of the application range.
Obviously, those skilled in the art can carry out various modification and variations without departing from the essence of the application to the application Mind and range.In this way, if these modifications and variations of the application belong to the range of the claim of this application and its equivalent technologies Within, then the application is also intended to include these modifications and variations.

Claims (10)

1. a kind of laugh detection method, which is characterized in that it is applied to electronic equipment, this method comprises:
For voice signal to be detected, the voice signal to be detected is divided into multiple speech frames, and obtain each speech frame Gene frequency and multidimensional speech characteristic parameter;
According to the gene frequency and multidimensional phonetic feature of the preparatory training laugh detection model completed and each speech frame of acquisition Parameter predicts whether each speech frame is laugh frame;
In the speech frame for identifying the first setting quantity adjacent with current speech frame, prediction result is the speech frame of laugh frame Quantity;
When the quantity is greater than the amount threshold of setting, the current speech frame is determined as laugh frame.
2. the method according to claim 1, wherein the training process of the laugh detection model includes:
For voice signal each in training set, the voice signal is divided into multiple speech frames;
Obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Identify whether each speech frame is laugh frame, if so, the first label is added in the speech frame, otherwise, described The second label is added in speech frame;
The gene frequency of speech frame after addition label and multidimensional speech characteristic parameter are input in laugh detection model, to institute Laugh detection model is stated to be trained.
3. according to the method described in claim 2, it is characterized in that, described be trained to the laugh detection model includes:
Using support vector machines method, the laugh detection model is trained;Or,
Using extreme learning machine ELM method, the laugh detection model is trained.
4. the method according to claim 1, wherein it is described be directed to voice signal to be detected, will be described to be detected Voice signal is divided into multiple speech frames
Preemphasis processing is carried out to the voice signal, pretreated voice signal to be detected is divided into multiple speech frames.
5. method according to claim 1 or 4, which is characterized in that it is described the voice signal to be detected is divided into it is more After a speech frame, before the gene frequency for obtaining each speech frame and multidimensional speech characteristic parameter, the method is also wrapped It includes:
End-point detection is carried out to each speech frame, removes the noise frame and mute frame in the speech frame.
6. the method according to claim 1, wherein described identify first adjacent with the current speech frame In the speech frame for setting quantity, prediction result is that the quantity of the speech frame of laugh frame includes:
The position for identifying current speech frame, judges whether current speech frame is located at the front end of voice signal;
If so, prediction result is laugh frame in the speech frame of the first setting quantity after the identification current speech frame Speech frame quantity;
If not, judging whether current speech frame is located at the rear end of voice signal;
If so, prediction result is laugh frame in the speech frame of the first setting quantity before the identification current speech frame Speech frame quantity;Otherwise, identify before the current speech frame the 4th setting quantity and the current speech frame it Afterwards the 5th setting quantity speech frame in, prediction result be laugh frame speech frame quantity, wherein the 4th setting quantity and 5th setting quantity and for it is described first setting quantity.
7. a kind of laugh detection device, which is characterized in that described device includes:
It divides and obtains module, for for voice signal to be detected, the voice signal to be detected to be divided into multiple speech frames, And obtain the gene frequency and multidimensional speech characteristic parameter of each speech frame;
Prediction module, for the gene frequency of each speech frame according to the laugh detection model that training is completed in advance and acquisition and more Speech characteristic parameter is tieed up, predicts whether each speech frame is laugh frame;
Recognition detection module, for identification in the speech frame of the first setting quantity adjacent with current speech frame, prediction result For the quantity of the speech frame of laugh frame;When the quantity is greater than the amount threshold of setting, the current speech frame is determined For laugh frame.
8. device according to claim 7, which is characterized in that described device further include:
Training module, for for each voice signal in training set, the voice signal to be divided into multiple speech frames;It obtains The gene frequency and multidimensional speech characteristic parameter of each speech frame;Identify whether each speech frame is laugh frame, if so, in institute It states and adds the first label in speech frame, otherwise, the second label is added in the speech frame;By the speech frame after addition label Gene frequency and multidimensional speech characteristic parameter are input in laugh detection model, are trained to the laugh detection model.
9. device according to claim 7, which is characterized in that the division obtains module, is specifically used for the voice Signal carries out preemphasis processing, and pretreated voice signal to be detected is divided into multiple speech frames;
Described device further include:
Filtering module removes the noise frame and mute frame in the speech frame for carrying out end-point detection to each speech frame.
10. device according to claim 7, which is characterized in that the recognition detection module, specifically for identifying currently The position of speech frame, judges whether current speech frame is located at the front end of voice signal;If so, the identification current voice In the speech frame of the first setting quantity after frame, prediction result is the quantity of the speech frame of laugh frame;If not, judgement is current Speech frame whether be located at the rear end of voice signal;If so, the first setting quantity before the identification current speech frame Speech frame in, prediction result be laugh frame speech frame quantity;Otherwise, the 4th sets before identifying the current speech frame In the speech frame of the 5th setting quantity after fixed number amount and the current speech frame, prediction result is the speech frame of laugh frame Quantity, wherein the 4th setting quantity and the 5th setting quantity and for it is described first setting quantity.
CN201610755283.XA 2016-08-29 2016-08-29 A kind of laugh detection method and device Active CN106356077B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610755283.XA CN106356077B (en) 2016-08-29 2016-08-29 A kind of laugh detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610755283.XA CN106356077B (en) 2016-08-29 2016-08-29 A kind of laugh detection method and device

Publications (2)

Publication Number Publication Date
CN106356077A CN106356077A (en) 2017-01-25
CN106356077B true CN106356077B (en) 2019-09-27

Family

ID=57856963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610755283.XA Active CN106356077B (en) 2016-08-29 2016-08-29 A kind of laugh detection method and device

Country Status (1)

Country Link
CN (1) CN106356077B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107393559B (en) * 2017-07-14 2021-05-18 深圳永顺智信息科技有限公司 Method and device for checking voice detection result
CN107545902B (en) * 2017-07-14 2020-06-02 清华大学 Article material identification method and device based on sound characteristics
CN111210804A (en) * 2018-11-01 2020-05-29 普天信息技术有限公司 Method and device for identifying social signal
CN111755029B (en) * 2020-05-27 2023-08-25 北京大米科技有限公司 Voice processing method, device, storage medium and electronic equipment
CN112632369B (en) * 2020-12-05 2023-03-24 武汉风行在线技术有限公司 Short video recommendation system and method for identifying laughter
CN113689861B (en) * 2021-08-10 2024-02-27 上海淇玥信息技术有限公司 Intelligent track dividing method, device and system for mono call recording

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1698097A (en) * 2003-02-19 2005-11-16 松下电器产业株式会社 Speech recognition device and speech recognition method
CN101030384A (en) * 2007-03-27 2007-09-05 西安交通大学 Electronic throat speech reinforcing system and its controlling method
CN101727900A (en) * 2009-11-24 2010-06-09 北京中星微电子有限公司 Method and equipment for detecting user pronunciation
CN101944359A (en) * 2010-07-23 2011-01-12 杭州网豆数字技术有限公司 Voice recognition method facing specific crowd
CN102486920A (en) * 2010-12-06 2012-06-06 索尼公司 Audio event detection method and device
CN102881284A (en) * 2012-09-03 2013-01-16 江苏大学 Unspecific human voice and emotion recognition method and system
CN105632486A (en) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 Voice wake-up method and device of intelligent hardware

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6275806B1 (en) * 1999-08-31 2001-08-14 Andersen Consulting, Llp System method and article of manufacture for detecting emotion in voice signals by utilizing statistics for voice signal parameters
US7451084B2 (en) * 2003-07-29 2008-11-11 Fujifilm Corporation Cell phone having an information-converting function
CN102915728B (en) * 2011-08-01 2014-08-27 佳能株式会社 Sound segmentation device and method and speaker recognition system

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1698097A (en) * 2003-02-19 2005-11-16 松下电器产业株式会社 Speech recognition device and speech recognition method
CN101030384A (en) * 2007-03-27 2007-09-05 西安交通大学 Electronic throat speech reinforcing system and its controlling method
CN101727900A (en) * 2009-11-24 2010-06-09 北京中星微电子有限公司 Method and equipment for detecting user pronunciation
CN101944359A (en) * 2010-07-23 2011-01-12 杭州网豆数字技术有限公司 Voice recognition method facing specific crowd
CN102486920A (en) * 2010-12-06 2012-06-06 索尼公司 Audio event detection method and device
CN102881284A (en) * 2012-09-03 2013-01-16 江苏大学 Unspecific human voice and emotion recognition method and system
CN105632486A (en) * 2015-12-23 2016-06-01 北京奇虎科技有限公司 Voice wake-up method and device of intelligent hardware

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
语音信号情感识别;陈佳;《中国硕士论文全文数据库 信息科技辑》;20090115;第1-60页 *

Also Published As

Publication number Publication date
CN106356077A (en) 2017-01-25

Similar Documents

Publication Publication Date Title
CN106356077B (en) A kind of laugh detection method and device
CN109473123B (en) Voice activity detection method and device
CN108630193B (en) Voice recognition method and device
CN110706694A (en) Voice endpoint detection method and system based on deep learning
CN111312218B (en) Neural network training and voice endpoint detection method and device
CN106531195B (en) A kind of dialogue collision detection method and device
CN112786029B (en) Method and apparatus for training VAD using weakly supervised data
CN110852215A (en) Multi-mode emotion recognition method and system and storage medium
CN106887241A (en) A kind of voice signal detection method and device
CN109286848B (en) Terminal video information interaction method and device and storage medium
CN110503944B (en) Method and device for training and using voice awakening model
CN104317392B (en) A kind of information control method and electronic equipment
CN109360551B (en) Voice recognition method and device
CN111918122A (en) Video processing method and device, electronic equipment and readable storage medium
CN107025913A (en) A kind of way of recording and terminal
CN106649253A (en) Auxiliary control method and system based on post verification
CN112331188A (en) Voice data processing method, system and terminal equipment
CN112750461B (en) Voice communication optimization method and device, electronic equipment and readable storage medium
CN112735466B (en) Audio detection method and device
CN112185382A (en) Method, device, equipment and medium for generating and updating wake-up model
CN113223499B (en) Method and device for generating audio negative sample
CN110708619A (en) Word vector training method and device for intelligent equipment
CN112614506B (en) Voice activation detection method and device
US20220101871A1 (en) Live streaming control method and apparatus, live streaming device, and storage medium
CN114842382A (en) Method, device, equipment and medium for generating semantic vector of video

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant