CN108831508A - Voice activity detection method, device and equipment - Google Patents

Voice activity detection method, device and equipment Download PDF

Info

Publication number
CN108831508A
CN108831508A CN201810605698.8A CN201810605698A CN108831508A CN 108831508 A CN108831508 A CN 108831508A CN 201810605698 A CN201810605698 A CN 201810605698A CN 108831508 A CN108831508 A CN 108831508A
Authority
CN
China
Prior art keywords
signal
frame
voice
audio signal
probability
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810605698.8A
Other languages
Chinese (zh)
Inventor
李超
文铭
朱唯鑫
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201810605698.8A priority Critical patent/CN108831508A/en
Publication of CN108831508A publication Critical patent/CN108831508A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/84Detection of presence or absence of voice signals for discriminating voice from noise

Abstract

The embodiment of the present invention provides a kind of voice activity detection method, device and equipment.This method includes:Audio signal to be detected is smoothed, calculate the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing, according to the energy of each frame signal, zero-crossing rate and detection model trained in advance, determine that each frame signal is the probability of voice signal, it is the probability of voice signal according to each frame signal, determines noise signal and voice signal in audio signal.The method of the embodiment of the present invention, so that the noise signal in audio signal is substantially weakened, improves performance of the voice activity detection in noise circumstance by being smoothed to the audio signal comprising noise signal.

Description

Voice activity detection method, device and equipment
Technical field
The present embodiments relate to speech signal processing technology more particularly to a kind of voice activity detection methods, dress It sets and equipment.
Background technique
Voice activity detection (Voice Activity Detection, referred to as:VAD speech terminals detection, voice) are also known as Border detection.By the detection to voice in voice signal and non-voice, when to identify in voice signal stream and eliminate long Between the mute phase.Commonly used in playing reduction voice coder in the speech processing systems such as speech recognition, voice coding, speech enhan-cement Bit rate saves communication bandwidth, reduces energy consumption of mobile equipment, improves the effects of discrimination.
Voice signal is because of its non-stationary property, and its is easy the interference by noise signal, and the disturbance of noise can serious shadow Ring the accuracy of VAD.Then the existing VAD method based on G.729 standard sets thresholding to signal by the energy of calculating signal Each frame simply classified, however, this method can not obtain satisfactory effect in the presence of noise.
With the continuous development of voice processing technology, the requirement to voice activity detection is also higher and higher.Therefore, it is necessary to one Kind voice activity detection method can still keep good detection performance in noise circumstance.
Summary of the invention
The embodiment of the present invention provides a kind of voice activity detection method, device and equipment, exists in the prior art to solve In noise circumstance, the not high problem of voice activity detection performance.
In a first aspect, the embodiment of the present invention provides a kind of voice activity detection method, including:
Audio signal to be detected is smoothed;
Calculate the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing;
According to the energy of each frame signal, zero-crossing rate and detection model trained in advance, determine that each frame signal is voice The probability of signal;
It is the probability of voice signal according to each frame signal, determines noise signal and voice signal in audio signal.
In one possible implementation, to audio signal to be detected be smoothed including:
An average value is calculated per N number of sampled point in audio signal to be detected, it is smoothed out as every N number of sampled point Output valve, N are the natural number greater than 1.
In one possible implementation, in calculating audio signal after smoothing processing each frame signal energy Before amount and zero-crossing rate, further include:
It is moved according to preset frame length and preset frame, sub-frame processing is carried out to the audio signal after smoothing processing, in advance If frame length greater than preset frame move.
In one possible implementation, it is the probability of voice signal according to each frame signal, determines in audio signal Noise signal and voice signal, including:
If probability is greater than predetermined probabilities value, it is determined that the frame signal is voice signal;
If probability is less than or equal to predetermined probabilities value, it is determined that the frame signal is noise signal.
In one possible implementation, according to the energy of each frame signal, zero-crossing rate and detection trained in advance Model further includes before determining the probability that each frame signal is voice signal:
Audio signal in training corpus is smoothed and sub-frame processing, generates multiple training samples;
Using the energy of multiple training samples and zero-crossing rate as the input feature vector of detection model, by multiple training samples whether It is desired output feature of the voice signal as detection model, detection model is trained.
In one possible implementation, detection model is based on deep neural network, Logic Regression Models or support Vector machine model is trained.
Second aspect, the embodiment of the present invention provide a kind of voice activity detection apparatus, including:
Leveling Block, for being smoothed to audio signal to be detected;
Computing module, for calculating the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing;
Determining module, for determining each according to the energy of each frame signal, zero-crossing rate and detection model trained in advance Frame signal is the probability of voice signal;
Determining module, be also used to be according to each frame signal voice signal probability, determine in audio signal noise letter Number and voice signal.
In one possible implementation, Leveling Block is specifically used for, per N number of sampling in audio signal to be detected Point calculates an average value, and as every N number of smoothed out output valve of sampled point, N is the natural number greater than 1.
The third aspect, the embodiment of the present invention provide a kind of voice activity detection apparatus, including:
Memory;
Processor;And
Computer program;
Wherein, computer program stores in memory, and is configured as being executed by processor to realize such as first aspect The method of any one.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, are stored thereon with computer program, Computer program is executed by processor to realize the method such as any one of first aspect.
Voice activity detection method, device and equipment provided in an embodiment of the present invention, by audio signal to be detected It is smoothed, the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing is calculated, according to each Energy, zero-crossing rate and the detection model trained in advance of frame signal, determine that each frame signal is the probability of voice signal, according to every One frame signal is the probability of voice signal, determines noise signal and voice signal in audio signal, realizes under noise circumstance Speech activity high-performance detection.The different characteristics as possessed by voice signal and noise signal, smoothing processing can make The amplitude of noise signal in audio signal is by significantly smooth, and the amplitude that the voice signal in audio signal is smoothed is compared It is much smaller for the amplitude that noise signal is smoothed, more variant acoustic feature can be extracted, voice is improved Performance of the activity detection in noise circumstance.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention Example, and be used to explain the principle of the present invention together with specification.
Fig. 1 is the flow chart of one embodiment of voice activity detection method provided by the invention;
Fig. 2 is the flow chart of the another embodiment of voice activity detection method provided by the invention;
Fig. 3 is the flow chart of training detection model during voice activity detection method one provided by the invention is implemented;
Fig. 4 is the structural schematic diagram of one embodiment of voice activity detection apparatus provided by the invention;
Fig. 5 is the structural schematic diagram of one embodiment of voice activity detection apparatus provided by the invention.
Through the above attached drawings, it has been shown that the specific embodiment of the present invention will be hereinafter described in more detail.These attached drawings It is not intended to limit the scope of the inventive concept in any manner with verbal description, but is by referring to specific embodiments Those skilled in the art illustrate idea of the invention.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended The example of device and method being described in detail in claims, some aspects of the invention are consistent.
Term " includes " and " having " and their any deformations in description and claims of this specification, it is intended that It is to cover and non-exclusive includes.Such as the process, method, system, product or equipment for containing a series of steps or units do not have It is defined in listed step or unit, but optionally further comprising the step of not listing or unit, or optionally also wrap Include the other step or units intrinsic for these process, methods, product or equipment.
" first " and " second " in the present invention only plays mark action, be not understood to indicate or imply ordinal relation, Relative importance or the quantity for implicitly indicating indicated technical characteristic." multiple " refer to two or more." and/ Or ", the incidence relation of affiliated partner is described, indicates may exist three kinds of relationships, for example, A and/or B, can indicate:Individually deposit In A, A and B, these three situations of individualism B are existed simultaneously.It is a kind of "or" that character "/", which typicallys represent forward-backward correlation object, Relationship.
" one embodiment " or " embodiment " mentioned in the whole text in specification of the invention means related with embodiment A particular feature, structure, or characteristic include at least one embodiment of the application.Therefore, occur everywhere in the whole instruction " in one embodiment " or " in one embodiment " not necessarily refer to identical embodiment.It should be noted that not rushing In the case where prominent, the feature in embodiment and embodiment in the present invention be can be combined with each other.
Fig. 1 is the flow chart of one embodiment of voice activity detection method provided by the invention.As shown in Figure 1, the present embodiment The voice activity detection method of offer may include:
Step S101, audio signal to be detected is smoothed.
With the continuous development of artificial intelligence technology, the various intelligent uses based on speech recognition are constantly released.With mobile phone For, the application such as phonetic search, Voice Navigation is gradually influencing the use habit of user.And mobile phone usually passes through Mike's elegance Collect audio signal, unavoidably will receive the influence of ambient noise during acquisition, the presence of noise will affect voice letter Number process performance.
It is described in detail below by a specific scene.Smart phone provides driving mode, so as to user When driving, smart phone is controlled by voice.For example, user " can be made a phone call by voice on the run To Zhang San ", control mobile phone is made a phone call to the people for being named as Zhang San in address list;By voice " incoming call answering ", controls mobile phone and connect Logical incoming call;By voice " searching for nearest parking lot ", the navigation etc. of mobile phone offer to nearest parking lot is controlled.However, very User phonetic order all may not be issued during the entire process of user drives, alternatively, using during the entire process of user drives Family only has issued a small amount of phonetic order, and such as in 40 minutes to drive, phonetic order duration only only has 1 minute.If right This 40 minutes collected audio signals carry out voice recognition processing, can bring biggish load, meeting to the processor of mobile phone A large amount of process resource is wasted, causes mobile telephone power consumption excessively high.At this time, it may be necessary to by voice activity detection method from it is collected when 1 minute voice signal is identified in a length of 40 minutes audio signals, and only this 1 minute voice signal is carried out at identification Reason reduces mobile telephone power consumption to improve the efficiency of speech recognition.However, vehicle-mounted noise is very serious under vehicle environment, this direct shadow The performance and stability for having rung voice activity detection cause testing result inaccurate, and then influence subsequent speech recognition and place Reason process.
By the common point and otherness progress detailed analysis to noise signal and voice signal, had using voice signal Smooth performance possessed by some non-stationary properties and vehicle-mounted noise signal, to collected audio signal to be detected into Row smoothing processing weakens influence of the noise for Speech signal detection.
Smoothing processing can make the amplitude of the noise signal in audio signal by significantly smooth, and the language in audio signal It is much smaller for the amplitude that the amplitude that sound signal is smoothed is smoothed compared to noise signal, therefore, it can extract and have more The acoustic feature of otherness realizes the high-performance of voice activity detection in a noisy environment.
Step S102, the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing are calculated.
Optionally, before the computation, can also include:It is moved according to preset frame length and preset frame, to by smooth place Audio signal after reason carries out sub-frame processing, and preset frame length is moved greater than preset frame.Specifically, frame length can choose 25 millis Second, frame shifting can choose 10 milliseconds, then one section 85 milliseconds of duration of audio signal is divided into 7 frames.
In the present embodiment, the energy of each frame signal can be indicated using the L2 norm of the frame signal, i.e. a frame signal Energy be equal to this frame signal in each sampled point value quadratic sum.The zero-crossing rate of each frame signal is believed using this frame Number pass through the number of zero, the i.e. changed number of frame signal symbol.Energy and zero-crossing rate calculation amount are small, come to equipment belt Calculated load it is small, have the advantages that low-power consumption.
Step S103, according to the energy of each frame signal, zero-crossing rate and detection model trained in advance, each frame letter is determined It number is the probability of voice signal.
Trained detection model in advance is inputted using the energy of obtained each frame signal and zero-crossing rate as input feature vector In, obtain the probability that each frame signal is voice signal.Optionally, can also obtain each frame signal is the general of noise signal Rate.
Step S104, it is the probability of voice signal according to each frame signal, determines noise signal and language in audio signal Sound signal.
According to obtained probability value, the noise signal and voice signal in audio signal are determined, specifically, an if frame signal It is the probability of voice signal greater than predetermined probabilities value, it is determined that the frame signal is voice signal;If a frame signal is voice signal Probability be less than or equal to predetermined probabilities value, it is determined that the frame signal is noise signal, such as predetermined probabilities value can be set to 0.5.
Alternatively, if the probability that a frame signal is voice signal is greater than the probability that the frame signal is noise signal, it is determined that should Frame signal is voice signal;If the probability that a frame signal is voice signal is less than or equal to the probability that the frame signal is noise signal, Then determine that the frame signal is noise signal.Under normal conditions, a frame signal is the probability of voice signal and the frame signal is noise The sum of probability of signal is equal to 1.
Voice activity detection method provided in an embodiment of the present invention, by smoothly being located to audio signal to be detected Reason calculates the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing, according to the energy of each frame signal Amount, zero-crossing rate and detection model trained in advance, determine that each frame signal is the probability of voice signal, are according to each frame signal The probability of voice signal determines noise signal and voice signal in audio signal, realizes the speech activity under noise circumstance High-performance detection.The different characteristics as possessed by voice signal and noise signal, smoothing processing can make in audio signal Noise signal amplitude by significantly smooth, and the amplitude that the voice signal in audio signal is smoothed is compared to noise signal It is much smaller for the amplitude being smoothed, more variant acoustic feature can be extracted, voice activity detection is improved and exists Performance in noise circumstance.
Below by a specific embodiment, to the smoothing processing in the technical solution of embodiment of the method shown in Fig. 1 into Row is described in detail.With the popularization of intelligent terminals, voice activity detection method is more run on the terminal device, therefore not only It is required that it is stable and reliable for performance, and calculation amount cannot be too big.For this feature, in the present embodiment, audio to be detected is believed It number is smoothed and may include:An average value is calculated per N number of sampled point in audio signal to be detected, as every N A smoothed out output valve of sampled point, N are the natural number greater than 1.
For example, if N is taken to be equal to 4, that is, the scale for taking average progress smooth is 4, includes 160 samplings for one section For the audio signal of point, then the audio signal exported after smooth includes 40 sampled points, wherein each is exported Sampled point is all the average value of 4 sampled points.
Voice activity detection method provided in this embodiment, by N number of sampled point meter every in audio signal to be detected An average value is calculated, as every N number of smoothed out output valve of sampled point, not only calculation amount is small for this smoothing processing method, but also Since multiple sampled points are merged into a sampled point in smoothing processing, the number of sampled point is greatly reduced, language is reduced Data processing amount in sound activity detection process can not only promote forecasting efficiency, and can satisfy the requirement of low-power consumption.
Fig. 2 is the flow chart of the another embodiment of voice activity detection method provided by the invention.As shown in Fig. 2, this implementation Example provide voice activity detection method may include:
Step S201, an average value is calculated per N number of sampled point to audio signal to be detected, as every N number of sampled point Smoothed out output valve.
Step S202, it is moved according to preset frame length and preset frame, the audio signal after smoothing processing is divided Frame processing, preset frame length are moved greater than preset frame.
Step S203, the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing are calculated.
Step S204, according to the energy of each frame signal, zero-crossing rate and detection model trained in advance, each frame letter is determined It number is the probability of voice signal.
If step S205, probability is greater than predetermined probabilities value, it is determined that the frame signal is voice signal, if probability is less than or equal to Predetermined probabilities value, it is determined that the frame signal is noise signal.
Voice activity detection method provided in this embodiment merges into a sampling by being averaged multiple sampled points Point reduces the number of sampled point, reduces the data processing amount during voice activity detection, reduces power consumption;By flat Sliding processing makes the amplitude of the noise signal in audio signal significantly smoothly, be improved voice activity detection in noise circumstance Performance;By using the lesser energy of calculation amount and zero-crossing rate as input feature vector, the demand of low-power consumption can satisfy, so that Voice activity detection method provided in this embodiment can be run on the terminal device.
On the basis of the above embodiments, training process of the present embodiment for the detection model used in above-described embodiment It is described in detail.Fig. 3 is the flow chart of training detection model during voice activity detection method one provided by the invention is implemented.Such as Shown in Fig. 3, the training process for detection model may include:
Step S301, the audio signal in training corpus is smoothed and sub-frame processing, generates multiple training Sample.
Specific smoothing processing can use the calculating one per N number of sampled point with identical method in above-mentioned detection embodiment A average value, as this N number of smoothed out output valve of sampled point.It is moved according to preset frame length and preset frame, to by smooth Audio signal that treated carries out sub-frame processing, and preset frame length is moved greater than preset frame, i.e., has part between adjacent two frame Overlapping, in the training stage, can increase the sample number of training stage using this framing method.It should be noted that training rank Section is identical with the frame length of detection-phase needs, and frame moves can be different, and the audio signal of same duration, frame moves smaller obtained training Sample number is more.For example, frame length can choose 25 milliseconds, frame shifting can choose 5 milliseconds.
Training corpus can select public audio corpus, can also voluntarily acquire.When each training sample is The long audio signal for being equal to default frame length, and labeled is voice signal or noise signal.For example, voice signal sample can To be labeled as 1, noise signal sample, which can mark, is.
Step S302, the energy and zero-crossing rate of each training sample are calculated.
Wherein, energy is training sample L2 norm, and zero-crossing rate is the changed number of training sample symbol.
Step S303, using the energy of multiple training samples and zero-crossing rate as the input feature vector of detection model, by multiple instructions Practice whether sample is desired output feature of the voice signal as detection model, detection model is trained.
Detection model in the present embodiment can be based on deep neural network, Logic Regression Models or support vector machines mould Type building.By in obtained multiple training samples, two dimension of the energy and zero-crossing rate of each training sample as detection model Each training sample is that voice signal or noise signal are special as the desired output of detection model by acoustics input feature vector Sign, is trained detection model.
The embodiment of the present invention also provides a kind of voice activity detection apparatus, shown in Figure 4, the embodiment of the present invention only with It is illustrated for Fig. 4, is not offered as that present invention is limited only to this.Fig. 4 is that voice activity detection apparatus one provided by the invention is real Apply the structural schematic diagram of example.As shown in figure 4, voice activity detection apparatus 40 provided in an embodiment of the present invention includes:Leveling Block 401, computing module 402 and determining module 403.
Leveling Block 401, for being smoothed to audio signal to be detected.
Computing module 402, for calculating the energy and zero passage of each frame signal in the audio signal after smoothing processing Rate.
Determining module 403, for determining according to the energy of each frame signal, zero-crossing rate and detection model trained in advance Each frame signal is the probability of voice signal.
Determining module 403, be also used to be according to each frame signal voice signal probability, determine the noise in audio signal Signal and voice signal.
Device provided in this embodiment can be used for executing the technical solution of embodiment of the method shown in Fig. 1, realization principle Similar with technical effect, details are not described herein again.
In one possible implementation, Leveling Block 401 is specifically used for, per N number of in audio signal to be detected Sampled point calculates an average value, and as every N number of smoothed out output valve of sampled point, N is the natural number greater than 1.
In one possible implementation, voice activity detection apparatus can also include framing module, for calculating Before the energy and zero-crossing rate of each frame signal, moved according to preset frame length and preset frame, to the sound after smoothing processing Frequency signal carries out sub-frame processing, wherein preset frame length is moved greater than preset frame.
In one possible implementation, determining module 403 specifically can be also used for, if probability is greater than predetermined probabilities Value, it is determined that the frame signal is voice signal;If probability is less than or equal to predetermined probabilities value, it is determined that the frame signal is noise letter Number.
In one possible implementation, according to the energy of each frame signal, zero-crossing rate and detection trained in advance Model further includes before determining the probability that each frame signal is voice signal:
Audio signal in training corpus is smoothed and sub-frame processing, generates multiple training samples;
Using the energy of multiple training samples and zero-crossing rate as the input feature vector of detection model, by multiple training samples whether It is desired output feature of the voice signal as detection model, detection model is trained.
In one possible implementation, detection model is based on deep neural network, Logic Regression Models or support Vector machine model is trained.
The embodiment of the present invention also provides a kind of voice activity detection apparatus, shown in Figure 5, the embodiment of the present invention only with It is illustrated for Fig. 5, is not offered as that present invention is limited only to this.Fig. 5 is that voice activity detection apparatus one provided by the invention is real Apply the structural schematic diagram of example.The detection device can be mobile phone, computer, digital broadcast terminal, messaging devices, trip Play console, tablet device, Medical Devices, body-building equipment, personal digital assistant etc..As shown in figure 5, inspection provided in this embodiment Measurement equipment may include following one or more components:Processing component 501, memory 502, audio component 503, power supply module 504, communication component 505, multimedia component 506, sensor module 507 and input/output (I/O) interface 508.
Processing component 501 usually control detection device integrated operation, such as with display, telephone call, data communication, phase Machine operation and record operate associated operation.Processing component 501 may include that one or more processors 5011 refer to execute It enables, to perform all or part of the steps of the methods described above.In addition, processing component 501 may include one or more modules, just Interaction between processing component 501 and other assemblies.For example, processing component 501 may include multi-media module, it is more to facilitate Interaction between media component 506 and processing component 501.
Memory 502 is configured as storing various types of data to support the operation in detection device.These data Example includes the instruction of any application or method for operating on detection device, contact data, telephone book data, Message, picture, video etc..Memory 502 can by any kind of volatibility or non-volatile memory device or they Combination is realized, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), it is erasable can Program read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory Reservoir, disk or CD.In the present embodiment, it is stored with computer program in memory 502, which can be by handling Device 5011 executes, to realize the technical solution of any of the above-described voice activity detection method embodiment.
Power supply module 504 provides electric power for the various assemblies of detection device.Power supply module 504 may include power management system System, one or more power supplys and other with for detection device generate, manage, and distribute the associated component of electric power.
Multimedia component 506 includes the screen of one output interface of offer between the detection device and user.? In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel, Screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes that one or more touch passes Sensor is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding is dynamic The boundary of work, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more Media component 506 includes a front camera and/or rear camera.When detection device is in operation mode, as shot mould When formula or video mode, front camera and/or rear camera can receive external multi-medium data.Each preposition camera shooting Head and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 503 is configured as output and/or input audio signal.For example, audio component 503 includes a Mike Wind (MIC), when detection device is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched It is set to reception external audio signal.The received audio signal can be further stored in memory 502.The present embodiment In, the voice signal that user carries out voice control to detection device can be acquired by microphone, then via processing component 501 Voice activity detection is carried out to it, and then carries out a series of subsequent processings such as speech recognition.In some embodiments, audio component 503 further include a loudspeaker, is used for output audio signal.In the present embodiment, it can be played by loudspeaker and user is mentioned Show information.
I/O interface 508 provides interface between processing component 501 and peripheral interface module, and above-mentioned peripheral interface module can To be keyboard, click wheel, button etc..These buttons may include, but are not limited to:Home button, volume button, start button and lock Determine button.
Sensor module 507 includes one or more sensors, and the state for providing various aspects for detection device is commented Estimate.For example, sensor module 507 can detecte the state that opens/closes of detection device, the relative positioning of component, such as institute The display and keypad that component is detection device are stated, sensor module 507 can also detect detection device or detection device one The position change of a component, the existence or non-existence that user contacts with detection device, detection device orientation or acceleration/deceleration and inspection The temperature change of measurement equipment.Sensor module 507 may include proximity sensor, be configured to connect in not any physics It is detected the presence of nearby objects when touching.Sensor module 507 can also include optical sensor, such as CMOS or ccd image sensor, For being used in imaging applications.In some embodiments, which can also include acceleration transducer, top Spiral shell instrument sensor, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 505 is configured to facilitate the communication of wired or wireless way between detection device and other equipment.This Communication component 505 is in embodiment for realizing the interaction between detection device and cloud server.Detection device can access base In the wireless network of communication standard, such as WiFi, 2G, 3G or 4G or their combination.In one exemplary embodiment, it communicates Component 505 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel.Show at one In example property embodiment, the communication component 505 further includes near-field communication (NFC) module, to promote short range communication.For example, in NFC Module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, bluetooth (BT) Technology and other technologies are realized.
In the exemplary embodiment, detection device can be by one or more application specific integrated circuit (ASIC), number Signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array (FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided It such as include the memory 502 of instruction, above-metioned instruction can be executed by the processor 5011 of detection device to complete the above method.Example Such as, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, soft Disk and optical data storage devices etc..
Voice activity detection apparatus provided in an embodiment of the present invention can be used for executing the technology of any of the above-described embodiment of the method Scheme, it is similar that the realization principle and technical effect are similar, and details are not described herein again.
The embodiment of the present invention also provides a kind of computer readable storage medium, is stored thereon with computer program, computer Program is executed by processor the technical solution to realize any of the above-described embodiment of the method.
Finally it should be noted that:The above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent Present invention has been described in detail with reference to the aforementioned embodiments for pipe, those skilled in the art should understand that:Its according to So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution The range of scheme.

Claims (10)

1. a kind of voice activity detection method, which is characterized in that including:
Audio signal to be detected is smoothed;
Calculate the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing;
According to the energy of each frame signal, zero-crossing rate and detection model trained in advance, determine that each frame signal is The probability of voice signal;
It is the probability of voice signal according to each frame signal, determines noise signal and voice letter in the audio signal Number.
2. the method according to claim 1, wherein described be smoothed packet to audio signal to be detected It includes:
An average value is calculated per N number of sampled point in the audio signal to be detected, as described smooth per N number of sampled point Output valve afterwards, N are the natural number greater than 1.
3. the method according to claim 1, wherein in the calculating in the audio signal after smoothing processing Before the energy and zero-crossing rate of each frame signal, further include:
It is moved according to preset frame length and preset frame, sub-frame processing, institute is carried out to the audio signal after smoothing processing It states preset frame length and is greater than the preset frame shifting.
4. according to each frame signal being the general of voice signal the method according to claim 1, wherein described Rate determines noise signal and voice signal in the audio signal, including:
If the probability is greater than predetermined probabilities value, it is determined that the frame signal is voice signal;
If the probability is less than or equal to predetermined probabilities value, it is determined that the frame signal is noise signal.
5. the method according to claim 1, wherein in the energy according to each frame signal, zero passage Rate and detection model trained in advance further include before determining the probability that each frame signal is voice signal:
Audio signal in training corpus is smoothed and sub-frame processing, generates multiple training samples;
Using the energy of the multiple training sample and zero-crossing rate as the input feature vector of the detection model, by the multiple training Whether sample is desired output feature of the voice signal as the detection model, is trained to the detection model.
6. according to the method described in claim 5, it is characterized in that, the detection model is based on deep neural network, logic is returned Model or supporting vector machine model is returned to be trained.
7. a kind of voice activity detection apparatus, which is characterized in that including:
Leveling Block, for being smoothed to audio signal to be detected;
Computing module, for calculating the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing;
Determining module, described in determining according to the energy of each frame signal, zero-crossing rate and detection model trained in advance Each frame signal is the probability of voice signal;
The determining module, be also used to be according to each frame signal voice signal probability, determine in the audio signal Noise signal and voice signal.
8. device according to claim 7, which is characterized in that the Leveling Block is specifically used for, described to be detected An average value is calculated in audio signal per N number of sampled point, as described per the smoothed out output valve of N number of sampled point, N be greater than 1 natural number.
9. a kind of voice activity detection apparatus, which is characterized in that including:
Memory;
Processor;And
Computer program;
Wherein, the computer program stores in the memory, and is configured as being executed by the processor to realize such as Method described in any one of claims 1-6.
10. a kind of computer readable storage medium, which is characterized in that be stored thereon with computer program, the computer program It is executed by processor to realize as the method according to claim 1 to 6.
CN201810605698.8A 2018-06-13 2018-06-13 Voice activity detection method, device and equipment Pending CN108831508A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810605698.8A CN108831508A (en) 2018-06-13 2018-06-13 Voice activity detection method, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810605698.8A CN108831508A (en) 2018-06-13 2018-06-13 Voice activity detection method, device and equipment

Publications (1)

Publication Number Publication Date
CN108831508A true CN108831508A (en) 2018-11-16

Family

ID=64145020

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810605698.8A Pending CN108831508A (en) 2018-06-13 2018-06-13 Voice activity detection method, device and equipment

Country Status (1)

Country Link
CN (1) CN108831508A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110349597A (en) * 2019-07-03 2019-10-18 山东师范大学 A kind of speech detection method and device
CN110648656A (en) * 2019-08-28 2020-01-03 北京达佳互联信息技术有限公司 Voice endpoint detection method and device, electronic equipment and storage medium
CN112071328A (en) * 2019-06-10 2020-12-11 谷歌有限责任公司 Audio noise reduction
CN112189232A (en) * 2019-07-31 2021-01-05 深圳市大疆创新科技有限公司 Audio processing method and device
CN112969130A (en) * 2020-12-31 2021-06-15 维沃移动通信有限公司 Audio signal processing method and device and electronic equipment
CN113744752A (en) * 2021-08-30 2021-12-03 西安声必捷信息科技有限公司 Voice processing method and device
WO2022134781A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Prolonged speech detection method, apparatus and device, and storage medium
CN116153341A (en) * 2023-04-20 2023-05-23 深圳锐盟半导体有限公司 Control method and device of voice detection device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197130A (en) * 2006-12-07 2008-06-11 华为技术有限公司 Sound activity detecting method and detector thereof
US20090125304A1 (en) * 2007-11-13 2009-05-14 Samsung Electronics Co., Ltd Method and apparatus to detect voice activity
CN101494049A (en) * 2009-03-11 2009-07-29 北京邮电大学 Method for extracting audio characteristic parameter of audio monitoring system
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN103646649A (en) * 2013-12-30 2014-03-19 中国科学院自动化研究所 High-efficiency voice detecting method
CN107134277A (en) * 2017-06-15 2017-09-05 深圳市潮流网络技术有限公司 A kind of voice-activation detecting method based on GMM model

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101197130A (en) * 2006-12-07 2008-06-11 华为技术有限公司 Sound activity detecting method and detector thereof
US20090125304A1 (en) * 2007-11-13 2009-05-14 Samsung Electronics Co., Ltd Method and apparatus to detect voice activity
CN101625857A (en) * 2008-07-10 2010-01-13 新奥特(北京)视频技术有限公司 Self-adaptive voice endpoint detection method
CN101494049A (en) * 2009-03-11 2009-07-29 北京邮电大学 Method for extracting audio characteristic parameter of audio monitoring system
CN103646649A (en) * 2013-12-30 2014-03-19 中国科学院自动化研究所 High-efficiency voice detecting method
CN107134277A (en) * 2017-06-15 2017-09-05 深圳市潮流网络技术有限公司 A kind of voice-activation detecting method based on GMM model

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112071328A (en) * 2019-06-10 2020-12-11 谷歌有限责任公司 Audio noise reduction
CN112071328B (en) * 2019-06-10 2024-03-26 谷歌有限责任公司 Audio noise reduction
CN110349597A (en) * 2019-07-03 2019-10-18 山东师范大学 A kind of speech detection method and device
CN110349597B (en) * 2019-07-03 2021-06-25 山东师范大学 Voice detection method and device
CN112189232A (en) * 2019-07-31 2021-01-05 深圳市大疆创新科技有限公司 Audio processing method and device
CN110648656A (en) * 2019-08-28 2020-01-03 北京达佳互联信息技术有限公司 Voice endpoint detection method and device, electronic equipment and storage medium
WO2022134781A1 (en) * 2020-12-23 2022-06-30 深圳壹账通智能科技有限公司 Prolonged speech detection method, apparatus and device, and storage medium
CN112969130A (en) * 2020-12-31 2021-06-15 维沃移动通信有限公司 Audio signal processing method and device and electronic equipment
CN113744752A (en) * 2021-08-30 2021-12-03 西安声必捷信息科技有限公司 Voice processing method and device
CN116153341A (en) * 2023-04-20 2023-05-23 深圳锐盟半导体有限公司 Control method and device of voice detection device

Similar Documents

Publication Publication Date Title
CN108831508A (en) Voice activity detection method, device and equipment
CN111179961B (en) Audio signal processing method and device, electronic equipment and storage medium
WO2019214361A1 (en) Method for detecting key term in speech signal, device, terminal, and storage medium
CN110634507A (en) Speech classification of audio for voice wakeup
JP2019117623A (en) Voice dialogue method, apparatus, device and storage medium
CN110808063A (en) Voice processing method and device for processing voice
CN107799126A (en) Sound end detecting method and device based on Supervised machine learning
CN111933112B (en) Awakening voice determination method, device, equipment and medium
CN110992963B (en) Network communication method, device, computer equipment and storage medium
CN103024182B (en) Method and device which enter into photo album interface from shoot interface of mobile terminal
JP7166294B2 (en) Audio processing method, device and storage medium
CN111063342A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN104240278B (en) The determination of equipment body position
CN110648656A (en) Voice endpoint detection method and device, electronic equipment and storage medium
CN109599104A (en) Multi-beam choosing method and device
CN107992813A (en) A kind of lip condition detection method and device
US20220165258A1 (en) Voice processing method, electronic device, and storage medium
CN108665889A (en) The Method of Speech Endpoint Detection, device, equipment and storage medium
CN109388699A (en) Input method, device, equipment and storage medium
CN109256145A (en) Audio-frequency processing method, device, terminal and readable storage medium storing program for executing based on terminal
WO2022147692A1 (en) Voice command recognition method, electronic device and non-transitory computer-readable storage medium
CN104850592B (en) The method and apparatus for generating model file
KR101927050B1 (en) User terminal and computer readable recorindg medium including a user adaptive learning model to be tranined with user customized data without accessing a server
CN112614507A (en) Method and apparatus for detecting noise
CN113744736B (en) Command word recognition method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181116