CN108831508A - Voice activity detection method, device and equipment - Google Patents
Voice activity detection method, device and equipment Download PDFInfo
- Publication number
- CN108831508A CN108831508A CN201810605698.8A CN201810605698A CN108831508A CN 108831508 A CN108831508 A CN 108831508A CN 201810605698 A CN201810605698 A CN 201810605698A CN 108831508 A CN108831508 A CN 108831508A
- Authority
- CN
- China
- Prior art keywords
- signal
- frame
- voice
- audio signal
- probability
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L21/00—Processing of the speech or voice signal to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
- G10L21/02—Speech enhancement, e.g. noise reduction or echo cancellation
- G10L21/0208—Noise filtering
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/78—Detection of presence or absence of voice signals
- G10L25/84—Detection of presence or absence of voice signals for discriminating voice from noise
Abstract
The embodiment of the present invention provides a kind of voice activity detection method, device and equipment.This method includes:Audio signal to be detected is smoothed, calculate the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing, according to the energy of each frame signal, zero-crossing rate and detection model trained in advance, determine that each frame signal is the probability of voice signal, it is the probability of voice signal according to each frame signal, determines noise signal and voice signal in audio signal.The method of the embodiment of the present invention, so that the noise signal in audio signal is substantially weakened, improves performance of the voice activity detection in noise circumstance by being smoothed to the audio signal comprising noise signal.
Description
Technical field
The present embodiments relate to speech signal processing technology more particularly to a kind of voice activity detection methods, dress
It sets and equipment.
Background technique
Voice activity detection (Voice Activity Detection, referred to as:VAD speech terminals detection, voice) are also known as
Border detection.By the detection to voice in voice signal and non-voice, when to identify in voice signal stream and eliminate long
Between the mute phase.Commonly used in playing reduction voice coder in the speech processing systems such as speech recognition, voice coding, speech enhan-cement
Bit rate saves communication bandwidth, reduces energy consumption of mobile equipment, improves the effects of discrimination.
Voice signal is because of its non-stationary property, and its is easy the interference by noise signal, and the disturbance of noise can serious shadow
Ring the accuracy of VAD.Then the existing VAD method based on G.729 standard sets thresholding to signal by the energy of calculating signal
Each frame simply classified, however, this method can not obtain satisfactory effect in the presence of noise.
With the continuous development of voice processing technology, the requirement to voice activity detection is also higher and higher.Therefore, it is necessary to one
Kind voice activity detection method can still keep good detection performance in noise circumstance.
Summary of the invention
The embodiment of the present invention provides a kind of voice activity detection method, device and equipment, exists in the prior art to solve
In noise circumstance, the not high problem of voice activity detection performance.
In a first aspect, the embodiment of the present invention provides a kind of voice activity detection method, including:
Audio signal to be detected is smoothed;
Calculate the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing;
According to the energy of each frame signal, zero-crossing rate and detection model trained in advance, determine that each frame signal is voice
The probability of signal;
It is the probability of voice signal according to each frame signal, determines noise signal and voice signal in audio signal.
In one possible implementation, to audio signal to be detected be smoothed including:
An average value is calculated per N number of sampled point in audio signal to be detected, it is smoothed out as every N number of sampled point
Output valve, N are the natural number greater than 1.
In one possible implementation, in calculating audio signal after smoothing processing each frame signal energy
Before amount and zero-crossing rate, further include:
It is moved according to preset frame length and preset frame, sub-frame processing is carried out to the audio signal after smoothing processing, in advance
If frame length greater than preset frame move.
In one possible implementation, it is the probability of voice signal according to each frame signal, determines in audio signal
Noise signal and voice signal, including:
If probability is greater than predetermined probabilities value, it is determined that the frame signal is voice signal;
If probability is less than or equal to predetermined probabilities value, it is determined that the frame signal is noise signal.
In one possible implementation, according to the energy of each frame signal, zero-crossing rate and detection trained in advance
Model further includes before determining the probability that each frame signal is voice signal:
Audio signal in training corpus is smoothed and sub-frame processing, generates multiple training samples;
Using the energy of multiple training samples and zero-crossing rate as the input feature vector of detection model, by multiple training samples whether
It is desired output feature of the voice signal as detection model, detection model is trained.
In one possible implementation, detection model is based on deep neural network, Logic Regression Models or support
Vector machine model is trained.
Second aspect, the embodiment of the present invention provide a kind of voice activity detection apparatus, including:
Leveling Block, for being smoothed to audio signal to be detected;
Computing module, for calculating the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing;
Determining module, for determining each according to the energy of each frame signal, zero-crossing rate and detection model trained in advance
Frame signal is the probability of voice signal;
Determining module, be also used to be according to each frame signal voice signal probability, determine in audio signal noise letter
Number and voice signal.
In one possible implementation, Leveling Block is specifically used for, per N number of sampling in audio signal to be detected
Point calculates an average value, and as every N number of smoothed out output valve of sampled point, N is the natural number greater than 1.
The third aspect, the embodiment of the present invention provide a kind of voice activity detection apparatus, including:
Memory;
Processor;And
Computer program;
Wherein, computer program stores in memory, and is configured as being executed by processor to realize such as first aspect
The method of any one.
Fourth aspect, the embodiment of the present invention provide a kind of computer readable storage medium, are stored thereon with computer program,
Computer program is executed by processor to realize the method such as any one of first aspect.
Voice activity detection method, device and equipment provided in an embodiment of the present invention, by audio signal to be detected
It is smoothed, the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing is calculated, according to each
Energy, zero-crossing rate and the detection model trained in advance of frame signal, determine that each frame signal is the probability of voice signal, according to every
One frame signal is the probability of voice signal, determines noise signal and voice signal in audio signal, realizes under noise circumstance
Speech activity high-performance detection.The different characteristics as possessed by voice signal and noise signal, smoothing processing can make
The amplitude of noise signal in audio signal is by significantly smooth, and the amplitude that the voice signal in audio signal is smoothed is compared
It is much smaller for the amplitude that noise signal is smoothed, more variant acoustic feature can be extracted, voice is improved
Performance of the activity detection in noise circumstance.
Detailed description of the invention
The drawings herein are incorporated into the specification and forms part of this specification, and shows and meets implementation of the invention
Example, and be used to explain the principle of the present invention together with specification.
Fig. 1 is the flow chart of one embodiment of voice activity detection method provided by the invention;
Fig. 2 is the flow chart of the another embodiment of voice activity detection method provided by the invention;
Fig. 3 is the flow chart of training detection model during voice activity detection method one provided by the invention is implemented;
Fig. 4 is the structural schematic diagram of one embodiment of voice activity detection apparatus provided by the invention;
Fig. 5 is the structural schematic diagram of one embodiment of voice activity detection apparatus provided by the invention.
Through the above attached drawings, it has been shown that the specific embodiment of the present invention will be hereinafter described in more detail.These attached drawings
It is not intended to limit the scope of the inventive concept in any manner with verbal description, but is by referring to specific embodiments
Those skilled in the art illustrate idea of the invention.
Specific embodiment
Example embodiments are described in detail here, and the example is illustrated in the accompanying drawings.Following description is related to
When attached drawing, unless otherwise indicated, the same numbers in different drawings indicate the same or similar elements.Following exemplary embodiment
Described in embodiment do not represent all embodiments consistented with the present invention.On the contrary, they be only with it is such as appended
The example of device and method being described in detail in claims, some aspects of the invention are consistent.
Term " includes " and " having " and their any deformations in description and claims of this specification, it is intended that
It is to cover and non-exclusive includes.Such as the process, method, system, product or equipment for containing a series of steps or units do not have
It is defined in listed step or unit, but optionally further comprising the step of not listing or unit, or optionally also wrap
Include the other step or units intrinsic for these process, methods, product or equipment.
" first " and " second " in the present invention only plays mark action, be not understood to indicate or imply ordinal relation,
Relative importance or the quantity for implicitly indicating indicated technical characteristic." multiple " refer to two or more." and/
Or ", the incidence relation of affiliated partner is described, indicates may exist three kinds of relationships, for example, A and/or B, can indicate:Individually deposit
In A, A and B, these three situations of individualism B are existed simultaneously.It is a kind of "or" that character "/", which typicallys represent forward-backward correlation object,
Relationship.
" one embodiment " or " embodiment " mentioned in the whole text in specification of the invention means related with embodiment
A particular feature, structure, or characteristic include at least one embodiment of the application.Therefore, occur everywhere in the whole instruction
" in one embodiment " or " in one embodiment " not necessarily refer to identical embodiment.It should be noted that not rushing
In the case where prominent, the feature in embodiment and embodiment in the present invention be can be combined with each other.
Fig. 1 is the flow chart of one embodiment of voice activity detection method provided by the invention.As shown in Figure 1, the present embodiment
The voice activity detection method of offer may include:
Step S101, audio signal to be detected is smoothed.
With the continuous development of artificial intelligence technology, the various intelligent uses based on speech recognition are constantly released.With mobile phone
For, the application such as phonetic search, Voice Navigation is gradually influencing the use habit of user.And mobile phone usually passes through Mike's elegance
Collect audio signal, unavoidably will receive the influence of ambient noise during acquisition, the presence of noise will affect voice letter
Number process performance.
It is described in detail below by a specific scene.Smart phone provides driving mode, so as to user
When driving, smart phone is controlled by voice.For example, user " can be made a phone call by voice on the run
To Zhang San ", control mobile phone is made a phone call to the people for being named as Zhang San in address list;By voice " incoming call answering ", controls mobile phone and connect
Logical incoming call;By voice " searching for nearest parking lot ", the navigation etc. of mobile phone offer to nearest parking lot is controlled.However, very
User phonetic order all may not be issued during the entire process of user drives, alternatively, using during the entire process of user drives
Family only has issued a small amount of phonetic order, and such as in 40 minutes to drive, phonetic order duration only only has 1 minute.If right
This 40 minutes collected audio signals carry out voice recognition processing, can bring biggish load, meeting to the processor of mobile phone
A large amount of process resource is wasted, causes mobile telephone power consumption excessively high.At this time, it may be necessary to by voice activity detection method from it is collected when
1 minute voice signal is identified in a length of 40 minutes audio signals, and only this 1 minute voice signal is carried out at identification
Reason reduces mobile telephone power consumption to improve the efficiency of speech recognition.However, vehicle-mounted noise is very serious under vehicle environment, this direct shadow
The performance and stability for having rung voice activity detection cause testing result inaccurate, and then influence subsequent speech recognition and place
Reason process.
By the common point and otherness progress detailed analysis to noise signal and voice signal, had using voice signal
Smooth performance possessed by some non-stationary properties and vehicle-mounted noise signal, to collected audio signal to be detected into
Row smoothing processing weakens influence of the noise for Speech signal detection.
Smoothing processing can make the amplitude of the noise signal in audio signal by significantly smooth, and the language in audio signal
It is much smaller for the amplitude that the amplitude that sound signal is smoothed is smoothed compared to noise signal, therefore, it can extract and have more
The acoustic feature of otherness realizes the high-performance of voice activity detection in a noisy environment.
Step S102, the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing are calculated.
Optionally, before the computation, can also include:It is moved according to preset frame length and preset frame, to by smooth place
Audio signal after reason carries out sub-frame processing, and preset frame length is moved greater than preset frame.Specifically, frame length can choose 25 millis
Second, frame shifting can choose 10 milliseconds, then one section 85 milliseconds of duration of audio signal is divided into 7 frames.
In the present embodiment, the energy of each frame signal can be indicated using the L2 norm of the frame signal, i.e. a frame signal
Energy be equal to this frame signal in each sampled point value quadratic sum.The zero-crossing rate of each frame signal is believed using this frame
Number pass through the number of zero, the i.e. changed number of frame signal symbol.Energy and zero-crossing rate calculation amount are small, come to equipment belt
Calculated load it is small, have the advantages that low-power consumption.
Step S103, according to the energy of each frame signal, zero-crossing rate and detection model trained in advance, each frame letter is determined
It number is the probability of voice signal.
Trained detection model in advance is inputted using the energy of obtained each frame signal and zero-crossing rate as input feature vector
In, obtain the probability that each frame signal is voice signal.Optionally, can also obtain each frame signal is the general of noise signal
Rate.
Step S104, it is the probability of voice signal according to each frame signal, determines noise signal and language in audio signal
Sound signal.
According to obtained probability value, the noise signal and voice signal in audio signal are determined, specifically, an if frame signal
It is the probability of voice signal greater than predetermined probabilities value, it is determined that the frame signal is voice signal;If a frame signal is voice signal
Probability be less than or equal to predetermined probabilities value, it is determined that the frame signal is noise signal, such as predetermined probabilities value can be set to 0.5.
Alternatively, if the probability that a frame signal is voice signal is greater than the probability that the frame signal is noise signal, it is determined that should
Frame signal is voice signal;If the probability that a frame signal is voice signal is less than or equal to the probability that the frame signal is noise signal,
Then determine that the frame signal is noise signal.Under normal conditions, a frame signal is the probability of voice signal and the frame signal is noise
The sum of probability of signal is equal to 1.
Voice activity detection method provided in an embodiment of the present invention, by smoothly being located to audio signal to be detected
Reason calculates the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing, according to the energy of each frame signal
Amount, zero-crossing rate and detection model trained in advance, determine that each frame signal is the probability of voice signal, are according to each frame signal
The probability of voice signal determines noise signal and voice signal in audio signal, realizes the speech activity under noise circumstance
High-performance detection.The different characteristics as possessed by voice signal and noise signal, smoothing processing can make in audio signal
Noise signal amplitude by significantly smooth, and the amplitude that the voice signal in audio signal is smoothed is compared to noise signal
It is much smaller for the amplitude being smoothed, more variant acoustic feature can be extracted, voice activity detection is improved and exists
Performance in noise circumstance.
Below by a specific embodiment, to the smoothing processing in the technical solution of embodiment of the method shown in Fig. 1 into
Row is described in detail.With the popularization of intelligent terminals, voice activity detection method is more run on the terminal device, therefore not only
It is required that it is stable and reliable for performance, and calculation amount cannot be too big.For this feature, in the present embodiment, audio to be detected is believed
It number is smoothed and may include:An average value is calculated per N number of sampled point in audio signal to be detected, as every N
A smoothed out output valve of sampled point, N are the natural number greater than 1.
For example, if N is taken to be equal to 4, that is, the scale for taking average progress smooth is 4, includes 160 samplings for one section
For the audio signal of point, then the audio signal exported after smooth includes 40 sampled points, wherein each is exported
Sampled point is all the average value of 4 sampled points.
Voice activity detection method provided in this embodiment, by N number of sampled point meter every in audio signal to be detected
An average value is calculated, as every N number of smoothed out output valve of sampled point, not only calculation amount is small for this smoothing processing method, but also
Since multiple sampled points are merged into a sampled point in smoothing processing, the number of sampled point is greatly reduced, language is reduced
Data processing amount in sound activity detection process can not only promote forecasting efficiency, and can satisfy the requirement of low-power consumption.
Fig. 2 is the flow chart of the another embodiment of voice activity detection method provided by the invention.As shown in Fig. 2, this implementation
Example provide voice activity detection method may include:
Step S201, an average value is calculated per N number of sampled point to audio signal to be detected, as every N number of sampled point
Smoothed out output valve.
Step S202, it is moved according to preset frame length and preset frame, the audio signal after smoothing processing is divided
Frame processing, preset frame length are moved greater than preset frame.
Step S203, the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing are calculated.
Step S204, according to the energy of each frame signal, zero-crossing rate and detection model trained in advance, each frame letter is determined
It number is the probability of voice signal.
If step S205, probability is greater than predetermined probabilities value, it is determined that the frame signal is voice signal, if probability is less than or equal to
Predetermined probabilities value, it is determined that the frame signal is noise signal.
Voice activity detection method provided in this embodiment merges into a sampling by being averaged multiple sampled points
Point reduces the number of sampled point, reduces the data processing amount during voice activity detection, reduces power consumption;By flat
Sliding processing makes the amplitude of the noise signal in audio signal significantly smoothly, be improved voice activity detection in noise circumstance
Performance;By using the lesser energy of calculation amount and zero-crossing rate as input feature vector, the demand of low-power consumption can satisfy, so that
Voice activity detection method provided in this embodiment can be run on the terminal device.
On the basis of the above embodiments, training process of the present embodiment for the detection model used in above-described embodiment
It is described in detail.Fig. 3 is the flow chart of training detection model during voice activity detection method one provided by the invention is implemented.Such as
Shown in Fig. 3, the training process for detection model may include:
Step S301, the audio signal in training corpus is smoothed and sub-frame processing, generates multiple training
Sample.
Specific smoothing processing can use the calculating one per N number of sampled point with identical method in above-mentioned detection embodiment
A average value, as this N number of smoothed out output valve of sampled point.It is moved according to preset frame length and preset frame, to by smooth
Audio signal that treated carries out sub-frame processing, and preset frame length is moved greater than preset frame, i.e., has part between adjacent two frame
Overlapping, in the training stage, can increase the sample number of training stage using this framing method.It should be noted that training rank
Section is identical with the frame length of detection-phase needs, and frame moves can be different, and the audio signal of same duration, frame moves smaller obtained training
Sample number is more.For example, frame length can choose 25 milliseconds, frame shifting can choose 5 milliseconds.
Training corpus can select public audio corpus, can also voluntarily acquire.When each training sample is
The long audio signal for being equal to default frame length, and labeled is voice signal or noise signal.For example, voice signal sample can
To be labeled as 1, noise signal sample, which can mark, is.
Step S302, the energy and zero-crossing rate of each training sample are calculated.
Wherein, energy is training sample L2 norm, and zero-crossing rate is the changed number of training sample symbol.
Step S303, using the energy of multiple training samples and zero-crossing rate as the input feature vector of detection model, by multiple instructions
Practice whether sample is desired output feature of the voice signal as detection model, detection model is trained.
Detection model in the present embodiment can be based on deep neural network, Logic Regression Models or support vector machines mould
Type building.By in obtained multiple training samples, two dimension of the energy and zero-crossing rate of each training sample as detection model
Each training sample is that voice signal or noise signal are special as the desired output of detection model by acoustics input feature vector
Sign, is trained detection model.
The embodiment of the present invention also provides a kind of voice activity detection apparatus, shown in Figure 4, the embodiment of the present invention only with
It is illustrated for Fig. 4, is not offered as that present invention is limited only to this.Fig. 4 is that voice activity detection apparatus one provided by the invention is real
Apply the structural schematic diagram of example.As shown in figure 4, voice activity detection apparatus 40 provided in an embodiment of the present invention includes:Leveling Block
401, computing module 402 and determining module 403.
Leveling Block 401, for being smoothed to audio signal to be detected.
Computing module 402, for calculating the energy and zero passage of each frame signal in the audio signal after smoothing processing
Rate.
Determining module 403, for determining according to the energy of each frame signal, zero-crossing rate and detection model trained in advance
Each frame signal is the probability of voice signal.
Determining module 403, be also used to be according to each frame signal voice signal probability, determine the noise in audio signal
Signal and voice signal.
Device provided in this embodiment can be used for executing the technical solution of embodiment of the method shown in Fig. 1, realization principle
Similar with technical effect, details are not described herein again.
In one possible implementation, Leveling Block 401 is specifically used for, per N number of in audio signal to be detected
Sampled point calculates an average value, and as every N number of smoothed out output valve of sampled point, N is the natural number greater than 1.
In one possible implementation, voice activity detection apparatus can also include framing module, for calculating
Before the energy and zero-crossing rate of each frame signal, moved according to preset frame length and preset frame, to the sound after smoothing processing
Frequency signal carries out sub-frame processing, wherein preset frame length is moved greater than preset frame.
In one possible implementation, determining module 403 specifically can be also used for, if probability is greater than predetermined probabilities
Value, it is determined that the frame signal is voice signal;If probability is less than or equal to predetermined probabilities value, it is determined that the frame signal is noise letter
Number.
In one possible implementation, according to the energy of each frame signal, zero-crossing rate and detection trained in advance
Model further includes before determining the probability that each frame signal is voice signal:
Audio signal in training corpus is smoothed and sub-frame processing, generates multiple training samples;
Using the energy of multiple training samples and zero-crossing rate as the input feature vector of detection model, by multiple training samples whether
It is desired output feature of the voice signal as detection model, detection model is trained.
In one possible implementation, detection model is based on deep neural network, Logic Regression Models or support
Vector machine model is trained.
The embodiment of the present invention also provides a kind of voice activity detection apparatus, shown in Figure 5, the embodiment of the present invention only with
It is illustrated for Fig. 5, is not offered as that present invention is limited only to this.Fig. 5 is that voice activity detection apparatus one provided by the invention is real
Apply the structural schematic diagram of example.The detection device can be mobile phone, computer, digital broadcast terminal, messaging devices, trip
Play console, tablet device, Medical Devices, body-building equipment, personal digital assistant etc..As shown in figure 5, inspection provided in this embodiment
Measurement equipment may include following one or more components:Processing component 501, memory 502, audio component 503, power supply module
504, communication component 505, multimedia component 506, sensor module 507 and input/output (I/O) interface 508.
Processing component 501 usually control detection device integrated operation, such as with display, telephone call, data communication, phase
Machine operation and record operate associated operation.Processing component 501 may include that one or more processors 5011 refer to execute
It enables, to perform all or part of the steps of the methods described above.In addition, processing component 501 may include one or more modules, just
Interaction between processing component 501 and other assemblies.For example, processing component 501 may include multi-media module, it is more to facilitate
Interaction between media component 506 and processing component 501.
Memory 502 is configured as storing various types of data to support the operation in detection device.These data
Example includes the instruction of any application or method for operating on detection device, contact data, telephone book data,
Message, picture, video etc..Memory 502 can by any kind of volatibility or non-volatile memory device or they
Combination is realized, such as static random access memory (SRAM), electrically erasable programmable read-only memory (EEPROM), it is erasable can
Program read-only memory (EPROM), programmable read only memory (PROM), read-only memory (ROM), magnetic memory, flash memory
Reservoir, disk or CD.In the present embodiment, it is stored with computer program in memory 502, which can be by handling
Device 5011 executes, to realize the technical solution of any of the above-described voice activity detection method embodiment.
Power supply module 504 provides electric power for the various assemblies of detection device.Power supply module 504 may include power management system
System, one or more power supplys and other with for detection device generate, manage, and distribute the associated component of electric power.
Multimedia component 506 includes the screen of one output interface of offer between the detection device and user.?
In some embodiments, screen may include liquid crystal display (LCD) and touch panel (TP).If screen includes touch panel,
Screen may be implemented as touch screen, to receive input signal from the user.Touch panel includes that one or more touch passes
Sensor is to sense the gesture on touch, slide, and touch panel.The touch sensor can not only sense touch or sliding is dynamic
The boundary of work, but also detect duration and pressure associated with the touch or slide operation.In some embodiments, more
Media component 506 includes a front camera and/or rear camera.When detection device is in operation mode, as shot mould
When formula or video mode, front camera and/or rear camera can receive external multi-medium data.Each preposition camera shooting
Head and rear camera can be a fixed optical lens system or have focusing and optical zoom capabilities.
Audio component 503 is configured as output and/or input audio signal.For example, audio component 503 includes a Mike
Wind (MIC), when detection device is in operation mode, when such as call mode, recording mode, and voice recognition mode, microphone is matched
It is set to reception external audio signal.The received audio signal can be further stored in memory 502.The present embodiment
In, the voice signal that user carries out voice control to detection device can be acquired by microphone, then via processing component 501
Voice activity detection is carried out to it, and then carries out a series of subsequent processings such as speech recognition.In some embodiments, audio component
503 further include a loudspeaker, is used for output audio signal.In the present embodiment, it can be played by loudspeaker and user is mentioned
Show information.
I/O interface 508 provides interface between processing component 501 and peripheral interface module, and above-mentioned peripheral interface module can
To be keyboard, click wheel, button etc..These buttons may include, but are not limited to:Home button, volume button, start button and lock
Determine button.
Sensor module 507 includes one or more sensors, and the state for providing various aspects for detection device is commented
Estimate.For example, sensor module 507 can detecte the state that opens/closes of detection device, the relative positioning of component, such as institute
The display and keypad that component is detection device are stated, sensor module 507 can also detect detection device or detection device one
The position change of a component, the existence or non-existence that user contacts with detection device, detection device orientation or acceleration/deceleration and inspection
The temperature change of measurement equipment.Sensor module 507 may include proximity sensor, be configured to connect in not any physics
It is detected the presence of nearby objects when touching.Sensor module 507 can also include optical sensor, such as CMOS or ccd image sensor,
For being used in imaging applications.In some embodiments, which can also include acceleration transducer, top
Spiral shell instrument sensor, Magnetic Sensor, pressure sensor or temperature sensor.
Communication component 505 is configured to facilitate the communication of wired or wireless way between detection device and other equipment.This
Communication component 505 is in embodiment for realizing the interaction between detection device and cloud server.Detection device can access base
In the wireless network of communication standard, such as WiFi, 2G, 3G or 4G or their combination.In one exemplary embodiment, it communicates
Component 505 receives broadcast singal or broadcast related information from external broadcasting management system via broadcast channel.Show at one
In example property embodiment, the communication component 505 further includes near-field communication (NFC) module, to promote short range communication.For example, in NFC
Module can be based on radio frequency identification (RFID) technology, Infrared Data Association (IrDA) technology, ultra wide band (UWB) technology, bluetooth (BT)
Technology and other technologies are realized.
In the exemplary embodiment, detection device can be by one or more application specific integrated circuit (ASIC), number
Signal processor (DSP), digital signal processing appts (DSPD), programmable logic device (PLD), field programmable gate array
(FPGA), controller, microcontroller, microprocessor or other electronic components are realized, for executing the above method.
In the exemplary embodiment, a kind of non-transitorycomputer readable storage medium including instruction, example are additionally provided
It such as include the memory 502 of instruction, above-metioned instruction can be executed by the processor 5011 of detection device to complete the above method.Example
Such as, the non-transitorycomputer readable storage medium can be ROM, random access memory (RAM), CD-ROM, tape, soft
Disk and optical data storage devices etc..
Voice activity detection apparatus provided in an embodiment of the present invention can be used for executing the technology of any of the above-described embodiment of the method
Scheme, it is similar that the realization principle and technical effect are similar, and details are not described herein again.
The embodiment of the present invention also provides a kind of computer readable storage medium, is stored thereon with computer program, computer
Program is executed by processor the technical solution to realize any of the above-described embodiment of the method.
Finally it should be noted that:The above embodiments are only used to illustrate the technical solution of the present invention., rather than its limitations;To the greatest extent
Present invention has been described in detail with reference to the aforementioned embodiments for pipe, those skilled in the art should understand that:Its according to
So be possible to modify the technical solutions described in the foregoing embodiments, or to some or all of the technical features into
Row equivalent replacement;And these are modified or replaceed, various embodiments of the present invention technology that it does not separate the essence of the corresponding technical solution
The range of scheme.
Claims (10)
1. a kind of voice activity detection method, which is characterized in that including:
Audio signal to be detected is smoothed;
Calculate the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing;
According to the energy of each frame signal, zero-crossing rate and detection model trained in advance, determine that each frame signal is
The probability of voice signal;
It is the probability of voice signal according to each frame signal, determines noise signal and voice letter in the audio signal
Number.
2. the method according to claim 1, wherein described be smoothed packet to audio signal to be detected
It includes:
An average value is calculated per N number of sampled point in the audio signal to be detected, as described smooth per N number of sampled point
Output valve afterwards, N are the natural number greater than 1.
3. the method according to claim 1, wherein in the calculating in the audio signal after smoothing processing
Before the energy and zero-crossing rate of each frame signal, further include:
It is moved according to preset frame length and preset frame, sub-frame processing, institute is carried out to the audio signal after smoothing processing
It states preset frame length and is greater than the preset frame shifting.
4. according to each frame signal being the general of voice signal the method according to claim 1, wherein described
Rate determines noise signal and voice signal in the audio signal, including:
If the probability is greater than predetermined probabilities value, it is determined that the frame signal is voice signal;
If the probability is less than or equal to predetermined probabilities value, it is determined that the frame signal is noise signal.
5. the method according to claim 1, wherein in the energy according to each frame signal, zero passage
Rate and detection model trained in advance further include before determining the probability that each frame signal is voice signal:
Audio signal in training corpus is smoothed and sub-frame processing, generates multiple training samples;
Using the energy of the multiple training sample and zero-crossing rate as the input feature vector of the detection model, by the multiple training
Whether sample is desired output feature of the voice signal as the detection model, is trained to the detection model.
6. according to the method described in claim 5, it is characterized in that, the detection model is based on deep neural network, logic is returned
Model or supporting vector machine model is returned to be trained.
7. a kind of voice activity detection apparatus, which is characterized in that including:
Leveling Block, for being smoothed to audio signal to be detected;
Computing module, for calculating the energy and zero-crossing rate of each frame signal in the audio signal after smoothing processing;
Determining module, described in determining according to the energy of each frame signal, zero-crossing rate and detection model trained in advance
Each frame signal is the probability of voice signal;
The determining module, be also used to be according to each frame signal voice signal probability, determine in the audio signal
Noise signal and voice signal.
8. device according to claim 7, which is characterized in that the Leveling Block is specifically used for, described to be detected
An average value is calculated in audio signal per N number of sampled point, as described per the smoothed out output valve of N number of sampled point, N be greater than
1 natural number.
9. a kind of voice activity detection apparatus, which is characterized in that including:
Memory;
Processor;And
Computer program;
Wherein, the computer program stores in the memory, and is configured as being executed by the processor to realize such as
Method described in any one of claims 1-6.
10. a kind of computer readable storage medium, which is characterized in that be stored thereon with computer program, the computer program
It is executed by processor to realize as the method according to claim 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810605698.8A CN108831508A (en) | 2018-06-13 | 2018-06-13 | Voice activity detection method, device and equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810605698.8A CN108831508A (en) | 2018-06-13 | 2018-06-13 | Voice activity detection method, device and equipment |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108831508A true CN108831508A (en) | 2018-11-16 |
Family
ID=64145020
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810605698.8A Pending CN108831508A (en) | 2018-06-13 | 2018-06-13 | Voice activity detection method, device and equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108831508A (en) |
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110349597A (en) * | 2019-07-03 | 2019-10-18 | 山东师范大学 | A kind of speech detection method and device |
CN110648656A (en) * | 2019-08-28 | 2020-01-03 | 北京达佳互联信息技术有限公司 | Voice endpoint detection method and device, electronic equipment and storage medium |
CN112071328A (en) * | 2019-06-10 | 2020-12-11 | 谷歌有限责任公司 | Audio noise reduction |
CN112189232A (en) * | 2019-07-31 | 2021-01-05 | 深圳市大疆创新科技有限公司 | Audio processing method and device |
CN112969130A (en) * | 2020-12-31 | 2021-06-15 | 维沃移动通信有限公司 | Audio signal processing method and device and electronic equipment |
CN113744752A (en) * | 2021-08-30 | 2021-12-03 | 西安声必捷信息科技有限公司 | Voice processing method and device |
WO2022134781A1 (en) * | 2020-12-23 | 2022-06-30 | 深圳壹账通智能科技有限公司 | Prolonged speech detection method, apparatus and device, and storage medium |
CN116153341A (en) * | 2023-04-20 | 2023-05-23 | 深圳锐盟半导体有限公司 | Control method and device of voice detection device |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101197130A (en) * | 2006-12-07 | 2008-06-11 | 华为技术有限公司 | Sound activity detecting method and detector thereof |
US20090125304A1 (en) * | 2007-11-13 | 2009-05-14 | Samsung Electronics Co., Ltd | Method and apparatus to detect voice activity |
CN101494049A (en) * | 2009-03-11 | 2009-07-29 | 北京邮电大学 | Method for extracting audio characteristic parameter of audio monitoring system |
CN101625857A (en) * | 2008-07-10 | 2010-01-13 | 新奥特(北京)视频技术有限公司 | Self-adaptive voice endpoint detection method |
CN103646649A (en) * | 2013-12-30 | 2014-03-19 | 中国科学院自动化研究所 | High-efficiency voice detecting method |
CN107134277A (en) * | 2017-06-15 | 2017-09-05 | 深圳市潮流网络技术有限公司 | A kind of voice-activation detecting method based on GMM model |
-
2018
- 2018-06-13 CN CN201810605698.8A patent/CN108831508A/en active Pending
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101197130A (en) * | 2006-12-07 | 2008-06-11 | 华为技术有限公司 | Sound activity detecting method and detector thereof |
US20090125304A1 (en) * | 2007-11-13 | 2009-05-14 | Samsung Electronics Co., Ltd | Method and apparatus to detect voice activity |
CN101625857A (en) * | 2008-07-10 | 2010-01-13 | 新奥特(北京)视频技术有限公司 | Self-adaptive voice endpoint detection method |
CN101494049A (en) * | 2009-03-11 | 2009-07-29 | 北京邮电大学 | Method for extracting audio characteristic parameter of audio monitoring system |
CN103646649A (en) * | 2013-12-30 | 2014-03-19 | 中国科学院自动化研究所 | High-efficiency voice detecting method |
CN107134277A (en) * | 2017-06-15 | 2017-09-05 | 深圳市潮流网络技术有限公司 | A kind of voice-activation detecting method based on GMM model |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112071328A (en) * | 2019-06-10 | 2020-12-11 | 谷歌有限责任公司 | Audio noise reduction |
CN112071328B (en) * | 2019-06-10 | 2024-03-26 | 谷歌有限责任公司 | Audio noise reduction |
CN110349597A (en) * | 2019-07-03 | 2019-10-18 | 山东师范大学 | A kind of speech detection method and device |
CN110349597B (en) * | 2019-07-03 | 2021-06-25 | 山东师范大学 | Voice detection method and device |
CN112189232A (en) * | 2019-07-31 | 2021-01-05 | 深圳市大疆创新科技有限公司 | Audio processing method and device |
CN110648656A (en) * | 2019-08-28 | 2020-01-03 | 北京达佳互联信息技术有限公司 | Voice endpoint detection method and device, electronic equipment and storage medium |
WO2022134781A1 (en) * | 2020-12-23 | 2022-06-30 | 深圳壹账通智能科技有限公司 | Prolonged speech detection method, apparatus and device, and storage medium |
CN112969130A (en) * | 2020-12-31 | 2021-06-15 | 维沃移动通信有限公司 | Audio signal processing method and device and electronic equipment |
CN113744752A (en) * | 2021-08-30 | 2021-12-03 | 西安声必捷信息科技有限公司 | Voice processing method and device |
CN116153341A (en) * | 2023-04-20 | 2023-05-23 | 深圳锐盟半导体有限公司 | Control method and device of voice detection device |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108831508A (en) | Voice activity detection method, device and equipment | |
CN111179961B (en) | Audio signal processing method and device, electronic equipment and storage medium | |
WO2019214361A1 (en) | Method for detecting key term in speech signal, device, terminal, and storage medium | |
CN110634507A (en) | Speech classification of audio for voice wakeup | |
JP2019117623A (en) | Voice dialogue method, apparatus, device and storage medium | |
CN110808063A (en) | Voice processing method and device for processing voice | |
CN107799126A (en) | Sound end detecting method and device based on Supervised machine learning | |
CN111933112B (en) | Awakening voice determination method, device, equipment and medium | |
CN110992963B (en) | Network communication method, device, computer equipment and storage medium | |
CN103024182B (en) | Method and device which enter into photo album interface from shoot interface of mobile terminal | |
JP7166294B2 (en) | Audio processing method, device and storage medium | |
CN111063342A (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN104240278B (en) | The determination of equipment body position | |
CN110648656A (en) | Voice endpoint detection method and device, electronic equipment and storage medium | |
CN109599104A (en) | Multi-beam choosing method and device | |
CN107992813A (en) | A kind of lip condition detection method and device | |
US20220165258A1 (en) | Voice processing method, electronic device, and storage medium | |
CN108665889A (en) | The Method of Speech Endpoint Detection, device, equipment and storage medium | |
CN109388699A (en) | Input method, device, equipment and storage medium | |
CN109256145A (en) | Audio-frequency processing method, device, terminal and readable storage medium storing program for executing based on terminal | |
WO2022147692A1 (en) | Voice command recognition method, electronic device and non-transitory computer-readable storage medium | |
CN104850592B (en) | The method and apparatus for generating model file | |
KR101927050B1 (en) | User terminal and computer readable recorindg medium including a user adaptive learning model to be tranined with user customized data without accessing a server | |
CN112614507A (en) | Method and apparatus for detecting noise | |
CN113744736B (en) | Command word recognition method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181116 |