CN116246653A

CN116246653A - Voice endpoint detection method and device, electronic equipment and storage medium

Info

Publication number: CN116246653A
Application number: CN202211643948.XA
Authority: CN
Inventors: 宋其岩; 王林章
Original assignee: Beijing Xiaomi Pinecone Electronic Co Ltd; Xiaomi Automobile Technology Co Ltd
Current assignee: Beijing Xiaomi Pinecone Electronic Co Ltd; Xiaomi Automobile Technology Co Ltd
Priority date: 2022-12-20
Filing date: 2022-12-20
Publication date: 2023-06-09

Abstract

The disclosure relates to a method and a device for detecting a voice endpoint, electronic equipment and a storage medium, and belongs to the technical field of voice processing. The method comprises the following steps: acquiring original signals acquired by a plurality of sensors in a sensor array; carrying out beam forming on a plurality of original signals to obtain beam signals; performing voice endpoint detection on the beam signals to obtain initial time of the voice signals; based on the initial time, target time of the voice signals corresponding to the plurality of sensors is obtained. Therefore, the original signals acquired by the plurality of sensors can be subjected to beam forming to obtain beam signals, the beam signals are subjected to voice endpoint detection to obtain initial time, so that target time corresponding to the plurality of sensors is obtained, only the beam signals are subjected to voice endpoint detection, the number of times of voice endpoint detection is greatly reduced, the efficiency of voice endpoint detection of the sensor array is improved, the signal-to-noise ratio of the beam signals is high, and the accuracy of voice endpoint detection is improved.

Description

Voice endpoint detection method and device, electronic equipment and storage medium

Technical Field

The disclosure relates to the technical field of voice processing, and in particular relates to a method and a device for detecting a voice endpoint, electronic equipment and a storage medium.

Background

At present, with the development of technologies such as artificial intelligence and natural language processing, voice processing is widely applied in the fields of intelligent home appliances, robot voice interaction, vehicle-mounted terminals and the like. Voice endpoint detection may include identifying the presence of a voice signal from the collected raw signal, as well as the start time, end time, etc. of the voice signal, which is critical to voice processing. However, the voice endpoint detection method in the related art has the problems of low efficiency and low accuracy.

Disclosure of Invention

The disclosure provides a method, an apparatus, an electronic device, a computer readable storage medium and a computer program product for detecting a voice endpoint, so as to at least solve the problems of low efficiency and low accuracy of the voice endpoint detection method in the related art. The technical scheme of the present disclosure is as follows:

according to a first aspect of an embodiment of the present disclosure, there is provided a method for detecting a voice endpoint, including: acquiring original signals acquired by a plurality of sensors in a sensor array; carrying out beam forming on a plurality of original signals to obtain beam signals; performing voice endpoint detection on the beam signals to obtain initial time of voice signals; and obtaining target time of the voice signals corresponding to the plurality of sensors based on the initial time.

In one embodiment of the present disclosure, the beamforming the plurality of original signals to obtain a beam signal includes: estimating the azimuth of a sound source to obtain an azimuth angle between the sound source and the sensor array; and carrying out beam forming on a plurality of original signals based on the azimuth angles to obtain the beam signals.

In one embodiment of the present disclosure, the beamforming the plurality of original signals to obtain a beam signal includes: responding to the original signals as broadband signals, and carrying out beam forming on a plurality of original signals in a frequency domain to obtain frequency domain beam signals; or, in response to the original signals being single-frequency signals, performing beam forming on the plurality of original signals in the time domain to obtain time domain beam signals.

In one embodiment of the present disclosure, the performing voice endpoint detection on the beam signal to obtain an initial time of the voice signal includes: framing the beam signals to obtain multi-frame beam signals; acquiring the energy entropy ratio of each frame of wave beam signal; obtaining the initial time based on the energy entropy ratio; wherein the initial time includes an initial start time and/or an initial end time.

In one embodiment of the present disclosure, the obtaining the initial time based on the energy entropy ratio includes: based on the energy-entropy ratio of multi-frame beam signals, an energy-entropy ratio curve is obtained, wherein the abscissa of the energy-entropy ratio curve is time, and the ordinate is energy-entropy ratio; acquiring a first intersection point and a second intersection point between the energy-entropy ratio curve and a first reference line, and acquiring a third intersection point and a fourth intersection point between the energy-entropy ratio curve and a second reference line, wherein the abscissa of the first intersection point is smaller than the abscissa of the second intersection point; determining the abscissa of the third intersection point as an initial start time of the voice signal in response to the abscissa of the third intersection point being smaller than the abscissa of the first intersection point; and/or, in response to the abscissa of the fourth intersection being greater than the abscissa of the second intersection, determining the abscissa of the fourth intersection as an initial end time of the speech signal; wherein the initial time comprises the initial start time and/or the initial end time.

In one embodiment of the disclosure, the first reference line and the second reference line are parallel to the horizontal axis, the ordinate of the point on the first reference line is a first threshold, the ordinate of the point on the second reference line is a second threshold, and the second threshold is smaller than the first threshold.

In one embodiment of the disclosure, the obtaining, based on the initial time, target times of the voice signals corresponding to the plurality of sensors includes: acquiring the time difference corresponding to the sensor; and delaying the initial time by the time difference to obtain the target time of the voice signal corresponding to the sensor.

In one embodiment of the disclosure, the acquiring the time difference corresponding to the sensor includes: performing cross-correlation processing on the original signals acquired by the sensor and the beam signals to obtain a cross-correlation curve; and determining the time corresponding to the peak value of the cross-correlation curve as the time difference corresponding to the sensor.

In one embodiment of the disclosure, the acquiring the time difference corresponding to the sensor includes: estimating the azimuth of a sound source to obtain an azimuth angle between the sound source and the sensor array; and obtaining the time difference corresponding to the sensor based on the azimuth angle and the distance between the sensor and the reference sensor.

According to a second aspect of the embodiments of the present disclosure, there is provided a device for detecting a voice endpoint, including: an acquisition module configured to perform acquisition of raw signals acquired by a plurality of sensors in the sensor array; the processing module is configured to perform beam forming on a plurality of original signals to obtain beam signals; the detection module is configured to perform voice endpoint detection on the beam signals to obtain initial time of voice signals; and the acquisition module is configured to acquire target time of the voice signals corresponding to the plurality of sensors based on the initial time.

In one embodiment of the present disclosure, the processing module is further configured to perform: estimating the azimuth of a sound source to obtain an azimuth angle between the sound source and the sensor array; and carrying out beam forming on a plurality of original signals based on the azimuth angles to obtain the beam signals.

In one embodiment of the present disclosure, the processing module is further configured to perform: responding to the original signals as broadband signals, and carrying out beam forming on a plurality of original signals in a frequency domain to obtain frequency domain beam signals; or, in response to the original signals being single-frequency signals, performing beam forming on the plurality of original signals in the time domain to obtain time domain beam signals.

In one embodiment of the present disclosure, the detection module is further configured to perform: framing the beam signals to obtain multi-frame beam signals; acquiring the energy entropy ratio of each frame of wave beam signal; obtaining the initial time based on the energy entropy ratio; wherein the initial time includes an initial start time and/or an initial end time.

In one embodiment of the present disclosure, the detection module is further configured to perform: based on the energy-entropy ratio of multi-frame beam signals, an energy-entropy ratio curve is obtained, wherein the abscissa of the energy-entropy ratio curve is time, and the ordinate is energy-entropy ratio; acquiring a first intersection point and a second intersection point between the energy-entropy ratio curve and a first reference line, and acquiring a third intersection point and a fourth intersection point between the energy-entropy ratio curve and a second reference line, wherein the abscissa of the first intersection point is smaller than the abscissa of the second intersection point; determining the abscissa of the third intersection point as an initial start time of the voice signal in response to the abscissa of the third intersection point being smaller than the abscissa of the first intersection point; and/or, in response to the abscissa of the fourth intersection being greater than the abscissa of the second intersection, determining the abscissa of the fourth intersection as an initial end time of the speech signal; wherein the initial time comprises the initial start time and/or the initial end time.

In one embodiment of the present disclosure, the acquisition module is further configured to perform: acquiring the time difference corresponding to the sensor; and delaying the initial time by the time difference to obtain the target time of the voice signal corresponding to the sensor.

In one embodiment of the present disclosure, the acquisition module is further configured to perform: performing cross-correlation processing on the original signals acquired by the sensor and the beam signals to obtain a cross-correlation curve; and determining the time corresponding to the peak value of the cross-correlation curve as the time difference corresponding to the sensor.

In one embodiment of the present disclosure, the acquisition module is further configured to perform: estimating the azimuth of a sound source to obtain an azimuth angle between the sound source and the sensor array; and obtaining the time difference corresponding to the sensor based on the azimuth angle and the distance between the sensor and the reference sensor.

According to a third aspect of embodiments of the present disclosure, there is provided an electronic device comprising a processor; a memory for storing processor-executable instructions; wherein the processor is configured to implement the steps of the method according to the first aspect of the embodiments of the present disclosure.

According to a fourth aspect of embodiments of the present disclosure, there is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of the first aspect of embodiments of the present disclosure.

According to a fifth aspect of embodiments of the present disclosure, there is provided a computer program product comprising a computer program, characterized in that the computer program, when executed by a processor of an electronic device, implements the steps of the method according to the first aspect of embodiments of the present disclosure.

The technical scheme provided by the embodiment of the disclosure at least brings the following beneficial effects: the method has the advantages that the original signals collected by the sensors can be subjected to beam forming to obtain beam signals, the beam signals are subjected to voice endpoint detection to obtain initial time, so that target time corresponding to the sensors is obtained, and compared with the method in the related art in which voice endpoint detection is carried out on the original signals collected by each sensor mostly, only the beam signals are subjected to voice endpoint detection, the number of times of voice endpoint detection is greatly reduced, the efficiency of voice endpoint detection of a sensor array is improved, the signal-to-noise ratio of the beam signals is high, and the accuracy of voice endpoint detection is improved.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the disclosure and together with the description, serve to explain the principles of the disclosure and do not constitute an undue limitation on the disclosure.

Fig. 1 is a flow chart illustrating a method of detecting a voice endpoint according to an exemplary embodiment.

Fig. 2 is a schematic diagram illustrating a method of detecting a voice endpoint according to an exemplary embodiment.

Fig. 3 is a flow chart illustrating a method of detecting a voice endpoint according to another exemplary embodiment.

Fig. 4 is a schematic diagram of an entropy ratio curve, a first reference line, and a second reference line in a method for detecting a voice endpoint according to an exemplary embodiment.

Fig. 5 is a flowchart illustrating a method of detecting a voice endpoint according to another exemplary embodiment.

Fig. 6 is a block diagram illustrating a voice endpoint detection system according to an example embodiment.

Fig. 7 is a block diagram illustrating a voice endpoint detection apparatus according to an example embodiment.

Fig. 8 is a block diagram of an electronic device, according to an example embodiment.

Detailed Description

In order to enable those skilled in the art to better understand the technical solutions of the present disclosure, the technical solutions of the embodiments of the present disclosure will be clearly and completely described below with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and claims of the present disclosure and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the disclosure described herein may be capable of operation in sequences other than those illustrated or described herein. The implementations described in the following exemplary examples are not representative of all implementations consistent with the present disclosure. Rather, they are merely examples of apparatus and methods consistent with some aspects of the present disclosure as detailed in the accompanying claims.

The data acquisition, storage, use, processing and the like in the technical scheme of the present disclosure all conform to the relevant regulations of the national laws and regulations.

Fig. 1 is a flowchart illustrating a method for detecting a voice endpoint according to an exemplary embodiment, and as shown in fig. 1, the method for detecting a voice endpoint according to an embodiment of the disclosure includes the following steps.

S101, acquiring original signals acquired by a plurality of sensors in a sensor array.

It should be noted that, the execution main body of the voice endpoint detection method in the embodiment of the disclosure is an electronic device, and the electronic device includes a mobile phone, a notebook, a desktop computer, a vehicle-mounted terminal, an intelligent household appliance, and the like. The method for detecting a voice endpoint according to the embodiments of the present disclosure may be performed by the apparatus for detecting a voice endpoint according to the embodiments of the present disclosure, and the apparatus for detecting a voice endpoint according to the embodiments of the present disclosure may be configured in any electronic device to perform the method for detecting a voice endpoint according to the embodiments of the present disclosure.

In an embodiment of the present disclosure, the sensor array includes a plurality of sensors. The distribution mode of the plurality of sensors in the sensor array is not excessively limited, for example, the plurality of sensors can be arranged in at least one set direction, the set distance is arranged between two adjacent sensors, the set direction is not excessively limited, for example, the xy two-dimensional coordinate system is taken as an example, and the plurality of sensors can be respectively arranged in the directions of the x axis and the y axis.

In some examples, the sensor array includes sensors 1 through 20, with the sensors 1 through 20 disposed in the x-axis direction, with a 20 cm spacing between adjacent two sensors.

The sensor is used for collecting sound. The sensor is not limited excessively, and may include MIC (Microphone), sonar, radar, and the like, for example.

It should be noted that, the original signal includes a voice signal, and the original signal also includes a noise signal due to environmental and vibration factors. The SNR (Signal Noise Ratio, signal to noise ratio) of the original signal is not too limited, and may be-3 dB, for example.

It will be appreciated that different sensors will collect different raw signals.

In one embodiment, the controllable sensor collects raw signals at a set sampling rate. It should be noted that the set sampling rate is not limited too much, and for example, the set sampling rate may be 8kHz (kilohertz).

In some examples, the sensor array includes M sensors, each of which collects raw signals including N sample point signals, and the M sensors collect raw signals including m×n sample point signals. It should be noted that, the sampling point signals and the sampling time are in one-to-one correspondence, wherein M, N is a positive integer.

It will be appreciated that the sampling time may include t ₁ 、t ₂ To t _N The original signal acquired by the ith sensor comprises S _i1 、S _i2 To S _iN Wherein S is _i1 、S _i2 To S _iN Are all sampling point signals S _i1 、S _i2 To S _iN The corresponding sampling time is t in turn ₁ 、t ₂ To t _N . I is more than or equal to 1 and less than or equal to M, wherein i is a positive integer.

S102, carrying out beam forming on a plurality of original signals to obtain beam signals.

In the embodiment of the disclosure, the original signals acquired by the plurality of sensors can be subjected to beam forming, so as to obtain beam signals. It should be noted that the specific manner of beamforming is not limited too much, and may be implemented by any beamforming algorithm in the related art, for example.

In some examples, the sensor array includes M sensors, the raw signals collected by each sensor include N sample point signals, the plurality of raw signals are beamformed to obtain a beamformed signal, and the weighting summation is performed on the M sample point signals at any sampling time to obtain the beamformed signal at any sampling time. The beam signal includes the beam signal at N sample times.

For example, take m= 5,N =5 as an exampleThe sampling time may include t ₁ 、t ₂ To t ₅ The original signal acquired by the ith sensor comprises S _i1 、S _i2 To S _i5 Wherein S is _i1 、S _i2 To S _i5 Are all sampling point signals S _i1 、S _i2 To S _i5 The corresponding sampling time is t in turn ₁ 、t ₂ To t ₅ . I is more than or equal to 1 and less than or equal to 5,i, and is a positive integer.

At a sampling time t ₁ For example, the sampling time t ₁ Down sample point signal S ₁₁ 、S ₂₁ 、S ₃₁ 、S ₄₁ 、S ₅₁ Weighted summation is carried out to obtain sampling time t ₁ Lower beam signal. Wherein S is ₁₁ At sampling time t for sensor 1 ₁ Sampling point signal of lower acquisition S ₂₁ At sampling time t for sensor 2 ₁ Sampling point signal of lower acquisition S ₃₁ At sampling time t for sensor 3 ₁ Sampling point signal of lower acquisition S ₄₁ At sampling time t for sensor 4 ₁ Sampling point signal of lower acquisition S ₅₁ At sampling time t for the 5 th sensor ₁ And (5) sampling point signals collected downwards.

The sampling time t ₂ To t ₅ The acquisition process of the beam signal can refer to the sampling time t ₁ The following beam signal acquisition process is not described here.

In one embodiment, the beamforming is performed on the plurality of original signals to obtain a beam signal, including performing an azimuth estimation on the sound source to obtain an azimuth angle between the sound source and the sensor array, and performing beamforming on the plurality of original signals based on the azimuth angle to obtain the beam signal. Thus, in the method, the azimuth angle between the sound source and the sensor array can be considered to carry out beam forming on a plurality of original signals, so as to obtain beam signals.

The azimuth angle may be any angle or an angle range, and is not limited herein. For example, as shown in fig. 2, the azimuth angle θ is the angle between the direction of the sound source and the vertical direction.

It should be noted that, the specific manner of performing the direction estimation on the sound source is not limited too much, and may be implemented by using a direction estimation algorithm such as CBF (Conventional Beamforming ), MVDR (Minimum Variance Distortionless Response, minimum variance distortion-free response), MUSIC (Multiple Signal Classification ), CS (Compressed Sensing, compressed sensing), and the like.

In one embodiment, beam forming is performed on a plurality of original signals to obtain beam signals, including beam forming is performed on a plurality of original signals in a frequency domain in response to the original signals being broadband signals to obtain frequency domain beam signals, or beam forming is performed on a plurality of original signals in a time domain in response to the original signals being single frequency signals to obtain time domain beam signals. Therefore, in the method, the original signals can be considered to be broadband signals or single-frequency signals, and the beam forming on the frequency domain or the time domain is carried out on a plurality of original signals, so that the flexibility of the beam forming is improved.

It should be noted that, the specific modes of beamforming in the frequency domain and the time domain are not limited too much, for example, beamforming in the frequency domain may be implemented by any frequency domain beamforming algorithm in the related art, and beamforming in the time domain may be implemented by any time domain beamforming algorithm in the related art.

In some examples, beamforming in the frequency domain is performed on the plurality of original signals to obtain frequency domain beam signals, including beamforming in the frequency domain is performed on the plurality of original signals based on azimuth angles to obtain frequency domain beam signals.

In some examples, time-domain beamforming is performed on the plurality of original signals to obtain a time-domain beamformed signal, including time-domain beamforming is performed on the plurality of original signals based on azimuth angles to obtain a time-domain beamformed signal.

S103, voice endpoint detection is carried out on the beam signals, and initial time of the voice signals is obtained.

It should be noted that, the specific manner of performing voice endpoint detection on the beam signal is not limited too much, for example, any voice endpoint detection algorithm in the related art may be used.

It is understood that the beam signal includes a voice signal and a noise signal, and the beam signal is subjected to voice endpoint detection to obtain an initial time of the voice signal, where the initial time of the voice signal refers to a time of the voice signal in the beam signal, and may include an initial start time and/or an initial end time of the voice signal in the beam signal.

In one embodiment, performing voice endpoint detection on the beam signal to obtain an initial time of the voice signal includes performing frame-splitting processing on the beam signal to obtain a multi-frame beam signal, obtaining a target parameter of each frame of beam signal, and obtaining the initial time based on the target parameter. Wherein the initial time includes an initial start time and/or an initial end time. It should be noted that the target parameter is not limited too much, and for example, the target parameter may include a volume, a zero crossing rate, a spectral entropy value, an entropy ratio, and the like.

It should be noted that, the specific manner of the framing process is not limited too much, and may be implemented by any framing processing algorithm in the related art, for example.

In some examples, obtaining the initial time based on the target parameter includes obtaining a target curve based on the target parameter of the multi-frame beam signal, wherein an abscissa of the target curve is time, an ordinate is the target parameter, obtaining a fifth intersection point and a sixth intersection point between the target curve and a set reference line, wherein the abscissa of the fifth intersection point is smaller than the abscissa of the sixth intersection point, the abscissas of points on the set reference line are all set thresholds, the abscissa of the fifth intersection point is determined to be an initial start time of the voice signal, and the abscissa of the sixth intersection point is determined to be an initial end time of the voice signal. It should be noted that, the setting reference line is parallel to the horizontal axis, and the setting threshold is not excessively limited.

S104, obtaining target time of the voice signals corresponding to the plurality of sensors based on the initial time.

It is understood that the target time of the voice signal corresponding to the sensor refers to the time of the voice signal in the original signal collected by the sensor, and may include the target start time and/or the target end time of the voice signal in the original signal.

It will be appreciated that the target times for the corresponding speech signals may be different for different sensors.

In one embodiment, the obtaining the target time of the voice signal corresponding to the plurality of sensors based on the initial time includes obtaining the target start time of the voice signal corresponding to the plurality of sensors based on the initial start time and/or obtaining the target end time of the voice signal corresponding to the plurality of sensors based on the initial end time.

In one embodiment, obtaining the target time of the voice signals corresponding to the plurality of sensors based on the initial time includes obtaining a time difference corresponding to the sensors, and delaying the initial time by the time difference to obtain the target time of the voice signals corresponding to the sensors. It is understood that different sensors may correspond to different time differences, which refers to the time difference between the target time and the initial time of the speech signal to which the sensor corresponds.

For example, as shown in fig. 2, the sensor array includes sensors 1 to 4, and the time differences corresponding to the sensors 1 to 4 are 0 seconds, 1 second, 2 seconds, and 3 seconds in order. If the initial time includes the initial start time t ₁ Second, initial end time t ₂ Second, the target start time of the voice signal corresponding to the sensor 1 is t ₁ Second, target end time t ₂ Second, the target start time of the voice signal corresponding to the sensor 2 is (t ₁ +1) seconds, the target end time is (t ₂ +1) seconds, the target start time of the voice signal corresponding to the sensor 3 is (t) ₁ +2) seconds, the target end time is (t ₂ +2) seconds, the target start time of the voice signal corresponding to the sensor 4 is (t) ₁ +3) seconds, the target end time is (t ₂ +3) seconds.

In some examples, obtaining the time difference corresponding to the sensor includes estimating a position of the sound source, obtaining an azimuth angle between the sound source and the sensor array, and obtaining the time difference corresponding to the sensor based on the azimuth angle and a distance between the sensor and the reference sensor.

Such asContinuing with the example of FIG. 2, the reference sensor may be the closest sensor 1 to the acoustic source, and if the time difference corresponding to sensor 1 is 0 seconds, the time difference Δt corresponding to sensor 2 ₂ = (d×sin θ)/c, time difference Δt corresponding to sensor 3 ₃ = (2 d sin θ)/c, the time difference Δt corresponding to the sensor 4 ₃ = (3 d sin θ)/c. Where d is the distance between two adjacent sensors, θ is the azimuth angle, and c is the speed of sound of the medium.

According to the voice endpoint detection method provided by the embodiment of the disclosure, original signals acquired by a plurality of sensors in the sensor array are acquired, the plurality of original signals are subjected to beam forming to obtain beam signals, voice endpoint detection is performed on the beam signals to obtain initial time of voice signals, and target time of the voice signals corresponding to the plurality of sensors is obtained based on the initial time. Therefore, the original signals collected by the plurality of sensors can be subjected to beam forming to obtain beam signals, the beam signals are subjected to voice endpoint detection to obtain initial time, so that target time corresponding to the plurality of sensors is obtained.

Fig. 3 is a flowchart illustrating a method for detecting a voice endpoint according to another exemplary embodiment, and as shown in fig. 3, the method for detecting a voice endpoint according to an embodiment of the disclosure includes the following steps.

S301, acquiring original signals acquired by a plurality of sensors in a sensor array.

S302, carrying out beam forming on a plurality of original signals to obtain beam signals.

S303, framing the beam signals to obtain multi-frame beam signals.

For the relevant content of steps S301-S303, refer to the above embodiments, and are not repeated here.

S304, the energy-entropy ratio of each frame of wave beam signal is obtained.

It should be noted that, the specific manner of obtaining the energy-entropy ratio is not limited too much, for example, any energy-entropy ratio calculation algorithm in the related art may be used.

In one embodiment, obtaining the energy-to-entropy ratio of each frame of beam signals includes obtaining short-term energy and short-term spectral entropy of any frame of beam signals, and obtaining the energy-to-entropy ratio of any frame of beam signals based on the short-term energy and the short-term spectral entropy.

In some examples, the energy-to-entropy ratio of the beam signal per frame is obtained by the following equation:

LE _q ＝log(1+E _q a)

wherein S is _q (k) For the signal component of the q-th frame beam signal at the k-th frequency point, E _q For short-term energy of the q-th frame beam signal, Y _q (k) For the energy of the q-th frame beam signal at the k-th frequency point, p _q (k) As a normalized spectral probability density function of the q-th frame beam signal at the k-th frequency point, H _q For short-time spectral entropy, Q of the Q-th frame beam signal _q Is the energy-entropy ratio of the q-th frame beam signal.

Where N/2 represents that only the positive frequency part is taken, and a is a set coefficient.

S305, based on the energy-entropy ratio of the multi-frame beam signals, an energy-entropy ratio curve is obtained, wherein the abscissa of the energy-entropy ratio curve is time, and the ordinate is energy-entropy ratio.

In some examples, the energy-to-entropy ratio curve is obtained based on the energy-to-entropy ratio of the multi-frame beam signal, including performing curve fitting based on the energy-to-entropy ratio of the multi-frame beam signal to obtain the energy-to-entropy ratio curve. It should be noted that the specific manner of curve fitting is not limited too much, and for example, linear fitting, least square method, or the like may be employed.

S306, a first intersection point and a second intersection point between the energy entropy ratio curve and the first reference line are obtained, and a third intersection point and a fourth intersection point between the energy entropy ratio curve and the second reference line are obtained, wherein the abscissa of the first intersection point is smaller than the abscissa of the second intersection point.

S307, determining the abscissa of the third intersection point as the initial start time of the voice signal in response to the abscissa of the third intersection point being smaller than the abscissa of the first intersection point.

And S308, determining the abscissa of the fourth intersection point as the initial ending time of the voice signal in response to the abscissa of the fourth intersection point being greater than the abscissa of the second intersection point.

The first reference line and the second reference line are not excessively limited.

In one embodiment, as shown in FIG. 4, a first reference line L ₁ Second reference line L ₂ Are all parallel to the transverse axis, a first reference line L ₁ The ordinate of the points on the first and second thresholds Th ₁ Second reference line L ₂ The ordinate of the points on the two-dimensional coordinate system are the second threshold Th ₂ Second threshold Th ₂ Less than a first threshold Th ₁ . I.e. a first reference line L ₁ Located at the second reference line L ₂ For the upper part of the first threshold Th ₁ Second threshold Th ₂ Neither too much restriction is made, e.g. the first threshold Th ₁ May be 1.5 times the energy of the noise signal, a second threshold Th ₂ Which may be noise signal energy.

In some examples, the first threshold Th ₁ Second threshold Th ₂ The following are provided:

Th ₁ ＝α ₁ D+σ _n

Th ₂ ＝α ₂ D+σ _n

α ₁ <α ₂

wherein D is the energy difference between the voice signal and the noise signal, sigma _n For pre-acquired noise signal energy, or, sigma _n Can be the average energy of the silent frame, D, sigma _n Can be preset or updated in real time, and is not limited in this way.

Wherein alpha is ₁ 、α ₂ To set coefficients.

In some examples, as shown in FIG. 4, the energy-to-entropy ratio curve L ₃ With a first reference line L ₁ The first intersection point and the second intersection point are A, B respectively, and the energy entropy ratio curve L ₃ And a second reference line L ₂ The third intersection point and the fourth intersection point are C, D respectively, the points A, B, C, D are ordered from small to large according to the abscissa, the ordering result is the point C, A, B, D, the abscissa of the point C can be determined as the initial start time of the voice signal, and the abscissa of the point D can be determined as the initial end time of the voice signal.

According to the voice endpoint detection method provided by the embodiment of the disclosure, the beam signals are subjected to framing processing to obtain multi-frame beam signals, the energy-entropy ratio of each frame of beam signals is obtained, an energy-entropy ratio curve is obtained based on the energy-entropy ratio of the multi-frame beam signals, and the initial time of the voice signals is obtained by comprehensively considering the first threshold and the second threshold.

Fig. 5 is a flowchart illustrating a method for detecting a voice endpoint according to another exemplary embodiment, and as shown in fig. 5, the method for detecting a voice endpoint according to an embodiment of the disclosure includes the following steps.

S501, acquiring original signals acquired by a plurality of sensors in a sensor array.

S502, carrying out beam forming on a plurality of original signals to obtain beam signals.

S503, performing voice endpoint detection on the beam signal to obtain the initial time of the voice signal.

For the relevant content of steps S501-S503, refer to the above embodiment, and are not repeated here.

S504, performing cross-correlation processing on the original signals acquired by the sensor and the beam signals to obtain a cross-correlation curve.

S505, determining the time corresponding to the peak value of the cross-correlation curve as the time difference corresponding to the sensor.

It should be noted that, the specific manner of the cross-correlation process is not limited too much, and for example, any cross-correlation algorithm in the related art may be used.

It should be noted that, the abscissa of the cross-correlation curve is time, the ordinate is a correlation parameter, and the correlation parameter is used for representing the correlation between the original signal and the beam signal at a certain moment, if the correlation parameter is larger, it indicates that the correlation between the original signal and the beam signal at a certain moment is stronger, if the correlation parameter is larger, it indicates that the correlation between the original signal and the beam signal at a certain moment is higher, otherwise, if the correlation parameter is smaller, it indicates that the correlation between the original signal and the beam signal at a certain moment is weaker.

Continuing with fig. 2 as an example, the original signal collected by the sensor 1 and the beam signal may be subjected to a cross-correlation process to obtain a cross-correlation curve E ₁ Will cross-correlate curve E ₁ The time corresponding to the peak of the sensor 1 is determined as the time difference corresponding to the sensor 1.

The original signal collected by the sensor 2 and the wave beam signal can be subjected to cross-correlation processing to obtain a cross-correlation curve E ₂ Will cross-correlate curve E ₂ Is determined as the time difference corresponding to the sensor 2.

The original signal collected by the sensor 3 and the wave beam signal can be subjected to cross-correlation processing to obtain a cross-correlation curve E ₃ Will cross-correlate curve E ₃ Is determined as the time difference corresponding to the sensor 3.

The original signal collected by the sensor 4 and the wave beam signal can be subjected to cross-correlation processing to obtain a cross-correlation curve E ₄ Will cross-correlate curve E ₄ Is determined as the time difference corresponding to the sensor 4.

The cross-correlation curve E ₁ To E to ₄ Not shown in fig. 2.

S506, delaying the initial time by the time difference to obtain the target time of the voice signal corresponding to the sensor.

For the relevant content of step S506, refer to the above embodiment, and will not be described herein.

According to the voice endpoint detection method provided by the embodiment of the disclosure, the original signals acquired by the sensor and the beam signals can be subjected to cross-correlation processing to obtain the cross-correlation curve, and the time corresponding to the peak value of the cross-correlation curve is determined as the time difference corresponding to the sensor so as to acquire the time difference of the sensor.

On the basis of any of the above embodiments, as shown in fig. 6, the detection system 100 for a voice endpoint includes a sensor array 110, an orientation estimation module 120, a beam forming module 130, an endpoint detection module 140, a delay estimation module 150, and a delay processing module 160.

The sensor array 110 includes a plurality of sensors, the sensors are used for collecting original signals, the azimuth estimation module 120 is used for estimating a sound source to obtain an azimuth angle between the sound source and the sensor array 110, the beam forming module 130 is used for carrying out beam forming on the original signals collected by the plurality of sensors in the sensor array 110 to obtain beam signals, the endpoint detection module 140 is used for carrying out voice endpoint detection on the beam signals to obtain initial time of the voice signals, the time delay estimation module 150 is used for obtaining time difference corresponding to the sensors, and the time delay processing module 160 is used for delaying the time difference to obtain target time of the voice signals corresponding to the sensors.

In some examples, the delay estimation module 150 is further configured to perform a cross-correlation process on the original signal collected by the sensor and the beam signal, obtain a cross-correlation curve, and determine a time corresponding to a peak value of the cross-correlation curve as the time difference corresponding to the sensor.

In some examples, the time delay estimation module 150 is further configured to obtain the time difference corresponding to the sensor according to the azimuth angle and the distance between the sensor and the reference sensor.

Fig. 7 is a block diagram illustrating a voice endpoint detection apparatus according to an example embodiment. Referring to fig. 7, a voice endpoint detection apparatus 200 according to an embodiment of the present disclosure includes: the device comprises an acquisition module 210, a processing module 220, a detection module 230 and an acquisition module 240.

The acquisition module 210 is configured to perform acquisition of raw signals acquired by a plurality of sensors in the sensor array;

the processing module 220 is configured to perform beamforming on a plurality of the original signals, so as to obtain beam signals;

the detection module 230 is configured to perform voice endpoint detection on the beam signal, so as to obtain an initial time of the voice signal;

the acquisition module 240 is configured to perform obtaining target times of the voice signals corresponding to the plurality of sensors based on the initial time.

In one embodiment of the present disclosure, the processing module 220 is further configured to perform: estimating the azimuth of a sound source to obtain an azimuth angle between the sound source and the sensor array; and carrying out beam forming on a plurality of original signals based on the azimuth angles to obtain the beam signals.

In one embodiment of the present disclosure, the processing module 220 is further configured to perform: responding to the original signals as broadband signals, and carrying out beam forming on a plurality of original signals in a frequency domain to obtain frequency domain beam signals; or, in response to the original signals being single-frequency signals, performing beam forming on the plurality of original signals in the time domain to obtain time domain beam signals.

In one embodiment of the present disclosure, the detection module 230 is further configured to perform: framing the beam signals to obtain multi-frame beam signals; acquiring the energy entropy ratio of each frame of wave beam signal; and obtaining the initial time based on the energy entropy ratio. Wherein the initial time includes an initial start time and/or an initial end time.

In one embodiment of the present disclosure, the detection module 230 is further configured to perform: based on the energy-entropy ratio of multi-frame beam signals, an energy-entropy ratio curve is obtained, wherein the abscissa of the energy-entropy ratio curve is time, and the ordinate is energy-entropy ratio; acquiring a first intersection point and a second intersection point between the energy-entropy ratio curve and a first reference line, and acquiring a third intersection point and a fourth intersection point between the energy-entropy ratio curve and a second reference line, wherein the abscissa of the first intersection point is smaller than the abscissa of the second intersection point; determining the abscissa of the third intersection point as an initial start time of the voice signal in response to the abscissa of the third intersection point being smaller than the abscissa of the first intersection point; and/or, in response to the abscissa of the fourth intersection being greater than the abscissa of the second intersection, determining the abscissa of the fourth intersection as an initial end time of the speech signal; wherein the initial time comprises the initial start time and/or the initial end time.

In one embodiment of the present disclosure, the acquisition module 240 is further configured to perform: acquiring the time difference corresponding to the sensor; and delaying the initial time by the time difference to obtain the target time of the voice signal corresponding to the sensor.

In one embodiment of the present disclosure, the acquisition module 240 is further configured to perform: performing cross-correlation processing on the original signals acquired by the sensor and the beam signals to obtain a cross-correlation curve; and determining the time corresponding to the peak value of the cross-correlation curve as the time difference corresponding to the sensor.

In one embodiment of the present disclosure, the acquisition module 240 is further configured to perform: estimating the azimuth of a sound source to obtain an azimuth angle between the sound source and the sensor array; and obtaining the time difference corresponding to the sensor based on the azimuth angle and the distance between the sensor and the reference sensor.

The specific manner in which the various modules perform the operations in the apparatus of the above embodiments have been described in detail in connection with the embodiments of the method, and will not be described in detail herein.

According to the voice endpoint detection device provided by the embodiment of the disclosure, original signals acquired by a plurality of sensors in the sensor array are acquired, the plurality of original signals are subjected to beam forming to obtain beam signals, voice endpoint detection is performed on the beam signals to obtain initial time of voice signals, and target time of the voice signals corresponding to the plurality of sensors is obtained based on the initial time. Therefore, the original signals collected by the plurality of sensors can be subjected to beam forming to obtain beam signals, the beam signals are subjected to voice endpoint detection to obtain initial time, so that target time corresponding to the plurality of sensors is obtained.

Fig. 8 is a block diagram of an electronic device 300, according to an example embodiment.

As shown in fig. 8, the electronic device 300 includes:

memory 310 and processor 320, bus 330 connecting the different components (including memory 310 and processor 320), memory 310 storing a computer program that when executed by processor 320 implements the method for detecting a voice endpoint according to embodiments of the present disclosure.

Bus 330 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, an accelerated graphics port, a processor, or a local bus using any of a variety of bus architectures. By way of example, and not limitation, such architectures include Industry Standard Architecture (ISA) bus, micro channel architecture (MAC) bus, enhanced ISA bus, video Electronics Standards Association (VESA) local bus, and Peripheral Component Interconnect (PCI) bus.

Electronic device 300 typically includes a variety of electronic device readable media. Such media can be any available media that is accessible by electronic device 300 and includes both volatile and non-volatile media, removable and non-removable media.

Memory 310 may also include computer system readable media in the form of volatile memory, such as Random Access Memory (RAM) 340 and/or cache memory 350. Electronic device 300 may further include other removable/non-removable, volatile/nonvolatile computer system storage media. By way of example only, storage system 360 may be used to read from or write to a non-removable, non-volatile magnetic media (not shown in FIG. 8, commonly referred to as a "hard disk drive"). Although not shown in fig. 8, a magnetic disk drive for reading from and writing to a removable non-volatile magnetic disk (e.g., a "floppy disk"), and an optical disk drive for reading from or writing to a removable non-volatile optical disk (e.g., a CD-ROM, DVD-ROM, or other optical media) may be provided. In such cases, each drive may be coupled to bus 330 through one or more data medium interfaces. Memory 310 may include at least one program product having a set (e.g., at least one) of program modules configured to carry out the functions of the various embodiments of the disclosure.

A program/utility 380 having a set (at least one) of program modules 370 may be stored in, for example, memory 310, such program modules 370 including, but not limited to, an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment. Program modules 370 generally perform the functions and/or methods in the embodiments described in this disclosure.

The electronic device 300 may also communicate with one or more external devices 390 (e.g., keyboard, pointing device, display 391, etc.), one or more devices that enable user interaction with the electronic device 300, and/or any device (e.g., network card, modem, etc.) that enables the electronic device 300 to communicate with one or more other computing devices. Such communication may occur through an input/output (I/O) interface 392. Also, electronic device 300 may communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the Internet, through network adapter 393. As shown in fig. 8, the network adapter 393 communicates with the other modules of the electronic device 300 via the bus 330. It should be appreciated that although not shown, other hardware and/or software modules may be used in connection with electronic device 300, including, but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, data backup storage systems, and the like.

The processor 320 executes various functional applications and data processing by running programs stored in the memory 310.

It should be noted that, the implementation process and the technical principle of the electronic device in this embodiment refer to the foregoing explanation of the method for detecting a voice endpoint in the embodiment of the disclosure, and are not repeated herein.

The electronic device provided by the embodiment of the disclosure may execute the method for detecting a voice endpoint as described above, obtain the original signals collected by the plurality of sensors in the sensor array, perform beam forming on the plurality of original signals to obtain a beam signal, perform voice endpoint detection on the beam signal to obtain an initial time of the voice signal, and obtain a target time of the voice signal corresponding to the plurality of sensors based on the initial time. Therefore, the original signals collected by the plurality of sensors can be subjected to beam forming to obtain beam signals, the beam signals are subjected to voice endpoint detection to obtain initial time, so that target time corresponding to the plurality of sensors is obtained.

To achieve the above embodiments, the present disclosure also proposes a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method for detecting a voice endpoint provided by the present disclosure.

Alternatively, the computer readable storage medium may be ROM, random Access Memory (RAM), CD-ROM, magnetic tape, floppy disk, optical data storage device, etc.

To achieve the above embodiments, the present disclosure further provides a computer program product comprising a computer program, characterized in that the computer program, when executed by a processor of an electronic device, implements a method for detecting a speech endpoint as described above.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any adaptations, uses, or adaptations of the disclosure following the general principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

It is to be understood that the present disclosure is not limited to the precise arrangements and instrumentalities shown in the drawings, and that various modifications and changes may be effected without departing from the scope thereof. The scope of the present disclosure is limited only by the appended claims.

Claims

1. A method for detecting a voice endpoint, comprising:

acquiring original signals acquired by a plurality of sensors in a sensor array;

carrying out beam forming on a plurality of original signals to obtain beam signals;

performing voice endpoint detection on the beam signals to obtain initial time of voice signals;

and obtaining target time of the voice signals corresponding to the plurality of sensors based on the initial time.

2. The method of claim 1, wherein said beamforming the plurality of original signals to obtain a beamformed signal comprises:

estimating the azimuth of a sound source to obtain an azimuth angle between the sound source and the sensor array;

and carrying out beam forming on a plurality of original signals based on the azimuth angles to obtain the beam signals.

3. The method of claim 1, wherein said beamforming the plurality of original signals to obtain a beamformed signal comprises:

Responding to the original signals as broadband signals, and carrying out beam forming on a plurality of original signals in a frequency domain to obtain frequency domain beam signals; or,

and responding to the original signals as single-frequency signals, and carrying out beam forming on a plurality of original signals in the time domain to obtain time domain beam signals.

4. The method of claim 1, wherein performing voice endpoint detection on the beam signal to obtain an initial time of the voice signal comprises:

framing the beam signals to obtain multi-frame beam signals;

acquiring the energy entropy ratio of each frame of wave beam signal;

obtaining the initial time based on the energy entropy ratio; wherein,,

the initial time includes an initial start time and/or an initial end time.

5. The method of claim 4, wherein the deriving the initial time based on the entropy ratio comprises:

based on the energy-entropy ratio of multi-frame beam signals, an energy-entropy ratio curve is obtained, wherein the abscissa of the energy-entropy ratio curve is time, and the ordinate is energy-entropy ratio;

acquiring a first intersection point and a second intersection point between the energy-entropy ratio curve and a first reference line, and acquiring a third intersection point and a fourth intersection point between the energy-entropy ratio curve and a second reference line, wherein the abscissa of the first intersection point is smaller than the abscissa of the second intersection point;

Determining the abscissa of the third intersection point as an initial start time of the voice signal in response to the abscissa of the third intersection point being smaller than the abscissa of the first intersection point; and/or the number of the groups of groups,

and in response to the abscissa of the fourth intersection being greater than the abscissa of the second intersection, determining the abscissa of the fourth intersection as an initial end time of the speech signal.

6. The method of claim 5, wherein the first reference line and the second reference line are each parallel to a horizontal axis, wherein the first reference line points are each a first threshold on an ordinate, wherein the second reference line points are each a second threshold on an ordinate, and wherein the second threshold is less than the first threshold.

7. The method according to any one of claims 1-6, wherein the obtaining, based on the initial time, target times of the speech signals corresponding to the plurality of sensors includes:

acquiring the time difference corresponding to the sensor;

and delaying the initial time by the time difference to obtain the target time of the voice signal corresponding to the sensor.

8. The method of claim 7, wherein the obtaining the time difference corresponding to the sensor comprises:

Performing cross-correlation processing on the original signals acquired by the sensor and the beam signals to obtain a cross-correlation curve;

and determining the time corresponding to the peak value of the cross-correlation curve as the time difference corresponding to the sensor.

9. The method of claim 7, wherein the obtaining the time difference corresponding to the sensor comprises:

and obtaining the time difference corresponding to the sensor based on the azimuth angle and the distance between the sensor and the reference sensor.

10. A voice endpoint detection apparatus, comprising:

an acquisition module configured to perform acquisition of raw signals acquired by a plurality of sensors in the sensor array;

the processing module is configured to perform beam forming on a plurality of original signals to obtain beam signals;

the detection module is configured to perform voice endpoint detection on the beam signals to obtain initial time of voice signals;

and the acquisition module is configured to acquire target time of the voice signals corresponding to the plurality of sensors based on the initial time.

11. The apparatus of claim 10, wherein the processing module is further configured to perform:

12. The apparatus of claim 10, wherein the processing module is further configured to perform:

13. The apparatus of claim 10, wherein the detection module is further configured to perform:

framing the beam signals to obtain multi-frame beam signals;

acquiring the energy entropy ratio of each frame of wave beam signal;

obtaining the initial time based on the energy entropy ratio; wherein,,

the initial time includes an initial start time and/or an initial end time.

14. The apparatus of claim 13, wherein the detection module is further configured to perform:

15. The apparatus of claim 14, wherein the first reference line and the second reference line are each parallel to a horizontal axis, wherein the first reference line points are each a first threshold on an ordinate, wherein the second reference line points are each a second threshold on an ordinate, and wherein the second threshold is less than the first threshold.

16. The apparatus of any of claims 10-15, wherein the acquisition module is further configured to perform:

Acquiring the time difference corresponding to the sensor;

17. The apparatus of claim 16, wherein the acquisition module is further configured to perform:

18. The apparatus of claim 16, wherein the acquisition module is further configured to perform:

19. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to:

the steps of carrying out the method of any one of claims 1-9.

20. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the steps of the method of any of claims 1-9.