CN112634940A - Voice endpoint detection method, device, equipment and computer readable storage medium - Google Patents

Voice endpoint detection method, device, equipment and computer readable storage medium Download PDF

Info

Publication number
CN112634940A
CN112634940A CN202011453437.2A CN202011453437A CN112634940A CN 112634940 A CN112634940 A CN 112634940A CN 202011453437 A CN202011453437 A CN 202011453437A CN 112634940 A CN112634940 A CN 112634940A
Authority
CN
China
Prior art keywords
reflected wave
determining
vector
wave signal
audio signal
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202011453437.2A
Other languages
Chinese (zh)
Inventor
罗剑
王健宗
程宁
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN202011453437.2A priority Critical patent/CN112634940A/en
Priority to PCT/CN2021/084296 priority patent/WO2022121182A1/en
Publication of CN112634940A publication Critical patent/CN112634940A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Image Analysis (AREA)
  • Two-Way Televisions, Distribution Of Moving Picture Or The Like (AREA)

Abstract

The application relates to artificial intelligence, and provides a voice endpoint detection method, a device, equipment and a computer readable storage medium, wherein the method comprises the following steps: acquiring an audio signal to be detected, and video data and a reflected wave signal which are acquired when the audio signal is acquired; determining spectral data of the audio signal, and determining a spectral feature vector of the audio signal according to the spectral data; extracting image areas where lip positions in the video data are located, and determining video characteristic vectors of the video data according to each image area; determining a phase difference between the reflected wave signal and a preset transmitting wave signal, and determining a reflected wave vector of the reflected wave signal according to the phase difference; fusing the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a target characteristic vector; and inputting the target feature vector into a pre-trained voice endpoint detection model to obtain a plurality of voice endpoints of the audio signal. The method and the device can improve the accuracy of voice endpoint detection.

Description

Voice endpoint detection method, device, equipment and computer readable storage medium
Technical Field
The present application relates to the field of artificial intelligence technologies, and in particular, to a method, an apparatus, a device, and a computer-readable storage medium for voice endpoint detection.
Background
Voice Activity Detection (VAD), also known as Voice activity Detection, refers to a technique for locating the beginning and end points of speech from a segment of an audio signal to distinguish between speech and non-speech portions of the audio signal. Research shows that Lombard/Loud effect is generated under noise environment or when the speaker sound distortion, the pronunciation speed and the tone change, and the application of the voice endpoint detection is easy to generate recognition error. At present, researchers also try to extract sound features through machine learning or deep learning to perform voice endpoint detection, however, background noise in audio signals in real life is complex, for example, sound feature interference of others often appears in the audio signals, and the detection accuracy rate is difficult to guarantee. Therefore, how to improve the accuracy of voice endpoint detection becomes an urgent problem to be solved.
Disclosure of Invention
The present application mainly aims to provide a method, an apparatus, a device and a computer readable storage medium for voice endpoint detection, which aim to improve the accuracy of voice endpoint detection.
In a first aspect, the present application provides a method for detecting a voice endpoint, including:
acquiring an audio signal to be detected, and video data and a reflected wave signal which are acquired when the audio signal is acquired, wherein the reflected wave signal is acquired by performing sound wave detection on a lip part of a user;
determining spectral data of the audio signal, and determining a spectral feature vector of the audio signal according to the spectral data;
extracting image areas where lip positions in the video data are located, and determining video feature vectors of the video data according to each image area;
determining a phase difference between the reflected wave signal and a preset transmitting wave signal, and determining a reflected wave vector of the reflected wave signal according to the phase difference;
fusing the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a target characteristic vector;
and inputting the target feature vector into a pre-trained voice endpoint detection model to obtain a plurality of voice endpoints of the audio signal.
In a second aspect, the present application further provides a voice endpoint detection apparatus, including:
the device comprises an acquisition module, a detection module and a control module, wherein the acquisition module is used for acquiring an audio signal to be detected and video data and a reflected wave signal which are acquired when the audio signal is acquired, and the reflected wave signal is acquired by performing sound wave detection on a lip part of a user;
the first determining module is used for determining the spectrum data of the audio signal and determining the spectrum characteristic vector of the audio signal according to the spectrum data;
the second determining module is used for extracting image areas where lip positions in the video data are located, and determining video feature vectors of the video data according to each image area;
a third determining module, configured to determine a phase difference between the reflected wave signal and a preset transmitted wave signal, and determine a reflected wave vector of the reflected wave signal according to the phase difference;
the fusion module is used for fusing the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a target characteristic vector;
and the detection module is used for inputting the target feature vector into a pre-trained voice endpoint detection model to obtain a plurality of voice endpoints of the audio signal.
In a third aspect, the present application also provides a computer device comprising a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, implements the steps of the voice endpoint detection method as described above.
In a fourth aspect, the present application further provides a computer-readable storage medium having a computer program stored thereon, where the computer program, when executed by a processor, implements the steps of the voice endpoint detection method as described above.
The application provides a voice endpoint detection method, a device, equipment and a computer readable storage medium, which obtains an audio signal to be detected, video data and a reflected wave signal which are collected when the audio signal is collected, then determines the frequency spectrum data of the audio signal, determines the frequency spectrum characteristic vector of the audio signal according to the frequency spectrum data, simultaneously extracts the image area where the lip part in the video data is located, determines the video characteristic vector of the video data according to each image area, determines the phase difference between the reflected wave signal and a preset transmitting wave signal, determines the reflected wave vector of the reflected wave signal according to the phase difference, then fuses the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a target characteristic vector, and then inputs the target characteristic vector into a pre-trained voice endpoint detection model, the voice endpoint detection is assisted by the frequency spectrum data of the audio signals, the synchronously acquired video data and the three different modal data of the reflected wave signals, so that the anti-noise capability of the voice endpoint detection is greatly improved, and the detection accuracy of the voice endpoint detection in a complex environment can be effectively improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic flowchart illustrating steps of a voice endpoint detection method according to an embodiment of the present application;
FIG. 2 is a flow chart illustrating sub-steps of the voice endpoint detection method of FIG. 1;
fig. 3 is a schematic view of a scene for implementing the voice endpoint detection method provided in this embodiment;
fig. 4 is a schematic block diagram of a voice endpoint detection apparatus according to an embodiment of the present application;
FIG. 5 is a schematic block diagram of sub-modules of the speech endpoint detection apparatus of FIG. 4;
fig. 6 is a schematic block diagram of a structure of a computer device according to an embodiment of the present application.
The implementation, functional features and advantages of the objectives of the present application will be further explained with reference to the accompanying drawings.
Detailed Description
The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The flow diagrams depicted in the figures are merely illustrative and do not necessarily include all of the elements and operations/steps, nor do they necessarily have to be performed in the order depicted. For example, some operations/steps may be decomposed, combined or partially combined, so that the actual execution sequence may be changed according to the actual situation. In addition, although the division of the functional blocks is made in the device diagram, in some cases, it may be divided in blocks different from those in the device diagram.
The embodiment of the application provides a voice endpoint detection method, a voice endpoint detection device, voice endpoint detection equipment and a computer readable storage medium. The voice endpoint detection method can be applied to terminal equipment or a server, and the terminal equipment can be electronic equipment such as a mobile phone, a tablet computer, a notebook computer, a desktop computer, a personal digital assistant and wearable equipment; the server may be a single server or a server cluster including a plurality of servers. The following explains the voice endpoint detection method applied to a server as an example.
Some embodiments of the present application will be described in detail below with reference to the accompanying drawings. The embodiments described below and the features of the embodiments can be combined with each other without conflict.
Referring to fig. 1, fig. 1 is a schematic flowchart illustrating steps of a voice endpoint detection method according to an embodiment of the present application.
As shown in fig. 1, the voice endpoint detection method includes steps S101 to S106.
Step S101, obtaining an audio signal to be detected, and video data and a reflected wave signal which are collected when the audio signal is collected.
The reflected wave signals are acquired by sound wave detection of the lips of the user, and the video data are acquired by video recording of the lips of the user. In the research field of multi-modal voice recognition, the accuracy of lip voice recognition is low, but the movement of the lips can often accurately reflect whether a person speaks, so that the method has a good application prospect in the field of voice endpoint detection. Therefore, the reflected wave signals acquired by sound wave detection of the lip part of the user and the video data acquired by video recording of the lip part of the user jointly assist the audio signals to be detected to carry out voice endpoint detection, so that the accuracy of voice endpoint detection can be greatly improved, noise interference of various environments and various intensities can be effectively coped with, and the accuracy of voice endpoint detection of the audio signals recorded in the complex environment is improved.
In one embodiment, when a user triggers a recording instruction of an audio signal through intelligent equipment, a recorder is started to record the audio signal of the surrounding environment; meanwhile, a camera of the intelligent device is started to record video, wherein the camera is a front camera or a rear camera of a smart phone, for example, so as to record video data including lips or faces of a user; simultaneously, sound waves are emitted through a loudspeaker of the intelligent device, the sound waves are high-frequency sound waves which cannot be heard by human ears, the sound waves emitted by the loudspeaker can be reflected to the microphone by lips, and therefore the movement of the lips of a speaker during sound production can be identified through the change of sound wave phases.
In an embodiment, the server requests to acquire the audio signal to be detected and the video data and the reflected wave signal acquired when the audio signal is acquired by initiating a data acquisition request to the intelligent device. After receiving a data acquisition request sent by a server, the intelligent device returns an audio signal, corresponding video data and a reflected wave signal to the server according to the data acquisition request.
In some embodiments, after the audio signal and the video data and the reflected wave signal acquired when the audio signal is acquired, the intelligent device stores the acquired audio signal, video data and reflected wave signal to the cloud database, so that the server can acquire the audio signal, video data and reflected wave signal acquired by the intelligent device through the cloud database.
It should be noted that, in order to further ensure the privacy and security of the related information such as the audio signal, the video data, the reflected wave signal, and the like acquired synchronously, the related information such as the audio signal, the video data, the reflected wave signal, and the like may also be stored in a node of a block chain. A block chain (Blockchain), which is essentially a decentralized database, is a series of data blocks associated by using a cryptographic method, and each data block contains information of a batch of network transactions, so as to verify the validity (anti-counterfeiting) of the information and generate a next block. The blockchain may include a blockchain underlying platform, a platform product service layer, an application service layer, and the like.
Step S102, determining the frequency spectrum data of the audio signal, and determining the frequency spectrum characteristic vector of the audio signal according to the frequency spectrum data.
The method comprises the steps of carrying out short-time Fourier transform (STFT) on an audio signal to determine the frequency spectrum data of the audio signal, carrying out feature extraction on the frequency spectrum data of the audio signal to obtain a frequency spectrum feature vector of the audio signal, wherein the frequency spectrum feature vector of the audio signal can assist in voice endpoint detection, and the accuracy of the voice endpoint detection is improved.
In one embodiment, as shown in fig. 2, step S102 includes: substeps 1021 to substep S1023.
And a substep S1021, determining a feature vector corresponding to each first timestamp according to the plurality of first timestamps of the spectrum data.
It should be noted that the spectrum data includes a plurality of first time stamps, each first time stamp corresponds to a section of spectrum data, and according to a section of spectrum data corresponding to each first time stamp, a feature vector of a section of spectrum data corresponding to each first time stamp can be determined.
In an embodiment, determining, according to a plurality of first time stamps of the spectrum data, a feature vector corresponding to each first time stamp includes: determining a plurality of first time stamps of the spectrum data, and determining a plurality of frames of spectrum data corresponding to each first time stamp; and determining the characteristic vector corresponding to each first time stamp according to the characteristic parameters of the multi-frame frequency spectrum data corresponding to each first time stamp.
It should be noted that each first timestamp corresponds to a segment of continuous multi-frame spectrum data, and the correspondence between the first timestamp and the multi-frame spectrum data can be flexibly set by a user, for example, the first timestamp is set to correspond to multi-frame spectrum data within 2 frames before and after the current time, that is, the first timestamp at the time t corresponds to spectrum data within the time range of [ t-2, t +2 ]. To facilitate time alignment with subsequent video feature vectors and reflected wave vectors, the correspondence between the time stamp (e.g., first time stamp) of each modality data and the sub-modality data (e.g., multi-frame spectrum data) may be consistent. And forming a feature vector through feature parameter data of a continuous section of multi-frame frequency spectrum, wherein the feature vector is a one-dimensional feature vector.
And a substep S1022 of performing convolution pooling on the feature vector corresponding to each of the first timestamps.
And inputting the feature vector corresponding to each first time stamp into the convolution layer and/or the pooling layer so as to perform convolution pooling on a plurality of feature vectors.
And a substep S1023 of splicing the feature vectors corresponding to the first time stamps after the convolution pooling to obtain the spectrum feature vectors corresponding to the spectrum data.
And splicing the feature vectors subjected to convolution pooling according to the time sequence of each first timestamp, so that the spectrum feature vector corresponding to the spectrum data can be accurately obtained, and the spectrum feature vector can assist in voice endpoint detection.
Illustratively, a range [ t-2, t +2] of 2 frames before and after the first timestamp of the current time t is selected]Extracting characteristic parameters of a section of frequency spectrum data corresponding to the first time stamp to obtain a characteristic vector Xa, and performing primary convolution, pooling and secondary convolution on the characteristic vector Xa to obtain a sub-frequency spectrum characteristic vector a of the first time stamp at the time tt=conva3(Poola2(conva1(Xa) And) splicing the sub-spectrum feature vectors corresponding to each first time stamp of the spectrum data to obtain the spectrum feature vectors.
Step S103, extracting image areas where the lips are located in the video data, and determining video feature vectors of the video data according to each image area.
The image area where the lip part is located in the video data can be located through a target detection algorithm, then the image area where the multi-frame lip part is located is extracted from the video data, feature extraction is carried out on each image area according to time information corresponding to the image area where the lip part is located in each frame, and video feature vectors of the video data can be determined. The video feature vector of the video data can assist in voice endpoint detection, and the accuracy of the voice endpoint detection is improved.
In one embodiment, according to a plurality of third timestamps of the video data, determining a feature vector corresponding to each third timestamp; performing convolution pooling on the feature vector corresponding to each third timestamp; and splicing the feature vectors corresponding to each third time stamp subjected to convolution pooling to obtain the video feature vectors of the video data. It should be noted that the convolution pooling process includes convolution processing and/or pooling processing, the video data includes a plurality of third timestamps, each third timestamp corresponds to a piece of video data, and a piece of video data may include a plurality of image areas or may be empty (not including an image area). And generating a feature vector corresponding to the third timestamp according to the plurality of image areas corresponding to the third timestamp, and performing convolution pooling and sequential splicing on the feature vector corresponding to the third timestamp, so that the video feature vector of the video data can be accurately obtained, and the video feature vector can assist in voice endpoint detection.
In one embodiment, a plurality of third timestamps of the video data are determined, and a plurality of image areas corresponding to each third timestamp are determined; and determining a feature vector corresponding to each third timestamp according to the feature parameters of the plurality of image areas corresponding to each third timestamp. The characteristic parameters of the image region may include one or more of the number of horizontal-axis pixels, the number of vertical-axis pixels, the number of channels, and the frame rate, for example, the number of horizontal-axis pixels and the number of vertical-axis pixels may be 800 × 1000, the number of RGB channels is 3, and the frame rate is a frequency of continuous appearance of the image region in units of frames.
Illustratively, for example, the third timestamp is set to correspond to a plurality of image areas within 2 frames before and after the current time, that is, the third timestamp at the time t corresponds to [ t-2, t +2]]Extracting the image area within the time range, and extracting the [ t-2, t +2] corresponding to the third timestamp]Obtaining a sub-video feature vector v by using the feature parameters of the image area in the time ranget=convv3(convv2(convv1(Xv) And) splicing the sub-video feature vectors corresponding to each third timestamp of the video data to obtain the video feature vectors.
And step S104, determining the phase difference between the reflected wave signal and a preset transmitting wave signal, and determining the reflected wave vector of the reflected wave signal according to the phase difference.
The method comprises the steps of obtaining a preset transmitting wave signal of sound waves emitted by a loudspeaker and reflected back by lips received by a microphone, determining the phase difference between a reflected wave signal and the preset transmitting wave signal through signal processing, and identifying the movement of the lips when a speaker makes a sound through the phase difference. The phase difference between the reflected wave signal and the preset transmitted wave signal is subjected to feature extraction, and the reflected wave vector of the reflected wave signal can be determined. The reflected wave vector of the reflected wave signal can assist voice endpoint detection, and the accuracy of the voice endpoint detection is improved.
In one embodiment, the hearing limit of an ordinary person is 17kHz, most smart devices can emit sound waves of at most 23kHz, so that the preset transmitting wave signal can be located in the interval of 17-23kHz, and the phase difference between the reflected wave signal and the preset transmitting wave signal is detected by the sound wave coupling detector.
For example, the transmitting wave signal is assumed to be Acos (2 π ft), and the reflected wave signal is assumed to be Rp=Apcos(2πft-θp) Wherein A ispIs the amplitude of the reflected wave, thetapFor phase difference, f is frequency, a and Ap may be uniform, and may be slightly changed by environmental influences after reflection.
In one embodiment, the reflected wave signals are coupled through a preset parameter factor; filtering the coupled reflected wave signal through a low-pass filter; and calculating the phase difference between the reflected wave signal and the preset transmitting wave signal according to the filtered reflected wave signal. It should be noted that, the reflected wave signal is coupled by a preset parameter factor, for example, the preset parameter factor is cos (2 pi ft), then the coupled reflected wave signal:
Figure BDA0002832395040000081
filtering the coupled reflected wave signal by a high-frequency blocking low-pass filter, for example, the filtered reflected wave signal:
Figure BDA0002832395040000082
as can be seen from the formula of the filtered reflected wave signal, the phase difference theta between the source wave signal and the reflected wave signal can be directly calculated according to the filtered reflected wave signalp
In one embodiment, determining a reflected wave vector of the reflected wave signal according to the phase difference includes: determining a phase difference corresponding to each second time stamp according to the plurality of second time stamps of the reflected wave signals; determining the difference value between the phase differences corresponding to every two adjacent second timestamps according to the phase difference corresponding to every second timestamp to obtain a plurality of phase difference values; determining a phase difference vector corresponding to the reflected wave signal according to the plurality of phase difference values; and inputting the phase difference vector corresponding to the reflected wave signal into a first preset neural network to obtain the reflected wave vector corresponding to the reflected wave signal.
It should be noted that the reflected wave signal includes a plurality of second timestamps, each second timestamp corresponds to a phase difference, a phase difference value between phase differences corresponding to every two adjacent second timestamps is calculated, a phase difference vector corresponding to the reflected wave signal can be obtained by performing connection according to the plurality of phase difference values, and then the phase difference vector corresponding to the reflected wave signal is input to a first preset neural network, which is, for example, a convolutional network and an LSTM network, to obtain a reflected wave vector corresponding to the reflected wave signal, where the reflected wave vector can assist in performing voice endpoint detection.
Illustratively, the phase difference θ corresponding to the second timestamp at the current time t is calculatedpThe phase difference corresponding to the second timestamp of the previous frame is thetap-1Using the phase difference theta of the current framepSubtracting the phase difference theta of the previous framep-1To obtain a phase difference value delta phipCalculating the phase difference value delta phipK order derivative of (a) Δ Φp∈RkAs a phase difference vector corresponding to the second time stamp, let Δ Φp∈RkInputting a convolution network and an LSTM network to obtain a reflected wave vector r corresponding to a second timestamp of the current time tt=Convr2(Convr1(ΔΦp) And splicing the sub-reflected wave vectors corresponding to each second timestamp of the reflected wave signal to obtain the reflected wave vector.
And S105, fusing the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a target characteristic vector.
The frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector are fused to obtain a higher-dimensionality target characteristic vector, and the voice endpoint of the audio signal can be detected more accurately by fusing the target characteristic vectors of the three modal data.
In one embodiment, the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector are combined to obtain a combined characteristic vector; and inputting the merged feature vector into a second preset neural network to obtain a target feature vector. Exemplary, for spectral feature vector aiVideo feature vector viAnd the reflected wave vector riMerging to obtain merged characteristic vector hi=concat(aiviri) If the second predetermined neural network is a recurrent neural network LSTM network or a plurality of fully-connected networks, the feature vectors h will be mergediInputting the target characteristic vector Z ═ FC (LSTM (h) into a second preset neural networki)). Through the combination of the three modal data and the integration of the time sequence characteristics through the second preset neural network, the characteristics of the time range before and after the time range can be considered, and therefore the stability and the accuracy of the voice endpoint recognition are improved.
And S106, inputting the target feature vector into a pre-trained voice endpoint detection model to obtain a plurality of voice endpoints of the audio signal.
The target feature vector is subjected to voice endpoint detection through a pre-trained voice endpoint detection model, a plurality of voice endpoints of the audio signal can be output, and the voice signal and the non-voice signal in the audio signal can be distinguished through the plurality of voice endpoints. The anti-noise capability of voice endpoint detection is greatly improved, and the detection accuracy of the voice endpoint detection in a complex environment can be effectively improved.
In one embodiment, a first speech endpoint detection model is trained by a plurality of labeled audio signals to initialize parameters of a second speech endpoint detection model; training a second voice endpoint detection model through the plurality of video data with the labels to initialize parameters of the second voice endpoint detection model; training a third voice endpoint detection model through a plurality of reflected wave signals with labels to initialize parameters of the third voice endpoint detection model; fusing the initialized first voice endpoint detection model, the second voice endpoint detection model and the third voice endpoint detection model to obtain a target voice endpoint detection model; and (3) optimizing the target voice endpoint detection model by using a back propagation algorithm by using the classified cross entropy as a target function to obtain a trained multi-modal model.
The pre-trained voice endpoint detection model may be a regression classifier, a bayesian network, a rule-based classifier, a neural network, or the like, which is not specifically limited in the embodiment of the present application.
In an embodiment, the audio signal is subjected to smoothing filtering processing through a preset smoothing layer, so as to obtain a filtered audio signal. It should be noted that there are many abrupt changes and discontinuities in the voice endpoint detection process, and the audio signal is subjected to smooth filtering processing, for example, by adjusting the offset (FEC) at the front of the segment of the audio signal to be detected and the tail (OVER) at the end, eliminating the segment (MSC) of the continuous voice signal that is detected as a non-voice signal, and the segment (NDS) of the non-voice signal that is recognized as a voice signal, the endpoint detection effect of the audio signal can be improved, and the user experience can be improved.
Referring to fig. 3, fig. 3 is a schematic view of a scene for implementing the voice endpoint detection method provided in this embodiment.
As shown in fig. 3, the server obtains an audio signal to be detected, video data and a reflected wave signal which are collected when the audio signal is collected, then determines spectral data of the audio signal, determines a spectral feature vector of the audio signal according to the spectral data, simultaneously extracts an image area where a lip part in the video data is located, determines a video feature vector of the video data according to each image area, determines a phase difference between the reflected wave signal and a preset transmitting wave signal, determines a reflected wave vector of the reflected wave signal according to the phase difference, then fuses the spectral feature vector, the video feature vector and the reflected wave vector to obtain a target feature vector, and then inputs the target feature vector into a pre-trained voice endpoint detection model to obtain a plurality of voice endpoints of the audio signal.
The voice endpoint detection method provided by the above embodiment obtains an audio signal to be detected, video data and a reflected wave signal which are collected when the audio signal is collected, then determines spectral data of the audio signal, determines a spectral feature vector of the audio signal according to the spectral data, simultaneously extracts an image area where a lip part in the video data is located, determines a video feature vector of the video data according to each image area, determines a phase difference between the reflected wave signal and a preset transmitting wave signal, determines a reflected wave vector of the reflected wave signal according to the phase difference, then fuses the spectral feature vector, the video feature vector and the reflected wave vector to obtain a target feature vector, and then inputs the target feature vector into a pre-trained voice endpoint detection model to obtain a plurality of voice endpoints of the audio signal The voice endpoint detection is assisted by different modal data, the anti-noise capability of the voice endpoint detection is greatly improved, and the detection accuracy of the voice endpoint detection in a complex environment can be effectively improved.
Referring to fig. 4, fig. 4 is a schematic block diagram of a voice endpoint detection apparatus according to an embodiment of the present disclosure.
As shown in fig. 4, the apparatus 200 for detecting a voice endpoint includes: an acquisition module 201, a first determination module 202, a second determination module 203, a third determination module 204, a fusion module 205, and a detection module 206.
An obtaining module 201, configured to obtain an audio signal to be detected, and video data and a reflected wave signal that are collected when the audio signal is collected, where the reflected wave signal is collected by performing sound wave detection on a lip portion of a user;
a first determining module 202, configured to determine spectral data of the audio signal, and determine a spectral feature vector of the audio signal according to the spectral data;
the second determining module 203 is configured to extract image regions where lips in the video data are located, and determine a video feature vector of the video data according to each image region;
a third determining module 204, configured to determine a phase difference between the reflected wave signal and a preset transmitted wave signal, and determine a reflected wave vector of the reflected wave signal according to the phase difference;
a fusion module 205, configured to fuse the frequency spectrum feature vector, the video feature vector, and the reflected wave vector to obtain a target feature vector;
the detection module 206 is configured to input the target feature vector into a pre-trained voice endpoint detection model to obtain a plurality of voice endpoints of the audio signal.
In one embodiment, as shown in FIG. 5, the first determining module 202 includes:
a determining submodule 2021, configured to determine, according to a plurality of first time stamps of the spectrum data, a feature vector corresponding to each of the first time stamps;
the processing submodule 2022 is configured to perform convolution pooling on the feature vector corresponding to each of the first timestamps;
the splicing submodule 2023 is configured to splice the feature vector corresponding to each of the first timestamps after the convolution pooling processing, so as to obtain a spectrum feature vector corresponding to the spectrum data.
In one embodiment, the determination sub-module 2021 is further configured to:
determining a plurality of first time stamps of the spectrum data, and determining a plurality of frames of spectrum data corresponding to each first time stamp;
and determining the characteristic vector corresponding to each first time stamp according to the characteristic parameters of the multi-frame frequency spectrum data corresponding to each first time stamp.
In one embodiment, the third determination module 204 is further configured to:
coupling the reflected wave signals through preset parameter factors;
filtering the coupled reflected wave signal through a low-pass filter;
and calculating the phase difference between the reflected wave signal and a preset transmitting wave signal according to the filtered reflected wave signal.
In one embodiment, the third determination module 204 is further configured to:
determining a phase difference corresponding to each second timestamp according to a plurality of second timestamps of the reflected wave signals;
determining a difference value between the phase differences corresponding to each two adjacent second timestamps according to the phase difference corresponding to each second timestamp to obtain a plurality of phase difference values;
determining a phase difference vector corresponding to the reflected wave signal according to the plurality of phase difference values;
and inputting the phase difference vector corresponding to the reflected wave signal into a first preset neural network to obtain the reflected wave vector corresponding to the reflected wave signal.
In one embodiment, the fusion module 205 is further configured to:
merging the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a merged characteristic vector;
and inputting the merged feature vector to a second preset neural network to obtain a target feature vector.
In one embodiment, the voice endpoint detection apparatus 200 is further configured to:
and carrying out smooth filtering processing on the audio signal through a preset smooth layer to obtain a filtered audio signal.
It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the apparatus and the modules and units described above may refer to the corresponding processes in the foregoing embodiment of the voice endpoint detection method, and are not described herein again.
The apparatus provided by the above embodiments may be implemented in the form of a computer program, which can be run on a computer device as shown in fig. 6.
Referring to fig. 6, fig. 6 is a schematic block diagram illustrating a structure of a computer device according to an embodiment of the present disclosure. The computer device may be a server or a terminal device.
As shown in fig. 6, the computer device includes a processor, a memory, and a network interface connected by a system bus, wherein the memory may include a nonvolatile storage medium and an internal memory.
The non-volatile storage medium may store an operating system and a computer program. The computer program includes program instructions that, when executed, cause a processor to perform any of the methods of voice endpoint detection.
The processor is used for providing calculation and control capability and supporting the operation of the whole computer equipment.
The internal memory provides an environment for the execution of a computer program on a non-volatile storage medium, which when executed by a processor, causes the processor to perform any of the methods for voice endpoint detection.
The network interface is used for network communication, such as sending assigned tasks and the like. Those skilled in the art will appreciate that the architecture shown in fig. 6 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
It should be understood that the Processor may be a Central Processing Unit (CPU), and the Processor may be other general purpose processors, Digital Signal Processors (DSPs), Application Specific Integrated Circuits (ASICs), Field Programmable Gate Arrays (FPGAs) or other Programmable logic devices, discrete Gate or transistor logic devices, discrete hardware components, etc. Wherein a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
Wherein, in one embodiment, the processor is configured to execute a computer program stored in the memory to implement the steps of:
acquiring an audio signal to be detected, and video data and a reflected wave signal which are acquired when the audio signal is acquired, wherein the reflected wave signal is acquired by performing sound wave detection on a lip part of a user;
determining spectral data of the audio signal, and determining a spectral feature vector of the audio signal according to the spectral data;
extracting image areas where lip positions in the video data are located, and determining video feature vectors of the video data according to each image area;
determining a phase difference between the reflected wave signal and a preset transmitting wave signal, and determining a reflected wave vector of the reflected wave signal according to the phase difference;
fusing the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a target characteristic vector;
and inputting the target feature vector into a pre-trained voice endpoint detection model to obtain a plurality of voice endpoints of the audio signal.
In one embodiment, the processor, when enabling the determining of the spectral feature vector of the audio signal from the spectral data, is configured to enable:
determining a feature vector corresponding to each first time stamp according to the plurality of first time stamps of the frequency spectrum data;
performing convolution pooling on the feature vector corresponding to each first timestamp;
and splicing the characteristic vectors corresponding to the first timestamps after convolution pooling to obtain the frequency spectrum characteristic vectors corresponding to the frequency spectrum data.
In one embodiment, the processor, when implementing the determining of the feature vector corresponding to each of the first timestamps according to the plurality of first timestamps of the spectrum data, is configured to implement:
determining a plurality of first time stamps of the spectrum data, and determining a plurality of frames of spectrum data corresponding to each first time stamp;
and determining the characteristic vector corresponding to each first time stamp according to the characteristic parameters of the multi-frame frequency spectrum data corresponding to each first time stamp.
In one embodiment, the processor, in performing the determining the phase difference between the reflected wave signal and a preset transmitted wave signal, is configured to perform:
coupling the reflected wave signals through preset parameter factors;
filtering the coupled reflected wave signal through a low-pass filter;
and calculating the phase difference between the reflected wave signal and a preset transmitting wave signal according to the filtered reflected wave signal.
In one embodiment, the processor, in implementing the determining of the reflected wave vector of the reflected wave signal according to the phase difference, is configured to implement:
determining a phase difference corresponding to each second timestamp according to a plurality of second timestamps of the reflected wave signals;
determining a difference value between the phase differences corresponding to each two adjacent second timestamps according to the phase difference corresponding to each second timestamp to obtain a plurality of phase difference values;
determining a phase difference vector corresponding to the reflected wave signal according to the plurality of phase difference values;
and inputting the phase difference vector corresponding to the reflected wave signal into a first preset neural network to obtain the reflected wave vector corresponding to the reflected wave signal.
In one embodiment, when the processor performs the fusion on the spectral feature vector, the video feature vector, and the reflected wave vector to obtain a target feature vector, the processor is configured to perform:
merging the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a merged characteristic vector;
and inputting the merged feature vector to a second preset neural network to obtain a target feature vector.
In one embodiment, the processor is further configured to implement: and carrying out smooth filtering processing on the audio signal through a preset smooth layer to obtain a filtered audio signal.
It should be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working process of the computer device described above may refer to the corresponding process in the foregoing embodiment of the voice endpoint detection method, and details are not described herein again.
Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, where the computer program includes program instructions, and a method implemented when the program instructions are executed may refer to the embodiments of the voice endpoint detection method in the present application.
The computer-readable storage medium may be an internal storage unit of the computer device described in the foregoing embodiment, for example, a hard disk or a memory of the computer device. The computer readable storage medium may also be an external storage device of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like provided on the computer device.
It is to be understood that the terminology used in the description of the present application herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.
It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or system that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or system. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, method, article, or system that comprises the element.
The above-mentioned serial numbers of the embodiments of the present application are merely for description and do not represent the merits of the embodiments. While the invention has been described with reference to specific embodiments, the scope of the invention is not limited thereto, and those skilled in the art can easily conceive various equivalent modifications or substitutions within the technical scope of the invention. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A method for voice endpoint detection, comprising:
acquiring an audio signal to be detected, and video data and a reflected wave signal which are acquired when the audio signal is acquired, wherein the reflected wave signal is acquired by performing sound wave detection on a lip part of a user;
determining spectral data of the audio signal, and determining a spectral feature vector of the audio signal according to the spectral data;
extracting image areas where lip positions in the video data are located, and determining video feature vectors of the video data according to each image area;
determining a phase difference between the reflected wave signal and a preset transmitting wave signal, and determining a reflected wave vector of the reflected wave signal according to the phase difference;
fusing the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a target characteristic vector;
and inputting the target feature vector into a pre-trained voice endpoint detection model to obtain a plurality of voice endpoints of the audio signal.
2. The method of speech endpoint detection according to claim 1, wherein said determining a spectral feature vector of the audio signal from the spectral data comprises:
determining a feature vector corresponding to each first time stamp according to the plurality of first time stamps of the frequency spectrum data;
performing convolution pooling on the feature vector corresponding to each first timestamp;
and splicing the characteristic vectors corresponding to the first timestamps after convolution pooling to obtain the frequency spectrum characteristic vectors corresponding to the frequency spectrum data.
3. The method of claim 2, wherein the determining a feature vector corresponding to each of the first timestamps according to the plurality of first timestamps of the spectrum data comprises:
determining a plurality of first time stamps of the spectrum data, and determining a plurality of frames of spectrum data corresponding to each first time stamp;
and determining the characteristic vector corresponding to each first time stamp according to the characteristic parameters of the multi-frame frequency spectrum data corresponding to each first time stamp.
4. The voice endpoint detection method of claim 1, wherein the determining a phase difference between the reflected wave signal and a predetermined transmitted wave signal comprises:
coupling the reflected wave signals through preset parameter factors;
filtering the coupled reflected wave signal through a low-pass filter;
and calculating the phase difference between the reflected wave signal and a preset transmitting wave signal according to the filtered reflected wave signal.
5. The voice end point detection method according to any one of claims 1 to 4, wherein the determining a reflected wave vector of the reflected wave signal based on the phase difference includes:
determining a phase difference corresponding to each second timestamp according to a plurality of second timestamps of the reflected wave signals;
determining a difference value between the phase differences corresponding to each two adjacent second timestamps according to the phase difference corresponding to each second timestamp to obtain a plurality of phase difference values;
determining a phase difference vector corresponding to the reflected wave signal according to the plurality of phase difference values;
and inputting the phase difference vector corresponding to the reflected wave signal into a first preset neural network to obtain the reflected wave vector corresponding to the reflected wave signal.
6. The method according to any one of claims 1-4, wherein the fusing the spectral feature vector, the video feature vector and the reflected wave vector to obtain a target feature vector comprises:
merging the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a merged characteristic vector;
and inputting the merged feature vector to a second preset neural network to obtain a target feature vector.
7. The voice endpoint detection method of any one of claims 1-4, further comprising:
and carrying out smooth filtering processing on the audio signal through a preset smooth layer to obtain a filtered audio signal.
8. A voice endpoint detection apparatus, the voice endpoint detection apparatus comprising:
the device comprises an acquisition module, a detection module and a control module, wherein the acquisition module is used for acquiring an audio signal to be detected and video data and a reflected wave signal which are acquired when the audio signal is acquired, and the reflected wave signal is acquired by performing sound wave detection on a lip part of a user;
the first determining module is used for determining the spectrum data of the audio signal and determining the spectrum characteristic vector of the audio signal according to the spectrum data;
the second determining module is used for extracting image areas where lip positions in the video data are located, and determining video feature vectors of the video data according to each image area;
a third determining module, configured to determine a phase difference between the reflected wave signal and a preset transmitted wave signal, and determine a reflected wave vector of the reflected wave signal according to the phase difference;
the fusion module is used for fusing the frequency spectrum characteristic vector, the video characteristic vector and the reflected wave vector to obtain a target characteristic vector;
and the detection module is used for inputting the target feature vector into a pre-trained voice endpoint detection model to obtain a plurality of voice endpoints of the audio signal.
9. A computer arrangement, characterized in that the computer arrangement comprises a processor, a memory, and a computer program stored on the memory and executable by the processor, wherein the computer program, when executed by the processor, carries out the steps of the voice endpoint detection method according to any of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program, wherein the computer program, when being executed by a processor, carries out the steps of the voice endpoint detection method according to any one of claims 1 to 7.
CN202011453437.2A 2020-12-11 2020-12-11 Voice endpoint detection method, device, equipment and computer readable storage medium Pending CN112634940A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN202011453437.2A CN112634940A (en) 2020-12-11 2020-12-11 Voice endpoint detection method, device, equipment and computer readable storage medium
PCT/CN2021/084296 WO2022121182A1 (en) 2020-12-11 2021-03-31 Voice activity detection method and apparatus, and device and computer-readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011453437.2A CN112634940A (en) 2020-12-11 2020-12-11 Voice endpoint detection method, device, equipment and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN112634940A true CN112634940A (en) 2021-04-09

Family

ID=75309804

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011453437.2A Pending CN112634940A (en) 2020-12-11 2020-12-11 Voice endpoint detection method, device, equipment and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN112634940A (en)
WO (1) WO2022121182A1 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255556A (en) * 2021-06-07 2021-08-13 斑马网络技术有限公司 Multi-mode voice endpoint detection method and device, vehicle-mounted terminal and storage medium
CN113380236A (en) * 2021-06-07 2021-09-10 斑马网络技术有限公司 Voice endpoint detection method and device based on lip, vehicle-mounted terminal and storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020116197A1 (en) * 2000-10-02 2002-08-22 Gamze Erten Audio visual speech processing
CN106328141B (en) * 2016-09-05 2019-06-14 南京大学 A kind of the ultrasonic wave labiomaney identification device and method of facing moving terminal
CN110875060A (en) * 2018-08-31 2020-03-10 阿里巴巴集团控股有限公司 Voice signal processing method, device, system, equipment and storage medium
CN111768760B (en) * 2020-05-26 2023-04-18 云知声智能科技股份有限公司 Multi-mode voice endpoint detection method and device
CN111916061B (en) * 2020-07-22 2024-05-07 北京地平线机器人技术研发有限公司 Voice endpoint detection method and device, readable storage medium and electronic equipment

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113255556A (en) * 2021-06-07 2021-08-13 斑马网络技术有限公司 Multi-mode voice endpoint detection method and device, vehicle-mounted terminal and storage medium
CN113380236A (en) * 2021-06-07 2021-09-10 斑马网络技术有限公司 Voice endpoint detection method and device based on lip, vehicle-mounted terminal and storage medium

Also Published As

Publication number Publication date
WO2022121182A1 (en) 2022-06-16

Similar Documents

Publication Publication Date Title
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
CN110415686B (en) Voice processing method, device, medium and electronic equipment
CN110070882B (en) Voice separation method, voice recognition method and electronic equipment
CN112949708B (en) Emotion recognition method, emotion recognition device, computer equipment and storage medium
US9131295B2 (en) Multi-microphone audio source separation based on combined statistical angle distributions
CN110459241B (en) Method and system for extracting voice features
CN112581978A (en) Sound event detection and positioning method, device, equipment and readable storage medium
JP2022542287A (en) Audio-video information processing method and apparatus, electronic equipment and storage medium
CN112634940A (en) Voice endpoint detection method, device, equipment and computer readable storage medium
CN110088835B (en) Blind source separation using similarity measures
US11688412B2 (en) Multi-modal framework for multi-channel target speech separation
CN110456309B (en) Sound source positioning method, device and computer readable storage medium
CN114333865A (en) Model training and tone conversion method, device, equipment and medium
US11133022B2 (en) Method and device for audio recognition using sample audio and a voting matrix
CN109448732B (en) Digital string voice processing method and device
CN115426582B (en) Earphone audio processing method and device
CN112347450A (en) Identity verification method based on blink sound signal
CN111667839A (en) Registration method and apparatus, speaker recognition method and apparatus
CN115116458A (en) Voice data conversion method and device, computer equipment and storage medium
CN108364346B (en) Method, apparatus and computer readable storage medium for constructing three-dimensional face model
CN113724689A (en) Voice recognition method and related device, electronic equipment and storage medium
CN115910037A (en) Voice signal extraction method and device, readable storage medium and electronic equipment
KR101096091B1 (en) Apparatus for Separating Voice and Method for Separating Voice of Single Channel Using the Same
CN113555037A (en) Method and device for detecting tampered area of tampered audio and storage medium
CN111354341A (en) Voice awakening method and device, processor, sound box and television

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination