WO2021139425A1 - 语音端点检测方法、装置、设备及存储介质 - Google Patents

语音端点检测方法、装置、设备及存储介质 Download PDF

Info

Publication number
WO2021139425A1
WO2021139425A1 PCT/CN2020/131693 CN2020131693W WO2021139425A1 WO 2021139425 A1 WO2021139425 A1 WO 2021139425A1 CN 2020131693 W CN2020131693 W CN 2020131693W WO 2021139425 A1 WO2021139425 A1 WO 2021139425A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
information
feature information
target speaker
feature
Prior art date
Application number
PCT/CN2020/131693
Other languages
English (en)
French (fr)
Inventor
张之勇
王健宗
贾雪丽
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021139425A1 publication Critical patent/WO2021139425A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/78Detection of presence or absence of voice signals
    • G10L25/87Detection of discrete points within a voice signal
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/18Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being spectral information of each sub-band
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks

Definitions

  • This application relates to the field of voice signal processing, and in particular to a method, device, device, and storage medium for detecting voice endpoints.
  • Voice activity detection is an important part of voice signal processing. The purpose is to distinguish the voice and non-voice parts in a continuous voice stream. By accurately locating the starting point of the voice part, it can effectively filter out non-voice noise fragments. , So as to more effectively process voice stream information, which has been widely used in speech recognition, speaker separation and recognition, and other auxiliary tasks, such as emotion recognition, gender recognition, and language recognition.
  • endpoint detection is relatively easy, and traditional detection methods based on energy or spectral entropy can achieve higher detection accuracy.
  • the difficulty of endpoint detection increases significantly.
  • the detection method based on harmonic rules can effectively distinguish speech and non-speech segments by using the harmonic characteristics of human voice. It has good robustness in high-noise scenes and has been widely used in speech signal processing systems, but due to the same Background noise with harmonic characteristics, such as music, coughing, and car horns, causes the endpoint detection method based on harmonic rules to inevitably introduce many misidentifications.
  • DNN deep neural network
  • the main purpose of this application is to solve the problem that the traditional voice endpoint detection algorithm cannot distinguish between target speakers and non-target speakers, resulting in low accuracy of voice endpoint detection.
  • the first aspect of the present application provides a voice endpoint detection method, including: acquiring voice information to be recognized, and preprocessing the voice information to be recognized to obtain the preprocessed voice information; Extract the frame-level speech spectral feature information from the preprocessed speech information; perform feature processing on the preprocessed speech information to obtain the acoustic feature information of the target speaker; compare the speech spectral feature information and the acoustic feature
  • the information is feature fused to obtain fused voice feature information, the fused voice feature information is segment-level or sentence-level feature information; the fused voice feature information is input into the trained deep neural network model Perform voice endpoint detection processing to obtain the detection result, and determine the target speaker's voice type, the non-target speaker's voice type, and the background noise type according to the detection result.
  • the second aspect of the present application provides a voice endpoint detection device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes the computer
  • the following steps are implemented when reading instructions: acquiring voice information to be recognized, and preprocessing the voice information to be recognized to obtain preprocessed voice information; extracting frame-level voice spectrum features from the preprocessed voice information Information; perform feature processing on the preprocessed voice information to obtain the acoustic feature information of the target speaker; perform feature fusion on the voice frequency spectrum feature information and the acoustic feature information to obtain the fused voice feature information, so
  • the fused voice feature information is segment-level or sentence-level feature information; the fused voice feature information is input into the trained deep neural network model for voice endpoint detection processing, and the detection result is obtained according to the The detection result determines the target speaker's voice type, the non-target speaker's voice type, and the background noise type.
  • the third aspect of the present application provides a computer-readable storage medium, the computer-readable storage medium stores computer instructions, and when the computer instructions run on the computer, the computer executes the following steps: Obtain the voice information to be recognized , And preprocess the voice information to be recognized to obtain preprocessed voice information; extract frame-level voice spectrum feature information from the preprocessed voice information; perform processing on the preprocessed voice information Feature processing to obtain acoustic feature information of the target speaker; feature fusion of the voice spectrum feature information and the acoustic feature information to obtain fused voice feature information, and the fused voice feature information is segment-level or sentence Level feature information; input the fused voice feature information into the trained deep neural network model for voice endpoint detection processing, obtain the detection result, and determine the target speaker's voice type and non-target speech according to the detection result Human voice type and background noise type.
  • the fourth aspect of the present application provides a voice endpoint detection device, including: a preprocessing module for obtaining voice information to be recognized, and preprocessing the voice information to be recognized to obtain preprocessed voice information; and an extraction module , For extracting frame-level speech spectrum feature information from the preprocessed voice information; processing module, for performing feature processing on the preprocessed voice information to obtain the acoustic feature information of the target speaker; fusion A module for performing feature fusion on the voice spectrum feature information and the acoustic feature information to obtain fused voice feature information.
  • the fused voice feature information is segment-level or sentence-level feature information; a detection module, Used to input the fused voice feature information into the trained deep neural network model for voice endpoint detection processing, to obtain the detection result, and to determine the target speaker's voice type and the non-target speaker's voice type according to the detection result And background noise type.
  • voice endpoint detection is performed on voice information through a deep neural network model, and based on the voice spectrum feature information of auditory perception characteristics, the voice acoustic feature information of the target speaker is enhanced, which improves the accuracy of voice information detection of the target speaker and reduces The interference of other speakers' voice or background noise prevents business logic problems caused by other speakers' voice or non-voice background noise. This allows the subsequent voice processing system to process only the voice fragments of the target speaker, which reduces the computational pressure and improves the response speed of the subsequent voice processing system.
  • FIG. 1 is a schematic diagram of an embodiment of a voice endpoint detection method in an embodiment of this application
  • FIG. 2 is a schematic diagram of another embodiment of a voice endpoint detection method in an embodiment of this application.
  • Figure 3 is a schematic diagram of an embodiment of a voice endpoint detection device in an embodiment of the application
  • FIG. 4 is a schematic diagram of another embodiment of a voice endpoint detection device in an embodiment of this application.
  • Figure 5 is a schematic diagram of an embodiment of a voice endpoint detection device in an embodiment of the application.
  • the embodiments of the application provide a voice endpoint detection method, device, device, and storage medium, which are used to perform voice endpoint detection on voice information through a deep neural network model, and enhance the target speaker’s performance based on the voice spectrum feature information of the auditory perception characteristics.
  • the voice acoustic feature information improves the accuracy of the target speaker's voice information detection.
  • An embodiment of the method for evaluating the connotation quality of the text in the embodiment of the present invention includes:
  • the voice information to be recognized may be real-time voice information or non-real-time voice information (pre-recorded audio).
  • the server can receive the voice information to be recognized, or read the voice information to be recognized according to the preset file path; the server preprocesses the voice information to be recognized, and further, the server improves the signal-to-noise ratio of the voice information to be recognized to enhance the voice information;
  • the enhanced voice information to be recognized is subjected to framing processing to obtain multiple voice frame information, and multiple voice frame information is windowed to make the frame start and end of each voice frame information smoother and get preprocessed After the voice information, so as to avoid high-frequency noise caused by sudden mutation.
  • the server adds a Hamming window or a rectangular window to multiple speech frame information for processing.
  • the execution subject of this application may be a voice endpoint detection device, or a terminal or a server, which is not specifically limited here.
  • the embodiment of the present application takes the server as the execution subject as an example for description.
  • the server extracts recognizable features from the preprocessed voice information, and then discards other information, which includes background noise or emotions.
  • the voice frequency spectrum feature information includes the Mel frequency cepstrum coefficient MCFF feature and the filter group fbank feature.
  • the server can also collect other spectrum features, which are not specifically limited here.
  • the server performs fast Fourier transformation (FFT) on the preprocessed voice information (multiple windowed voice frame information), and uses the Mel filter bank to filter and process to obtain a 40-dimensional fbank;
  • FFT fast Fourier transformation
  • the server can perform discrete cosine transformation (DCT) on the 40-dimensional fbank, that is, the server maps the 40-dimensional fbank to a low-dimensional space (reduced from 40 dimensions to 13 dimensions) to obtain MCFF features.
  • DCT discrete cosine transformation
  • the server can also add differential features that characterize the dynamic characteristics of the voice to the voice features, which can improve the recognition performance of the system.
  • the server uses the first-order difference feature and the second-order difference feature of the MFCC feature, and may also use the first-order difference feature and the second-order difference feature of the fbank feature, which are not specifically limited here.
  • the server can use a preset trained network model for feature processing.
  • the preset trained network model can be a Gaussian mixture model-general background model GMM-UBM, i
  • the vector network model i-vector and the x-vector network model x-vector can be selected according to different business scenarios. The specific mode is not limited here.
  • the server uses a preset trained network model to perform segment-level speaker feature extraction to obtain acoustic feature information of the target speaker, and then stores the acoustic feature information of the target speaker in the database.
  • the server extracts the features of the target speaker for a preset number of frames of speech, and then compares it with the acoustic feature information of the target speaker in the preset database to obtain the similarity score, and The similarity score is used as an input parameter for subsequent voice endpoint detection.
  • the server uses the d-vector network model d-vector for frame-level speaker feature extraction. Due to the unstable type of frame-level features, the server can take the form of sliding window to perform the process through aggregation Frame-level speaker feature information in the window, and output the acoustic feature information of the target speaker.
  • the server performs frame-level speech feature splicing processing on the voice spectrum feature information and acoustic feature information to obtain segment-level or sentence-level speaker feature information, and sets the segment-level or sentence-level speaker feature information as the fused voice feature Information, the voice feature information that has been fused is segment-level or sentence-level feature information. That is, the server connects the acoustic feature information of the target speaker (for example, i-vector feature information, x-vector feature information, or d-vector feature information) to each frame of voice spectrum feature information to obtain the fused voice feature information .
  • the fused speech feature information is the input parameter of the trained deep neural network model.
  • Voice endpoint detection uses a voice endpoint detection algorithm based on deep neural networks.
  • the input feature is Mel frequency cepstral coefficient MCFF or fank feature, and the acoustic feature information of the target speaker is embedded. Among them, the acoustic feature information of the target speaker can be the target The speaker's similarity score (similarity score) or d-vector hidden layer network output feature vector.
  • the network structure of the trained deep neural network model generally uses long short-term memory (LSTM), recurrent neural network (RNN), convolutional neural network (convolutional neural networks, CNN) or time-based
  • the extended neural network TDNN can also adopt other network structures, which are not specifically limited here.
  • the server inputs the fused voice feature information to LSTM, RNN, CNN, or TDNN for frame-by-frame voice endpoint detection processing, and the output detection results include target speaker voice type, non-target speaker voice type, and background noise type.
  • the detection result is used to indicate the posterior probability of the endpoint type of each frame of voice information. For example, 0.8, 05, and 0.2 may be used to identify the target speaker's voice type, the non-target speaker's voice type, and the background noise type, respectively.
  • the server performs annotating processing on the voice information according to the detection result, so as to obtain a voice segment with only the voice type of the target speaker, which is convenient for subsequent use and processing of the voice processing system.
  • the server performs voice endpoint detection on the voice segment (as the voice information to be recognized) in the conference scene, and then detects the voice type of the target speaker in each frame of the voice information in the voice segment (for example, the speech voice of the conference speaker), The type of non-target speaker's voice (for example, the discussion voice of participants) and the type of background noise (for example, the ringing of a cell phone or the noise of opening and closing doors).
  • voice endpoint detection on the voice segment (as the voice information to be recognized) in the conference scene, and then detects the voice type of the target speaker in each frame of the voice information in the voice segment (for example, the speech voice of the conference speaker), The type of non-target speaker's voice (for example, the discussion voice of participants) and the type of background noise (for example, the ringing of a cell phone or the noise of opening and closing doors).
  • voice endpoint detection is performed on voice information through a deep neural network model, and based on the voice spectrum feature information of auditory perception characteristics, the voice acoustic feature information of the target speaker is enhanced, and the accuracy of the voice information detection of the target speaker is improved , To reduce the interference of other speakers' voice or background noise, and prevent business logic problems caused by other speakers' voice or non-voice background noise. This allows the subsequent voice processing system to process only the voice fragments of the target speaker, which reduces the computational pressure and improves the response speed of the subsequent voice processing system.
  • FIG. 2 another embodiment of the voice endpoint detection method in the embodiment of the present application includes:
  • the server sets the sampling frequency (the number of times the sound samples are obtained per second) to collect the voice information to be recognized.
  • the higher the sampling frequency the better the quality of the voice in the voice information to be recognized. Since the resolution of the human ear is very limited, the sampling frequency cannot be set too high.
  • the server receives the voice information to be recognized, and samples the voice information to be recognized to obtain the sampled voice information.
  • the server samples the voice information (audio signal) to be recognized through a high-pass filter, for example, the cutoff frequency is about 200 Hz, and then removes the DC offset component and some low-frequency noise in the voice information to be recognized, even if In the part below 200 Hz, part of the voice information is still filtered, but it will not have a great impact on the recognition of the voice information; the server performs pre-emphasis, framing and windowing on the sampled voice information in sequence, and after pre-processing Voice message.
  • a high-pass filter for example, the cutoff frequency is about 200 Hz
  • pre-emphasis can use a first-order finite excitation response high-pass filter to flatten the frequency spectrum of the sampled voice information.
  • Framing is used to convert pre-emphasized voice information into frame voice information with a length of 20 to 40 milliseconds (collecting N sampling points into one observation unit).
  • the overlap between frames is 10 milliseconds.
  • the sampling rate of the sampled voice information is 12 kHz
  • the window size is 25 milliseconds
  • Windowing is to substitute a window function for each frame of speech information.
  • the window function has a non-zero value in a certain interval, and is 0 in the remaining intervals (values outside the window), so that both ends of each frame of speech information attenuate to close to 0.
  • the voice frequency spectrum feature information is a sound spectrum that conforms to human hearing habits, and the voice frequency spectrum feature information includes MCFF and fbank, and may also include other spectrum features, which is not specifically limited here.
  • the server extracts each frame of voice signal from the preprocessed voice information; the server performs Fourier transform on each frame of voice signal to obtain the corresponding spectrum information, that is, the service transforms the time domain signal into the power of the signal Spectrum (frequency domain signal); the server performs mel filter bank processing on the corresponding spectrum information to obtain the filter group fbank feature information, where the mel filter bank processing converts the linear natural frequency spectrum to reflect the characteristics of human hearing Mel spectrum; the server sets the fbank feature information to frame-level voice spectrum feature information.
  • the server obtains the identity information corresponding to the target speaker (for example, the identity information is id_001), and queries the preset database according to the corresponding identity information to obtain the query result; the server determines whether the query result is null; if the query is If the result is a null value, the server determines that the target speaker has not pre-registered voice feature information, and further, the server executes step 204; if the query result is not a null value, the server determines that the target speaker has pre-registered voice feature information, and further, The server executes step 205.
  • unique identification information for example, a globally unique identifier
  • other information can also be used to represent identity information, which is not specifically limited here.
  • the pre-trained d-vector network is used to perform feature processing on the preprocessed voice information to obtain the acoustic feature information of the target speaker.
  • the acoustic feature information of the target speaker is d-vector feature vector information.
  • the server inputs the pre-processed voice information into the pre-trained d-vector network, and uses a preset feature extraction network to extract the pre-processed voice information Frame-level speaker feature vector; the server uses the preset hidden layer network in the pre-trained d-vector network to extract activation values from the filter group fbank feature information; the server performs L2 regularization and accumulates the activation values to obtain the target speech
  • the acoustic feature information of a person, the acoustic feature information is d-vector feature vector information.
  • the server can set the first speech speaker as the target speaker, and during the voice processing, the server will use the voice information Analyze the time-length ratio and the corresponding text semantic content to update the target speaker's information.
  • the number of speakers included in the business scenario is limited, and the small parameter network structure (d-vector corresponding structure) is used for the speaker feature extraction network structure, which improves the calculation efficiency and extraction efficiency of the target speaker's acoustic features.
  • the target speaker If the target speaker has pre-registered voice feature information, query the target speaker's acoustic feature information from the preset data table.
  • the server obtains the target speaker's acoustic feature information from the preset database, and calculates similarity to the target speaker's acoustic feature information according to the frame-level speaker feature vector Degree score, the similarity score is obtained, and the similarity score is set as the acoustic feature information of the target speaker.
  • the server obtains the unique identification information of the target speaker, and generates the query statement according to the preset structured query language grammar rules, the unique identification information and the preset data table; the server;
  • the query sentence is executed to obtain the preset d-vector feature information determined by the target speaker in the feature registration stage, and the preset d-vector feature information is set as the target speaker's feature information.
  • step 206 is similar to the description of step 104, and the details are not repeated here.
  • the trained deep neural network model may be a preset long and short-term memory network-convolutional neural network LSTM-CNN model, or other network models, which are not specifically limited here.
  • the server inputs the fused voice feature information into a preset long and short-term memory network-convolutional neural network LSTM-CNN model, and performs voice on the fused voice feature information through the preset LSTM-CNN model Endpoint detection processing, the detection result is obtained, the preset LSTM-CNN model is a trained deep neural network model; when the detection result is greater than or equal to the first preset threshold, the server determines that the detection result is the target speaker's voice type; When the result is less than the first preset threshold and greater than or equal to the second preset threshold, the server determines that the detection result is a non-target speaker's voice type; when the detection result is less than the second preset threshold and greater than or equal to the third preset threshold When, the server determines that the detection result is of the background noise type
  • the first preset threshold, the second preset threshold, and the third preset threshold respectively correspond to a range of decimals between 0 and 1, for example, the first preset threshold, the second preset threshold, and the third preset threshold.
  • the preset thresholds are 0.90, 0.40, and 0.10 respectively.
  • the server determines that the detection result is the target speaker's voice type, for example, the detection result is 0.96; when the detection result is less than 0.90 and greater than or equal to 0.90 At 0.40, the server determines that the detection result is a non-target speaker's voice type, for example, the detection result is 0.67; when the detection result is less than 0.40 and greater than or equal to 0.10, the server determines that the detection result is a background noise type, for example, the detection result is 0.23 .
  • the detection result can also be 1 or 0, which is not specifically limited here.
  • the server obtains voice sample data, and divides the voice sample data into training sample data and test sample data according to a preset ratio.
  • the server trains the initial deep neural network model based on the training sample data, where the server Cross entropy can be used as the objective function for model training.
  • the server can use weighting to train the loss function to enhance the difference between the target speaker's voice and the non-target speaker's voice.
  • the specific source is not limited, and the training depth is obtained.
  • Neural network model uses the test sample data to predict the trained deep neural network model to obtain the prediction result, and iteratively optimizes the trained deep neural network model based on the prediction result to obtain the trained deep neural network model.
  • voice endpoint detection is performed on voice information through a deep neural network model, and based on the voice spectrum feature information of auditory perception characteristics, the voice acoustic feature information of the target speaker is enhanced, and the accuracy of the voice information detection of the target speaker is improved , To reduce the interference of other speakers' voice or background noise, and prevent business logic problems caused by other speakers' voice or non-voice background noise. This allows the subsequent voice processing system to process only the voice fragments of the target speaker, which reduces the computational pressure and improves the response speed of the subsequent voice processing system.
  • an embodiment of the voice endpoint detection device in the embodiment of the present application includes:
  • the preprocessing module 301 is used to obtain the voice information to be recognized, and preprocess the voice information to be recognized to obtain the preprocessed voice information;
  • the extraction module 302 is configured to extract frame-level speech frequency spectrum feature information from the preprocessed speech information
  • the processing module 303 is configured to perform feature processing on the preprocessed speech information to obtain the acoustic feature information of the target speaker;
  • the fusion module 304 is used to perform feature fusion on the voice frequency spectrum feature information and the acoustic feature information to obtain the fused voice feature information, and the fused voice feature information is feature information at the segment level or sentence level;
  • the detection module 305 is used to input the fused voice feature information into the trained deep neural network model for voice endpoint detection processing, obtain the detection result, and determine the target speaker's voice type and the non-target speaker's voice type according to the detection results And background noise type.
  • the deep neural network model is used to detect voice endpoints of voice information, and based on the voice spectrum feature information of auditory perception characteristics, the voice acoustic feature information of the target speaker is enhanced, and the accuracy of the target speaker's voice information detection is improved. Reduce the interference of other speakers' voice or background noise, and prevent business logic problems caused by other speakers' voice or non-voice background noise. This allows the subsequent voice processing system to process only the voice fragments of the target speaker, which reduces the computational pressure and improves the response speed of the subsequent voice processing system.
  • another embodiment of the voice endpoint detection device in the embodiment of the present application includes:
  • the preprocessing module 301 is used to obtain the voice information to be recognized, and preprocess the voice information to be recognized to obtain the preprocessed voice information;
  • the extraction module 302 is configured to extract frame-level speech frequency spectrum feature information from the preprocessed speech information
  • the processing module 303 is configured to perform feature processing on the preprocessed speech information to obtain the acoustic feature information of the target speaker;
  • the fusion module 304 is used to perform feature fusion on the voice frequency spectrum feature information and the acoustic feature information to obtain the fused voice feature information, and the fused voice feature information is feature information at the segment level or sentence level;
  • the detection module 305 is used to input the fused voice feature information into the trained deep neural network model for voice endpoint detection processing, obtain the detection result, and determine the target speaker's voice type and the non-target speaker's voice type according to the detection results And background noise type.
  • the preprocessing module 301 can also be specifically used for:
  • Pre-emphasis, framing and windowing are sequentially performed on the sampled voice information to obtain pre-processed voice information.
  • the voice information to be recognized is stored in the blockchain database, which is not specifically limited here.
  • the extraction module 302 further includes:
  • Mel filter bank processing is performed on each frame of speech signal to obtain filter group fbank characteristic information, and the fbank characteristic information is set as frame-level speech spectral characteristic information.
  • processing module 303 further includes:
  • the judging unit 3031 is used to judge whether the target speaker has pre-registered voice feature information
  • the processing unit 3032 if the target speaker has not registered the voice feature information in advance, it is used to perform feature processing on the pre-processed voice information using a pre-trained d-vector network to obtain the acoustic feature information of the target speaker;
  • the query unit 3033 if the voice feature information of the target speaker has been registered in advance, is used to query the acoustic feature information of the target speaker from the preset data table.
  • processing unit 3032 may also be specifically configured to:
  • the target speaker has not registered the voice feature information in advance, input the pre-processed voice information into the pre-trained d-vector network, and use the preset feature extraction network to extract frame-level speaker features from the pre-processed voice information vector;
  • the activation value is L2 regularized and accumulated to obtain the acoustic feature information of the target speaker.
  • the acoustic feature information is the d-vector feature vector information.
  • the query unit 3033 may also be specifically configured to:
  • the target speaker has pre-registered the voice feature information
  • the unique identification information of the target speaker is obtained, and the query sentence is generated according to the preset structured query language grammar rules, the unique identification information and the preset data table;
  • the detection module 305 may also be specifically used for:
  • the preset LSTM-CNN model is a trained deep neural network model
  • the detection result is greater than or equal to the first preset threshold, it is determined that the detection result is the voice type of the target speaker;
  • the detection result is less than the first preset threshold and greater than or equal to the second preset threshold, it is determined that the detection result is the voice type of the non-target speaker;
  • the detection result is less than the second preset threshold and greater than or equal to the third preset threshold, it is determined that the detection result is of the background noise type.
  • voice endpoint detection is performed on voice information through a deep neural network model, and based on the voice spectrum feature information of auditory perception characteristics, the voice acoustic feature information of the target speaker is enhanced, and the accuracy of the voice information detection of the target speaker is improved , To reduce the interference of other speakers' voice or background noise, and prevent business logic problems caused by other speakers' voice or non-voice background noise. This allows the subsequent voice processing system to process only the voice fragments of the target speaker, which reduces the computational pressure and improves the response speed of the subsequent voice processing system.
  • FIG. 5 is a schematic structural diagram of a voice endpoint detection device provided by an embodiment of the present application.
  • the voice endpoint detection device 500 may have relatively large differences due to different configurations or performance, and may include one or more processors (central processing units).
  • a CPU 510 for example, one or more processors
  • a memory 520 for example, one or more storage media 530 (for example, one or more storage devices) storing application programs 533 or data 532.
  • the memory 520 and the storage medium 530 may be short-term storage or persistent storage.
  • the program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations on the voice endpoint detection device 500.
  • the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the voice endpoint detection device 500.
  • the voice endpoint detection device 500 may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or one or more operating systems 531, such as Windows Serve , Mac OS X, Unix, Linux, FreeBSD, etc.
  • operating systems 531 such as Windows Serve , Mac OS X, Unix, Linux, FreeBSD, etc.
  • FIG. 5 does not constitute a limitation on the voice endpoint detection device, and may include more or less components than shown in the figure, or a combination of certain components, or different components. Component arrangement.
  • the computer-readable storage medium may be a non-volatile computer-readable storage medium, and the computer-readable storage medium may also be a volatile computer-readable storage medium.
  • the computer-readable storage medium stores instructions, and when the instructions run on a computer, the computer executes the steps of the voice endpoint detection method.
  • the present application also provides a voice endpoint detection device.
  • the voice endpoint detection device includes a memory and a processor.
  • the memory stores instructions.
  • the processor executes the above-mentioned various embodiments. The steps of the voice endpoint detection method.
  • the computer-readable storage medium may mainly include a storage program area and a storage data area, where the storage program area may store an operating system, an application program required by at least one function, etc.; the storage data area may store Data created by the use of nodes, etc.
  • the blockchain referred to in this application is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the technical solution of the present application essentially or the part that contributes to the existing technology or all or part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium , Including several instructions to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disk and other media that can store program code .

Abstract

一种语音端点检测方法、装置及存储介质,涉及人工智能技术领域,用于提高语音端点检测的准确性。语音端点检测方法包括:对待识别语音信息进行预处理,得到预处理后的语音信息(101);从预处理后的语音信息中提取帧级别的语音频谱特征信息(102);对预处理后的语音信息进行信息处理,得到目标说话人的声学特征信息(103);对语音频谱特征信息和声学特征信息进行特征融合,得到已融合的语音特征信息,已融合的语音特征信息为段级或句子级的特征信息(104);将已融合的语音特征信息输入至已训练的深度神经网络中进行语音端点检测处理,得到检测结果,并按照检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型(105)。还涉及区块链技术,待识别语音信息可存储于区块链节点中。

Description

语音端点检测方法、装置、设备及存储介质
本申请要求于2020年07月31日提交中国专利局、申请号为202010762893.9、发明名称为“语音端点检测方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。
技术领域
本申请涉及语音信号处理领域,尤其涉及一种语音端点检测方法、装置、设备及存储介质。
背景技术
语音端点检测(voice activity detection,VAD)是语音信号处理的重要组成部分,目的是区分出连续语音流中的语音和非语音部分,通过对语音部分起始点的准确定位,有效滤除非语音噪声片段,从而更有效的处理语音流信息,其已被广泛应用于语音识别、说话人分离和识别及其他辅助任务,如情感识别、性别识别和语种识别等。
一般情况,在低噪音条件下,端点检测相对容易,传统基于能量或谱熵的检测方法就能得到较高的检测精度。而在高噪音条件下,端点检测的困难显著提高。基于谐波规则的检测方法,通过利用人声的谐波特性,可以有效区分语音和非语音片段,在高噪音场景具有很好的鲁棒性,已广泛应用于语音信号处理系统,但是由于同样具有谐波特性的背景噪声,如音乐声、咳嗽声和汽车喇叭声这类噪声的存在,导致基于谐波规则的端点检测方法不可避免的会引进很多误识别。
近年来,随着深度神经网络技术(deep neural network,DNN)在信号处理领域的巨大成功,发明人意识到,基于DNN的端点检测算法愈来成为研究热点,由于很难获得精确的语音识别对齐信息,使得基于DNN的端点检测具有一定的混淆性,一些无谐波特性的背景噪声也有可能被误识别成语音。因此,采用传统的语音端点检测算法,无法区分出目标说话人和非目标说话人,导致语音端点检测的准确性低。
发明内容
本申请的主要目的在于解决传统的语音端点检测算法,无法区分出目标说话人和非目标说话人,导致语音端点检测的准确性低的问题。
为实现上述目的,本申请第一方面提供了一种语音端点检测方法,包括:获取待识别语音信息,并对所述待识别语音信息进行预处理,得到预处理后的语音信息;从所述预处理后的语音信息中提取帧级别的语音频谱特征信息;对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;对所述语音频谱特征信息和所述声学特征信息进行特征融合,得到已融合的语音特征信息,所述已融合的语音特征信息为段级或句子级的特征信息;将所述已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照所述检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
本申请第二方面提供了一种语音端点检测设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取待识别语音信息,并对所述待识别语音信息进行预处理,得到预处理后的语音信息;从所述预处理后的语音信息中提取帧级别的语音频谱特征信息;对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;对所述语音频谱特征信息和所述声学特征信息进行特征融合,得到已融合的语音特征信息,所述已融合的语音特征信息为段级或句子级的特征信息;将所述已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照所述检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
本申请第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取待识别语音信息,并对所述待识别语音信息进行预处理,得到预处理后的语音信息;从所述预处理后的语音信息中提取帧级别的语音频谱特征信息;对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;对所述语音频谱特征信息和所述声学特征信息进行特征融合,得到已融合的语音特征信息,所述已融合的语音特征信息为段级或句子级的特征信息;将所述已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照所述检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
本申请第四方面提供了一种语音端点检测装置,包括:预处理模块,用于获取待识别语音信息,并对所述待识别语音信息进行预处理,得到预处理后的语音信息;提取模块,用于从所述预处理后的语音信息中提取帧级别的语音频谱特征信息;处理模块,用于对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;融合模块,用于对所述语音频谱特征信息和所述声学特征信息进行特征融合,得到已融合的语音特征信息,所述已融合的语音特征信息为段级或句子级的特征信息;检测模块,用于将所述已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照所述检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
本申请中,通过深度神经网络模型对语音信息进行语音端点检测,并基于听觉感知特性的语音频谱特征信息增强目标说话人的语音声学特征信息,提高了目标说话人语音信息检测的准确性,减少其他说话人语音或者背景噪声的干扰,防止出现因其他说话人语音或者非语音的背景噪声导致的业务逻辑问题。以使得后续语音处理系统仅对目标说话人语音片段进行处理,减少了计算压力,提高了后续语音处理系统的响应速度。
附图说明
图1为本申请实施例中语音端点检测方法的一个实施例示意图;
图2为本申请实施例中语音端点检测方法的另一个实施例示意图;
图3为本申请实施例中语音端点检测装置的一个实施例示意图;
图4为本申请实施例中语音端点检测装置的另一个实施例示意图;
图5为本申请实施例中语音端点检测设备的一个实施例示意图。
具体实施方式
本申请实施例提供了一种语音端点检测方法、装置、设备及存储介质,用于通过深度神经网络模型对语音信息进行语音端点检测,并基于听觉感知特性的语音频谱特征信息增强目标说话人的语音声学特征信息,提高了目标说话人语音信息检测的准确性。
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
为便于理解,下面对本发明实施例的具体流程进行描述,请参阅图1,本发明实施例中文本内涵质量的评估方法的一个实施例包括:
101、获取待识别语音信息,并对待识别语音信息进行预处理,得到预处理后的语音信息。
其中,待识别语音信息可以为实时语音信息,也可以为非实时语音信息(预先录制的音频)。服务器可以接收待识别语音信息,或者按照预设文件路径读取待识别语音信息;服务器对待识别语音信息进行预处理,进一步地,服务器对待识别语音信息提高信噪比,以增强语音信息;服务器对增强的待识别语音信息进行分帧处理,得到多个语音帧信息,并对多个语音帧信息进行加窗处理,以使得每个语音帧信息的帧首和帧尾更为平滑,得到预处理后的语音信息,从而避免突发变异产生的高频噪声。例如,服务器对多个语音帧信息添加汉明窗或者矩形窗进行处理。
可以理解的是,本申请的执行主体可以为语音端点检测装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。
102、从预处理后的语音信息中提取帧级别的语音频谱特征信息。
也就是,服务器将预处理后的语音信息中具有辨识性的特征提取出来,然后将其他信息丢弃,其他信息包括背景噪声或者情绪。其中,语音频谱特征信息包括梅尔频率倒谱系数MCFF特征和过滤器组fbank特征,服务器还可以采集其他频谱特征,具体此处不做限定。
进一步地,服务器对预处理后的语音信息(多个已加窗的语音帧信息)进行快速傅立叶变换(fast fourier transformation,FFT),并采用梅尔滤波器组过滤处理,得到40维fbank;然后服务器可将40维fbank进行离散余弦变换(discrete cosine transformation,DCT),也就是,服务器将40维fbank映射到低维空间(从40维降到13维),得到MCFF特征。
需要说明的是,MFCC特征计算是在fbank的基础上进行的,所以MFCC的计算量更大,而fbank特征相关性较高(相邻滤波器组有重叠),MFCC具有更好的判别度。同时,服务器还可以在语音特征中加入表征语音动态特性的差分特征,能够提高系统的识别性能。例如,服务器采用MFCC特征的一阶差分特征和二阶差分特征,也可以采用fbank特征的一阶差分特征和二阶差分特征,具体此处不做限定。
103、对预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息。
若待识别语音信息为预先录制的语音信息,则服务器可采用预设的已训练网络模型进行特征处理,例如,预设的已训练网络模型可以为高斯混合模型-通用背景模型GMM-UBM、i向量网络模型i-vector以及x向量网络模型x-vector,具体采取何种模式可依据不同的业务场景进行选取,具体此处不做限定。进一步地,服务器采用预设的已训练网络模型进行段级说话人特征提取,得到目标说话人的声学特征信息,然后将目标说话人的声学特征信息存储至数据库。并在模型训练阶段,服务器对预设数量帧数的语音段进行目标说话人特征提取,然后与预置数据库中目标说话人的声学特征信息进行相似性比对,得到相似度分值,并将相似度分值作为后续语音端点检测的输入参数。
若待识别语音信息为实时采集的语音信息,则服务器采用d向量网络模型d-vector进行帧级说话人特征提取,由于帧级特征的不稳定型,服务器可以采取滑窗的形式进行,通过聚合窗内帧级说话人特征信息,输出目标说话人的声学特征信息。
104、对语音频谱特征信息和声学特征信息进行特征融合,得到已融合的语音特征信息,已融合的语音特征信息为段级或句子级的特征信息。
进一步地,服务器将语音频谱特征信息和声学特征信息进行帧级语音特征拼接处理,得到段级或句子级说话人特征信息,并将段级或句子级说话人特征信息设置为已融合的语音特征信息,已融合的语音特征信息为段级或句子级的特征信息。也就是,服务器将目标说话人的声学特征信息(例如,i-vector特征信息、x-vector特征信息或者d-vector特征信息)连接到每帧语音频谱特征信息上,得到已融合的语音特征信息。其中,已融合的语音特征信息为已训练的深度神经网络模型的输入参数。
105、将已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
语音端点检测采用基于深度神经网络的语音端点检测算法,输入特征为梅尔频率倒谱系数MCFF或fank特征,另外嵌入目标说话人的声学特征信息,其中,目标说话人的声学特征信息可以采用目标说话人的相似度得分(相似度分值)或d-vector的隐层网络输出特征向量。已训练的深度神经网络模型的网络结构一般采用长短期记忆网络(long short-term memory,LSTM)、循环神经网络(recurrent neural network,RNN)、卷积神经网络(convolutional neural networks,CNN)或者时延神经网络TDNN,还可以采用其他网络结构,具体此处不做限定。也就是,服务器将已融合的语音特征信息输入至LSTM、RNN、CNN或者TDNN进行逐帧语音端点检测处理,输出的检测结果包括目标说话人语音类型、非目标说话人语音类型和背景噪声类型。其中,检测结果用于指示每帧语音信息端点类型的后验概率,例如,可以采用0.8、、05、0.2分别标识目标说话人语音类型、非目标说话人语音类型和背景噪声类型。进一步地,服务器根据检测结果对语音信息进行标注处理,以获取仅存在目标说话人语音类型的语音片段,便于后续语音处理系统的使用处理。
例如,服务器对会议场景中的语音片段(作为待识别语音信息)进行语音端点检测,进而检测出语音片段中每帧语音信息中的目标说话人语音类型(例如,会议主讲人的讲话语音)、非目标说话人语音类型(例如,与会人员的讨论语音)和背景噪声类型(例如,手机铃声或者开关门的噪音)。
本申请实施例中,通过深度神经网络模型对语音信息进行语音端点检测,并基于听觉感知特性的语音频谱特征信息增强目标说话人的语音声学特征信息,提高了目标说话人语音信息检测的准确性,减少其他说话人语音或者背景噪声的干扰,防止出现因其他说话人语音或者非语音的背景噪声导致的业务逻辑问题。以使得后续语音处理系统仅对目标说话人语音片段进行处理,减少了计算压力,提高了后续语音处理系统的响应速度。
请参阅图2,本申请实施例中语音端点检测方法的另一个实施例包括:
201、获取待识别语音信息,并对待识别语音信息进行预处理,得到预处理后的语音信息。
一般情况下,人耳可以听到的声音频率在20赫兹至20千赫兹之间的声波。因此,服务器设置取样频率(每秒钟取得声音样本的次数)对待识别语音信息进行采集。而采样频率越高,待识别语音信息中声音的质量也就越好。由于人耳的分辨率很有限,所以取样频率也不能设置太高的频率。可选的,服务器接收待识别语音信息,并对待识别语音信息进行采样,得到已采样的语音信息。进一步地,服务器将待识别语音信息(音频信号)通过一个高通滤波器进行采样,例如,截止频率大约为200赫兹,进而移除待识别语音信息中的直流偏置分量和一些低频噪声,即使在低于200赫兹的部分仍然有部分语音信息被过滤,但是不会对待识别语音信息造成很大的影响;服务器对已采样的语音信息依次进行预加重、分帧和加窗处理,得到预处理后的语音信息。
需要说明的是,预加重可采用一个一阶有限激励响应高通滤波器,使得已采样的语音信息的频谱变得平坦。分帧用于将预加重后的语音信息转换为长度为20毫秒至40毫秒的帧语音信息(将N个采样点集合成一个观测单位),一般帧与帧之间的重叠为10毫秒。例如,若已采样的语音信息的采样率为12千赫兹,取窗口大小为25毫秒,那么,每一帧语音数据的所包含的数据点为:0.025*12000=300个采样点。而以帧之间重叠为10毫秒来计算,第一帧的数据起始点为sample0,第二帧数据的起始点为sample120。加窗是对对每帧语音信息代入窗函数,窗函数在某一区间有非零值,而在其余区间(窗外的值)皆为0, 使得每帧语音信息两端衰减至接近0。
202、从预处理后的语音信息中提取帧级别的语音频谱特征信息。
其中,语音频谱特征信息为符合人耳听觉习惯的声谱,语音频谱特征信息包括MCFF和fbank,也可以包括其他频谱特征,具体此处不做限定。可选的,服务器从预处理后的语音信息中提取每帧语音信号;服务器对每帧语音信号进行傅里叶变换,得到对应的频谱信息,也就是,服务将时域信号变换成为信号的功率谱(频域信号);服务器对对应的频谱信息进行梅尔滤波器组处理,得到过滤器组fbank特征信息,其中,梅尔滤波器组处理是将线形的自然频谱转换为体现人类听觉特性的梅尔频谱;服务器将fbank特征信息设置为帧级别的语音频谱特征信息。
203、判断目标说话人是否已预先注册语音特征信息。
进一步地,服务器获取目标说话人对应的身份标识信息(例如,身份标识信息为id_001),并根据对应的身份标识信息查询预置数据库,得到查询结果;服务器判断查询结果是否为空值;若查询结果为空值,则服务器确定目标说话人未预先注册语音特征信息,进一步地,服务器执行步骤204;若查询结果不为空值,则服务器确定目标说话人已预先注册语音特征信息,进一步地,服务器执行步骤205。例如,可采用唯一标识信息(例如,全球唯一标识符)表示身份标识信息,也可以采用其他信息表示身份标识信息,具体此处不做限定。
204、若目标说话人未预先注册语音特征信息,则采用预训练的d-vector网络对预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息。
其中,目标说话人的声学特征信息为d-vector特征向量信息。可选的,若目标说话人未预先注册语音特征信息,则服务器将预处理后的语音信息输入到预训练的d-vector网络中,采用预置特征提取网络从预处理后的语音信息中提取帧级说话人特征向量;服务器采用预训练的d-vector网络中的预置隐层网络从过滤器组fbank特征信息中抽取激活值;服务器将激活值进行L2正则化并累加处理,得到目标说话人的声学特征信息,声学特征信息为d-vector特征向量信息。
需要说明的是,对于目标说话人判定,还存在一些无法预知目标说话人的业务场景,一般情况下,服务器可设置首段语音说话人为目标说话人,并在语音处理过程中,服务器根据语音信息的时长占比和以及对应的文本语义内容分析,对目标说话人进行信息更新。另外,业务场景中包含说话人数量是有限的,为说话人特征提取网络结构采用小参数网络结构(d-vector对应的结构),提高了目标说话人的声学特征计算效率和提取效率。
205、若目标说话人已预先注册语音特征信息,则从预置数据表中查询目标说话人的声学特征信息。
需要说明的是,当目标说话人已预先注册语音特征信息时,服务器从预设数据库中获取目标说话人的声学特征信息,并按照帧级说话人特征向量与目标说话人的声学特征信息计算相似度得分,得到相似度分值,并将相似性分值设置为标说话人的声学特征信息。
可选的,若目标说话人已预先注册语音特征信息,则服务器获取目标说话人的唯一标识信息,并按照预置结构化查询语言语法规则、唯一标识信息和预置数据表生成查询语句;服务器执行查询语句,得到目标说话人在特征注册阶段中确定的预置d-vector特征信息,并将预置d-vector特征信息设置为目标说话人特征信息。
206、对语音频谱特征信息和声学特征信息进行特征融合,得到已融合的语音特征信息,已融合的语音特征信息为段级或句子级的特征信息。
该步骤206与步骤104的描述相似,具体此处不再赘述。
207、将已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测 处理,得到检测结果,并按照检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
其中,已训练的深度神经网络模型可为预设的长短期记忆网络-卷积神经网络LSTM-CNN模型,也可为其他网络模型,具体此处不做限定。可选的,服务器将已融合的语音特征信息输入至预设的长短期记忆网络-卷积神经网络LSTM-CNN模型中,并通过预设的LSTM-CNN模型对已融合的语音特征信息进行语音端点检测处理,得到检测结果,预设的LSTM-CNN模型为已训练的深度神经网络模型;当检测结果大于或者等于第一预置阈值时,服务器确定检测结果为目标说话人语音类型;当检测结果小于第一预置阈值,并且大于或者等于第二预置阈值时,服务器确定检测结果为非目标说话人语音类型;当检测结果小于第二预置阈值,并且大于或者等于第三预置阈值时,服务器确定检测结果为背景噪声类型。
其中,第一预置阈值、第二预置阈值和第三预置阈值分别对应的取值范围为0到1之间的小数,例如,第一预置阈值、第二预置阈值和第三预置阈值分别为0.90、0.40和0.10,那么,当检测结果大于或者等于0.90时,服务器确定检测结果为目标说话人语音类型,例如,检测结果为0.96;当检测结果小于0.90,并且大于或者等于0.40时,服务器确定检测结果为非目标说话人语音类型,例如,检测结果为0.67;当检测结果小于0.40,并且大于或者等于0.10时,服务器确定检测结果为背景噪声类型,例如,检测结果为0.23。检测结果也可以为1或者0,具体此处不做限定。
进一步地,在步骤201之前,服务器获取语音样本数据,并按照预设比例将语音样本数据划分为训练样本数据和测试样本数据,服务器基于训练样本数据对初始深度神经网络模型进行训练,其中,服务器可以采用交叉熵作为目标函数进行模型训练,同时,由于目标说话人语音与非目标说话人语音受限于说话人之间的区分度,而且数量占比较小。为平衡类型差异,防止网络训练出现偏差,服务器可采用加权对损失函数进行模型训练,以增强目标说话人语音与非目标说话人语音之间的差别,具体出处不做限定,得到训练好的深度神经网络模型。服务器采用测试样本数据对训练好的深度神经网络模型进行预测,得到预测结果,并基于预测结果对训练好的深度神经网络模型进行迭代优化,得到已训练的深度神经网络模型。
本申请实施例中,通过深度神经网络模型对语音信息进行语音端点检测,并基于听觉感知特性的语音频谱特征信息增强目标说话人的语音声学特征信息,提高了目标说话人语音信息检测的准确性,减少其他说话人语音或者背景噪声的干扰,防止出现因其他说话人语音或者非语音的背景噪声导致的业务逻辑问题。以使得后续语音处理系统仅对目标说话人语音片段进行处理,减少了计算压力,提高了后续语音处理系统的响应速度。
上面对本申请实施例中语音端点检测方法进行了描述,下面对本申请实施例中语音端点检测装置进行描述,请参阅图3,本申请实施例中语音端点检测装置的一个实施例包括:
预处理模块301,用于获取待识别语音信息,并对待识别语音信息进行预处理,得到预处理后的语音信息;
提取模块302,用于从预处理后的语音信息中提取帧级别的语音频谱特征信息;
处理模块303,用于对预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;
融合模块304,用于对语音频谱特征信息和声学特征信息进行特征融合,得到已融合的语音特征信息,已融合的语音特征信息为段级或句子级的特征信息;
检测模块305,用于将已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照检测结果确定目标说话人语音类型、非目标 说话人语音类型和背景噪声类型。
本申请实施例中通过深度神经网络模型对语音信息进行语音端点检测,并基于听觉感知特性的语音频谱特征信息增强目标说话人的语音声学特征信息,提高了目标说话人语音信息检测的准确性,减少其他说话人语音或者背景噪声的干扰,防止出现因其他说话人语音或者非语音的背景噪声导致的业务逻辑问题。以使得后续语音处理系统仅对目标说话人语音片段进行处理,减少了计算压力,提高了后续语音处理系统的响应速度。
请参阅图4,本申请实施例中语音端点检测装置的另一个实施例包括:
预处理模块301,用于获取待识别语音信息,并对待识别语音信息进行预处理,得到预处理后的语音信息;
提取模块302,用于从预处理后的语音信息中提取帧级别的语音频谱特征信息;
处理模块303,用于对预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;
融合模块304,用于对语音频谱特征信息和声学特征信息进行特征融合,得到已融合的语音特征信息,已融合的语音特征信息为段级或句子级的特征信息;
检测模块305,用于将已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
可选的,预处理模块301还可以具体用于:
接收待识别语音信息,并对待识别语音信息进行采样,得到已采样的语音信息;
对已采样的语音信息依次进行预加重、分帧和加窗处理,得到预处理后的语音信息。
进一步地,将待识别语音信息存储于区块链数据库中,具体此处不做限定。
可选的,提取模块302还包括:
从预处理后的语音信息中提取每帧语音信号;
对每帧语音信号进行梅尔滤波器组处理,得到过滤器组fbank特征信息,并将fbank特征信息设置为帧级别的语音频谱特征信息。
可选的,处理模块303还包括:
判断单元3031,用于判断目标说话人是否已预先注册语音特征信息;
处理单元3032,若目标说话人未预先注册语音特征信息,则用于采用预训练的d-vector网络对预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;
查询单元3033,若目标说话人已预先注册语音特征信息,则用于从预置数据表中查询目标说话人的声学特征信息。
可选的,处理单元3032还可以具体用于:
若目标说话人未预先注册语音特征信息,则将预处理后的语音信息输入到预训练的d-vector网络中,采用预置特征提取网络从预处理后的语音信息中提取帧级说话人特征向量;
采用预训练的d-vector网络中的预置隐层网络从过滤器组fbank特征信息中抽取激活值;
将激活值进行L2正则化并累加处理,得到目标说话人的声学特征信息,声学特征信息为d-vector特征向量信息。
可选的,查询单元3033还可以具体用于:
若目标说话人已预先注册语音特征信息,则获取目标说话人的唯一标识信息,并按照预置结构化查询语言语法规则、唯一标识信息和预置数据表生成查询语句;
执行查询语句,得到目标说话人在特征注册阶段中确定的预置d-vector特征信息,并 将预置d-vector特征信息设置为目标说话人特征信息。
可选的,检测模块305还可以具体用于:
将已融合的语音特征信息输入至预设的长短期记忆网络-卷积神经网络LSTM-CNN模型中,并通过预设的LSTM-CNN模型对已融合的语音特征信息进行语音端点检测处理,得到检测结果,预设的LSTM-CNN模型为已训练的深度神经网络模型;
当检测结果大于或者等于第一预置阈值时,确定检测结果为目标说话人语音类型;
当检测结果小于第一预置阈值,并且大于或者等于第二预置阈值时,确定检测结果为非目标说话人语音类型;
当检测结果小于第二预置阈值,并且大于或者等于第三预置阈值时,确定检测结果为背景噪声类型。
本申请实施例中,通过深度神经网络模型对语音信息进行语音端点检测,并基于听觉感知特性的语音频谱特征信息增强目标说话人的语音声学特征信息,提高了目标说话人语音信息检测的准确性,减少其他说话人语音或者背景噪声的干扰,防止出现因其他说话人语音或者非语音的背景噪声导致的业务逻辑问题。以使得后续语音处理系统仅对目标说话人语音片段进行处理,减少了计算压力,提高了后续语音处理系统的响应速度。
上面图3和图4从模块化的角度对本申请实施例中的语音端点检测装置进行详细描述,下面从硬件处理的角度对本申请实施例中语音端点检测设备进行详细描述。
图5是本申请实施例提供的一种语音端点检测设备的结构示意图,该语音端点检测设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对语音端点检测设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在语音端点检测设备500上执行存储介质530中的一系列指令操作。
语音端点检测设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的语音端点检测设备结构并不构成对语音端点检测设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行所述语音端点检测方法的步骤。
本申请还提供一种语音端点检测设备,所述语音端点检测设备包括存储器和处理器,存储器中存储有指令,所述指令被处理器执行时,使得处理器执行上述各实施例中的所述语音端点检测方法的步骤。
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证 其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (20)

  1. 一种语音端点检测方法,其中,包括:
    获取待识别语音信息,并对所述待识别语音信息进行预处理,得到预处理后的语音信息;
    从所述预处理后的语音信息中提取帧级别的语音频谱特征信息;
    对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;
    对所述语音频谱特征信息和所述声学特征信息进行特征融合,得到已融合的语音特征信息,所述已融合的语音特征信息为段级或句子级的特征信息;
    将所述已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照所述检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
  2. 根据权利要求1所述的语音端点检测方法,其中,所述获取待识别语音信息,并对所述待识别语音信息进行预处理,得到预处理后的语音信息,包括:
    接收待识别语音信息,并对所述待识别语音信息进行采样,得到已采样的语音信息;
    对所述已采样的语音信息依次进行预加重、分帧和加窗处理,得到预处理后的语音信息。
  3. 根据权利要求1所述的语音端点检测方法,其中,所述从所述预处理后的语音信息中提取帧级别的语音频谱特征信息,包括:
    从所述预处理后的语音信息中提取每帧语音信号;
    对所述每帧语音信号进行傅里叶变换,得到对应的频谱信息;
    对所述对应的频谱信息进行梅尔滤波器组处理,得到过滤器组fbank特征信息,并将所述fbank特征信息设置为帧级别的语音频谱特征信息。
  4. 根据权利要求1所述的语音端点检测方法,其中,所述对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息,包括:
    判断目标说话人是否已预先注册语音特征信息;
    若目标说话人未预先注册语音特征信息,则采用预训练的d-vector网络对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;
    若目标说话人已预先注册语音特征信息,则从所述预置数据表中查询目标说话人的声学特征信息。
  5. 根据权利要求4所述的语音端点检测方法,其中,所述若目标说话人未预先注册语音特征信息,则采用预训练的d-vector网络对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息,包括:
    若目标说话人未预先注册语音特征信息,则将所述预处理后的语音信息输入到预训练的d-vector网络中,采用预置特征提取网络从所述预处理后的语音信息中提取帧级说话人特征向量;
    采用所述预训练的d-vector网络中的预置隐层网络从所述过滤器组fbank特征信息中抽取激活值;
    将所述激活值进行L2正则化并累加处理,得到目标说话人的声学特征信息,所述声学特征信息为d-vector特征向量信息。
  6. 根据权利要求4所述的语音端点检测方法,其中,所述对所述目标特征和所述目标特征的权重进行加权平均计算,得到所述预置证据项的目标权重,包括:
    若目标说话人已预先注册语音特征信息,则获取目标说话人的唯一标识信息,并按照预置结构化查询语言语法规则、所述唯一标识信息和所述预置数据表生成查询语句;
    执行所述查询语句,得到所述目标说话人在特征注册阶段中确定的预置d-vector特征信息,并将所述预置d-vector特征信息设置为目标说话人特征信息。
  7. 根据权利要求1-6中任意一项所述的语音端点检测方法,其中,所述将所述已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照所述检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型,包括:
    将所述已融合的语音特征信息输入至预设的长短期记忆网络-卷积神经网络LSTM-CNN模型中,并通过所述预设的LSTM-CNN模型对所述已融合的语音特征信息进行语音端点检测处理,得到检测结果,所述预设的LSTM-CNN模型为已训练的深度神经网络模型;
    当所述检测结果大于或者等于第一预置阈值时,确定所述检测结果为目标说话人语音类型;
    当所述检测结果小于第一预置阈值,并且大于或者等于第二预置阈值时,确定所述检测结果为非目标说话人语音类型;
    当所述检测结果小于第二预置阈值,并且大于或者等于第三预置阈值时,确定所述检测结果为背景噪声类型。
  8. 一种语音端点检测设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:
    获取待识别语音信息,并对所述待识别语音信息进行预处理,得到预处理后的语音信息;
    从所述预处理后的语音信息中提取帧级别的语音频谱特征信息;
    对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;
    对所述语音频谱特征信息和所述声学特征信息进行特征融合,得到已融合的语音特征信息,所述已融合的语音特征信息为段级或句子级的特征信息;
    将所述已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照所述检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
  9. 根据权利要求8所述的语音端点检测设备,所述处理器执行所述计算机程序时还实现以下步骤:
    接收待识别语音信息,并对所述待识别语音信息进行采样,得到已采样的语音信息;
    对所述已采样的语音信息依次进行预加重、分帧和加窗处理,得到预处理后的语音信息。
  10. 根据权利要求8所述的语音端点检测设备,所述处理器执行所述计算机程序时还实现以下步骤:
    从所述预处理后的语音信息中提取每帧语音信号;
    对所述每帧语音信号进行傅里叶变换,得到对应的频谱信息;
    对所述对应的频谱信息进行梅尔滤波器组处理,得到过滤器组fbank特征信息,并将所述fbank特征信息设置为帧级别的语音频谱特征信息。
  11. 根据权利要求8所述的语音端点检测设备,所述处理器执行所述计算机程序时还实现以下步骤:
    判断目标说话人是否已预先注册语音特征信息;
    若目标说话人未预先注册语音特征信息,则采用预训练的d-vector网络对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;
    若目标说话人已预先注册语音特征信息,则从所述预置数据表中查询目标说话人的声学特征信息。
  12. 根据权利要求11所述的语音端点检测设备,所述处理器执行所述计算机程序时还实现以下步骤:
    若目标说话人未预先注册语音特征信息,则将所述预处理后的语音信息输入到预训练的d-vector网络中,采用预置特征提取网络从所述预处理后的语音信息中提取帧级说话人特征向量;
    采用所述预训练的d-vector网络中的预置隐层网络从所述过滤器组fbank特征信息中抽取激活值;
    将所述激活值进行L2正则化并累加处理,得到目标说话人的声学特征信息,所述声学特征信息为d-vector特征向量信息。
  13. 根据权利要求11所述的语音端点检测设备,所述处理器执行所述计算机程序时还实现以下步骤:
    若目标说话人已预先注册语音特征信息,则获取目标说话人的唯一标识信息,并按照预置结构化查询语言语法规则、所述唯一标识信息和所述预置数据表生成查询语句;
    执行所述查询语句,得到所述目标说话人在特征注册阶段中确定的预置d-vector特征信息,并将所述预置d-vector特征信息设置为目标说话人特征信息。
  14. 根据权利要求8-13中任意一项所述的语音端点检测设备,所述处理器执行所述计算机程序时还实现以下步骤:
    将所述已融合的语音特征信息输入至预设的长短期记忆网络-卷积神经网络LSTM-CNN模型中,并通过所述预设的LSTM-CNN模型对所述已融合的语音特征信息进行语音端点检测处理,得到检测结果,所述预设的LSTM-CNN模型为已训练的深度神经网络模型;
    当所述检测结果大于或者等于第一预置阈值时,确定所述检测结果为目标说话人语音类型;
    当所述检测结果小于第一预置阈值,并且大于或者等于第二预置阈值时,确定所述检测结果为非目标说话人语音类型;
    当所述检测结果小于第二预置阈值,并且大于或者等于第三预置阈值时,确定所述检测结果为背景噪声类型。
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:
    获取待识别语音信息,并对所述待识别语音信息进行预处理,得到预处理后的语音信息;
    从所述预处理后的语音信息中提取帧级别的语音频谱特征信息;
    对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;
    对所述语音频谱特征信息和所述声学特征信息进行特征融合,得到已融合的语音特征信息,所述已融合的语音特征信息为段级或句子级的特征信息;
    将所述已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照所述检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
  16. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:
    接收待识别语音信息,并对所述待识别语音信息进行采样,得到已采样的语音信息;
    对所述已采样的语音信息依次进行预加重、分帧和加窗处理,得到预处理后的语音信息。
  17. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:
    从所述预处理后的语音信息中提取每帧语音信号;
    对所述每帧语音信号进行傅里叶变换,得到对应的频谱信息;
    对所述对应的频谱信息进行梅尔滤波器组处理,得到过滤器组fbank特征信息,并将所述fbank特征信息设置为帧级别的语音频谱特征信息。
  18. 根据权利要求15所述的计算机可读存储介质,当所述计算机指令在计算机上运行时,使得计算机还执行以下步骤:
    判断目标说话人是否已预先注册语音特征信息;
    若目标说话人未预先注册语音特征信息,则采用预训练的d-vector网络对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;
    若目标说话人已预先注册语音特征信息,则从所述预置数据表中查询目标说话人的声学特征信息。
  19. 根据权利要求18所述的计算机可读存储介质,当所述计算机指令在计算机上运行执行以下步骤时,使得计算机还执行以下步骤:
    若目标说话人未预先注册语音特征信息,则将所述预处理后的语音信息输入到预训练的d-vector网络中,采用预置特征提取网络从所述预处理后的语音信息中提取帧级说话人特征向量;
    采用所述预训练的d-vector网络中的预置隐层网络从所述过滤器组fbank特征信息中抽取激活值;
    将所述激活值进行L2正则化并累加处理,得到目标说话人的声学特征信息,所述声学特征信息为d-vector特征向量信息。
  20. 一种语音端点检测装置,其中,所述语音端点检测装置包括:
    预处理模块,用于获取待识别语音信息,并对所述待识别语音信息进行预处理,得到预处理后的语音信息;
    提取模块,用于从所述预处理后的语音信息中提取帧级别的语音频谱特征信息;
    处理模块,用于对所述预处理后的语音信息进行特征处理,得到目标说话人的声学特征信息;
    融合模块,用于对所述语音频谱特征信息和所述声学特征信息进行特征融合,得到已融合的语音特征信息,所述已融合的语音特征信息为段级或句子级的特征信息;
    检测模块,用于将所述已融合的语音特征信息输入至已训练的深度神经网络模型中进行语音端点检测处理,得到检测结果,并按照所述检测结果确定目标说话人语音类型、非目标说话人语音类型和背景噪声类型。
PCT/CN2020/131693 2020-07-31 2020-11-26 语音端点检测方法、装置、设备及存储介质 WO2021139425A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010762893.9A CN111816218A (zh) 2020-07-31 2020-07-31 语音端点检测方法、装置、设备及存储介质
CN202010762893.9 2020-07-31

Publications (1)

Publication Number Publication Date
WO2021139425A1 true WO2021139425A1 (zh) 2021-07-15

Family

ID=72864477

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/131693 WO2021139425A1 (zh) 2020-07-31 2020-11-26 语音端点检测方法、装置、设备及存储介质

Country Status (2)

Country Link
CN (1) CN111816218A (zh)
WO (1) WO2021139425A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230005488A1 (en) * 2019-12-17 2023-01-05 Sony Group Corporation Signal processing device, signal processing method, program, and signal processing system

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111816218A (zh) * 2020-07-31 2020-10-23 平安科技(深圳)有限公司 语音端点检测方法、装置、设备及存储介质
CN112420069A (zh) * 2020-11-18 2021-02-26 北京云从科技有限公司 一种语音处理方法、装置、机器可读介质及设备
CN112562649B (zh) * 2020-12-07 2024-01-30 北京大米科技有限公司 一种音频处理的方法、装置、可读存储介质和电子设备
CN112599151B (zh) * 2020-12-07 2023-07-21 携程旅游信息技术(上海)有限公司 语速评估方法、系统、设备及存储介质
CN112712820A (zh) * 2020-12-25 2021-04-27 广州欢城文化传媒有限公司 一种音色分类方法、装置、设备和介质
CN112735385A (zh) * 2020-12-30 2021-04-30 科大讯飞股份有限公司 语音端点检测方法、装置、计算机设备及存储介质
CN112767952A (zh) * 2020-12-31 2021-05-07 苏州思必驰信息科技有限公司 语音唤醒方法和装置
CN112634882B (zh) * 2021-03-11 2021-06-04 南京硅基智能科技有限公司 端到端实时语音端点检测神经网络模型、训练方法
CN113113001A (zh) * 2021-04-20 2021-07-13 深圳市友杰智新科技有限公司 人声激活检测方法、装置、计算机设备和存储介质
CN113327630B (zh) * 2021-05-27 2023-05-09 平安科技(深圳)有限公司 语音情绪识别方法、装置、设备及存储介质
CN113470698B (zh) * 2021-06-30 2023-08-08 北京有竹居网络技术有限公司 一种说话人转换点检测方法、装置、设备及存储介质
CN113724720B (zh) * 2021-07-19 2023-07-11 电信科学技术第五研究所有限公司 一种基于神经网络和mfcc的嘈杂环境下非人声语音过滤方法
CN113421595B (zh) * 2021-08-25 2021-11-09 成都启英泰伦科技有限公司 一种利用神经网络的语音活性检测方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107039035A (zh) * 2017-01-10 2017-08-11 上海优同科技有限公司 一种语音起始点和终止点的检测方法
US20190156832A1 (en) * 2017-11-21 2019-05-23 International Business Machines Corporation Diarization Driven by the ASR Based Segmentation
CN109801646A (zh) * 2019-01-31 2019-05-24 北京嘉楠捷思信息技术有限公司 一种基于融合特征的语音端点检测方法和装置
CN111354378A (zh) * 2020-02-12 2020-06-30 北京声智科技有限公司 语音端点检测方法、装置、设备及计算机存储介质
CN111816218A (zh) * 2020-07-31 2020-10-23 平安科技(深圳)有限公司 语音端点检测方法、装置、设备及存储介质

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107610707B (zh) * 2016-12-15 2018-08-31 平安科技(深圳)有限公司 一种声纹识别方法及装置
CN109801635A (zh) * 2019-01-31 2019-05-24 北京声智科技有限公司 一种基于注意力机制的声纹特征提取方法及装置
CN109801634B (zh) * 2019-01-31 2021-05-18 北京声智科技有限公司 一种声纹特征的融合方法及装置
CN110136749B (zh) * 2019-06-14 2022-08-16 思必驰科技股份有限公司 说话人相关的端到端语音端点检测方法和装置
CN111161713A (zh) * 2019-12-20 2020-05-15 北京皮尔布莱尼软件有限公司 一种语音性别识别方法、装置及计算设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107039035A (zh) * 2017-01-10 2017-08-11 上海优同科技有限公司 一种语音起始点和终止点的检测方法
US20190156832A1 (en) * 2017-11-21 2019-05-23 International Business Machines Corporation Diarization Driven by the ASR Based Segmentation
CN109801646A (zh) * 2019-01-31 2019-05-24 北京嘉楠捷思信息技术有限公司 一种基于融合特征的语音端点检测方法和装置
CN111354378A (zh) * 2020-02-12 2020-06-30 北京声智科技有限公司 语音端点检测方法、装置、设备及计算机存储介质
CN111816218A (zh) * 2020-07-31 2020-10-23 平安科技(深圳)有限公司 语音端点检测方法、装置、设备及存储介质

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20230005488A1 (en) * 2019-12-17 2023-01-05 Sony Group Corporation Signal processing device, signal processing method, program, and signal processing system

Also Published As

Publication number Publication date
CN111816218A (zh) 2020-10-23

Similar Documents

Publication Publication Date Title
WO2021139425A1 (zh) 语音端点检测方法、装置、设备及存储介质
Tan et al. rVAD: An unsupervised segment-based robust voice activity detection method
EP3955246B1 (en) Voiceprint recognition method and device based on memory bottleneck feature
US11631404B2 (en) Robust audio identification with interference cancellation
Dinkel et al. End-to-end spoofing detection with raw waveform CLDNNS
US8160877B1 (en) Hierarchical real-time speaker recognition for biometric VoIP verification and targeting
KR100636317B1 (ko) 분산 음성 인식 시스템 및 그 방법
WO2020181824A1 (zh) 声纹识别方法、装置、设备以及计算机可读存储介质
Deshwal et al. Feature extraction methods in language identification: a survey
WO2014153800A1 (zh) 语音识别系统
CN112102850B (zh) 情绪识别的处理方法、装置、介质及电子设备
CN112927694B (zh) 一种基于融合声纹特征的语音指令合法性判别方法
CN108091340B (zh) 声纹识别方法、声纹识别系统和计算机可读存储介质
Pao et al. Combining acoustic features for improved emotion recognition in mandarin speech
Nawas et al. Speaker recognition using random forest
CN112992153B (zh) 音频处理方法、声纹识别方法、装置、计算机设备
Zhang et al. Depthwise separable convolutions for short utterance speaker identification
Singh et al. Novel feature extraction algorithm using DWT and temporal statistical techniques for word dependent speaker’s recognition
Kaminski et al. Automatic speaker recognition using a unique personal feature vector and Gaussian Mixture Models
WO2021217979A1 (zh) 声纹识别方法、装置、设备及存储介质
CN114512133A (zh) 发声对象识别方法、装置、服务器及存储介质
Sas et al. Gender recognition using neural networks and ASR techniques
JPH01255000A (ja) 音声認識システムに使用されるテンプレートに雑音を選択的に付加するための装置及び方法
Chaudhari et al. Effect of varying MFCC filters for speaker recognition
Chen et al. End-to-end speaker-dependent voice activity detection

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911657

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911657

Country of ref document: EP

Kind code of ref document: A1