WO2020020375A1 - 语音处理方法、装置、电子设备及可读存储介质 - Google Patents

语音处理方法、装置、电子设备及可读存储介质 Download PDF

Info

Publication number
WO2020020375A1
WO2020020375A1 PCT/CN2019/098023 CN2019098023W WO2020020375A1 WO 2020020375 A1 WO2020020375 A1 WO 2020020375A1 CN 2019098023 W CN2019098023 W CN 2019098023W WO 2020020375 A1 WO2020020375 A1 WO 2020020375A1
Authority
WO
WIPO (PCT)
Prior art keywords
frame
voice
speech
center vector
feature center
Prior art date
Application number
PCT/CN2019/098023
Other languages
English (en)
French (fr)
Inventor
辛颖
Original Assignee
北京三快在线科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京三快在线科技有限公司 filed Critical 北京三快在线科技有限公司
Publication of WO2020020375A1 publication Critical patent/WO2020020375A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/20Pattern transformations or operations aimed at increasing system robustness, e.g. against channel noise or different working conditions

Definitions

  • the embodiments of the present application relate to the technical field of speech recognition, and in particular, to a speech processing method, device, electronic device, and readable storage medium.
  • patent application CN107610707A proposes a voiceprint recognition method, device, electronic device, and readable storage medium: first, de-noise the voice data through preprocessing to obtain valid voice data; then, from the valid voice data The MFCC (Mel-frequency cepstrum coefficients) acoustic features are extracted to obtain the feature matrix of the MFCC dimension and the number of speech frames. Finally, the speaker of the speech data is determined from the preset feature matrix set according to the feature matrix.
  • MFCC Mel-frequency cepstrum coefficients
  • the complexity of the denoising processing operation is relatively large, which results in a slow speech recognition speed, and the denoising processing is targeted. It cannot guarantee that all noise is removed, resulting in low speech recognition accuracy.
  • the present application provides a speech processing method, device, electronic device, and readable storage medium to solve the foregoing problems of speech processing in related technologies.
  • a speech processing method includes:
  • a feature center vector containing voice information is determined from the feature center vector of each voice frame, and a target voice feature center vector is generated.
  • the target speech feature center vector is used to determine identity information of the target object.
  • a voice processing apparatus includes:
  • a voice frame division module configured to obtain a plurality of voice frames divided by a voice file corresponding to a target object according to a preset frame length
  • a feature vector generation module configured to generate, for each voice frame, a feature vector of the voice frame
  • a feature center vector generating module configured to cluster the feature vectors of each voice frame to generate a feature center vector
  • the target speech feature center vector generation module is configured to determine a feature center vector containing speech information from the feature center vectors of the reference noise frames from the feature center vectors of the reference noise frames, and generate a target speech feature center vector.
  • the reference noise A frame is a noise frame among the plurality of voice frames;
  • the target speech feature center vector is used to determine identity information of the target object.
  • an electronic device including:
  • a processor a memory, and a computer program stored on the memory and executable on the processor, and the processor implements the foregoing voice processing method when executing the program.
  • a readable storage medium is provided, and when an instruction in the storage medium is executed by a processor of an electronic device, the electronic device is capable of performing the foregoing voice processing method.
  • An embodiment of the present application provides a voice processing method, apparatus, electronic device, and readable storage medium.
  • the method includes: obtaining a plurality of voice frames divided by a preset file length of a voice file corresponding to a target object; and for each voice frame Generating a feature vector of the voice frame; clustering the feature vector of each voice frame to generate a feature center vector; and determining the inclusion from the feature center vector of each voice frame according to the feature center vector of the reference noise frame
  • the feature center vector of the voice information generates a target voice feature center vector
  • the reference noise frame is a noise frame among the plurality of voice frames
  • the target voice feature center vector is used to determine identity information of the target object.
  • the feature center vector of the voice frame containing the voice information is determined according to the feature center vector of the reference noise frame, and then the target voice feature is generated based on the feature center vector of the voice frame containing the voice information.
  • Center vector, target speech feature center vector is used for target recognition. Because the features are directly extracted from the voice file, and the recognition is based on the features, there is no need to perform denoising processing on the voice file. Problem, and can directly extract features from the voice file and weaken the noise, improving the speed and accuracy of recognition.
  • FIG. 1 is a flowchart of specific steps of a voice processing method under a system architecture provided by an embodiment of the present application
  • FIG. 2 is a flowchart of specific steps of another voice processing method under a system architecture provided by an embodiment of the present application
  • FIG. 3 is a structural diagram of a voice processing device provided by an embodiment of the present application.
  • FIG. 4 is a structural diagram of another speech processing apparatus according to an embodiment of the present application.
  • FIG. 5 is a structural diagram of an electronic device according to an embodiment of the present application.
  • the voice processing method provided in the embodiment of the present application may be applied to an electronic device, and the electronic device may be a mobile phone, a tablet computer, a home device, or the like.
  • Household equipment can be TVs, speakers, air conditioners, refrigerators, microwave ovens, etc.
  • the voice processing method can be widely applied in various practical application scenarios. For example, it can be applied in the following three scenarios:
  • the target object when the electronic device is a home device, when the target object controls the home device to be turned off, the target object sends a voice file to the home device; for example, the voice file is "Please turn off"; the home device uses the voice to When the processing method recognizes that the target object is a holder of the household equipment, a shutdown operation is performed.
  • FIG. 1 a flowchart of steps of a voice processing method is shown, including:
  • Step 101 Obtain a plurality of voice frames divided by a voice file corresponding to the target object according to a preset frame length.
  • the target object may be a target person or a target animal, wherein the target person is a person who needs to be identified, and the target animal is an animal who needs to be identified.
  • the target object is taken as an example for description.
  • the voice file can be a real-time recorded voice file or a pre-recorded voice file.
  • the preset frame length can be set according to the actual application scenario and experience values, which are not limited in the embodiment of the present application. According to the short-term smoothness of the voice, the preset frame length is usually set to 10 ms to 32 ms. In this application, a preset frame length of 30 milliseconds is taken as an example for description.
  • the embodiment of the present application implements frame framing through a window function.
  • the window functions include, but are not limited to, rectangular windows, triangular windows, Hamming windows, and Hanning windows.
  • a Hamming window is preferred. It can be understood that the preset frame length is the width of the window function.
  • the length of the overlapping portion is 50% to 80% of the preset frame length.
  • the length of the overlapping portion is preferably 50% of the preset frame length.
  • a voice frame including only noise is referred to as a noise frame.
  • Step 102 For each speech frame, a feature vector of the speech frame is generated.
  • the feature vector of the speech frame represents the energy feature of the speech frame.
  • the feature vector of the speech frame may be generated based on the Mel spectrum and / or the discrete cosine coefficient and / or the Mel frequency cepstrum coefficient of the speech frame.
  • the Mel spectrum is obtained by performing a log-domain conversion on the power spectrum of the speech frame. It can be understood that the power spectrum is the relationship between frequency and power, and power is the energy expression of sound.
  • the discrete cosine coefficient and Mel frequency cepstrum coefficient can be obtained by performing discrete cosine transform on the Mel spectrum.
  • the embodiment of the present application directly extracts feature information from a voice frame and generates a feature vector, thereby eliminating the need to perform desalination processing on the voice file, thereby solving the slow recognition caused by the denoising process in the related technology and the inability to remove all the noise caused by the noise. Handle problems with lower accuracy.
  • the feature vector of each speech frame is composed of one or more of a Mel spectrum, a discrete cosine coefficient, and a Mel spectrum cepstrum coefficient of the speech frame.
  • the Mel spectrum is the Mel domain energy of the target object, which is used to distinguish the sound and noise of the target object; the discrete cosine coefficient and the Mel spectrum cepstrum coefficient can distinguish the characteristics of the target object.
  • the target object is the target person
  • the Mel spectrum is the Mel domain energy of the human voice, which is used to distinguish the sound and noise of the human voice
  • the discrete cosine coefficient and the Mel spectral cepstrum coefficient can distinguish the characteristics of the human voice.
  • Step 103 Cluster feature vectors of each speech frame to generate a feature center vector.
  • the feature vector of each voice frame may be used as an initial value for clustering, so that the noise feature or voice feature of each voice frame is clustered to obtain the noise feature center vector or the voice feature center vector of the voice frame.
  • the clustering algorithm can use algorithms such as k-means (k-means clustering algorithm), fuzzy-c-means (fuzzy c-means clustering algorithm), and EM (Expectation-Maximization algorithm).
  • k-means k-means clustering algorithm
  • fuzzy-c-means fuzzy c-means clustering algorithm
  • EM Extractation-Maximization algorithm
  • the k-means algorithm performs clustering with k points in space as the center, and classifies the objects closest to the k points. Through the iterative method, the values of each cluster center are updated one by one until the optimal clustering result is obtained.
  • the fuzzy-c-means algorithm obtains the membership degree of each sample point to all cluster centers by optimizing the objective function, so as to determine the category of the sample points to achieve the purpose of automatically clustering the sample data.
  • the EM algorithm looks for the parameter maximum likelihood estimation or maximum posterior estimation in the probability model.
  • Step 104 According to the feature center vector of the reference noise frame, determine the feature center vector containing the voice information from the feature center vector of each voice frame, and generate a target voice feature center vector.
  • the reference noise frame is a noise frame among a plurality of voice frames.
  • the reference noise frame is a pure noise frame or a voice frame whose noise power exceeds a certain threshold among a plurality of voice frames.
  • the feature center vector of the reference noise frame is compared with the feature center vectors of other speech frames to determine a feature center vector with a large gap, and the feature center vector with a large gap is determined as a feature center vector containing voice information.
  • the feature center vector containing the voice information is spliced to generate the target voice feature center vector.
  • the target speech feature center vector is used to determine the identity information of the target object; and the device that generates the target speech feature center vector may determine the identity information of the target object based on the target speech feature center vector; other devices may also be used according to the target
  • the speech feature center vector determines the identity information of the target object, that is, the device that generates the target speech feature center vector and the device according to the target speech feature center vector may be the same or different. In the embodiment of the present application, this is not specifically limited; When the device generating the target speech feature center vector and the device according to the target speech feature center vector are the same, step 104 is performed to determine the identity information of the target object according to the target speech feature center vector. When the device generating the target speech feature center vector and the device according to the target speech feature center vector are different, the target speech feature center vector is sent to other devices; other devices determine the identity information of the target object according to the target speech feature center vector.
  • the target voice feature center vector of the target object can be compared with the target voice feature center vector of the reference object that determines the identity to determine whether the target object is a reference object. If the two target speech feature center vectors are close, it means that the target object is a reference object; otherwise, the target object is not a reference object.
  • a large number of target speech feature center vectors of reference objects can be saved to a database, so that it can be judged from whether the target object is an object in the database. It can be understood that in the extreme case, when the target speech feature center vectors of all target objects are saved in the database, the identity information of any one person can be confirmed.
  • the embodiment of the present application provides a voice method.
  • the method includes: obtaining a plurality of voice frames divided by a preset length of a voice file corresponding to a target object; and for each voice frame, a feature of generating a voice frame Vector; cluster the feature vectors of each voice frame to generate a feature center vector; determine the feature center vector containing the voice information from the feature center vector of each voice frame according to the feature center vector of the reference noise frame, and generate the target voice feature center Vector, the reference noise frame is a noise frame among multiple speech frames, and the target speech feature center vector is used to determine the identity information of the target object.
  • the feature center vector of the voice frame containing the voice information is determined according to the feature center vector of the reference noise frame, and then the target voice feature is generated based on the feature center vector of the voice frame containing the voice information.
  • Center vector, target speech feature center vector is used for target recognition. Because the features are directly extracted from the voice file, and the recognition is based on the features, there is no need to perform denoising processing on the voice file. Problem, and extract features directly from the voice file and weaken the noise, improving the speed and accuracy of recognition.
  • the embodiment of the present application describes the optional speech processing method from the level of the system architecture.
  • FIG. 2 a flowchart of specific steps of another speech processing method is shown.
  • Step 201 Acquire multiple voice frames divided by a voice file corresponding to the target object according to a preset frame length.
  • step 101 For this step, refer to the detailed description of step 101, and details are not described herein again.
  • Step 202 For each voice frame, determine a Mel spectrum of the voice frame.
  • the Mel spectrum can distinguish between speech frames and noise frames.
  • the above-mentioned step 202 includes a sub-step 2021:
  • Sub-step 2021 Determine the power spectrum of the speech frame.
  • the power spectrum can be calculated based on the frequency spectrum.
  • the above-mentioned sub-step 2021 includes sub-steps 20211 to 20212:
  • a Fourier transform is performed on the speech frame to obtain a frequency spectrum of the speech frame.
  • the speech frame includes a plurality of discrete signals. Specifically, for the n-th discrete signal x i (n) of the i-th voice frame, the calculation formula of the frequency spectrum F i (k) of the i-th voice frame is shown in the following formula 1:
  • N 0, 1, ..., N-1, N is the number of points of the Fourier transform, which can be set according to the actual application scenario; in practical applications, 256 is usually taken.
  • Sub-step 20212 calculating the square of the spectrum of the voice frame to obtain the power spectrum of the voice frame.
  • Sub-step 2022 Calculate the Mel spectrum of the speech frame according to the power spectrum of the speech frame.
  • the Mel spectrum is obtained by filtering the power spectrum with a Mel frequency filter.
  • the above-mentioned sub-step 2022 includes a sub-step 20221:
  • the power spectrum of the speech frame is filtered through a preset triangular band-pass filter to obtain a Mel spectrum of the speech frame.
  • the Mel frequency filter is implemented by a set of triangular bandpass filters, so that the obtained Mel spectrum of the speech frame can meet the masking effect of the human ear, so that the low-frequency component is strengthened and the influence of noise is shielded.
  • the Mel frequency filter is preferably 24 triangular bandpass filters.
  • the triangular band-pass filter H (k) is represented by the following formula 3:
  • Equation 4 the calculation formula of the Mel spectrum M i (k) is shown in Equation 4 below:
  • M i (k) is a Mel spectrum of the i-th voice frame
  • H (k) is a triangular band-pass filter
  • P i (k) is a power spectrum of the i-th voice frame.
  • a sub-step 2023 is further included:
  • Sub-step 2023 calculating the discrete cosine coefficient and the Mel frequency cepstrum coefficient of the speech frame according to the power spectrum of the speech frame.
  • the discrete cosine coefficient and the Mel frequency cepstrum coefficient can be obtained by performing a discrete cosine transform on a log-domain power spectrum.
  • the Mel spectrum of the speech frame is converted to a logarithmic domain to obtain a logarithmic domain power spectrum of the speech frame.
  • a logarithmic domain power spectrum is obtained by taking a logarithm of the Mel spectrum of the speech frame, so as to be as close as possible to the hearing characteristics of the human ear, that is, logarithmic sensing.
  • the above-mentioned sub-step 20231 includes sub-steps 202311 to 202316:
  • Sub-step 202311 For each power point on the power spectrum of the voice frame, obtain the frequency and power of the power point.
  • the embodiments of the present application implement logarithmic domain conversion of the entire power spectrum by converting each power point on the power spectrum.
  • Sub-step 202312 dividing the frequency corresponding to the power point by a preset first conversion parameter to obtain a first intermediate value.
  • Equation 5 the calculation formula of the first intermediate value MV 1 is shown in Equation 5 below:
  • P 1 is a first conversion parameter.
  • P 1 is preferably 700; and m is a frequency corresponding to the power point.
  • step 202313 the first intermediate value is added to a preset second conversion parameter to obtain a second intermediate value.
  • Equation 6 the calculation formula of the second intermediate value MV 2 is shown in Equation 6 below:
  • P 2 is a second conversion parameter.
  • P 2 is preferably 1.
  • Sub-step 202314 taking a logarithm of the second intermediate value to obtain a third intermediate value.
  • Equation 7 the calculation formula of the third intermediate value MV 3 is shown in Equation 7 below:
  • Sub-step 202315 calculating a product of the third intermediate value and a preset third conversion parameter to obtain a logarithmic conversion value.
  • P 3 is a third conversion parameter.
  • P 3 is preferably 2595.
  • P 1, P 2, P 3 can be appropriately adjusted according to the actual application scenario, embodiments of the present application is not to be limiting thereof.
  • a logarithmic conversion value of each power point and the power are combined into a logarithmic power spectrum.
  • the frequency k is converted into M (k), so that the power corresponding to M (k) and frequency k constitutes a log-domain power spectrum.
  • Sub-step 20232 performing a discrete cosine transform on the log-domain power spectrum of the speech frame to obtain a discrete cosine coefficient and a Mel frequency cepstrum coefficient of the speech frame.
  • the Mel frequency cepstrum coefficient is determined from the discrete cosine coefficient. of.
  • the log-domain power spectrum of the speech frame is subjected to discrete cosine transform, and the first coefficient after the discrete cosine transform is determined as the discrete cosine coefficient of the speech frame, and other coefficients after the discrete cosine transform are determined as Mel Frequency cepstrum coefficient.
  • Step 203 Generate a feature vector of the speech frame according to the Mel spectrum of the speech frame.
  • the Mel spectrum can be used as the feature vector alone, and the Mel spectrum can be linearly or non-linearly converted to obtain the feature vector.
  • step 203 includes sub-step 2031:
  • Sub-step 2031 Merge the Mel spectrum, discrete cosine coefficient, and Mel frequency cepstrum coefficient of the speech frame into a feature vector of the speech frame. It can be understood that the embodiment of the present application does not limit the order of splicing of the Mel spectrum, the discrete cosine coefficient, and the Mel frequency cepstrum coefficient. For example, you can stitch the discrete cosine coefficients on the Mel spectrum and then stitch the Mel frequency cepstrum coefficients, or you can stitch the Mel spectrum after the discrete cosine coefficients and then stitch the Mel spectrum cepstrum coefficients. The Mel spectrum is spliced after the Mel frequency cepstrum coefficient, and then the discrete cosine coefficient is stitched. It is also possible to stitch the discrete cosine coefficient after the Mel frequency cepstrum coefficient, and then stitch the Mel spectrum.
  • the Mel spectrum, discrete cosine coefficient, and Mel frequency cepstrum coefficient of the speech frame are spliced into a feature vector of the speech frame, so that the obtained feature vector of the speech frame carries more information. It is easier to help distinguish noise from speech, and improves the accuracy of subsequent speech processing.
  • Step 204 Cluster feature vectors of each speech frame to generate a feature center vector.
  • step 103 For this step, refer to the detailed description of step 103, and details are not described herein again.
  • Step 205 Determine a feature center vector of a reference noise frame from the feature center vectors of each voice frame, where the reference noise frame is a noise frame among a plurality of voice frames.
  • the reference noise frame is usually the first frame in the voice frame, and the reference noise frame does not contain voice information or the noise power is greater than or equal to a preset threshold. Accordingly, this step may be: selecting a first voice frame from each voice frame; if the first frame in each voice frame does not contain voice information or the noise power is greater than or equal to a preset threshold, the first One frame is used as the reference noise frame.
  • the first frame of each speech frame contains speech information or the noise power is less than a preset threshold, it is determined that the first frame of each speech frame is not a reference noise frame, then other frames are selected. If the other frames do not contain speech information or noise When the power is greater than or equal to a preset threshold, other frames are used as reference noise frames. If other frames contain voice information or the noise power is less than the preset threshold, the voice frame is reselected until a voice frame is selected that does not contain a voice signal or the noise power is greater than or equal to the preset threshold. A speech frame equal to a preset threshold is used as a reference noise frame.
  • the selection can be made in the order of the frames; for example, if the first frame of each speech frame is not a reference noise frame, the second frame of each speech frame is selected; if each speech frame When the first frame in is not a reference noise frame, the third frame in each voice frame is selected until a reference noise frame is selected.
  • the reference noise frame is generally the first frame or the first few frames of each voice frame; therefore, when determining the reference noise frame, the reference noise frame is selected in accordance with the frame order, so that the reference noise can be determined relatively quickly. Frames, which improves efficiency.
  • Step 206 Calculate the distance between the feature center vector corresponding to the reference noise frame and the feature center vector of each speech frame.
  • the distance between the feature center vector corresponding to the reference noise frame and the feature center vector of each speech frame can be calculated.
  • the distance between the feature center vector corresponding to the reference noise frame and the feature center vector of each voice frame may not be calculated, but may be selected first from each voice frame. Feature speech frames, and then calculate the distance between the feature center vector corresponding to the reference noise frame and the feature center vector of each selected feature speech frame.
  • Feature speech frames can be randomly selected, for example, a seed random method can be used to randomly select them. In practical applications, if the local optimum is caught, the voice frame is selected again. This can avoid the poor randomness of the characteristic speech frame, which leads to no local optimal solution.
  • the number of characteristic speech frames selected from each speech frame can be set and changed according to the size of the speech file, which is not limited in the embodiments of the present application; for example, 10 random speeches are preferred in the embodiments of the present application. frame.
  • the reference noise frame is used for comparison with each voice frame, and pure noise frames are eliminated, and only voice frames containing voice information are retained.
  • the distance can be calculated by using Euclidean distance or by other methods, which is not limited in the embodiment of the present application.
  • Step 207 For each voice frame, if the distance between the feature center vector of the reference noise frame and the feature center vector of the voice frame is greater than or equal to a preset second distance threshold, determine the feature center vector of the voice frame as The feature center vector containing voice information is stitched into the target voice feature center vector.
  • the second distance threshold may be set according to an actual application scenario, which is not limited in the embodiment of the present application.
  • the voice frame includes not only noise but also voice information That is, it is determined that the feature center vector of the voice frame is a feature center vector containing voice information; if the distance between the feature center vector of the reference noise frame and the feature center vector of the voice frame is less than the second distance threshold, it indicates that the The voice frame includes only noise information, that is, it is determined that the feature center vector of the voice frame is not a feature center vector containing voice information.
  • at least one feature center vector containing voice information is determined, at least one feature center vector containing voice information may be stitched into a target voice feature center vector.
  • the at least one feature center vector containing voice information may be stitched into a target voice feature center vector according to the order of at least one voice frame containing the feature information vector of the voice information in the voice file.
  • the feature center vector of the voice frame containing the voice information can be spliced into the target voice feature center vector, so that the target voice feature center vector is a center vector that is noiseless and can reflect the sound characteristics of the target object.
  • the target voice feature center vector is a center vector that is noiseless and can reflect the sound characteristics of the target object.
  • the target speech feature center vector is used to determine the identity information of the target object; and the device that generates the target speech feature center vector may determine the identity information of the target object based on the target speech feature center vector; other devices may also be used according to the target
  • the speech feature center vector determines the identity information of the target object, that is, the device that generates the target speech feature center vector and the device according to the target speech feature center vector may be the same or different.
  • step 207 is performed, and step 208 is further included; when the device generating the target speech feature center vector and the device according to the target speech feature center vector are not At the same time, the target voice feature center vector is sent to other devices; the other device determines the identity information of the target object through step 208.
  • Step 208 Determine the identity information of the target object according to the target speech feature center vector.
  • the step of determining the identity information of the target object according to the target speech feature center vector includes sub-steps A1 to A4:
  • a reference speech feature center vector is obtained, and the reference speech feature center vector corresponds to a preset reference object.
  • the reference speech feature center vector corresponds to a preset reference object, that is, the reference speech feature center vector is a target speech feature center vector of the preset reference object.
  • the preset reference object is an object in which a center vector of a speech feature is determined in advance.
  • the preset reference object is a person whose voice feature center vector is determined in advance.
  • the reference speech feature center vector of the preset reference object can be obtained through steps 201 to 207, and saved to the database. Therefore, when the identity information of the target object to be identified is confirmed, the reference speech feature center vector of the preset reference object is directly obtained from the database, and the target speech feature center vector of the target object is compared with the reference speech feature center vector. To identify the identity of the target object.
  • Sub-step A2 calculating a distance between the reference speech feature center vector and the target speech feature center vector.
  • the distance may be calculated by using Euclidean distance or by other methods, which are not limited in the embodiment of the present application.
  • the distance between two vectors can be calculated by Euclidean distance, and the specific formula is shown in Equation 9 below:
  • a (j) and B (j) are the j-th component of the reference speech feature center vector and the j-th component of the target speech feature center vector
  • J is the size of the reference speech feature center vector or the target The size of the speech feature center vector.
  • Sub-step A3 if the distance between the reference speech feature center vector and the target speech feature center vector is less than a preset first distance threshold, determine the target object as a reference object.
  • the first distance threshold may be set according to an actual application scenario, which is not limited in the embodiment of the present application.
  • the speech feature representing the target object is similar to the speech feature of the reference object, and thus can be confirmed as the same object.
  • step A4 if the distance between the reference speech feature center vector and the target speech feature center vector is greater than or equal to a preset first distance threshold, it is determined that the target object is not a reference object.
  • the distance between the reference speech feature center vector and the target speech feature center vector is greater than or equal to the distance threshold, the speech features of the target object and the reference object are significantly different from each other, and thus can be confirmed as different objects.
  • step 208 is not a necessary step. After executing step 207, it may be directly ended, or step 208 may be performed to determine the identity information of the target object according to the target speech feature center vector.
  • an embodiment of the present application provides a voice processing method.
  • the method includes: obtaining multiple voice frames divided by a preset length of a voice file corresponding to a target object; and for each voice frame, generating the voice frame Feature vector; cluster the feature vectors of each voice frame to generate a feature center vector; determine the feature center vector containing the voice information from the feature center vector of each voice frame according to the feature center vector of the reference noise frame, and generate the target voice Feature center vector, the reference noise frame is a noise frame among multiple voice frames, and the target voice feature center vector is used to determine the identity information of the target object.
  • the feature center vector of the voice frame containing the voice information is determined according to the feature center vector of the reference noise frame, and then the target voice feature is generated based on the feature center vector of the voice frame containing the voice information.
  • Center vector, speech feature center vector is used for target recognition. Because the features are directly extracted from the voice file, and the recognition is based on the features, there is no need to perform denoising processing on the voice file. problem. In addition, features are directly extracted from the voice file, and the noise is weakened, which improves the speed and accuracy of recognition.
  • FIG. 3 a structural diagram of a speech processing device is shown, as follows.
  • the voice frame dividing module 301 is configured to obtain a plurality of voice frames in which a voice file corresponding to a target object is divided according to a preset frame length.
  • a feature vector generating module 302 is configured to generate, for each voice frame, a feature vector of the voice frame.
  • a feature center vector generation module 303 is configured to cluster feature vectors of each speech frame to generate a feature center vector.
  • the target voice feature center vector generation module 304 is configured to determine a feature center vector containing voice information from the feature center vectors of each voice frame according to the feature center vector of a reference noise frame, and generate a target voice feature center vector.
  • the reference noise frame is Noise frames in multiple speech frames;
  • the target speech feature center vector is used to determine the identity information of the target object.
  • an embodiment of the present application provides a voice processing device.
  • the device includes: a voice frame division module, configured to obtain multiple voice frames divided by a preset corresponding frame length of a voice file corresponding to a target object; feature vector generation A module for generating a feature vector of each voice frame; a feature center vector generating module for clustering the feature vectors of each voice frame to generate a feature center vector; a target voice feature center vector generating module for Based on the feature center vector of the reference noise frame, the feature center vector containing the voice information is determined from the feature center vector of each voice frame to generate a target voice feature center vector.
  • the reference noise frame is a noise frame among a plurality of voice frames.
  • the target speech feature center vector is used to determine the identity information of the target object.
  • the feature center vector of the voice frame containing the voice information is determined according to the feature center vector of the reference noise frame, and then the target voice feature is generated based on the feature center vector of the voice frame containing the voice information.
  • Center vector, target speech feature center vector is used for target recognition. Because the features are directly extracted from the voice file, and the recognition is based on the features, there is no need to perform denoising processing on the voice file. problem. In addition, features are directly extracted from the voice file, and the noise is weakened, which improves the speed and accuracy of recognition.
  • FIG. 4 a structural diagram of another speech processing apparatus is shown, as follows.
  • the voice frame dividing module 301 is configured to obtain a plurality of voice frames divided by a voice file corresponding to a target object according to a preset frame length.
  • a feature vector generating module 302 is configured to generate, for each voice frame, a feature vector of the voice frame.
  • the above-mentioned feature vector generation module 302 includes:
  • the Mel spectrum determining sub-module 3021 is configured to determine, for each voice frame, a Mel spectrum of the voice frame.
  • a feature vector generation sub-module 3022 is configured to generate a feature vector of the speech frame according to the Mel spectrum of the speech frame.
  • a feature center vector generation module 303 is configured to cluster feature vectors of each speech frame to generate a feature center vector.
  • the target voice feature center vector generation module 304 is configured to determine a feature center vector containing voice information from the feature center vectors of each voice frame according to the feature center vector of the reference noise frame, and generate a target voice feature center vector. Noise frames in speech frames;
  • the target speech feature center vector is used to determine the identity information of the target object.
  • the target voice feature center vector generation module 304 includes:
  • the noise feature center vector determination submodule 3041 is configured to determine a feature center vector of a reference noise frame from the feature center vectors of each speech frame.
  • the first distance calculation submodule 3042 is configured to calculate a distance between a feature center vector corresponding to the reference noise frame and a feature center vector of each voice frame.
  • Target speech feature center vector generation sub-module 3043 for each voice frame, if the distance between the feature center vector of the reference noise frame and the feature center vector of the voice frame is greater than or equal to a preset second distance threshold, then The feature vector of the speech frame is determined as the feature center vector containing the voice information, and the feature center vector containing the voice information is spliced into the target voice feature center vector.
  • the above-mentioned Mel spectrum determination submodule 3021 includes:
  • the power spectrum determining unit determines a power spectrum of the speech frame.
  • the Mel spectrum calculation unit is configured to calculate a Mel spectrum of the speech frame according to a power spectrum of the speech frame.
  • the above-mentioned Mel spectrum determination submodule 3021 further includes:
  • the Mel spectral coefficient calculation unit is configured to calculate a discrete cosine coefficient and a Mel frequency cepstrum coefficient of the speech frame according to a power spectrum of the speech frame.
  • the power spectrum determining unit includes:
  • the spectrum calculation subunit is configured to perform Fourier transform on the speech frame to obtain a frequency spectrum of the speech frame.
  • the power spectrum calculation subunit is configured to calculate a square of a frequency spectrum of the speech frame to obtain a power spectrum of the speech frame.
  • the above-mentioned Mel spectrum calculation unit includes:
  • the Mel spectrum calculation subunit is configured to filter the power spectrum of the speech frame by using a preset triangular band-pass filter to obtain a Mel spectrum of the speech frame.
  • the above-mentioned Mel spectral coefficient calculation unit includes:
  • the log domain conversion subunit is used to convert the Mel spectrum of the voice frame to the log domain to obtain the log domain power spectrum of the voice frame.
  • the Mel spectral coefficient calculation subunit is configured to perform a discrete cosine transform on a log-domain power spectrum of the speech frame to obtain a discrete cosine coefficient and a Mel frequency cepstrum coefficient of the speech frame. Determined in discrete cosine coefficients.
  • the above-mentioned feature vector generation sub-module 3022 includes:
  • a feature vector splicing unit is configured to stitch the Mel spectrum, discrete cosine coefficient, and Mel frequency cepstrum coefficient of the speech frame into a feature vector of the speech frame.
  • the above-mentioned device further includes an identity information determining module, configured to determine the identity information of the target object according to the target speech feature center vector;
  • the identity information determination module includes:
  • the reference speech feature center vector acquisition submodule is configured to obtain a reference speech feature center vector, where the reference speech feature center vector corresponds to a preset reference object.
  • a second distance calculation submodule is configured to calculate a distance between the reference speech feature center vector and the target speech feature center vector.
  • the first identity confirmation sub-module is configured to determine that the target object is the reference object if the distance is less than a preset first distance threshold.
  • the second identity confirmation submodule is configured to determine that the target object is not the reference object if the distance is greater than or equal to a preset first distance threshold.
  • the above-mentioned log-domain conversion subunit includes:
  • the power point acquisition subunit is configured to acquire, for each power point on the power spectrum of the voice frame, the frequency and power of the power point.
  • the first intermediate value calculation subunit is configured to divide the frequency corresponding to the power point by a preset first conversion parameter to obtain a first intermediate value.
  • the second intermediate value calculation subunit is configured to add the first intermediate value to a preset second conversion parameter to obtain a second intermediate value.
  • the third intermediate value calculation subunit is configured to take a logarithm of the second intermediate value to obtain a third intermediate value.
  • a logarithmic conversion value calculation subunit is configured to calculate a product of the third intermediate value and a preset third conversion parameter to obtain a logarithmic conversion value.
  • the log-domain power spectrum generation subunit is used to form a log power spectrum for the speech frame by the log-transformed value of each power point and the power.
  • an embodiment of the present application provides a voice processing device.
  • the device includes: a voice frame division module, configured to obtain multiple voice frames divided by a preset corresponding frame length of a voice file corresponding to a target object; feature vector generation A module for generating a feature vector of each voice frame; a feature center vector generating module for clustering the feature vectors of each voice frame to generate a feature center vector; a target voice feature center vector generating module for Based on the feature center vector of the reference noise frame, a feature center vector containing speech information is determined from the feature center vector, and a target voice feature center vector is generated.
  • the reference noise frame is a noise frame among multiple voice frames, and the target voice feature center The vector is used for the identity information of the target object.
  • the feature center vector of the voice frame containing the voice information is determined according to the feature center vector of the reference noise frame, and then the target voice feature is generated based on the feature center vector of the voice frame containing the voice information.
  • Center vector, target speech feature center vector is used for target recognition. Because the features are directly extracted from the voice file, and the recognition is based on the features, there is no need to perform denoising processing on the voice file. problem. In addition, features are directly extracted from the voice file, and the noise is weakened, which improves the speed and accuracy of recognition.
  • An embodiment of the present application further provides an electronic device including a processor, a memory, and a computer program stored on the memory and executable on the processor.
  • the processor executes the program to implement the voice processing method of the foregoing embodiment.
  • FIG. 5 is a schematic structural diagram of an electronic device according to an embodiment of the present application.
  • the electronic device 500 may be: a smartphone, a tablet, an MP3 player (Moving Picture Experts Group Audio Layer III, Moving Picture Expert Compression Standard Audio Level 3), MP4 (Moving Picture Experts Group Audio Audio Layer IV, Moving Picture Expert Compression Standard Audio level 4) Player, laptop or desktop computer.
  • the electronic device 500 may also be referred to as a user device, a portable electronic device, a laptop electronic device, a desktop electronic device, or other names.
  • the electronic device 500 includes a processor 501 and a memory 502.
  • the processor 501 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and the like.
  • the processor 501 may use at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array). achieve.
  • the processor 501 may also include a main processor and a co-processor.
  • the main processor is a processor for processing data in the awake state, also called a CPU (Central Processing Unit).
  • the co-processor is Low-power processor for processing data in standby.
  • the processor 501 may be integrated with a GPU (Graphics Processing Unit), and the GPU is responsible for rendering and drawing content required to be displayed on the display screen.
  • the processor 501 may further include an AI (Artificial Intelligence) processor, and the AI processor is configured to process computing operations related to machine learning.
  • AI Artificial Intelligence
  • the memory 502 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 502 may also include high-speed random access memory, and non-volatile memory, such as one or more disk storage devices, flash storage devices.
  • non-transitory computer-readable storage medium in the memory 502 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 501 to implement the speech processing provided by the method embodiment in this application. method.
  • the electronic device 500 may further include a peripheral device interface 503 and at least one peripheral device.
  • the processor 501, the memory 502, and the peripheral device interface 503 may be connected through a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 503 through a bus, a signal line, or a circuit board.
  • the peripheral device includes at least one of a radio frequency circuit 504, a display screen 505, a camera component 506, an audio circuit 507, a positioning component 508, and a power source 509.
  • the peripheral device interface 503 may be used to connect at least one peripheral device related to I / O (Input / Output) to the processor 501 and the memory 502.
  • the processor 501, the memory 502, and the peripheral device interface 503 are integrated on the same chip or circuit board; in some other embodiments, any one of the processor 501, the memory 502, and the peripheral device interface 503 or Two can be implemented on separate chips or circuit boards, which is not limited in this embodiment.
  • the radio frequency circuit 504 is used for receiving and transmitting an RF (Radio Frequency) signal, also called an electromagnetic signal.
  • the radio frequency circuit 504 communicates with a communication network and other communication devices through electromagnetic signals.
  • the radio frequency circuit 504 converts electrical signals into electromagnetic signals for transmission, or converts received electromagnetic signals into electrical signals.
  • the radio frequency circuit 504 includes: an antenna system, an RF transceiver, one or more amplifiers, a tuner, an oscillator, a digital signal processor, a codec chipset, a subscriber identity module card, and the like.
  • the radio frequency circuit 504 can communicate with other electronic devices through at least one wireless communication protocol.
  • the wireless communication protocol includes, but is not limited to: a metropolitan area network, various generations of mobile communication networks (2G, 3G, 4G, and 5G), a wireless local area network, and / or a WiFi (Wireless Fidelity) network.
  • the radio frequency circuit 504 may further include circuits related to Near Field Communication (NFC), which is not limited in this application.
  • NFC Near Field Communication
  • the display screen 505 is used to display a UI (User Interface).
  • the UI may include graphics, text, icons, videos, and any combination thereof.
  • the display screen 505 also has the ability to collect touch signals on or above the surface of the display screen 505.
  • the touch signal can be input to the processor 501 as a control signal for processing.
  • the display screen 505 may also be used to provide a virtual button and / or a virtual keyboard, which is also called a soft button and / or a soft keyboard.
  • one display screen 505 may be provided, and a front panel of the electronic device 500 is provided.
  • At least two display screens 505 may be provided on different surfaces of the electronic device 500 or have a folded design.
  • the display screen 505 may be a flexible display screen disposed on a curved surface or a folded surface of the electronic device 500.
  • the display screen 505 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the display screen 505 can be made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).
  • the camera component 506 is used for capturing images or videos.
  • the camera component 506 includes a front camera and a rear camera.
  • a front camera is disposed on a front panel of an electronic device
  • a rear camera is disposed on a back of the electronic device.
  • there are at least two rear cameras each of which is a main camera, a depth-of-field camera, a wide-angle camera, and a telephoto camera, so as to realize the integration of the main camera and the depth-of-field camera, the background blur function, and the main camera. Integrate with a wide-angle camera to achieve panoramic shooting and VR (Virtual Reality) shooting functions or other fusion shooting functions.
  • the camera assembly 506 may further include a flash.
  • the flash can be a monochrome temperature flash or a dual color temperature flash.
  • a dual color temperature flash is a combination of a warm light flash and a cold light flash, which can be used for light compensation at different color temperatures.
  • the audio circuit 507 may include a microphone and a speaker.
  • the microphone is used to collect sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 501 for processing, or input them to the radio frequency circuit 504 to implement voice communication.
  • the microphone can also be an array microphone or an omnidirectional acquisition microphone.
  • the speaker is used to convert electrical signals from the processor 501 or the radio frequency circuit 504 into sound waves.
  • the speaker can be a traditional film speaker or a piezoelectric ceramic speaker.
  • the speaker When the speaker is a piezoelectric ceramic speaker, it can not only convert electrical signals into sound waves audible to humans, but also convert electrical signals into sound waves inaudible to humans for ranging purposes.
  • the audio circuit 507 may further include a headphone jack.
  • the positioning component 508 is used to locate the current geographic position of the electronic device 500 to implement navigation or LBS (Location Based Service).
  • the positioning component 508 may be a positioning component based on the United States' GPS (Global Positioning System), the Beidou system in China, the Granas system in Russia, or the Galileo system in the European Union.
  • the power supply 509 is used to supply power to various components in the electronic device 500.
  • the power source 509 may be an alternating current, a direct current, a disposable battery, or a rechargeable battery.
  • the rechargeable battery may support wired charging or wireless charging.
  • the rechargeable battery can also be used to support fast charging technology.
  • the electronic device 500 further includes one or more sensors 510.
  • the one or more sensors 510 include, but are not limited to, an acceleration sensor, a gyroscope sensor, a pressure sensor, a fingerprint sensor, an optical sensor, and a proximity sensor.
  • FIG. 5 does not constitute a limitation on the electronic device 500, and may include more or fewer components than shown, or combine certain components, or adopt different component arrangements.
  • the embodiment of the present application further provides a readable storage medium, and when the instructions in the storage medium are executed by the processor of the electronic device, the electronic device can execute the voice processing method of the foregoing embodiment.
  • the description is relatively simple.
  • the related parts refer to the description of the method embodiment.
  • modules in the device in the embodiment can be adaptively changed and set in one or more devices different from the embodiment.
  • the modules or units or components in the embodiment may be combined into one module or unit or component, and furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except for such features and / or processes or units, which are mutually exclusive, all features disclosed in this specification (including the accompanying claims, abstract and drawings) and any methods so disclosed may be employed in any combination or All processes or units of the equipment are combined.
  • the various component embodiments of the present application may be implemented in hardware, or in software modules running on one or more processors, or in a combination thereof.
  • a microprocessor or a digital signal processor (DSP) may be used to implement some or all functions of some or all components in the voice processing device according to the embodiments of the present application.
  • DSP digital signal processor
  • the application may also be implemented as a device or device program for performing part or all of the method described herein.
  • Such a program that implements the present application may be stored on a computer-readable medium or may have the form of one or more signals. Such signals can be downloaded from an Internet website, provided on a carrier signal, or provided in any other form.

Landscapes

  • Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种语音处理方法、装置、电子设备及可读存储介质,语音处理方法包括:获取目标对象对应的语音文件按照预设帧长划分的多个语音帧(101);对于各语音帧,生成语音帧的特征向量(102);对各语音帧的特征向量进行聚类,生成特征中心向量(103);根据参考噪声帧的特征中心向量,从各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,参考噪声帧为多个语音帧中的噪声帧(104),目标语音特征中心向量用于确定语音目标对象的身份信息。解决了相关技术中去噪导致的处理较慢、无法去掉所有噪声导致语音处理准确度较低的问题,能够直接提取特征,并将噪声弱化,提高了处理的速度和准确度。

Description

语音处理方法、装置、电子设备及可读存储介质
本申请要求于2018年7月27日提交、申请号为201810842328.6、发明名称为“语音识别方法、装置、电子设备及可读存储介质”的中国专利申请的优先权,其全部内容都通过引用结合在本申请中。
技术领域
本申请实施例涉及语音识别技术领域,尤其涉及一种语音处理方法、装置、电子设备及可读存储介质。
背景技术
随着语音识别技术的迅速发展,语音作为身份识别的有效手段逐渐成熟。
相关技术中,专利申请CN107610707A提出了一种声纹识别方法、装置、电子设备及可读存储介质:首先,通过预处理对语音数据进行去噪,得到有效语音数据;然后,从有效语音数据中提取MFCC(Mel-frequency cepstral coefficients,梅尔频率倒谱系数)声学特征,得到MFCC维度及语音分帧数的特征矩阵;最后,根据特征矩阵从预设特征矩阵集中确定语音数据的说话人。
然而,去噪处理运算复杂度较大,导致语音识别速度较慢,且去噪处理具有针对性,无法保证去掉所有噪声,导致语音识别准确率较低。
发明内容
本申请提供一种语音处理方法、装置、电子设备及可读存储介质,以解决相关技术语音处理的上述问题。
根据本申请的一方面,提供了一种语音处理方法,所述方法包括:
获取目标对象对应的语音文件按照预设帧长划分的多个语音帧;
对于各语音帧,生成所述语音帧的特征向量;
对所述各语音帧的特征向量进行聚类,生成特征中心向量;
根据参考噪声帧的特征中心向量,从所述各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,所述参考噪声帧为所述多个语音帧中的噪声帧;
所述目标语音特征中心向量用于确定所述目标对象的身份信息。
根据本申请的另一方面,提供了一种语音处理装置,所述装置包括:
语音帧划分模块,用于获取目标对象对应的语音文件按照预设帧长划分的多个语音帧;
特征向量生成模块,用于对于各语音帧,生成所述语音帧的特征向量;
特征中心向量生成模块,用于对所述各语音帧的特征向量进行聚类,生成特征中心向量;
目标语音特征中心向量生成模块,用于根据参考噪声帧的特征中心向量,从所述各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,所述参考噪声帧为所述多个语音帧中的噪声帧;
所述目标语音特征中心向量用于确定所述目标对象的身份信息。
根据本申请的另一方面,提供了一种电子设备,包括:
处理器、存储器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现前述语音处理方法。
根据本申请的另一方面,提供了一种可读存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行前述语音处理方法。
本申请实施例提供了一种语音处理方法、装置、电子设备及可读存储介质,所述方法包括:获取目标对象对应的语音文件按照预设帧长划分的多个语音帧;对于各语音帧,生成所述语音帧的特征向量;对所述各语音帧的特征向量进行聚类,生成特征中心向量;根据参考噪声帧的特征中心向量,从所述各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,所述参考噪声帧为所述多个语音帧中的噪声帧,所述目标语音特征中心向量用于确定所述目标对象的身份信息。由于本申请实施例中对语音处理时,根据参考噪声帧的特征中心向量,确定出包含语音信息的语音帧的特征中心向量,然后基于包含语音信息的语音帧的特征中心向量,生成目标语音特征中心向量,目标语音特征中心向量用于进行目标对象的识别。由于直接从语音文件中提取特征,基于特征进行识别,从而不需要对语音文件进行去噪处理,解决了相关技术中去噪导致的识别较慢、无法去掉所有噪声导致语音处理准确度较低的问题,并且,能够直接从语音文件中提取特征,并将噪声弱化,提高了识别的速度和准确度。
附图说明
为了更清楚地说明本发明本申请实施例的技术方案,下面将对本发明本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的系统架构下的一种语音处理方法具体步骤流程图;
图2是本申请实施例提供的系统架构下的另一种语音处理方法具体步骤流程图;
图3是本申请实施例提供的一种语音处理装置的结构图;
图4是本申请实施例提供的另一种语音处理装置的结构图;
图5是本申请实施例提供的一种电子设备的结构图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
本申请实施例提供的语音处理方法可以应用在电子设备中,该电子设备可以为手机、平板电脑、家居设备等。家居设备可以为电视机、音箱、空调、冰箱、微波炉等。并且,该语 音处理方法可以广泛地应用在各种实际应用场景中例如,可以应用在以下三个场景中:
(1)可以应用在对电子设备进行解锁的场景:当目标对象在对电子设备进行解锁时,目标对象向电子设备输入语音文件;电子设备基于该语音文件,通过该语音处理方法,识别出该目标对象是否为电子设备的持有者;当目标对象为电子设备的持有者时,对电子设备进行解锁。
(2)可以应用在电子设备进行资源转移的场景中:当目标对象在通过电子设备从第一账户向第二账户转移资源时,向电子设备输入语音文件;电子设备基于该语音文件,通过该语音处理方法,识别出该目标对象是否为第一账户的持有者;当目标对象为第一账户的持有者时,电子设备进行资源转移操作。
(3)可以应用在控制电子设备的场景中;当目标对象在控制电子设备时,向电子设备输入语音文件;电子设备基于该语音文件,通过该语音处理方法,识别出该目标对象是否为电子设备的持有者;当该目标对象为电子设备的持有者时,执行该语音文件对应的控制指令。
例如,当电子设备为家居设备时,当目标对象控制家居设备关机时,目标对象向家居设备发输入语音文件;例如,该语音文件为“请关机”;家居设备基于该语音文件,通过该语音处理方法识别出该目标对象为该家居设备的持有者时,进行关机操作。
参照图1,其示出了一种语音处理方法的步骤流程图,包括:
步骤101,获取目标对象对应的语音文件按照预设帧长划分的多个语音帧。
其中,目标对象可以为目标人物或者目标动物,其中,目标人物为需要识别身份的人物,目标动物为需要识别身份的动物。在本申请实施例中,以目标对象为目标人物为例进行说明。语音文件可以实时录入的语音文件,也可以为预先录入的语音文件。
预设帧长可以根据实际应用场景和经验值设定,本申请实施例对其不加以限制。依据语音短时平稳的特性,预设帧长通常设置为10毫秒至32毫秒。本申请优选以预设帧长为30毫秒为例进行说明。
具体地,本申请实施例通过窗函数实现分帧。其中,窗函数包括但不限于:矩形窗、三角窗、汉明窗、汉宁窗。本申请实施例优选汉明窗。可以理解,预设帧长为窗函数的宽度。
在实际应用中,为了防止频谱泄露,在分帧时连续两帧通常重叠一部分。根据经验值,重叠部分的长度为预设帧长的50%至80%。本申请实施例优选重叠部分的长度为预设帧长的50%。从而每次窗函数向前移动时,仅移动帧长的50%的长度。
可以理解,对于各语音帧,有的语音帧只包括噪声,有的语音帧包括噪声和语音信息。在本申请实施例中,将只包括噪声的语音帧称为噪声帧。
步骤102,对于各语音帧,生成语音帧的特征向量。
其中,语音帧的特征向量代表了语音帧的能量特征。具体地,语音帧的特征向量可以基于语音帧的梅尔频谱和/或离散余弦系数和/或梅尔频率倒谱系数生成。
其中,梅尔频谱通过对语音帧的功率谱进行对数域转换得到。可以理解,功率谱是频率与功率的关系,功率为声音的能量表述。
离散余弦系数和梅尔频率倒谱系数可以通过对梅尔频谱进行离散余弦变换得到。
本申请实施例直接从语音帧中提取特征信息,生成特征向量,从而不需要对语音文件进行去燥处理,从而解决了相关技术中去噪处理导致的识别较慢、无法去掉所有噪声导致的语 音处理准确度较低的问题。
在本申请实施例中,各语音帧的特征向量由该语音帧的梅尔频谱、离散余弦系数以及梅尔频谱倒谱系数中的一个或者多个组成。其中,梅尔频谱为目标对象的梅尔域能量,用于区别目标对象的声音和噪声;离散余弦系数和梅尔频谱倒谱系数可以区分目标对象的特征。例如,当目标对象为目标人物时,梅尔频谱为人声的梅尔域能量,用于区别人声的声音和噪声;离散余弦系数和梅尔频谱倒谱系数可以区分人声的特征。
步骤103,对各语音帧的特征向量进行聚类,生成特征中心向量。
本申请实施例可以将各语音帧的特征向量作为初始值进行聚类,从而将各语音帧的噪声特征或语音特征聚类得到该语音帧的噪声特征中心向量或语音特征中心向量。聚类算法可以采用k-means(k均值聚类算法)、fuzzy-c-means(模糊c均值聚类算法)、EM(Expectation-Maximization algorithm,最大期望算法)等算法。本申请实施例对聚类算法不加以限制。
其中,k-means算法以空间中k个点为中心进行聚类,对最靠近该k个点的对象进行归类。通过迭代的方法,逐次更新各聚类中心的值,直至得到最优的聚类结果。
fuzzy-c-means算法通过优化目标函数得到每个样本点对所有聚类中心的隶属度,从而决定样本点的类属以达到自动对样本数据进行聚类的目的。
EM算法在概率模型中寻找参数最大似然估计或最大后验估计。
步骤104,根据参考噪声帧的特征中心向量,从各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,参考噪声帧为多个语音帧中的噪声帧。
其中,在一个实施例中,参考噪声帧为多个语音帧中的纯噪声帧或噪声功率超过一定阈值的语音帧。具体地,将参考噪声帧的特征中心向量与其他语音帧的特征中心向量进行对比,从而确定差距较大的特征中心向量,将差距较大的特征中心向量确定为包含语音信息的特征中心向量,将包含语音信息的特征中心向量拼接生成目标语音特征中心向量。
其中,目标语音特征中心向量用于确定目标对象的身份信息;并且,可以由生成目标语音特征中心向量的设备根据该目标语音特征中心向量确定目标对象的身份信息;也可以由其他设备根据该目标语音特征中心向量确定目标对象的身份信息,也即生成目标语音特征中心向量的设备和根据目标语音特征中心向量的设备可以相同,也可以不同,在本申请实施例中,对此不作具体限定;当生成目标语音特征中心向量的设备和根据目标语音特征中心向量的设备相同时,执行完步骤104,根据目标语音特征中心向量,确定目标对象的身份信息。当生成目标语音特征中心向量的设备和根据目标语音特征中心向量的设备不同时,向其他设备发送目标语音特征中心向量;其他设备根据目标语音特征中心向量,确定目标对象的身份信息。
在实际应用中,可以将目标对象的目标语音特征中心向量与确定身份的参考对象的目标语音特征中心向量,进行对比,确定目标对象是否是参考对象。若两个目标语音特征中心向量接近,则代表目标对象是参考对象;否则,目标对象不是参考对象。
在实际应用中,可以将大量参考对象的目标语音特征中心向量保存至数据库中,从而可以从判断该目标对象是否为该数据库中的对象。可以理解,在极限情况下,当该数据库中保存了所有目标对象的目标语音特征中心向量时,即可以确认任何一个人的身份信息。
综上所述,本申请实施例提供了一种语音别方法,该方法包括:获取目标对象对应的语音文件按照预设帧长划分的多个语音帧;对于各语音帧,生成语音帧的特征向量;对各语音 帧的特征向量进行聚类,生成特征中心向量;根据参考噪声帧的特征中心向量,从各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,参考噪声帧为多个语音帧中的噪声帧,目标语音特征中心向量用于确定目标对象的身份信息。由于本申请实施例中对语音处理时,根据参考噪声帧的特征中心向量,确定出包含语音信息的语音帧的特征中心向量,然后基于包含语音信息的语音帧的特征中心向量,生成目标语音特征中心向量,目标语音特征中心向量用于进行目标对象的识别。由于直接从语音文件中提取特征,基于特征进行识别,从而不需要对语音文件进行去噪处理,解决了相关技术中去噪导致的识别较慢、无法去掉所有噪声导致语音处理准确度较低的问题,并且,直接从语音文件中提取特征,并将噪声弱化,提高了识别的速度和准确度。
本申请实施例从系统架构的层级对可选地语音处理方法进行了描述。
参照图2,其示出了另一种语音处理方法的具体步骤流程图。
步骤201,获取目标对象对应的语音文件按照预设帧长划分的多个语音帧。
该步骤可以参照步骤101的详细说明,在此不再赘述。
步骤202,对于各语音帧,确定语音帧的梅尔频谱。
其中,梅尔频谱可以区分语音帧和噪声帧。
可选地,在本申请的另一种实施例中,上述步骤202包括子步骤2021:
子步骤2021,确定该语音帧的功率谱。
具体地,功率谱可以基于频谱进行计算。
可选地,在本申请的另一种实施例中,上述子步骤2021包括子步骤20211至20212:
子步骤20211,对该语音帧分别进行傅里叶变换,得到该语音帧的频谱。
该语音帧中包括多个离散信号。具体地,对于第i帧语音帧的第n个离散信号x i(n),该第i帧语音帧的频谱F i(k)的计算公式如下公式一所示:
公式一:
Figure PCTCN2019098023-appb-000001
其中,k=0、1、…、N-1,N为傅里叶变换的点数,可以根据实际应用场景设定;在实际应用中,通常取256。
子步骤20212,计算该语音帧的频谱的平方得到该语音帧的功率谱。
具体地,对于第i帧语音帧的第n个离散信号x i(n),该第i帧语音帧的功率谱P i(k)的计算公式如下公式二所示:
公式二:P i(k)=|F i(k)| 2
子步骤2022,根据该语音帧的功率谱,计算该语音帧的梅尔频谱。
其中,梅尔频谱通过梅尔频率滤波器对功率谱进行滤波得到。
可选地,在本申请的另一种实施例中,上述子步骤2022包括子步骤20221:
子步骤20221,通过预设三角带通滤波器对该语音帧的功率谱进行滤波,得到该语音帧的梅尔频谱。
在本申请实施例中,梅尔频率滤波器采用一组三角带通滤波器实现,从而得到的该语音帧的梅尔频谱可以符合人耳的掩蔽效应,使得加强低频分量,屏蔽噪声影响。在本申请实施 例中,梅尔频率滤波器优选24个三角带通滤波器。
其中,三角带通滤波器H(k)的表示如下公式三所示:
公式三:
Figure PCTCN2019098023-appb-000002
其中,f(l)、f(l-1)、f(l+1)分别为第l、l-1、l+1个三角带通滤波器的中心频率,k=0、1、…、N-1,N为傅里叶变换的点数。
具体地,梅尔频谱M i(k)的计算公式如下公式四所示:
公式四:M i(k)=H(k)·P i(k)=H(k)·|F i(k)| 2
其中,M i(k)为该第i帧语音帧的梅尔频谱,H(k)为三角带通滤波器,P i(k)为第i帧语音帧的功率谱。
可选地,在本申请的另一种实施例中,在上述子步骤2022之后还包括子步骤2023:
子步骤2023,根据该语音帧的功率谱,计算该语音帧的离散余弦系数及梅尔频率倒谱系数。
其中,离散余弦系数和梅尔频率倒谱系数可以通过对对数域功率谱进行离散余弦变换得到。
可选地,在本申请的另一种实施例中,上述子步骤20231至20232:
子步骤20231,对该语音帧的梅尔频谱转换至对数域,得到该语音帧的对数域功率谱。
具体地,对该语音帧的梅尔频谱取对数得到对数域功率谱,从而可以尽可能的符合人耳的听觉特性,即:对数式感知。
可选地,在本申请的另一种实施例中,上述子步骤20231包括子步骤202311至202316:
子步骤202311,对于该语音帧的功率谱上的每个功率点,获取该功率点的频率和功率。
本申请实施例通过对功率谱上的每个功率点进行转换,实现整个功率谱的对数域转换。
子步骤202312,将该功率点对应的频率除以预设第一转换参数,得到第一中间值。
具体地,第一中间值MV 1的计算公式如下公式五所示:
公式五:MV 1=n/P 1
其中,P 1为第一转换参数,在本申请实施例中,P 1优选700;m为该功率点对应的频率。
子步骤202313,将第一中间值加上预设第二转换参数,得到第二中间值。
具体地,第二中间值MV 2的计算公式如下公式六所示:
公式六:MV 2=P 2+MV 1=P 2+n/P 1
其中,P 2为第二转换参数,在本申请实施例中,P 2优选1。
子步骤202314,对第二中间值取对数,得到第三中间值。
具体地,第三中间值MV 3的计算公式如下公式七所示:
公式七:MV 3=log(MV 2)=log(P 2+n/P 1)
子步骤202315,计算第三中间值与预设第三转换参数的乘积,得到对数转换值。
具体地,对数转换值N(k)的计算公式如下公式八所示:
公式八:M(k)=P 3·MV 3=P 3·log(P 2+k/P 1)
其中,P 3为第三转换参数,在本申请实施例中P 3优选2595。
可以理解,P 1、P 2、P 3均可以根据实际应用场景进行适当调整,本申请实施例对其不加以限制。
子步骤202316,对于该语音帧,将各功率点的对数转换值和该功率组成对数功率谱。
根据子步骤202312至202315的计算,将频率k转换为M(k),从而M(k)和频率k对应的功率组成对数域功率谱。
子步骤20232,对该语音帧的对数域功率谱进行离散余弦变换,得到该语音帧的离散余弦系数和梅尔频率倒谱系数,该梅尔频率倒谱系数为从该离散余弦系数中确定的。
具体地,对该语音帧的对数域功率谱进行离散余弦变换,将离散余弦变换之后的第一个系数确定为该语音帧的离散余弦系数,将离散余弦变换之后的其他系数确定为梅尔频率倒谱系数。
步骤203,根据该语音帧的梅尔频谱生成该语音帧的特征向量。
在实际应用中,可以单独将梅尔频谱作为特征向量,也可以对梅尔频谱进行线性或非线性转换得到特征向量。
可选地,针对子步骤2023,上述步骤203包括子步骤2031:
子步骤2031,将该语音帧的梅尔频谱、离散余弦系数及梅尔频率倒谱系数拼接成为该语音帧的特征向量。可以理解,本申请实施例对梅尔频谱、离散余弦系数及梅尔频率倒谱系数的拼接顺序不加以限制。例如,可以将离散余弦系数拼接在梅尔频谱之后,再拼接上梅尔频率倒谱系数,也可以将梅尔频谱拼接在离散余弦系数之后,再拼接上梅尔频谱倒谱系数,也可以将梅尔频谱拼接在梅尔频率倒谱系数之后,再拼接上离散余弦系数,也可以将离散余弦系数拼接在梅尔频率倒谱系数之后,再拼接上梅尔频谱。
在本申请实施例中,将该语音帧的梅尔频谱、离散余弦系数及梅尔频率倒谱系数拼接成为该语音帧的特征向量,从而得到的该语音帧的特征向量携带的信息更多,更容易帮助区别噪声和语音,提高了后续语音处理的准确性。
步骤204,对各语音帧的特征向量进行聚类,生成特征中心向量。
该步骤可以参照步骤103的详细说明,在此不再赘述。
步骤205,从各语音帧的特征中心向量中确定参考噪声帧的特征中心向量,该参考噪声帧为多个语音帧中的噪声帧。
其中,参考噪声帧通常为语音帧中的第一帧,并且,参考噪声帧中不包含语音信息或者噪声功率大于或者等于预设阈值。相应的,本步骤可以为:从各语音帧中选择第一语音帧;若各语音帧中的第一帧不包含语音信息或者噪声功率大于或等于预设阈值时,将各语音帧中的第一帧作为参考噪声帧。
若各语音帧中的第一帧包含语音信息或噪声功率小于预设阈值,确定各语音帧中的第一帧不为参考噪声帧,则选取其他帧,若其他帧中不包含语音信息或噪声功率大于或等于预设 阈值时,将其他帧作为参考噪声帧。若其他帧包含语音信息或噪声功率小于预设阈值时,重新选择语音帧,直到选择出不包含语音信号或者噪声功率大于或等于预设阈值的语音帧,将不包含语音信号或者噪声功率大于或等于预设阈值的语音帧作为参考噪声帧。
需要说明的一点是,在选择其他帧时,可以按照帧顺序进行选择;例如,若各语音帧中的第一帧不是参考噪声帧时,选择各语音帧中的第二帧;若各语音帧中的第一帧不是参考噪声帧时,选择各语音帧中第三帧,直到选择到参考噪声帧为止。
在本申请实施例中,由于参考噪声帧一般是各语音帧中的第一帧或者前几帧;因此,在确定参考噪声帧时,按照帧顺序进行选择,从而能够较快的确定出参考噪声帧,提高了效率。
步骤206,计算该参考噪声帧对应的特征中心向量与每个语音帧的特征中心向量之间的距离。
在本步骤中,可以计算该参考噪声帧对应的特征中心向量与每个语音帧的特征中心向量之间的距离。此外,在实际应用中,为了进一步提高识别速度,也可以不计算该参考噪声帧对应的特征中心向量与每个语音帧的特征中心向量之间的距离,而是可以从各语音帧中首先选取特征语音帧,然后计算参考噪声帧对应的特征中心向量与选择的每个特征语音帧的特征中心向量之间的距离。
特征语音帧可以随机选取,例如可以采用种子随机方法随机选取。在实际应用中,若陷入局部最优,则重新选择一次语音帧。从而可以避免特征语音帧的随机性差,导致局部最优无解。并且,从各语音帧中选择特征语音帧的数量可以根据该语音文件的大小进行设置并更改,在本申请实施例中对此不加以限定;例如,在本申请实施例中优选10个随机语音帧。
在本申请实施例中,参考噪声帧用于与各语音帧进行比较,并剔除纯噪声帧,仅保留含有语音信息的语音帧。
可以理解,距离可以采用欧氏距离也可以采用其他方式计算,本申请实施例对其不加以限制。
步骤207,对于每个语音帧,若该参考噪声帧的特征中心向量与该语音帧的特征中心向量之间的距离大于或等于预设第二距离阈值,则确定该语音帧的特征中心向量为包含语音信息的特征中心向量,将包含语音信息的特征中心向量拼接成目标语音特征中心向量中。
其中,第二距离阈值可以根据实际应用场景设定,本申请实施例对其不加以限制。
具体地,对于每个语音帧,若该参考噪声帧的特征中心向量与该语音帧的特征中心向量之间的距离大于或等于第二距离阈值,则表明该语音帧不仅包括噪声还包括语音信息,也即确定该语音帧的特征中心向量为包含语音信息的特征中心向量;若该参考噪声帧的特征中心向量与该语音帧的特征中心向量之间的距离小于第二距离阈值,则表明该语音帧仅包括噪声信息,也即确定该语音帧的特征中心向量不为包含语音信息的特征中心向量。确定出至少一个包含语音信息的特征中心向量时,可以将至少一个包含语音信息的特征中心向量拼接成目标语音特征中心向量。
其中,可以根据至少一个包含语音信息的特征中心向量的语音帧在语音文件中的顺序,将该至少一个包含语音信息的特征中心向量拼接成目标语音特征中心向量。
本申请实施例可以将包含语音信息的语音帧的特征中心向量拼接成为目标语音特征中心向量,从而目标语音特征中心向量就是无噪声,且能够体现目标对象的声音特征的中心向量,后续不需要对语音文件进行去噪处理,即可获取到能够体现目标对象的声音特征的中心向量, 提高了效率。
其中,目标语音特征中心向量用于确定目标对象的身份信息;并且,可以由生成目标语音特征中心向量的设备根据该目标语音特征中心向量确定目标对象的身份信息;也可以由其他设备根据该目标语音特征中心向量确定目标对象的身份信息,也即生成目标语音特征中心向量的设备和根据目标语音特征中心向量的设备可以相同,也可以不同,在本申请实施例中,对此不作具体限定;当生成目标语音特征中心向量的设备和根据目标语音特征中心向量的设备相同时,执行完步骤207,还包括步骤208;当生成目标语音特征中心向量的设备和根据目标语音特征中心向量的设备不同时,向其他设备发送目标语音特征中心向量;其他设备通过步骤208,确定目标对象的身份信息。
步骤208,根据该目标语音特征中心向量,确定该目标对象的身份信息。
可选地,在本申请的另一种实施例中,根据该目标语音特征中心向量确定目标对象的身份信息的步骤,包括子步骤A1至A4:
子步骤A1,获取参考语音特征中心向量,该参考语音特征中心向量对应预设参考对象。
该参考语音特征中心向量对应预设参考对象,也即该参考语音特征中心向量为该预设参考对象的目标语音特征中心向量。其中,预设参考对象为预先确定了语音特征中心向量的对象。例如,预设参考对象为预先确定了语音特征中心向量的人物。在实际应用中,可以通过步骤201至207获取预设参考对象的参考语音特征中心向量,并保存至数据库中。从而在对待确认身份的目标对象的身份信息进行确认时,直接从数据库中获取预设参考对象的参考语音特征中心向量,从而将目标对象的目标语音特征中心向量与参考语音特征中心向量进行对比,以确目标对象的身份信息。
子步骤A2,计算该参考语音特征中心向量与该目标语音特征中心向量之间的距离。
距离可以采用欧氏距离也可以采用其他方式计算,本申请实施例对其不加以限制。例如,可以通过欧氏距离计算两个向量的距离,具体公式如下公式九所示:
公式九:
Figure PCTCN2019098023-appb-000003
其中,A(j)和B(j)分别为该参考语音特征中心向量的第j个分量和该目标语音特征中心向量的第j个分量,J为该参考语音特征中心向量的大小或者该目标语音特征中心向量的大小。
可以理解,在实际应用中,还可以采用其他计算距离的公式,本申请实施例对其不加以限制。
子步骤A3,若该参考语音特征中心向量与该目标语音特征中心向量之间的距离小于预设第一距离阈值,则确定目标对象为参考对象。
其中,第一距离阈值可以根据实际应用场景设定,本申请实施例对其不加以限制。
可以理解,该参考语音特征中心向量与该目标语音特征中心向量之间的距离小于第一距离阈值,则代表目标对象的语音特征与参考对象的语音特征近似,从而可以确认为同一对象。
子步骤A4,若该参考语音特征中心向量与该目标语音特征中心向量之间的距离大于或等于预设第一距离阈值,则确定目标对象不为参考对象。
可以理解,该参考语音特征中心向量与该目标语音特征中心向量之间的距离大于等于距离阈值,则代表目标对象的语音特征与参考对象的语音特征相差较大,从而可以确认为非同 一对象。
需要说明的一点是,步骤208不是必须的步骤,执行完步骤207可以直接结束,也可以执行步骤208根据目标语音特征中心向量,确定目标对象的身份信息。
综上所述,本申请实施例提供了一种语音处理方法,该方法包括:获取目标对象对应的语音文件按照预设帧长划分的多个语音帧;对于各语音帧,生成该语音帧的特征向量;对该各语音帧的特征向量进行聚类,生成特征中心向量;根据参考噪声帧的特征中心向量,从各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,该参考噪声帧为多个语音帧中的噪声帧,目标语音特征中心向量用于确定目标对象的身份信息。由于本申请实施例中对语音处理时,根据参考噪声帧的特征中心向量,确定出包含语音信息的语音帧的特征中心向量,然后基于包含语音信息的语音帧的特征中心向量,生成目标语音特征中心向量,语音特征中心向量用于进行目标对象的识别。由于直接从语音文件中提取特征,基于特征进行识别,从而不需要对语音文件进行去噪处理,解决了相关技术中去噪导致的识别较慢、无法去掉所有噪声导致语音处理准确度较低的问题。并且,直接从语音文件中提取特征,并将噪声弱化,提高了识别的速度和准确度。
参照图3,其示出了一种语音处理装置的结构图,具体如下。
语音帧划分模块301,用于获取目标对象对应的语音文件按照预设帧长划分的多个语音帧。
特征向量生成模块302,用于对于各语音帧,生成该语音帧的特征向量。
特征中心向量生成模块303,用于对各语音帧的特征向量进行聚类,生成特征中心向量。
目标语音特征中心向量生成模块304,用于根据参考噪声帧的特征中心向量,从各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,该参考噪声帧为多个语音帧中的噪声帧;
目标语音特征中心向量用于确定目标对象的身份信息。
综上所述,本申请实施例提供了一种语音处理装置,该装置包括:语音帧划分模块,用于获取目标对象对应的语音文件按照预设帧长划分的多个语音帧;特征向量生成模块,用于对于各语音帧,生成该语音帧的特征向量;特征中心向量生成模块,用于对各语音帧的特征向量进行聚类,生成特征中心向量;目标语音特征中心向量生成模块,用于根据参考噪声帧的特征中心向量,从各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,该参考噪声帧为多个语音帧中的噪声帧,该目标语音特征中心向量用于确定目标对象的身份信息。由于本申请实施例中对语音处理时,根据参考噪声帧的特征中心向量,确定出包含语音信息的语音帧的特征中心向量,然后基于包含语音信息的语音帧的特征中心向量,生成目标语音特征中心向量,目标语音特征中心向量用于进行目标对象的识别。由于直接从语音文件中提取特征,基于特征进行识别,从而不需要对语音文件进行去噪处理,解决了相关技术中去噪导致的识别较慢、无法去掉所有噪声导致语音处理准确度较低的问题。并且,直接从语音文件中提取特征,并将噪声弱化,提高了识别的速度和准确度。
参照图4,其示出了另一种语音处理装置的结构图,具体如下。
语音帧划分模块301,用于获取目标对象对应的语音文件按照预设帧长划分的多个语音 帧。
特征向量生成模块302,用于对于各语音帧,生成该语音帧的特征向量。可选地,在本申请实施例中,上述特征向量生成模块302包括:
梅尔频谱确定子模块3021,用于对于各语音帧,确定该语音帧的梅尔频谱。
特征向量生成子模块3022,用于根据该语音帧的梅尔频谱生成该语音帧的特征向量。
特征中心向量生成模块303,用于对各语音帧的特征向量进行聚类,生成特征中心向量。
目标语音特征中心向量生成模块304,用于根据参考噪声帧的特征中心向量,从各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,参考噪声帧为多个语音帧中的噪声帧;
目标语音特征中心向量用于确定目标对象的身份信息。
可选地,在本申请实施例中,上述目标语音特征中心向量生成模块304包括:
噪声特征中心向量确定子模块3041,用于从各语音帧的特征中心向量中确定参考噪声帧的特征中心向量。
第一距离计算子模块3042,用于计算该参考噪声帧对应的特征中心向量与每个语音帧的特征中心向量之间的距离。
目标语音特征中心向量生成子模块3043,用于对于每个语音帧,若该参考噪声帧的特征中心向量与该语音帧的特征中心向量之间的距离大于或等于预设第二距离阈值,则确定该语音帧的特征向量为包含语音信息的特征中心向量,将包含语音信息的特征中心向量拼接成目标语音特征中心向量。
可选地,在本申请的另一种实施例中,上述梅尔频谱确定子模块3021包括:
功率谱确定单元,确定该语音帧的功率谱。
梅尔频谱计算单元,用于根据该语音帧的功率谱,计算该语音帧的梅尔频谱。
可选地,在本申请的另一种实施例中,上述梅尔频谱确定子模块3021还包括:
梅尔频谱系数计算单元,用于根据该语音帧的功率谱,计算该语音帧的离散余弦系数及梅尔频率倒谱系数。
可选地,在本申请的另一种实施例中,上述功率谱确定单元包括:
频谱计算子单元,用于对该语音帧分别进行傅里叶变换,得到该语音帧的频谱。
功率谱计算子单元,用于计算该语音帧的频谱的平方得到该语音帧的功率谱。
可选地,在本申请的另一种实施例中,上述梅尔频谱计算单元,包括:
梅尔频谱计算子单元,用于通过预设三角带通滤波器对该语音帧的功率谱进行滤波,得到该语音帧的梅尔频谱。
可选地,在本申请的另一种实施例中,上述梅尔频谱系数计算单元,包括:
对数域转换子单元,用于对该语音帧的梅尔频谱转换至对数域,得到该语音帧的对数域功率谱。
梅尔频谱系数计算子单元,用于对该语音帧的对数域功率谱进行离散余弦变换,得到该语音帧的离散余弦系数和梅尔频率倒谱系数,该梅尔频率倒谱系数从该离散余弦系数中确定的。
可选地,在本申请的另一种实施例中,上述特征向量生成子模块3022,包括:
特征向量拼接单元,用于将该语音帧的梅尔频谱、离散余弦系数及梅尔频率倒谱系数拼 接成为该语音帧的特征向量。
可选地,在本申请的另一种实施例中,上述装置还包括身份信息确定模块,用于根据目标语音特征中心向量,确定目标对象的身份信息;
其中,身份信息确定模块包括:
参考语音特征中心向量获取子模块,用于获取参考语音特征中心向量,该参考语音特征中心向量对应预设的参考对象。
第二距离计算子模块,用于计算该参考语音特征中心向量与该目标语音特征中心向量的距离。
第一身份确认子模块,用于若该距离小于预设第一距离阈值,则确定目标对象为该参考对象。
第二身份确认子模块,用于若该距离大于或等于预设第一距离阈值,则确定目标对象不为该参考对象。
可选地,在本申请的另一种实施例中,上述对数域转换子单元,包括:
功率点获取子单元,用于对于该语音帧的功率谱上的每个功率点,获取该功率点的频率和功率。
第一中间值计算子单元,用于将该功率点对应的频率除以预设第一转换参数,得到第一中间值。
第二中间值计算子单元,用于将该第一中间值加上预设第二转换参数,得到第二中间值。
第三中间值计算子单元,用于对该第二中间值取对数,得到第三中间值。
对数转换值计算子单元,用于计算该第三中间值与预设第三转换参数的乘积,得到对数转换值。
对数域功率谱生成子单元,用于对于该语音帧,各功率点的对数转换值和该功率组成对数功率谱。
综上所述,本申请实施例提供了一种语音处理装置,该装置包括:语音帧划分模块,用于获取目标对象对应的语音文件按照预设帧长划分的多个语音帧;特征向量生成模块,用于对于各语音帧,生成该语音帧的特征向量;特征中心向量生成模块,用于对各语音帧的特征向量进行聚类,生成特征中心向量;目标语音特征中心向量生成模块,用于根据参考噪声帧的特征中心向量,从该特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,该参考噪声帧为多个语音帧中的噪声帧,目标语音特征中心向量用于目标对象的身份信息。由于本申请实施例中对语音处理时,根据参考噪声帧的特征中心向量,确定出包含语音信息的语音帧的特征中心向量,然后基于包含语音信息的语音帧的特征中心向量,生成目标语音特征中心向量,目标语音特征中心向量用于进行目标对象的识别。由于直接从语音文件中提取特征,基于特征进行识别,从而不需要对语音文件进行去噪处理,解决了相关技术中去噪导致的识别较慢、无法去掉所有噪声导致语音处理准确度较低的问题。并且,直接从语音文件中提取特征,并将噪声弱化,提高了识别的速度和准确度。
本申请实施例还提供了一种电子设备,包括:处理器、存储器以及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现前述实施例的语音处理方法。
图5是本申请实施例提供的一种电子设备的结构示意图。该电子设备500可以是:智能 手机、平板电脑、MP3播放器(Moving Picture Experts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture Experts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、笔记本电脑或台式电脑。电子设备500还可能被称为用户设备、便携式电子设备、膝上型电子设备、台式电子设备等其他名称。
通常,电子设备500包括有:处理器501和存储器502。
处理器501可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器501可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器501也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。在一些实施例中,处理器501可以在集成有GPU(Graphics Processing Unit,图像处理器),GPU用于负责显示屏所需要显示的内容的渲染和绘制。一些实施例中,处理器501还可以包括AI(Artificial Intelligence,人工智能)处理器,该AI处理器用于处理有关机器学习的计算操作。
存储器502可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器502还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在一些实施例中,存储器502中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器501所执行以实现本申请中方法实施例提供的语音处理方法。
在一些实施例中,电子设备500还可选包括有:外围设备接口503和至少一个外围设备。处理器501、存储器502和外围设备接口503之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口503相连。具体地,外围设备包括:射频电路504、显示屏505、摄像头组件506、音频电路507、定位组件508和电源509中的至少一种。
外围设备接口503可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器501和存储器502。在一些实施例中,处理器501、存储器502和外围设备接口503被集成在同一芯片或电路板上;在一些其他实施例中,处理器501、存储器502和外围设备接口503中的任意一个或两个可以在单独的芯片或电路板上实现,本实施例对此不加以限定。
射频电路504用于接收和发射RF(Radio Frequency,射频)信号,也称电磁信号。射频电路504通过电磁信号与通信网络以及其他通信设备进行通信。射频电路504将电信号转换为电磁信号进行发送,或者,将接收到的电磁信号转换为电信号。可选地,射频电路504包括:天线系统、RF收发器、一个或多个放大器、调谐器、振荡器、数字信号处理器、编解码芯片组、用户身份模块卡等等。射频电路504可以通过至少一种无线通信协议来与其它电子设备进行通信。该无线通信协议包括但不限于:城域网、各代移动通信网络(2G、3G、4G及5G)、无线局域网和/或WiFi(Wireless Fidelity,无线保真)网络。在一些实施例中,射频电路504还可以包括NFC(Near Field Communication,近距离无线通信)有关的电路,本申请对此不加以限定。
显示屏505用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、 视频及其它们的任意组合。当显示屏505是触摸显示屏时,显示屏505还具有采集在显示屏505的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器501进行处理。此时,显示屏505还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在一些实施例中,显示屏505可以为一个,设置电子设备500的前面板;在另一些实施例中,显示屏505可以为至少两个,分别设置在电子设备500的不同表面或呈折叠设计;在再一些实施例中,显示屏505可以是柔性显示屏,设置在电子设备500的弯曲表面上或折叠面上。甚至,显示屏505还可以设置成非矩形的不规则图形,也即异形屏。显示屏505可以采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
摄像头组件506用于采集图像或视频。可选地,摄像头组件506包括前置摄像头和后置摄像头。通常,前置摄像头设置在电子设备的前面板,后置摄像头设置在电子设备的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在一些实施例中,摄像头组件506还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。
音频电路507可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器501进行处理,或者输入至射频电路504以实现语音通信。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在电子设备500的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。扬声器则用于将来自处理器501或射频电路504的电信号转换为声波。扬声器可以是传统的薄膜扬声器,也可以是压电陶瓷扬声器。当扬声器是压电陶瓷扬声器时,不仅可以将电信号转换为人类可听见的声波,也可以将电信号转换为人类听不见的声波以进行测距等用途。在一些实施例中,音频电路507还可以包括耳机插孔。
定位组件508用于定位电子设备500的当前地理位置,以实现导航或LBS(Location Based Service,基于位置的服务)。定位组件508可以是基于美国的GPS(Global Positioning System,全球定位系统)、中国的北斗系统、俄罗斯的格雷纳斯系统或欧盟的伽利略系统的定位组件。
电源509用于为电子设备500中的各个组件进行供电。电源509可以是交流电、直流电、一次性电池或可充电电池。当电源509包括可充电电池时,该可充电电池可以支持有线充电或无线充电。该可充电电池还可以用于支持快充技术。
在一些实施例中,电子设备500还包括有一个或多个传感器510。该一个或多个传感器510包括但不限于:加速度传感器、陀螺仪传感器、压力传感器、指纹传感器、光学传感器以及接近传感器。
本领域技术人员可以理解,图5中示出的结构并不构成对电子设备500的限定,可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
本申请实施例还提供了一种可读存储介质,当存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行前述实施例的语音处理方法。
对于装置实施例而言,由于其与方法实施例基本相似,所以描述的比较简单,相关之处参见方法实施例的部分说明即可。
在此提供的算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述,构造这类系统所要求的结构是显而易见的。此外,本申请也不针对任何特定编程语言。应当明白,可以利用各种编程语言实现在此描述的本申请的内容,并且上面对特定语言所做的描述是为了披露本申请的最佳实施方式。
在此处所提供的说明书中,说明了大量具体细节。然而,能够理解,本申请的实施例可以在没有这些具体细节的情况下实践。在一些实例中,并未详细示出公知的方法、结构和技术,以便不模糊对本说明书的理解。
类似地,应当理解,为了精简本申请并帮助理解各个发明方面中的一个或多个,在上面对本申请的示例性实施例的描述中,本申请的各个特征有时被一起分组到单个实施例、图、或者对其的描述中。然而,并不应将该公开的方法解释成反映如下意图:即所要求保护的本申请要求比在每个权利要求中所明确记载的特征更多的特征。更确切地说,如下面的权利要求书所反映的那样,发明方面在于少于前面公开的单个实施例的所有特征。因此,遵循具体实施方式的权利要求书由此明确地并入该具体实施方式,其中每个权利要求本身都作为本申请的单独实施例。
本领域那些技术人员可以理解,可以对实施例中的设备中的模块进行自适应性地改变并且把它们设置在与该实施例不同的一个或多个设备中。可以把实施例中的模块或单元或组件组合成一个模块或单元或组件,以及此外可以把它们分成多个子模块或子单元或子组件。除了这样的特征和/或过程或者单元中的至少一些是相互排斥之外,可以采用任何组合对本说明书(包括伴随的权利要求、摘要和附图)中公开的所有特征以及如此公开的任何方法或者设备的所有过程或单元进行组合。除非另外明确陈述,本说明书(包括伴随的权利要求、摘要和附图)中公开的每个特征可以由提供相同、等同或相似目的的替代特征来代替。
本申请的各个部件实施例可以以硬件实现,或者以在一个或者多个处理器上运行的软件模块实现,或者以它们的组合实现。本领域的技术人员应当理解,可以在实践中使用微处理器或者数字信号处理器(DSP)来实现根据本申请实施例的语音处理设备中的一些或者全部部件的一些或者全部功能。本申请还可以实现为用于执行这里所描述的方法的一部分或者全部的设备或者装置程序。这样的实现本申请的程序可以存储在计算机可读介质上,或者可以具有一个或者多个信号的形式。这样的信号可以从因特网网站上下载得到,或者在载体信号上提供,或者以任何其他形式提供。
应该注意的是上述实施例对本申请进行说明而不是对本申请进行限制,并且本领域技术人员在不脱离所附权利要求的范围的情况下可设计出替换实施例。在权利要求中,不应将位于括号之间的任何参考符号构造成对权利要求的限制。单词“包含”不排除存在未列在权利要求中的元件或步骤。位于元件之前的单词“一”或“一个”不排除存在多个这样的元件。本申请可以借助于包括有若干不同元件的硬件以及借助于适当编程的计算机来实现。在列举了若干装置的单元权利要求中,这些装置中的若干个可以是通过同一个硬件项来具体体现。单词第一、第二、以及第三等的使用不表示任何顺序。可将这些单词解释为名称。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置 和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
以上所述仅为本申请的较佳实施例而已,并不用以限制本申请,凡在本申请的精神和原则之内所作的任何修改、等同替换和改进等,均应包含在本申请的保护范围之内。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限。应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。

Claims (20)

  1. 一种语音处理方法,所述方法包括:
    获取目标对象对应的语音文件按照预设帧长划分的多个语音帧;
    对于各语音帧,生成所述语音帧的特征向量;
    对所述各语音帧的特征向量进行聚类,生成特征中心向量;
    根据参考噪声帧的特征中心向量,从所述各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,所述参考噪声帧为所述多个语音帧中的噪声帧;
    所述目标语音特征中心向量用于确定所述目标对象的身份信息。
  2. 根据权利要求1所述的方法,所述方法还包括:根据所述目标语音特征中心向量确定所述目标人物的身份信息;
    其中,所述根据所述目标语音特征中心向量,确定所述目标对象的身份信息的步骤,包括:
    获取参考语音特征中心向量,所述参考语音特征中心向量对应预设的参考对象;
    计算所述参考语音特征中心向量与所述目标语音特征中心向量之间的距离;
    若所述距离小于预设第一距离阈值,则确定所述目标对象为所述参考对象;
    若所述距离大于或等于预设第一距离阈值,则确定所述目标对象不为所述参考对象。
  3. 根据权利要求1所述的方法,所述根据参考噪声帧的特征中心向量,从所述各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量的步骤,包括:
    从所述各语音帧的特征中心向量中确定参考噪声帧的特征中心向量;
    计算所述参考噪声帧的特征中心向量与每个语音帧的特征中心向量之间的距离;
    对于每个语音帧,若所述参考噪声帧的特征中心向量与所述语音帧的特征中心向量之间的距离大于或等于预设第二距离阈值,确定所述语音帧的特征中心向量为包含语音信息的特征中心向量;
    将包含语音信息的特征中心向量拼接成目标语音特征中心向量。
  4. 根据权利要求1所述的方法,所述生成所述语音帧的特征向量的步骤,包括:
    确定所述语音帧的梅尔频谱;
    根据所述语音帧的梅尔频谱生成所述语音帧的特征向量。
  5. 根据权利要求4所述的方法,所述确定所述语音帧的梅尔频谱的步骤,包括:
    确定所述语音帧的功率谱;
    根据所述语音帧的功率谱,计算所述语音帧的梅尔频谱。
  6. 根据权利要求5所述的方法,在根据所述语音帧的功率谱,计算所述语音帧的梅尔频谱的步骤之后,还包括:
    根据所述语音帧的功率谱,计算所述语音帧的离散余弦系数及梅尔频率倒谱系数;
    所述根据所述语音帧的梅尔频谱生成所述语音帧的特征向量的步骤,包括:
    将所述语音帧的梅尔频谱、离散余弦系数及梅尔频率倒谱系数拼接成为所述语音帧的特征向量。
  7. 根据权利要求5所述的方法,所述确定所述语音帧的功率谱的步骤,包括:
    对所述语音帧分别进行傅里叶变换,得到所述语音帧的频谱;
    计算所述语音帧的频谱的平方得到所述语音帧的功率谱。
  8. 根据权利要求6所述的方法,所述根据所述语音帧的功率谱,计算所述语音帧的梅尔频谱的步骤,包括:
    通过预设三角带通滤波器对所述语音帧的功率谱进行滤波,得到所述语音帧的梅尔频谱;
    所述根据所述语音帧的功率谱,计算所述语音帧的离散余弦系数及梅尔频率倒谱系数的步骤,包括:
    对所述语音帧的梅尔频谱转换至对数域,得到所述语音帧的对数域功率谱;
    对所述语音帧的对数域功率谱进行离散余弦变换,得到所述语音帧的离散余弦系数和梅尔频率倒谱系数,所述梅尔频率倒谱系数为从所述离散余弦系数中确定的。
  9. 根据权利要求8所述的方法,所述对所述语音帧的梅尔频谱转换至对数域,得到所述语音帧的对数域功率谱的步骤,包括:
    对于所述语音帧的功率谱上的每个功率点,获取所述功率点的频率和功率;
    将所述功率点对应的频率除以预设第一转换参数,得到第一中间值;
    将所述第一中间值加上预设第二转换参数,得到第二中间值;
    对所述第二中间值取对数,得到第三中间值;
    计算所述第三中间值与预设第三转换参数的乘积,得到对数转换值;
    对于所述语音帧,将各功率点的对数转换值和所述功率组成对数功率谱。
  10. 一种语音处理装置,所述装置包括:
    语音帧划分模块,用于获取目标对象对应的语音文件按照预设帧长划分的多个语音帧;
    特征向量生成模块,用于对于各语音帧,生成所述语音帧的特征向量;
    特征中心向量生成模块,用于对所述各语音帧的特征向量进行聚类,生成特征中心向量;
    目标语音特征中心向量生成模块,用于根据参考噪声帧的特征中心向量,从所述各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,所述参考噪声帧为所述多个语音帧中的噪声帧;
    所述目标语音特征中心向量用于确定所述目标对象的身份信息。
  11. 一种电子设备,包括:
    处理器、存储器以及存储在所述存储器上并可在所述处理器上运行的计算机程序,所述处理器执行所述程序时实现如下操作:
    获取目标对象对应的语音文件按照预设帧长划分的多个语音帧;
    对于各语音帧,生成所述语音帧的特征向量;
    对所述各语音帧的特征向量进行聚类,生成特征中心向量;
    根据参考噪声帧的特征中心向量,从所述各语音帧的特征中心向量中确定包含语音信息的特征中心向量,生成目标语音特征中心向量,所述参考噪声帧为所述多个语音帧中的噪声帧;
    所述目标语音特征中心向量用于确定所述目标对象的身份信息。
  12. 根据权利要求11所述的电子设备,所述处理器执行所述程序时还实现如下操作:根 据所述目标语音特征中心向量确定所述目标人物的身份信息;
    其中,所述处理器执行所述程序时还实现如下操作:
    获取参考语音特征中心向量,所述参考语音特征中心向量对应预设的参考对象;
    计算所述参考语音特征中心向量与所述目标语音特征中心向量之间的距离;
    若所述距离小于预设第一距离阈值,则确定所述目标对象为所述参考对象;
    若所述距离大于或等于预设第一距离阈值,则确定所述目标对象不为所述参考对象。
  13. 根据权利要求11所述的电子设备,所述处理器执行所述程序时还实现如下操作:
    从所述各语音帧的特征中心向量中确定参考噪声帧的特征中心向量;
    计算所述参考噪声帧的特征中心向量与每个语音帧的特征中心向量之间的距离;
    对于每个语音帧,若所述参考噪声帧的特征中心向量与所述语音帧的特征中心向量之间的距离大于或等于预设第二距离阈值,确定所述语音帧的特征中心向量为包含语音信息的特征中心向量;
    将包含语音信息的特征中心向量拼接成目标语音特征中心向量。
  14. 根据权利要求11所述的电子设备,所述处理器执行所述程序时还实现如下操作:
    确定所述语音帧的梅尔频谱;
    根据所述语音帧的梅尔频谱生成所述语音帧的特征向量。
  15. 根据权利要求14所述的电子设备,所述处理器执行所述程序时还实现如下操作:
    确定所述语音帧的功率谱;
    根据所述语音帧的功率谱,计算所述语音帧的梅尔频谱。
  16. 根据权利要求15所述的电子设备,所述处理器执行所述程序时还实现如下操作:
    根据所述语音帧的功率谱,计算所述语音帧的离散余弦系数及梅尔频率倒谱系数;
    将所述语音帧的梅尔频谱、离散余弦系数及梅尔频率倒谱系数拼接成为所述语音帧的特征向量。
  17. 根据权利要求15所述的电子设备,所述处理器执行所述程序时还实现如下操作:
    对所述语音帧分别进行傅里叶变换,得到所述语音帧的频谱;
    计算所述语音帧的频谱的平方得到所述语音帧的功率谱。
  18. 根据权利要求16所述的电子设备,所述处理器执行所述程序时还实现如下操作:
    通过预设三角带通滤波器对所述语音帧的功率谱进行滤波,得到所述语音帧的梅尔频谱;
    对所述语音帧的梅尔频谱转换至对数域,得到所述语音帧的对数域功率谱;
    对所述语音帧的对数域功率谱进行离散余弦变换,得到所述语音帧的离散余弦系数和梅尔频率倒谱系数,所述梅尔频率倒谱系数为从所述离散余弦系数中确定的。
  19. 根据权利要求18所述的电子设备,所述处理器执行所述程序时还实现如下操作:
    对于所述语音帧的功率谱上的每个功率点,获取所述功率点的频率和功率;
    将所述功率点对应的频率除以预设第一转换参数,得到第一中间值;
    将所述第一中间值加上预设第二转换参数,得到第二中间值;
    对所述第二中间值取对数,得到第三中间值;
    计算所述第三中间值与预设第三转换参数的乘积,得到对数转换值;
    对于所述语音帧,将各功率点的对数转换值和所述功率组成对数功率谱。
  20. 一种可读存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行如方法权利要求1-9中任一项所述的语音处理方法。
PCT/CN2019/098023 2018-07-27 2019-07-26 语音处理方法、装置、电子设备及可读存储介质 WO2020020375A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201810842328.6A CN109147798B (zh) 2018-07-27 2018-07-27 语音识别方法、装置、电子设备及可读存储介质
CN201810842328.6 2018-07-27

Publications (1)

Publication Number Publication Date
WO2020020375A1 true WO2020020375A1 (zh) 2020-01-30

Family

ID=64798325

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/098023 WO2020020375A1 (zh) 2018-07-27 2019-07-26 语音处理方法、装置、电子设备及可读存储介质

Country Status (2)

Country Link
CN (1) CN109147798B (zh)
WO (1) WO2020020375A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967730A (zh) * 2021-01-29 2021-06-15 北京达佳互联信息技术有限公司 语音信号的处理方法、装置、电子设备及存储介质
CN113707182A (zh) * 2021-09-17 2021-11-26 北京声智科技有限公司 声纹识别方法、装置、电子设备及存储介质

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109147798B (zh) * 2018-07-27 2023-06-09 北京三快在线科技有限公司 语音识别方法、装置、电子设备及可读存储介质
CN111128131B (zh) * 2019-12-17 2022-07-01 北京声智科技有限公司 语音识别方法、装置、电子设备及计算机可读存储介质
CN111754982A (zh) * 2020-06-19 2020-10-09 平安科技(深圳)有限公司 语音通话的噪声消除方法、装置、电子设备及存储介质

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1540623A (zh) * 2003-11-04 2004-10-27 清华大学 一种门限自适应的语音检测系统
CN1543641A (zh) * 2001-06-19 2004-11-03 �������ֿ� 说话者识别系统
CN102024455A (zh) * 2009-09-10 2011-04-20 索尼株式会社 说话人识别系统及其方法
CN102201236A (zh) * 2011-04-06 2011-09-28 中国人民解放军理工大学 一种高斯混合模型和量子神经网络联合的说话人识别方法
CN102509547A (zh) * 2011-12-29 2012-06-20 辽宁工业大学 基于矢量量化的声纹识别方法及系统
US9368116B2 (en) * 2012-09-07 2016-06-14 Verint Systems Ltd. Speaker separation in diarization
CN109147798A (zh) * 2018-07-27 2019-01-04 北京三快在线科技有限公司 语音识别方法、装置、电子设备及可读存储介质

Family Cites Families (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPS61100878A (ja) * 1984-10-23 1986-05-19 Nec Corp パタン認識装置
JP2776848B2 (ja) * 1988-12-14 1998-07-16 株式会社日立製作所 雑音除去方法、それに用いるニューラルネットワークの学習方法
JPH1091186A (ja) * 1997-10-28 1998-04-10 Matsushita Electric Ind Co Ltd 音声認識方法
RU2385272C1 (ru) * 2009-04-30 2010-03-27 Общество с ограниченной ответственностью "Стэл-Компьютерные Системы" Система голосовой идентификации диктора
CN102723081B (zh) * 2012-05-30 2014-05-21 无锡百互科技有限公司 语音信号处理方法、语音和声纹识别方法及其装置
US9892731B2 (en) * 2015-09-28 2018-02-13 Trausti Thor Kristjansson Methods for speech enhancement and speech recognition using neural networks
CN106971714A (zh) * 2016-01-14 2017-07-21 芋头科技(杭州)有限公司 一种应用于机器人的语音去噪识别方法及装置
CN106485781A (zh) * 2016-09-30 2017-03-08 广州博进信息技术有限公司 基于实时视频流的三维场景构建方法及其系统
KR101893789B1 (ko) * 2016-10-27 2018-10-04 에스케이텔레콤 주식회사 정규화를 이용한 음성 구간 판단 방법 및 이를 위한 음성 구간 판단 장치
CN106531195B (zh) * 2016-11-08 2019-09-27 北京理工大学 一种对话冲突检测方法及装置
CN107845389B (zh) * 2017-12-21 2020-07-17 北京工业大学 一种基于多分辨率听觉倒谱系数和深度卷积神经网络的语音增强方法
CN108281146B (zh) * 2017-12-29 2020-11-13 歌尔科技有限公司 一种短语音说话人识别方法和装置
CN108257606A (zh) * 2018-01-15 2018-07-06 江南大学 一种基于自适应并行模型组合的鲁棒语音身份识别方法

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1543641A (zh) * 2001-06-19 2004-11-03 �������ֿ� 说话者识别系统
CN1540623A (zh) * 2003-11-04 2004-10-27 清华大学 一种门限自适应的语音检测系统
CN102024455A (zh) * 2009-09-10 2011-04-20 索尼株式会社 说话人识别系统及其方法
CN102201236A (zh) * 2011-04-06 2011-09-28 中国人民解放军理工大学 一种高斯混合模型和量子神经网络联合的说话人识别方法
CN102509547A (zh) * 2011-12-29 2012-06-20 辽宁工业大学 基于矢量量化的声纹识别方法及系统
US9368116B2 (en) * 2012-09-07 2016-06-14 Verint Systems Ltd. Speaker separation in diarization
CN109147798A (zh) * 2018-07-27 2019-01-04 北京三快在线科技有限公司 语音识别方法、装置、电子设备及可读存储介质

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112967730A (zh) * 2021-01-29 2021-06-15 北京达佳互联信息技术有限公司 语音信号的处理方法、装置、电子设备及存储介质
CN113707182A (zh) * 2021-09-17 2021-11-26 北京声智科技有限公司 声纹识别方法、装置、电子设备及存储介质

Also Published As

Publication number Publication date
CN109147798B (zh) 2023-06-09
CN109147798A (zh) 2019-01-04

Similar Documents

Publication Publication Date Title
WO2020020375A1 (zh) 语音处理方法、装置、电子设备及可读存储介质
CN107464564B (zh) 语音交互方法、装置及设备
WO2019214361A1 (zh) 语音信号中关键词的检测方法、装置、终端及存储介质
WO2021135628A1 (zh) 语音信号的处理方法、语音分离方法
CN111933112B (zh) 唤醒语音确定方法、装置、设备及介质
CN111696570B (zh) 语音信号处理方法、装置、设备及存储介质
EP3360137B1 (en) Identifying sound from a source of interest based on multiple audio feeds
KR20170097519A (ko) 음성 처리 방법 및 장치
CN110807325B (zh) 谓词识别方法、装置及存储介质
CN110047468B (zh) 语音识别方法、装置及存储介质
KR102653450B1 (ko) 전자 장치의 입력 음성에 대한 응답 방법 및 그 전자 장치
WO2021052306A1 (zh) 声纹特征注册
CN111863020B (zh) 语音信号处理方法、装置、设备及存储介质
CN113763933B (zh) 语音识别方法、语音识别模型的训练方法、装置和设备
WO2023071519A1 (zh) 音频信息的处理方法、电子设备、系统、产品及介质
US11915700B2 (en) Device for processing user voice input
WO2022199500A1 (zh) 一种模型训练方法、场景识别方法及相关设备
CN112233689B (zh) 音频降噪方法、装置、设备及介质
CN113539290A (zh) 语音降噪方法和装置
CN111933167A (zh) 电子设备的降噪方法、装置、存储介质及电子设备
CN114333774A (zh) 语音识别方法、装置、计算机设备及存储介质
CN115620728B (zh) 音频处理方法、装置、存储介质及智能眼镜
CN111341307A (zh) 语音识别方法、装置、电子设备及存储介质
US10803870B2 (en) Electronic device performing operation using voice command and method of operating electronic device
US20230048330A1 (en) In-Vehicle Speech Interaction Method and Device

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19841135

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19841135

Country of ref document: EP

Kind code of ref document: A1