WO2022134587A1 - 声纹识别方法、装置、存储介质及计算机设备 - Google Patents

声纹识别方法、装置、存储介质及计算机设备 Download PDF

Info

Publication number
WO2022134587A1
WO2022134587A1 PCT/CN2021/109597 CN2021109597W WO2022134587A1 WO 2022134587 A1 WO2022134587 A1 WO 2022134587A1 CN 2021109597 W CN2021109597 W CN 2021109597W WO 2022134587 A1 WO2022134587 A1 WO 2022134587A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint
data
recognition model
preset
sample
Prior art date
Application number
PCT/CN2021/109597
Other languages
English (en)
French (fr)
Inventor
王德勋
徐国强
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134587A1 publication Critical patent/WO2022134587A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Definitions

  • the present application belongs to the technical field of artificial intelligence, and in particular relates to a voiceprint recognition method, device, storage medium and computer equipment.
  • Voiceprint recognition technology has been gradually applied in many fields such as voiceprint lock, financial anti-fraud, intelligent customer service, etc., and continues to output effective decisions.
  • voiceprint recognition model with high recognition accuracy, in addition to the necessary in the training process It is necessary to fine-tune the hyperparameters of the model.
  • the hyperparameters of the voiceprint recognition model are usually adjusted manually, and then the voiceprint recognition is performed according to the adjusted model.
  • this hyperparameter setting method relies too much on the human experience of business personnel, which may lead to inaccurate setting of hyperparameters, thereby affecting the recognition accuracy of the voiceprint recognition model.
  • the present application provides a voiceprint recognition method, device, storage medium and computer equipment, which can improve the recognition accuracy of the voiceprint recognition model.
  • a voiceprint recognition method comprising:
  • the preset voiceprint recognition model is determined by the vector angle between the sample voiceprint data and the category weight in the optimal convergence state and the corresponding classification probability value.
  • a voiceprint recognition device comprising:
  • an acquisition unit used to acquire the voiceprint data of the user to be identified
  • an extraction unit configured to extract the voiceprint feature corresponding to the voiceprint data
  • the identification unit is configured to input the voiceprint feature into a preset voiceprint identification model for voiceprint identification, and obtain a voiceprint identification result corresponding to the user to be identified, wherein the ultra
  • the parameter is determined by the vector angle between the sample voiceprint data and the category weight and the corresponding classification probability value of the preset voiceprint recognition model in the optimal convergence state.
  • a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented:
  • the preset voiceprint recognition model is determined by the vector angle between the sample voiceprint data and the category weight in the optimal convergence state and the corresponding classification probability value.
  • a computer device comprising a memory, a processor, and a computer program stored in the memory and executable on the processor, the processor implementing the following steps when executing the program:
  • the preset voiceprint recognition model is determined by the vector angle between the sample voiceprint data and the category weight in the optimal convergence state and the corresponding classification probability value.
  • the voiceprint recognition method, device, storage medium and computer equipment provided by the present application can ensure the accuracy of hyperparameter setting in the voiceprint recognition model and improve the recognition accuracy of the voiceprint recognition model.
  • FIG. 1 shows a flowchart of a voiceprint recognition method provided by an embodiment of the present application
  • FIG. 2 shows a flowchart of another voiceprint recognition method provided by an embodiment of the present application
  • FIG. 3 shows a relationship graph provided by an embodiment of the present application
  • FIG. 4 shows a schematic structural diagram of a voiceprint recognition device provided by an embodiment of the present application
  • FIG. 5 shows a schematic structural diagram of another visual voiceprint recognition device provided by an embodiment of the present application.
  • FIG. 6 shows a schematic diagram of an entity structure of a computer device provided by an embodiment of the present application.
  • the hyperparameters of the voiceprint recognition model are usually adjusted manually, and then the voiceprint recognition is performed according to the adjusted model.
  • this hyperparameter setting method relies too much on the human experience of business personnel, which may lead to inaccurate hyperparameter setting, which in turn affects the recognition accuracy of the voiceprint recognition model.
  • an embodiment of the present application provides a method for extracting semantic information of a video frame, as shown in FIG. 1 , the method includes:
  • the user to be identified is a user whose identity needs to be confirmed through voiceprint recognition.
  • this application implements For example, by constructing a cosine edge loss function, according to the cosine edge loss function, the angle between the sample voiceprint data and the category weight vector and the corresponding classification probability of the voiceprint recognition model under the optimal convergence condition are determined, and then according to the vector angle and its corresponding classification probability, and automatically adjust the hyperparameters in the voiceprint recognition model.
  • the voiceprint recognition technology can be applied in different scenarios.
  • the voiceprint lock identifies the voiceprint data of the user to be identified, and judges whether the user has the unlocking authority according to the voiceprint identification result. If the user is a user with unlocking authority, the voiceprint lock will start the unlocking instruction; if the user to be identified is a user without unlocking authority, the voiceprinting lock will not start the unlocking instruction.
  • a standard voiceprint collection device or terminal collects the voiceprint data of the user to be identified, so as to confirm the identity of the user to be identified according to the collected voiceprint data.
  • the Mel cepstral coefficient corresponding to the voiceprint data may be used as the voiceprint feature corresponding to the voiceprint data.
  • the preprocessing process specifically includes pre-emphasis, framing and windowing function processing, so that the voiceprint data of the user to be identified becomes flat, that is, every N points of the voiceprint data are synthesized into an observation unit (frame), and the left and right sides of the frame are combined into one observation unit (frame).
  • the terminal has continuity. After preprocessing the voiceprint data of the user to be identified, it is necessary to perform fast Fourier transformation on the preprocessed voiceprint data to obtain the transformed voiceprint data, and then convert the transformed voiceprint data.
  • the Mel filter calculate the voiceprint energy after the converted voiceprint data passes through the Mel filter, and then calculate the Mel cepstral coefficient corresponding to the voiceprint data according to the voiceprint energy corresponding to the voiceprint data, and convert the Mel cepstral coefficient to the Mel filter.
  • the cepstral coefficient is determined as the voiceprint feature corresponding to the voiceprint data of the user to be identified, so that the voiceprint recognition can be performed according to the voiceprint feature corresponding to the voiceprint data.
  • the hyperparameter in the preset voiceprint recognition model is determined by the vector angle between the sample voiceprint data and the category weight and the corresponding classification probability value of the preset voiceprint recognition model in the optimal convergence state
  • the preset voiceprint recognition model may be a neural network model, and the extracted voiceprint features are input into the voiceprint recognition model for voiceprint recognition, and the specific formula is as follows:
  • x is the voiceprint feature corresponding to the user to be identified
  • W and b are parameters in the neural network model
  • y is the probability value that the user to be identified is a user with different permissions.
  • the users with the voiceprint lock unlocking authority are A, B and C, if the probability value of the user to be identified is 0.2 for user A, 0.1 for user B, 0.5 for user C, 0.5 for user A, 0.5 for other users If it is 0.2, it can be considered that the user to be identified is user C, who has the unlocking authority, and the voiceprint lock will start the unlocking instruction; if the probability value of the user to be identified as user A is determined to be 0.2 from the results output by the neural network model, it is the user
  • the probability value of B is 0.1
  • the probability value of user C is 0.2
  • the probability value of other users is 0.5. It can be considered that the user to be identified is an unauthorized user, and the voiceprint lock will not start the unlock command.
  • the voiceprint recognition model for voiceprint recognition, it needs to be trained. Specifically, a large amount of sample voiceprint data is obtained from the sample voiceprint database, and samples are labeled according to the users corresponding to the sample voiceprint data. The final sample voiceprint data trains the initial neural network model to build a preset voiceprint recognition model. In the process of model training, it is also necessary to continuously optimize and adjust the hyperparameters of the preset voiceprint recognition model in order to improve the preset voiceprint recognition model. The recognition accuracy of the fingerprint recognition model. Specifically, in the process of setting hyperparameters, a cosine edge loss function can be constructed. According to the pre-edge loss function, the sample voiceprint data and category of the preset voiceprint recognition model in the optimal convergence state can be determined.
  • the included angle vector of the weight and its corresponding classification probability value and then according to the included angle vector and its corresponding classification probability value, the hyperparameters of the preset voiceprint recognition model are automatically adjusted to ensure the accuracy of hyperparameter settings and improve the preset The recognition accuracy of the voiceprint recognition model.
  • the present application can obtain the voiceprint data of the user to be identified; and extract the corresponding voiceprint data.
  • input the voiceprint feature into a preset voiceprint recognition model for voiceprint recognition and obtain a voiceprint recognition result corresponding to the user to be recognized, wherein the preset voiceprint recognition
  • the hyperparameters in the model are determined by the vector angle between the sample voiceprint data and the category weight and the corresponding classification probability value of the preset voiceprint recognition model in the optimal convergence state.
  • the vector angle between the sample voiceprint data and the category weight and the corresponding classification probability in the best convergence state of the voiceprint recognition model can automatically adjust the hyperparameters in the voiceprint recognition model and ensure the hyperparameters in the voiceprint recognition model. Set the accuracy to improve the recognition accuracy of the voiceprint recognition model.
  • the embodiment of the present application provides another voiceprint recognition method, as shown in FIG. 2 ,
  • the method includes:
  • a large amount of sample voiceprint data is stored in the preset sample voiceprint database.
  • a cosine edge loss function needs to be constructed, so that the cosine edge loss function needs to be constructed according to the cosine edge loss function.
  • sample voiceprint data the value of the hyperparameter is automatically set, and the specific formula of the cosine edge loss function in the embodiment of the present application
  • L lmc is the cosine edge loss function
  • yi is the true label of the ith sample
  • P i,yi is the probability value that the ith sample is correctly predicted as yi
  • ⁇ j is the input sample voiceprint data and the ith sample
  • the vector angle of the weights of the j-type parameters is the vector angle between the input sample voiceprint data and the real label yi parameter weight
  • s and m are respectively the hyperparameters that need to be set, which are the main optimization targets in the embodiment of the application.
  • step 202 specifically includes: according to the cosine Edge loss function, draw the relationship curve between the vector angle and the classification probability value of the hyperparameter under different values; based on the relationship curve, determine the sample when the preset voiceprint recognition model is in the best convergence state The vector angle between the voiceprint data and the class weight and its corresponding classification probability value.
  • determining the vector angle between the sample voiceprint data and the category weight and the corresponding classification probability value when the preset voiceprint recognition model is in an optimal convergence state based on the relationship curve includes: calculating The average value of the vector angle between the sample voiceprint data and the category weight; according to the relationship curve, determine the vector folder between the sample voiceprint data and the category weight under the optimal convergence state of the preset voiceprint recognition model When the angle tends to 0° and 90°, the classification probability values corresponding to the sample voiceprint data respectively; according to the relationship curve, determine the sample voiceprint data and the category in the best convergence state of the preset voiceprint recognition model When the vector angle between the weights tends to the average value, the classification probability value corresponding to the sample voiceprint data.
  • the relationship curves of the hyperparameter s under different values are drawn respectively, as shown in Figure 3, the abscissa of the relationship curve is The ordinate is Pi, yi.
  • the relationship between the vector angle between the sample voiceprint data and the category weight and the classification probability value can be known. It can be seen from Figure 3 that when j is not equal to yi, Basically maintained around 90°, and when When it is basically maintained at 0, P i,yi is maintained around 1.
  • step 203 specifically includes: When the vector angle between the sample voiceprint data and the class weights tends to 0° and 90°, the corresponding classification probability values of the sample voiceprint data are substituted into the cosine edge loss function to estimate the preset voice The first hyperparameter corresponding to the fingerprint recognition model; when the vector angle between the sample voiceprint data and the category weight tends to the average value, the classification probability value corresponding to the sample voiceprint data is substituted into the cosine The edge loss function estimates the second hyperparameter corresponding to the preset voiceprint recognition model.
  • C is the total number of categories, and C-1 is recorded as Bi.
  • P i and yi are close to 1.
  • an automatic assignment algorithm for hyperparameters s and m is derived.
  • p is a floating-point number close to 1, which also represents the upper bound of the curve. It is generally set to 0.999.
  • Both Bi and ⁇ med are related to the current batch of training samples. , which can be obtained directly from statistics. It should be noted that if the amount of sample voiceprint data is large, you can train in batches and gradually adjust the values of hyperparameters s and m to achieve the optimal effect.
  • a certified voiceprint collection device or terminal may be used to collect voiceprint data of the user to be identified, so as to confirm the identity of the user to be identified according to the collected voiceprint data.
  • step 205 specifically includes: performing fast Fourier transform on the voiceprint data to obtain transformed voiceprint data, and performing a fast Fourier transform on the transformed voiceprint data.
  • the data is filtered to obtain the voiceprint energy corresponding to the voiceprint data; according to the voiceprint energy, the Mel cepstral coefficient corresponding to the voiceprint data is calculated, and the Mel cepstral coefficient is determined as the The voiceprint feature corresponding to the voiceprint data.
  • the voiceprint data needs to be preprocessed, and the preprocessing process specifically includes pre-emphasis, framing and windowing function processing, so that the voiceprint data of the user to be identified becomes Flat, that is, every N points of the voiceprint data are combined into an observation unit (frame), and the left and right ends of the frame are continuous.
  • the preprocessed voiceprint data needs to be processed. Perform fast Fourier transformation to obtain the converted voiceprint data, then input the converted voiceprint data into the Mel filter, calculate the voice energy after the converted voiceprint data passes through the Mel filter, and then correspond to the voiceprint data. The voice energy is calculated, and the Mel cepstral coefficient corresponding to the voiceprint data is calculated, and the Mel cepstral coefficient is determined as the voiceprint feature of the user to be identified.
  • the specific calculation formula of the Mel cepstral coefficient is as follows:
  • s(m) represents the speech energy output by the voiceprint data after the mth filter
  • M is the total number of filters
  • C(n) is the Mel cepstral coefficient
  • n represents the Mel cepstral coefficient.
  • Order, L can usually take 12-16, the specific calculation formula of s(m) speech energy is as follows:
  • H m (k) is the frequency of the filter
  • K is the number of Fourier transform points. Therefore, according to the above formula, the Mel cepstral coefficient corresponding to the voiceprint data of the user to be identified can be calculated and determined as the voiceprint feature corresponding to the voiceprint data, so that according to the voiceprint feature corresponding to the voiceprint data, Perform voiceprint recognition.
  • step 206 specifically includes: inputting the voiceprint feature into a preset voiceprint recognition model for voiceprint recognition, and obtaining that the user to be recognized has different permissions The probability value of the user; according to the probability value that the user to be identified is a user with different permissions, the voiceprint recognition result corresponding to the user to be identified is determined.
  • the users who have the right to unlock the voiceprint lock are a, b, and c, respectively. If it is determined from the output result of the preset voiceprint recognition model that the probability value of the user to be identified is user a is 0.5, and the probability value of user b is 0.1 , the probability value of user c is 0.2, and the probability value of other users is 0.2, then it can be considered that the user to be identified is user a, who has the unlocking authority, and the voiceprint lock will start the unlocking command; if the output from the voiceprint recognition model In the result, it is determined that the probability value of the user to be identified is 0.2, the probability value of user b is 0.1, the probability value of user c is 0.2, and the probability value of other users is 0.5, then it can be considered that the user to be identified is no For users with permission, the voiceprint lock will not start the unlock command.
  • the present application can obtain the voiceprint data of the user to be identified; and extract the voiceprint data The corresponding voiceprint features; at the same time, the voiceprint features are input into a preset voiceprint recognition model for voiceprint recognition, and a voiceprint recognition result corresponding to the user to be identified is obtained, wherein the preset voiceprint
  • the hyperparameters in the recognition model are determined by the angle between the sample voiceprint data of the preset voiceprint recognition model and the vector angle of the category weight and the corresponding classification probability value in the optimal convergence state of the preset voiceprint recognition model.
  • the vector angle between the sample voiceprint data and the category weight and its corresponding classification probability can automatically adjust the hyperparameters in the voiceprint recognition model, and at the same time ensure that the hyperparameters in the voiceprint recognition model are superfluous.
  • the accuracy of parameter setting improves the recognition accuracy of the voiceprint recognition model.
  • an embodiment of the present application provides a voiceprint recognition device.
  • the device includes: an acquisition unit 31 , an extraction unit 32 , and an identification unit 33 .
  • the obtaining unit 31 may be used to obtain the voiceprint data of the user to be identified.
  • the obtaining unit 31 is the main functional module in the device for obtaining the voiceprint data of the user to be identified.
  • the extraction unit 32 may be configured to extract the voiceprint feature corresponding to the voiceprint data.
  • the extraction unit 32 is a main functional module in the device for extracting the voiceprint feature corresponding to the voiceprint data, and is also a core module.
  • the identification unit 33 can be configured to input the voiceprint features into a preset voiceprint identification model for voiceprint identification, and obtain a voiceprint identification result corresponding to the user to be identified, wherein the preset voiceprint identification
  • the hyperparameters in the model are determined by the vector angle between the sample voiceprint data and the category weight and the corresponding classification probability value of the preset voiceprint recognition model in the optimal convergence state.
  • the recognition unit 33 is the main functional module in the device for inputting the voiceprint features into a preset voiceprint recognition model for voiceprint recognition, and is also a core module for obtaining a voiceprint recognition result corresponding to the user to be recognized.
  • the extraction unit 32 includes: a filtering module 321 and a computing module 322 .
  • the filtering module 321 can be used to perform fast Fourier transformation on the voiceprint data to obtain transformed voiceprint data, and perform filtering processing on the transformed voiceprint data to obtain the voiceprint data Corresponding voiceprint energy.
  • the calculation module 322 can be configured to calculate the Mel cepstral coefficient corresponding to the voiceprint data according to the voiceprint energy, and determine the Mel cepstral coefficient as the voiceprint corresponding to the voiceprint data feature.
  • the recognition unit 33 in order to determine the voiceprint recognition result corresponding to the user to be recognized, includes: a recognition module 331 and a determination module 332 .
  • the identification module 331 may be configured to input the voiceprint feature into a preset voiceprint identification model for voiceprint identification, and obtain a probability value that the user to be identified is a user with different permissions.
  • the determining module 332 may be configured to determine a voiceprint recognition result corresponding to the user to be identified according to the probability value that the user to be identified is a user with different rights.
  • the apparatus in order to automatically adjust the hyperparameters in the voiceprint recognition model, the apparatus further includes: a determining unit 34 .
  • the obtaining unit 31 may also be configured to obtain sample voiceprint data, and construct a cosine edge loss function corresponding to the preset voiceprint recognition model according to the sample voiceprint data.
  • the determining unit 34 may be configured to determine, based on the cosine edge loss function, the vector angle between the sample voiceprint data and the class weight and the corresponding classification when the preset voiceprint recognition model is in the best convergence state. probability value.
  • the determining unit 34 may also be configured to determine hyperparameters corresponding to the preset voiceprint recognition model according to the vector angle and the classification probability value.
  • the determining unit 34 includes: a drawing module 341 and a determination module 342.
  • the drawing module 341 may be configured to draw the relationship curve between the vector angle and the classification probability value of the hyperparameter under different values according to the cosine edge loss function.
  • the determining module 342 may be configured to determine, based on the relationship curve, the vector angle between the sample voiceprint data and the category weight and the corresponding classification probability value when the preset voiceprint recognition model is in the best convergence state. .
  • the determining module 342 includes: a calculation sub- modules and determine submodules.
  • the calculation sub-module can be used to calculate the average value of the vector angle between the sample voiceprint data and the category weight.
  • the determining sub-module can be used to determine, according to the relationship curve, that the vector angle between the sample voiceprint data and the category weight tends to be 0° and 90° in the optimal convergence state of the preset voiceprint recognition model. , the classification probability values corresponding to the sample voiceprint data respectively.
  • the determining sub-module can also be used to determine, according to the relationship curve, that the vector angle between the sample voiceprint data and the category weights tends to the average value in the optimal convergence state of the preset voiceprint recognition model. , the classification probability value corresponding to the sample voiceprint data.
  • the hyperparameters include a first hyperparameter and a second hyperparameter.
  • the determining unit 34 further includes: a first estimation module 343 and a second estimation module 344 .
  • the first estimation module 342 can be configured to substitute the classification probability values corresponding to the sample voiceprint data respectively when the vector angle between the sample voiceprint data and the class weight tends to 0° and 90° into
  • the cosine edge loss function estimates the first hyperparameter corresponding to the preset voiceprint recognition model.
  • the second estimation module 344 can be configured to substitute the classification probability value corresponding to the sample voiceprint data into the A cosine edge loss function for estimating the second hyperparameter corresponding to the preset voiceprint recognition model.
  • an embodiment of the present application further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the following steps are implemented: obtaining the user's data to be identified. voiceprint data; extract the voiceprint feature corresponding to the voiceprint data; input the voiceprint feature into a preset voiceprint recognition model for voiceprint recognition, and obtain a voiceprint recognition result corresponding to the user to be identified, wherein,
  • the hyperparameters in the preset voiceprint recognition model are determined by the vector angle between the sample voiceprint data and the category weight and the corresponding classification probability value of the preset voiceprint recognition model in the optimal convergence state.
  • the computer-readable storage medium may be non-volatile or volatile.
  • an embodiment of the present application further provides a physical structure diagram of a computer device.
  • the computer device includes: a processor 41 , Memory 42, and a computer program stored on the memory 42 and running on the processor, wherein the memory 42 and the processor 41 are both arranged on the bus 43 when the processor 41 executes the program and implements the following steps: Obtaining the to-be-identified voiceprint data of the user; extract the voiceprint feature corresponding to the voiceprint data; input the voiceprint feature into a preset voiceprint recognition model for voiceprint recognition, and obtain a voiceprint recognition result corresponding to the user to be identified, wherein, the hyperparameter in the preset voiceprint recognition model is determined by the vector angle between the sample voiceprint data and the category weight and the corresponding classification probability value of the preset voiceprint recognition model in the optimal convergence state of.
  • the present application can obtain the voiceprint data of the user to be identified; and extract the voiceprint feature corresponding to the voiceprint data; at the same time, input the voiceprint feature into the preset voiceprint recognition model Perform voiceprint recognition to obtain a voiceprint recognition result corresponding to the user to be identified, wherein the hyperparameters in the preset voiceprint recognition model are samples obtained from the preset voiceprint recognition model in an optimal convergence state The vector angle between the voiceprint data and the category weights and their corresponding classification probability values are determined.
  • modules or steps of the present application can be implemented by a general-purpose computing device, and they can be centralized on a single computing device, or distributed in a network composed of multiple computing devices Alternatively, they may be implemented in program code executable by a computing device, such that they may be stored in a storage device and executed by the computing device, and in some cases, in a different order than here
  • the steps shown or described are performed either by fabricating them separately into individual integrated circuit modules, or by fabricating multiple modules or steps of them into a single integrated circuit module.
  • the present application is not limited to any particular combination of hardware and software.

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Signal Processing (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Collating Specific Patterns (AREA)
  • Image Analysis (AREA)

Abstract

提供了一种声纹识别方法、装置、计算机设备及存储介质。该方法包括:获取待识别用户的声纹数据(101);提取声纹数据对应的声纹特征(102);将声纹特征输入至预设声纹识别模型进行声纹识别,得到待识别用户对应的声纹识别结果,其中,预设声纹识别模型中的超参数是通过预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的(103)。该方法通过构建余弦边缘损失函数,自动调整声纹识别模型的超参数,确保超参数设定的准确度,提升声纹识别模型的识别精度。

Description

声纹识别方法、装置、存储介质及计算机设备
本申请要求于2020年12月22日提交中国专利局、申请号为202011526763.1,申请名称为“声纹识别方法、装置、存储介质及计算机设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请属于人工智能技术领域,尤其涉及一种声纹识别方法、装置、存储介质及计算机设备。
背景技术
声纹识别技术目前已逐渐应用在声纹锁、金融反欺诈、智能客服等多个领域中,并持续输出有作用的决策,为了得到识别精度较高的声纹识别模型,训练过程中除了必要的数据支撑,还需要对模型的超参数进行精细调整。
目前,在声纹识别的过程中,通常通过人为手动的方式对声纹识别模型的超参数进行调整,进而依据调整后的模型进行声纹识别。然而,申请人意识到,这种超参数的设定方式过于依赖业务人员的人为经验,很可能会导致超参数的设定不够准确,进而影响声纹识别模型的识别精度。
技术问题
本申请提供了一种声纹识别方法、装置、存储介质及计算机设备,能够提升声纹识别模型的识别精度。
技术解决方案
根据本申请的第一个方面,提供一种声纹识别方法,包括:
获取待识别用户的声纹数据;
提取所述声纹数据对应的声纹特征;
将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。
根据本申请的第二个方面,提供一种声纹识别装置,包括:
获取单元,用于获取待识别用户的声纹数据;
提取单元,用于提取所述声纹数据对应的声纹特征;
识别单元,用于将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。
根据本申请的第三个方面,提供一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以下步骤:
获取待识别用户的声纹数据;
提取所述声纹数据对应的声纹特征;
将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。
根据本申请的第四个方面,提供一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现以下步骤:
获取待识别用户的声纹数据;
提取所述声纹数据对应的声纹特征;
将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。
有益效果
本申请提供的一种声纹识别方法、装置、存储介质及计算机设备,能够保证声纹识别模型中超参数设定的准确度,提升声纹识别模型的识别精度。
附图说明
图1示出了本申请实施例提供的一种声纹识别方法流程图;
图2示出了本申请实施例提供的另一种声纹识别方法流程图;
图3示出了本申请实施例提供的关系曲线图;
图4示出了本申请实施例提供的一种声纹识别装置的结构示意图;
图5示出了本申请实施例提供的另一种视声纹识别装置的结构示意图;
图6示出了本申请实施例提供的一种计算机设备的实体结构示意图。
本申请的实施方式
下文中将参考附图并结合实施例来详细说明本申请。需要说明的是,在不冲突的情况下,本申请中的实施例及实施例中的特征可以相互组合。
目前,在声纹识别的过程中,通常通过人为手动的方式对声纹识别模型的超参数进行调整,进而依据调整后的模型进行声纹识别。然而,这种超参数的设定方式过于依赖业务人员的人为经验,很可能会导致超参数的设定不够准确,进而影响声纹识别模型的识别精度。
为了解决上述问题,本申请实施例提供了一种视频帧语义信息的提取方法,如图1所示,所述方法包括:
101、获取待识别用户的声纹数据。
其中,待识别用户为需要通过声纹识别进行身份确认的用户,为了解决现有技术中人为手动调整声纹识别模型的超参数,造成声纹识别模型的识别精度较低的问题,本申请实施例通过构建余弦边缘损失函数,根据该余弦边缘损失函数,确定声纹识别模型在最佳收敛情况下的样本声纹数据与类别权重的向量夹角及其对应的分类概率,进而根据该向量夹角及其对应的分类概率,自动调整声纹识别模型中的超参数。
对于本申请实施例,声纹识别技术可以应用不同的场景中,例如,声纹锁对待识别用户的声纹数据进行识别,根据声纹识别结果判断其是否为有解锁权限的用户,如果待识别用户为有解锁权限的用户,则声纹锁会启动解锁指令;如果待识别用户是没有解锁权限的用户,则声纹锁不会启动解锁指令,具体地,在进行声纹识别之前,可以利用标准的声纹采集设备或者终端采集待识别用户的声纹数据,以便根据采集的声纹数据对待识别用户的身份进行确认。
102、提取所述声纹数据对应的声纹特征。
对于本申请实施例,可以将声纹数据对应的梅尔倒谱系数作为声纹数据对应的声纹特征,具体地,在对声纹数据进行特征提取之前需要对声纹数据进行预处理,该预处理过程 具体包括预加重、分帧和加窗函数处理,从而使得待识别用户的声纹数据变得平坦,即将声纹数据的每N个采用点合成一个观测单位(帧),帧的左右端具有连续性,在对待识别用户的声纹数据进行预处理之后,需要对预处理后的声纹数据进行快速傅里叶转换,得到转换后的声纹数据,之后将转换后的声纹数据输入Mel滤波器,计算转换后的声纹数据通过Mel滤波器后的声纹能量,接着根据声纹数据对应的声纹能量,计算声纹数据对应的梅尔倒谱系数,并将该梅尔倒谱系数确定为待识别用户的声纹数据对应的声纹特征,以便依据声纹数据对应的声纹特征,进行声纹识别。
103、将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果。
其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的,该预设声纹识别模型具体可以为神经网络模型,提取的声纹特征输入该声纹识别模型进行声纹识别,具体公式如下:
y=softmax(Wx+b)
其中,x为待识别用户对应的声纹特征,W和b为神经网络模型中的参数,y为待识别用户为不同权限用户的概率值,例如,有声纹锁解锁权限的用户分别为A、B和C,如果从神经网络模型输出的结果中确定待识别用户为用户A的概率值为0.2,为用户B的概率值为0.1,为用户C的概率值为0.5,为其他用户的概率值为0.2,则可以认为待识别用户为用户C,其具有解锁权限,声纹锁会启动解锁指令;如果从神经网络模型输出的结果中确定待识别用户为用户A的概率值为0.2,为用户B的概率值为0.1,为用户C的概率值为0.2,为其他用户的概率值为0.5,则可以认为待识别用户为没有权限的用户,声纹锁不会启动解锁指令。
此外,利用声纹识别模型进行声纹识别之前,需要对其进行训练,具体地,从样本声纹库中获取大量样本声纹数据,并根据样本声纹数据对应的用户进行样本标注,利用标注后的样本声纹数据对初始神经网络模型进行训练,构建预设声纹识别模型,在对模型训练的过程中,还需要不断优化调整预设声纹识别模型的超参数,以便提高预设声纹识别模型的识别精度,具体在设定超参数的过程中,可以构建余弦边缘损失函数,根据该预先边缘损失函数,确定预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重的夹角向量及其对应的分类概率值,进而根据该夹角向量及其对应的分类概率值,自动调整预设声纹识别模型的超参数,确保超参数设置的准确度,提升预设声纹识别模型的识别精度。
本申请实施例提供的一种声纹识别方法,与目前人为手动调整声纹识别模型的超参数的方式相比,本申请能够获取待识别用户的声纹数据;并提取所述声纹数据对应的声纹特征;与此同时,将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的,由此通过确定预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重的向量夹角及其对应的分类概率,能够对声纹识别模型中的超参数进行自动调整,同时能够保证声纹识别模型中超参数设定的准确度,提升声纹识别模型的识别精度。
进一步的,为了更好的说明上述声纹识别模型中超参数的设置过程,作为对上述实施例的细化和扩展,本申请实施例提供了另一种声纹识别方法,如图2所示,所述方法包括:
201、获取样本声纹数据,并根据所述样本声纹数据,构建所述预设声纹识别模型对应的余弦边缘损失函数。
对于本申请实施例,预设样本声纹库中存储有大量样本声纹数据,为了对预设声纹识别模型中的超参数进行优化调整,需要构建余弦边缘损失函数,以便根据余弦边缘损失函数和样本声纹数据,自动设定超参数的值,本申请实施例中余弦边缘损失函数的具体公式
如下:
Figure PCTCN2021109597-appb-000001
Figure PCTCN2021109597-appb-000002
其中,L lmc为余弦边缘损失函数,y i为第i个样本的真实标签,P i,yi为第i个样本被正确预测为yi的概率值,θ j为输入的样本声纹数据与第j类参数权重的向量夹角,
Figure PCTCN2021109597-appb-000003
为输入的样本声纹数据与真实标签y i参数权重的向量夹角,s和m分别为需要设定的超参数,为本申请实施例中主要的优化目标。
202、基于所述余弦边缘损失函数,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值。
对于本申请实施例,为了确定预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值,步骤202具体包括:根据所述余弦边缘损失函数,绘制所述超参数在不同取值下的向量夹角和分类概率值之间的关系曲线;基于所述关系曲线,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值。进一步地,所述基于所述关系曲线,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值,包括:计算样本声纹数据与类别权重之间的向量夹角的平均值;根据所述关系曲线,确定所述预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重之间的向量夹角趋于0°和90°时,所述样本声纹数据分别对应的分类概率值;根据所述关系曲线,确定所述预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重之间的向量夹角趋于所述平均值时,所述样本声纹数据对应的分类概率值。
具体地,根据构建的余弦边缘损失函数,分别绘制超参数s在不同取值情况下的关系曲线图,如图3所示,该关系曲线图中横坐标为
Figure PCTCN2021109597-appb-000004
纵坐标为Pi,yi,根据该曲线图可以得知样本声纹数据与类别权重之间的向量夹角和分类概率值之间的关系,从图3中可知,当j不等于yi时,
Figure PCTCN2021109597-appb-000005
基本维持在90°附近,且当
Figure PCTCN2021109597-appb-000006
基本维持在0时,P i,yi维持在1附近。与此同时,在预设声纹识别模型收敛状态达到最佳时,当
Figure PCTCN2021109597-appb-000007
等于当前所有样本声纹数据的中位数或者均值θ med时,P i,yi
Figure PCTCN2021109597-appb-000008
处有最大的梯度值,从图3中可以得到,此时P i,yi的值为0.5,即
Figure PCTCN2021109597-appb-000009
进而可以将从关系曲线图中得到的结论代入至余弦边缘损失函数估算超参数s和m的值。
203、根据所述向量夹角和所述分类概率值,确定所述预设声纹识别模型对应的超参 数。
对于本申请实施例,为了根据所述向量夹角和所述分类概率值,估算所述预设声纹识别模型对应的第一超参数s和第二超参数m,步骤203具体包括:将所述样本声纹数据与类别权重之间的向量夹角趋于0°和90°时,所述样本声纹数据分别对应的分类概率值代入至所述余弦边缘损失函数,估算所述预设声纹识别模型对应的第一超参数;将所述样本声纹数据与类别权重之间的向量夹角趋于所述平均值时,所述样本声纹数据对应的分类概率值代入至所述余弦边缘损失函数,估算所述预设声纹识别模型对应的第二超参数。
具体地,由上图可知,当j不等于y i时,
Figure PCTCN2021109597-appb-000010
基本维持在90°附近,且当
Figure PCTCN2021109597-appb-000011
基本维持在0时,P i,yi维持在1附近,将该结论代入到上述公式中,得出:
Figure PCTCN2021109597-appb-000012
其中,C为总类别数,将C-1记做Bi,此外,当
Figure PCTCN2021109597-appb-000013
接近0时,P i,yi接近1,代入上述公式,得到:
Figure PCTCN2021109597-appb-000014
假设P i,yi为接近1的浮点数,例如0.999或0.99,代入简化后得到s的表达式:
Figure PCTCN2021109597-appb-000015
与此同时,在声纹识别模型收敛状态达到最佳时,当
Figure PCTCN2021109597-appb-000016
等于当前所有样本声纹数据的中位数或者均值θ med时,P i,yi
Figure PCTCN2021109597-appb-000017
处有最大的梯度值,可以从图中得到,此时P i,yi的值为0.5,即
Figure PCTCN2021109597-appb-000018
简化后的m表达式如下:
Figure PCTCN2021109597-appb-000019
综上所述推导出了超参数s和m自动赋值算法,其中,p为接近1的浮点数,也代表曲线的上界,一般设置为0.999,Bi与θ med皆与当前批次训练样本有关,可通过统计直接得到。需要说明的书,如果样本声纹数据量较大,则可以分批训练,逐渐调整超参数s和m的值,以达到最优效果。
204、获取待识别用户的声纹数据。
对于本申请实施例,在进行声纹识别之前,可以利用保准的声纹采集设备或者终端采集待识别用户的声纹数据,以便根据采集的声纹数据对待识别用户的身份进行确认。
205、提取所述声纹数据对应的声纹特征。
对于本申请实施例,为了待识别用户的声纹特征,步骤205具体包括:对所述声纹数据进行快速傅里叶转换,得到转换后的声纹数据,并对所述转换后的声纹数据进行滤波处理,得到所述声纹数据对应的声纹能量;根据所述声纹能量,计算所述声纹数据对应的梅尔倒谱系数,并将所述梅尔倒谱系数确定为所述声纹数据对应的声纹特征。
具体地,在对声纹数据进行特征提取之前,需要对声纹数据进行预处理,该预处理过程具体包括预加重、分帧和加窗函数处理,从而使得待识别用户的声纹数据变得平坦,即将声纹数据的每N个采用点合成一个观测单位(帧),帧的左右端具有连续性,在对待识别用户的声纹数据进行预处理之后,需要对预处理后的声纹数据进行快速傅里叶转换,得到转换后的声纹数据,之后将转换后的声纹数据输入Mel滤波器,计算转换后的声纹数据通过Mel滤波器后的语音能量,接着根据声纹数据对应的语音能量,计算声纹数据对应的梅尔倒谱系数,并将该梅尔倒谱系数确定为待识别用户的声纹特征,梅尔倒谱系数的具体计算公式如下:
Figure PCTCN2021109597-appb-000020
其中,s(m)代表声纹数据经过第m个滤波器后输出的语音能量,M为滤波器的总个数,C(n)为梅尔倒谱系数,n代表梅尔倒谱系数的阶数,L通常可取12-16,s(m)语音能量的具体计算公式如下:
Figure PCTCN2021109597-appb-000021
其中,
Figure PCTCN2021109597-appb-000022
为对声纹数据的频谱取模平方得到语音数据的功率谱,H m(k)为滤波器的频率,K为傅里叶变换的点数。由此按照上述公式,能够计算出待识别用户的声纹数据对应的梅尔倒谱系数,并将其确定为声纹数据对应的声纹特征,以便根据该声纹数据对应的声纹特征,进行声纹识别。
206、将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果。
其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。对于本申请实施例,为了确定待识别用户的声纹识别结果,步骤206具体包括:将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户为不同权限用户的概率值;根据所述待识别用户为不同权限用户的概率值,确定所述待识别用户对应的声纹识别结果。
例如,有声纹锁解锁权限的用户分别为a、b和c,如果从预设声纹识别模型输出的结 果中确定待识别用户为用户a的概率值为0.5,为用户b的概率值为0.1,为用户c的概率值为0.2,为其他用户的概率值为0.2,则可以认为待识别用户为用户a,其具有解锁权限,声纹锁会启动解锁指令;如果从声纹识别模型输出的结果中确定待识别用户为用户a的概率值为0.2,为用户b的概率值为0.1,为用户c的概率值为0.2,为其他用户的概率值为0.5,则可以认为待识别用户为没有权限的用户,声纹锁不会启动解锁指令。
本申请实施例提供的另一种声纹识别方法,与目前人为手动调整声纹识别模型的超参数的方式相比,本申请能够获取待识别用户的声纹数据;并提取所述声纹数据对应的声纹特征;与此同时,将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的,由此通过确定预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重的向量夹角及其对应的分类概率,能够对声纹识别模型中的超参数进行自动调整,同时能够保证声纹识别模型中超参数设定的准确度,提升声纹识别模型的识别精度。
进一步地,作为图1的具体实现,本申请实施例提供了一种声纹识别装置,如图4所示,所述装置包括:获取单元31、提取单元32和识别单元33。
所述获取单元31,可以用于获取待识别用户的声纹数据。所述获取单元31是本装置中获取待识别用户的声纹数据的主要功能模块。
所述提取单元32,可以用于提取所述声纹数据对应的声纹特征。所述提取单元32是本装置中提取所述声纹数据对应的声纹特征的主要功能模块,也是核心模块。
所述识别单元33,可以用于将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。所述识别单元33是本装置中将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果的主要功能模块,也是核心模块。
在具体应用场景中,为了提取所述声纹数据对应的声纹特征,如图5所示,所述提取单元32,包括:滤波模块321和计算模块322。
所述滤波模块321,可以用于对所述声纹数据进行快速傅里叶转换,得到转换后的声纹数据,并对所述转换后的声纹数据进行滤波处理,得到所述声纹数据对应的声纹能量。
所述计算模块322,可以用于根据所述声纹能量,计算所述声纹数据对应的梅尔倒谱系数,并将所述梅尔倒谱系数确定为所述声纹数据对应的声纹特征。
在具体应用场景中,为了确定所述待识别用户对应的声纹识别结果,所述识别单元33,包括:识别模块331和确定模块332。
所述识别模块331,可以用于将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户为不同权限用户的概率值。
所述确定模块332,可以用于根据所述待识别用户为不同权限用户的概率值,确定所述待识别用户对应的声纹识别结果。
在具体应用场景中,为自动调整所述声纹识别模型中的超参数,所述装置还包括:确定单元34。
所述获取单元31,还可以用于获取样本声纹数据,并根据所述样本声纹数据,构建所述预设声纹识别模型对应的余弦边缘损失函数。
所述确定单元34,可以用于基于所述余弦边缘损失函数,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值。
所述确定单元34,还可以用于根据所述向量夹角和所述分类概率值,确定所述预设声纹识别模型对应的超参数。
进一步地,为了确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值,所述确定单元34,包括:绘制模块341和确定模块342。
所述绘制模块341,可以用于根据所述余弦边缘损失函数,绘制所述超参数在不同取值下的向量夹角和分类概率值之间的关系曲线。
所述确定模块342,可以用于基于所述关系曲线,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值。
进一步地,为了确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值,所述确定模块342,包括:计算子模块和确定子模块。
所述计算子模块,可以用于计算样本声纹数据与类别权重之间的向量夹角的平均值。
所述确定子模块,可以用于根据所述关系曲线,确定所述预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重之间的向量夹角趋于0°和90°时,所述样本声纹数据分别对应的分类概率值。
所述确定子模块,还可以用于根据所述关系曲线,确定所述预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重之间的向量夹角趋于所述平均值时,所述样本声纹数据对应的分类概率值。
进一步地,所述超参数包括第一超参数和第二超参数,为了自动设置预设声纹识别模型中的第一超参数和第二超参数,所述确定单元34还包括:第一估算模块343和第二估算模块344。
所述第一估算模块342,可以用于将所述样本声纹数据与类别权重之间的向量夹角趋于0°和90°时,所述样本声纹数据分别对应的分类概率值代入至所述余弦边缘损失函数,估算所述预设声纹识别模型对应的第一超参数。
所述第二估算模块344,可以用于将所述样本声纹数据与类别权重之间的向量夹角趋于所述平均值时,所述样本声纹数据对应的分类概率值代入至所述余弦边缘损失函数,估算所述预设声纹识别模型对应的第二超参数。
需要说明的是,本申请实施例提供的一种声纹识别装置所涉及各功能模块的其他相应描述,可以参考图1所示方法的对应描述,在此不再赘述。
基于上述如图1所示方法,相应的,本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现以下步骤:获取待识别用户的声纹数据;提取所述声纹数据对应的声纹特征;将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。另外,所述计算机可读存储介质可以是非易失性,也可以是易失性。
基于上述如图1所示方法和如图4所示装置的实施例,本申请实施例还提供了一种计算机设备的实体结构图,如图6所示,该计算机设备包括:处理器41、存储器42、及存储在存储器42上并可在处理器上运行的计算机程序,其中存储器42和处理器41均设置在总线43上所述处理器41执行所述程序时实现以下步骤:获取待识别用户的声纹数据;提取所述声纹数据对应的声纹特征;将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。
通过本申请的技术方案,本申请能够获取待识别用户的声纹数据;并提取所述声纹数据对应的声纹特征;与此同时,将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的,由此通过确定预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重的向量夹角及其对应的分类概率,能够对声纹识别模型中的超参数进行自动调整,同时能够保证声纹识别模型中超参数设定的准确度,提升声纹识别模型的识别精度。
显然,本领域的技术人员应该明白,上述的本申请的各模块或各步骤可以用通用的计算装置来实现,它们可以集中在单个的计算装置上,或者分布在多个计算装置所组成的网络上,可选地,它们可以用计算装置可执行的程序代码来实现,从而,可以将它们存储在存储装置中由计算装置来执行,并且在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤,或者将它们分别制作成各个集成电路模块,或者将它们中的多个模块或步骤制作成单个集成电路模块来实现。这样,本申请不限制于任何特定的硬件和软件结合。
以上所述仅为本申请的优选实施例而已,并不用于限制本申请,对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包括在本申请的保护范围之内。

Claims (20)

  1. 一种声纹识别方法,其中,包括:
    获取待识别用户的声纹数据;
    提取所述声纹数据对应的声纹特征;
    将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。
  2. 如权利要求1所述的方法,其中,所述提取所述声纹数据对应的声纹特征,包括:
    对所述声纹数据进行快速傅里叶转换,得到转换后的声纹数据,并对所述转换后的声纹数据进行滤波处理,得到所述声纹数据对应的声纹能量;
    根据所述声纹能量,计算所述声纹数据对应的梅尔倒谱系数,并将所述梅尔倒谱系数确定为所述声纹数据对应的声纹特征。
  3. 如权利要求1所述的方法,其中,所述将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,包括:
    将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户为不同权限用户的概率值;
    根据所述待识别用户为不同权限用户的概率值,确定所述待识别用户对应的声纹识别结果。
  4. 如权利要求1所述的方法,其中,在所述获取待识别用户的声纹数据之前,所述方法还包括:
    获取样本声纹数据,并根据所述样本声纹数据,构建所述预设声纹识别模型对应的余弦边缘损失函数;
    基于所述余弦边缘损失函数,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值;
    根据所述向量夹角和所述分类概率值,确定所述预设声纹识别模型对应的超参数。
  5. 如权利要求4所述的方法,其中,所述基于所述余弦边缘损失函数,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值,包括:
    根据所述余弦边缘损失函数,绘制所述超参数在不同取值下的向量夹角和分类概率值之间的关系曲线;
    基于所述关系曲线,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值。
  6. 如权利要求5所述的方法,其中,所述基于所述关系曲线,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值,包括:
    计算样本声纹数据与类别权重之间的向量夹角的平均值;
    根据所述关系曲线,确定所述预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重之间的向量夹角趋于0°和90°时,所述样本声纹数据分别对应的分类概率值;
    根据所述关系曲线,确定所述预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重之间的向量夹角趋于所述平均值时,所述样本声纹数据对应的分类概率值。
  7. 如权利要求6所述的方法,其中,所述超参数包括第一超参数和第二超参数,所述根据所述向量夹角和所述分类概率值,确定所述预设声纹识别模型对应的超参数,包括:
    将所述样本声纹数据与类别权重之间的向量夹角趋于0°和90°时,所述样本声纹数据分别对应的分类概率值代入至所述余弦边缘损失函数,估算所述预设声纹识别模型对应的第一超参数;
    将所述样本声纹数据与类别权重之间的向量夹角趋于所述平均值时,所述样本声纹数据对应的分类概率值代入至所述余弦边缘损失函数,估算所述预设声纹识别模型对应的第二超参数。
  8. 一种声纹识别装置,其中,包括:
    获取单元,用于获取待识别用户的声纹数据;
    提取单元,用于提取所述声纹数据对应的声纹特征;
    识别单元,用于将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。
  9. 一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,其中,所述计算机程序被处理器执行时实现如下步骤:
    获取待识别用户的声纹数据;
    提取所述声纹数据对应的声纹特征;
    将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。
  10. 如权利要求9所述的计算机设备,其中,所述提取所述声纹数据对应的声纹特征,包括:
    对所述声纹数据进行快速傅里叶转换,得到转换后的声纹数据,并对所述转换后的声纹数据进行滤波处理,得到所述声纹数据对应的声纹能量;
    根据所述声纹能量,计算所述声纹数据对应的梅尔倒谱系数,并将所述梅尔倒谱系数确定为所述声纹数据对应的声纹特征。
  11. 如权利要求9所述的计算机设备,其中,所述将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,包括:
    将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户为不同权限用户的概率值;
    根据所述待识别用户为不同权限用户的概率值,确定所述待识别用户对应的声纹识别结果。
  12. 如权利要求9所述的计算机设备,其中,在所述获取待识别用户的声纹数据之前,所述计算机程序被处理器执行时还实现如下步骤:
    获取样本声纹数据,并根据所述样本声纹数据,构建所述预设声纹识别模型对应的余弦边缘损失函数;
    基于所述余弦边缘损失函数,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值;
    根据所述向量夹角和所述分类概率值,确定所述预设声纹识别模型对应的超参数。
  13. 如权利要求12所述的计算机设备,其中,所述基于所述余弦边缘损失函数,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值,包括:
    根据所述余弦边缘损失函数,绘制所述超参数在不同取值下的向量夹角和分类概率值之间的关系曲线;
    基于所述关系曲线,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值。
  14. 如权利要求13所述的计算机设备,其中,所述基于所述关系曲线,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值,包括:
    计算样本声纹数据与类别权重之间的向量夹角的平均值;
    根据所述关系曲线,确定所述预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重之间的向量夹角趋于0°和90°时,所述样本声纹数据分别对应的分类概率值;
    根据所述关系曲线,确定所述预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重之间的向量夹角趋于所述平均值时,所述样本声纹数据对应的分类概率值。
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其中,所述计算机程序被处理器执行时实现如下步骤:
    获取待识别用户的声纹数据;
    提取所述声纹数据对应的声纹特征;
    将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,其中,所述预设声纹识别模型中的超参数是通过所述预设声纹识别模型在最佳收敛状态下的样本声纹数据与类别权重的向量夹角及其对应的分类概率值确定的。
  16. 如权利要求15所述的计算机可读存储介质,其中,所述提取所述声纹数据对应的声纹特征,包括:
    对所述声纹数据进行快速傅里叶转换,得到转换后的声纹数据,并对所述转换后的声纹数据进行滤波处理,得到所述声纹数据对应的声纹能量;
    根据所述声纹能量,计算所述声纹数据对应的梅尔倒谱系数,并将所述梅尔倒谱系数确定为所述声纹数据对应的声纹特征。
  17. 如权利要求15所述的计算机可读存储介质,其中,所述将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户对应的声纹识别结果,包括:
    将所述声纹特征输入至预设声纹识别模型进行声纹识别,得到所述待识别用户为不同权限用户的概率值;
    根据所述待识别用户为不同权限用户的概率值,确定所述待识别用户对应的声纹识别结果。
  18. 如权利要求15所述的计算机可读存储介质,其中,在所述获取待识别用户的声纹数据之前,所述计算机程序被处理器执行时还实现如下步骤:
    获取样本声纹数据,并根据所述样本声纹数据,构建所述预设声纹识别模型对应的余弦边缘损失函数;
    基于所述余弦边缘损失函数,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值;
    根据所述向量夹角和所述分类概率值,确定所述预设声纹识别模型对应的超参数。
  19. 如权利要求18所述的计算机可读存储介质,其中,所述基于所述余弦边缘损失函数,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值,包括:
    根据所述余弦边缘损失函数,绘制所述超参数在不同取值下的向量夹角和分类概率值之间的关系曲线;
    基于所述关系曲线,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值。
  20. 如权利要求19所述的计算机可读存储介质,其中,所述基于所述关系曲线,确定所述预设声纹识别模型处于最佳收敛状态时样本声纹数据与类别权重之间的向量夹角及其对应的分类概率值,包括:
    计算样本声纹数据与类别权重之间的向量夹角的平均值;
    根据所述关系曲线,确定所述预设声纹识别模型在最佳收敛状态下样本声纹数据与类别权重之间的向量夹角趋于0°和90°时,所述样本声纹数据分别对应的分类概率值;
    根据所述关系曲线,确定所述预设声纹识别模型在最佳收敛状态下样本声纹数据与类 别权重之间的向量夹角趋于所述平均值时,所述样本声纹数据对应的分类概率值。
PCT/CN2021/109597 2020-12-22 2021-07-30 声纹识别方法、装置、存储介质及计算机设备 WO2022134587A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011526763.1A CN112466311B (zh) 2020-12-22 2020-12-22 声纹识别方法、装置、存储介质及计算机设备
CN202011526763.1 2020-12-22

Publications (1)

Publication Number Publication Date
WO2022134587A1 true WO2022134587A1 (zh) 2022-06-30

Family

ID=74804644

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109597 WO2022134587A1 (zh) 2020-12-22 2021-07-30 声纹识别方法、装置、存储介质及计算机设备

Country Status (2)

Country Link
CN (1) CN112466311B (zh)
WO (1) WO2022134587A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112466311B (zh) * 2020-12-22 2022-08-19 深圳壹账通智能科技有限公司 声纹识别方法、装置、存储介质及计算机设备

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108766445A (zh) * 2018-05-30 2018-11-06 苏州思必驰信息科技有限公司 声纹识别方法及系统
CN109801636A (zh) * 2019-01-29 2019-05-24 北京猎户星空科技有限公司 声纹识别模型的训练方法、装置、电子设备及存储介质
CN109903774A (zh) * 2019-04-12 2019-06-18 南京大学 一种基于角度间隔损失函数的声纹识别方法
US20190392842A1 (en) * 2016-09-12 2019-12-26 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
CN111524521A (zh) * 2020-04-22 2020-08-11 北京小米松果电子有限公司 声纹提取模型训练方法和声纹识别方法、及其装置和介质
US20200294509A1 (en) * 2018-05-08 2020-09-17 Ping An Technology (Shenzhen) Co., Ltd. Method and apparatus for establishing voiceprint model, computer device, and storage medium
CN112466311A (zh) * 2020-12-22 2021-03-09 深圳壹账通智能科技有限公司 声纹识别方法、装置、存储介质及计算机设备

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190392842A1 (en) * 2016-09-12 2019-12-26 Pindrop Security, Inc. End-to-end speaker recognition using deep neural network
US20200294509A1 (en) * 2018-05-08 2020-09-17 Ping An Technology (Shenzhen) Co., Ltd. Method and apparatus for establishing voiceprint model, computer device, and storage medium
CN108766445A (zh) * 2018-05-30 2018-11-06 苏州思必驰信息科技有限公司 声纹识别方法及系统
CN109801636A (zh) * 2019-01-29 2019-05-24 北京猎户星空科技有限公司 声纹识别模型的训练方法、装置、电子设备及存储介质
CN109903774A (zh) * 2019-04-12 2019-06-18 南京大学 一种基于角度间隔损失函数的声纹识别方法
CN111524521A (zh) * 2020-04-22 2020-08-11 北京小米松果电子有限公司 声纹提取模型训练方法和声纹识别方法、及其装置和介质
CN112466311A (zh) * 2020-12-22 2021-03-09 深圳壹账通智能科技有限公司 声纹识别方法、装置、存储介质及计算机设备

Also Published As

Publication number Publication date
CN112466311B (zh) 2022-08-19
CN112466311A (zh) 2021-03-09

Similar Documents

Publication Publication Date Title
CN109712628B (zh) 一种基于rnn建立的drnn降噪模型的语音降噪方法及语音识别方法
WO2019100606A1 (zh) 电子装置、基于声纹的身份验证方法、系统及存储介质
CN104700018B (zh) 一种用于智能机器人的识别方法
CN110084149B (zh) 一种基于难样本四元组动态边界损失函数的人脸验证方法
CN105608450A (zh) 基于深度卷积神经网络的异质人脸识别方法
CN106250821A (zh) 一种聚类再分类的人脸识别方法
WO2022134587A1 (zh) 声纹识别方法、装置、存储介质及计算机设备
CN113221086B (zh) 离线人脸认证方法、装置、电子设备及存储介质
CN108932535A (zh) 一种基于机器学习的边缘计算克隆节点识别方法
WO2020073519A1 (zh) 声纹验证的方法、装置、计算机设备以及存储介质
CN108520752A (zh) 一种声纹识别方法和装置
CN105868693A (zh) 身份认证方法及系统
CN110119746A (zh) 一种特征识别方法及装置、计算机可读存储介质
WO2022268183A1 (zh) 一种基于视频的随机手势认证方法及系统
CN111382601A (zh) 生成对抗网络模型的光照人脸图像识别预处理系统及方法
CN110991554B (zh) 一种基于改进pca的深度网络图像分类方法
CN109377601A (zh) 一种基于指纹识别的智能办公室门禁系统
CN105184236A (zh) 机器人人脸识别系统
CN113241081B (zh) 一种基于梯度反转层的远场说话人认证方法及系统
CN110737728A (zh) 基于大数据分析技术的项目域主题分析系统
Zhong et al. Text-independent speaker recognition based on adaptive course learning loss and deep residual network
Wei On feature extraction of ship radiated noise using 11/2 d spectrum and principal components analysis
CN108537206A (zh) 一种基于卷积神经网络的人脸验证方法
CN111524524B (zh) 声纹识别方法、装置、设备及存储介质
CN111476145A (zh) 一种基于卷积神经网络的1:n人脸识别方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908592

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 02.11.2023)