CN116386647A

CN116386647A - Audio verification method, related device, storage medium and program product

Info

Publication number: CN116386647A
Application number: CN202310606668.XA
Authority: CN
Inventors: 郭军军; 程晓娟; 萧子豪
Original assignee: Beijing Real AI Technology Co Ltd
Current assignee: Beijing Real AI Technology Co Ltd
Priority date: 2023-05-26
Filing date: 2023-05-26
Publication date: 2023-07-04
Anticipated expiration: 2043-05-26
Also published as: CN116386647B

Abstract

The embodiment of the application discloses an audio verification method, a related device, a storage medium and a program product. The method comprises the following steps: acquiring a plurality of types of voiceprint features based on audio data to be verified, wherein the voiceprint features comprise at least one type of features of frequency domain features and vector features; performing feature fusion processing on the voiceprint features to determine target test voiceprint vector features of the audio data to be verified; and determining a target similarity score of the target test voiceprint vector feature and the registered voiceprint vector feature of the registered audio data, wherein the target similarity score is used for determining a verification result of the audio data to be verified. According to the method and the device, the voice print characteristics of the plurality of types of voice print characteristics of the audio data to be verified are fused to determine the target test voice print vector characteristics, so that the characteristic dimension of the target test voice print vector characteristics is higher, the audio verification processing is carried out through the characteristics of the higher dimension, and the accuracy of the audio verification result can be improved.

Description

Audio verification method, related device, storage medium and program product

Technical Field

Embodiments of the present disclosure relate to the field of speech recognition technologies, and in particular, to an audio verification method, a related device, a storage medium, and a program product.

Background

With the development of technology, biometric identification technology has become an important means of identity verification today instead of traditional password identification. Where audio verification (voiceprint verification) is often used as a "gatekeeper" to prevent intrusion into the security system.

To solve the problem of verifying whether test audio belongs to registered audio, the prior art proposes a method of extracting x-vector for audio verification using Mel-frequency cepstral coefficient (Mel-Frequency Cepstral Coefficients, MFCC). Specifically, extracting the MFCC characteristics of the audio data, then acquiring the x-vector of the test audio based on the single MFCC characteristics, and calculating the cosine similarity of the test audio and the x-vector of the registered audio; when the cosine similarity value is greater than or equal to a set threshold value, judging that the test audio and the registered audio come from the same user; otherwise, the test audio and the registration audio are determined to come from different users.

However, the MFCC features use nonlinear transforms such as logarithmic and discrete cosine transforms in the extraction process, and these nonlinear transforms may introduce nonlinear distortions, resulting in a decrease in accuracy of the characterization based on the MFCC features, and thus in a decrease in accuracy of the x-vector features obtained based on the MFCC features; also, the x-vector may be affected by voice quality, data type (e.g., singing, reading, etc.), environmental changes, etc., and may have different performance. Therefore, the prior art uses x-vector based on a single MFCC for audio verification, which results in lower accuracy of the audio verification result.

Disclosure of Invention

The embodiment of the application provides an audio verification method, a related device, a storage medium and a program product, which can improve the accuracy of an audio verification result.

In a first aspect, an embodiment of the present application provides an audio verification method, including:

acquiring a plurality of types of voiceprint features based on audio data to be verified, wherein the voiceprint features comprise at least one type of features of frequency domain features and vector features;

performing feature fusion processing on the voiceprint features to determine target test voiceprint vector features of the audio data to be verified;

and determining a target similarity score of the target test voiceprint vector feature and the registered voiceprint vector feature of the registered audio data, wherein the target similarity score is used for determining a verification result of the audio data to be verified.

In a second aspect, embodiments of the present application further provide an audio verification apparatus, including:

the receiving and transmitting module is used for acquiring a plurality of types of voiceprint features based on the audio data to be verified, wherein the voiceprint features comprise at least one type of features of frequency domain features and vector features;

the processing module is used for carrying out feature fusion processing on the voiceprint features so as to determine target test voiceprint vector features of the audio data to be verified; and determining a target similarity score of the target test voiceprint vector feature and the registered voiceprint vector feature of the registered audio data, wherein the target similarity score is used for determining a verification result of the audio data to be verified.

In some embodiments, the voiceprint features include a first frequency domain feature and a second frequency domain feature; the processing module is specifically configured to, when executing the step of performing feature fusion processing on the plurality of voiceprint features to determine the target test voiceprint vector feature of the audio data to be verified:

performing feature fusion processing on the first frequency domain features and the second frequency domain features to obtain target frequency domain features; and inputting the target frequency domain characteristic into a preset first acoustic model to obtain the target test voiceprint vector characteristic.

In some embodiments, the processing module is specifically configured to, when executing the step of performing feature fusion processing on the first frequency domain feature and the second frequency domain feature to obtain the target frequency domain feature:

inputting the first frequency domain features into a preset second acoustic model to obtain first voiceprint vector features; and carrying out feature fusion processing on the first voiceprint vector features and the second frequency domain features to obtain the target frequency domain features.

In some embodiments, the voiceprint features include a third frequency domain feature and a second voiceprint vector feature; the processing module performs feature fusion processing on the plurality of voiceprint features to determine target test voiceprint vector features of the audio data to be verified, when:

And carrying out feature fusion processing on the third frequency domain feature and the second voiceprint vector feature to obtain the target test voiceprint vector feature.

In some embodiments, the second voiceprint vector feature is derived based on the steps of:

and inputting a fourth frequency domain feature into a preset second acoustic model to obtain the second acoustic vector feature, wherein the fourth frequency domain feature is extracted from the audio data to be verified.

In some embodiments, the processing module is specifically configured to, when performing the step of determining the target similarity score for the target test voiceprint vector feature and the registered voiceprint vector feature of the registered audio data:

determining an initial similarity score for the target test voiceprint vector feature and the registered voiceprint vector feature; according to a first weight sequence and a preset impersonation voiceprint vector feature group of each impersonation user in a impersonation data set, respectively determining target impersonation voiceprint vector features of each impersonation user, wherein the impersonation data set comprises a plurality of impersonation voiceprint vector feature groups of the impersonation users, each impersonation voiceprint vector feature group comprises impersonation voiceprint vector features corresponding to each specific state of the impersonation user in a plurality of specific states, the first weight sequence is a preset weight sequence corresponding to a target service scene, and the first weight sequence comprises weight values corresponding to each specific state; respectively determining similarity scores of the target test voiceprint vector features and the target impersonation voiceprint vector features to obtain a first similarity score set; and respectively determining similarity scores of the registered voiceprint vector features and the target impersonation voiceprint vector features to obtain a second similarity score set; and determining the target similarity score according to the first similarity score set, the second similarity score set and the initial similarity score.

In some embodiments, the first set of similarity scores and the second set of similarity scores are ordered sequences, the first set of similarity scores comprising a first target score, the first set of similarity scores comprising a second target score; the processing module is specifically configured to, when executing the step of determining the target similarity score according to the first similarity score set, the second similarity score set, and the initial similarity score:

determining a preset number of first similarity scores from the first similarity score set, wherein the sorting positions of the first similarity scores are before or after the first target scores; determining a second similarity score of the preset number from the second similarity score set, wherein the sorting position of the second similarity score is before or after the second target score; calculating a first mean and a first variance of the first similarity scores of the preset number; calculating a second mean and a second variance of the second similarity scores of the preset number; and obtaining the target similarity score according to the first mean value, the first variance, the second mean value, the second variance and the initial similarity score.

In some embodiments, before executing the step of determining the target impersonation voiceprint vector feature of each impersonation user according to the first weight sequence and the impersonation voiceprint vector feature set of each impersonation user in the preset impersonation data set, the processing module is further configured to:

determining the first weight sequence corresponding to the target service scene from a preset weight sequence set, wherein the weight sequence set comprises weight sequences respectively corresponding to each service scene in a plurality of service scenes.

the method comprises the steps that a training set of a target service scene is obtained by utilizing a receiving-transmitting module, the training set comprises a plurality of test pairs of two labels, each test pair comprises a test voiceprint vector sample feature and a registered voiceprint vector sample feature, and the labels are used for indicating whether the test voiceprint vector sample feature and the registered voiceprint vector sample feature in the test pairs belong to the same user or not; acquiring a second weight sequence; determining a third similarity score for the test voiceprint vector sample feature and the registered voiceprint vector sample feature in each of the test pairs, respectively, based on the second weight sequence and the impersonation data set; determining loss values corresponding to the test pairs according to the third similarity scores corresponding to the test pairs and the labels corresponding to the test pairs; if the second weight sequence is determined to be not in accordance with the preset condition according to each loss value, updating the second weight sequence to obtain a candidate weight sequence, taking the candidate weight sequence as the second weight sequence until the second weight sequence is in accordance with the preset condition, and taking the second weight sequence as the first weight sequence.

In a third aspect, embodiments of the present application further provide a computer device, including a memory and a processor, where the memory stores a computer program, and the processor implements the method when executing the computer program.

In a fourth aspect, embodiments of the present application also provide a computer readable storage medium storing a computer program comprising program instructions which, when executed by a processor, implement the above-described method.

In a fifth aspect, an embodiment of the present application provides a chip, where the chip includes a transceiver coupled to a terminal device, for performing the technical solution provided in the first aspect of the embodiment of the present application.

In a sixth aspect, an embodiment of the present application provides a chip system, where the chip system includes a processor, configured to support a terminal device to implement the functions involved in the first aspect, for example, to generate or process information involved in the audio verification method provided in the first aspect. In one possible design, the above chip system further includes a memory for holding program instructions and data necessary for the terminal. The chip system may be formed of a chip or may include a chip and other discrete devices.

In a seventh aspect, embodiments of the present application provide a computer program product containing instructions, which when executed on a computer, cause the computer to perform the audio verification method provided in the first aspect, and also achieve the beneficial effects provided by the audio verification method provided in the first aspect.

Compared with the prior art, in the scheme provided by the embodiment of the application, on one hand, the target test voiceprint vector feature for performing audio verification fuses a plurality of types of voiceprint features, so that the obtained target voiceprint vector feature has higher dimensionality, and the accuracy of an audio verification result can be improved by performing audio verification processing through the feature with higher dimensionality; on the other hand, the target test voiceprint vector features in the embodiment are fused with the frequency domain features and the vector features, so that the target test voiceprint vector features can retain the frequency domain features of the original speaker and can be used for representing deep information by utilizing vector special cases, and the target voiceprint vector features provided by the embodiment are used for audio verification processing, so that the robustness of an audio verification system can be effectively improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is an application scenario schematic diagram of an audio verification method provided in an embodiment of the present application;

fig. 2 is a schematic flow chart of an audio verification method according to an embodiment of the present application;

FIG. 3a is a schematic illustration of a sub-flowchart of an audio verification method according to an embodiment of the present application;

FIG. 3b is a schematic illustration of another sub-flowchart of the audio verification method according to the embodiment of the present application;

FIG. 4a is a schematic illustration of another sub-flowchart of the audio verification method according to the embodiment of the present application;

FIG. 4b is a schematic illustration of another sub-flowchart of the audio verification method according to the embodiment of the present application;

fig. 5 is a flowchart of an audio verification method according to another embodiment of the present application;

fig. 6 is a schematic flow chart of determining a weight sequence in the audio verification method according to the embodiment of the present application;

FIG. 7 is a schematic diagram of a test pair in an audio verification method according to an embodiment of the present disclosure;

FIG. 8 is a schematic block diagram of an audio verification device provided by an embodiment of the present application;

FIG. 9 is a schematic diagram of a hardware configuration in an embodiment of the present application;

fig. 10 is a schematic structural diagram of a terminal device in an embodiment of the present application;

fig. 11 is a schematic structural diagram of a server in an embodiment of the present application.

Detailed Description

The terms first, second and the like in the description and in the claims of the embodiments and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments described herein may be implemented in other sequences than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or modules is not necessarily limited to those explicitly listed but may include other steps or modules not expressly listed or inherent to such process, method, article, or apparatus, such that the partitioning of modules by embodiments of the application is only one logical partitioning, such that a plurality of modules may be combined or integrated in another system, or some features may be omitted, or not implemented, and further that the coupling or direct coupling or communication connection between modules may be via some interfaces, such that indirect coupling or communication connection between modules may be electrical or other like, none of the embodiments of the application are limited. The modules or sub-modules described as separate components may or may not be physically separate, may or may not be physical modules, or may be distributed in a plurality of circuit modules, and some or all of the modules may be selected according to actual needs to achieve the purposes of the embodiments of the present application.

The embodiment of the application provides an audio verification method, a related device, a storage medium and a program product, wherein an execution subject of the audio verification method can be the audio verification device provided by the embodiment of the application or a computer device integrated with the audio verification device, wherein the audio verification device can be realized in a hardware or software mode, and the computer device can be a terminal or a server.

When the computer device is a server, the server may be an independent physical server, or may be a server cluster or a distributed system formed by a plurality of physical servers, or may be a cloud server that provides cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligence platforms, and the like.

When the computer device is a terminal, the terminal may include: smart phones, tablet computers, notebook computers, desktop computers, smart televisions, smart speakers, personal digital assistants (english: personal Digital Assistant, abbreviated to PDA), desktop computers, smart watches, door access integrated machines, and the like, which carry multimedia data processing functions (e.g., video data playing functions, music data playing functions), but are not limited thereto.

The scheme of the embodiment of the application can be realized based on an artificial intelligence technology, and particularly relates to the technical field of computer vision in the artificial intelligence technology and the fields of cloud computing, cloud storage, databases and the like in the cloud technology, and the technical fields are respectively described below.

Artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace human eyes with a camera and a Computer to perform machine Vision such as recognition, tracking and measurement on a target, and further perform graphic processing to make the Computer process into an image more suitable for human eyes to observe or transmit to an instrument to detect. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, model robustness detection, image semantic understanding, image retrieval, OCR, video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, 3D techniques, virtual reality, augmented reality, synchronous positioning, map construction, etc., as well as common model robustness detection, fingerprint recognition, etc., biometric techniques.

With research and advancement of artificial intelligence technology, research and application of artificial intelligence technology is being developed in various fields, such as common smart home, smart wearable devices, virtual assistants, smart speakers, smart marketing, unmanned, automatic driving, unmanned aerial vehicles, robots, smart medical treatment, smart customer service, etc., and it is believed that with the development of technology, artificial intelligence technology will be applied in more fields and with increasing importance value.

The solution of the embodiment of the present application may be implemented based on cloud technology, and in particular, relates to the technical fields of cloud computing, cloud storage, database, and the like in the cloud technology, and will be described below.

Cloud technology (Cloud technology) refers to a hosting technology for integrating hardware, software, network and other series resources in a wide area network or a local area network to realize calculation, storage, processing and sharing of data. Cloud technology (Cloud technology) is based on the general terms of network technology, information technology, integration technology, management platform technology, application technology and the like applied by Cloud computing business models, and can form a resource pool, so that the Cloud computing business model is flexible and convenient as required. Cloud computing technology will become an important support. Background services of technical networking systems require a significant amount of computing, storage resources, such as video websites, image-like websites, and more portals. Along with the high development and application of the internet industry, each article possibly has an own identification mark in the future, the identification mark needs to be transmitted to a background system for logic processing, data with different levels can be processed separately, and various industry data needs strong system rear shield support and can be realized only through cloud computing. According to the embodiment of the application, the identification result can be stored through cloud technology.

Cloud storage (cloud storage) is a new concept that extends and develops in the concept of cloud computing, and a distributed cloud storage system (hereinafter referred to as a storage system for short) refers to a storage system that integrates a large number of storage devices (storage devices are also referred to as storage nodes) of various types in a network to work cooperatively through application software or application interfaces through functions such as cluster application, grid technology, and a distributed storage file system, so as to provide data storage and service access functions for the outside. In the embodiment of the application, the information such as network configuration and the like can be stored in the storage system, so that the server can conveniently call the information.

At present, the storage method of the storage system is as follows: when creating logical volumes, each logical volume is allocated a physical storage space, which may be a disk composition of a certain storage device or of several storage devices. The client stores data on a certain logical volume, that is, the data is stored on a file system, the file system divides the data into a plurality of parts, each part is an object, the object not only contains the data but also contains additional information such as a data Identification (ID) and the like, the file system writes each object into a physical storage space of the logical volume, and the file system records storage position information of each object, so that when the client requests to access the data, the file system can enable the client to access the data according to the storage position information of each object.

The process of allocating physical storage space for the logical volume by the storage system specifically includes: physical storage space is divided into stripes in advance according to the set of capacity measures for objects stored on a logical volume (which measures tend to have a large margin with respect to the capacity of the object actually to be stored) and redundant array of independent disks (RAID, redundant Array of Independent Disk), and a logical volume can be understood as a stripe, whereby physical storage space is allocated for the logical volume.

The Database (Database), which can be considered as an electronic filing cabinet, is a place for storing electronic files, and users can perform operations such as adding, inquiring, updating, deleting and the like on the data in the files. A "database" is a collection of data stored together in a manner that can be shared with multiple users, with as little redundancy as possible, independent of the application.

The database management system (Database Management System, abbreviated as DBMS) is a computer software system designed for managing databases, and generally has basic functions of storage, interception, security, backup and the like. The database management system may classify according to the database model it supports, e.g., relational, XML (Extensible Markup Language ); or by the type of computer supported, e.g., server cluster, mobile phone; or by the query language used, e.g., SQL (structured query language ), XQuery; or by performance impact emphasis, such as maximum scale, maximum speed of operation; or other classification schemes. Regardless of the manner of classification used, some DBMSs are able to support multiple query languages across categories, for example, simultaneously. In the embodiment of the application, the identification result can be stored in the database management system, so that the server can conveniently call.

It should be specifically noted that, the service terminal according to the embodiments of the present application may be a device that provides voice and/or data connectivity to the service terminal, a handheld device with a wireless connection function, or other processing device connected to a wireless modem. Such as mobile telephones (or "cellular" telephones) and computers with mobile terminals, which can be portable, pocket, hand-held, computer-built-in or car-mounted mobile devices, for example, which exchange voice and/or data with radio access networks. For example, personal communication services (English full name: personal Communication Service, english short name: PCS) telephones, cordless telephones, session Initiation Protocol (SIP) phones, wireless local loop (Wireless Local Loop, english short name: WLL) stations, personal digital assistants (English full name: personal Digital Assistant, english short name: PDA) and the like.

Referring to fig. 1, fig. 1 is an application scenario schematic diagram of an audio verification method according to an embodiment of the present application. The audio verification method is applied to the audio verification system in fig. 1, and in some embodiments, the audio verification system includes a user terminal 10 and a server 20, the user terminal 10 is configured to obtain audio data to be verified and send the audio data to be verified to the server 20, and the server 20 is configured to perform audio verification processing on the received audio data to be verified and return a verification result to the user terminal 10.

Specifically, the server 20 receives audio data to be verified sent by the user terminal 10, and then obtains a plurality of types of voiceprint features of the audio data to be verified, wherein the voiceprint features include at least one type of features of frequency domain features and vector features; then fusing a plurality of types of voiceprint features to obtain target test voiceprint vector features of the audio data to be verified; and then calculates the target similarity score between the target test voiceprint vector feature and the registered voiceprint vector feature, determines a verification result according to the target similarity score, and finally returns the verification result to the user terminal 10.

In other embodiments, the audio verification system in this embodiment may only include the user terminal 10 or the server 20, and in this case, the audio verification method provided in this embodiment is also only completed by the user terminal 10 or the server 20.

In this embodiment, a server is taken as an execution body, and the server is integrated with the audio verification device, and when the execution body is a terminal, reference may be made to an embodiment of the server, which will not be described in detail. The audio verification method provided in this embodiment is described in detail below based on fig. 2, and as shown in fig. 2, the method includes the following steps 101-105.

Fig. 2 is a flow chart of an audio verification method according to an embodiment of the present application. As shown in fig. 2, the method includes the following steps 101-105.

101. The server acquires the audio data to be verified.

In this embodiment, the audio data to be verified may be external to the server or may be local data of the server, which is not limited in this embodiment.

When the audio data to be verified comes from the outside of the server, the audio data to be verified can be the audio data to be verified sent by the user terminal, or the server can be provided with a microphone, and at this time, the server can obtain the audio data to be verified through the microphone.

102. The server obtains a plurality of types of voiceprint features based on the audio data to be verified.

Wherein the voiceprint features include at least one type of features of a frequency domain feature and a vector feature.

The frequency domain features include at least one of MFCC, constant Q transform frequency cepstrum coefficient (Constant Qtransform Cepstrum Coefficients, CQCC) features, linear prediction cepstrum coefficient (linear predictive cepstral coefficient, LPCC) features, bark frequency cepstrum coefficient (Bark-Frequency Cepstral Coefficients, BFCC) features, and gammatine cepstrum coefficient (Gammatone Frequency Cepstrum Coefficient, GFCC) features, including vector features extracted from any one of frequency domain features such as MFCC, LFCC, CQCC, BFCC and GFCC of audio data to be verified.

It should be noted that, in some embodiments, in order to improve the purity of the acquired audio, before step 101, the method further includes: and acquiring initial audio data, and preprocessing the initial audio data to obtain audio data to be verified, wherein the preprocessing comprises the processing of data cleaning, mute segment removal and the like.

In some embodiments, the audio data to be verified in the present embodiment accords with at least one of the preset conditions, and has more noise, is acquired in a specific service scene, has poor voice quality, and has a specific voice type, wherein the specific service scene may be a financial payment scene, a security access control scene, a package signing scene, etc., and the specific voice type is singing, reading, etc.

103. And the server performs feature fusion processing on the voiceprint features to determine target test voiceprint vector features of the audio data to be verified.

In some embodiments, in order to improve accuracy of the frequency domain features and improve performance of the vector features, the voiceprint features include a first frequency domain feature and a second frequency domain feature, where the first frequency domain feature and the second frequency domain feature are different types of frequency domain features, in this embodiment, the different frequency domain features are fused, where step 103 includes:

(1) And carrying out feature fusion processing on the first frequency domain features and the second frequency domain features to obtain target frequency domain features.

In a specific embodiment, a first frequency domain feature is taken as an MFCC feature, a second frequency domain feature is taken as an LFCC feature, an MFCC feature is defined as a matrix M,

MFCC features representing the T time frame D dimension; defining LFCC characteristics as matrix L,>

LFCC features representing the T time frame D dimension.

Specifically:

；

；

at this time, the fused MFCC feature and LFCC feature are obtained:

；

(2) And inputting the target frequency domain characteristic into a preset first acoustic model to obtain the target test voiceprint vector characteristic.

In some embodiments, C is as described above _L,M And inputting the first acoustic model, and outputting the target test voiceprint vector characteristics.

The first acoustic model may specifically be a TDNN acoustic model, or may be another type of acoustic model, for example, a DNN acoustic model, where a specific model type is not limited herein.

Specifically, as shown in fig. 3a, after the server obtains the audio data to be verified, the server extracts the first frequency domain feature and the second frequency domain feature of the audio data to be verified, fuses the first frequency domain feature and the second frequency domain feature to obtain the target frequency domain feature, inputs the target frequency domain feature into the first acoustic model, and outputs the target test voiceprint vector feature.

Further, in some embodiments, to extract the deeper features, the feature fusion process may be further performed on the first frequency domain feature and the second frequency domain feature based on the following steps:

specifically, as shown in fig. 3b, after obtaining the audio data to be verified, the server extracts a first frequency domain feature and a second frequency domain feature of the audio data to be verified, then inputs the first frequency domain feature into a second acoustic model to obtain a first voiceprint vector feature, then performs feature fusion processing on the first voiceprint vector feature and the second frequency domain feature to obtain a target frequency domain feature, and inputs the target frequency domain feature into the first acoustic model to obtain a target test voiceprint vector feature.

The second acoustic model may be specifically a time delay neural network (Time Delay Neural Network, TDNN) acoustic model, and may also be other types of acoustic models, for example, a deep neural network (Deep Neural Networks, DNN) acoustic model, where the specific model type is not limited herein, and the second acoustic model is obtained based on training of the first frequency domain feature sample, and the specific type of each frequency domain feature is not limited in this embodiment.

In other embodiments, to preserve the frequency domain features of the original speaker and obtain further vector features, the voiceprint features include a third frequency domain feature and a second voiceprint vector feature; at this time, step 103 includes: and carrying out feature fusion processing on the third frequency domain feature and the second voiceprint vector feature to obtain the target test voiceprint vector feature.

Specifically, as shown in fig. 4a, a third frequency domain feature and a second voiceprint vector feature of the audio data to be verified are extracted, and then the third frequency domain feature and the second voiceprint vector feature are fused to obtain a target test voiceprint vector feature.

The second voiceprint vector feature is obtained based on the following steps: and inputting a fourth frequency domain feature into a preset second acoustic model to obtain the second acoustic vector feature, wherein the fourth frequency domain feature is extracted from the audio data to be verified.

Firstly, extracting a fourth frequency domain feature and a third frequency domain feature of audio data to be verified, inputting the fourth frequency domain feature into a second acoustic model to obtain a second acoustic vector feature, and then fusing the second acoustic vector feature and the third frequency domain feature to obtain a target test acoustic vector feature.

Specifically, referring to fig. 4b, after obtaining the audio data to be verified, the server extracts a third frequency domain feature and a fourth frequency domain feature of the audio data to be verified, inputs the fourth frequency domain feature into a second acoustic model, outputs a second acoustic vector feature corresponding to the fourth frequency domain feature, and performs feature fusion processing on the output second acoustic vector feature and the third frequency domain feature to obtain a target test acoustic vector feature.

In a specific embodiment, a third frequency domain feature is taken as an LFCC feature, a fourth frequency domain feature is taken as an MFCC feature, where the third comment feature is a matrix L, the fourth frequency domain feature is a matrix M, and the second acoustic model is

Inputting the fourth frequency domain feature into a second acoustic model, i.e., inputting the matrix M into the acoustic model

The second voice pattern vector output at this time is characterized by

Vector, representing the D dimensions of N segments, specifically:

；

when vector is calculated according to the parameter frame length of the LFCC characteristic, t=n; when t+.n, t×w=n can be made by dimension converting matrix W. The target test voiceprint vector features after the second voiceprint vector features are fused with the third frequency domain feature frame level are as follows:

；

104. The server determines a target similarity score for the target test voiceprint vector feature and a registered voiceprint vector feature of the registered audio data.

In this embodiment, the server is preset with the registered audio data, or the registered voiceprint vector feature of the registered audio data, and if the registered audio data is preset, the registered voiceprint vector feature of the registered audio data needs to be acquired by referring to the mode of acquiring the target test voiceprint vector feature.

Specifically, the target similarity score may be a cosine similarity or a euclidean distance between the target test voiceprint vector feature and the registered voiceprint vector feature, or may be determined based on the cosine similarity or the euclidean distance between the target test voiceprint vector feature and the registered voiceprint vector feature.

105. And the server determines the verification result of the audio data to be verified according to the target similarity score.

The target similarity score is used for determining a verification result of the audio data to be verified, specifically, if the target similarity score is greater than or equal to a preset threshold, determining that the verification result of the audio data to be verified is verification passing, and if the target similarity score is less than the preset threshold, determining that the verification result of the audio data to be verified is verification failing.

Wherein in some embodiments, after the service determines the verification result, the verification result is returned to the corresponding user terminal.

In summary, in the scheme provided by the embodiment of the present application, on one hand, the target test voiceprint vector feature for performing audio verification in the embodiment fuses a plurality of types of voiceprint features, so that the obtained target voiceprint vector feature has a higher dimension, and the accuracy of the audio verification result can be improved by performing audio verification processing through the feature with the higher dimension; on the other hand, the target test voiceprint vector features in the embodiment are fused with the frequency domain features and the vector features, so that the target test voiceprint vector features can retain the frequency domain features of the original speaker and can be used for representing deep information by utilizing vector special cases, and the target voiceprint vector features provided by the embodiment are used for audio verification processing, so that the robustness of an audio verification system can be effectively improved.

In order to further improve the accuracy of the audio verification result and the robustness of the audio verification system, the embodiment combines the impersonation data set to balance the similarity between the registered voice and the test voice, obtain AS-norm score, and reduce the influence of other interference factors (at least one factor of noise, voice time, voice environment, voice quality and voice type).

It should be noted that, in this embodiment, the impersonation data set includes impersonation voiceprint vector features corresponding to each specific state of each impersonation user in multiple impersonation users in multiple specific states, and different weights are given to the impersonation voiceprint vector features in different specific states according to different service scenarios, that is, different weight sequences are set for different service scenarios, and each weight sequence includes weight values corresponding to each specific state.

The specific states include quiet environment, noise environment, different periods, different physical states, etc., for example, for each impersonation user in the impersonation data set, the voiceprint vector feature QA of the audio in the quiet environment of each impersonation user, the voiceprint vector feature NA of the audio in the different noise environments, the voiceprint vector feature DA of the audio in different periods (for example, different age groups), and the voiceprint vector feature HA of the audio in different physical states (for example, illness state, normal state, post-exercise state, etc.) are collected respectively.

Different business scenarios include financial payments, security access, package signing, smart home, and the like.

Because different service scenes have different requirements on the audio verification system, the embodiment needs to give different weights to the impersonation voiceprint vector features of impersonation audio data in the impersonation data set in combination with the service scene requirements, so as to achieve the purpose of fitting the service scenes. Referring to fig. 5, fig. 5 is a flowchart of an audio verification method according to another embodiment of the present application.

In this embodiment, a server is taken as an execution body, and the server is integrated with the audio verification device, and when the execution body is a terminal, reference may be made to an embodiment of the server, which will not be described in detail. As shown in fig. 5, the audio verification method of the present embodiment includes steps 201 to 207.

201. The server acquires the audio data to be verified.

202. The server obtains a plurality of types of voiceprint features based on the audio data to be verified.

203. And the server performs feature fusion processing on the voiceprint features to determine target test voiceprint vector features of the audio data to be verified.

Steps 201 to 203 are similar to steps 101 to 103 in the above embodiments, and detailed descriptions thereof are omitted herein.

204. The server determines an initial similarity score for the target test voiceprint vector feature and the registered voiceprint vector feature.

Specifically, after the target test voiceprint vector feature of the audio data to be verified is obtained, the cosine similarity or euclidean distance between the target test voiceprint vector feature and the registered voiceprint vector feature is calculated, and the cosine similarity or euclidean distance is determined as the initial similarity score.

205. And the server respectively determines target impersonation voiceprint vector characteristics of each impersonation user according to the first weight sequence and a impersonation voiceprint vector characteristic set of each impersonation user in a preset impersonation data set.

In some embodiments, the impersonation data set includes a plurality of impersonation voiceprint vector feature sets of the impersonation users, the impersonation voiceprint vector feature sets include impersonation voiceprint vector features corresponding to respective specific states of the impersonation users under the specific states, the first weight sequence is a preset weight sequence corresponding to a target service scene, and the first weight sequence includes weight values corresponding to the specific states.

The impersonation user in this embodiment is a user different from the test user, which is a user corresponding to the audio data to be authenticated, and the registration user, which is a user corresponding to the registered audio data.

Specifically, for example, the impersonation data set includes impersonation user 1 and impersonation voiceprint vector feature sets of impersonation user 2, and the impersonation voiceprint vector feature sets corresponding to impersonation user 1 include voiceprint vector feature QA1 of audio in a quiet environment, voiceprint vector feature NA1 of audio in different noise environments, voiceprint vector feature DA1 of audio in different periods, and voiceprint vector feature HA1 of audio in different physical states; the impersonation voiceprint vector feature group corresponding to the impersonation user 2 comprises voiceprint vector features QA2 of the audio in a quiet environment, voiceprint vector features NA2 of the audio in different noise environments, voiceprint vector features DA2 of the audio in different periods and voiceprint vector features HA2 of the audio in different physical states, wherein the first weight sequences are alpha 1, alpha 2, alpha 3 and alpha 4.

At this time, the target impersonation voiceprint vector feature of impersonation user 1 is: em_o ₁ The target impersonation voiceprint vector for impersonation user 2 is characterized by =α1×qa1+α2×na 1+α3×da 1+α4×ha 1: em_o ₂ = α1×QA2 + α2×NA2 + α3×DA2 + α4×HA2。

In other embodiments, the impersonation data set includes impersonation audio data in a plurality of specific states of each impersonation user, and further acquiring impersonation voiceprint vector features of each impersonation audio data in the impersonation data set is required, and the step of acquiring the impersonation voiceprint vector features of the impersonation audio data is similar to the step of acquiring target test voiceprint vector features of the audio data to be verified in the previous embodiment, which is not repeated here.

In some embodiments, in order to improve the acquisition efficiency of the weight sequence, the system presets a weight sequence set, where the weight sequence set includes weight sequences corresponding to each of the service scenes in the plurality of service scenes, and before executing step 205, the embodiment may directly determine the first weight sequence corresponding to a target service scene in the weight sequence set, where the target service scene is a service scene corresponding to the audio data to be verified, and is determined according to the source of the audio data to be verified.

206. The server respectively determines similarity scores of the target test voiceprint vector features and the target impersonation voiceprint vector features to obtain a first similarity score set; and respectively determining similarity scores of the registered voiceprint vector features and the target impersonation voiceprint vector features to obtain a second similarity score set.

In this embodiment, similarity scores of the target test voiceprint vector feature and the target impersonation voiceprint vector feature in the impersonation dataset are required to be calculated respectively, so as to obtain a first similarity score set containing similarity scores of the target test voiceprint vector feature and each target impersonation voiceprint vector feature; and calculating the similarity scores of the registered voiceprint vector features and the target impersonation voiceprint vector features in the impersonation data set to obtain a second similarity score set containing the similarity scores of the registered voiceprint vector features and the target impersonation voiceprint vector features.

207. The server determines the target similarity score according to the first similarity score set, the second similarity score set and the initial similarity score.

In some embodiments, in order to improve the calculation efficiency, only top k similarity scores in the first similarity score set and top k similarity scores in the second similarity score set need to participate in calculation of the target similarity score, where k is a preset number, and the value of the preset number is smaller than the total number of the impersonation users in the impersonation data set.

At this time, in particular, in some embodiments, the first set of similarity scores and the second set of similarity scores are ordered sequences, the first set of similarity scores including a first target score, the first set of similarity scores including a second target score; at this time, the target similarity score is determined by:

When the first similarity set is ordered in descending order, the first target score is located at the (k+1) th position of the first similarity set, k is a preset number, and at this time, the first similarity score is a similarity score of the ordered position in the first similarity set before the first target score; when the first similarity set is sorted according to ascending order, the first target score is located at the M-K position of the first similarity set, and at this time, the first similarity score is the similarity score of the sorting position in the first similarity score set after the first target score, wherein M is the total number of the similarity scores in the first similarity set.

In addition, the similarity scores in the first similarity set and the second similarity set can be ranked according to the acquisition time, and only top k similarity scores are required to be extracted from the first similarity set and the second similarity set respectively according to a preset score comparison rule.

In one embodiment, the initial similarity score of the target test voiceprint vector feature (test audio data t) and the registered voiceprint vector feature (registered audio data e) is defined as

Defining the impersonation data set as

Wherein N is the total number of impersonation users in the impersonation data set, i represents the ith impersonation user,/and/or>

Impersonation voiceprint vector features including a plurality of specific states of impersonation user i define a first weight sequence as

The first similarity score is expressed as:

；

the second similarity score is expressed as:

；

further, in some embodiments, the target similarity score (i.e., the final as-norm score) is determined specifically by the following equation:

；

wherein, the liquid crystal display device comprises a liquid crystal display device,

for the target similarity score, for the initial similarity score,

for said first mean,/a>

For the first variance->

For said second mean,/>

Is the second variance.

208. And the server determines the verification result of the audio data to be verified according to the target similarity score.

Specifically, if the target similarity score is greater than or equal to a preset threshold, determining that the verification result is verification passing, and if the target similarity score is less than the preset threshold, determining that the verification result is verification failing.

In some embodiments, after obtaining the verification result, returning the verification result to the corresponding user terminal, when the verification result is that the verification is passed, allowing the user terminal to execute the next operation, otherwise, sending out a verification failing reminder; for example, if the target service scenario is a financial payment scenario, the next operation is a payment operation, and if the target service scenario is a security access scenario, the next operation is a door opening operation.

Compared with the prior art, the similarity score is optimized by only utilizing the single impersonation voiceprint vector feature of the impersonation user, and the similarity score is optimized due to the limited informativity of the single impersonation voiceprint vector feature; according to the method, the target impersonation voiceprint vector features of all impersonation users are determined by combining the impersonation voiceprint vector features of all the impersonation users and corresponding weight sequences, the target impersonation voiceprint vector features of all the impersonation users are obtained by combining the impersonation voiceprint vector features of all the impersonation users, the target impersonation voiceprint vector features of all the impersonation users comprise more information dimensions, and the similarity score is adjusted by using the target impersonation voiceprint vector features of all the information dimensions, so that the robustness of the audio verification system can be enhanced.

Because the requirements of different service scenes on the audio verification system are different, different weights are required to be given to the audio data (the impersonation voiceprint vector features) in the impersonation data set in combination with the requirements of specific service scenes, so that the purpose of fitting the service scenes is achieved. The embodiment of the present application may further use an adaptive weighting AS-norm scheme to determine a weight sequence corresponding to each service scenario, and loop steps 303 to 306-1 to describe the adaptive weighting AS-norm scheme in this embodiment in detail by taking the weight sequence of the determined target service scenario AS an example.

Referring specifically to fig. 6, fig. 6 is a schematic flow chart of determining a weight sequence in the audio verification method according to the embodiment of the present application:

301. and acquiring a training set of the target service scene.

The training set includes a plurality of test pairs of two kinds of labels, each test pair includes a test voiceprint vector sample feature and a registered voiceprint vector sample feature, and the labels are used for indicating whether the test voiceprint vector sample feature and the registered voiceprint vector sample feature in the test pair belong to the same user, for example, the labels of the test pairs of the same user are set to 1, and the labels of the test pairs of different users are set to 0.

Specifically, a training audio sample set (the training audio sample set includes a plurality of voiceprint vector sample features of training audio) is obtained from a target service scene, and as shown in fig. 7, the training audio sample set includes a test sample set (including a plurality of test voiceprint vector sample features) and a registration sample set (including a plurality of registration voiceprint vector sample features), and then one piece of audio data is randomly extracted from the test sample and the registration sample as test pairs, respectively, to obtain n test pairs, where n is an integer greater than 1.

302. A second weight sequence is acquired.

When the calculation of the second weight sequence is performed for the first time, the second weight sequence corresponding to the target service scene needs to be initialized, and the second weight sequence participating in the calculation can be acquired through updating in the subsequent iteration round without initializing.

In this embodiment, the second weight sequence includes weight values corresponding to the audio data in each specific state among the plurality of specific states.

For example, the method comprises a quiet environment, a noise environment, different periods and different physical states, wherein the weight values respectively correspond to 4 specific states.

303. And respectively determining third similarity scores of the test voiceprint vector sample features and the registered voiceprint vector sample features in each test pair based on the second weight sequence and a impersonation data set.

In this embodiment, the similarity scores of the test voiceprint vector sample features and the registered voiceprint vector sample features in each test pair are determined by combining the second weight sequence and the preset impersonation data set.

The specific process of determining the third similarity score of the test voiceprint vector sample feature and the registered voiceprint vector sample feature based on the second weight sequence and the impersonation data set for each test pair is similar to the process of determining the target similarity score of the target test voiceprint vector feature and the registered voiceprint vector feature according to the first weight sequence and the impersonation data set in the corresponding embodiment of fig. 5, and is not described herein in detail.

304. And determining loss values corresponding to the test pairs according to the third similarity scores corresponding to the test pairs and the labels corresponding to the test pairs.

In this embodiment, after determining the third similarity score of each test pair, the loss value corresponding to each test pair is determined in combination with the label of the corresponding test pair.

If the label indicates that the data in the corresponding test pair are the data of the same user, the larger the third similarity score is, the smaller the loss value is, otherwise, the larger the loss value is; if the label indicates that the data in the test pair is the data of different users, the larger the third similarity score is, the larger the loss value is, and otherwise, the smaller the loss value is.

305. And determining whether the second weight sequence meets a preset condition according to each loss value, if not, executing a step 306, and if so, executing a step 307.

The preset condition is that the ratio (False Accept Rate, FAR) indicating the false acceptance is smaller than the first preset ratio, or that the ratio (TAR) indicating the correct acceptance is larger than the second preset ratio.

If the preset condition is that FAR is smaller than the first preset ratio, step 305 specifically includes: determining FAR according to each loss value, judging whether the FAR is smaller than a first preset proportion, if so, indicating that the current second weight sequence meets preset conditions, otherwise, indicating that the current second weight sequence does not meet the preset conditions.

If the preset condition is that the TAR is greater than the second preset ratio, step 305 specifically includes: determining TAR according to each loss value, judging whether the TAR is larger than a second preset proportion, if so, indicating that the current second weight sequence meets the preset condition, otherwise, indicating that the current second weight sequence does not meet the preset condition.

In other embodiments, the preset condition may be that the current iteration number is greater than the preset iteration number, and when the current iteration number reaches the preset iteration number, the second weight sequence is output as the first weight sequence.

306-1, updating the second weight sequence to obtain a candidate weight sequence, and returning the candidate weight sequence as the second weight sequence to execute step 303.

In this embodiment, if it is determined that the second weight sequence does not meet the preset condition, the second weight sequence is updated to obtain the candidate weight sequence, and specifically, the weight value in the second weight sequence may be randomly adjusted, or the weight value in the second weight sequence may be adaptively adjusted according to the magnitude of the current loss value.

306-2, taking the second weight sequence as the first weight sequence.

In this embodiment, when the second weight sequence meets a preset condition, the second weight sequence is used as the first weight sequence.

As can be seen, in this embodiment, corresponding weight sequences are respectively constructed for different service scenarios, and the target impersonation voiceprint vector features of each impersonation user are determined according to the weight sequences corresponding to the service scenarios and the impersonation data set, so that the target impersonation voiceprint vector features are more attached to the current service scenario, and the similarity score is adjusted by using the target impersonation voiceprint vector features, so that the robustness of the audio verification system can be further improved.

Any technical features mentioned in the embodiments corresponding to any one of fig. 1 to fig. 7 are also applicable to the embodiments corresponding to fig. 8 to fig. 11 in the embodiments of the present application, and the following similar parts will not be repeated.

An audio verification method in the embodiments of the present application is described above, and an audio verification apparatus (e.g., a server, a user terminal) that performs the audio verification method described above is described below.

Referring to fig. 8, a schematic diagram of an audio verification device 800 shown in fig. 8 is shown, which can be applied to an audio verification scene. The audio verification device 800 in the embodiment of the present application can implement steps corresponding to the audio verification method performed in the embodiment corresponding to any one of fig. 1 to 7. The functions implemented by the audio verification apparatus 800 may be implemented by hardware, or may be implemented by hardware executing corresponding software. The hardware or software includes one or more modules corresponding to the functions described above, which may be software and/or hardware. The audio verification device 800 may include a transceiver module 801 and a processing module 802, wherein:

a transceiver module 801, configured to obtain a plurality of types of voiceprint features based on audio data to be verified, where the voiceprint features include at least one type of features of frequency domain features and vector features;

A processing module 802, configured to perform feature fusion processing on the plurality of voiceprint features to determine a target test voiceprint vector feature of the audio data to be verified; and determining a target similarity score of the target test voiceprint vector feature and the registered voiceprint vector feature of the registered audio data, wherein the target similarity score is used for determining a verification result of the audio data to be verified.

In some embodiments, the voiceprint features include a first frequency domain feature and a second frequency domain feature; the processing module 802 is specifically configured to, when executing the step of performing feature fusion processing on the plurality of voiceprint features to determine the target test voiceprint vector feature of the audio data to be verified:

In some embodiments, the processing module 802 is specifically configured to, when performing the step of performing feature fusion processing on the first frequency domain feature and the second frequency domain feature to obtain the target frequency domain feature:

In some embodiments, the voiceprint features include a third frequency domain feature and a second voiceprint vector feature; the processing module 802 performs the feature fusion processing on the plurality of voiceprint features to determine a target test voiceprint vector feature of the audio data to be verified, where:

In some embodiments, the processing module 802 is specifically configured to, when performing the step of determining the target similarity score for the target test voiceprint vector feature and the registered voiceprint vector feature of the registered audio data:

In some embodiments, the first set of similarity scores and the second set of similarity scores are ordered sequences, the first set of similarity scores comprising a first target score, the first set of similarity scores comprising a second target score; the processing module 802 is specifically configured to, when executing the step of determining the target similarity score according to the first set of similarity scores, the second set of similarity scores, and the initial similarity score:

In some embodiments, before executing the step of determining the target impersonation voiceprint vector feature of each impersonation user according to the first weight sequence and the impersonation voiceprint vector feature set of each impersonation user in the preset impersonation data set, the processing module 802 is further configured to:

acquiring a training set of a target service scene by using the transceiver module 801, wherein the training set comprises a plurality of test pairs of two labels, each test pair comprises a test voiceprint vector sample feature and a registered voiceprint vector sample feature, and the labels are used for indicating whether the test voiceprint vector sample feature and the registered voiceprint vector sample feature in the test pair belong to the same user; acquiring a second weight sequence; determining a third similarity score for the test voiceprint vector sample feature and the registered voiceprint vector sample feature in each of the test pairs, respectively, based on the second weight sequence and the impersonation data set; determining loss values corresponding to the test pairs according to the third similarity scores corresponding to the test pairs and the labels corresponding to the test pairs; if the second weight sequence is determined to be not in accordance with the preset condition according to each loss value, updating the second weight sequence to obtain a candidate weight sequence, taking the candidate weight sequence as the second weight sequence until the second weight sequence is in accordance with the preset condition, and taking the second weight sequence as the first weight sequence.

In the embodiment of the present application, on one hand, the processing module 802 of the present application is configured to fuse multiple types of voiceprint features with the target test voiceprint vector features for audio verification, so as to solve the problem of feature precision degradation caused by the x-vector features, and improve the precision of features for audio verification; on the other hand, the processing module 802 in the scheme performs feature fusion processing on the multiple types of voiceprint features, performs audio verification processing by using the fused features, solves the problem that in the prior art, the accuracy of the audio verification result is low because the x-vector is influenced by voice quality, data type, environmental change and the like and the performance may be different because the x-vector is only used for audio verification, and improves the accuracy of the audio verification result; in addition, the audio verification method in the embodiment can be applied to the audio verification system, so that the robustness of the audio verification system is improved through the scheme.

The image information recognition system in the embodiment of the present application is described above from the viewpoint of the modularized functional entity, and the image information recognition apparatuses in the embodiment of the present application are described below from the viewpoint of hardware processing, respectively.

It should be noted that, in each embodiment (including each embodiment shown in fig. 8) of the present application, the entity devices corresponding to all the transceiver modules may be transceivers, and the entity devices corresponding to all the processing modules may be processors. When one of the devices has a structure as shown in fig. 8, the processor, the transceiver and the memory implement the same or similar functions as the transceiver module and the processing module provided in the device embodiment of the device, and the memory in fig. 9 stores a computer program to be invoked when the processor executes the above-mentioned audio verification method.

The apparatus shown in fig. 8 may have a structure as shown in fig. 9, and when the apparatus shown in fig. 8 has a structure as shown in fig. 9, the processor in fig. 9 can implement the same or similar functions as the processing module provided by the apparatus embodiment corresponding to the apparatus, and the transceiver in fig. 9 can implement the same or similar functions as the transceiver module provided by the apparatus embodiment corresponding to the apparatus, and the memory in fig. 9 stores a computer program that needs to be invoked when the processor performs the audio verification method. In the embodiment shown in fig. 8 of the present application, the entity device corresponding to the transceiver module may be an input/output interface, and the entity device corresponding to the processing module may be a processor.

The embodiment of the present application further provides a terminal device, as shown in fig. 10, for convenience of explanation, only the portion relevant to the embodiment of the present application is shown, and specific technical details are not disclosed, please refer to the method portion of the embodiment of the present application. The terminal device may be any terminal device including a mobile phone, a tablet computer, a personal digital assistant (Personal Digital Assistant, PDA), a Point of Sales (POS), a vehicle-mounted computer, and the like, taking the mobile phone as an example of the terminal:

fig. 10 is a block diagram showing a part of the structure of a mobile phone related to a terminal device provided in an embodiment of the present application. Referring to fig. 10, the mobile phone includes: radio Frequency (RF) circuit 55, memory 520, input unit 530, display unit 540, sensor 550, audio circuit 560, wireless fidelity (wireless fidelity, wi-Fi) module 570, processor 580, and power supply 590. It will be appreciated by those skilled in the art that the handset construction shown in fig. 10 is not limiting of the handset and may include more or fewer components than shown, or may combine certain components, or a different arrangement of components.

The following describes the components of the mobile phone in detail with reference to fig. 10:

the RF circuit 55 may be used for receiving and transmitting signals during the process of receiving and transmitting information or communication, in particular, after receiving downlink information of the base station, the downlink information is processed by the processor 580; in addition, the data of the design uplink is sent to the base station. Generally, RF circuitry 55 includes, but is not limited to, an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (English full name: low Noise Amplifier; LNA), a duplexer, and the like. In addition, the RF circuitry 55 may also communicate with networks and other devices via wireless communications. The wireless communication may use any communication standard or protocol, including but not limited to global system for mobile communications (english: global System of Mobile communication, english: GSM), general packet radio service (english: general Packet Radio Service, english: GPRS), code division multiple access (english: code Division Multiple Access, CDMA), wideband code division multiple access (english: wideband Code Division Multiple Access, english: WCDMA), long term evolution (english: long Term Evolution, english: LTE), email, short message service (english: short Messaging Service, english: SMS), and the like.

The memory 520 may be used to store software programs and modules, and the processor 580 performs various functional applications and data processing of the cellular phone by executing the software programs and modules stored in the memory 520. The memory 520 may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, application programs required for at least one function (such as a sound playing function, an image playing function, etc.), and the like; the storage data area may store data (such as audio data, phonebook, etc.) created according to the use of the handset, etc. In addition, memory 520 may include high-speed random access memory, and may also include non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage device.

The input unit 530 may be used to receive input numeric or character information and generate key signal inputs related to user settings and function control of the handset. In particular, the input unit 530 may include a touch panel 531 and other input devices 532. The touch panel 531, also referred to as a touch screen, may collect touch operations thereon or thereabout by a user (e.g., operations of the user on the touch panel 531 or thereabout by using any suitable object or accessory such as a finger, a stylus, etc.), and drive the corresponding connection device according to a predetermined program. Alternatively, the touch panel 531 may include two parts, a touch detection device and a touch controller. The touch detection device detects the touch azimuth of a user, detects a signal brought by touch operation and transmits the signal to the touch controller; the touch controller receives touch information from the touch detection device and converts it into touch point coordinates, which are then sent to the processor 580, and can receive commands from the processor 580 and execute them. In addition, the touch panel 531 may be implemented in various types such as resistive, capacitive, infrared, and surface acoustic wave. The input unit 530 may include other input devices 532 in addition to the touch panel 531. In particular, other input devices 532 may include, but are not limited to, one or more of a physical keyboard, function keys (e.g., volume control keys, switch keys, etc.), a trackball, mouse, joystick, etc.

The display unit 540 may be used to display information input by a user or information provided to the user and various menus of the mobile phone. The display unit 540 may include a display panel 541, and optionally, the display panel 541 may be configured in the form of a liquid crystal display (english: liquid Crystal Display, abbreviated as LCD), an Organic Light-Emitting Diode (OLED), or the like. Further, the touch panel 531 may cover the display panel 541, and when the touch panel 531 detects a touch operation thereon or thereabout, the touch operation is transferred to the processor 580 to determine the type of the touch event, and then the processor 580 provides a corresponding visual output on the display panel 541 according to the type of the touch event. Although in fig. 10, the touch panel 531 and the display panel 541 are two independent components to implement the input and output functions of the mobile phone, in some embodiments, the touch panel 531 and the display panel 541 may be integrated to implement the input and output functions of the mobile phone.

The handset may also include at least one sensor 550, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor may include an ambient light sensor that may adjust the brightness of the display panel 541 according to the brightness of ambient light, and a proximity sensor that may turn off the display panel 541 and/or the backlight when the mobile phone moves to the ear. The accelerometer sensor can be used for detecting the acceleration in all directions (generally three axes), detecting the gravity and the direction when the accelerometer sensor is static, and can be used for identifying the gesture of a mobile phone (such as transverse and vertical screen switching, related games, magnetometer gesture calibration), vibration identification related functions (such as pedometer and knocking), and other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors which are also configured by the mobile phone are not repeated herein.

Audio circuitry 560, speakers 561, microphone 562 may provide an audio interface between the user and the handset. The audio circuit 560 may transmit the received electrical signal converted from audio data to the speaker 561, and the electrical signal is converted into a sound signal by the speaker 561 and output; on the other hand, microphone 562 converts the collected sound signals into electrical signals, which are received by audio circuit 560 and converted into audio data, which are processed by audio data output processor 580 for transmission to, for example, another cell phone via RF circuit 55, or for output to memory 520 for further processing.

Wi-Fi belongs to a short-distance wireless transmission technology, and a mobile phone can help a user to send and receive e-mails, browse web pages, access streaming media and the like through a Wi-Fi module 570, so that wireless broadband Internet access is provided for the user. Although fig. 10 shows Wi-Fi module 570, it is understood that it does not belong to the necessary constitution of the cell phone, and can be omitted entirely as needed within the scope of not changing the essence of the application.

Processor 580 is the control center of the handset, connects the various parts of the entire handset using various interfaces and lines, and performs various functions and processes of the handset by running or executing software programs and/or modules stored in memory 520, and invoking data stored in memory 520, thereby performing overall monitoring of the handset. Optionally, processor 580 may include one or more processing modules; preferably, processor 580 may integrate an application processor that primarily handles operating systems, user interfaces, applications, etc., with a modem processor that primarily handles wireless communications. It will be appreciated that the modem processor described above may not be integrated into the processor 580.

The handset further includes a power supply 590 (e.g., a battery) for powering the various components, which can be logically connected to the processor 580 by a power management system so as to perform functions such as managing charging, discharging, and power consumption by the power management system.

Although not shown, the mobile phone may further include a camera, a bluetooth module, etc., which will not be described herein.

In the embodiment of the present application, the processor 580 included in the mobile phone further has a flowchart for controlling the execution of the above audio verification method shown in fig. 2.

Fig. 11 is a schematic diagram of a server structure provided in an embodiment of the present application, where the server 620 may have a relatively large difference due to configuration or performance, and may include one or more central processing units (in english: central processing units, in english: CPU) 622 (e.g., one or more processors) and a memory 632, and one or more storage media 630 (e.g., one or more mass storage devices) storing application programs 642 or data 644. Wherein memory 632 and storage medium 630 may be transitory or persistent storage. The program stored on the storage medium 630 may include one or more modules (not shown), each of which may include a series of instruction operations on a server. Still further, the central processor 622 may be configured to communicate with a storage medium 630 and execute a series of instruction operations in the storage medium 630 on the server 620.

The Server 620 may also include one or more power supplies 626, one or more wired or wireless network interfaces 650, one or more input/output interfaces 658, and/or one or more operating systems 641, such as Windows Server, mac OS X, unix, linux, freeBSD, and the like.

The steps performed by the server in the above embodiments may be based on the structure of the server 620 shown in fig. 11. The steps of the embodiment shown in fig. 2, for example, in the above-described embodiment may be implemented based on the server shown in fig. 11. For example, the processor 622 performs the following operations by invoking instructions in the memory 632:

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and for parts of one embodiment that are not described in detail, reference may be made to related descriptions of other embodiments.

It will be clearly understood by those skilled in the art that, for convenience and brevity of description, the specific working processes of the systems, apparatuses and modules described above may refer to the corresponding processes in the foregoing method embodiments, which are not repeated herein.

In the several embodiments provided in the embodiments of the present application, it should be understood that the disclosed systems, apparatuses, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the modules is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or modules, which may be in electrical, mechanical, or other forms.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical modules, i.e., may be located in one place, or may be distributed over a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in each embodiment of the present application may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module. The integrated modules may be implemented in hardware or in software functional modules. The integrated modules, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium.

In the above embodiments, it may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product.

The computer program product includes one or more computer instructions. When the computer program is loaded and executed on a computer, the flow or functions described in accordance with embodiments of the present application are fully or partially produced. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by a wired (e.g., coaxial cable, fiber optic, digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be stored by a computer or a data storage device such as a server, data center, etc. that contains an integration of one or more available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., solid State Disk (SSD)), etc.

The foregoing describes in detail the technical solution provided by the embodiments of the present application, in which specific examples are applied to illustrate the principles and implementations of the embodiments of the present application, where the foregoing description of the embodiments is only used to help understand the methods and core ideas of the embodiments of the present application; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope according to the ideas of the embodiments of the present application, the present disclosure should not be construed as limiting the embodiments of the present application in view of the above.

Claims

1. An audio verification method, comprising:

2. The method of claim 1, wherein the voiceprint features comprise a first frequency domain feature and a second frequency domain feature; and performing feature fusion processing on the voiceprint features to determine target test voiceprint vector features of the audio data to be verified, including:

performing feature fusion processing on the first frequency domain features and the second frequency domain features to obtain target frequency domain features;

and inputting the target frequency domain characteristic into a preset first acoustic model to obtain the target test voiceprint vector characteristic.

3. The method of claim 2, wherein the performing feature fusion processing on the first frequency domain feature and the second frequency domain feature to obtain a target frequency domain feature includes:

inputting the first frequency domain features into a preset second acoustic model to obtain first voiceprint vector features;

and carrying out feature fusion processing on the first voiceprint vector features and the second frequency domain features to obtain the target frequency domain features.

4. The method of claim 1, wherein the voiceprint features comprise a third frequency domain feature and a second voiceprint vector feature; and performing feature fusion processing on the voiceprint features to determine target test voiceprint vector features of the audio data to be verified, including:

5. The method of claim 4, wherein the second acoustic vector feature is derived based on the steps of:

6. The method of claim 1, wherein the determining a target similarity score for the target test voiceprint vector feature and a registered voiceprint vector feature of registered audio data comprises:

determining an initial similarity score for the target test voiceprint vector feature and the registered voiceprint vector feature;

according to a first weight sequence and a preset impersonation voiceprint vector feature group of each impersonation user in a impersonation data set, respectively determining target impersonation voiceprint vector features of each impersonation user, wherein the impersonation data set comprises a plurality of impersonation voiceprint vector feature groups of the impersonation users, each impersonation voiceprint vector feature group comprises impersonation voiceprint vector features corresponding to each specific state of the impersonation user in a plurality of specific states, the first weight sequence is a preset weight sequence corresponding to a target service scene, and the first weight sequence comprises weight values corresponding to each specific state;

Respectively determining similarity scores of the target test voiceprint vector features and the target impersonation voiceprint vector features to obtain a first similarity score set; and respectively determining similarity scores of the registered voiceprint vector features and the target impersonation voiceprint vector features to obtain a second similarity score set;

and determining the target similarity score according to the first similarity score set, the second similarity score set and the initial similarity score.

7. The method of claim 6, wherein the first set of similarity scores and the second set of similarity scores are ordered sequences, the first set of similarity scores comprising a first target score, the first set of similarity scores comprising a second target score; the determining the target similarity score according to the first set of similarity scores, the second set of similarity scores, and the initial similarity score includes:

determining a preset number of first similarity scores from the first similarity score set, wherein the sorting positions of the first similarity scores are before or after the first target scores; determining a second similarity score of the preset number from the second similarity score set, wherein the sorting position of the second similarity score is before or after the second target score;

Calculating a first mean and a first variance of the first similarity scores of the preset number; calculating a second mean and a second variance of the second similarity scores of the preset number;

and obtaining the target similarity score according to the first mean value, the first variance, the second mean value, the second variance and the initial similarity score.

8. The method of claim 6, wherein before determining the target impersonation voiceprint vector feature of each impersonation user separately from the first weight sequence and the set of impersonation voiceprint vector features of each impersonation user in a preset impersonation data set, the method further comprises:

9. The method according to any one of claims 6 to 8, wherein the first weight sequence is derived based on the steps of:

acquiring a training set of a target service scene, wherein the training set comprises a plurality of test pairs of two labels, each test pair comprises a test voiceprint vector sample feature and a registered voiceprint vector sample feature, and the labels are used for indicating whether the test voiceprint vector sample feature and the registered voiceprint vector sample feature in the test pair belong to the same user or not;

Acquiring a second weight sequence;

determining a third similarity score for the test voiceprint vector sample feature and the registered voiceprint vector sample feature in each of the test pairs, respectively, based on the second weight sequence and the impersonation data set;

determining loss values corresponding to the test pairs according to the third similarity scores corresponding to the test pairs and the labels corresponding to the test pairs;

if the second weight sequence is determined to be not in accordance with the preset condition according to each loss value, updating the second weight sequence to obtain a candidate weight sequence, taking the candidate weight sequence as the second weight sequence until the second weight sequence is in accordance with the preset condition, and taking the second weight sequence as the first weight sequence.

10. An audio verification device, comprising:

11. A computer device, characterized in that it comprises a memory on which a computer program is stored and a processor which, when executing the computer program, implements the method according to any of claims 1-9.

12. A computer readable storage medium, characterized in that the storage medium stores a computer program comprising program instructions which, when executed by a processor, can implement the method of any of claims 1-9.

13. A computer program product comprising instructions which, when run on a computer or processor, cause the computer or processor to perform the method of any of claims 1 to 9.

14. A chip system, comprising:

a communication interface for inputting and/or outputting information;

a processor for executing a computer executable program to cause a device on which the chip system is installed to perform the method of any one of claims 1 to 9.