WO2021082420A1 - 声纹认证方法、装置、介质及电子设备 - Google Patents

声纹认证方法、装置、介质及电子设备 Download PDF

Info

Publication number
WO2021082420A1
WO2021082420A1 PCT/CN2020/092943 CN2020092943W WO2021082420A1 WO 2021082420 A1 WO2021082420 A1 WO 2021082420A1 CN 2020092943 W CN2020092943 W CN 2020092943W WO 2021082420 A1 WO2021082420 A1 WO 2021082420A1
Authority
WO
WIPO (PCT)
Prior art keywords
voiceprint information
target user
feature
voiceprint
predicted
Prior art date
Application number
PCT/CN2020/092943
Other languages
English (en)
French (fr)
Inventor
冯晨
王健宗
彭俊清
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021082420A1 publication Critical patent/WO2021082420A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/18Artificial neural networks; Connectionist approaches
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/06Decision making techniques; Pattern matching strategies
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • This application relates to the field of communication technology, and in particular to a voiceprint authentication method, device, medium and electronic equipment.
  • the voice can be processed to generate an identity vector indicating the identity of the voice inputter, and the two voices can be determined by calculating the similarity between the identity vectors of the two voices. Whether the voice inputter is the same user.
  • the inventor realized that with the change of time, human voiceprint will also change, and the longer the time, the greater the change of human voiceprint. If the collected voiceprint information If the time is far away from the time of pre-registering the voiceprint model, it may cause authentication failure.
  • This application aims to provide a voiceprint authentication method, device, medium and electronic equipment, which can improve the accuracy of voiceprint authentication.
  • a voiceprint authentication method including: acquiring the voiceprint information, age, gender, and environment of a target user before a preset period of time; setting the target user in a preset The voiceprint information, age, gender, and environment before the time period are input into the first prediction model to obtain predicted voiceprint information; collect the voiceprint information of the current user to be authenticated; compare the predicted voiceprint information with the voiceprint to be authenticated Information is matched to obtain a first matching degree; if the first matching degree exceeds a first preset threshold, the current user is determined as the target user.
  • a voiceprint authentication device including: an acquisition module for acquiring voiceprint information, age, gender, and environment of a target user before a preset time period; a first prediction The module is used to input the voiceprint information, age, gender and environment of the target user before the preset time period into the first prediction model to obtain the predicted voiceprint information; the collection module is used to collect the voice to be authenticated of the current user Pattern information; a matching module for matching the predicted voiceprint information with the voiceprint information to be authenticated to obtain a first degree of matching; a determining module, if the first degree of matching exceeds a first preset threshold, Then the current user is determined as the target user.
  • an electronic device including: one or more processors; a storage device, configured to store one or more programs, when the one or more programs are used by the one or more When multiple processors are executed, the one or more processors implement the voiceprint authentication method described above.
  • a computer-readable program medium with computer-readable instructions stored thereon.
  • the computer-readable instructions When the computer-readable instructions are executed by the processor of the computer, the computer can execute the above-mentioned Voiceprint authentication method.
  • the voiceprint information, age, gender, and environment of the target user before a preset time period are obtained; and the voiceprint information of the target user before the preset time period is obtained , Age, gender, and environment.
  • Input the first prediction model to obtain the predicted voiceprint information so that the predicted voiceprint information can take into account the changes in the voiceprint information of the target user’s gender as the age increases.
  • Predicting the voiceprint information can take into account the changes of the target user's voiceprint information in different environments.
  • the predicted voiceprint information is matched with the voiceprint information to be authenticated to obtain the first matching degree; if the first matching degree exceeds the first preset threshold, the current user is determined as the target user.
  • the target user identified using the predicted voiceprint information as the standard is not disturbed by time, which solves the problem of the voice collected in the prior art.
  • the long distance between the time of the pattern information and the time of pre-registering the voiceprint model leads to the problem of authentication failure.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture to which the technical solutions of the embodiments of the present application can be applied;
  • Fig. 2 schematically shows a flowchart of a voiceprint authentication method according to an embodiment of the present application
  • FIG. 3 schematically shows a flowchart of a voiceprint authentication method according to an embodiment of the present application
  • Fig. 4 schematically shows a block diagram of a voiceprint authentication device according to an embodiment of the present application
  • Fig. 5 is a schematic diagram showing the hardware of an electronic device according to an exemplary embodiment
  • Fig. 6 shows a computer-readable storage medium for realizing the above voiceprint authentication method according to an exemplary embodiment.
  • FIG. 1 shows a schematic diagram of an exemplary system architecture 100 to which the technical solutions of the embodiments of the present application can be applied.
  • the system architecture 100 may include terminal devices (as shown in FIG. 1, one or more of the smart phone 101, the tablet computer 102 and the portable computer 103, of course, it may also be a desktop computer, etc.), a network 104 and server 105.
  • the network 104 is used as a medium for providing a communication link between the terminal device and the server 105.
  • the network 104 may include various connection types, such as wired communication links, wireless communication links, and so on.
  • terminal devices, networks 104, and servers 105 in FIG. 1 are merely illustrative. According to implementation needs, there may be any number of terminal devices, networks 104, and servers 105.
  • the server 105 may be a server cluster composed of multiple servers.
  • the server 105 may obtain the voiceprint information input by the target user from the terminal device.
  • the target user can input voiceprint information, age, gender, and environment through the client or web page in the terminal device.
  • the server 105 inputs the voiceprint information, age, gender, and environment of the target user before the preset time period into the first prediction model to obtain predicted voiceprint information, so that the predicted voiceprint information can take into account the gender of the target user
  • the predicted voiceprint information is matched with the voiceprint information to be authenticated to obtain the first matching degree; if the first matching degree exceeds the first preset threshold, the current user is determined as the target user.
  • the target user identified using the predicted voiceprint information as the standard is not disturbed by time, which solves the problem of collecting in the prior art.
  • the time of voiceprint information is far away from the time of pre-registered voiceprint model, which leads to the problem of authentication failure.
  • the voiceprint authentication method provided by the embodiment of the present application is generally executed by the server 105, and correspondingly, the voiceprint authentication device is generally set in the server 105.
  • the terminal device may also have a similar function to the server 105, so as to execute the voiceprint authentication method provided by the embodiment of the present application.
  • FIG. 2 schematically shows a flowchart of a voiceprint authentication method according to an embodiment of the present application.
  • the execution subject of the voiceprint authentication method may be a server, for example, the server 105 shown in FIG. 1.
  • the voiceprint authentication method includes at least step S210 to step S250, which are described in detail as follows:
  • step S210 the voiceprint information, age, gender, and environment of the target user before a preset period of time are acquired.
  • the voiceprint information may be a piece of recording, or it may be voiceprint information with a certain characteristic extracted from a piece of recording.
  • the environment may include the target user's work environment, living environment, language environment, and so on.
  • the voiceprint information, age, and environment of the target user before a plurality of preset time periods can be acquired.
  • the voiceprint change trend of the target user can be obtained, and the predicted voiceprint information of the target user can be predicted more accurately.
  • the gender of the target user can be obtained, and the voiceprint information, age, and environment of the target user 1 year ago, 2 years ago, and 3 years ago can be obtained.
  • the change trend of voiceprint 3 years ago can predict the voiceprint information more accurately.
  • the voiceprint information of the target user before and during each stage of voice change can be obtained, because in this embodiment, the gender and age of the target user are considered when predicting the voiceprint information. , So that the solution in this embodiment can accurately predict the predicted voiceprint information of the target user during the voice change period and after the voice change period ends.
  • step S220 the voiceprint information, age, gender, and environment of the target user before a preset time period are input into the first prediction model to obtain predicted voiceprint information.
  • the predicted voiceprint information may be the voiceprint information of the target user at the current time, or may be the voiceprint information of the target user at a certain time in the future.
  • the first prediction model is pre-trained using the following method: obtaining a sample data set for training the first prediction model, where each piece of sample data in the sample data set includes the same user The voiceprint information, age, gender, and environment before the preset time period, and the voiceprint information of the user at the current time; each sample data in the sample data set contains the user’s voice before the preset time period Fingerprint information, age, gender, and environment are used as the input of the first prediction model, and each sample data in the sample data set contains the user’s voiceprint information at the current time as the predicted voiceprint information output by the first prediction model , Train the first prediction model; compare the predicted voiceprint information output by the first prediction model with the actual voiceprint information of the user at the current time. If they are inconsistent, adjust the first prediction model to make the output current voiceprint The information is consistent with the actual voiceprint information of the user at the current time.
  • step S230 the voiceprint information of the current user to be authenticated is collected.
  • the voice to be authenticated of the current user is recorded by a recording device, and then feature extraction is performed on the voice to be authenticated to obtain voiceprint information to be authenticated.
  • the MFCC feature of the voice to be authenticated can be extracted as the voiceprint information of the current user to be authenticated. It is also possible to extract the current user's auditory cepstral coefficient feature based on the Gammatone filter bank as the current user's voiceprint information to be authenticated.
  • step S240 the predicted voiceprint information is matched with the voiceprint information to be authenticated to obtain a first degree of matching.
  • the predicted voiceprint information obtained by the prediction model may be obtained, and then the predicted voiceprint information can be matched with the voiceprint information to be authenticated to obtain the first degree of matching.
  • the predicted voiceprint information and the voiceprint information to be authenticated can be scored by a linear discriminant model, and the obtained score can be used as the first degree of matching.
  • the predicted voiceprint information and the voiceprint information to be authenticated can be matched locally in the relevant application for target user authentication, or the predicted voiceprint information and the voiceprint information to be authenticated can be uploaded to the relevant server in general.
  • the relevant server matches the predicted voiceprint information with the voiceprint information to be authenticated.
  • the predicted voiceprint information can be matched with the voiceprint information to be authenticated through the following steps to obtain the first degree of matching.
  • the auditory cepstral coefficient characteristics of the voiceprint information of the target user before a preset period of time are extracted, and the auditory cepstral coefficient characteristics are input into the first deep neural network model to obtain the depth bottleneck feature; then the auditory sense is inverted
  • the smaller the target user's voiceprint and voice discrimination the more uniform the voiceprint information features of the target user in the voiceprint information sample set, so that the target user's voice discrimination reaches the minimum and can be more easily recognized Whether the voiceprint information to be authenticated comes from the target user.
  • the two features extracted from the voiceprint information of the target user are fused to obtain the fusion feature, and the obtained fusion feature is also more representative of the voiceprint information of the target user.
  • step S250 if the first matching degree exceeds the first preset threshold, the current user is determined as the target user.
  • the first matching degree is greater than or equal to the first preset threshold, it means that the similarity between the current predicted voiceprint information and the voiceprint information to be authenticated meets the requirements. It can be determined that the current user and the target user are the same person, and the current user can be identified as Target users.
  • the current user is identified as a non-target user.
  • the non-target user's voiceprint information to be authenticated can be collected and stored, so that the target user can know who is trying to unlock his device.
  • the voiceprint information migration fusion feature Y 1 before the time period where G is the auditory cepstrum coefficient feature, and B 1 is the migration depth bottleneck feature; then the migration fusion feature of the voiceprint information to be authenticated and the migration of the predicted voiceprint information
  • the fusion features are compared to obtain a third degree of matching; based on the first degree of matching and the third degree of matching, it is determined whether the current user is the target user.
  • the weighted sum of the first matching degree and the third matching degree may be calculated, and if the voiceprint information of the current user to be authenticated exceeds the third set threshold, the current user is determined to be the target user.
  • the use of the stacked denoising auto-encoding network model to process the migration features obtained from the auditory cepstral coefficient features can more accurately represent the voiceprint information.
  • FIG. 3 schematically shows a flowchart of a voiceprint authentication method according to an embodiment of the present application.
  • the execution subject of the voiceprint authentication method may be a server, for example, the server 105 shown in FIG. 1.
  • the voiceprint authentication method includes at least step S310 to step S390, which are described in detail as follows:
  • step S310 the voiceprint information, age, gender, and environment of the target user before a preset period of time are acquired.
  • step S320 the voiceprint information, age, gender, and environment of the target user before a preset time period are input into the first prediction model to obtain predicted voiceprint information.
  • step S330 the voiceprint information of the current user to be authenticated is collected.
  • step S340 the predicted voiceprint information is matched with the voiceprint information to be authenticated to obtain a first degree of matching.
  • step S350 if the first matching degree exceeds the first preset threshold, the current user is determined as the target user.
  • step S360 the face image information of the target user before a preset period of time is acquired.
  • the facial image information may be facial feature information extracted from the facial image of the target user.
  • Multiple feature points can be established on the edges of the facial features and the outer contour of the face in the face image of the target user, and the lines between the multiple feature points and the connecting feature points can be used as the face image information of the target user.
  • step S370 the face image, age, and gender of the target user before a preset time period are input into the second prediction model to obtain predicted face image information.
  • the second prediction model is pre-trained using the following method: Obtain an image sample data set for training the second prediction model, where each piece of image sample data in the image sample data set includes The face image, age and gender of the same user before the preset time period, and the face image of the user at the current time; each piece of image sample data in the image sample data includes the face before the user preset time period
  • the image, age, and gender are used as the input of the second prediction model, and each piece of image sample data in the image sample data contains the user’s current face image as the output of the second prediction model as the predicted face image information.
  • the second prediction is trained; the face image of the user at the current time output by the second prediction model is compared with the actual face image of the user at the current time. If they are inconsistent, the second prediction model is adjusted so that the output of the same user is in The face image at the current time is consistent with the actual face image.
  • step S380 the face image information of the current user to be authenticated is collected.
  • the face image of the current user to be authenticated may be captured by a camera, and then feature extraction is performed on the face image to be authenticated to obtain the image information to be authenticated.
  • Multiple feature points can be established on the edges of the facial features and the outer contour of the face in the face image to be authenticated, and the lines between the multiple feature points and the connecting feature points are used as the image information to be authenticated.
  • step S390 the predicted face image information is matched with the face image information to be authenticated to obtain a second degree of matching.
  • the predicted face image information of the target user can be predicted by the prediction model, and the current predicted face image information can be matched with the face image information of the current user to be authenticated to obtain the second face image information.
  • the matching degree is performed by scoring the second matching degree, and then determining the similarity between the predicted facial image information of the target user and the facial image information to be authenticated of the current user according to the scoring result.
  • the predicted face image information can be matched with the feature points of the facial features and face shape in the face image information to be authenticated, and the percentage of the number of matched feature points to the total number of points can be used as the second matching degree.
  • step S3100 the first matching degree and the second matching degree are weighted and calculated to obtain the total matching degree; if the total matching degree is greater than the second preset threshold, it is determined that the current user is the target user.
  • the first matching degree and the second matching degree may each be assigned a weight of 50%, and a weighted sum is performed to obtain a weighted total matching degree.
  • the first matching degree, the second matching degree, and the third matching degree may be weighted and calculated and compared with a fourth preset threshold. If the fourth preset threshold is reached, the current user is determined For the target user. Among them, the weights of the first matching degree, the second matching degree, and the third matching degree are set as required.
  • the foregoing embodiment simultaneously predicts the voiceprint information and face image of the target user before a set period of time, and then combines the predicted voiceprint information and face image to identify the current user. The accuracy of the recognition is higher.
  • the voiceprint information after acquiring the voiceprint information of the target user before a preset period of time, can be denoised to obtain pure voice data, and the voice enhancement algorithm based on spectral subtraction can be used to The voiceprint information undergoes denoising processing to eliminate the noise caused by the recording equipment and obtain pure voice data. Then the pure speech data is divided into frames, and the Mel cepstrum coefficient feature in each frame of speech data is extracted based on the human cochlear auditory model.
  • the obtained pure voice data is divided into frames according to the frame length of 25ms and the frame shift of 10ms, and the MFCC (Mel Frequency Cepstrum Coefficient) feature is used to perform short-term analysis of each frame of voice data to obtain the MFCC.
  • MFCC Mel Frequency Cepstrum Coefficient
  • the 39-dimensional feature vector of the pattern information, age, gender, and environment are input into the first prediction model to obtain the predicted voiceprint information of each frame, and then the predicted voiceprint information is obtained according to the predicted voiceprint information of each frame.
  • the predicted voiceprint information of each frame may be spliced and combined to obtain predicted voiceprint information.
  • the feature vector of each frame is predicted by the first prediction model, and the predicted predicted value is combined into predicted voiceprint information, so that the result of the obtained voiceprint prediction is more accurate.
  • Fig. 4 schematically shows a block diagram of a task processing time device according to an embodiment of the present application.
  • a voiceprint authentication device 400 includes an acquisition module 401, a first prediction module 402, an acquisition module 403, a matching module 404, and a determination module 405.
  • the obtaining module 401 is used to obtain the voiceprint information, age, gender, and environment of the target user before a preset period of time; the first prediction module 402 is used to calculate the target user's voiceprint information, age, gender, and environment.
  • the voiceprint information, age, gender, and environment before the preset time period are input into the first prediction model to obtain the predicted voiceprint information;
  • the collection module 403 is used to collect the voiceprint information of the current user to be authenticated;
  • the matching module 404 is used to The predicted voiceprint information is matched with the voiceprint information to be authenticated to obtain the first matching degree;
  • the determining module 405 is configured to determine the current user as the target user if the first matching degree exceeds the first preset threshold.
  • the first prediction module 402 is configured to: perform noise reduction processing on the voiceprint information to obtain pure voice data; perform framing on the pure voice data, and extract each signal based on the human cochlear hearing model.
  • the characteristics of auditory cepstrum coefficients in frame speech data input the voiceprint information, age, gender, and environment of the target user before the preset time period into the first prediction model to obtain predicted voiceprint information including: The environment and the auditory cepstrum coefficient characteristics of each frame are input into the first prediction model to obtain the predicted voiceprint information of each frame; the predicted voiceprint information is obtained according to the predicted voiceprint information of each frame.
  • the voiceprint authentication device further includes: a second prediction module, configured to obtain the face image information of the target user before a preset time period; and set the target user's face image information before the preset time period
  • the face image, age, and gender are input into the second prediction model to obtain the predicted face image information; collect the face image information of the current user to be authenticated; match the predicted face image information with the face image information to be authenticated to obtain the first Second matching degree
  • the matching module 404 is configured to: perform a weighted sum calculation on the first matching degree and the second matching degree to obtain a total matching degree; if the total matching degree is greater than a second preset threshold, determine that the current user is the target user .
  • the matching module 404 is configured to extract the auditory cepstral coefficient characteristics of the voiceprint information of the target user before a preset period of time based on the human cochlear auditory model, and convert the auditory cepstral coefficients
  • G is the auditory cepstrum coefficient feature
  • B is the depth bottleneck feature
  • the coefficients a and b are obtained in advance through the following process: obtain the target user's voiceprint information sample set before the preset time period, and seek to minimize the voice discrimination R
  • the matching module 404 is further configured to: input the auditory cepstrum coefficient characteristics of the voiceprint information of the target user before a preset period of time into the stacked noise reduction self-encoding network model to obtain the target user
  • the voiceprint information transfer characteristics before the preset time period; the transfer characteristics are input into the second deep neural network model to obtain the transfer depth bottleneck characteristics; the auditory cepstrum coefficient characteristics and the transfer depth bottleneck characteristics are in accordance with the formula Y 1 aG+bB 1 Calculate to obtain the voiceprint information migration fusion feature Y 1 of the target user before the preset time period, where G is the auditory cepstrum coefficient feature, and B 1 is the migration depth bottleneck feature; merge the migration feature of the voiceprint information to be authenticated It is compared with the migration fusion feature of the predicted voiceprint information to obtain the third matching degree; based on the first matching degree and the third matching degree, it is determined whether the current user is the target user.
  • the electronic device 50 according to this embodiment of the present application will be described below with reference to FIG. 5.
  • the electronic device 50 shown in FIG. 5 is only an example, and should not bring any limitation to the function and scope of use of the embodiments of the present application.
  • the electronic device 50 is in the form of a general-purpose computing device.
  • the components of the electronic device 50 may include, but are not limited to: the aforementioned at least one processing unit 51, the aforementioned at least one storage unit 52, a bus 53 connecting different system components (including the storage unit 52 and the processing unit 51), and a display unit 54.
  • the storage unit stores program codes, and the program codes can be executed by the processing unit 51, so that the processing unit 51 executes the steps according to various exemplary implementations of the present application described in the above-mentioned "Embodiment Method" section of this specification.
  • the storage unit 52 may include a readable medium in the form of a volatile storage unit, such as a random access storage unit (RAM) 521 and/or a cache storage unit 522, and may further include a read-only storage unit (ROM) 523.
  • RAM random access storage unit
  • ROM read-only storage unit
  • the storage unit 52 may also include a program/utility tool 524 having a set of (at least one) program module 525.
  • program module 525 includes but is not limited to: an operating system, one or more application programs, other program modules, and program data, Each of these examples or some combination may include the implementation of a network environment.
  • the bus 53 may represent one or more of several types of bus structures, including a storage unit bus or a storage unit controller, a peripheral bus, a graphics acceleration port, a processing unit, or a local area using any bus structure among multiple bus structures. bus.
  • the electronic device 50 may also communicate with one or more external devices (such as keyboards, pointing devices, Bluetooth devices, etc.), and may also communicate with one or more devices that enable a user to interact with the electronic device 50, and/or communicate with
  • the electronic device 50 can communicate with any device (such as a router, a modem, etc.) that communicates with one or more other computing devices. This communication can be performed through an input/output (I/O) interface 55.
  • the electronic device 50 may also communicate with one or more networks (for example, a local area network (LAN), a wide area network (WAN), and/or a public network, such as the Internet) through the network adapter 56. As shown in the figure, the network adapter 56 communicates with other modules of the electronic device 50 through the bus 53.
  • LAN local area network
  • WAN wide area network
  • public network such as the Internet
  • the example embodiments described here can be implemented by software, or can be implemented by combining software with necessary hardware. Therefore, the technical solution according to the embodiments of the present application can be embodied in the form of a software product, which can be stored in a non-volatile storage medium (which can be a CD-ROM, U disk, mobile hard disk, etc.) or on the network , Including several instructions to make a computing device (which can be a personal computer, a server, a terminal device, or a network device, etc.) execute the method according to the embodiment of the present application.
  • a computing device which can be a personal computer, a server, a terminal device, or a network device, etc.
  • a computer-readable storage medium is also provided.
  • the computer-readable storage medium may be nonvolatile or volatile.
  • Stored on it are program products that can implement the above-mentioned methods of this specification.
  • various aspects of the present application can also be implemented in the form of a program product, which includes program code.
  • the program product runs on a terminal device, the program code is used to make the terminal device execute the above-mentioned instructions in this specification.
  • the steps according to various exemplary embodiments of the present application are described in the "Exemplary Methods" section.
  • a program product 60 for implementing the above method according to an embodiment of the present application is described. It can adopt a portable compact disk read-only memory (CD-ROM) and include program code, and can be installed in a terminal device, For example, running on a personal computer.
  • CD-ROM compact disk read-only memory
  • the program product of this application is not limited to this.
  • the readable storage medium can be any tangible medium that contains or stores a program, and the program can be used by or in combination with an instruction execution system, device, or device.
  • the program product can adopt any combination of one or more readable media.
  • the readable medium may be a readable signal medium or a readable storage medium.
  • the readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or a combination of any of the above. More specific examples (non-exhaustive list) of readable storage media include: electrical connections with one or more wires, portable disks, hard disks, random access memory (RAM), read-only memory (ROM), erasable Type programmable read only memory (EPROM or flash memory), optical fiber, portable compact disk read only memory (CD-ROM), optical storage device, magnetic storage device, or any suitable combination of the above.
  • the computer-readable signal medium may include a data signal propagated in baseband or as a part of a carrier wave, and readable program code is carried therein. This propagated data signal can take many forms, including but not limited to electromagnetic signals, optical signals, or any suitable combination of the foregoing.
  • the readable signal medium may also be any readable medium other than a readable storage medium, and the readable medium may send, propagate, or transmit a program for use by or in combination with the instruction execution system, apparatus, or device.
  • the program code contained on the readable medium can be transmitted by any suitable medium, including but not limited to wireless, wired, optical cable, RF, etc., or any suitable combination of the above.
  • the program code for performing the operations of this application can be written in any combination of one or more programming languages.
  • Programming languages include object-oriented programming languages—such as Java, C++, etc., as well as conventional procedural programming. Language-such as "C" language or similar programming language.
  • the program code can be executed entirely on the user's computing device, partly on the user's device, executed as an independent software package, partly on the user's computing device and partly executed on the remote computing device, or entirely on the remote computing device or server Executed on.
  • the remote computing device can be connected to a user computing device through any kind of network, including a local area network (LAN) or a wide area network (WAN), or it can be connected to an external computing device (for example, using Internet service providers). Business to connect via the Internet).
  • LAN local area network
  • WAN wide area network
  • Internet service providers for example, using Internet service providers

Landscapes

  • Engineering & Computer Science (AREA)
  • Multimedia (AREA)
  • Acoustics & Sound (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Game Theory and Decision Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Business, Economics & Management (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Collating Specific Patterns (AREA)

Abstract

本申请提供了一种声纹认证方法、装置、介质及电子设备,可在人工智能深度学习中实现。该方法包括:获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境;将目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息;采集当前用户的待认证声纹信息;将预测声纹信息与待认证声纹信息进行匹配,以获得第一匹配度;若第一匹配度超过第一预设阈值,则将当前用户确定为目标用户。由于在预测声纹信息时考虑了目标用户的年龄、性别和所处环境,使以该预测声纹信息为标准识别出的目标用户不受时间的干扰。本申请能够进行声纹认证。

Description

声纹认证方法、装置、介质及电子设备
本申请要求于2019年11月01日提交中国专利局、申请号为2019110598438,发明名称为“声纹认证方法、装置、介质及电子设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,特别涉及一种声纹认证方法、装置、介质及电子设备。
背景技术
由于声纹识别是生物识别技术的一种,通过对语音进行处理可生成用于指示该语音输入者身份信息的身份向量,通过计算两段语音的身份向量之间的相似度来确定这两段语音的输入者是否为同一用户。
在声纹技术的研究过程中,发明人意识到,随着时间的变化,人的声纹也会发生变化,而且时间越长,人的声纹变化也越大,若采集到的声纹信息的时间与预先注册声纹模型的时间相隔较远,则有可能会导致认证失败。
发明内容
本申请旨在提供一种声纹认证方法、装置、介质及电子设备,能够提高声纹认证的准确性。
根据本申请实施例的一个方面,提供了一种声纹认证方法,包括:获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境;将所述目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息;采集当前用户的待认证声纹信息;将所述预测声纹信息与所述待认证声纹信息进行匹配,以获得第一匹配度;若所述第一匹配度超过第一预设阈值,则将所述当前用户确定为所述目标用户。
根据本申请实施例的一个方面,提供了一种声纹认证装置,包括:获取模块,用于获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境;第一预测模块,用于将所述目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息;采集模块,用于采集当前用户的待认证声纹信息;匹配模块,用于将所述预测声纹信息与所述待认证声纹信息进行匹配,以获得第一匹配度;确定模块,若所述第一匹配度超过第一预设阈值,则将所述当前用户确定为所述目标用户。
根据本申请实施例的一个方面,提供了一种电子装置,包括:一个或多个处理器;存储装置,用于存储一个或多个程序,当所述一个或多个程序被所述一个或多个处理器执行时,使得所述一个或多个处理器实现如上所述的声纹认证方法。
根据本申请实施例的一个方面,提供了一种计算机可读程序介质,其上存储有计算机可读指令,当所述计算机可读指令被计算机的处理器执行时,使计算机执行如上所述的声纹认证方法。
本申请的实施例提供的技术方案可以包括以下有益效果:
在本申请的一些实施例所提供的技术方案中,通过获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境;将目标用户在预设时间段前的声纹信息、年龄、性别和 所处环境输入第一预测模型得到预测声纹信息,使得到的预测声纹信息能够考虑到目标用户的性别随着年龄的增大时声纹信息产生的变化,使得到的预测声纹信息能够考虑到目标用户在不同环境中声纹信息的变化。再将预测声纹信息与待认证声纹信息进行匹配,以获得第一匹配度;若第一匹配度超过第一预设阈值,则将当前用户确定为目标用户。由于在预测声纹信息时考虑了目标用户的年龄、性别和所处环境,使以该预测声纹信息为标准识别出的目标用户不受时间的干扰,解决了现有技术中采集到的声纹信息的时间与预先注册声纹模型的时间相隔较远导致认证失败的问题。
应当理解的是,以上的一般描述和后文的细节描述仅是示例性的,并不能限制本申请。
附图说明
图1示出了可以应用本申请实施例的技术方案的示例性系统架构的示意图;
图2示意性示出了根据本申请的一个实施例的声纹认证方法的流程图;
图3示意性示出了根据本申请的一个实施例的声纹认证方法的流程图;
图4示意性示出了根据本申请的一个实施例的声纹认证装置的框图;
图5是根据一示例性实施例示出的一种电子设备的硬件示意图;
图6是根据一示例性实施例示出的一种用于实现上述声纹认证方法的计算机可读存储介质。
具体实施方式
现在将参考附图更全面地描述示例实施方式。然而,示例实施方式能够以多种形式实施,且不应被理解为限于在此阐述的范例;相反,提供这些实施方式使得本申请将更加全面和完整,并将示例实施方式的构思全面地传达给本领域的技术人员。
此外,所描述的特征、结构或特性可以以任何合适的方式结合在一个或更多实施例中。在下面的描述中,提供许多具体细节从而给出对本申请的实施例的充分理解。然而,本领域技术人员将意识到,可以实践本申请的技术方案而没有特定细节中的一个或更多,或者可以采用其它的方法、组元、装置、步骤等。在其它情况下,不详细示出或描述公知方法、装置、实现或者操作以避免模糊本申请的各方面。
附图中所示的方框图仅仅是功能实体,不一定必须与物理上独立的实体相对应。即,可以采用软件形式来实现这些功能实体,或在一个或多个硬件模块或集成电路中实现这些功能实体,或在不同网络和/或处理器装置和/或微控制器装置中实现这些功能实体。
附图中所示的流程图仅是示例性说明,不是必须包括所有的内容和操作/步骤,也不是必须按所描述的顺序执行。例如,有的操作/步骤还可以分解,而有的操作/步骤可以合并或部分合并,因此实际执行的顺序有可能根据实际情况改变。
图1示出了可以应用本申请实施例的技术方案的示例性系统架构100的示意图。
如图1所示,系统架构100可以包括终端设备(如图1中所示智能手机101、平板电脑102和便携式计算机103中的一种或多种,当然也可以是台式计算机等等)、网络104和服务器105。网络104用以在终端设备和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线通信链路、无线通信链路等等。
应该理解,图1中的终端设备、网络104和服务器105的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络104、和服务器105。比如服务器105可以是多个服务器组成的服务器集群等。
在本申请的一个实施例中,服务器105可以获取目标用户从终端设备输入的声纹信息。目标用户可以通过终端设备中的客户端或网页输入声纹信息、年龄、性别和所处环境。服务器105将目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息,使得到的预测声纹信息能够考虑到目标用户的性别随着年龄的增大时产生的变化,使得到的预测声纹信息能够考虑到目标用户在不同环境中的变化。再将预测声纹信息与待认证声纹信息进行匹配,以获得第一匹配度;若第一匹配度超过第一预设阈值,则将当前用户确定为目标用户。由于在预测声纹信息时考虑了目标用户的年龄、性别和所处环境,使以该预测声纹信息为标准识别出的目标用户不受时间的干扰,解决了现有技术中因为采集到的声纹信息的时间与预先注册声纹模型的时间相隔较远导致认证失败的问题。
需要说明的是,本申请实施例所提供的声纹认证方法一般由服务器105执行,相应地,声纹认证装置一般设置于服务器105中。但是,在本申请的其它实施例中,终端设备也可以与服务器105具有相似的功能,从而执行本申请实施例所提供的声纹认证方法。
以下对本申请实施例的技术方案的实现细节进行详细阐述:
图2示意性示出了根据本申请的一个实施例的声纹认证方法的流程图,该声纹认证方法的执行主体可以是服务器,比如可以是图1中所示的服务器105。
参照图2所示,该声纹认证方法至少包括步骤S210至步骤S250,详细介绍如下:
在步骤S210中,获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境。
在本申请的一个实施例中,声纹信息可以是一段录音,也可以是从一段录音中提取出的具有某种特征的声纹信息。所处环境可以包括目标用户工作环境、生活环境、语言环境等。
在本申请的一个实施例中,可以获取目标用户在多个预设时间段前的声纹信息、年龄和所处环境。通过获取目标用户在多个预设时间段前的声纹信息、年龄和所处环境,能够得到目标用户的声纹变化趋势,更加准确的预测出目标用户的预测声纹信息。
具体例如,可以获取目标用户的性别,并获取1年前、2年前、3年前目标用户的声纹信息、年龄及所处环境,预测模型根据目标用户在1年前、2年前、3年前的声纹变化趋势,能够更加准确的预测声纹信息。
在该实施例中,当目标用户处于青春期变声阶段时,可以获取目标用户变声前和变声中各个阶段的声纹信息,由于该实施例中在预测声纹信息时考虑到了目标用户的性别、年龄,使该实施例中的方案能够准确预测出目标用户在变声期中和变声期结束后的预测声纹信息。
在步骤S220中,将目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息。
在本申请的一个实施例中,预测声纹信息可以是目标用户在当前时间的声纹信息,也可以是目标用户在未来某个时间的声纹信息。
在本申请的一个实施例中,第一预测模型采用以下方法预先训练:获取用于对第一预测模型进行训练的样本数据集合,其中,样本数据集合中的每条样本数据均包括同一用户在预设时间段前的声纹信息、年龄、性别和所处环境以及该用户在当前时间的声纹信息;将样本数据集合中的每条样本数据包含的该用户在预设时间段前的声纹信息、年龄、性别和所处环境作为第一预测模型的输入,将样本数据集合中的每条样本数据包含的该用户在当前时间的声纹信息作为第一预测模型输出的预测声纹信息,对第一预测模型进行训练;将第一预测模型输出的预测声纹信息与该用户在当前时间实际的声纹信息进行比较,如果不一致,调整第一预测模型,使得输出的当前的声纹信息与该用户在当前时间实际的声纹信息一致。
在步骤S230中,采集当前用户的待认证声纹信息。
在本申请的一个实施例中,通过录音设备记录当前用户的待认证的语音,然后对该待认证的语音进行特征提取以获得待认证声纹信息。可以提取该待认证的语音的MFCC特征作为当前用户的待认证声纹信息。也可以基于Gammatone滤波器组提取当前用户的听觉倒谱系数特征作为当前用户的待认证声纹信息。
在步骤S240中,将预测声纹信息与待认证声纹信息进行匹配,以获得第一匹配度。
在本申请的一个实施例中,可以获取通过预测模型得到的预测声纹信息,再将该预测声纹信息与待认证声纹信息进行匹配,以获得第一匹配度。可以通过线性判别模型对该预测声纹信息与待认证声纹信息进行打分,将得到的分数作为第一匹配度。
在上述实施例中,可在目标用户认证的相关应用本地,将预测声纹信息与待认证声纹信息进行匹配,也可通用将预测声纹信息与待认证声纹信息上传至相关服务器,在相关服务器中将预测声纹信息与待认证声纹信息进行匹配。
在本申请的一个实施例中,可以通过以下步骤将预测声纹信息与待认证声纹信息进行匹配,获得第一匹配度。
首先,基于人耳耳蜗听觉模型提取目标用户在预设时间段前的声纹信息的听觉倒谱系数特征,将听觉倒谱系数特征输入第一深度神经网络模型得到深度瓶颈特征;再将听觉倒谱系数特征和深度瓶颈特征按照公式Y=aG+bB计算,得到目标用户在预设时间段前的声纹信息的融合特征Y,其中,G为听觉倒谱系数特征,B为深度瓶颈特征,系数a和b预先通过以下过程获得:获取目标用户在预设时间段前的声纹信息样本集合,求使语音区分度R取最小值时a与b的值,0≤a≤1,0≤b≤1,a+b=1,
Figure PCTCN2020092943-appb-000001
其中,N为目标用户在预设时间段前的声纹信息样本集合中的声纹数,Y i与Y j分别为基于在声纹信息样本集合中目标用户的第i条语音和第j条语音的听觉倒谱系数特征G和深度瓶颈特征B按照 Y=aG+bB得到的融合特征;将待认证声纹信息的融合特征与预测声纹信息的融合特征进行比较,以获得第一匹配度。
在本实施例中,目标用户的声纹语音区分度越小,在声纹信息样本集合中目标用户的声纹信息特征越统一,使目标用户的语音区分度达到最小值,能够更加容易识别出待认证声纹信息是否来自于目标用户。此外,将从目标用户的声纹信息中提取出的两种特征进行融合得到融合特征,得到的融合特征也更加能代表目标用户的声纹信息。
在步骤S250中,若第一匹配度超过第一预设阈值,则将当前用户确定为目标用户。
当第一匹配度大于或等于第一预设阈值时,说明当前预测声纹信息与待认证声纹信息的相似度达到要求,可以确定当前用户与目标用户为同一人,能够将当前用户识别为目标用户。
在本申请的一个实施例中,若第一匹配度小于第一预设阈值,则将当前用户识别为非目标用户。可以收集该非目标用户的待认证声纹信息进行存储,使目标用户能够知道有哪些人试图对其设备进行解锁。
在本申请的一个实施例中,还可以将目标用户在预设时间段前的声纹信息的听觉倒谱系数特征输入堆叠降噪自编码网络模型得到目标用户在预设时间段前的声纹信息的迁移特征;将迁移特征输入第二深度神经网络模型得到迁移深度瓶颈特征;再将听觉倒谱系数特征和迁移深度瓶颈特征按照公式Y 1=aG+bB 1计算,得到目标用户在预设时间段前的声纹信息的迁移融合特征Y 1,其中,G为听觉倒谱系数特征,B 1为迁移深度瓶颈特征;再将待认证声纹信息的迁移融合特征与预测声纹信息的迁移融合特征进行比较,以获得第三匹配度;再基于第一匹配度和第三匹配度,判断当前用户是否为目标用户。
在上述施例中,可以计算第一匹配度和第三匹配度的加权和,若当前用户的待认证声纹信息超过第三设定阈值,则确定当前用户为目标用户。
在上述实施例中,由于堆叠降噪自编码网络模型具有鲁棒的特征提取能力,使用堆叠降噪自编码网络模型处理听觉倒谱系数特征得到的迁移特征,能够更加准确的表示声纹信息。
图3示意性示出了根据本申请的一个实施例的声纹认证方法的流程图,该声纹认证方法的执行主体可以是服务器,比如可以是图1中所示的服务器105。
参照图3所示,该声纹认证方法至少包括步骤S310至步骤S390,详细介绍如下:
在步骤S310中,获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境。
在步骤S320中,将目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息。
在步骤S330中,采集当前用户的待认证声纹信息。
在步骤S340中,将预测声纹信息与待认证声纹信息进行匹配,以获得第一匹配度。
在步骤S350中,若第一匹配度超过第一预设阈值,则将当前用户确定为目标用户。
在步骤S360中,获取目标用户在预设时间段前的人脸图像信息。
在本申请的一个实施例中,人脸图像信息可以为从目标用户的人脸图像中提取出来的 人脸特征信息。可以在目标用户的人脸图像中的五官边缘和脸的外轮廓建立多个特征点,将多个特征点和连接特征点之间的连线作为目标用户的人脸图像信息。
在步骤S370中,将目标用户在预设时间段前的人脸图像、年龄、性别输入第二预测模型获得预测人脸图像信息。
在本申请的一个实施例中,第二预测模型采用以下方法预先训练:获取用于对第二预测模型进行训练的图像样本数据集合,其中,图像样本数据集合中的每条图像样本数据均包括同一用户在预设时间段前的人脸图像、年龄和性别以及该用户在当前时间的人脸图像;将图像样本数据中的每条图像样本数据包含的该用户预设时间段前的人脸图像、年龄和性别作为第二预测模型的输入,将图像样本数据中的每条图像样本数据包含的该用户的当前时间的人脸图像作为第二预测模型的输出作为预测人脸图像信息,对第二预测进行训练;将第二预测模型输出的该用户在当前时间的人脸图像与该用户当前时间实际的人脸图像进行比较,如果不一致,调整第二预测模型,使得输出的同一用户在当前时间的人脸图像与实际的人脸图像一致。
在步骤S380中,采集当前用户的待认证人脸图像信息。
在本申请的一个实施例中,可以通过相机拍摄得到当前用户的待认证人脸图像,然后对该待认证人脸图像进行特征提取以获得待认证图像信息。可以在待认证人脸图像中的五官边缘和脸的外轮廓建立多个特征点,将多个特征点和连接特征点之间的连线作为待认证图像信息。
在步骤S390中,将预测人脸图像信息与待认证人脸图像信息进行匹配,以获得第二匹配度。
在本申请的一个实施例中,可以通过预测模型预测得到目标用户的预测人脸图像信息,并将该当前预测人脸图像信息与当前用户的待认证人脸图像信息进行匹配,以获得第二匹配度,通过对该第二匹配度进行打分,然后根据打分结果确定目标用户的预测人脸图像信息与当前用户的待认证人脸图像信息的相似度。可以将预测人脸图像信息与待认证人脸图像信息中的五官和脸型的特征点进行匹配,将匹配的特征点数占总点数的百分比作为第二匹配度。
在步骤S3100中,将第一匹配度和第二匹配度进行加权和计算以获得总匹配度;若总匹配度大于第二预设阈值,则确定当前用户为目标用户。
在本申请的一个实施例中,可以对第一匹配度和第二匹配度各赋予50%权重,进行加权求和以得到加权后的总匹配度。
在本申请的一个实施例中,可以将第一匹配度、第二匹配度和第三匹配度进行加权和计算后和第四预设阈值比较,若达到第四预设阈值,则确定当前用户为目标用户。其中,第一匹配度、第二匹配度和第三匹配度的权重根据需要设定。
由于随着时间的推移,目标用户的相貌也会随着年龄的增长而改变,而且不同性别的目标用户,其相貌的变化趋势也存在区别。上述实施例通过对目标用户在设定时间段前的声纹信息和人脸图像同时进行预测,然后将预测后的声纹信息和人脸图像结合在一起对当 前用户进行识别,识别的准确度更高。
在本申请的一个实施例中,在获取目标用户在预设时间段前的声纹信息之后,可以对声纹信息进行降噪处理得到纯语音数据,可以采用基于谱相减的语音增强算法对声纹信息进行去噪处理,以消除录音设备造成的噪声,得到纯语音数据。再对纯语音数据进行分帧,基于人耳耳蜗听觉模型提取每帧语音数据中的梅尔倒谱系数特征。具体地,对得到的纯语音数据按照帧长25ms,帧移10ms进行分帧,并通过MFCC(Mel Frequency Cepstrum Coefficient,梅尔频率倒谱系数)特征,对每帧语音数据做短时分析得到MFCC特征并继续计算其一阶和二阶差分,分别提取MFCC特征、MFCC特征的一阶差分、MFCC特征的二阶差分的前13维特征向量拼接成为一个39维的特征向量,再将每帧声纹信息的39维特征向量、年龄、性别和所处环境输入第一预测模型,以获得每帧的预测声纹信息,再根据每帧的预测声纹信息得到预测声纹信息。可以是将每帧的预测声纹信息拼接组合以得到预测声纹信息。
在上述实施例中通过第一预测模型对每帧的特征向量进行预测,并将预测后的预测值组合成预测声纹信息,以使得到的声纹预测的结果更加准确。
以下介绍本申请的装置实施例,可以用于执行本申请上述实施例中的任务处理时间方法。对于本申请装置实施例中未披露的细节,请参照本申请上述的任务处理时间方法的实施例。
图4示意性示出了根据本申请的一个实施例的任务处理时间装置的框图。
参照图4所示,根据本申请的一种声纹认证装置400,包括获取模块401、第一预测模块402、采集模块403、匹配模块404和确定模块405。
在本申请的一些实施例中,基于前述方案,获取模块401用于获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境;第一预测模块402用于将目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息;采集模块403用于采集当前用户的待认证声纹信息;匹配模块404用于将预测声纹信息与待认证声纹信息进行匹配,以获得第一匹配度;确定模块405用于若第一匹配度超过第一预设阈值,则将当前用户确定为目标用户。
在本申请的一些实施例中,基于前述方案,第一预测模块402配置为:对声纹信息进行降噪处理得到纯语音数据;对纯语音数据进行分帧,基于人耳耳蜗听觉模型提取每帧语音数据中的听觉倒谱系数特征;将目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息包括:将年龄、性别、所处环境及每帧的听觉倒谱系数特征输入第一预测模型,以获得每帧的预测声纹信息;根据每帧的预测声纹信息得到预测声纹信息。
在本申请的一些实施例中,基于前述方案,声纹认证装置还包括:第二预测模块,用于获取目标用户在预设时间段前的人脸图像信息;将目标用户预设时间段前的人脸图像、年龄、性别输入第二预测模型获得预测人脸图像信息;采集当前用户的待认证人脸图像信息;将预测人脸图像信息与待认证人脸图像信息进行匹配,以获得第二匹配度;所述匹配 模块404配置为:将第一匹配度和第二匹配度进行加权和计算以获得总匹配度;若总匹配度大于第二预设阈值,则确定当前用户为目标用户。
在本申请的一些实施例中,基于前述方案,匹配模块404配置为:基于人耳耳蜗听觉模型提取目标用户在预设时间段前的声纹信息的听觉倒谱系数特征,将听觉倒谱系数特征输入第一深度神经网络模型得到深度瓶颈特征;将听觉倒谱系数特征和深度瓶颈特征按照公式Y=aG+bB计算,得到目标用户在预设时间段前的声纹信息的融合特征Y,其中,G为听觉倒谱系数特征,B为深度瓶颈特征,系数a和b预先通过以下过程获得:获取目标用户在预设时间段前的声纹信息样本集合,求使语音区分度R取最小值时a与b的值,0≤a≤1,0≤b≤1,a+b=1,
Figure PCTCN2020092943-appb-000002
其中,N为目标用户在预设时间段前的声纹信息样本集合中的声纹数,Y i与Y j分别为基于在声纹信息样本集合中目标用户的第i条语音和第j条语音的听觉倒谱系数特征G和深度瓶颈特征B按照Y=aG+bB得到的融合特征;将待认证声纹信息的融合特征与预测声纹信息的融合特征进行比较,以获得第一匹配度。
在本申请的一些实施例中,基于前述方案,匹配模块404还配置为:将目标用户在预设时间段前的声纹信息的听觉倒谱系数特征输入堆叠降噪自编码网络模型得到目标用户在预设时间段前的声纹信息的迁移特征;将迁移特征输入第二深度神经网络模型得到迁移深度瓶颈特征;将听觉倒谱系数特征和迁移深度瓶颈特征按照公式Y 1=aG+bB 1计算,得到目标用户在预设时间段前的声纹信息的迁移融合特征Y 1,其中,G为听觉倒谱系数特征,B 1为迁移深度瓶颈特征;将待认证声纹信息的迁移融合特征与预测声纹信息的迁移融合特征进行比较,以获得第三匹配度;基于第一匹配度和第三匹配度,判断当前用户是否为目标用户。
所属技术领域的技术人员能够理解,本申请的各个方面可以实现为系统、方法或程序产品。因此,本申请的各个方面可以具体实现为以下形式,即:完全的硬件实施方式、完全的软件实施方式(包括固件、微代码等),或硬件和软件方面结合的实施方式,这里可以统称为“电路”、“模块”或“系统”。
下面参照图5来描述根据本申请的这种实施方式的电子设备50。图5显示的电子设备50仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图5所示,电子设备50以通用计算设备的形式表现。电子设备50的组件可以包括但不限于:上述至少一个处理单元51、上述至少一个存储单元52、连接不同系统组件(包括存储单元52和处理单元51)的总线53、显示单元54。
其中,存储单元存储有程序代码,程序代码可以被处理单元51执行,使得处理单元51执行本说明书上述“实施例方法”部分中描述的根据本申请各种示例性实施方式的步骤。
存储单元52可以包括易失性存储单元形式的可读介质,例如随机存取存储单元(RAM)521和/或高速缓存存储单元522,还可以进一步包括只读存储单元(ROM)523。
存储单元52还可以包括具有一组(至少一个)程序模块525的程序/实用工具524,这样的程序模块525包括但不限于:操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。
总线53可以为表示几类总线结构中的一种或多种,包括存储单元总线或者存储单元控制器、外围总线、图形加速端口、处理单元或者使用多种总线结构中的任意总线结构的局域总线。
电子设备50也可以与一个或多个外部设备(例如键盘、指向设备、蓝牙设备等)通信,还可与一个或者多个使得用户能与该电子设备50交互的设备通信,和/或与使得该电子设备50能与一个或多个其它计算设备进行通信的任何设备(例如路由器、调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口55进行。并且,电子设备50还可以通过网络适配器56与一个或者多个网络(例如局域网(LAN),广域网(WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器56通过总线53与电子设备50的其它模块通信。应当明白,尽管图中未示出,可以结合电子设备50使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
通过以上的实施方式的描述,本领域的技术人员易于理解,这里描述的示例实施方式可以通过软件实现,也可以通过软件结合必要的硬件的方式来实现。因此,根据本申请实施方式的技术方案可以以软件产品的形式体现出来,该软件产品可以存储在一个非易失性存储介质(可以是CD-ROM,U盘,移动硬盘等)中或网络上,包括若干指令以使得一台计算设备(可以是个人计算机、服务器、终端装置、或者网络设备等)执行根据本申请实施方式的方法。
根据本申请一个实施例,还提供了一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性。其上存储有能够实现本说明书上述方法的程序产品。在一些可能的实施方式中,本申请的各个方面还可以实现为一种程序产品的形式,其包括程序代码,当程序产品在终端设备上运行时,程序代码用于使终端设备执行本说明书上述“示例性方法”部分中描述的根据本申请各种示例性实施方式的步骤。
参考图6所示,描述了根据本申请的实施方式的用于实现上述方法的程序产品60,其可以采用便携式紧凑盘只读存储器(CD-ROM)并包括程序代码,并可以在终端设备,例如个人电脑上运行。然而,本申请的程序产品不限于此,在本文件中,可读存储介质可以是任何包含或存储程序的有形介质,该程序可以被指令执行系统、装置或者器件使用或者与其结合使用。
程序产品可以采用一个或多个可读介质的任意组合。可读介质可以是可读信号介质或者可读存储介质。可读存储介质例如可以为但不限于电、磁、光、电磁、红外线、或半导体的系统、装置或器件,或者任意以上的组合。可读存储介质的更具体的例子(非穷举的列表)包括:具有一个或多个导线的电连接、便携式盘、硬盘、随机存取存储器(RAM)、只读存储器(ROM)、可擦式可编程只读存储器(EPROM或闪存)、光纤、便携式紧凑盘只读 存储器(CD-ROM)、光存储器件、磁存储器件、或者上述的任意合适的组合。
计算机可读信号介质可以包括在基带中或者作为载波一部分传播的数据信号,其中承载了可读程序代码。这种传播的数据信号可以采用多种形式,包括但不限于电磁信号、光信号或上述的任意合适的组合。可读信号介质还可以是可读存储介质以外的任何可读介质,该可读介质可以发送、传播或者传输用于由指令执行系统、装置或者器件使用或者与其结合使用的程序。
可读介质上包含的程序代码可以用任何适当的介质传输,包括但不限于无线、有线、光缆、RF等等,或者上述的任意合适的组合。
可以以一种或多种程序设计语言的任意组合来编写用于执行本申请操作的程序代码,程序设计语言包括面向对象的程序设计语言—诸如Java、C++等,还包括常规的过程式程序设计语言—诸如“C”语言或类似的程序设计语言。程序代码可以完全地在用户计算设备上执行、部分地在用户设备上执行、作为一个独立的软件包执行、部分在用户计算设备上部分在远程计算设备上执行、或者完全在远程计算设备或服务器上执行。在涉及远程计算设备的情形中,远程计算设备可以通过任意种类的网络,包括局域网(LAN)或广域网(WAN),连接到用户计算设备,或者,可以连接到外部计算设备(例如利用因特网服务提供商来通过因特网连接)。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (20)

  1. 一种声纹认证方法,其中,包括:
    获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境;
    将所述目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息;
    采集当前用户的待认证声纹信息;
    将所述预测声纹信息与所述待认证声纹信息进行匹配,以获得第一匹配度;
    若所述第一匹配度超过第一预设阈值,则将所述当前用户确定为所述目标用户。
  2. 根据权利要求1所述的声纹认证方法,其中,在所述获取目标用户预设时间段前的声纹信息之后,所述方法包括:
    对所述声纹信息进行降噪处理得到纯语音数据;
    对所述纯语音数据进行分帧,基于人耳耳蜗听觉模型提取每帧语音数据中的听觉倒谱系数特征;
    所述将所述目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息包括:将年龄、性别、所处环境及每帧的听觉倒谱系数特征输入第一预测模型,以获得每帧的预测声纹信息;
    根据所述每帧的预测声纹信息得到所述预测声纹信息。
  3. 根据权利要求1所述的声纹认证方法,其中,所述方法还包括:
    获取所述目标用户在预设时间段前的人脸图像信息;
    将所述目标用户预设时间段前的人脸图像、年龄、性别输入第二预测模型获得预测人脸图像信息;
    采集所述当前用户的待认证人脸图像信息;
    将所述预测人脸图像信息与所述待认证人脸图像信息进行匹配,以获得第二匹配度;
    在所述获得第一匹配度之后,所述方法还包括:
    将所述第一匹配度和所述第二匹配度进行加权和计算以获得总匹配度;若所述总匹配度大于第二预设阈值,则确定所述当前用户为所述目标用户。
  4. 根据权利要求1所述的声纹认证方法,其中,所述将所述预测声纹信息与所述待认证声纹信息进行匹配,以获得第一匹配度,包括:
    基于人耳耳蜗听觉模型提取所述目标用户在预设时间段前的声纹信息的听觉倒谱系数特征,将所述听觉倒谱系数特征输入第一深度神经网络模型得到深度瓶颈特征;
    将所述听觉倒谱系数特征和所述深度瓶颈特征按照公式Y=aG+bB计算,得到所述目标用户在预设时间段前的声纹信息的融合特征Y,其中,G为所述听觉倒谱系数特征,B为所述深度瓶颈特征,系数a和b预先通过以下过程获得:获取所述目标用户在预设时间段前的声纹信息样本集合,求使语音区分度R取最小值时a与b的值,
    0≤a≤1,0≤b≤1,a+b=1,
    Figure PCTCN2020092943-appb-100001
    其中,N为所述目标用户在所述预设时间段前的声纹信息样本集合中的声纹数,Y i与Y j分别为基于在声纹信息样本集合中所述目标用户的第i条语音和第j条语音的听觉倒谱系数特征G和深度瓶颈特征B按照Y=aG+bB得到的所述融合特征;
    将所述待认证声纹信息的融合特征与所述预测声纹信息的融合特征进行比较,以获得第一匹配度。
  5. 根据权利要求4所述的声纹认证方法,其中,所述基于人耳耳蜗听觉模型提取所述预设时间段前的声纹信息的听觉倒谱系数特征之后,所述方法还包括:
    将所述目标用户在预设时间段前的声纹信息的听觉倒谱系数特征输入堆叠降噪自编码网络模型得到所述目标用户在预设时间段前的声纹信息的迁移特征;
    将所述迁移特征输入第二深度神经网络模型得到迁移深度瓶颈特征;
    将所述听觉倒谱系数特征和所述迁移深度瓶颈特征按照公式Y 1=aG+bB 1计算,得到所述目标用户在预设时间段前的声纹信息的迁移融合特征Y1,其中,G为所述听觉倒谱系数特征,B 1为所述迁移深度瓶颈特征;
    将所述待认证声纹信息的迁移融合特征与所述预测声纹信息的迁移融合特征进行比较,以获得第三匹配度;
    基于所述第一匹配度和所述第三匹配度,判断所述当前用户是否为所述目标用户。
  6. 根据权利要求1-5任一项所述的声纹认证方法,其中,所述采集当前用户的待认证声纹信息,包括:
    提取当前用户的待认证的语音的MFCC特征作为所述当前用户的待认证声纹信息。
  7. 根据权利要求1-5任一项所述的声纹认证方法,其中,所述采集当前用户的待认证声纹信息,包括:
    基于Gammatone滤波器组提取当前用户的待认证的语音的听觉倒谱系数特征作为所述当前用户的待认证声纹信息。
  8. 一种声纹认证装置,其中,包括:
    获取模块,用于获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境;
    第一预测模块,用于将所述目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息;
    采集模块,用于采集当前用户的待认证声纹信息;
    匹配模块,用于将所述预测声纹信息与所述待认证声纹信息进行匹配,以获得第一匹配度;
    确定模块,若所述第一匹配度超过第一预设阈值,则将所述当前用户确定为所述目标用户。
  9. 一种电子设备,其中,包括存储器和处理器,所述处理器、和所述存储器相互连接,其中,所述存储器用于存储计算机程序,所述计算机程序包括程序指令,所述处理器用于执行所述存储器的所述程序指令,其中:
    获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境;
    将所述目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息;
    采集当前用户的待认证声纹信息;
    将所述预测声纹信息与所述待认证声纹信息进行匹配,以获得第一匹配度;
    若所述第一匹配度超过第一预设阈值,则将所述当前用户确定为所述目标用户。
  10. 根据权利要求9所述的电子设备,其中,所述处理器,还用于:
    对所述声纹信息进行降噪处理得到纯语音数据;
    对所述纯语音数据进行分帧,基于人耳耳蜗听觉模型提取每帧语音数据中的听觉倒谱系数特征;
    所述将所述目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息包括:将年龄、性别、所处环境及每帧的听觉倒谱系数特征输入第一预测模型,以获得每帧的预测声纹信息;
    根据所述每帧的预测声纹信息得到所述预测声纹信息。
  11. 根据权利要求9所述的电子设备,其中,所述处理器,还用于:
    获取所述目标用户在预设时间段前的人脸图像信息;
    将所述目标用户预设时间段前的人脸图像、年龄、性别输入第二预测模型获得预测人脸图像信息;
    采集所述当前用户的待认证人脸图像信息;
    将所述预测人脸图像信息与所述待认证人脸图像信息进行匹配,以获得第二匹配度;
    在所述获得第一匹配度之后,所述方法还包括:
    将所述第一匹配度和所述第二匹配度进行加权和计算以获得总匹配度;若所述总匹配度大于第二预设阈值,则确定所述当前用户为所述目标用户。
  12. 根据权利要求9所述的电子设备,其中,所述处理器,还用于:
    基于人耳耳蜗听觉模型提取所述目标用户在预设时间段前的声纹信息的听觉倒谱系数特征,将所述听觉倒谱系数特征输入第一深度神经网络模型得到深度瓶颈特征;
    将所述听觉倒谱系数特征和所述深度瓶颈特征按照公式Y=aG+bB计算,得到所述目标用户在预设时间段前的声纹信息的融合特征Y,其中,G为所述听觉倒谱系数特征,B为所述深度瓶颈特征,系数a和b预先通过以下过程获得:获取所述目标用户在预设时间段前的声纹信息样本集合,求使语音区分度R取最小值时a与b的值,
    0≤a≤1,0≤b≤1,a+b=1,
    Figure PCTCN2020092943-appb-100002
    其中,N为所述目标用户在所述预设时间
    段前的声纹信息样本集合中的声纹数,Y i与Y j分别为基于在声纹信息样本集合中所述目标用户的第i条语音和第j条语音的听觉倒谱系数特征G和深度瓶颈特征B按照Y=aG+bB得到的所述融合特征;
    将所述待认证声纹信息的融合特征与所述预测声纹信息的融合特征进行比较,以获得第一匹配度。
  13. 根据权利要求12所述的电子设备,其中,所述处理器,还用于:
    将所述目标用户在预设时间段前的声纹信息的听觉倒谱系数特征输入堆叠降噪自编码网络模型得到所述目标用户在预设时间段前的声纹信息的迁移特征;
    将所述迁移特征输入第二深度神经网络模型得到迁移深度瓶颈特征;
    将所述听觉倒谱系数特征和所述迁移深度瓶颈特征按照公式Y 1=aG+bB 1计算,得到所述目标用户在预设时间段前的声纹信息的迁移融合特征Y 1,其中,G为所述听觉倒谱系数特征,B1为所述迁移深度瓶颈特征;
    将所述待认证声纹信息的迁移融合特征与所述预测声纹信息的迁移融合特征进行比较,以获得第三匹配度;
    基于所述第一匹配度和所述第三匹配度,判断所述当前用户是否为所述目标用户。
  14. 根据权利要求9-13任一项所述的电子设备,其中,所述处理器,还用于:
    提取当前用户的待认证的语音的MFCC特征作为所述当前用户的待认证声纹信息。
  15. 根据权利要求9-13任一项所述的电子设备,其中,所述处理器,还用于:
    基于Gammatone滤波器组提取当前用户的待认证的语音的听觉倒谱系数特征作为所述当前用户的待认证声纹信息。
  16. 一种计算机可读存储介质,其中,所述计算机可读存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令被处理器执行时,用于实现以下步骤:
    获取目标用户在预设时间段前的声纹信息、年龄、性别和所处环境;
    将所述目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息;
    采集当前用户的待认证声纹信息;
    将所述预测声纹信息与所述待认证声纹信息进行匹配,以获得第一匹配度;
    若所述第一匹配度超过第一预设阈值,则将所述当前用户确定为所述目标用户。
  17. 根据权利要求16所述的计算机可读存储介质,其中,所述程序指令被处理器执行时,还用于实现以下步骤:
    对所述声纹信息进行降噪处理得到纯语音数据;
    对所述纯语音数据进行分帧,基于人耳耳蜗听觉模型提取每帧语音数据中的听觉倒谱系数特征;
    所述将所述目标用户在预设时间段前的声纹信息、年龄、性别和所处环境输入第一预测模型得到预测声纹信息包括:将年龄、性别、所处环境及每帧的听觉倒谱系数特征输入第一预测模型,以获得每帧的预测声纹信息;
    根据所述每帧的预测声纹信息得到所述预测声纹信息。
  18. 根据权利要求16所述的计算机可读存储介质,其中,所述程序指令被处理器执行时,还用于实现以下步骤:
    获取所述目标用户在预设时间段前的人脸图像信息;
    将所述目标用户预设时间段前的人脸图像、年龄、性别输入第二预测模型获得预测人脸图像信息;
    采集所述当前用户的待认证人脸图像信息;
    将所述预测人脸图像信息与所述待认证人脸图像信息进行匹配,以获得第二匹配度;
    在所述获得第一匹配度之后,所述方法还包括:
    将所述第一匹配度和所述第二匹配度进行加权和计算以获得总匹配度;若所述总匹配度大于第二预设阈值,则确定所述当前用户为所述目标用户。
  19. 根据权利要求16所述的计算机可读存储介质,其中,所述程序指令被处理器执行时,还用于实现以下步骤:
    基于人耳耳蜗听觉模型提取所述目标用户在预设时间段前的声纹信息的听觉倒谱系数特征,将所述听觉倒谱系数特征输入第一深度神经网络模型得到深度瓶颈特征;
    将所述听觉倒谱系数特征和所述深度瓶颈特征按照公式Y=aG+bB计算,得到所述目标用户在预设时间段前的声纹信息的融合特征Y,其中,G为所述听觉倒谱系数特征,B为所述深度瓶颈特征,系数a和b预先通过以下过程获得:获取所述目标用户在预设时间段前的声纹信息样本集合,求使语音区分度R取最小值时a与b的值,
    0≤a≤1,0≤b≤1,a+b=1,
    Figure PCTCN2020092943-appb-100003
    其中,N为所述目标用户在所述预设时间段前的声纹信息样本集合中的声纹数,Y i与Y j分别为基于在声纹信息样本集合中所述目标用户的第i条语音和第j条语音的听觉倒谱系数特征G和深度瓶颈特征B按照Y=aG+bB得到的所述融合特征;
    将所述待认证声纹信息的融合特征与所述预测声纹信息的融合特征进行比较,以获得第一匹配度。
  20. 根据权利要求19所述的计算机可读存储介质,其中,所述程序指令被处理器执行时,还用于实现以下步骤:
    将所述目标用户在预设时间段前的声纹信息的听觉倒谱系数特征输入堆叠降噪自编码网络模型得到所述目标用户在预设时间段前的声纹信息的迁移特征;
    将所述迁移特征输入第二深度神经网络模型得到迁移深度瓶颈特征;
    将所述听觉倒谱系数特征和所述迁移深度瓶颈特征按照公式Y 1=aG+bB 1计算,得到所述目标用户在预设时间段前的声纹信息的迁移融合特征Y 1,其中,G为所述听觉倒谱系数特征,B 1为所述迁移深度瓶颈特征;
    将所述待认证声纹信息的迁移融合特征与所述预测声纹信息的迁移融合特征进行比较,以获得第三匹配度;
    基于所述第一匹配度和所述第三匹配度,判断所述当前用户是否为所述目标用户。
PCT/CN2020/092943 2019-11-01 2020-05-28 声纹认证方法、装置、介质及电子设备 WO2021082420A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911059843.8A CN110956966B (zh) 2019-11-01 2019-11-01 声纹认证方法、装置、介质及电子设备
CN201911059843.8 2019-11-01

Publications (1)

Publication Number Publication Date
WO2021082420A1 true WO2021082420A1 (zh) 2021-05-06

Family

ID=69976610

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/092943 WO2021082420A1 (zh) 2019-11-01 2020-05-28 声纹认证方法、装置、介质及电子设备

Country Status (2)

Country Link
CN (1) CN110956966B (zh)
WO (1) WO2021082420A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565814A (zh) * 2022-02-25 2022-05-31 平安国际智慧城市科技股份有限公司 一种特征检测方法、装置及终端设备

Families Citing this family (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110956966B (zh) * 2019-11-01 2023-09-19 平安科技(深圳)有限公司 声纹认证方法、装置、介质及电子设备
CN111444377A (zh) * 2020-04-15 2020-07-24 厦门快商通科技股份有限公司 一种声纹识别的认证方法和装置以及设备
CN111444375A (zh) * 2020-04-15 2020-07-24 厦门快商通科技股份有限公司 一种声纹识别的验证方法和装置以及设备
CN111444376A (zh) * 2020-04-15 2020-07-24 厦门快商通科技股份有限公司 一种音频指纹的识别方法和装置以及设备
CN111326163B (zh) * 2020-04-15 2023-02-14 厦门快商通科技股份有限公司 一种声纹识别方法和装置以及设备
CN111581426A (zh) * 2020-04-30 2020-08-25 厦门快商通科技股份有限公司 一种音频指纹匹配方法和装置以及设备
CN112330897B (zh) * 2020-08-19 2023-07-25 深圳Tcl新技术有限公司 用户语音对应性别改变方法、装置、智能门铃及存储介质
CN112002346A (zh) * 2020-08-20 2020-11-27 深圳市卡牛科技有限公司 基于语音的性别年龄识别方法、装置、设备和存储介质
CN112562691B (zh) * 2020-11-27 2024-07-02 平安科技(深圳)有限公司 一种声纹识别的方法、装置、计算机设备及存储介质
US11735158B1 (en) * 2021-08-11 2023-08-22 Electronic Arts Inc. Voice aging using machine learning

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106782564A (zh) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置
CN107665295A (zh) * 2016-07-29 2018-02-06 长城汽车股份有限公司 车辆的身份认证方法、系统及车辆
CN108288470A (zh) * 2017-01-10 2018-07-17 富士通株式会社 基于声纹的身份验证方法和装置
US10074089B1 (en) * 2012-03-01 2018-09-11 Citigroup Technology, Inc. Smart authentication and identification via voiceprints
CN110956966A (zh) * 2019-11-01 2020-04-03 平安科技(深圳)有限公司 声纹认证方法、装置、介质及电子设备

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105656887A (zh) * 2015-12-30 2016-06-08 百度在线网络技术(北京)有限公司 基于人工智能的声纹认证方法以及装置
CN105513597B (zh) * 2015-12-30 2018-07-10 百度在线网络技术(北京)有限公司 声纹认证处理方法及装置
CN109473105A (zh) * 2018-10-26 2019-03-15 平安科技(深圳)有限公司 与文本无关的声纹验证方法、装置和计算机设备
CN110265040B (zh) * 2019-06-20 2022-05-17 Oppo广东移动通信有限公司 声纹模型的训练方法、装置、存储介质及电子设备

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10074089B1 (en) * 2012-03-01 2018-09-11 Citigroup Technology, Inc. Smart authentication and identification via voiceprints
CN107665295A (zh) * 2016-07-29 2018-02-06 长城汽车股份有限公司 车辆的身份认证方法、系统及车辆
CN106782564A (zh) * 2016-11-18 2017-05-31 百度在线网络技术(北京)有限公司 用于处理语音数据的方法和装置
CN108288470A (zh) * 2017-01-10 2018-07-17 富士通株式会社 基于声纹的身份验证方法和装置
CN110956966A (zh) * 2019-11-01 2020-04-03 平安科技(深圳)有限公司 声纹认证方法、装置、介质及电子设备

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114565814A (zh) * 2022-02-25 2022-05-31 平安国际智慧城市科技股份有限公司 一种特征检测方法、装置及终端设备

Also Published As

Publication number Publication date
CN110956966B (zh) 2023-09-19
CN110956966A (zh) 2020-04-03

Similar Documents

Publication Publication Date Title
WO2021082420A1 (zh) 声纹认证方法、装置、介质及电子设备
EP3806089B1 (en) Mixed speech recognition method and apparatus, and computer readable storage medium
CN112562691B (zh) 一种声纹识别的方法、装置、计算机设备及存储介质
WO2021208287A1 (zh) 用于情绪识别的语音端点检测方法、装置、电子设备及存储介质
JP6429945B2 (ja) 音声データを処理するための方法及び装置
CN112259106B (zh) 声纹识别方法、装置、存储介质及计算机设备
WO2021135438A1 (zh) 多语种语音识别模型训练方法、装置、设备及存储介质
WO2018107810A1 (zh) 声纹识别方法、装置、电子设备及介质
JP2021527840A (ja) 声紋識別方法、モデルトレーニング方法、サーバ、及びコンピュータプログラム
CN110826466A (zh) 基于lstm音像融合的情感识别方法、装置及存储介质
JP2021500616A (ja) オブジェクト識別の方法及びその、コンピュータ装置並びにコンピュータ装置可読記憶媒体
CN107180628A (zh) 建立声学特征提取模型的方法、提取声学特征的方法、装置
WO2022178942A1 (zh) 情绪识别方法、装置、计算机设备和存储介质
WO2021051608A1 (zh) 一种基于深度学习的声纹识别方法、装置及设备
Tao et al. End-to-end audiovisual speech activity detection with bimodal recurrent neural models
US9947323B2 (en) Synthetic oversampling to enhance speaker identification or verification
JP7268711B2 (ja) 信号処理システム、信号処理装置、信号処理方法、およびプログラム
WO2020140609A1 (zh) 一种语音识别方法、设备及计算机可读存储介质
TW202213326A (zh) 用於說話者驗證的廣義化負對數似然損失
Ding et al. Enhancing GMM speaker identification by incorporating SVM speaker verification for intelligent web-based speech applications
CN109688271A (zh) 联系人信息输入的方法、装置及终端设备
CN118173094A (zh) 结合动态时间规整的唤醒词识别方法、装置、设备及介质
JP2020173381A (ja) 話者認識方法、話者認識装置、話者認識プログラム、データベース作成方法、データベース作成装置、及びデータベース作成プログラム
CN112466284B (zh) 一种口罩语音鉴别方法
GB2576960A (en) Speaker recognition

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20883305

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20883305

Country of ref document: EP

Kind code of ref document: A1