WO2021103913A1 - 语音防伪方法、装置、终端设备及存储介质 - Google Patents

语音防伪方法、装置、终端设备及存储介质 Download PDF

Info

Publication number
WO2021103913A1
WO2021103913A1 PCT/CN2020/124766 CN2020124766W WO2021103913A1 WO 2021103913 A1 WO2021103913 A1 WO 2021103913A1 CN 2020124766 W CN2020124766 W CN 2020124766W WO 2021103913 A1 WO2021103913 A1 WO 2021103913A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
recognition model
preset
templates
illegal
Prior art date
Application number
PCT/CN2020/124766
Other languages
English (en)
French (fr)
Inventor
周皓隽
谢妍辉
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2021103913A1 publication Critical patent/WO2021103913A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/22Interactive procedures; Man-machine interfaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/30Authentication, i.e. establishing the identity or authorisation of security principals
    • G06F21/31User authentication
    • G06F21/32User authentication using biometric data, e.g. fingerprints, iris scans or voiceprints
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/02Preprocessing operations, e.g. segment selection; Pattern representation or modelling, e.g. based on linear discriminant analysis [LDA] or principal components; Feature selection or extraction
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L17/00Speaker identification or verification techniques
    • G10L17/04Training, enrolment or model building

Definitions

  • This application relates to the field of voice processing technology, and in particular to a voice anti-counterfeiting method, device, terminal device, and storage medium based on artificial intelligence (AI).
  • AI artificial intelligence
  • a replay attack means that an attacker uses a recording device to secretly record a user's voice signal for voice recognition, and plays the recorded voice through the playback device, thereby passing the verification of voiceprint recognition.
  • the popularity of high-fidelity recording equipment makes it very easy for attackers to successfully secretly record user voices.
  • the embodiments of the present application provide a voice anti-counterfeiting method, device, terminal device, and storage medium, which can solve the problem of recognizing legal voice as illegal voice in the existing voice anti-counterfeiting.
  • an embodiment of the present application provides a voice anti-counterfeiting method, including:
  • each voice template is generated based on the illegal voice data recognized each time by the preset voice recognition model;
  • the recognition parameters of the preset speech recognition model are adjusted or the current speech recognition model is switched to another speech recognition model.
  • the voice anti-counterfeiting method provided by the embodiment of the application generates and stores a voice template from the illegal voice data recognized by a preset voice recognition model, and calculates the similarity of the voice template. If the similarity calculation meets the preset condition, adjust the preset The recognition parameters of the speech recognition model or switch the current speech recognition model to another speech recognition model. Thereby, the problem of speech recognition errors caused by inaccurate recognition of current speech recognition models is avoided, the probability of misjudged legal speech as illegal speech is reduced, the accuracy of speech recognition is improved, and the user experience is improved.
  • the performing similarity calculation on the stored voice templates includes:
  • the voice recognition model recognizes illegal voice data twice in a row
  • the similarity calculation is performed on two stored voice templates, where the two voice templates are generated based on the illegal voice data recognized twice in a row Yes, each voice template corresponds to an illegal voice data;
  • adjusting the recognition parameters of the speech recognition model or switching the current speech recognition model to another speech recognition model includes:
  • the recognition parameters of the speech recognition model are adjusted or the current speech recognition model is switched to another speech recognition model.
  • the voice anti-counterfeiting method provided by the embodiment of the application performs similarity calculation on two stored voice templates when the voice data is recognized as illegal voice data twice. If the similarity of the two voice templates is less than the first threshold, then Adjust the recognition parameters of the speech recognition model or switch the current speech recognition model to another speech recognition model. Since the voice data is recognized as illegal voice data twice in a row, it indicates that legitimate voices may be recognized as illegal voices. At this time, adjust the recognition parameters or switch the voice recognition model to obtain a more accurate voice recognition model. When recognizing, reduce the probability of recognizing legitimate voice as illegal voice data.
  • the performing similarity calculation on the stored voice templates includes:
  • the stored voice The template performs pairwise similarity calculation
  • adjusting the recognition parameters of the speech recognition model or switching the current speech recognition model to another speech recognition model includes:
  • the number of similar speech templates calculated according to the similarity between every two speech templates is less than the third threshold and/or the percentage of the number of similar speech templates in all similarity calculation times is less than the fourth threshold, then Adjust the recognition parameters of the speech recognition model or switch the current speech recognition model to another speech recognition model.
  • the number of stored voice templates is counted.
  • the number of voice templates in the preset period reaches a certain number, it indicates that the frequency of recognizing voice data as illegal voice data is high, and the similarity calculation is performed on the voice templates. If the number of similar voice templates is less than a certain number, it indicates that the similarity between the voice templates is not high, and the input voice data may be legitimate voice data. Adjust the recognition parameters of the voice recognition model or switch the current voice recognition model to another voice recognition model to prevent legitimate voice data from being recognized as illegal voice data multiple times within a period of time.
  • the method before the similarity calculation is performed on the stored speech templates, the method further includes:
  • the recognition result is illegal voice data
  • a voice template corresponding to the illegal voice data is generated, and the voice template is stored.
  • the voice template is obtained after feature extraction of the voice data, the similarity between the voice templates is calculated, and the similarity between the voice templates can accurately reflect whether the voice data is a replayed voice.
  • the generating a voice template corresponding to the illegal voice data includes:
  • the voice template is generated according to the two-dimensional matrix. Compared with calculating the similarity between voice data through the voice frequency spectrum, calculating the similarity between the voice data through a two-dimensional matrix can increase the calculation speed.
  • the generating the voice template according to the two-dimensional matrix includes:
  • the elements in the normalized two-dimensional matrix that are greater than the energy threshold are set to the first preset value, and the elements in the normalized two-dimensional matrix that are less than or equal to the energy threshold are set to the second
  • the preset value is to use the set two-dimensional matrix as the voice template.
  • the performing similarity calculation on the stored voice templates includes:
  • the similarity between every two speech templates is determined according to the number of matches of the first preset value.
  • the corresponding elements in the two speech templates are compared one by one, and the number of matches of the first preset value is calculated. The greater the number of matches, the higher the similarity.
  • the method further includes:
  • the speech data is re-recognized by adopting the speech recognition model after the adjustment of the recognition parameters or the new speech recognition model after switching.
  • the speech data when speech data is recognized as illegal speech, the speech data can be re-recognized by adjusting parameters or switching models, so that appropriate speech recognition can be used
  • the model or appropriate parameters can recognize the voice data and reduce the probability of misjudged legal voice as illegal voice.
  • the adjusting the recognition parameters of the preset speech recognition model includes:
  • the method further includes:
  • an embodiment of the present application provides a voice anti-counterfeiting device, including:
  • the calculation module is used to calculate the similarity of the stored voice templates; wherein, each voice template is generated according to the illegal voice data recognized each time by the preset voice recognition model;
  • the adjustment module is configured to adjust the recognition parameters of the preset speech recognition model or switch the current speech recognition model to another speech recognition model if the similarity calculation result meets the preset condition.
  • the calculation module is specifically configured to:
  • the voice recognition model recognizes illegal voice data twice in a row
  • the similarity calculation is performed on two stored voice templates, where the two voice templates are generated based on the illegal voice data recognized twice in a row Yes, each voice template corresponds to an illegal voice data;
  • the adjustment module is specifically used for:
  • the recognition parameters of the speech recognition model are adjusted or the current speech recognition model is switched to another speech recognition model.
  • the calculation module is specifically configured to:
  • the stored voice The template performs pairwise similarity calculation
  • the adjustment module is specifically used for:
  • the number of similar speech templates calculated according to the similarity between every two speech templates is less than the third threshold and/or the percentage of the number of similar speech templates in all similarity calculation times is less than the fourth threshold, then Adjust the recognition parameters of the speech recognition model or switch the current speech recognition model to another speech recognition model.
  • the voice anti-counterfeiting device further includes:
  • the acquisition module is used to acquire voice data
  • a recognition module configured to recognize the voice data using the preset voice recognition model
  • the template generating module is configured to generate a voice template corresponding to the illegal voice data if the recognition result is illegal voice data, and store the voice template.
  • the template generation module is specifically configured to:
  • the voice template is generated according to the two-dimensional matrix.
  • the template generation module is further configured to:
  • the elements in the normalized two-dimensional matrix that are greater than the energy threshold are set to the first preset value, and the elements in the normalized two-dimensional matrix that are less than or equal to the energy threshold are set to the second
  • the preset value is to use the set two-dimensional matrix as the voice template.
  • the calculation module is further configured to:
  • the similarity between every two speech templates is determined according to the number of matches of the first preset value.
  • the adjustment module is further configured to:
  • the speech data is re-recognized by adopting the speech recognition model after the adjustment of the recognition parameters or the new speech recognition model after switching.
  • the adjustment module is further configured to:
  • the voice anti-counterfeiting device further includes:
  • an embodiment of the present application provides a terminal device, including: a memory, a processor, and a computer program stored in the memory and running on the processor, and the processor executes the computer program
  • the voice anti-counterfeiting method described in any one of the first aspect is realized at a time.
  • an embodiment of the present application provides a computer-readable storage medium that stores a computer program that, when executed by a processor, implements any one of the above-mentioned aspects of the first aspect Voice anti-counterfeiting method.
  • the embodiments of the present application provide a computer program product, which when the computer program product runs on a terminal device, causes the terminal device to execute the voice anti-counterfeiting method described in any one of the above-mentioned first aspects.
  • FIG. 1 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of an application scenario provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of another application scenario provided by an embodiment of this application.
  • Figure 4 is a schematic diagram of an application scenario provided by an embodiment of the application.
  • FIG. 5 is a schematic diagram of an application scenario provided by another embodiment of this application.
  • FIG. 6 is a schematic diagram of an application scenario provided by another embodiment of this application.
  • FIG. 7 is a schematic diagram of an application scenario provided by another embodiment of this application.
  • FIG. 8 is a schematic diagram of an application scenario provided by another embodiment of this application.
  • FIG. 9 is a schematic flowchart of a voice anti-counterfeiting method provided by the first embodiment of this application.
  • FIG. 10 is a schematic flowchart of a voice anti-counterfeiting method provided by the second embodiment of this application.
  • FIG. 11 is a schematic flowchart of a voice anti-counterfeiting method provided by the third embodiment of this application.
  • FIG. 12 is a schematic flowchart of a voice anti-counterfeiting method provided by the fourth embodiment of this application.
  • FIG. 13 is a schematic flowchart of a voice anti-counterfeiting method provided by the fifth embodiment of this application.
  • FIG. 14 is a schematic structural diagram of a voice anti-counterfeiting device provided by an embodiment of the present application.
  • the term “if” can be construed as “when” or “once” or “in response to determination” or “in response to detecting “.
  • the phrase “if determined” or “if detected [described condition or event]” can be interpreted as meaning “once determined” or “in response to determination” or “once detected [described condition or event]” depending on the context ]” or “in response to detection of [condition or event described]”.
  • the voice anti-counterfeiting method provided by the embodiment of the application is applied to terminal equipment.
  • the terminal device can be any device with a voice interaction function. Including but not limited to smartphones, smart speakers, smart home appliances, tablets, wearable devices, vehicle-mounted devices, augmented reality (AR)/virtual reality (VR) devices, laptops, For ultra-mobile personal computers (UMPC), netbooks, personal digital assistants (personal digital assistants, PDAs), etc., the embodiments of this application do not impose any restrictions on the specific types of terminal devices.
  • Figure 1 shows a schematic structural diagram of a terminal device.
  • the terminal device includes: a processor 110, a memory 120, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a wireless fidelity (WiFi) module 170, and a power supply 180.
  • a processor 110 the terminal device includes: a processor 110, a memory 120, an input unit 130, a display unit 140, a sensor 150, an audio circuit 160, a wireless fidelity (WiFi) module 170, and a power supply 180.
  • WiFi wireless fidelity
  • the processor 110 may be a central processing unit (Central Processing Unit, CPU), and the processor 110 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (ASICs). ), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, etc.
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • the memory 120 may be used to store software programs and modules.
  • the processor 110 executes various functional applications and data processing of the terminal device by running the software programs and modules stored in the memory 120.
  • the memory 120 may mainly include a program storage area and a data storage area.
  • the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, etc.), etc.;
  • the data (such as audio data, phone book, etc.) created by the use of the terminal device, etc.
  • the memory 120 may include a high-speed random access memory, and may also include a non-volatile memory, such as at least one magnetic disk storage device, a flash memory device, or other volatile solid-state storage devices.
  • the input unit 130 may be used to receive inputted number or character information, and generate key signal input related to user settings and function control of the terminal device.
  • the input unit 130 may include a touch panel 131 and other input devices 132.
  • the touch panel 131 also called a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 131 or near the touch panel 131. Operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 131 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch position, and detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it To the processor 110, and can receive and execute the commands sent by the processor 110.
  • the touch panel 131 can be implemented in multiple types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 130 may also include other input devices 132.
  • the other input device 132 may include, but is not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, and joystick.
  • the display unit 140 may be used to display information input by the user or information provided to the user and various menus of the mobile phone.
  • the display unit 140 may include a display panel 141.
  • the display panel 141 may be configured in the form of a liquid crystal display (LCD), an organic light-emitting diode (OLED), etc.
  • the touch panel 131 can cover the display panel 141. When the touch panel 131 detects a touch operation on or near it, it transmits it to the processor 110 to determine the type of the touch event, and then the processor 110 responds to the touch event. The type provides corresponding visual output on the display panel 141.
  • the touch panel 131 and the display panel 141 are used as two independent components to implement the input and input functions of the mobile phone, but in some embodiments, the touch panel 131 and the display panel 141 can be integrated Realize the input and output functions of the mobile phone.
  • the terminal device may also include at least one sensor 150, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor can include an ambient light sensor and a proximity sensor.
  • the ambient light sensor can adjust the brightness of the display panel 141 according to the brightness of the ambient light.
  • the proximity sensor can close the display panel 141 and/or when the mobile phone is moved to the ear. Or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three axes), and can detect the magnitude and direction of gravity when stationary, and can be used to identify the posture of the terminal device (such as horizontal and vertical screen switching) , Related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that can be configured in mobile phones, here No longer.
  • the audio circuit 160, the speaker 161, and the microphone 162 can provide an audio interface between the user and the terminal device.
  • the audio circuit 160 can transmit the electrical signal converted from the received audio data to the speaker 161, which is converted into a sound signal for output by the speaker 161; on the other hand, the microphone 162 converts the collected sound signal into an electrical signal, and the audio circuit 160 After being received, it is converted into audio data, and then processed by the audio data output processor 110, and then the audio data is output to the memory 120 for further processing.
  • WiFi is a short-distance wireless transmission technology.
  • terminal devices can help users send and receive emails, browse web pages, and access streaming media. It provides users with wireless broadband Internet access.
  • FIG. 1 shows the WiFi module 170, it is understandable that it is not a necessary component of the terminal device and can be omitted as needed without changing the essence of the invention.
  • the terminal device also includes a power supply 180 (such as a battery) for supplying power to various components.
  • a power supply 180 (such as a battery) for supplying power to various components.
  • the power supply may be logically connected to the processor 110 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.
  • the terminal device may also include a camera.
  • the position of the camera on the terminal device may be front-mounted or rear-mounted, which is not limited in the embodiment of the present application.
  • the terminal device may also include a Bluetooth module, etc., which will not be repeated here.
  • a terminal device with a voice interaction function obtains the voice uttered by the user, processes the voice uttered by the user, and determines whether the voice uttered by the user is a legitimate voice, that is, whether it is a real voice. If the voice uttered by the user is a legitimate voice, the corresponding operation is performed according to the voice input by the user, and if the voice uttered by the user is an illegal voice, a prompt of the illegal voice is issued.
  • the terminal device is a smart phone
  • the user sends a voice "WeChat payment” to the smart phone
  • the microphone of the smart phone collects the voice of the user.
  • the processor processes the voice uttered by the user and judges whether the voice uttered by the user is legal.
  • the voice uttered by the user is a legal voice
  • the corresponding operation is performed, for example, the WeChat payment interface is opened, and the corresponding payment QR code is displayed on the display panel.
  • a voice prompt of "illegal voice” will be issued through the speaker, thereby realizing a human-machine dialogue between the user and the smart phone.
  • the terminal device is a smart speaker.
  • the user sends a voice "Call Xiao Ming" to the smart speaker.
  • the smart speaker’s microphone collects the user’s voice and processes it.
  • the device processes the voice uttered by the user and determines whether the voice uttered by the user is legal.
  • the smart speaker performs a corresponding operation, for example, a WiFi module or a Bluetooth module instructs a mobile phone paired with the smart speaker to dial a corresponding number.
  • a voice prompt of "illegal voice” will be issued through the speaker, thereby realizing a human-machine dialogue between the user and the smart speaker.
  • the terminal device can also be a server.
  • the user sends a voice to the smart phone
  • the smart phone transmits the voice to the server
  • the server processes the voice sent by the user. Determine whether the voice uttered by the user is legal. If the voice uttered by the user is a legal voice, the server instructs the smart phone to perform the corresponding operation. If the user's voice is an illegal voice, the server instructs the smart phone to issue a voice prompt of "illegal voice", thereby realizing a human-machine dialogue between the user and the server.
  • the embodiments of the present application provide a voice anti-counterfeiting method.
  • the terminal device determines that the user's voice is illegal, it calculates the similarity of the stored voice template. If the result of the similarity calculation meets the preset condition, the preset is adjusted. Set the recognition parameters of the speech recognition model or switch the current speech recognition model to other speech recognition models, and re-recognize the user's speech that is recognized as illegal speech to reduce the probability of recognizing legal speech as illegal speech.
  • the terminal device may continuously recognize the user's legitimate voice as illegal voice.
  • the smart phone shown in Figure 4 recognizes the user's legitimate voice as an illegal voice multiple times in succession.
  • the following uses this scenario as an example to describe the voice anti-counterfeiting method provided in the embodiment of the present application.
  • the voice anti-counterfeiting method provided by the first embodiment of the present application includes:
  • the terminal device first obtains the authentication information input by the user, for example, obtains face information through a camera, or obtains fingerprint information, digital password, pattern password, etc. through the input unit, and determines the authentication information input by the user Whether the information matches the unlocking information stored on the terminal device, if it matches, the authentication is successful. If the authentication is successful, the terminal device starts a voice recognition application, such as a voice assistant or voice dialogue software. As shown in Figure 2, if the user utters a voice, the microphone collects the voice uttered by the user.
  • a voice recognition application such as a voice assistant or voice dialogue software.
  • voice data can also be directly used as authentication information. After the terminal device obtains the voice data, it first determines whether the feature information of the voice data input by the user matches the feature information of the voice data stored on the terminal device. If it matches, then The authentication is successful. If the authentication is successful, the corresponding operation is further performed according to the voice data.
  • S102 Recognize the voice data using a preset voice recognition model, and determine whether the recognition result is a legal voice.
  • the preset voice recognition model is obtained by training the classification model by using machine learning or deep learning algorithms based on the collected user voice and the replayed voice.
  • the replayed voice can be one or more of recording, synthesized voice, and imitated voice.
  • the terminal device obtains the real voices and corresponding replayed voices of the same group of people as training samples, marks the real voices as legitimate voices, and marks the replayed voices as illegal voices.
  • the terminal device inputs a preset voice recognition model, outputs the probability that the voice data is a legal voice, and determines whether the voice data is a legal voice based on the probability that the voice data is a legal voice.
  • the confidence threshold of the preset speech recognition model is first set as a basis for judging whether the speech data is legal and illegal speech. For example, collect the real voices and corresponding replayed voices of a group of people as test samples, mark the real voices as legitimate voices, and mark the replayed voices as illegal voices.
  • the test sample is input into the preset speech recognition model, the output probability of the preset speech recognition model is compared with the corresponding mark, the comparison result is counted, and the confidence threshold is generated according to the statistical result.
  • the confidence level is set The threshold is 0.5.
  • the voice data is input into the preset voice recognition model. If the output probability of the preset voice recognition model is greater than or equal to the confidence threshold, the corresponding voice data is a legal voice. If the preset voice recognition model outputs the probability If it is less than the confidence threshold, the corresponding voice data is illegal voice.
  • the processor parses out the text information corresponding to the voice data according to the preset voice analysis model, and executes the corresponding operation according to the instructions corresponding to the text information. For example, as shown in Fig. 3, if the parsed voice data is WeChat payment, the WeChat payment interface is displayed.
  • the voice template is the data after feature extraction of the voice data input by the user, which can reflect the main feature information of the input voice data.
  • the audio circuit converts the voice data collected by the microphone into a voice signal, and sends it to the processor, and the processor converts the voice signal into a frequency spectrum.
  • the voice data is divided into several voice frames, and there is no overlap between every two voice frames.
  • Each voice frame includes several frequency bands, and the amplitude value of each frequency band, namely energy, is extracted from the frequency spectrum.
  • a two-dimensional array E(n, m) is used to represent the amplitude value of the nth speech frame in the frequency band m, so that a two-dimensional matrix corresponding to the frequency spectrum can be generated.
  • a spectrum map corresponding to the voice data is generated.
  • S105 Determine whether the number of times of continuously recognizing voice data as illegal speech reaches a preset number of illegal recognition times.
  • the initial value of the counter is set to 0, if the output result is an illegal voice, the counter is increased by 1, and if the output result is a legal voice, the counter is reset to 0.
  • the processor After the processor generates the voice template corresponding to the voice data, it determines whether the value of the counter reaches the preset number of illegal recognition times.
  • S107 If the preset number of illegal recognitions is reached, perform similarity calculation on the stored voice template, and determine whether the similarity is less than the first threshold.
  • the counter if the counter reaches the preset number of illegal recognitions, reset the counter to 0 and calculate the similarity between the stored voice templates.
  • the similarity between the voice templates is the difference between the input voice data. Similarity.
  • the preset number of illegal recognition times is 2.
  • the recognition result is an illegal voice
  • the value of the counter is 2, it means that the voice recognition model recognizes the illegal voice twice in a row, and calculates the difference between the two voice templates. The similarity between.
  • the preset number of illegal recognition times is greater than 2.
  • the recognition result is an illegal speech
  • the value of the counter reaches the preset number of illegal recognition times, the similarity between every two speech templates is calculated.
  • the similarity between the two speech templates is calculated by multiplying the corresponding elements of the two spectrograms. It can be seen from the generation process of the speech template that the spectrum map is a matrix, and each element in the matrix is 0 or 1, and 1 represents the peak point. After the corresponding elements of the two spectrum map are multiplied, the product is summed to obtain Similarity, that is, the number of peak points in the spectrum map. The greater the number of peak points matching, the closer the spectrum map, the higher the similarity.
  • the similarity of the voice templates is two, when the similarity is greater than or equal to the first threshold, it indicates that the two speech templates are similar, and when the similarity is less than the first threshold, it indicates that the two speech templates are not similar. If the number of voice templates is greater than two, the similarity of the voice templates is calculated pairwise. If all the similarities are less than the first threshold, it indicates that the multiple voice templates are not similar, otherwise the voice templates are similar.
  • the similarity is greater than or equal to the first threshold, that is, the voice templates are similar, it indicates that the input voice data is a repeated voice, and illegal voice is still output, waiting for the user to re-input voice data.
  • the input voice may be real voice. Therefore, the recognition parameters of the preset voice recognition model are adjusted. It is necessary to perform voice recognition again to further determine whether the voice data is illegal voice data.
  • adjusting the recognition parameters of the preset speech recognition model can be adjusting the feature extraction parameters in the preset speech recognition model, for example, the parameters of the neural network model in the speech recognition model, or adjusting the preset speech recognition model The confidence threshold of.
  • the confidence threshold is reduced according to the set step size or percentage. For example, if the initial value of the confidence threshold is 0.5, the confidence is reduced in steps of 0.01 or 10% The threshold is 0.49 or 0.45, and the reduced confidence threshold is used as the basis for judging whether the voice data is legal or illegal. Input the most recently received voice data into a preset voice recognition model, and perform voice recognition again. If the output probability is greater than or equal to the reduced confidence threshold, it is determined that the input voice data is a legitimate voice, and the corresponding operation is performed according to the voice data. If the output probability is less than the reduced confidence threshold, it is determined that the input speech is illegal speech.
  • the confidence threshold is lowered again according to the set step size, and speech recognition is performed again. If the confidence threshold is reduced to the set minimum value, the output probability is still less than The minimum value of the confidence threshold is determined to be an illegal voice, and a prompt for illegal voice is output; if the output probability is greater than or equal to the reduced confidence threshold before the confidence threshold is reduced to the minimum, then the input voice is determined to be For real voice, perform the corresponding operation according to the voice data, otherwise, output an illegal voice prompt.
  • the acquired voice data is input into a preset voice recognition model to determine whether the voice data is a legitimate voice, if it is a legitimate voice, the verification is passed, and if it is an illegal voice, a voice template corresponding to the voice data is generated At the same time, it is judged whether the number of consecutive recognition of voice data as illegal speech reaches the preset number of illegal recognitions, and if the preset number of illegal recognitions is reached, the similarity between the voice templates is calculated. Since the similarity between the replayed voices is high, the similarity of the real voice input by the user each time is low. If the similarity is less than the first threshold, it means that the input voice data may be real voice, and the current confidence threshold may be lower. If it is high, the confidence threshold in the preset speech recognition model is lowered, and speech recognition is performed again to prevent misrecognition of real speech as illegal speech and improve user experience.
  • the voice anti-counterfeiting method provided by the second embodiment of the present application differs from the first embodiment in that if the similarity is less than the first threshold, the following steps are executed.
  • S209 Switch the current speech recognition model to another speech recognition model, and perform speech recognition again.
  • a plurality of speech recognition models are stored in the memory, and these speech recognition models are all neural network models, which are trained by different training samples and classification models of different structures, and the accuracy of speech recognition for different users' speech data is different.
  • the similarity is less than the first threshold, that is, the voice templates are not similar, it indicates that the input voice data may be real voice, and the current voice recognition model may not be suitable for the current user.
  • Switch the current speech recognition model to another speech recognition model, input the most recently received speech data into the switched speech recognition model, and judge whether the input speech data is legal according to the confidence threshold corresponding to the switched speech recognition model Voice, and at the same time use the switched voice recognition model as the voice recognition model of the next input voice data.
  • the voice recognition method of the second voice recognition model is the same as the voice recognition method of the first voice recognition model.
  • the voice data corresponding to all the voice templates are input into the switched voice recognition model, and the number of illegal voices output by the switched voice recognition model is calculated. If the number of illegal voices output by the switched voice recognition model is less than the preset number, the switched voice recognition model is used as the current voice recognition model. Among the output results of the voice recognition model after the voice data input switch corresponding to all voice templates, obtain the output result of the voice recognition model after the last voice data input switch, and determine whether the voice data input by the user is a legitimate voice based on the output result . At the same time, the switched voice recognition model is used as the voice recognition model of the voice data input by the user next time. If the number of illegal voice output from the switched voice recognition model is the same as the number of illegal voice output from the preset voice recognition model, the preset voice recognition model is still used as the current voice recognition model, and the voice data input by the user is judged to be illegal voice.
  • the number of illegal voices output by the switched voice recognition model is the same as the number of illegal voices output by the preset voice recognition model, then according to the voice recognition results of each voice recognition model in the memory, Switch to the new speech recognition model again.
  • input the voice data corresponding to all voice templates into each voice recognition model in the memory in turn, calculate the number of illegal voices output by each voice recognition model, and use the voice recognition model with the least number of illegal voices as the post-switching
  • the speech recognition model is re-implemented for speech recognition. For example, if the current number of speech templates is 3, there are 5 speech recognition models stored in the memory, namely: speech recognition model A, speech recognition model B, speech recognition model C, speech recognition model D, and speech recognition model E.
  • the voice recognition model A is a preset voice recognition model, that is, the voice recognition model A recognizes voice data as illegal voice three times in a row.
  • each voice recognition model processes the voice data corresponding to the three voice templates, and counts the number of illegal voices in the output result of each voice recognition model. If the output result of speech recognition model B has 2 illegal speeches, the output result of speech recognition model C has 1 illegal speech, the output result of speech recognition model D has 1 illegal speech, the output result of speech recognition model E There are 0 illegal voices, that is, the voice recognition model E recognizes the most recent voice data as legitimate voice, then the current voice recognition model is switched to the voice recognition model E. The next time the user inputs voice data, the voice recognition model E is used Determine whether the voice data is legitimate voice.
  • the acquired voice data is input into a preset voice recognition model to determine whether the voice data is a legitimate voice, if it is a legitimate voice, the verification is passed, and if it is an illegal voice, a voice template corresponding to the voice data is generated At the same time, it is judged whether the number of consecutive recognition of voice data as illegal speech reaches the preset number of illegal recognitions, and if the preset number of illegal recognitions is reached, the similarity between the voice templates is calculated. Since the similarity between replayed voices is high, the similarity of the real voice input by the user each time is low.
  • the similarity is less than the first threshold, it means that the input voice data may be real voice, and the preset voice recognition model may be If the voice data is not applicable to the current user, the current voice recognition model is switched to another voice recognition model, and the voice recognition is performed again to prevent the real voice from being mistakenly recognized as an illegal voice and improve the user experience.
  • the voice anti-counterfeiting method provided by the third embodiment of the present application includes:
  • S302 Recognize the voice data using a preset voice recognition model, and determine whether the recognition result is a legal voice.
  • S301-S303 are the same as S101-S103 in the first embodiment, and will not be repeated here.
  • the initial value of the counter is set to 0, if the output result is an illegal voice, the counter is increased by 1, and if the preset number of illegal recognition times is reached or the output result is a legal voice, the counter is reset to 0 .
  • the output result is an illegal voice, it is determined whether the value of the counter reaches the preset number of illegal recognition times.
  • an illegal voice prompt is output, and the user waits for the user to input voice data again.
  • the terminal device recognizes the voice data as illegal voice multiple times in succession, and generates and is recognized as illegal voice.
  • the voice template corresponding to multiple consecutive voice data For example, if the preset number of illegal recognition times is 2, the output result of the preset speech recognition model is illegal speech, and the preset number of illegal recognition times is reached, it means that the terminal device recognizes the voice data as illegal speech twice in a row, and generates and twice Two voice templates corresponding to voice data one-to-one.
  • the method for generating a voice template for each voice data is the same as that of the first embodiment, and will not be repeated here.
  • S307 Perform similarity calculation on the voice template, and determine whether the similarity is less than a first threshold.
  • S307-S309 are the same as S107-S109 in the first embodiment, and will not be repeated here.
  • the acquired voice data is input into a preset voice recognition model to determine whether the voice data is a legitimate voice. If it is a legitimate voice, the verification is passed; if it is an illegal voice, it is determined that the voice data is continuously recognized as illegal Whether the number of voices reaches the preset number of illegal recognitions, if it reaches the preset number of illegal recognitions, generate voice templates corresponding to multiple consecutive voice data recognized as illegal voice data, and perform similarity calculations on the voice templates. Since the similarity between the replayed voices is high, the similarity of the real voice input by the user each time is low.
  • the similarity is less than the first threshold, it means that the input voice data may be real voice, and the current voice recognition parameters may be If it is inappropriate, adjust the recognition parameters of the current speech recognition model and perform speech recognition again to prevent misrecognition of real speech as illegal speech and improve user experience.
  • the terminal device may frequently recognize the user's real voice as an illegal voice.
  • the smart speaker shown in FIG. 7 often recognizes the user's real voice as an illegal voice.
  • the following uses this scenario as an example to describe the voice anti-counterfeiting method provided in the embodiment of the present application.
  • the voice anti-counterfeiting method provided by the fourth embodiment of the present application includes:
  • S402 Recognize the voice data using a preset voice recognition model, and determine whether the recognition result is a legal voice.
  • S401-S404 are the same as S101-S104 in the first embodiment, and will not be repeated here.
  • S405 Determine whether the preset period is reached.
  • the frequency at which the existing smart speakers recognize legal voices as illegal voices is counted, and an appropriate adjustment period is set, for example, the adjustment period is set to 3 days.
  • the initial value of the timer is 0 and start timing. After the voice template is generated, it is determined whether the current timing reaches the adjustment period.
  • S407 If the preset period is reached, count the number of stored voice templates, and determine whether the number of voice templates in the preset period meets the first preset condition.
  • the initial value of the first counter is set to 0, and the first counter is used to count the number of input voice data. Each time voice data is input, the first counter is incremented by one.
  • the initial value of the second counter is set to 0, and the second counter is used to count the number of stored voice templates. Each time a voice template is generated, the second counter is increased by 1. When the preset period is reached, both the first counter and the second counter are reset to zero.
  • the first preset condition includes the following three situations, that is, any one of the following situations is satisfied, that is, the first preset condition is satisfied.
  • the number of voice templates in the preset period is greater than the second threshold
  • the percentage of the number of voice templates in the preset period in the number of all input voice data is greater than the third threshold
  • the number of voice templates in the preset period is greater than the second threshold, and the percentage of the number of voice templates in the preset period in the total number of input voice data is greater than the third threshold.
  • the second threshold is set to 5
  • the number of voice templates generated in the preset period is greater than 5 according to the value of the second counter
  • the third threshold is set to 1/10
  • the number of voice data input in the preset period is 30 according to the first counter
  • the number of voice templates generated according to the second counter is 5.
  • the number of times that data is recognized as an illegal voice is 5, and the percentage of the number of voice templates in the number of all input voice data is 1/6, which is greater than the third threshold and meets the first preset condition.
  • the terminal device if the number of voice templates in the preset period does not meet the first preset condition, it means that the terminal device has a lower probability of recognizing voice data as illegal voice, and further indicates that the terminal device has a lower probability of recognizing real voice as illegal voice.
  • the input voice data is judged as illegal voice, and the illegal voice prompt is output, waiting for the user to input the voice data again.
  • the terminal device has a greater probability of recognizing voice data as illegal voices.
  • the spectrum map corresponding to the voice template calculate one of any two voice templates. The similarity between the two speech templates, the similarity is greater than the first threshold, the two speech templates are similar, the two are compared whether the speech templates are similar, and the number of similar speech templates is calculated.
  • S410 Determine whether the second preset condition is satisfied according to the number of similar voice templates.
  • the second preset condition includes the following three situations, that is, any one of the following situations is satisfied, that is, the second preset condition is satisfied.
  • the number of similar speech templates is less than the third threshold
  • the percentage of the number of similar voice templates in all similarity calculation times is less than the fourth threshold
  • the number of similar speech templates is less than the third threshold, and the percentage of the number of similar speech templates in all similarity calculation times is less than the fourth threshold.
  • the third threshold is set to 3
  • the number of speech templates generated in the preset period is 10, and the similarity between every two speech templates is calculated. If the number of similar speech templates is 2, it is less than the first Three thresholds, meeting the second preset condition.
  • the fourth threshold is set to 1/5
  • the number of voice templates generated in the preset period is 10, and to calculate the similarity between every two voice templates, it needs to be calculated 45 times. If the voice templates are similar If the number of is 5, the percentage of the number of similar speech templates in all similarity calculation times is 1/9, which is less than the fourth threshold and meets the second preset condition.
  • the number of similar speech templates does not meet the second preset condition, that is, the number of similar speech templates is greater than the third threshold and/or the percentage of the number of similar speech templates in all similarity calculation times is greater than the fourth Threshold, indicating that the similarity between the voice templates is high, indicating that the input voice data is repeated voice, and illegal voice is still output, waiting for the user to re-input voice data.
  • the recognition parameters of the voice recognition model to recognize the voice data again.
  • the method of adjusting the recognition parameters of the speech recognition model and re-recognizing speech is the same as S109 in the first embodiment of the present application, and will not be repeated here.
  • the acquired voice data is input into a preset voice recognition model to determine whether the voice data is a legitimate voice, if it is a legitimate voice, the verification is passed, and if it is an illegal voice, a voice template corresponding to the voice data is generated
  • the number of similar voice templates meets the second preset condition, it means that the similarity between the input voice data is not high, and the input voice data may be real voice. Adjust the recognition parameters of the voice recognition model and perform voice recognition again to prevent Real voice is mistakenly recognized as illegal voice, improving user experience.
  • the voice anti-counterfeiting method provided by the fifth embodiment of the present application differs from the third embodiment in that if the number of similar voice templates does not meet the preset condition, execute:
  • S512 Switch the current speech recognition model to another speech recognition model.
  • S512 is the same as S209 in the second embodiment of the present application, and will not be repeated here.
  • the acquired voice data is input into a preset voice recognition model to determine whether the voice data is a legitimate voice, if it is a legitimate voice, the verification is passed, and if it is an illegal voice, a voice template corresponding to the voice data is generated
  • the input voice data may be real voice, and the preset voice recognition model may not be suitable for the current user's voice data ,
  • the current speech recognition model is switched to other speech recognition models to prevent misrecognition of real speech as illegal speech and improve user experience.
  • FIG. 14 shows a structural block diagram of a voice anti-counterfeiting device provided in an embodiment of the present application. For ease of description, only the parts related to the embodiment of the present application are shown.
  • the voice anti-counterfeiting device includes:
  • the calculation module 10 is used to calculate the similarity of the stored voice templates; wherein, each voice template is generated according to the illegal voice data recognized each time by the preset voice recognition model;
  • the adjustment module 20 is configured to adjust the recognition parameters of the preset voice recognition model or switch the current voice recognition model to another voice recognition model if the similarity calculation result meets a preset condition.
  • the calculation module 10 is specifically configured to:
  • the voice recognition model recognizes illegal voice data twice in a row
  • the similarity calculation is performed on two stored voice templates, where the two voice templates are generated based on the illegal voice data recognized twice in a row Yes, each voice template corresponds to an illegal voice data;
  • the adjustment module 20 is specifically configured to:
  • the recognition parameters of the speech recognition model are adjusted or the current speech recognition model is switched to another speech recognition model.
  • the calculation module 10 is specifically configured to:
  • the stored voice The template performs pairwise similarity calculation
  • the adjustment module 20 is specifically configured to:
  • the number of similar speech templates calculated according to the similarity between every two speech templates is less than the third threshold and/or the percentage of the number of similar speech templates in all similarity calculation times is less than the fourth threshold, then Adjust the recognition parameters of the speech recognition model or switch the current speech recognition model to another speech recognition model.
  • the voice anti-counterfeiting device further includes:
  • the acquisition module is used to acquire voice data
  • a recognition module configured to recognize the voice data using the preset voice recognition model
  • the template generation module is configured to generate a voice template corresponding to the illegal voice data if the recognition result is illegal voice data, and store the voice template.
  • the template generation module is specifically configured to:
  • the voice template is generated according to the two-dimensional matrix.
  • the template generation module is further used to:
  • the calculation module 10 is further configured to:
  • the adjustment module 20 is further configured to:
  • the speech data is re-recognized by adopting the speech recognition model after the adjustment of the recognition parameters or the new speech recognition model after switching.
  • the adjustment module 20 is further configured to:
  • the voice anti-counterfeiting device further includes:
  • the embodiments of the present application also provide a computer-readable storage medium, where the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, it can realize the above-mentioned method embodiments.
  • the embodiments of the present application provide a computer program product.
  • the computer program product When the computer program product is run on a mobile terminal, the mobile terminal can realize the implementation of each of the foregoing method embodiments.
  • the disclosed apparatus/network equipment and method may be implemented in other ways.
  • the device/network device embodiments described above are merely illustrative.
  • the division of the modules or units is only a logical function division, and there may be other divisions in actual implementation, such as multiple units.
  • components can be combined or integrated into another system, or some features can be omitted or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, they may be located in one place, or they may be distributed on multiple network units. Some or all of the units may be selected according to actual needs to achieve the objectives of the solutions of the embodiments.
  • the integrated unit is implemented in the form of a software functional unit and sold or used as an independent product, it can be stored in a computer readable storage medium.
  • the computer program can be stored in a computer-readable storage medium.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file, or some intermediate forms.
  • the computer-readable medium may at least include: any entity or device capable of carrying the computer program code to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
  • any entity or device capable of carrying the computer program code to the photographing device/terminal device recording medium, computer memory, read-only memory (ROM, Read-Only Memory), and random access memory (RAM, Random Access Memory), electric carrier signal, telecommunications signal and software distribution medium.
  • ROM read-only memory
  • RAM Random Access Memory

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Multimedia (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Software Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Telephone Function (AREA)
  • Telephonic Communication Services (AREA)

Abstract

提供了一种基于人工智能(Artificial Intelligence,AI)的语音防伪方法、装置、终端设备及存储介质,适用于语音处理技术领域,该语音防伪方法包括:对存储的语音模板进行相似度计算(S107);其中,每一个语音模板是根据预设的语音识别模型每次识别出的非法语音数据生成的;若相似度计算结果满足预设条件,则调整预设的语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型(S109)。该方法可以降低将合法语音误判为非法语音的概率,提高语音识别的准确度,提升用户体验。

Description

语音防伪方法、装置、终端设备及存储介质
本申请要求于2019年11月27日提交国家知识产权局、申请号为201911183043.7、申请名称为“语音防伪方法、装置、终端设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及属于语音处理技术领域,尤其涉及基于人工智能(Artificial Intelligence,AI)的语音防伪方法、装置、终端设备及存储介质。
背景技术
近年来,由于移动技术的蓬勃发展,语音识别已经广泛应用在移动终端的语音系统,包括移动银行语音认证、智能手机语音登录、和电子商务语音支付。目前的语音识别系统存在很多伪装攻击,主要是录音重放、语音合成、语音转换和语音模仿,而其中最简单的攻击就是重放攻击。重放攻击是指攻击者使用录音设备偷偷录制用户用于语音识别的语音信号,并通过播放设备播放录制的语音,从而通过声纹识别的验证。高保真录音设备的普及使得用户语音极易被攻击者偷录成功。
现有的语音防伪方法在声纹防伪识别方面取得了一定成就,加强了重放攻击的拦截功能,但是同时忽略了真实语音一定概率上被误判的情况,即用户的合法语音有时被识别为非法语音,无法通过语音验证,大大影响用户的体验。
发明内容
本申请实施例提供了语音防伪方法、装置、终端设备及存储介质,可以解决现有的语音防伪中出现将合法语音识别为非法语音的问题。
第一方面,本申请实施例提供了一种语音防伪方法,包括:
对存储的语音模板进行相似度计算;其中,每一个语音模板是根据预设的语音识别模型每次识别出的非法语音数据生成的;
若相似度计算结果满足预设条件,则调整所述预设的语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
本申请实施例提供的语音防伪方法,将预设的语音识别模型识别出的非法语音数据生成语音模板并存储,对语音模板进行相似度计算,若相似度计算满足预设条件,调整预设的语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。从而避免由于当前的语音识别模型识别不准确造成的语音识别错误的问题,降低将合法语音误判为非法语音的概率,提高语音识别的准确度,提升用户体验。
在第一方面的一种可能的实现方式中,所述对存储的语音模板进行相似度计算包括:
当所述语音识别模型连续两次识别出非法语音数据时,对存储的两个语音模板进行相似度计算,其中,所述两个语音模板是根据所述连续两次识别出的非法语音数据生成的,每个语音模板对应一个非法语音数据;
相应的,若相似度计算结果满足预设条件,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型包括:
若两个语音模板的相似度小于第一阈值,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
本申请实施例提供的语音防伪方法,在连续两次将语音数据识别为非法语音数据时,对存储的两个语音模板进行相似度计算,若两个语音模板的相似度小于第一阈值,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。由于连续两次将语音数据识别为非法语音数据,说明有可能存在将合法语音识别为非法语音的情况,此时调整识别参数或切换语音识别模型,得到更准确的语音识别模型,从而在下一次语音识别时,降低将合法语音识别为非法语音数据的概率。
在第一方面的一种可能的实现方式中,所述对存储的语音模板进行相似度计算包括:
当达到预设周期时,统计存储的语音模板的数量;
当所述预设周期内的语音模板的数量大于第二阈值和/或所述预设周期内的语音模板的数量在所有输入语音数据的数量中的百分比大于第三阈值时,对存储的语音模板进行两两相似度计算;
相应的,若相似度计算结果满足预设条件,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型包括:
若根据每两个语音模板之间的相似度计算出的相似语音模板的数量小于第三阈值和/或所述相似语音模板的数量在所有相似度计算的次数中的百分比小于第四阈值,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
本申请实施例提供的语音防伪方法,当达到预设周期时,统计存储的语音模板的数量。当所述预设周期内的语音模板的数量达到一定数量时,说明将语音数据识别为非法语音数据的频率较高,对语音模板进行相似度计算。若相似语音模板的数量小于一定数量时,说明语音模板之间的相似度不高,输入的语音数据可能为合法语音数据。调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型,防止一段时间内多次将合法语音数据识别为非法语音数据。
在第一方面的一种可能的实现方式中,所述对存储的语音模板进行相似度计算之前,所述方法还包括:
获取语音数据;
采用所述预设的语音识别模型对所述语音数据进行识别;
若识别结果为非法语音数据,则生成与所述非法语音数据对应的语音模板,并存储所述语音模板。
由于语音模板是对语音数据进行特征提取后得到的,计算语音模板之间的相似度,语音模板之间的相似度可以准确反映出语音数据是否是重放语音。
在第一方面的一种可能的实现方式中,所述生成与所述非法语音数据对应的语音模板,包括:
将所述非法语音数据对应的语音信号转换为语音频谱;
生成与所述语音频谱对应的二维矩阵,所述二维矩阵中的元素表示预设帧的语音在预设频带的能量;
根据所述二维矩阵生成所述语音模板。相对于通过语音频谱计算语音数据之间的相似度,通过二维矩阵计算语音数据之间的相似度可以提高计算速度。
在第一方面的一种可能的实现方式中,所述根据所述二维矩阵生成所述语音模板,包括:
对所述二维矩阵进行归一化处理;
将归一化处理后的二维矩阵中大于能量阈值的元素设置为第一预设值,将所述归一化处理后的二维矩阵中小于或者等于所述能量阈值的元素设置为第二预设值,将设置后的二维矩阵作为所述语音模板。
在第一方面的一种可能的实现方式中,所述对存储的语音模板进行相似度计算,包括:
计算每两个语音模板中所述第一预设值的匹配数量;
根据所述第一预设值的匹配数量确定每两个语音模板之间的相似度。
示例性地,将两个语音模板中对应的元素一一比较,计算第一预设值的匹配数量,匹配数量越多,相似度越高。
在第一方面的一种可能的实现方式中,调整所述预设的语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型之后,所述方法还包括:
采用识别参数调整后的语音识别模型或者切换后的新的语音识别模型对所述语音数据进行重新识别。
可以理解,由于不同的语音识别模型对不同个体的识别准确度不同,当语音数据被识别为非法语音时,通过调整参数或切换模型的方法对语音数据进行重新识别,从而可以采用合适的语音识别模型或者合适的参数对语音数据进行识别,降低将合法语音误判为非法语音的概率。
在第一方面的一种可能的实现方式中,所述调整所述预设的语音识别模型的识别参数,包括:
按照预设规则降低所述预设的语音识别模型的置信度阈值。通过降低置信度阈值的方式,防止由于置信度阈值设置过高造成的语音识别错误。
在第一方面的一种可能的实现方式中,所述方法还包括:
若相似度计算结果不满足预设条件,输出非法语音提示,等待用户再次输入语音数据。
第二方面,本申请实施例提供了一种语音防伪装置,包括:
计算模块,用于对存储的语音模板进行相似度计算;其中,每一个语音模板是根据预设的语音识别模型每次识别出的非法语音数据生成的;
调整模块,用于若相似度计算结果满足预设条件,则调整所述预设的语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
在第二方面的一种可能的实现方式中,所述计算模块具体用于:
当所述语音识别模型连续两次识别出非法语音数据时,对存储的两个语音模板进行相似度计算,其中,所述两个语音模板是根据所述连续两次识别出的非法语音数据 生成的,每个语音模板对应一个非法语音数据;
相应的,所述调整模块具体用于:
若两个语音模板的相似度小于第一阈值,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
在第二方面的一种可能的实现方式中,所述计算模块具体用于:
当达到预设周期时,统计存储的语音模板的数量;
当所述预设周期内的语音模板的数量大于第二阈值和/或所述预设周期内的语音模板的数量在所有输入语音数据的数量中的百分比大于第三阈值时,对存储的语音模板进行两两相似度计算;
相应的,所述调整模块具体用于:
若根据每两个语音模板之间的相似度计算出的相似语音模板的数量小于第三阈值和/或所述相似语音模板的数量在所有相似度计算的次数中的百分比小于第四阈值,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
在第二方面的一种可能的实现方式中,所述语音防伪装置还包括:
获取模块,用于获取语音数据;
识别模块,用于采用所述预设的语音识别模型对所述语音数据进行识别;
模板生成模块,用于若识别结果为非法语音数据,则生成与所述非法语音数据对应的语音模板,并存储所述语音模板。
在第二方面的一种可能的实现方式中,所述模板生成模块具体用于:
将所述非法语音数据对应的语音信号转换为语音频谱;
生成与所述语音频谱对应的二维矩阵,所述二维矩阵中的元素表示预设帧的语音在预设频带的能量;
根据所述二维矩阵生成所述语音模板。
在第二方面的一种可能的实现方式中,所述模板生成模块还用于:
对所述二维矩阵进行归一化处理;
将归一化处理后的二维矩阵中大于能量阈值的元素设置为第一预设值,将所述归一化处理后的二维矩阵中小于或者等于所述能量阈值的元素设置为第二预设值,将设置后的二维矩阵作为所述语音模板。
在第二方面的一种可能的实现方式中,所述计算模块还用于:
计算每两个语音模板中所述第一预设值的匹配数量;
根据所述第一预设值的匹配数量确定每两个语音模板之间的相似度。
在第二方面的一种可能的实现方式中,所述调整模块还用于:
采用识别参数调整后的语音识别模型或者切换后的新的语音识别模型对所述语音数据进行重新识别。
在第二方面的一种可能的实现方式中,所述调整模块还用于:
按照预设规则降低所述预设的语音识别模型的置信度阈值。
在第二方面的一种可能的实现方式中,所述语音防伪装置还包括:
若相似度计算结果不满足预设条件,输出非法语音提示。
第三方面,本申请实施例提供了一种终端设备,包括:存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现上述第一方面中任一项所述的语音防伪方法。
第四方面,本申请实施例提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现上述第一方面中任一项所述的语音防伪方法。
第五方面,本申请实施例提供了一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的语音防伪方法。
可以理解的是,上述第二方面至第五方面的有益效果可以参见上述第一方面中的相关描述,在此不再赘述。
附图说明
图1是本申请实施例提供的终端设备的结构示意图;
图2是本申请一实施例提供的应用场景示意图;
图3为本申请一实施例提供的另一应用场景示意图;
图4为本申请一实施例提供的应用场景示意图;
图5为本申请另一实施例提供的应用场景示意图;
图6为本申请另一实施例提供的应用场景示意图;
图7为本申请另一实施例提供的应用场景示意图;
图8为本申请又一实施例提供的应用场景示意图;
图9为本申请第一实施例提供的语音防伪方法的流程示意图;
图10为本申请第二实施例提供的语音防伪方法的流程示意图;
图11为本申请第三实施例提供的语音防伪方法的流程示意图;
图12为本申请第四实施例提供的语音防伪方法的流程示意图;
图13为本申请第五实施例提供的语音防伪方法的流程示意图;
图14是本申请实施例提供的语音防伪装置的结构示意图。
具体实施方式
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。
应当理解,当在本申请说明书和所附权利要求书中使用时,术语“包括”指示所描述特征、整体、操作、元素和/或组件的存在,但并不排除一个或多个其它特征、整体、操作、元素、组件和/或其集合的存在或添加。
还应当理解,在本申请说明书和所附权利要求书中使用的术语“和/或”是指相关联列出的项中的一个或多个的任何组合以及所有可能组合,并且包括这些组合。
如在本申请说明书和所附权利要求书中所使用的那样,术语“如果”可以依据上下文被解释为“当...时”或“一旦”或“响应于确定”或“响应于检测到”。类似地,短语“如果确定”或“如果检测到[所描述条件或事件]”可以依据上下文被解释为意指“一旦确定”或“响应于确定”或“一旦检测到[所描述条件或事件]”或“响应于 检测到[所描述条件或事件]”。
另外,在本申请说明书和所附权利要求书的描述中,术语“第一”、“第二”等仅用于区分描述,而不能理解为指示或暗示相对重要性。
在本申请说明书中描述的参考“一个实施例”或“一些实施例”等意味着在本申请的一个或多个实施例中包括结合该实施例描述的特定特征、结构或特点。由此,在本说明书中的不同之处出现的语句“在一个实施例中”、“在一些实施例中”、“在其他一些实施例中”、“在另外一些实施例中”等不是必然都参考相同的实施例,而是意味着“一个或多个但不是所有的实施例”,除非是以其他方式另外特别强调。术语“包括”、“包含”、“具有”及它们的变形都意味着“包括但不限于”,除非是以其他方式另外特别强调。
本申请实施例提供的语音防伪方法应用于终端设备。该终端设备可以是任意具有语音交互功能的设备。包括但不限于具有语音交互功能的智能手机、智能音箱、智能家电、平板电脑、可穿戴设备、车载设备、增强现实(augmented reality,AR)/虚拟现实(virtual reality,VR)设备、笔记本电脑、超级移动个人计算机(ultra-mobile personal computer,UMPC)、上网本、个人数字助理(personal digital assistant,PDA)等,本申请实施例对终端设备的具体类型不作任何限制。
图1示出了终端设备的结构示意图。参考图1,终端设备包括:处理器110、存储器120、输入单元130、显示单元140、传感器150、音频电路160、无线保真(wireless fidelity,WiFi)模块170、以及电源180等部件。本领域技术人员可以理解,图1中示出的终端设备结构并不构成对终端设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
下面结合图1对终端设备的各个构成部件进行具体的介绍:
处理器110可以是中央处理单元(Central Processing Unit,CPU),该处理器110还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现成可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
存储器120可用于存储软件程序以及模块,处理器110通过运行存储在存储器120的软件程序以及模块,从而执行终端设备的各种功能应用以及数据处理。存储器120可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能等)等;存储数据区可存储根据终端设备的使用所创建的数据(比如音频数据、电话本等)等。此外,存储器120可以包括高速随机存取存储器,还可以包括非易失性存储器,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件。
输入单元130可用于接收输入的数字或字符信息,以及产生与终端设备的用户设置以及功能控制有关的键信号输入。具体地,输入单元130可包括触控面板131以及其他输入设备132。触控面板131,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板131上或在触控面 板131附近的操作),并根据预先设定的程式驱动相应的连接装置。可选的,触控面板131可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给处理器110,并能接收处理器110发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板131。除了触控面板131,输入单元130还可以包括其他输入设备132。具体地,其他输入设备132可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆等中的一种或多种。
显示单元140可用于显示由用户输入的信息或提供给用户的信息以及手机的各种菜单。显示单元140可包括显示面板141,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板141。进一步的,触控面板131可覆盖显示面板141,当触控面板131检测到在其上或附近的触摸操作后,传送给处理器110以确定触摸事件的类型,随后处理器110根据触摸事件的类型在显示面板141上提供相应的视觉输出。虽然在图1中,触控面板131与显示面板141是作为两个独立的部件来实现手机的输入和输入功能,但是在某些实施例中,可以将触控面板131与显示面板141集成而实现手机的输入和输出功能。
终端设备还可包括至少一种传感器150,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板141的亮度,接近传感器可在手机移动到耳边时,关闭显示面板141和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别终端设备的姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于手机还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。
音频电路160、扬声器161,传声器162可提供用户与终端设备之间的音频接口。音频电路160可将接收到的音频数据转换后的电信号,传输到扬声器161,由扬声器161转换为声音信号输出;另一方面,传声器162将收集的声音信号转换为电信号,由音频电路160接收后转换为音频数据,再将音频数据输出处理器110处理后,将音频数据输出至存储器120以便进一步处理。
WiFi属于短距离无线传输技术,终端设备通过WiFi模块170可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图1示出了WiFi模块170,但是可以理解的是,其并不属于终端设备的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。
终端设备还包括给各个部件供电的电源180(比如电池),优选的,电源可以通过电源管理系统与处理器110逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。
尽管未示出,终端设备还可以包括摄像头。可选地,摄像头在终端设备的上的位置可以为前置的,也可以为后置的,本申请实施例对此不作限定。
另外,尽管未示出,终端设备还可以包括蓝牙模块等,在此不再赘述。
目前,具有语音交互功能的终端设备获取用户发出的语音,对用户发出的语音进行处理,判断用户发出的语音是否是合法语音,即是否为真实语音。若用户发出的语音是合法语音,则根据用户输入的语音执行对应的操作,若用户发出的语音是非法语音,发出非法语音的提示。
举例来说,在一种应用场景中,参见图2-4,终端设备为智能手机,如图2所示,用户向智能手机发出语音“微信付款”,智能手机的传声器采集用户发出的语音,处理器对用户发出的语音进行处理,判断用户发出的语音是否是合法语音。如图3所示,若用户发出的语音为合法语音,则执行对应的操作,例如打开微信付款界面,在显示面板显示对应的付款二维码。如图4所示,若用户发出的语音为非法语音,则通过扬声器发出“非法语音”的语音提示,从而实现用户与智能手机的人机对话。
在又一种应用场景中,参见图5-7,终端设备为智能音箱,如图5所示,用户向智能音箱发出语音“打电话给小明”,智能音箱的传声器采集用户发出的语音,处理器对用户发出的语音进行处理,判断用户发出的语音是否是合法语音。如图6所示,若用户发出的语音是合法语音,则智能音箱执行对应的操作,例如通过WiFi模块或蓝牙模块指示与智能音箱配对的手机拨打对应的号码。如图7所示,若用户发出的语音为非法语音,则通过扬声器发出非法语音”的语音提示,从而实现用户与智能音箱的人机对话。
需要说明的是,终端设备也可以是服务器,例如在又一应用场景中,如图8所示,用户向智能手机发出语音,智能手机将语音传输至服务器,服务器对用户发出的语音进行处理,判断用户发出的语音是否是合法语音。若用户发出的语音是合法语音,则服务器指示智能手机执行对应的操作。若用户发出的语音是非法语音,则服务器指示智能手机发出“非法语音”的语音提示,从而实现用户与服务器的人机对话。
上述方案中,可以对录音重放、语音合成、语音转换和语音模仿等非法语音进行有效的语音防伪,但是也会造成对真实语音误判的情况,例如,用户的真实语音被经常性地或者连续性地被识别为非法语音,影响用户体验。
基于上述技术问题,本申请实施例提供了语音防伪方法,终端设备在判断用户发出语音为非法语音时,对存储的语音模板进行相似度计算,若相似度计算结果满足预设条件,则调整预设的语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型,对被识别为非法语音的用户语音进行重新识别,以降低将合法语音识别为非法语音的概率。
下面结合图1所示的终端设备对本申请实施例提供的语音防伪方法进行详细描述。
在一些应用场景下,终端设备会出现将用户的合法语音连续识别为非法语音的情况。例如,图4所示的智能手机连续多次将用户的合法语音识别为非法语音。下面以该场景为例对本申请实施例提供的语音防伪方法进行描述。
如图9所示,本申请第一实施例提供的语音防伪方法包括:
S101:获取语音数据。
在一种可能的实现方式中,终端设备先获取用户输入的鉴权信息,例如,通过摄像头获取人脸信息,或者通过输入单元获取指纹信息、数字密码、图案密码等,判断 用户输入的鉴权信息与终端设备上存储的解锁信息是否匹配,若匹配,则鉴权成功。若鉴权成功,终端设备开启语音识别应用,例如语音助手或者语音对话软件。如图2所示,若用户发出语音,传声器采集用户发出的语音。
需要说明的是,语音数据也可以直接作为鉴权信息,终端设备获取语音数据后,首先判断用户输入的语音数据的特征信息与终端设备上存储的语音数据的特征信息是否匹配,若匹配,则鉴权成功,若鉴权成功,进一步根据语音数据执行对应的操作。
S102:采用预设的语音识别模型对语音数据进行识别,判断识别结果是否为合法语音。
其中,预设的语音识别模型是根据采集的用户语音和重放语音,采用机器学习或深度学习算法,对分类模型进行训练得到的。其中,重放语音可以是录音、合成语音、模仿语音中的一种或几种。具体地,终端设备获取同一批人的真实语音和对应的重放语音作为训练样本,将真实语音标记为合法语音,将重放语音标记为非法语音。将训练样本输入构建好的分类模型进行训练,根据分类模型的输出结果与对应的训练样本的标记优化分类模型的参数;当分类模型的输出结果与对应的训练样本的标记的差异在预设范围内时,得到分类模型的最优参数,根据最优参数生成预设的语音识别模型。对应地,本实施例中,终端设备获取语音数据后,输入预设的语音识别模型,输出语音数据为合法语音的概率,根据语音数据为合法语音的概率确定出语音数据是否为合法语音。
本申请实施例中,首先设定预设的语音识别模型的置信度阈值,以作为判定语音数据为合法语音和非法语音的依据。例如,采集一批人的真实语音和对应的重放语音作为测试样本,将真实语音标记为合法语音,将重放语音标记为非法语音。将测试样本输入预设的语音识别模型,将预设的语音识别模型的输出概率与对应的标记进行比较,对比较结果进行统计,根据统计结果生成置信度阈值。例如,若输出概率大于0.5的测试样本中,大部分测试样本对应的标记为合法语音;输出概率小于或等于0.5的测试样本中,大部分测试样本对应的标记为非法语音,则设定置信度阈值为0.5。对应地,将语音数据输入预设的语音识别模型,若预设的语音识别模型输出的概率大于或者等于置信度阈值,则对应的语音数据为合法语音,若预设的语音识别模型输出的概率小于置信度阈值,则对应的语音数据为非法语音。
S103:若识别结果为合法语音,验证通过。
具体地,若根据预设的语音识别模型识别出语音数据为合法语音,处理器根据预设的语音解析模型解析出语音数据对应的文本信息,根据文本信息对应的指令,执行对应的操作。例如,如图3所示,若解析出语音数据为微信付款,则显示微信付款界面。
S104:若识别结果为非法语音,生成与所述语音数据对应的语音模板,并存储语音模板。
其中,语音模板是对用户输入的语音数据进行特征提取后的数据,可以反映输入的语音数据的主要特征信息。在一种可能的实现方式中,音频电路将传声器采集的语音数据转化为语音信号,并发送至处理器,处理器将语音信号变换为频谱。将语音数据划分为若干个语音帧,每两个语音帧之间均不重叠,每个语音帧包括若干个频段, 在频谱中提取每个频段的幅度值,即能量。用二维数组E(n,m)表示第n个语音帧在频段m的幅度值,从而可以生成与频谱对应的二维矩阵。对二维矩阵的幅度值做归一化,将归一化后的值与幅度阈值做比较,若某一归一化后的幅度值大于幅度阈值,则将该幅度值置第一预设值,例如置为1,若某一归一化后的幅度值小于或者等于幅度阈值,则将该幅度值置为第二预设值,例如置为0,当对所有的幅度值分别进行处理(置为1或置为0)后,生成与语音数据对应的谱位图,即语音模板。
在另一种可能的实现方式中,生成与频谱对应的二维矩阵后,若当前帧相邻两个频段的幅值差值大于前一帧对应的相邻两个频段的幅值差值,则将当前帧的当前频段的幅值置为第一预设值,例如置为1,否则,置为第二预设值,例如置为0,当对所有的幅度值分别进行处理(置为1或置为0)后,生成与语音数据对应的谱位图,即语音模板。
S105:判断连续将语音数据识别为非法语音的次数是否达到预设非法识别次数。
在一种可能的实现方式中,设定计数器的初始值为0,若输出结果为非法语音,将计数器加1,若输出结果为合法语音,将计数器重新置为0。处理器生成与语音数据对应的语音模板后,判断计数器的值是否达到预设非法识别次数。
S106:若未达到预设非法识别次数,输出非法语音提示,执行S101。
继续上述可能的实现方式,如图4所示,若计数器未达到预设非法识别次数,输出非法语音提示,等待用户再次输入语音数据。
S107:若达到预设非法识别次数,对存储的语音模板进行相似度计算,判断相似度是否小于第一阈值。
继续上述可能的实现方式,若计数器达到预设非法识别次数,将计数器重新置为0,计算存储的语音模板之间的相似度,语音模板之间的相似度即为输入的语音数据之间的相似度。
在一种可能的实现方式中,预设非法识别次数为2,当识别结果为非法语音时,若计数器的值为2,说明语音识别模型连续两次识别出非法语音,计算两个语音模板之间的相似度。
在另一种可能的实现方式中,预设非法识别次数大于2,当识别结果为非法语音时,若计数器的值达到预设非法识别次数,计算每两个语音模板之间的相似度。
在一种可能的实现方式中,通过对两个谱位图的对应元素进行相乘的方法计算两个语音模板之间的相似性。由语音模板的生成过程可知,谱位图为矩阵,矩阵中的每个元素为0或者1,1代表峰值点,对两个谱位图的对应元素进行相乘后,对乘积求和,得到相似度,即谱位图中峰值点匹配的数量。峰值点匹配的数量越多,则说明谱位图距离越近,相似度越高。
若语音模板的数量为两个,当相似度大于或者等于第一阈值时,表明这两个语音模板相似,当相似度小于第一阈值时,表明这两个语音模板不相似。若语音模板的数量大于两个,两两计算语音模板的相似度,若所有的相似度均小于第一阈值,表明多个语音模板之间不相似,否则语音模板之间相似。
S108:若相似度大于或者等于第一阈值,输出非法语音提示,执行S101。
具体地,若相似度大于或者等于第一阈值,即语音模板之间相似,说明输入的语 音数据为重复语音,仍然输出非法语音,等待用户重新输入语音数据。
S109:若相似度小于第一阈值,调整所述预设的语音识别模型的识别参数,重新进行语音识别。
在一种可能的实现方式中,若相似度小于第一阈值,即语音模板不相似,输入语音有可能为真实语音,因此,调整预设的语音识别模型的识别参数。需要重新进行语音识别,以进一步确定语音数据是否是非法语音数据。
其中,调整预设的语音识别模型的识别参数可以是调整预设的语音识别模型中的特征提取参数,例如,语音识别模型中的神经网络模型的参数,也可以是调整预设的语音识别模型的置信度阈值。
可选的,若相似度小于第一阈值,按照设定的步长或百分比降低置信度阈值,例如,若置信度阈值的初始值为0.5,按照0.01的步长或者10%的幅度降低置信度阈值至0.49或0.45,将降低后的置信度阈值作为判定语音数据为合法语音或为非法语音的依据。将最近一次接收到的语音数据输入预设的语音识别模型,重新进行语音识别。若输出的概率大于或者等于降低后的置信度阈值,则判定输入的语音数据为合法语音,根据语音数据执行对应的操作。若输出的概率小于降低后的置信度阈值,则判定输入语音为非法语音。可选的,在输出概率小于降低后的置信度阈值后,按照设定步长再次降低置信度阈值,重新进行语音识别,若置信度阈值降低至设定的最小值时,输出的概率仍小于置信度阈值的最小值,则判定输入语音为非法语音,输出非法语音的提示;若在置信度阈值降低至最小值之前,输出的概率大于或者等于降低后的置信度阈值,则判定输入语音为真实语音,根据语音数据执行对应的操作,否则,输出非法语音的提示。
上述实施例中,将获取的语音数据输入预设的语音识别模型,以判断该语音数据是否为合法语音,若为合法语音,则验证通过,若为非法语音,生成与语音数据对应的语音模板,同时判断连续将语音数据识别为非法语音的次数是否达到预设非法识别次数,若达到预设非法识别次数,计算语音模板之间的相似度。由于重放语音之间的相似度较高,用户每次输入的真实语音的相似度较低,若相似度小于第一阈值,说明输入的语音数据可能为真实语音,可能当前的置信度阈值较高,则降低预设的语音识别模型中的置信度阈值,重新进行语音识别,以防止将真实语音误识别为非法语音,提高用户体验。
如图10所示,本申请第二实施例提供的语音防伪方法,其与第一实施例的区别在于,若相似度小于第一阈值,则执行下面的步骤。
S209:将当前的语音识别模型切换为其他的语音识别模型,重新进行语音识别。
具体地,存储器中存储多个语音识别模型,这些语音识别模型均为神经网络模型,由不同的训练样本和不同结构的分类模型训练得到,对不同用户的语音数据进行语音识别的准确度不同。
本申请实施例中,若相似度小于第一阈值,即语音模板之间不相似,说明输入的语音数据有可能为真实语音,当前的语音识别模型可能不适用于当前用户。将当前的语音识别模型切换为另一个语音识别模型,将最近一次接收到的语音数据输入切换后的语音识别模型,根据切换后的语音识别模型对应的置信度阈值判断输入的语音数据是 否是合法语音,同时将切换后的语音识别模型作为下一次输入的语音数据的语音识别模型。第二语音识别模型的语音识别方法与第一语音识别模型的语音识别方法相同。
在一种可能的实现方式中,对语音模型进行切换后,将与所有语音模板对应的语音数据输入切换后的语音识别模型,计算切换后的语音识别模型输出非法语音的个数。若切换后的语音识别模型输出非法语音的个数小于预设数量,则将切换后的语音识别模型作为当前语音识别模型。在所有语音模板对应的语音数据输入切换后的语音识别模型的输出结果中,获取最近一次语音数据输入切换后的语音识别模型的输出结果,根据该输出结果判断用户输入的语音数据是否是合法语音。同时将切换后的语音识别模型作为用户下一次输入的语音数据的语音识别模型。若切换后的语音识别模型输出非法语音的数量与预设的语音识别模型输出非法语音的数量相同,则仍然将预设的语音识别模型作为当前语音识别模型,将用户输入的语音数据判定为非法语音。
在一种可能的实现方式中,若切换后的语音识别模型输出非法语音的个数与预设的语音识别模型输出非法语音的数量相同,则根据存储器中每个语音识别模型的语音识别结果,重新切换新的语音识别模型。可选的,将所有语音模板对应的语音数据依次输入存储器中的每个语音识别模型,计算每个语音识别模型输出非法语音的个数,将输出非法语音个数最少的语音识别模型作为切换后的语音识别模型,重新进行语音识别。例如,若当前的语音模板数量为3个,存储器中存储有5个语音识别模型,分别为:语音识别模型A、语音识别模型B、语音识别模型C、语音识别模型D和语音识别模型E,语音识别模型A为预设的语音识别模型,即语音识别模型A连续3次将语音数据识别为非法语音。当判定至少两个语音模板之间不相似时,每个语音识别模型均对3个语音模板对应的语音数据进行处理,统计每个语音识别模型的输出结果中非法语音的数量。若语音识别模型B的输出结果中有2个非法语音,语音识别模型C的输出结果中有1个非法语音,语音识别模型D的输出结果中有1个非法语音,语音识别模型E的输出结果中有0个非法语音,即语音识别模型E将最近一次的语音数据识别为合法语音,则将当前的语音识别模型切换为语音识别模型E,用户下一次输入语音数据时,采用语音识别模型E判定语音数据是否是合法语音。
上述实施例中,将获取的语音数据输入预设的语音识别模型,以判断该语音数据是否为合法语音,若为合法语音,则验证通过,若为非法语音,生成与语音数据对应的语音模板,同时判断连续将语音数据识别为非法语音的次数是否达到预设非法识别次数,若达到预设非法识别次数,计算语音模板之间的相似度。由于重放语音之间的相似度较高,用户每次输入的真实语音的相似度较低,若相似度小于第一阈值,说明输入的语音数据可能为真实语音,预设的语音识别模型可能不适用于当前用户的语音数据,则将当前的语音识别模型切换为其他的语音识别模型,重新进行语音识别,防止将真实语音误识别为非法语音,提高用户体验。
如图11所示,本申请第三实施例提供的语音防伪方法包括:
S301:获取语音数据。
S302:采用预设的语音识别模型对语音数据进行识别,判断识别结果是否为合法语音。
S303:若识别结果为合法语音,验证通过。
S301-S303与第一实施例中的S101-S103相同,在此不再赘述。
S304:若识别结果为非法语音,判断连续将语音数据识别为非法语音的次数是否达到预设非法识别次数。
在一种可能的实现方式中,设定计数器的初始值为0,若输出结果为非法语音,将计数器加1,若达到预设非法识别次数或者输出结果为合法语音,将计数器重新置为0。当输出结果为非法语音时,判断计数器的值是否达到预设非法识别次数。
S305:若未达到预设非法识别次数,输出非法语音提示,执行S301。
具体地,若计数器未达到预设非法识别次数,输出非法语音提示,等待用户再次输入语音数据。
S306:若达到预设非法识别次数,生成与被识别为非法语音数据的连续多个语音数据对应的语音模板。
具体地,若输出结果为非法语音,且连续将语音数据识别为非法语音的次数达到预设非法识别次数,则说明终端设备连续多次将语音数据识别为非法语音,生成与被识别为非法语音的连续多个语音数据对应的语音模板。例如,若预设非法识别次数为2,预设的语音识别模型的输出结果为非法语音,且达到预设非法识别次数,说明终端设备连续两次将语音数据识别为非法语音,生成与两次语音数据一一对应的两个语音模板。每个语音数据生成语音模板的方法与第一实施例相同,在此不再赘述。
S307:对语音模板进行相似度计算,判断相似度是否小于第一阈值。
S308:若相似度大于或者等于第一阈值,输出非法语音提示,执行S301。
S309:若相似度是否小于第一阈值,调整所述预设的语音识别模型的识别参数,重新进行语音识别。
S307-S309与第一实施例中的S107-S109相同,在此不再赘述。
上述实施例中,将获取的语音数据输入预设的语音识别模型,以判断该语音数据是否为合法语音,若为合法语音,则验证通过,若为非法语音,判断连续将语音数据识别为非法语音的次数是否达到预设非法识别次数,若达到预设非法识别次数,生成与被识别为非法语音数据的连续多个语音数据对应的语音模板,对语音模板进行相似度计算。由于重放语音之间的相似度较高,用户每次输入的真实语音的相似度较低,若相似度是否小于第一阈值,说明输入的语音数据可能为真实语音,可能当前的语音识别参数不合适,调整当前的语音识别模型的识别参数,重新进行语音识别,以防止将真实语音误识别为非法语音,提高用户体验。
在一些应用场景下,终端设备会出现经常性的将用户的真实语音识别为非法语音的情况,例如,图7所示的智能音箱经常性的将用户的真实语音识别为非法语音。下面以该场景为例对本申请实施例提供的语音防伪方法进行描述。
如图12所示,本申请第四实施例提供的语音防伪方法包括:
S401:获取语音数据。
S402:采用预设的语音识别模型对语音数据进行识别,判断识别结果是否为合法语音。
S403:若识别结果为合法语音,验证通过。
S404:若识别结果为非法语音,生成与所述语音数据对应的语音模板,并存储语 音模板。
S401-S404与第一实施例中的S101-S104相同,在此不再赘述。
S405:判断是否达到预设周期。
具体地,如图7所示,统计现有的智能音箱将合法语音识别为非法语音的频率,设定合适的调整周期,例如,设定调整周期为3天。设定计时器的初始值为0,并开始计时,在生成语音模板后,判断当前计时是否达到调整周期。
S406:若未达到预设周期,输出非法语音提示,执行S401。
S407:若达到预设周期,统计存储的语音模板的数量,判断预设周期内的语音模板的数量是否满足第一预设条件。
具体地,设定第一计数器的初始值为0,第一计数器用于统计输入的语音数据的数量,每输入一次语音数据,将第一计数器加1。设定第二计数器的初始值为0,第二计数器用于统计存储的语音模板的数量,每生成一个语音模板,将第二计数器加1。当达到预设周期时,第一计数器和第二计数器均重新置为0。
在一种可能的实现方式中,第一预设条件包括下列三种情形,即满足下列任一种情形,即满足第一预设条件。
预设周期内的语音模板的数量大于第二阈值;
所述预设周期内的语音模板的数量在所有输入语音数据的数量中的百分比大于第三阈值;
预设周期内的语音模板的数量大于第二阈值,且所述预设周期内的语音模板的数量在所有输入语音数据的数量中的百分比大于第三阈值。
例如,设定第二阈值为5,根据第二计数器的值得到在预设周期内内生成的语音模板数量大于5个,则满足第一预设条件。
又例如,设定第三阈值为1/10,根据第一计数器得到在预设周期内输入的语音数据的数量为30,根据第二计数器得到生成的语音模板的数量为5个,输入的语音数据被识别为非法语音的次数为5,则语音模板的数量在所有输入语音数据的数量中的百分比为1/6,大于第三阈值,满足第一预设条件。同时,在达到调整周期时,重新开始计时。
S408:若预设周期内的语音模板的数量不满足第一预设条件,输出非法语音提示,执行S401。
具体地,若预设周期内的语音模板的数量不满足第一预设条件,说明终端设备将语音数据识别为非法语音的概率较小,进一步说明终端设备将真实语音识别为非法语音的概率较小,说明预设的语音识别模型的语音识别准确度较高,将输入的语音数据判定为非法语音,并输出非法语音提示,等待用户再次输入语音数据。
S409:若预设周期内的语音模板的数量满足第一预设条件,计算相似语音模板的数量。
具体地,若预设周期内的语音模板的数量满足第一预设条件,说明终端设备将语音数据识别为非法语音的概率较大,根据语音模板对应的谱位图计算任意两个语音模板之间的相似度,相似度大于第一阈值的两个语音模板相似,两两比较语音模板是否相似,计算相似语音模板的数量。
S410:根据相似语音模板的数量判断是否满足第二预设条件。
在一种可能的实现方式中,第二预设条件包括下列三种情形,即满足下列任一种情形,即满足第二预设条件。
相似语音模板的数量小于第三阈值;
所述相似语音模板的数量在所有相似度计算的次数中的百分比小于第四阈值;
相似语音模板的数量小于第三阈值,且所述相似语音模板的数量在所有相似度计算的次数中的百分比小于第四阈值。
例如,设定第三阈值为3,在预设周期内生成的语音模板的数量为10个,计算每两个语音模板之间的相似度,若相似语音模板的数量为2个,则小于第三阈值,满足第二预设条件。
又例如,设定第四阈值为1/5,在预设周期内生成的语音模板的数量为10个,计算每两个语音模板之间的相似度,则需要计算45次,若相似语音模板的数量为5个,则相似语音模板的数量在所有相似度计算的次数中的百分比为1/9,小于第四阈值,满足第二预设条件。
S411:若相似语音模板的数量不满足第二预设条件,输出非法语音提示,执行S401。
具体地,若相似语音模板的数量不满足第二预设条件,即相似语音模板的数量大于第三阈值和/或所述相似语音模板的数量在所有相似度计算的次数中的百分比大于第四阈值,说明语音模板之间相似度较高,说明输入的语音数据为重复语音,仍然输出非法语音,等待用户重新输入语音数据。
S412:若相似语音模板的数量满足第二预设条件,调整所述预设的语音识别模型的识别参数,重新进行语音识别。
具体地,若相似语音模板的数量满足第二预设条件,说明语音模板之间相似度不高,输入的语音数据可能为合法语音,调整语音识别模型的识别参数,重新识别语音数据。
其中调整语音识别模型的识别参数,重新进行语音识别的方法与本申请第一实施例中的S109相同,在此不再赘述。
上述实施例中,将获取的语音数据输入预设的语音识别模型,以判断该语音数据是否为合法语音,若为合法语音,则验证通过,若为非法语音,生成与语音数据对应的语音模板,同时判断是否达到预设周期,若达到预设周期,统计存储的语音模板的数量,判断预设周期内的语音模板的数量是否满足第一预设条件。若满足第一预设条件,说明语音识别过程中识别为非法语音的概率较高,计算相似语音模板的数量,根据相似语音模板的数量判断是否满足第二预设条件。若相似语音模板的数量满足第二预设条件,说明输入的语音数据之间的相似度不高,输入的语音数据可能为真实语音,调整语音识别模型的识别参数,重新进行语音识别,防止将真实语音误识别为非法语音,提高用户体验。
如图13所示,本申请第五实施例提供的语音防伪方法,其与第三实施例的区别在于,若相似语音模板的数量不满足预设条件,则执行:
S512:将当前的语音识别模型切换为其他的语音识别模型。
其中,S512与本申请第二实施例中S209相同,在此不再赘述。
上述实施例中,将获取的语音数据输入预设的语音识别模型,以判断该语音数据是否为合法语音,若为合法语音,则验证通过,若为非法语音,生成与语音数据对应的语音模板,同时判断是否达到预设周期,若达到预设周期,统计存储的语音模板的数量,判断预设周期内的语音模板的数量是否满足第一预设条件。若满足第一预设条件,说明语音识别过程中识别为非法语音的概率较高,计算相似语音模板的数量,根据相似语音模板的数量判断是否满足第二预设条件。若相似语音模板的数量满足第二预设条件,说明输入的语音数据之间的相似度不高,输入的语音数据可能为真实语音,预设的语音识别模型可能不适用于当前用户的语音数据,则将当前的语音识别模型切换为其他的语音识别模型,防止将真实语音误识别为非法语音,提高用户体验。
应理解,上述实施例中各的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
对应于上文实施例所述的语音防伪方法,图14示出了本申请实施例提供的语音防伪装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
参照图14,该语音防伪装置包括:
计算模块10,用于对存储的语音模板进行相似度计算;其中,每一个语音模板是根据预设的语音识别模型每次识别出的非法语音数据生成的;
调整模块20,用于若相似度计算结果满足预设条件,则调整所述预设的语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
在一种可能的实现方式中,所述计算模块10具体用于:
当所述语音识别模型连续两次识别出非法语音数据时,对存储的两个语音模板进行相似度计算,其中,所述两个语音模板是根据所述连续两次识别出的非法语音数据生成的,每个语音模板对应一个非法语音数据;
相应的,所述调整模块20具体用于:
若两个语音模板的相似度小于第一阈值,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
在一种可能的实现方式中,所述计算模块10具体用于:
当达到预设周期时,统计存储的语音模板的数量;
当所述预设周期内的语音模板的数量大于第二阈值和/或所述预设周期内的语音模板的数量在所有输入语音数据的数量中的百分比大于第三阈值时,对存储的语音模板进行两两相似度计算;
相应的,所述调整模块20具体用于:
若根据每两个语音模板之间的相似度计算出的相似语音模板的数量小于第三阈值和/或所述相似语音模板的数量在所有相似度计算的次数中的百分比小于第四阈值,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
在一种可能的实现方式中,所述语音防伪装置还包括:
获取模块,用于获取语音数据;
识别模块,用于采用所述预设的语音识别模型对所述语音数据进行识别;
模板生成模块,用于若识别结果为非法语音数据,则生成与所述非法语音数据对 应的语音模板,并存储所述语音模板。
在一种可能的实现方式中,所述模板生成模块具体用于:
将所述非法语音数据对应的语音信号转换为语音频谱;
生成与所述语音频谱对应的二维矩阵,所述二维矩阵中的元素表示预设帧的语音在预设频带的能量;
根据所述二维矩阵生成所述语音模板。
在一种可能的实现方式中,所述模板生成模块还用于:
对所述二维矩阵进行归一化;
将归一化后的二维矩阵中大于能量阈值的元素置为第一预设值,将所述归一化后的二维矩阵中小于或者等于所述能量阈值的元素置为第二预设值,得到所述语音模板。
在一种可能的实现方式中,所述计算模块10还用于:
根据所述归一化后的二维矩阵计算每两个语音模板中第一预设值的匹配数量;
根据所述第一预设值的匹配数量计算每两个语音模板之间的相似度。
在一种可能的实现方式中,所述调整模块20还用于:
采用识别参数调整后的语音识别模型或者切换后的新的语音识别模型对所述语音数据进行重新识别。
在一种可能的实现方式中,所述调整模块20还用于:
按照预设规则降低所述预设的语音识别模型的置信度阈值。
在一种可能的实现方式中,所述语音防伪装置还包括:
若相似度计算结果满足预设条件,输出非法语音提示。
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。
上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现上述各个方法实施例中的。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在移动终端上运行时,使得移动终端执行时实现可实现上述各个方法实施例中的。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。

Claims (13)

  1. 一种语音防伪方法,其特征在于,包括:
    对存储的语音模板进行相似度计算;其中,每一个语音模板是根据预设的语音识别模型每次识别出的非法语音数据生成的;
    若相似度计算结果满足预设条件,则调整所述预设的语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
  2. 如权利要求1所述的语音防伪方法,其特征在于,所述对存储的语音模板进行相似度计算包括:
    当所述语音识别模型连续两次识别出非法语音数据时,对存储的两个语音模板进行相似度计算,其中,所述两个语音模板是根据所述连续两次识别出的非法语音数据生成的,每个语音模板对应一个非法语音数据;
    相应的,若相似度计算结果满足预设条件,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型包括:
    若两个语音模板的相似度小于第一阈值,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
  3. 如权利要求1或2所述的语音防伪方法,其特征在于,所述对存储的语音模板进行相似度计算包括:
    当达到预设周期时,统计存储的语音模板的数量;
    当所述预设周期内的语音模板的数量大于第二阈值和/或所述预设周期内的语音模板的数量在所有输入语音数据的数量中的百分比大于第三阈值时,对存储的语音模板进行两两相似度计算;
    相应的,若相似度计算结果满足预设条件,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型包括:
    若根据每两个语音模板之间的相似度计算出的相似语音模板的数量小于第三阈值和/或所述相似语音模板的数量在所有相似度计算的次数中的百分比小于第四阈值,则调整所述语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
  4. 如权利要求1至3任一项所述的语音防伪方法,其特征在于,所述对存储的语音模板进行相似度计算之前,所述方法还包括:
    获取语音数据;
    采用所述预设的语音识别模型对所述语音数据进行识别;
    若识别结果为非法语音数据,则生成与所述非法语音数据对应的语音模板,并存储所述语音模板。
  5. 如权利要求4所述的语音防伪方法,其特征在于,所述生成与所述非法语音数据对应的语音模板,包括:
    将所述非法语音数据对应的语音信号转换为语音频谱;
    生成与所述语音频谱对应的二维矩阵,所述二维矩阵中的元素表示预设帧的语音在预设频带的能量;
    根据所述二维矩阵生成所述语音模板。
  6. 如权利要求5所述的语音防伪方法,其特征在于,所述根据所述二维矩阵生成所述语音模板,包括:
    对所述二维矩阵进行归一化处理;
    将归一化处理后的二维矩阵中大于能量阈值的元素设置为第一预设值,将所述归一化处理后的二维矩阵中小于或者等于所述能量阈值的元素设置为第二预设值,将设置后的二维矩阵作为所述语音模板。
  7. 如权利要求6所述的语音防伪方法,其特征在于,所述对存储的语音模板进行相似度计算,包括:
    计算每两个语音模板中所述第一预设值的匹配数量;
    根据所述第一预设值的匹配数量确定每两个语音模板之间的相似度。
  8. 如权利要求1至7任一项所述的语音防伪方法,其特征在于,调整所述预设的语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型之后,所述方法还包括:
    采用识别参数调整后的语音识别模型或者切换后的新的语音识别模型对所述语音数据进行重新识别。
  9. 如权利要求1至8任一项所述的语音防伪方法,其特征在于,所述调整所述预设的语音识别模型的识别参数,包括:
    按照预设规则降低所述预设的语音识别模型的置信度阈值。
  10. 如权利要求1至9任一项所述的语音防伪方法,其特征在于,所述方法还包括:
    若相似度计算结果不满足预设条件,输出非法语音提示。
  11. 一种语音防伪装置,其特征在于,包括:
    计算模块,用于对存储的语音模板进行相似度计算;其中,每一个语音模板是根据预设的语音识别模型每次识别出的非法语音数据生成的;
    调整模块,用于若相似度计算结果满足预设条件,则调整所述预设的语音识别模型的识别参数或者将当前的语音识别模型切换为其他的语音识别模型。
  12. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至10任一项所述的方法。
  13. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至10任一项所述的方法。
PCT/CN2020/124766 2019-11-27 2020-10-29 语音防伪方法、装置、终端设备及存储介质 WO2021103913A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201911183043.7A CN112863523B (zh) 2019-11-27 2019-11-27 语音防伪方法、装置、终端设备及存储介质
CN201911183043.7 2019-11-27

Publications (1)

Publication Number Publication Date
WO2021103913A1 true WO2021103913A1 (zh) 2021-06-03

Family

ID=75985702

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/124766 WO2021103913A1 (zh) 2019-11-27 2020-10-29 语音防伪方法、装置、终端设备及存储介质

Country Status (2)

Country Link
CN (1) CN112863523B (zh)
WO (1) WO2021103913A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116011460A (zh) * 2023-02-13 2023-04-25 安徽龙鼎信息科技有限公司 一种基于自然语言处理的物流运力匹配方法和系统

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1588538A (zh) * 2004-09-29 2005-03-02 上海交通大学 用于嵌入式自动语音识别系统的训练方法
US20050125226A1 (en) * 2003-10-29 2005-06-09 Paul Magee Voice recognition system and method
GB2541466A (en) * 2015-08-21 2017-02-22 Validsoft Uk Ltd Replay attack detection
CN108039176A (zh) * 2018-01-11 2018-05-15 广州势必可赢网络科技有限公司 一种防录音攻击的声纹认证方法、装置及门禁系统
CN108806695A (zh) * 2018-04-17 2018-11-13 平安科技(深圳)有限公司 自更新的反欺诈方法、装置、计算机设备和存储介质
CN108882242A (zh) * 2018-06-08 2018-11-23 国家计算机网络与信息安全管理中心 基于声纹识别和意图理解技术的反诈骗系统的自学习方法
CN109547466A (zh) * 2018-12-17 2019-03-29 北京车和家信息技术有限公司 基于机器学习提高风险感知能力的方法及装置、计算机设备和存储介质
CN109934114A (zh) * 2019-02-15 2019-06-25 重庆工商大学 一种手指静脉模板生成与更新算法及系统
CN110148425A (zh) * 2019-05-14 2019-08-20 杭州电子科技大学 一种基于完整局部二进制模式的伪装语音检测方法
CN110491391A (zh) * 2019-07-02 2019-11-22 厦门大学 一种基于深度神经网络的欺骗语音检测方法

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050125226A1 (en) * 2003-10-29 2005-06-09 Paul Magee Voice recognition system and method
CN1588538A (zh) * 2004-09-29 2005-03-02 上海交通大学 用于嵌入式自动语音识别系统的训练方法
GB2541466A (en) * 2015-08-21 2017-02-22 Validsoft Uk Ltd Replay attack detection
CN108039176A (zh) * 2018-01-11 2018-05-15 广州势必可赢网络科技有限公司 一种防录音攻击的声纹认证方法、装置及门禁系统
CN108806695A (zh) * 2018-04-17 2018-11-13 平安科技(深圳)有限公司 自更新的反欺诈方法、装置、计算机设备和存储介质
CN108882242A (zh) * 2018-06-08 2018-11-23 国家计算机网络与信息安全管理中心 基于声纹识别和意图理解技术的反诈骗系统的自学习方法
CN109547466A (zh) * 2018-12-17 2019-03-29 北京车和家信息技术有限公司 基于机器学习提高风险感知能力的方法及装置、计算机设备和存储介质
CN109934114A (zh) * 2019-02-15 2019-06-25 重庆工商大学 一种手指静脉模板生成与更新算法及系统
CN110148425A (zh) * 2019-05-14 2019-08-20 杭州电子科技大学 一种基于完整局部二进制模式的伪装语音检测方法
CN110491391A (zh) * 2019-07-02 2019-11-22 厦门大学 一种基于深度神经网络的欺骗语音检测方法

Also Published As

Publication number Publication date
CN112863523A (zh) 2021-05-28
CN112863523B (zh) 2023-05-16

Similar Documents

Publication Publication Date Title
Wang et al. User authentication on mobile devices: Approaches, threats and trends
US10789343B2 (en) Identity authentication method and apparatus
CN108702354B (zh) 基于传感器信号的活跃度确定
US10042995B1 (en) Detecting authority for voice-driven devices
CN109558512B (zh) 一种基于音频的个性化推荐方法、装置和移动终端
Wang et al. Secure your voice: An oral airflow-based continuous liveness detection for voice assistants
CN110647730A (zh) 经由单独的处理路径进行单通道输入多因素认证
TW201907330A (zh) 身份認證的方法、裝置、設備及資料處理方法
CN105429969B (zh) 一种用户身份验证方法与设备
Thomas et al. A broad review on non-intrusive active user authentication in biometrics
US20150278574A1 (en) Processing a Fingerprint for Fingerprint Matching
CN113129867B (zh) 语音识别模型的训练方法、语音识别方法、装置和设备
TW202029062A (zh) 網路優化方法及裝置、圖像處理方法及裝置、儲存媒體
WO2021213490A1 (zh) 一种身份验证方法、装置和电子设备
Jiang et al. Securing liveness detection for voice authentication via pop noises
WO2021103913A1 (zh) 语音防伪方法、装置、终端设备及存储介质
CN111835522A (zh) 一种音频处理方法及装置
CN110728993A (zh) 一种变声识别方法及电子设备
CN117077099A (zh) 一种基于声学传感数据的可信合法用户认证系统
Yu et al. Mobile devices based eavesdropping of handwriting
Rathore et al. Scanning the voice of your fingerprint with everyday surfaces
US11893098B2 (en) Authenticating a user subvocalizing a displayed text
Telo ANALYZING THE EFFECTIVENESS OF BEHAVIORAL BIOMETRICS IN AUTHENTICATION: A COMPREHENSIVE REVIEW
KR102622350B1 (ko) 전자 장치 및 그 제어 방법
CN109102810B (zh) 声纹识别方法和装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20891843

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20891843

Country of ref document: EP

Kind code of ref document: A1