WO2021179714A1 - 人工合成语音检测方法、装置、计算机设备及存储介质 - Google Patents

人工合成语音检测方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2021179714A1
WO2021179714A1 PCT/CN2020/135177 CN2020135177W WO2021179714A1 WO 2021179714 A1 WO2021179714 A1 WO 2021179714A1 CN 2020135177 W CN2020135177 W CN 2020135177W WO 2021179714 A1 WO2021179714 A1 WO 2021179714A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice data
voice
deep convolutional
generation network
credibility
Prior art date
Application number
PCT/CN2020/135177
Other languages
English (en)
French (fr)
Inventor
曾振
王健宗
程宁
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021179714A1 publication Critical patent/WO2021179714A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • This application relates to the field of artificial intelligence technology, in particular to the field of speech recognition technology, and in particular to artificially synthesized speech detection methods, devices, computer equipment and storage media.
  • Speech recognition is an important direction in the field of artificial intelligence speech.
  • speech synthesis technology has developed very well. Its synthesis speed is getting faster and faster, and its ability to simulate human voices is getting stronger. Therefore, false speech recognition technology has gradually become a research hotspot in recent years.
  • This application provides a method, device, computer equipment, and storage medium for artificially synthesized speech detection, which can recognize the authenticity of the speech data received by the user based on the confrontation generation network, and help the user better improve their awareness of preventing speech fraud.
  • a technical solution adopted in this application is to provide a method for artificially synthesized speech detection, including:
  • the authenticity of the voice data is judged according to the credibility.
  • an artificially synthesized speech detection device including:
  • the collection module is used to collect the voice data received by the user
  • a feature extraction module configured to input the voice data into a pre-trained deep convolutional confrontation generation network, perform framing and windowing processing on the voice data, and extract audio features of the voice data;
  • the detection module is used to identify and analyze the audio features and obtain the credibility of the voice data
  • the discrimination module is used to discriminate the authenticity of the voice data according to the credibility.
  • a computer device including a processor, a memory coupled to the processor, and the memory stores program instructions for implementing the following steps, so The steps include:
  • the processor is configured to execute program instructions stored in the memory.
  • another technical solution adopted in this application is to provide a storage device storing a program file capable of implementing the following steps, the steps including:
  • the authenticity of the voice data is judged according to the credibility.
  • the beneficial effects of this application are: to identify the authenticity of the voice data received by the user through the confrontation generation network, to help users better improve their awareness of voice fraud prevention; and to continuously optimize the confrontation generation network based on user feedback data in the follow-up, In this way, the accuracy of the voice data received by the user can be judged more accurately.
  • the voice data is used to optimize the confrontation generation network only when the user agrees to the feedback, and the user's privacy is protected on the basis of security precautions.
  • FIG. 1 is a schematic flowchart of a method for detecting artificially synthesized speech according to a first embodiment of the present application
  • FIG. 2 is a schematic flowchart of a method for detecting artificially synthesized speech according to a second embodiment of the present application
  • FIG. 3 is a schematic flowchart of a method for detecting artificially synthesized speech according to a third embodiment of the present application.
  • FIG. 4 is a schematic flowchart of a method for detecting artificially synthesized speech according to a fourth embodiment of the present application.
  • FIG. 5 is a schematic flowchart of a method for detecting artificially synthesized speech according to a fifth embodiment of the present application.
  • FIG. 6 is a schematic diagram of the structure of the artificially synthesized speech detection device according to the first embodiment of the present application.
  • FIG. 7 is a schematic diagram of the structure of the artificially synthesized speech detection device according to the second embodiment of the present application.
  • FIG. 8 is a schematic diagram of the structure of the artificially synthesized speech detection device according to the third embodiment of the present application.
  • FIG. 9 is a schematic diagram of the structure of the artificially synthesized speech detection device according to the fourth embodiment of the present application.
  • FIG. 10 is a schematic diagram of the architecture of the artificially synthesized speech detection device according to the fifth embodiment of the present application.
  • FIG. 11 is a schematic structural diagram of a terminal device according to an embodiment of the present application.
  • FIG. 12 is a schematic structural diagram of a storage medium according to an embodiment of the present application.
  • Fig. 1 is a schematic flowchart of a method for detecting artificially synthesized speech according to a first embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 1. As shown in Figure 1, the method includes steps:
  • Step S101 Collect the voice data received by the user.
  • step S101 a local synthesized speech detection model is installed for users who have activated the anti-fraud function, and the local synthesized speech detection model first collects all speech data received by the user.
  • Step S102 Input the voice data into the pre-trained deep convolutional confrontation generation network, perform framing and windowing processing on the voice data, and extract audio features of the voice data.
  • this embodiment uses the method of framing and windowing to process the voice data, divide the voice data into several voice frames, and then extract the audio features of each voice frame. Because the later voice data processing requires a stable voice signal, and the voice signal at one end is not stable as a whole, but the local signal is stable, so a segment of voice data is processed into frames. In addition, because the beginning end of each voice frame and There will be discontinuities at the end, so the more frames are divided, the greater the error with the original signal, and the windowing method can make the framed speech signal become continuous.
  • Step S103 Recognizing and analyzing the audio features and obtaining the credibility of the voice data.
  • step S103 the discriminant network in the pre-trained deep convolutional confrontation generation network is used to recognize the audio features and obtain the credibility of the voice data.
  • Step S104 Judging the authenticity of the voice data according to the credibility.
  • step S104 the credibility is compared with a preset threshold; when the credibility is lower than the preset threshold, it is determined that the voice data is a fake voice; when the credibility is higher than the preset threshold, it is determined that the voice data is Real voice.
  • the artificially synthesized speech detection method of the first embodiment of the present application recognizes the authenticity of the speech data received by the user through the pre-trained deep convolutional confrontation generation network, and helps the user to better improve the awareness of preventing speech fraud.
  • Fig. 2 is a schematic flowchart of a method for detecting artificially synthesized speech according to a second embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 2. As shown in Figure 2, the method includes steps:
  • Step S201 Collect the voice data received by the user.
  • step S201 in FIG. 2 is similar to step S101 in FIG.
  • Step S202 Receive random noise and generate synthesized speech through random noise.
  • Step S203 Use the synthesized speech and the preset real speech to train the deep convolutional confrontation generation network to obtain a pre-trained deep convolutional confrontation generation network.
  • the structure of the deep convolutional confrontation generation network includes a generation network and a discrimination network.
  • the generation network is used to generate synthesized speech
  • the discrimination network is used to determine the authenticity of speech data.
  • the goal of the generation network is To generate synthetic speech close to the real
  • the goal of the discriminant network is to distinguish the synthetic speech from the real speech, so that the generation network and the discriminant network form a dynamic "game process".
  • This embodiment first calculates the expected value of synthetic speech predicted to be true and the expected value of preset real speech predicted to be false; then, the sum of the expected value of synthetic speech predicted to be true and the expected value of preset real speech predicted to be false is generated as a deep convolutional confrontation.
  • the loss function of the network and the deep convolutional confrontation generation network are optimized.
  • E(*) represents the expected value
  • X represents the preset real speech
  • P data represents the distribution of the real speech
  • D(X) represents the output of the discriminant network
  • z represents the noise used to generate the synthesized speech
  • G(z) represents the generation
  • the output of the network, D(G(z)) represents the probability that the discriminant network D judges that the synthesized speech generated by the generation network G is true.
  • Step S204 Input the voice data into the pre-trained deep convolutional confrontation generation network, perform framing and windowing processing on the voice data, and extract audio features of the voice data.
  • step S204 in FIG. 2 is similar to step S102 in FIG.
  • Step S205 Recognizing and analyzing the audio features and obtaining the credibility of the voice data.
  • step S205 in FIG. 2 is similar to step S103 in FIG.
  • Step S206 Judging the authenticity of the voice data according to the credibility.
  • step S206 in FIG. 2 is similar to step S104 in FIG.
  • the artificially synthesized speech detection method of the second embodiment of the present application adopts the sum of the expected value of synthetic speech predicted to be true and the expected value of preset real speech predicted to be false as the loss of the deep convolutional confrontation generation network Function and optimize the deep convolutional confrontation generation network to improve the accuracy and reliability of the recognition of the deep convolutional confrontation generation network.
  • Fig. 3 is a schematic flowchart of a method for detecting artificially synthesized speech according to a third embodiment of the present application. It should be noted that if there are substantially the same results, the method of the present application is not limited to the sequence of the process shown in FIG. 3. As shown in Figure 3, the method includes steps:
  • Step S301 Collect the voice data received by the user.
  • step S301 in FIG. 3 is similar to step S101 in FIG.
  • Step S302 Input the voice data into the pre-trained deep convolutional confrontation generation network, perform framing and windowing processing on the voice data, and extract audio features of the voice data.
  • step S302 in FIG. 3 is similar to step S102 in FIG.
  • Step S303 Recognizing and analyzing the audio features and obtaining the credibility of the voice data.
  • step S303 in FIG. 3 is similar to step S103 in FIG.
  • Step S304 Judging the authenticity of the voice data according to the credibility.
  • step S304 in FIG. 3 is similar to step S104 in FIG.
  • step S305 is performed, and when it is determined that the voice data is a real voice, step S306 is performed.
  • Step S305 Send an early warning signal to the user by means of text messages or short messages.
  • step S305 the user is reminded by means of text messages or short messages that the voice data is a false voice. If the account transaction content is involved, please operate with caution and beware of fraud.
  • Step S306 Delete the voice data.
  • the artificially synthesized speech detection method of the third embodiment of the present application is based on the first embodiment.
  • it sends an early warning signal to the user through text information or short message, which further improves the user’s resistance to voice fraud. Awareness of prevention.
  • FIG. 4 is a schematic flowchart of a method for detecting artificially synthesized speech according to a fourth embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 4. As shown in Figure 4, the method includes the steps:
  • Step S401 Collect the voice data received by the user.
  • step S401 in FIG. 4 is similar to step S101 in FIG.
  • Step S402 input the voice data into the pre-trained deep convolutional confrontation generation network, perform framing and windowing processing on the voice data, and extract audio features of the voice data.
  • step S402 in FIG. 4 is similar to step S102 in FIG.
  • Step S403 Recognizing and analyzing the audio features and obtaining the credibility of the voice data.
  • step S403 in FIG. 4 is similar to step S103 in FIG.
  • Step S404 Judging the authenticity of the voice data according to the credibility.
  • step S404 in FIG. 4 is similar to step S104 in FIG.
  • step S405 is executed, and when it is determined that the voice data is a real voice, step S408 is executed.
  • Step S405 Send an early warning signal to the user by means of text messages or short messages.
  • step S405 the user is reminded by means of text messages or short messages that the voice data is a false voice. If the content of the account transaction is involved, please operate with caution and beware of fraud.
  • Step S406 is executed after step S405.
  • Step S406 Obtain the user's opinion on the judgment result of the feedback voice data.
  • step S406 if the user agrees to feed back the judgment result, step S407 is executed.
  • Step S407 Send the voice data to the server, and use the voice data to optimize the deep convolutional confrontation generation network within a preset interval.
  • step S407 specifically, the expected value of the preset real voice predicted to be false and the expected value of the voice data determined to be false to be predicted as the true expected value are first calculated; then the preset real voice is predicted to be the expected value of false and the expected value of the false voice determined to be false.
  • the speech data is predicted to be the true expected value sum as the loss function of the deep convolutional confrontation generation network and the deep convolutional confrontation generation network is optimized.
  • the deep convolutional confrontation generation network is further trained by using speech data determined as false speech, and the training does not depend on the generation network.
  • the loss function of the deep convolutional confrontation generation network is calculated according to the following formula: Among them, E(*) represents the expected value, X represents the preset real voice, P data represents the distribution of the preset real voice, and D(X) represents the output of the discriminant network. Indicates voice data determined to be fake voices, Represents the probability that the voice data determined to be false voice is real.
  • step S408 is executed.
  • Step S408 Delete the voice data.
  • the artificially synthesized speech detection method of the fourth embodiment of the present application continuously optimizes the confrontation generation network through user feedback data, so as to more accurately determine the accuracy of the voice data received by the user.
  • the voice data is used to optimize the confrontation generation network only when the feedback is agreed, which protects the privacy of users on the basis of security precautions.
  • Fig. 5 is a schematic flowchart of a method for detecting artificially synthesized speech according to a fifth embodiment of the present application. It should be noted that if there is substantially the same result, the method of the present application is not limited to the sequence of the process shown in FIG. 5. As shown in Figure 5, the method includes steps:
  • Step S501 Collect the voice data received by the user.
  • step S501 in FIG. 5 is similar to step S101 in FIG.
  • Step S502 Sampling and preprocessing the voice data.
  • step S502 the collected voice data is collected through a specific sampling rate and number of sampling bits, and pre-processing such as noise reduction and mute filtering is performed to improve the quality of the voice data and retain the complete voice data.
  • Step S503 Input the voice data into the pre-trained deep convolutional confrontation generation network, perform framing and windowing processing on the voice data, and extract audio features of the voice data.
  • step S503 in FIG. 5 is similar to step S102 in FIG.
  • Step S504 Recognizing and analyzing the audio features and obtaining the credibility of the voice data.
  • step S504 in FIG. 5 is similar to step S103 in FIG.
  • Step S505 Judging the authenticity of the voice data according to the credibility.
  • step S505 in FIG. 5 is similar to step S104 in FIG.
  • the artificially synthesized speech detection method of the fifth embodiment of the present application improves the quality of the speech data and retains the complete speech data by sampling and preprocessing the speech data.
  • FIG. 6 is a schematic diagram of the structure of the artificially synthesized speech detection device according to the first embodiment of the present application.
  • the device 60 includes an acquisition module 61, a feature extraction module 62, a detection module 63 and a discrimination module 64.
  • the collection module 61 is used to collect voice data received by the user.
  • the feature extraction module 62 is configured to input the voice data into the pre-trained deep convolutional confrontation generation network, perform framing and windowing processing on the voice data, and extract audio features of the voice data.
  • the detection module 63 is used to identify and analyze audio features and obtain the credibility of the voice data.
  • the discrimination module 64 is used to discriminate the authenticity of the voice data according to the credibility.
  • the discrimination module 64 includes a comparison unit, a first discrimination unit, and a second discrimination unit.
  • the comparison unit is used to compare the credibility with a preset threshold; the first discrimination unit is used to determine that the voice data is a fake voice when the credibility is lower than the preset threshold; the second discrimination unit is used to recognize the credibility When the degree is higher than the preset threshold, it is determined that the voice data is a real voice.
  • FIG. 7 is a schematic diagram of the structure of the artificially synthesized speech detection device according to the second embodiment of the present application.
  • the device 70 includes an acquisition module 71, a generation module 72, a training module 73, a feature extraction module 74, a detection module 75 and a discrimination module 76.
  • the collection module 71 is used to collect voice data received by the user.
  • the generating module 72 is configured to receive random noise and generate synthesized speech through the random noise.
  • the training module 73 is used to train the deep convolutional confrontation generation network by using synthetic speech and preset real speech to obtain a pre-trained deep convolutional confrontation generation network.
  • the feature extraction module 74 is configured to input the voice data into the pre-trained deep convolutional confrontation generation network, perform framing and windowing processing on the voice data, and extract audio features of the voice data.
  • the detection module 75 is used to identify and analyze the audio features and obtain the credibility of the voice data.
  • the discrimination module 76 is used to discriminate the authenticity of the voice data according to the credibility.
  • FIG. 8 is a schematic diagram of the structure of the artificially synthesized speech detection device according to the third embodiment of the present application.
  • the device 80 includes an acquisition module 81, a feature extraction module 82, a detection module 83, a discrimination module 84, a sending module 85 and a deletion module 86.
  • the collection module 81 is used to collect voice data received by the user.
  • the feature extraction module 82 is used for inputting the voice data into the pre-trained deep convolutional confrontation generation network, framing and windowing the voice data, and extracting audio features of the voice data.
  • the detection module 83 is used to identify and analyze audio features and obtain the credibility of the voice data.
  • the discrimination module 84 is used to discriminate the authenticity of the voice data according to the credibility.
  • the sending module 85 is used to send an early warning signal to the user in the form of text message or short message when the discrimination module 84 determines that the voice data is a false voice.
  • the deleting module 86 is used to delete the voice data when the discrimination module 84 determines that the voice data is a real voice.
  • FIG. 9 is a schematic diagram of the structure of the artificially synthesized speech detection device according to the fourth embodiment of the present application.
  • the device 90 includes an acquisition module 91, a feature extraction module 92, a detection module 93, a discrimination module 94, a sending module 95, a deletion module 96, an acquisition module 97, and a transmission and optimization module 98.
  • the collection module 91 is used to collect voice data received by the user.
  • the feature extraction module 92 is configured to input the voice data into the pre-trained deep convolutional confrontation generation network, perform framing and windowing processing on the voice data, and extract audio features of the voice data.
  • the detection module 93 is used to identify and analyze audio features and obtain the credibility of the voice data.
  • the discrimination module 94 is used to discriminate the authenticity of the voice data according to the credibility.
  • the sending module 95 is configured to send an early warning signal to the user in the form of text information or short message when the discrimination module 94 determines that the voice data is a false voice.
  • the deleting module 96 is used for deleting the voice data when the discrimination module 94 determines that the voice data is a real voice.
  • the obtaining module 97 is used to obtain the user's opinion on the judgment result of the feedback voice data.
  • the transmission and optimization module 98 is configured to send voice data to the server if the user agrees to feed back the judgment result, and use the voice data to optimize the deep convolutional confrontation generation network within a preset interval.
  • FIG. 10 is a schematic diagram of the structure of the artificially synthesized speech detection device according to the fifth embodiment of the present application.
  • the device 10 includes an acquisition module 11, a sampling and preprocessing module 12, a feature extraction module 13, a detection module 14 and a discrimination module 15.
  • the collection module 11 is used to collect voice data received by the user.
  • the sampling and preprocessing module 12 is used for sampling and preprocessing the voice data.
  • the feature extraction module 13 is used to input the voice data into the pre-trained deep convolutional confrontation generation network, perform framing and windowing processing on the voice data, and extract audio features of the voice data.
  • the detection module 14 is used to identify and analyze audio features and obtain the credibility of the voice data.
  • the discrimination module 15 is used to discriminate the authenticity of the voice data according to the credibility.
  • FIG. 11 is a schematic structural diagram of a computer device according to an embodiment of the application.
  • the computer device 11 includes a processor 111 and a memory 112 coupled to the processor 111.
  • the memory 112 stores program instructions for implementing the artificially synthesized speech detection method described in any of the foregoing embodiments.
  • the processor 111 is configured to execute program instructions stored in the memory 112 to realize artificially synthesized speech detection.
  • the processor 111 may also be referred to as a CPU (Central Processing Unit, central processing unit).
  • the processor 111 may be an integrated circuit chip with signal processing capabilities.
  • the processor 111 may also be a general-purpose processor, a digital signal processor (DSP), an application specific integrated circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, a discrete gate or transistor logic device, or a discrete hardware component .
  • the general-purpose processor may be a microprocessor or the processor may also be any conventional processor or the like.
  • FIG. 12 is a schematic structural diagram of a storage medium according to an embodiment of the application.
  • the storage medium of this embodiment of the application stores a program file 121 that can implement all the above methods.
  • the program file 121 can be stored in the above storage medium in the form of a software product, and includes several instructions to enable a computer device (which can It is a personal computer, a server, or a network device, etc.) or a processor (processor) that executes all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disks or optical disks and other media that can store program codes. , Or terminal devices such as computers, servers, mobile phones, and tablets.
  • the storage medium may be non-volatile or volatile.
  • the disclosed system, device, and method can be implemented in other ways.
  • the device embodiments described above are merely illustrative, for example, the division of units is only a logical function division, and there may be other divisions in actual implementation, for example, multiple units or components can be combined or integrated. To another system, or some features can be ignored, or not implemented.
  • the displayed or discussed mutual coupling or direct coupling or communication connection may be indirect coupling or communication connection through some interfaces, devices or units, and may be in electrical, mechanical or other forms.
  • the functional units in the various embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units may be integrated into one unit.
  • the above-mentioned integrated unit can be implemented in the form of hardware or software functional unit.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Physics & Mathematics (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种人工合成语音检测方法、装置(10,60,70,80,90)、计算机设备(11)、存储介质,涉及人工智能技术领域,人工合成语音检测方法包括:采集用户接收到的语音数据(S101);将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征(S102);对音频特征进行识别分析并获得语音数据的可信度(S103);根据可信度判别语音数据的真实性(S104)。通过对抗生成网络对用户接收到的语音数据的真实性进行识别,帮助用户更好地提高对语音诈骗的防范意识。

Description

人工合成语音检测方法、装置、计算机设备及存储介质
本申请要求于2020年10月21日提交中国专利局、申请号为202011134504.4、申请名称为“人工合成语音检测方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能技术领域,具体涉及语音识别技术领域,特别是涉及人工合成语音检测方法、装置、计算机设备及存储介质。
背景技术
语音识别是人工智能语音领域中的一个重要方向,近年来,随着计算机硬件能力提升以及深度学习模型的不断完善,语音合成技术已经有了非常完善的发展。其合成速度越来越快,模拟人声的能力也越来越强。因此,虚假语音识别技术近年来也逐渐成为研究的热点。
发明人意识到目前对于虚假语音识别的论文和产品依然很少,还没有十分具有突破性的技术与进展。所以,急需一种用于预防聊天语音诈骗的、基于语音合成及声音转换技术产生的数字语音与真实语音的判别技术设计合成语音检测系统。
发明内容
本申请提供人工合成语音检测方法、装置、计算机设备及存储介质,能够基于对抗生成网络对用户接收到的语音数据的真实性进行识别,帮助用户更好地提高对语音诈骗的防范意识。
为解决上述技术问题,本申请采用的一个技术方案是:提供一种人工合成语音检测方法,包括:
采集用户接收到的语音数据;
将所述语音数据输入预训练深度卷积对抗生成网络中,对所述语音数据进行分帧、加窗处理并提取所述语音数据的音频特征;
对所述音频特征进行识别分析并获得所述语音数据的可信度;
根据所述可信度判别所述语音数据的真实性。
为解决上述技术问题,本申请采用的另一个技术方案是:提供一种人工合成语音检测装置,包括:
采集模块,用于采集用户接收到的语音数据;
特征提取模块,用于将所述语音数据输入预训练深度卷积对抗生成网络中,对所述语音数据进行分帧、加窗处理并提取所述语音数据的音频特征;
检测模块,用于对所述音频特征进行识别分析并获得所述语音数据的可信度;
判别模块,用于根据所述可信度判别所述语音数据的真实性。
为解决上述技术问题,本申请采用的再一个技术方案是:提供一种计算机设备,包括处理器、与所述处理器耦接的存储器,所述存储器存储有用于实现以下步骤的程序指令,所述步骤包括:
采集用户接收到的语音数据;
将所述语音数据输入预训练深度卷积对抗生成网络中,对所述语音数据进行分帧、加窗处理并提取所述语音数据的音频特征;
对所述音频特征进行识别分析并获得所述语音数据的可信度;
根据所述可信度判别所述语音数据的真实性;
所述处理器用于执行所述存储器存储的程序指令。为解决上述技术问题,本申请采用的再一个技术方案是:提供一种存储装置,存储有能够实现如下步骤的程序文件,所述步骤包括:
采集用户接收到的语音数据;
将所述语音数据输入预训练深度卷积对抗生成网络中,对所述语音数据进行分帧、加窗处理并提取所述语音数据的音频特征;
对所述音频特征进行识别分析并获得所述语音数据的可信度;
根据所述可信度判别所述语音数据的真实性。
本申请的有益效果是:通过对抗生成网络对用户接收到的语音数据的真实性进行识别,帮助用户更好地提高对语音诈骗的防范意识;并在后续根据用户反馈数据不断优化对抗生成网络,从而更加准确的判别用户接收到的语音数据的准确性,同时仅在用户同意反馈的情况下才将语音数据用于优化对抗生成网络,在安全防范的基础上保护了用户的隐私安全。
附图说明
图1是本申请第一实施例的人工合成语音检测方法的流程示意图;
图2是本申请第二实施例的人工合成语音检测方法的流程示意图;
图3是本申请第三实施例的人工合成语音检测方法的流程示意图;
图4是本申请第四实施例的人工合成语音检测方法的流程示意图;
图5是本申请第五实施例的人工合成语音检测方法的流程示意图;
图6是本申请第一实施例的人工合成语音检测装置的架构示意图;
图7是本申请第二实施例的人工合成语音检测装置的架构示意图;
图8是本申请第三实施例的人工合成语音检测装置的架构示意图;
图9是本申请第四实施例的人工合成语音检测装置的架构示意图;
图10是本申请第五实施例的人工合成语音检测装置的架构示意图;
图11是本申请实施例的终端设备的结构示意图;
图12是本申请实施例的存储介质的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅是本申请的一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
图1是本申请第一实施例的人工合成语音检测方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图1所示的流程顺序为限。如图1所示,该方法包括步骤:
步骤S101:采集用户接收到的语音数据。
在步骤S101中,对开通防诈骗功能的用户安装本地合成语音检测模型,本地合成语音检测模型首先采集用户接收到的所有语音数据。
步骤S102:将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征。
在步骤S102中,本实施例采用分帧、加窗方法对语音数据进行处理,将语音数据分为若干语音帧,再提取每一个语音帧的音频特征。因为后期语音数据处理需要平稳的语音信号,而一端语音信号整体看是不平稳的,但是局部信号是平稳的,所以将一段语音数据进行分帧处理,另外,由于每一语音帧的起始端和末尾端会出现不连续的地方,所以分帧越多,与原始信号的误差也就越大,用加窗的方法能够使分帧后的语音信号变得连续。
步骤S103:对音频特征进行识别分析并获得语音数据的可信度。
在步骤S103中,采用预训练深度卷积对抗生成网络中的判别网络对音频特征进行识别并获得语音数据的可信度。
步骤S104:根据可信度判别语音数据的真实性。
在步骤S104中,将可信度与预设阈值作比对;当可信度低于预设阈值时,确定语音数据为虚假语音;当可信度高于预设阈值时,确定语音数据为真实语音。
本申请第一实施例的人工合成语音检测方法通过预训练深度卷积对抗生成网络对用户接收到的语音数据的真实性进行识别,帮助用户更好地提高对语音诈骗的防范意识。
图2是本申请第二实施例的人工合成语音检测方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图2所示的流程顺序为限。如图2所示,该方法包括步骤:
步骤S201:采集用户接收到的语音数据。
在本实施例中,图2中的步骤S201和图1中的步骤S101类似,为简约起见,在此不再赘述。
步骤S202:接收随机噪声并通过随机噪声生成合成语音。
步骤S203:利用合成语音和预设真实语音对深度卷积对抗生成网络进行训练,获得预训练深度卷积对抗生成网络。
在步骤S203中,深度卷积对抗生成网络的结构包括生成网络和判别网络,生成网络用于生成合成语音,判别网络用于判别语音数据的真实性;在训练的过程中,生成网络的目标是生成接近真实的合成语音,判别网络的目标是把合成语音和真实语音区别开来,以使生成网络和判别网络形成一个动态的“博弈过程”。本实施例首先计算合成语音预测为真实的期望值以及预设真实语音预测为虚假的期望值;然后将合成语音预测为真实的期望值以及预设真实语音预测为虚假的期望值之和作为深度卷积对抗生成网络的损失函数并对深度卷积对抗生成网络进行优化。
具体地,本实施例采用合成语音和预设真实语音对深度卷积对抗生成网络进行训练,深度卷积对抗生成网络的损失函数按照如下公式进行计算:
Figure PCTCN2020135177-appb-000001
其中,E(*)表示期望值,X表示预设真实语音,P data表示真实语音的分布,D(X)表示判别网络的输出,z表示用于生成合成语音的噪声,G(z)表示生成网络的输出,D(G(z))表示判别网络D判断生成网络G生成的合成语音为真实的概率。
步骤S204:将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征。
在本实施例中,图2中的步骤S204和图1中的步骤S102类似,为简约起见, 在此不再赘述。
步骤S205:对音频特征进行识别分析并获得语音数据的可信度。
在本实施例中,图2中的步骤S205和图1中的步骤S103类似,为简约起见,在此不再赘述。
步骤S206:根据可信度判别语音数据的真实性。
在本实施例中,图2中的步骤S206和图1中的步骤S104类似,为简约起见,在此不再赘述。
本申请第二实施例的人工合成语音检测方法在第一实施例的基础上,采用合成语音预测为真实的期望值以及预设真实语音预测为虚假的期望值之和作为深度卷积对抗生成网络的损失函数并对深度卷积对抗生成网络进行优化,提高深度卷积对抗生成网络识别的准确性和可靠性。
图3是本申请第三实施例的人工合成语音检测方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图3所示的流程顺序为限。如图3所示,该方法包括步骤:
步骤S301:采集用户接收到的语音数据。
在本实施例中,图3中的步骤S301和图1中的步骤S101类似,为简约起见,在此不再赘述。
步骤S302:将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征。
在本实施例中,图3中的步骤S302和图1中的步骤S102类似,为简约起见,在此不再赘述。
步骤S303:对音频特征进行识别分析并获得语音数据的可信度。
在本实施例中,图3中的步骤S303和图1中的步骤S103类似,为简约起见,在此不再赘述。
步骤S304:根据可信度判别语音数据的真实性。
在本实施例中,图3中的步骤S304和图1中的步骤S104类似,为简约起见,在此不再赘述。当确定语音数据为虚假语音时,执行步骤S305,当确定语音数据为真实语音时,执行步骤S306。
步骤S305:通过文本信息或短信的方式向用户发送预警信号。
在步骤S305中,通过文本信息或短信的方式提醒用户该语音数据为虚假语音,若涉及账户交易内容,请谨慎操作,谨防诈骗。
步骤S306:删除语音数据。
本申请第三实施例的人工合成语音检测方法在第一实施例的基础上,通过在确定语音数据为虚假数据时,通过文本信息或短信的方式向用户发送预警信号,进一步提高用户对语音诈骗的防范意识。
图4是本申请第四实施例的人工合成语音检测方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图4所示的流程顺序为限。如图4所示,该方法包括步骤:
步骤S401:采集用户接收到的语音数据。
在本实施例中,图4中的步骤S401和图1中的步骤S101类似,为简约起见,在此不再赘述。
步骤S402:将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征。
在本实施例中,图4中的步骤S402和图1中的步骤S102类似,为简约起见, 在此不再赘述。
步骤S403:对音频特征进行识别分析并获得语音数据的可信度。
在本实施例中,图4中的步骤S403和图1中的步骤S103类似,为简约起见,在此不再赘述。
步骤S404:根据可信度判别语音数据的真实性。
在本实施例中,图4中的步骤S404和图1中的步骤S104类似,为简约起见,在此不再赘述。当确定语音数据为虚假语音时,执行步骤S405,当确定语音数据为真实语音时,执行步骤S408。
步骤S405:通过文本信息或短信的方式向用户发送预警信号。
在步骤S405中,通过文本信息或短信的方式提醒用户该语音数据为虚假语音,若涉及账户交易内容,请谨慎操作,谨防诈骗。在步骤S405之后执行步骤S406。
步骤S406:获取用户对反馈语音数据的判别结果的意见。
在步骤S406中,若用户同意反馈判别结果,执行步骤S407。
步骤S407:将语音数据发送至服务器,在预设间隔时间内采用语音数据优化深度卷积对抗生成网络。
在步骤S407中,具体地,首先计算预设真实语音预测为虚假的期望值以及确定为虚假语音的语音数据预测为真实的期望值;然后将预设真实语音预测为虚假的期望值以及确定为虚假语音的语音数据预测为真实的期望值之和作为深度卷积对抗生成网络的损失函数并对深度卷积对抗生成网络进行优化。本实施例采用确定为虚假语音的语音数据对深度卷积对抗生成网络进一步训练,该训练不依赖于生成网络。深度卷积对抗生成网络的损失函数按照如下公式进行计算:
Figure PCTCN2020135177-appb-000002
Figure PCTCN2020135177-appb-000003
其中,E(*)表示期望值,X表示预设真实语音,P data表示预设真实语音的分布,D(X)表示判别网络的输出,
Figure PCTCN2020135177-appb-000004
表示确定为虚假语音的语音数据,
Figure PCTCN2020135177-appb-000005
表示判别确定为虚假语音的语音数据为真实的概率。
若用户不同意反馈判别结果,执行步骤S408。
步骤S408:删除语音数据。
本申请第四实施例的人工合成语音检测方法在第三实施例的基础上,通过用户反馈数据不断优化对抗生成网络,从而更加准确的判别用户接收到的语音数据的准确性,同时仅在用户同意反馈的情况下才将语音数据用于优化对抗生成网络,在安全防范的基础上保护了用户的隐私安全。
图5是本申请第五实施例的人工合成语音检测方法的流程示意图。需注意的是,若有实质上相同的结果,本申请的方法并不以图5所示的流程顺序为限。如图5所示,该方法包括步骤:
步骤S501:采集用户接收到的语音数据。
在本实施例中,图5中的步骤S501和图1中的步骤S101类似,为简约起见,在此不再赘述。
步骤S502:对语音数据进行采样及预处理。
在步骤S502中,通过特定采样率和采样位数对采集到的语音数据进行收集,并进行降噪、过滤首尾静音等预处理,提高语音数据的质量并保留完整的语音数据。
步骤S503:将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征。
在本实施例中,图5中的步骤S503和图1中的步骤S102类似,为简约起见,在此不再赘述。
步骤S504:对音频特征进行识别分析并获得语音数据的可信度。
在本实施例中,图5中的步骤S504和图1中的步骤S103类似,为简约起见,在此不再赘述。
步骤S505:根据可信度判别语音数据的真实性。
在本实施例中,图5中的步骤S505和图1中的步骤S104类似,为简约起见,在此不再赘述。
本申请第五实施例的人工合成语音检测方法在第一实施例的基础上,通过对语音数据进行采样及预处理,提高语音数据的质量并保留完整的语音数据。
图6是本申请第一实施例的人工合成语音检测装置的结构示意图。如图6所示,该装置60包括采集模块61、特征提取模块62、检测模块63以及判别模块64。
采集模块61用于采集用户接收到的语音数据。
特征提取模块62用于将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征。
检测模块63用于对音频特征进行识别分析并获得语音数据的可信度。
判别模块64用于根据可信度判别语音数据的真实性。
可选地,判别模块64包括比对单元、第一判别单元和第二判别单元。比对单元用于将可信度与预设阈值作比对;第一判别单元用于当可信度低于预设阈值时,确定语音数据为虚假语音;第二判别单元用于当可信度高于预设阈值时,确定语音数据为真实语音。
图7是本申请第二实施例的人工合成语音检测装置的结构示意图。如图7所示,该装置70包括采集模块71、生成模块72、训练模块73、特征提取模块74、检测模块75以及判别模块76。
采集模块71用于采集用户接收到的语音数据。
生成模块72用于接收随机噪声并通过随机噪声生成合成语音。
训练模块73用于利用合成语音和预设真实语音对深度卷积对抗生成网络进行训练,获得预训练深度卷积对抗生成网络。
特征提取模块74用于将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征。
检测模块75用于对音频特征进行识别分析并获得语音数据的可信度。
判别模块76用于根据可信度判别语音数据的真实性。
图8是本申请第三实施例的人工合成语音检测装置的结构示意图。如图8所示,该装置80包括采集模块81、特征提取模块82、检测模块83、判别模块84、发送模块85以及删除模块86。
采集模块81用于采集用户接收到的语音数据。
特征提取模块82用于将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征。
检测模块83用于对音频特征进行识别分析并获得语音数据的可信度。
判别模块84用于根据可信度判别语音数据的真实性。
发送模块85用于当判别模块84确定语音数据为虚假语音时,通过文本信息或短信的方式向用户发送预警信号。
删除模块86用于当判别模块84确定语音数据为真实语音时,删除语音数据。
图9是本申请第四实施例的人工合成语音检测装置的结构示意图。如图9所示,该装置90包括采集模块91、特征提取模块92、检测模块93、判别模块94、发送模块95、删除模块96、获取模块97以及传输及优化模块98。
采集模块91用于采集用户接收到的语音数据。
特征提取模块92用于将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征。
检测模块93用于对音频特征进行识别分析并获得语音数据的可信度。
判别模块94用于根据可信度判别语音数据的真实性。
发送模块95用于当判别模块94确定语音数据为虚假语音时,通过文本信息或短信的方式向用户发送预警信号。
删除模块96用于当判别模块94确定语音数据为真实语音时,删除语音数据。
获取模块97用于获取用户对反馈语音数据的判别结果的意见。
传输及优化模块98用于若用户同意反馈判别结果,将语音数据发送至服务器,在预设间隔时间内采用语音数据优化深度卷积对抗生成网络。
图10是本申请第五实施例的人工合成语音检测装置的结构示意图。如图10所示,该装置10包括采集模块11、采样及预处理模块12、特征提取模块13、检测模块14以及判别模块15。
采集模块11用于采集用户接收到的语音数据。
采样及预处理模块12用于对语音数据进行采样及预处理。
特征提取模块13用于将语音数据输入预训练深度卷积对抗生成网络中,对语音数据进行分帧、加窗处理并提取语音数据的音频特征。
检测模块14用于对音频特征进行识别分析并获得语音数据的可信度。
判别模块15用于根据可信度判别语音数据的真实性。
请参阅图11,图11为本申请实施例的计算机设备的结构示意图。如图11所示,该计算机设备11包括处理器111及和处理器111耦接的存储器112。
存储器112存储有用于实现上述任一实施例所述的人工合成语音检测方法的程序指令。
处理器111用于执行存储器112存储的程序指令以实现人工合成语音检测。
其中,处理器111还可以称为CPU(Central Processing Unit,中央处理单元)。处理器111可能是一种集成电路芯片,具有信号的处理能力。处理器111还可以是通用处理器、数字信号处理器(DSP)、专用集成电路(ASIC)、现成可编程门阵列(FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
参阅图12,图12为本申请实施例的存储介质的结构示意图。本申请实施例的存储介质存储有能够实现上述所有方法的程序文件121,其中,该程序文件121可以以软件产品的形式存储在上述存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)或处理器(processor)执行本申请各个实施方式所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质,或者是计算机、服务器、手机、平板等终端设备。所述存储介质可以是非易失性,也可以是易失性。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统,装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上仅为本申请的实施方式,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。

Claims (20)

  1. 一种人工合成语音检测方法,其中,包括:
    采集用户接收到的语音数据;
    将所述语音数据输入预训练深度卷积对抗生成网络中,对所述语音数据进行分帧、加窗处理并提取所述语音数据的音频特征;
    对所述音频特征进行识别分析并获得所述语音数据的可信度;
    根据所述可信度判别所述语音数据的真实性。
  2. 根据权利要求1所述的方法,其中,所述将所述语音数据输入预训练深度卷积对抗生成网络中的步骤之前,还包括:
    接收随机噪声并通过所述随机噪声生成合成语音;
    利用所述合成语音和预设真实语音对深度卷积对抗生成网络进行训练,获得预训练深度卷积对抗生成网络。
  3. 根据权利要求2所述的方法,其中,所述利用所述合成语音和预设真实语音对深度卷积对抗生成网络进行训练,获得预训练深度卷积对抗生成网络的步骤还包括:
    计算所述合成语音预测为真实的期望值以及所述预设真实语音预测为虚假的期望值;
    将所述合成语音预测为真实的期望值以及所述预设真实语音预测为虚假的期望值之和作为所述深度卷积对抗生成网络的损失函数并对所述深度卷积对抗生成网络进行优化。
  4. 根据权利要求1所述的方法,其中,所述根据所述可信度判别所述语音数据的真实性的步骤包括:
    将所述可信度与预设阈值作比对;
    当可信度低于预设阈值时,确定所述语音数据为虚假语音;
    当可信度高于预设阈值时,确定所述语音数据为真实语音。
  5. 根据权利要求4所述的方法,其中,在所述根据所述可信度判别所述语音数据的真实性的步骤之后,还包括:
    当确定所述语音数据为虚假语音时,通过文本信息或短信的方式向用户发送预警信号;
    当确定所述语音数据为真实语音时,删除所述语音数据。
  6. 根据权利要求5所述的方法,其中,在所述通过文本信息或短信的方式向用户发送预警信号的步骤之后,还包括:
    获取用户对反馈所述语音数据的判别结果的意见;
    若用户同意反馈所述判别结果,将所述语音数据发送至服务器,在预设间隔时间内采用所述语音数据优化所述深度卷积对抗生成网络;
    若用户不同意反馈所述判别结果,删除所述语音数据。
  7. 根据权利要求6所述的方法,其中,所述在预设间隔时间内采用所述语音数据优化所述深度卷积对抗生成网络的步骤还包括:
    计算所述预设真实语音预测为虚假的期望值以及确定为虚假语音的所述语音数据预测为真实的期望值;
    将所述预设真实语音预测为虚假的期望值以及确定为虚假语音的所述语音数据预测为真实的期望值之和作为所述深度卷积对抗生成网络的损失函数并对所述深 度卷积对抗生成网络进行优化。
  8. 一种人工合成语音检测装置,其中,包括:
    采集模块,用于采集用户接收到的语音数据;
    特征提取模块,用于将所述语音数据输入预训练深度卷积对抗生成网络中,对所述语音数据进行分帧、加窗处理并提取所述语音数据的音频特征;
    检测模块,用于对所述音频特征进行识别分析并获得所述语音数据的可信度;
    判别模块,用于根据所述可信度判别所述语音数据的真实性。
  9. 一种计算机设备,包括:处理器、与所述处理器耦接的存储器,其中,
    所述存储器存储有用于实现以下步骤的程序指令,所述步骤包括:
    采集用户接收到的语音数据;
    将所述语音数据输入预训练深度卷积对抗生成网络中,对所述语音数据进行分帧、加窗处理并提取所述语音数据的音频特征;
    对所述音频特征进行识别分析并获得所述语音数据的可信度;
    根据所述可信度判别所述语音数据的真实性;
    所述处理器用于执行所述存储器存储的程序指令。
  10. 根据权利要求9所述的计算机设备,其中,所述将所述语音数据输入预训练深度卷积对抗生成网络中的步骤之前,还包括:
    接收随机噪声并通过所述随机噪声生成合成语音;
    利用所述合成语音和预设真实语音对深度卷积对抗生成网络进行训练,获得预训练深度卷积对抗生成网络。
  11. 根据权利要求10所述的计算机设备,其中,所述利用所述合成语音和预设真实语音对深度卷积对抗生成网络进行训练,获得预训练深度卷积对抗生成网络的步骤还包括:
    计算所述合成语音预测为真实的期望值以及所述预设真实语音预测为虚假的期望值;
    将所述合成语音预测为真实的期望值以及所述预设真实语音预测为虚假的期望值之和作为所述深度卷积对抗生成网络的损失函数并对所述深度卷积对抗生成网络进行优化。
  12. 根据权利要求9所述的计算机设备,其中,所述根据所述可信度判别所述语音数据的真实性的步骤包括:
    将所述可信度与预设阈值作比对;
    当可信度低于预设阈值时,确定所述语音数据为虚假语音;
    当可信度高于预设阈值时,确定所述语音数据为真实语音。
  13. 根据权利要求12所述的计算机设备,其中,在所述根据所述可信度判别所述语音数据的真实性的步骤之后,还包括:
    当确定所述语音数据为虚假语音时,通过文本信息或短信的方式向用户发送预警信号;
    当确定所述语音数据为真实语音时,删除所述语音数据。
  14. 根据权利要求13所述的计算机设备,其中,在所述通过文本信息或短信的方式向用户发送预警信号的步骤之后,还包括:
    获取用户对反馈所述语音数据的判别结果的意见;
    若用户同意反馈所述判别结果,将所述语音数据发送至服务器,在预设间隔时间内采用所述语音数据优化所述深度卷积对抗生成网络;
    若用户不同意反馈所述判别结果,删除所述语音数据。
  15. 根据权利要求14所述的计算机设备,其中,所述在预设间隔时间内采用所述语音数据优化所述深度卷积对抗生成网络的步骤还包括:
    计算所述预设真实语音预测为虚假的期望值以及确定为虚假语音的所述语音数据预测为真实的期望值;
    将所述预设真实语音预测为虚假的期望值以及确定为虚假语音的所述语音数据预测为真实的期望值之和作为所述深度卷积对抗生成网络的损失函数并对所述深度卷积对抗生成网络进行优化。
  16. 一种存储装置,其中,存储有能够实现如下步骤的程序文件,所述步骤包括:
    采集用户接收到的语音数据;
    将所述语音数据输入预训练深度卷积对抗生成网络中,对所述语音数据进行分帧、加窗处理并提取所述语音数据的音频特征;
    对所述音频特征进行识别分析并获得所述语音数据的可信度;
    根据所述可信度判别所述语音数据的真实性。
  17. 根据权利要求16所述的存储装置,其中,所述将所述语音数据输入预训练深度卷积对抗生成网络中的步骤之前,还包括:
    接收随机噪声并通过所述随机噪声生成合成语音;
    利用所述合成语音和预设真实语音对深度卷积对抗生成网络进行训练,获得预训练深度卷积对抗生成网络。
  18. 根据权利要求17所述的存储装置,其中,所述利用所述合成语音和预设真实语音对深度卷积对抗生成网络进行训练,获得预训练深度卷积对抗生成网络的步骤还包括:
    计算所述合成语音预测为真实的期望值以及所述预设真实语音预测为虚假的期望值;
    将所述合成语音预测为真实的期望值以及所述预设真实语音预测为虚假的期望值之和作为所述深度卷积对抗生成网络的损失函数并对所述深度卷积对抗生成网络进行优化。
  19. 根据权利要求16所述的存储装置,其中,所述根据所述可信度判别所述语音数据的真实性的步骤包括:
    将所述可信度与预设阈值作比对;
    当可信度低于预设阈值时,确定所述语音数据为虚假语音;
    当可信度高于预设阈值时,确定所述语音数据为真实语音。
  20. 根据权利要求19所述的存储装置,其中,在所述根据所述可信度判别所述语音数据的真实性的步骤之后,还包括:
    当确定所述语音数据为虚假语音时,通过文本信息或短信的方式向用户发送预警信号;
    当确定所述语音数据为真实语音时,删除所述语音数据。
PCT/CN2020/135177 2020-10-21 2020-12-10 人工合成语音检测方法、装置、计算机设备及存储介质 WO2021179714A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011134504.4A CN112185417B (zh) 2020-10-21 2020-10-21 人工合成语音检测方法、装置、计算机设备及存储介质
CN202011134504.4 2020-10-21

Publications (1)

Publication Number Publication Date
WO2021179714A1 true WO2021179714A1 (zh) 2021-09-16

Family

ID=73923733

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/135177 WO2021179714A1 (zh) 2020-10-21 2020-12-10 人工合成语音检测方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN112185417B (zh)
WO (1) WO2021179714A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870899A (zh) * 2021-09-28 2021-12-31 平安科技(深圳)有限公司 语音质量评价模型的训练方法、装置与存储介质
CN118280389A (zh) * 2024-03-28 2024-07-02 南京龙垣信息科技有限公司 多重对抗判别伪造音频检测系统

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368724A (zh) * 2017-06-14 2017-11-21 广东数相智能科技有限公司 基于声纹识别的防作弊网络调研方法、电子设备及存储介质
CN109559736A (zh) * 2018-12-05 2019-04-02 中国计量大学 一种基于对抗网络的电影演员自动配音方法
CN109801638A (zh) * 2019-01-24 2019-05-24 平安科技(深圳)有限公司 语音验证方法、装置、计算机设备及存储介质
CN110930976A (zh) * 2019-12-02 2020-03-27 北京声智科技有限公司 一种语音生成方法及装置
US20200322377A1 (en) * 2019-04-08 2020-10-08 Pindrop Security, Inc. Systems and methods for end-to-end architectures for voice spoofing detection

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4964695B2 (ja) * 2007-07-11 2012-07-04 日立オートモティブシステムズ株式会社 音声合成装置及び音声合成方法並びにプログラム
CN107293289B (zh) * 2017-06-13 2020-05-29 南京医科大学 一种基于深度卷积生成对抗网络的语音生成方法
CN111383641B (zh) * 2018-12-29 2022-10-18 华为技术有限公司 语音识别方法、装置和控制器
CN110619886B (zh) * 2019-10-11 2022-03-22 北京工商大学 一种针对低资源土家语的端到端语音增强方法
CN111243621A (zh) * 2020-01-14 2020-06-05 四川大学 一种用于合成语音检测的gru-svm深度学习模型的构造方法
CN111798828B (zh) * 2020-05-29 2023-02-14 厦门快商通科技股份有限公司 合成音频检测方法、系统、移动终端及存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107368724A (zh) * 2017-06-14 2017-11-21 广东数相智能科技有限公司 基于声纹识别的防作弊网络调研方法、电子设备及存储介质
CN109559736A (zh) * 2018-12-05 2019-04-02 中国计量大学 一种基于对抗网络的电影演员自动配音方法
CN109801638A (zh) * 2019-01-24 2019-05-24 平安科技(深圳)有限公司 语音验证方法、装置、计算机设备及存储介质
US20200322377A1 (en) * 2019-04-08 2020-10-08 Pindrop Security, Inc. Systems and methods for end-to-end architectures for voice spoofing detection
CN110930976A (zh) * 2019-12-02 2020-03-27 北京声智科技有限公司 一种语音生成方法及装置

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113870899A (zh) * 2021-09-28 2021-12-31 平安科技(深圳)有限公司 语音质量评价模型的训练方法、装置与存储介质
CN118280389A (zh) * 2024-03-28 2024-07-02 南京龙垣信息科技有限公司 多重对抗判别伪造音频检测系统

Also Published As

Publication number Publication date
CN112185417A (zh) 2021-01-05
CN112185417B (zh) 2024-05-10

Similar Documents

Publication Publication Date Title
Zhang et al. Comparing acoustic analyses of speech data collected remotely
CN107112006B (zh) 基于神经网络的语音处理
WO2021073116A1 (zh) 生成法律文书的方法、装置、设备和存储介质
US9047868B1 (en) Language model data collection
CN112949708B (zh) 情绪识别方法、装置、计算机设备和存储介质
WO2021179714A1 (zh) 人工合成语音检测方法、装置、计算机设备及存储介质
KR102081495B1 (ko) 계정 추가 방법, 단말, 서버, 및 컴퓨터 저장 매체
WO2021159902A1 (zh) 年龄识别方法、装置、设备及计算机可读存储介质
CN110853646A (zh) 会议发言角色的区分方法、装置、设备及可读存储介质
CN110309799B (zh) 基于摄像头的说话判断方法
CN104598644A (zh) 用户喜好标签挖掘方法和装置
CN112786052B (zh) 语音识别方法、电子设备和存储装置
CN109785846B (zh) 单声道的语音数据的角色识别方法及装置
WO2022116487A1 (zh) 基于生成对抗网络的语音处理方法、装置、设备及介质
US11887623B2 (en) End-to-end speech diarization via iterative speaker embedding
CN113191787A (zh) 电信数据的处理方法、装置电子设备及存储介质
WO2019228306A1 (zh) 对齐语音的方法和装置
US20240013772A1 (en) Multi-Channel Voice Activity Detection
CN114138960A (zh) 用户意图识别方法、装置、设备及介质
CN113409771B (zh) 一种伪造音频的检测方法及其检测系统和存储介质
CN109634554B (zh) 用于输出信息的方法和装置
CN111149153B (zh) 信息处理装置以及说话解析方法
CN116232644A (zh) 基于ai的网络诈骗行为分析方法和系统
CN116386664A (zh) 一种语音伪造检测方法、装置、系统及存储介质
CN107133644A (zh) 数字化图书馆内容分析系统及方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20924645

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20924645

Country of ref document: EP

Kind code of ref document: A1