WO2023173966A1 - 语音识别方法、终端设备及计算机可读存储介质 - Google Patents

语音识别方法、终端设备及计算机可读存储介质 Download PDF

Info

Publication number
WO2023173966A1
WO2023173966A1 PCT/CN2023/075238 CN2023075238W WO2023173966A1 WO 2023173966 A1 WO2023173966 A1 WO 2023173966A1 CN 2023075238 W CN2023075238 W CN 2023075238W WO 2023173966 A1 WO2023173966 A1 WO 2023173966A1
Authority
WO
WIPO (PCT)
Prior art keywords
pinyin
spectrogram
audio
pinyin sequence
text
Prior art date
Application number
PCT/CN2023/075238
Other languages
English (en)
French (fr)
Inventor
房鹏
周波
郑明钊
李瑶
康志文
韩琮师
房钦国
Original Assignee
中国移动通信集团设计院有限公司
中国移动通信集团有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国移动通信集团设计院有限公司, 中国移动通信集团有限公司 filed Critical 中国移动通信集团设计院有限公司
Publication of WO2023173966A1 publication Critical patent/WO2023173966A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/27Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique
    • G10L25/30Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the analysis technique using neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/48Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use
    • G10L25/51Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 specially adapted for particular use for comparison or discrimination

Definitions

  • the present application relates to the field of speech recognition, and in particular to speech recognition methods, terminal equipment and computer-readable storage media.
  • Telecom fraudsters often take advantage of people's psychology of seeking advantages and avoiding disadvantages, making up fake phone calls, and spreading false information to the masses.
  • the release range is very wide in a very short period of time, and the scope of damage is very large, so the scope of losses is also very wide.
  • fraud methods are renovating very quickly, from the most primitive lottery-winning fraud to extortion, phone bill arrears, car tax refunds, etc.
  • the embodiments of this application solve the technical problem of hysteresis in the identification of telecommunications fraud in related technologies by providing a speech recognition method, a terminal device and a computer-readable storage medium, and realize the identification of telecommunications fraud in advance before the telecommunications fraud is completed. Effect.
  • the speech recognition method includes the following steps:
  • the voice content corresponding to the pre-stored text pinyin sequence that matches the audio pinyin sequence is identified as the voice content contained in the audio file to be recognized.
  • the method before the step of identifying the voice content corresponding to the pre-stored text pinyin sequence that matches the audio pinyin sequence as the voice content contained in the audio file to be recognized, the method further includes:
  • the pinyin sequence of the pre-stored text is generated according to the pre-set text.
  • the method further includes:
  • the key of the pinyin data dictionary is the text pinyin, and the value is the text index containing the text pinyin in the preset text;
  • each syllable corresponds to at least one audio pinyin
  • the convolutional neural network model determines that each optional pinyin in the pinyin library is the audio pinyin corresponding to the syllable based on the spectrogram.
  • the probability of pinyin, and according to the probability, at least one of the optional pinyin is selected as the audio pinyin corresponding to the syllable.
  • the above methods before the step of using the spectrogram as an input to a pre-trained convolutional neural network model and determining the audio pinyin sequence corresponding to the spectrogram based on the convolutional neural network model, the above methods also include:
  • a loss function is constructed using the CTC algorithm, and the convolutional neural network model is trained based on the loss function, the sample spectrogram and the sample pinyin sequence.
  • the step of obtaining the spectrogram feature data of the audio file to be identified, and determining the spectrogram corresponding to the audio file to be identified based on the spectrogram feature data includes:
  • the spectrogram is generated according to the component frequencies of the classification process and the timing information corresponding to each frame of data.
  • the audio file is a call recording
  • the pre-stored text pinyin sequence is the pinyin sequence corresponding to the fraud phrase text
  • the voice content is the fraud phrase.
  • embodiments of the present application also provide a terminal device, including:
  • the acquisition part is configured to obtain the sound spectrum feature data of the audio file to be recognized, and determine the spectrogram corresponding to the audio file to be recognized based on the sound spectrum feature data;
  • the analysis part is configured to use the spectrogram as an input to a pre-trained convolutional neural network model, and determine the audio pinyin sequence corresponding to the spectrogram based on the convolutional neural network model;
  • the identification part is configured to identify the voice content corresponding to the pre-stored text pinyin sequence that matches the audio pinyin sequence as the voice content contained in the audio file to be recognized.
  • embodiments of the present application also provide a terminal device, including a memory, a processor, and a speech recognition program stored in the memory and executable on the processor.
  • the processor executes the speech recognition program.
  • embodiments of the present application also provide a computer-readable storage medium on which a speech recognition program is stored.
  • the speech recognition program is executed by a processor, the speech recognition method as described above is implemented.
  • embodiments of the present application also provide a computer program product, including computer readable code.
  • the processor in the electronic device executes the above steps. Describe the speech recognition method.
  • the identification process can be completed in the terminal device, the call recording does not need to flow out of the terminal device, thus improving the effect of user privacy protection during the fraud identification process.
  • Figure 1 is a schematic flow chart of an embodiment of the speech recognition method of the present application
  • Figure 2 is an audio frequency domain signal diagram related to the embodiment of the present application.
  • Figure 3 is a single frame data diagram after the audio frequency domain signal shown in Figure 2 is divided into frames;
  • Figure 4 is an effect diagram after windowing the single frame data shown in Figure 3;
  • Figure 5 is a composition frequency diagram based on the signal shown in Figure 4.
  • Figure 6 is a spectrogram related to the embodiment of the present application.
  • Figure 7 is a schematic diagram of the Pinyin data dictionary involved in the embodiment of the present application.
  • Figure 8 is a partial schematic diagram of a terminal device involved in an embodiment of the present application.
  • Figure 9 is a schematic structural diagram of a terminal device related to an embodiment of the present application.
  • telecom fraud has developed more and more rapidly in recent years. Telecom fraudsters often take advantage of people's psychology of seeking advantages and avoiding disadvantages, and publish false information to the masses by making up fake phone calls. The release range is very wide in a very short period of time, and the scope of damage is very large, so the scope of losses is also very wide. At the same time, fraud methods are renovating very quickly, from the most primitive lottery-winning fraud to extortion, phone bill arrears, car tax refunds, etc.
  • telecommunications fraud is generally identified through the following two methods.
  • the accessed data includes signaling data, International Mobile Equipment Identity (IMEI) data, call voice data, etc.
  • the analysis methods include type matching, natural language analysis, etc.
  • user data is accessed on the bank side to identify fraud during the withdrawal process.
  • the accessed data includes bank card information, ATM cash withdrawal information, etc.
  • telecom fraud identification based on bank data can only be identified after the user's economic interests are damaged, resulting in a lag in the identification of telecom fraud.
  • the embodiment of this application proposes a speech recognition method.
  • the fraud identification function in the user's personal terminal, the data used for identification can be made. It does not leak out of the user's personal terminal, thus ensuring user privacy.
  • telecom fraud can be quickly identified through voice recognition of the user's call data, so that telecom fraud can be extracted and identified before the telecom fraud is successfully implemented to remind users to pay attention to prevention.
  • the speech recognition method includes the following steps:
  • Step S10 Obtain the sound spectrum feature data of the audio file to be identified, and determine the spectrogram corresponding to the audio file to be identified based on the sound spectrum feature data;
  • Step S20 Use the spectrogram as an input to a pre-trained convolutional neural network model, and determine the audio pinyin sequence corresponding to the spectrogram based on the convolutional neural network model;
  • Step S30 Convert the phonetic content corresponding to the pre-stored text pinyin sequence matching the audio pinyin sequence into The content is identified as the voice content contained in the audio file to be recognized.
  • a speech recognition method is implemented to identify the audio corresponding to the audio file and whether it contains the corresponding speech content in a preset text library or database.
  • the audio file to be recognized may be a call recording, so that it can be identified whether the call recording contains preset information based on the speech recognition method proposed in this embodiment.
  • the phonetic pronunciation corresponding to the set fraud language text Furthermore, if there is a phonetic pronunciation corresponding to the fraudulent speech text in the call recording, it is determined that the call recording has a risk of telecommunications fraud, and an anti-fraud reminder is provided to the user.
  • the audio file to be recognized can also be an audio file corresponding to the video file.
  • the speech recognition method proposed in this embodiment can be used to identify whether the target utterance appears in the monitoring video.
  • the target utterance can be a customized setting, for example, set to "xx, are you ready to do it tonight?"
  • the audio file to be recognized can be an audio file corresponding to a film and television work, a short video work or a music file.
  • the speech recognition method proposed in the embodiment of this application can be used to identify whether film and television coordinates, short-term works or music files contain target lines or lyrics.
  • the voice recognition method of this implementation may be a mobile terminal, such as a mobile phone, a tablet computer, etc.
  • the mobile terminal can establish a call connection with other terminals based on the mobile network. For example, telephone communication can be established based on the call network, or network voice calls can be established based on WeChat, QQ, DingTalk, Feishu, etc.
  • the terminal When the terminal detects that it has entered a call state, it can start the recording device and record the call audio through the recording device. And use the call recording as the audio file to be identified. After the call ends, the speech recognition method is executed based on the audio file to be recognized.
  • the sound spectrum feature data of the audio file to be recognized can be obtained, and the spectrogram corresponding to the audio file to be recognized can be determined based on the sound spectrum feature data.
  • the terminal can read the audio file to be recognized and determine the frequency domain signal corresponding to the audio file to be recognized. That is, the audio file to be recognized is read, the sampling frequency and sampling data of the audio file to be recognized are extracted, and the original audio data information is obtained.
  • Figure 2 This is a representation of speech in the time domain. The amplitude represents the intensity of the sound, and an amplitude of zero represents silence. Therefore, these amplitudes cannot represent the content of speech, and the amplitudes need to be converted into frequency domain representations.
  • the amplitude is converted into a frequency domain representation, that is, the frequency domain signal corresponding to the audio file to be identified is obtained, the data is divided into frames based on the frequency domain signal, and a Minghan window is added to each frame of data after the framing.
  • a Minghan window is added to each frame of data after the framing.
  • people make sounds through their vocal cords, and the different vibration frequencies of the vocal cords will produce sounds with different meanings, usually in the range of 10ms-30ms.
  • the vibration frequency of human voice will remain stable, so the original voice data is divided into 20ms, and the data length of each frame is 20ms.
  • overlapping framing may be used for framing. For example, the previous frame overlaps the next frame by 10ms.
  • a frame of intercepted data is shown in Figure 3.
  • the window function w(t) may be:
  • the frame data approximately behaves as periodic data.
  • the component frequency corresponding to each windowed frame data can be separated based on the fast Fourier transform. It can be understood that sound signals are composed of sound waves of different frequencies. Fast Fourier transform is used to separate the sound waves of different frequencies and obtain the frequency magnitude. Figure 5 shows the component frequencies of the sound signal separated by Fourier transform.
  • the spectrogram is generated based on the component frequencies of the classification process and the timing information corresponding to each frame of data. That is, based on the composition frequency of classification processing and the timing information corresponding to each frame of data, Figure 6 The spectrogram shown.
  • the above method of determining the spectrogram is an optional implementation that can be adopted by the speech recognition method of the present application.
  • the terminal can determine the spectrogram of the audio file to be identified based on the above solution without the recording file flowing out of the terminal, thus ensuring user privacy.
  • the execution terminal can call the cloud service and then determine the spectrogram corresponding to the audio file to be recognized based on the cloud service. The determination of the spectrogram through cloud services can effectively reduce the computing overhead of the terminal device.
  • the spectrogram can be used as input to a pre-trained convolutional neural network model, and the audio pinyin sequence corresponding to the spectrogram is determined based on the convolutional neural network model.
  • acoustic models for speech analysis eliminates the need for the speech-to-text process, and directly identifies harassment and fraud features in speech at the pronunciation level. The amount of calculation is small, the recognition speed is fast, and the model size is also small. It can be used on mobile phones. It is deployed on the terminal and runs smoothly. It also supports the recognition of new words and fuzzy pronunciation, which greatly improves the speech recognition rate. Therefore, in the embodiment, a pre-trained convolutional neural network model can be used to achieve the effect of determining the pinyin sequence corresponding to the speech file to be recognized based on the spectrogram.
  • the network structure of the convolutional neural network can be as follows:
  • the first convolutional layer a total of 32 convolution kernels, size 3 ⁇ 3, activation function relu.
  • the second convolution layer a total of 32 convolution kernels, size 3 ⁇ 3, activation function relu.
  • the third pooling layer 2 ⁇ 2 kernel, maximum pooling.
  • the fourth convolutional layer a total of 64 convolution kernels, size 3 ⁇ 3, activation function relu.
  • the fifth convolutional layer a total of 64 convolution kernels, size 3 ⁇ 3, activation function relu.
  • the sixth pooling layer 2 ⁇ 2 kernel, maximum pooling.
  • the seventh convolution layer a total of 128 convolution kernels, size 3 ⁇ 3, activation function relu.
  • the eighth convolution layer a total of 128 convolution kernels, size 3 ⁇ 3, activation function relu.
  • the ninth layer of pooling layer 2 ⁇ 2 kernel, maximum pooling.
  • the tenth convolutional layer a total of 128 convolution kernels, size 3 ⁇ 3, activation function relu.
  • the eleventh convolutional layer a total of 128 convolution kernels, size 3 ⁇ 3, activation function relu.
  • the twelfth convolutional layer a total of 128 convolution kernels, size 3 ⁇ 3, activation function relu.
  • the thirteenth convolutional layer a total of 128 convolution kernels, size 3 ⁇ 3, activation function relu.
  • the fourteenth fully connected layer 256 neurons.
  • the fifteenth fully connected layer the number of neurons is the number of pinyin dictionaries, activation function softmax, and final output layer.
  • the Connectionist Temporal Classification (CTC) algorithm can be used to construct a loss function and perform model training. Then obtain the sample spectrogram for model training and the sample pinyin sequence corresponding to the sample spectrogram, and train the convolutional neural network based on the loss function, the sample spectrogram and the sample pinyin sequence. network model. This allows the trained convolutional neural network model to directly determine the corresponding pinyin sequence based on the input spectrogram. In the pinyin sequence, each syllable corresponds to at least one audio pinyin.
  • the convolutional neural network model determines the probability that each optional pinyin in the pinyin library is the audio pinyin corresponding to the syllable based on the spectrogram, and based on The probability is to select at least one of the optional pinyin as the audio pinyin corresponding to the syllable. For example, the five pinyins with the top five probabilities can be used as the audio pinyins corresponding to the syllable. In this way, the final recognition result can achieve the effect of fuzzy query, that is, the recognition of fuzzy pronunciation can be achieved.
  • the speech content corresponding to the pre-stored text pinyin sequence matching the audio pinyin sequence is identified as the speech content contained in the audio file to be recognized.
  • the preset text can be obtained first in the terminal, And generate the pinyin sequence of the pre-stored text according to the pre-set text. Then, generate a pinyin data dictionary corresponding to the pre-stored text pinyin sequence, the key of the pinyin data dictionary is the text pinyin, and the value is the text index containing the text pinyin in the preset text; based on the pinyin data dictionary, Query the pre-stored text pinyin sequence that matches the audio pinyin sequence.
  • the call voice (audio file to be recognized) can be converted into an audio pinyin sequence, and model matching can be performed with the fraud phrases in the fraud script (preset text) to calculate the call
  • the probability of fraud is determined to determine whether the call is a fraud.
  • each Chinese phrase in the fraud script can be converted into the corresponding pinyin sequence for comparison with the audio pinyin sequence of the call voice. Then calculate the average length of all the words in the fraud script after they are converted to pinyin.
  • the calculation formula is as follows:
  • a pinyin data dictionary is generated based on the audio pinyin sequence corresponding to the fraud phrase and the average length of the pinyin of the phrase, in which the key is the text pinyin corresponding to the text, and the value is the phrase index in the phrase book that contains the pinyin of the text.
  • An example is as shown below. 7 shown.
  • N is set to the maximum value, which is the number of words.
  • a loop traversal is performed based on the obtained set of candidate words.
  • the candidate words are taken out one by one and compared with the audio pinyin sequence, because the lengths of the word phonetic pinyin sequence (i.e., the pre-stored text pinyin sequence) and the audio pinyin sequence are not fixed and are not equal. Therefore, a sliding window can be set, the length of the sliding window is the length of Hua Shu Pinyin, the step size is 1, and the audio Pinyin sequence is intercepted and compared with the Hua Shu Pinyin sequence.
  • the comparison algorithm uses the longest common subsequence algorithm based on dynamic programming. If the matching rate is greater than 50%, then Hit the phrase, jump out of the sliding window training, and compare the next alternative phrase.
  • the identification result of this fraudulent call and the hit words can be fed back to the user through the terminal.
  • the training process of the waveform diagram does not involve the pre-stored text pinyin sequence to be recognized. Therefore, when the speech content to be recognized is updated, that is, a new pre-stored text pinyin sequence is added, there is no need to retrain the convolutional neural network model. This allows the method to support new word recognition.
  • the spectrogram feature data of the audio file to be identified is first obtained, the spectrogram corresponding to the audio file to be identified is determined based on the spectrogram feature data, and then the spectrogram is used as The input of the pre-trained convolutional neural network model is used to determine the audio pinyin sequence corresponding to the spectrogram based on the convolutional neural network model, and then the speech content corresponding to the pre-stored text pinyin sequence matching the audio pinyin sequence is determined, Identify the voice content contained in the audio file to be recognized.
  • the speech recognition method provided by this implementation, it is possible to directly identify whether there are fraudulent words in the call recording, and then identify telecom fraud in advance before the fraud is completed. Moreover, since the identification process can be completed in the terminal device, the call recording does not need to flow out of the terminal device, thus improving the effect of user privacy protection during the fraud identification process.
  • an embodiment of the present application also proposes a terminal device.
  • the terminal device includes: a memory, a processor, and a speech recognition program stored on the memory and executable on the processor.
  • the speech recognition program is When executed by the processor, the steps of the speech recognition method described in each of the above embodiments are implemented.
  • embodiments of the present application also provide a computer-readable storage medium on which a speech recognition program is stored.
  • the speech recognition program is executed by a processor, the speech recognition as described in the above embodiments is implemented. Method steps.
  • embodiments of the present application also provide a computer program product, which includes computer readable code.
  • the processor in the electronic device executes to implement the above embodiments. The steps of the speech recognition method.
  • This embodiment of the present application also provides a terminal device 800, including:
  • the acquisition part 810 is configured to obtain the sound spectrum feature data of the audio file to be recognized, and determine the spectrogram corresponding to the audio file to be recognized based on the sound spectrum feature data;
  • the analysis part 820 is configured to use the spectrogram as an input to a pre-trained convolutional neural network model, and determine the audio pinyin sequence corresponding to the spectrogram based on the convolutional neural network model;
  • the identification part 830 is configured to identify the voice content corresponding to the pre-stored text pinyin sequence that matches the audio pinyin sequence as the voice content contained in the audio file to be recognized.
  • the acquisition part 810 is further configured to: acquire a preset text; and generate the pinyin sequence of the prestored text according to the preset text.
  • the analysis part 820 is further configured to: generate a pinyin data dictionary corresponding to the pre-stored text pinyin sequence, the key of the pinyin data dictionary is text pinyin, and the value is the pinyin contained in the preset text. a text index of the pinyin of the text; based on the pinyin data dictionary, query the pre-stored text pinyin sequence that matches the audio pinyin sequence.
  • each syllable corresponds to at least one audio pinyin
  • the convolutional neural network model determines that each optional pinyin in the pinyin library is the audio pinyin corresponding to the syllable based on the spectrogram.
  • the probability of pinyin, and according to the probability, at least one of the optional pinyin is selected as the audio pinyin corresponding to the syllable.
  • the terminal device 100 further includes a training module configured to: obtain a sample spectrogram for model training and a sample pinyin sequence corresponding to the sample spectrogram; use a connection
  • the CTC algorithm for temporal classification constructs a loss function, and trains the convolutional neural network model based on the loss function, the sample spectrogram and the sample pinyin sequence.
  • the acquisition part 810 is further configured to: read the audio file to be recognized software to determine the frequency domain signal corresponding to the audio file to be identified; perform data framing based on the frequency domain signal; add a Minghan window to each frame of data after framing; separate and add windows based on fast Fourier transform The composition frequency corresponding to each frame of data; the spectrogram is generated according to the composition frequency of classification processing and the timing information corresponding to each frame of data.
  • the audio file is a call recording
  • the pre-stored text pinyin sequence is the pinyin sequence corresponding to the fraud phrase text
  • the voice content is the fraud phrase.
  • Figure 9 is a schematic diagram of the terminal structure of the hardware operating environment involved in the embodiment of the present application.
  • the control terminal may include: a processor 901, such as a central processing unit (CPU), a network interface 903, a memory 904, and a communication bus 902.
  • the communication bus 902 is used to realize connection communication between these components.
  • the network interface 903 may optionally include a standard wired interface or a wireless interface (such as a wireless fidelity WI-FI interface).
  • the memory 904 may be a high-speed RAM memory or a stable memory.
  • the memory 904 may optionally be a storage device independent of the aforementioned processor 901.
  • terminal structure shown in FIG. 9 does not limit the terminal, and may include more or fewer components than shown, or combine certain components, or arrange different components.
  • memory 904 which is a computer storage medium, may include an operating system, a network communication part, and a voice recognition program.
  • the processor 901 can be used to call the speech recognition program stored in the memory 904 and perform the following operations:
  • the processor 901 can call the speech recognition program stored in the memory 904 and also perform the following operations:
  • the pinyin sequence of the pre-stored text is generated according to the pre-set text.
  • the processor 901 can call the speech recognition program stored in the memory 904 and also perform the following operations:
  • the key of the pinyin data dictionary is the text pinyin, and the value is the text index containing the text pinyin in the preset text;
  • the processor 901 can call the speech recognition program stored in the memory 904 and also perform the following operations:
  • a loss function is constructed using the CTC algorithm, and the convolutional neural network model is trained based on the loss function, the sample spectrogram and the sample pinyin sequence.
  • the processor 901 can call the speech recognition program stored in the memory 904 and also perform the following operations:
  • the spectrogram is generated according to the component frequencies of the classification process and the timing information corresponding to each frame of data.
  • the methods of the above embodiments can be implemented by means of software plus the necessary general hardware platform. Of course, it can also be implemented by hardware, but in many cases the former is better. implementation.
  • the technical solution of the present application can be embodied in the form of a software product that is essentially or contributes to the existing technology.
  • the computer software product is stored in a storage medium (such as ROM/RAM) as mentioned above. , magnetic disk, optical disk), including several instructions to cause a terminal device (such as a mobile phone, tablet computer, etc.) to execute the methods described in various embodiments of the present application.
  • This application discloses a speech recognition method, a terminal device and a computer-readable storage medium.
  • the method includes: obtaining the sound spectrum feature data of the audio file to be recognized, and determining the sound spectrum corresponding to the audio file to be recognized based on the sound spectrum feature data.
  • Figure use the spectrogram as the input of a pre-trained convolutional neural network model, determine the audio pinyin sequence corresponding to the spectrogram based on the convolutional neural network model; use the pre-stored audio pinyin sequence that matches the audio pinyin sequence
  • the speech content corresponding to the text pinyin sequence is identified as the speech content contained in the audio file to be recognized. It achieves the effect of identifying telecom fraud in advance before the telecom fraud is completed.

Abstract

本申请公开了语音识别方法、终端设备及计算机可读存储介质,该方法包括:获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图;将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列;将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容。

Description

语音识别方法、终端设备及计算机可读存储介质
相关申请的交叉引用
本申请基于申请号为202210248254.X、申请日为2022年03月14日、申请名称为“语音识别方法、终端设备及计算机可读存储介质”的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及语音识别领域,尤其涉及语音识别方法、终端设备及计算机可读存储介质。
背景技术
近年来电信诈骗的发展越发迅速,电信诈骗者往往利用人们趋利避害的心理,通过编造虚假电话,地毯式地给群众发布虚假信息。在极短的时间内发布范围很广,侵害面很大,所以造成损失的面也很广。同时诈骗手段翻新速度很快,从最原始的中奖诈骗发展到勒索、电话欠费、汽车退税等等。
在相关技术中,为了识别电信诈骗,需要在银行侧接入用户的银行卡信息和自动取款机(Automated Teller Machine,ATM)取现信息等用户数据,并在取款环节进行诈骗行为的识别。这导致该方法只能在用户已经被成功诈骗之后,才能识别出电信诈骗。
需要说明的是,上述内容仅用于辅助理解本申请所解决的技术问题,并不代表承认上述内容是现有技术。
发明内容
本申请实施例通过提供一种语音识别方法、终端设备及计算机可读存储介质,解决了相关技术中,电信诈骗识别存在滞后性的技术问题,实现了在电信诈骗完成前,提前识别出电信诈骗的效果。
本申请实施例提供了一种语音识别方法,所述语音识别方法包括以下步骤:
获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图;
将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列;
将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容。
在一些实施例中,所述将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容的步骤之前,所述方法还包括:
获取预设文本;
根据所述预设文本生成所述预存文本拼音序列。
在一些实施例中,所述根据所述预设文本生成所述预存文本拼音序列的步骤之后,所述方法还包括:
生成所述预存文本拼音序列对应的拼音数据字典,所述拼音数据字典的键为文本拼音,值为所述预设文本中包含所述文本拼音的文本索引;
基于所述拼音数据字典,查询与所述音频拼音序列匹配的预存文本拼音序列。
在一些实施例中,所述拼音序列中,每一音节对应至少一个音频拼音,所述卷积神经网络模型基于所述声谱图,确定拼音库中的各个可选拼音为所述音节对应音频拼音的概率,并根据所述概率,选定至少一个所述可选拼音作为所述音节对应的音频拼音。
在一些实施例中,所述将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列的步骤之前,所述方法还包括:
获取用于模型训练的样本声谱图,以及所述样本声谱图对应的样本拼音序列;
使用CTC算法构建损失函数,并基于所述损失函数、所述样本声谱图和所述样本拼音序列训练所述卷积神经网络模型。
在一些实施例中,所述获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图的步骤包括:
读取所述待识别音频文件,确定所述待识别音频文件对应的频域信号;
基于所述频域信号进行数据分帧;
对分帧后的每一帧数据加明汉窗;
基于快速傅里叶变换分离加窗后的每一帧数据对应的组成频率;
根据分类处理的组成频率及每一帧数据对应的时序信息生成所述声谱图。
在一些实施例中,所述音频文件为通话录音,所述预存文本拼音序列为诈骗话术文本对应的拼音序列,所述语音内容为诈骗话术。
此外,为实现上述效果,本申请实施例还提供一种终端设备,包括:
获取部分,被配置为获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图;
分析部分,被配置为将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列;
识别部分,被配置为将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容。
此外,为实现上述效果,本申请实施例还提供一种终端设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的语音识别程序,所述处理器执行所述语音识别程序时实现如上所述语音识别方法。
此外,为实现上述效果,本申请实施例还提供一种计算机可读存储介质,其上存储有语音识别程序,该语音识别程序被处理器执行时实现如上所述语音识别方法。
此外,为实现上述效果,本申请实施例还提供一种计算机程序产品,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行如上所述语音识别方法。
本申请实施例中提供的一个或多个技术方案,至少具有如下技术效果或优点:
1、由于本实施提供的语音识别方法,可以直接识别通话录音中是否存在诈骗话术,进而在诈骗完成前,提前识别出电信诈骗。
2、由于识别过程可以在终端设备中完成,从而使得通话录音无需流出终端设备,因而提升了诈骗识别过程中,用户隐私保护的效果。
附图说明
图1为本申请语音识别方法的一实施例的流程示意图;
图2为本申请实施例涉及的音频频域信号图;
图3为图2所示的音频频域信号分帧帧后的单帧数据图;
图4为图3中示出的单帧数据加窗后的效果图;
图5为基于图4中示出的信号的组成频率示意图;
图6为本申请实施例涉及的声谱图;
图7为本申请实施例涉及的拼音数据字典的示意图;
图8为本申请实施例涉及的终端设备的部分化示意图;
图9为本申请实施例涉及的终端设备的结构简图。
具体实施方式
在本实施例中,近年来电信诈骗的发展越发迅速,电信诈骗者往往利用人们趋利避害的心理,通过编造虚假电话,地毯式地给群众发布虚假信息。在极短的时间内发布范围很广,侵害面很大,所以造成损失的面也很广。同时诈骗手段翻新速度很快,从最原始的中奖诈骗发展到勒索、电话欠费、汽车退税等等。
在相关技术中,一般通过以下两种方式实现电信诈骗的识别。
其一,在运营商侧接入用户数据,进行诈骗电话的分析和识别。接入的数据包括信令数据、国际移动设备识别码(International Mobile Equipment Identity,IMEI)数据、通话语音数据等,分析的方法包括类型匹配,自然语言分析等。
其二,在银行侧接入用户数据,在取款环节进行诈骗行为的识别,接入的数据包括银行卡信息、ATM取现信息等。
在上述识别方式中,需要通过运营商和银行获取大量用户隐私数据,例如,用户的通话记录、联系人、通话内容、IMEI、银行卡信息、取现照片等。这导致实现电信诈骗识别就会造成用户隐私数据的泄露。
并且,基于运营商数据的诈骗识,在识别出电信诈骗后,只能通过运营商给用户发短信提醒或是直接挂断用户通话,这种方式直接干扰了用户的正常通话。而基于银行数据的电信诈骗识别,则在用户经济利益受损之后才能识别出电信诈骗,导致电信诈骗识别存在滞后性。
为解决现有电信诈骗识别会造成用户隐私泄露,以及存在滞后性的缺陷,本申请实施例提出一种语音识别方法,制作通过将诈骗识别功能部署于用户个人终端中,使得用于识别的数据不流出用户的个人终端,从而保障用户隐私。与此同时,通过对用户的通话数据进行语音识别,来快速识别电信诈骗,从而实现在电信诈骗实施成功之前,提取识别出电信诈骗,以提示用户注意防范。
以下,结合附图,对本实施例提出的语音识别方法做在一些实施例中解释说明。
在一实施例中,所述语音识别方法包括以下步骤:
步骤S10、获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图;
步骤S20、将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列;
步骤S30、将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内 容,识别为所述待识别音频文件包含的语音内容。
在本实施例中,实施语音识别方法用于识别音频文件对应的音频,是否包含预设文本库或者数据库中对应的语音内容。
例如,在本申请实施例提出的语音识别方法的一种应用场景,所述待识别音频文件可是通话录音,使得可以基于本实施例提出的语音识别方法,识别出所述通话录音中是否包含预设的诈骗话术文本对应的话术语音。进而在通话录音中存在诈骗话术文本对应的话术语音是,判定该通话录音存在电信诈骗风险,进而对用户进行防诈骗提示。
在另一种应用场景,待识别音频文件也可以是视频文件对应的音频文件。比如,在监控排查过程中,可以根据本实施例提出的语音识别方法,来识别监控视频中是否出现目标话语。其中,目标话语可以是自定义设置,例如,设置为“xx,准备好今晚动手了吗”。
在又一种应用场景中,待识别音频文件可以是影视作品,短视频作品或者音乐文件对应的音频文件。在本应用场景中,可以通过本申请实施例提出的语音识别方法,识别影视坐标,短时频作品或者音乐文件中,是否包含目标台词或者歌词。
以下,结合电信诈骗预警场景,对本申请实施例提出的语音识别方法,做进一步地解释说明,可以理解的是,以下内容旨在帮助本领域技术人员理解本申请语音识别方法的权利范围,而不对本申请的作出限定。
在本实施例中,执行本实施语音识别方法的可以是移动终端,如手机,平板电脑等。所述移动终端可以基于移动网络与其它终端建立通话连接。例如,可以基于通话网络建立电话通信,或者基于如微信、QQ、钉钉、飞书等建立网络语音通话。
当终端检测到进入通话状态时,可以启动录音装置,通过录音装置录制通话音频。并将通话录音作为所述待识别音频文件。在通话结束后,基于所述待识别音频文件,执行所述语音识别方法。
当确定待识别音频文件后,可以获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图。
示例性地,终端可以读取所述待识别音频文件,确定所述待识别音频文件对应的频域信号。即在读取待识别音频文件,提取待识别音频文件的采样频率和采样数据,获得原始音频数据信息。如下图2所示。这是语音在时域上的表示,振幅代表声音的强度,振幅为零表示静音。因此这些振幅不能代表语音的内容,需要将振幅转为频率域的表示。
当将振幅转换为频域表示,即得到待识别音频文件对应的频域信号后,基于所述频域信号进行数据分帧,进而对分帧后的每一帧数据加明汉窗。可以理解的是,人通过声带发出声音,声带不同的震动频率会发出不同的含义的声音,通常在10ms-30ms的范围内。人发出声音的震动频率会保持平稳,因此将原始的语音数据按照20ms进行分割,每一帧的数据长度为20ms。在一些实施例中,为了使帧与帧之间平滑过度,可以采用交叠分帧的方式进行分帧。例如,前一帧与后一帧交叠10ms。截取的一帧数据如图3所示。
当分帧完成后,由于每一帧的数据是从原始语音数据中截取出来的,会导致每一帧的数据不是一个周期的数据,因而产生频谱泄露。因此需要对帧数据加汉明窗,以改善频谱泄露的情况。
在一些实施例中,在一实施方案中,窗函数w(t)可以为:
如图4所示,将一帧数据加窗后,帧数据近似表现为周期数据。
在对帧数据加窗后,可以基于快速傅里叶变换分离加窗后的每一帧数据对应的组成频率。可以理解的是,声音信号由不同频率的声波构成,使用快速傅里叶变换将不同频率的声波分离出来,并取得频率的大小。如图5所示为傅里叶变化分离的声音信号的组成频率。
最后,根据分类处理的组成频率及每一帧数据对应的时序信息生成所述声谱图。即根据分类处理的组成频率及每一帧数据对应的时序信息生成如图6 所示的声谱图。
需要说明的是,上述确定声谱图的方式为本申请语音识别方法的可以采用的一种可选实施方案。在电信诈骗预警场景中,终端可以在录音文件不流出终端的情况先,基于上述方案实现确定待识别音频文件的声谱图的效果,从而保障用户隐私。当然,在一些无需考虑用户隐私,例如影视作品的音频识别或者歌曲音频识别场景中,执行终端可以通过调用云端服务,进而基于云端服务实现确定待识别音频文件对应的声谱图的效果。而通过云端服务实现声谱图的确定,可以有效地减小终端设备的计算开销。
进一步的,在确定声谱图后,可以将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列。
可以理解的是,传统语音识别方法需要将通话录音转为文本内容后再进行文本识别,识别速度慢,计算要求高,不适合在如手机,平板电脑等计算能力有限的终端中使用。而采用声学模型进行语音分析,省去了语音转文本的过程,直接在语音发音层面对语音中的骚扰诈骗话术特征进行识别,计算量小,识别速度快,模型体积也小,可在手机端部署并流畅运行,同时还支持识别生词和模糊发音,大大提高了语音识别率。因此,在实施例中,可以通过一个预先训练的卷积神经网络模型,来实现基于声谱图,确定待识别语音文件对应的拼音序列的效果。
示例性地,得到待识别音频文件对应的声谱图后,就将语音识别转换为图形识别,例如,该卷积神经网络的网络结构可以如下:
第一层卷积层:共32个卷积核,大小为3×3,激活函数relu。
第二层卷积层:共32个卷积核,大小为3×3,激活函数relu。
第三层池化层:2×2的核,最大池化。
第四层卷积层:共64个卷积核,大小为3×3,激活函数relu。
第五层卷积层:共64个卷积核,大小为3×3,激活函数relu。
第六层池化层:2×2的核,最大池化。
第七层卷积层:共128个卷积核,大小为3×3,激活函数relu。
第八层卷积层:共128个卷积核,大小为3×3,激活函数relu。
第九层池化层:2×2的核,最大池化。
第十层卷积层:共128个卷积核,大小为3×3,激活函数relu。
第十一层卷积层:共128个卷积核,大小为3×3,激活函数relu。
第十二层卷积层:共128个卷积核,大小为3×3,激活函数relu。
第十三层卷积层:共128个卷积核,大小为3×3,激活函数relu。
第十四层全连接层:256个神经元。
第十五层全连接层:神经元个数为拼音字典的个数,激活函数softmax,最终输出层。
需要说明的是,在模型训练过程中,可以使用联结主义时间分类(Connectionist Temporal Classification,CTC)算法构建损失函数,进行模型训练。然后获取用于模型训练的样本声谱图,以及所述样本声谱图对应的样本拼音序列,并基于所述损失函数、所述样本声谱图和所述样本拼音序列训练所述卷积神经网络模型。使得训练后的卷积神经网络模型可以直接根据输入的声谱图,确定对应的拼音序列。所述拼音序列中,每一音节对应至少一个音频拼音,所述卷积神经网络模型基于所述声谱图,确定拼音库中的各个可选拼音为所述音节对应音频拼音的概率,并根据所述概率,选定至少一个所述可选拼音作为所述音节对应的音频拼音。例如,可以将概率前五的5个拼音,作为该音节对应的音频拼音。这样使得最终识别出的结果,达到模糊查询的效果,即可以实现模糊发音的识别。
在一些实施例中,在得到音频拼音序列之后,将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容。
在一些实施例中,作为一种实现方案,可以在终端中,先获取预设文本, 并根据所述预设文本生成所述预存文本拼音序列。然后,生成所述预存文本拼音序列对应的拼音数据字典,所述拼音数据字典的键为文本拼音,值为所述预设文本中包含所述文本拼音的文本索引;基于所述拼音数据字典,查询与所述音频拼音序列匹配的预存文本拼音序列。
示例性地,在电信诈骗识别场景中,可以将通话语音(待识别音频文件)转为音频拼音序列后,与诈骗话术本(预设文本)中的诈骗话术进行模型匹配,计算该通话的诈骗概率,判断是否为诈骗电话。
在本示例中,可以先将诈骗话术本中的每一条中文话术转为相应的拼音序列,用于与通话语音的音频拼音序列进行对比。然后计算诈骗话术本中全部话术转为拼音后的平均长度,计算公式如下:
然后根据诈骗话术对应音频拼音序列,以及所述话术拼音平均长度,生成拼音数据字典,其中键为文本对应的文本拼音,值为话术本中包含该文本拼音的话术索引,示例如下图7所示。
在确定音频拼音序列和拼音数据字典后,可以根据拼音数据字典,计算当前处理的通话拼音在每条话术中出现的次数,并按降序排列。然后,根据排序的结果,可以选取前N个话术作为本次话术匹配的备选话术,N值越大,匹配的话术越全面,N值越小,话术匹配的速度越快。N值根据实际情况可灵活设置。在本应用场景中,为了保持最全的匹配效率,N设为最大值,即话术的数量。
进一步地,根据得到的话术备选集合,进行循环遍历。逐个取出备选话术与音频拼音序列进行对比,由于话术拼音序列(即预存文本拼音序列)和音频拼音序列长度不固定且不相等。因此,可以设置滑动窗口,滑动窗口长度为话术拼音长度,步长为1,依次截取音频拼音序列与话术拼音序列进行对比。比对算法采用基于动态规划的最长公共子序列算法,如果匹配率大于50%,则 命中该话术,跳出滑动窗口的训练,进行下一个备选话术的比对。
所有备选话术对比完成后,计算本次通话为诈骗电话的概率,计算公式如下:
在一些实施例中,确定概率后,可以将本次诈骗电话的识别结果及命中的话术,通过终端反馈给用户。
可以理解的是,在本实施例提供的语音识别方法中,由于波形图的训练过程不涉及待识别的预存文本拼音序列。因此,当更新待识别的语音内容,即增加新的预存文本拼音序列时,无需对卷积神经网络模型进行重新训练。从而使得该方法可以支持生词识别。
在本实施例公开的技术方案中,先获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图,然后将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列,进而将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容。在电信诈骗防范场景中,由于本实施提供的语音识别方法,可以直接识别通话录音中是否存在诈骗话术,进而在诈骗完成前,提前识别出电信诈骗。并且,由于识别过程可以在终端设备中完成,从而使得通话录音无需流出终端设备,因而提升了诈骗识别过程中,用户隐私保护的效果。
此外,本申请实施例还提出一种终端设备,所述终端设备包括:存储器、处理器及存储在所述存储器上并可在所述处理器上运行的语音识别程序,所述语音识别程序被处理器执行时实现如上各个实施例所述的语音识别方法的步骤。
此外,本申请实施例还提出一种计算机可读存储介质,其上存储有语音识别程序,该语音识别程序被处理器执行时实现如上述实施例所述的语音识别 方法的步骤。
此外,本申请实施例还提出一种计算机程序产品,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现如上述实施例所述的语音识别方法的步骤。
此外,请参照图8,本申请实施例还提出一种终端设备800,包括:
获取部分810,被配置为获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图;
分析部分820,被配置为将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列;
识别部分830,被配置为将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容。
在一些实施例中,所述获取部分810还被配置为:获取预设文本;根据所述预设文本生成所述预存文本拼音序列。
在一些实施例中,所述分析部分820还被配置为:生成所述预存文本拼音序列对应的拼音数据字典,所述拼音数据字典的键为文本拼音,值为所述预设文本中包含所述文本拼音的文本索引;基于所述拼音数据字典,查询与所述音频拼音序列匹配的预存文本拼音序列。
在一些实施例中,所述拼音序列中,每一音节对应至少一个音频拼音,所述卷积神经网络模型基于所述声谱图,确定拼音库中的各个可选拼音为所述音节对应音频拼音的概率,并根据所述概率,选定至少一个所述可选拼音作为所述音节对应的音频拼音。
在一些实施例中,所述终端设备100还包括训练模块,所述训练模块被配置为:获取用于模型训练的样本声谱图,以及所述样本声谱图对应的样本拼音序列;使用联结主义时间分类CTC算法构建损失函数,并基于所述损失函数、所述样本声谱图和所述样本拼音序列训练所述卷积神经网络模型。
在一些实施例中,所述获取部分810还被配置为:读取所述待识别音频文 件,确定所述待识别音频文件对应的频域信号;基于所述频域信号进行数据分帧;对分帧后的每一帧数据加明汉窗;基于快速傅里叶变换分离加窗后的每一帧数据对应的组成频率;根据分类处理的组成频率及每一帧数据对应的时序信息生成所述声谱图。
在一些实施例中,所述音频文件为通话录音,所述预存文本拼音序列为诈骗话术文本对应的拼音序列,所述语音内容为诈骗话术。
如图9所示,图9是本申请实施例方案涉及的硬件运行环境的终端结构示意图。
如图9所示,该控制终端可以包括:处理器901,例如中央处理器(central processing unit,CPU),网络接口903,存储器904,通信总线902。其中,通信总线902用于实现这些组件之间的连接通信。网络接口903可选的可以包括标准的有线接口、无线接口(如无线保真WI-FI接口)。存储器904可以是高速RAM存储器,也可以是稳定的存储器。存储器904可选的还可以是独立于前述处理器901的存储装置。
本领域技术人员可以理解,图9中示出的终端结构并不构成对终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。
如图9所示,作为一种计算机存储介质的存储器904中可以包括操作系统、网络通信部分、以及语音识别程序。
在图9所示的终端中,处理器901可以用于调用存储器904中存储的语音识别程序,并执行以下操作:
获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图;
将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列;
将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别 为所述待识别音频文件包含的语音内容。
在一些实施例中,处理器901可以调用存储器904中存储的语音识别程序,还执行以下操作:
获取预设文本;
根据所述预设文本生成所述预存文本拼音序列。
在一些实施例中,处理器901可以调用存储器904中存储的语音识别程序,还执行以下操作:
生成所述预存文本拼音序列对应的拼音数据字典,所述拼音数据字典的键为文本拼音,值为所述预设文本中包含所述文本拼音的文本索引;
基于所述拼音数据字典,查询与所述音频拼音序列匹配的预存文本拼音序列。
在一些实施例中,处理器901可以调用存储器904中存储的语音识别程序,还执行以下操作:
获取用于模型训练的样本声谱图,以及所述样本声谱图对应的样本拼音序列;
使用CTC算法构建损失函数,并基于所述损失函数、所述样本声谱图和所述样本拼音序列训练所述卷积神经网络模型。
在一些实施例中,处理器901可以调用存储器904中存储的语音识别程序,还执行以下操作:
读取所述待识别音频文件,确定所述待识别音频文件对应的频域信号;
基于所述频域信号进行数据分帧;
对分帧后的每一帧数据加明汉窗;
基于快速傅里叶变换分离加窗后的每一帧数据对应的组成频率;
根据分类处理的组成频率及每一帧数据对应的时序信息生成所述声谱图。
需要说明的是,在本文中,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者系统 不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(如手机、平板电脑等)执行本申请各个实施例所述的方法。
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。
工业实用性
本申请公开了语音识别方法、终端设备及计算机可读存储介质,该方法包括:获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图;将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列;将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容。达到了提取在电信诈骗实施完成前,提前识别出电信诈骗的效果。

Claims (11)

  1. 一种语音识别方法,所述语音识别方法包括:
    获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图;
    将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列;
    将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容。
  2. 根据权利要求1所述的语音识别方法,其中,所述将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容的步骤之前,所述方法还包括:
    获取预设文本;
    根据所述预设文本生成所述预存文本拼音序列。
  3. 根据权利要求2所述的语音识别方法,其中,所述根据所述预设文本生成所述预存文本拼音序列的步骤之后,所述方法还包括:
    生成所述预存文本拼音序列对应的拼音数据字典,所述拼音数据字典的键为文本拼音,值为所述预设文本中包含所述文本拼音的文本索引;
    基于所述拼音数据字典,查询与所述音频拼音序列匹配的预存文本拼音序列。
  4. 根据权利要求1至3任一项所述的语音识别方法,其中,所述拼音序列中,每一音节对应至少一个音频拼音,所述卷积神经网络模型基于所述声谱图,确定拼音库中的各个可选拼音为所述音节对应音频拼音的概率,并根据所述概率,选定至少一个所述可选拼音作为所述音节对应的音频拼音。
  5. 根据权利要求1至4任一项所述的语音识别方法,其中,所述将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列的步骤之前,所述方法还包括:
    获取用于模型训练的样本声谱图,以及所述样本声谱图对应的样本拼音序列;
    使用联结主义时间分类CTC算法构建损失函数,并基于所述损失函数、所述样本声谱图和所述样本拼音序列训练所述卷积神经网络模型。
  6. 根据权利要求1至5任一项所述的语音识别方法,其中,所述获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图,包括:
    读取所述待识别音频文件,确定所述待识别音频文件对应的频域信号;
    基于所述频域信号进行数据分帧;
    对分帧后的每一帧数据加明汉窗;
    基于快速傅里叶变换分离加窗后的每一帧数据对应的组成频率;
    根据分类处理的组成频率及每一帧数据对应的时序信息生成所述声谱图。
  7. 根据权利要求1至6任一项所述的语音识别方法,其中,所述音频文件为通话录音,所述预存文本拼音序列为诈骗话术文本对应的拼音序列,所述语音内容为诈骗话术。
  8. 一种终端设备,包括:
    获取部分,被配置为获取待识别音频文件的声谱特征数据,根据所述声谱特征数据确定所述待识别音频文件对应的声谱图;
    分析部分,被配置为将所述声谱图作为预先训练的卷积神经网络模型的输入,基于所述卷积神经网络模型确定所述声谱图对应的音频拼音序列;
    识别部分,被配置为将与所述音频拼音序列匹配的预存文本拼音序列对应的语音内容,识别为所述待识别音频文件包含的语音内容。
  9. 一种终端设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的语音识别程序,所述处理器执行所述语音识别程序时实现权利要求1至7中任一项所述的方法。
  10. 一种计算机可读存储介质,其上存储有语音识别程序,所述语音识别 程序被处理器执行时实现权利要求1至7中任一项所述的语音识别方法。
  11. 一种计算机程序产品,包括计算机可读代码,当所述计算机可读代码在电子设备中运行时,所述电子设备中的处理器执行用于实现权利要求1至7中任一项所述的语音识别方法。
PCT/CN2023/075238 2022-03-14 2023-02-09 语音识别方法、终端设备及计算机可读存储介质 WO2023173966A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210248254.XA CN116798408A (zh) 2022-03-14 2022-03-14 语音识别方法、终端设备及计算机可读存储介质
CN202210248254.X 2022-03-14

Publications (1)

Publication Number Publication Date
WO2023173966A1 true WO2023173966A1 (zh) 2023-09-21

Family

ID=88022260

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/075238 WO2023173966A1 (zh) 2022-03-14 2023-02-09 语音识别方法、终端设备及计算机可读存储介质

Country Status (2)

Country Link
CN (1) CN116798408A (zh)
WO (1) WO2023173966A1 (zh)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103456297A (zh) * 2012-05-29 2013-12-18 中国移动通信集团公司 一种语音识别匹配的方法和设备
CN108040185A (zh) * 2017-12-06 2018-05-15 福建天晴数码有限公司 一种识别骚扰电话的方法及设备
CN110570858A (zh) * 2019-09-19 2019-12-13 芋头科技(杭州)有限公司 语音唤醒方法、装置、智能音箱和计算机可读存储介质
CN111681669A (zh) * 2020-05-14 2020-09-18 上海眼控科技股份有限公司 一种基于神经网络的语音数据的识别方法与设备
CN112397051A (zh) * 2019-08-16 2021-02-23 武汉Tcl集团工业研究院有限公司 语音识别方法、装置及终端设备
CN113539247A (zh) * 2020-04-14 2021-10-22 京东数字科技控股有限公司 语音数据处理方法、装置、设备及计算机可读存储介质
CN113744722A (zh) * 2021-09-13 2021-12-03 上海交通大学宁波人工智能研究院 一种用于有限句库的离线语音识别匹配装置与方法
US20220044675A1 (en) * 2020-08-06 2022-02-10 National Chiao Tung University Method for generating caption file through url of an av platform

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103456297A (zh) * 2012-05-29 2013-12-18 中国移动通信集团公司 一种语音识别匹配的方法和设备
CN108040185A (zh) * 2017-12-06 2018-05-15 福建天晴数码有限公司 一种识别骚扰电话的方法及设备
CN112397051A (zh) * 2019-08-16 2021-02-23 武汉Tcl集团工业研究院有限公司 语音识别方法、装置及终端设备
CN110570858A (zh) * 2019-09-19 2019-12-13 芋头科技(杭州)有限公司 语音唤醒方法、装置、智能音箱和计算机可读存储介质
CN113539247A (zh) * 2020-04-14 2021-10-22 京东数字科技控股有限公司 语音数据处理方法、装置、设备及计算机可读存储介质
CN111681669A (zh) * 2020-05-14 2020-09-18 上海眼控科技股份有限公司 一种基于神经网络的语音数据的识别方法与设备
US20220044675A1 (en) * 2020-08-06 2022-02-10 National Chiao Tung University Method for generating caption file through url of an av platform
CN113744722A (zh) * 2021-09-13 2021-12-03 上海交通大学宁波人工智能研究院 一种用于有限句库的离线语音识别匹配装置与方法

Also Published As

Publication number Publication date
CN116798408A (zh) 2023-09-22

Similar Documents

Publication Publication Date Title
US20180301145A1 (en) System and Method for Using Prosody for Voice-Enabled Search
CN111311327A (zh) 基于人工智能的服务评价方法、装置、设备及存储介质
CN112259106A (zh) 声纹识别方法、装置、存储介质及计算机设备
US11810546B2 (en) Sample generation method and apparatus
EP3989217A1 (en) Method for detecting an audio adversarial attack with respect to a voice input processed by an automatic speech recognition system, corresponding device, computer program product and computer-readable carrier medium
CN110738998A (zh) 基于语音的个人信用评估方法、装置、终端及存储介质
Kopparapu Non-linguistic analysis of call center conversations
CN113314150A (zh) 基于语音数据的情绪识别方法、装置及存储介质
CN116631412A (zh) 一种通过声纹匹配判断语音机器人的方法
Pao et al. A study on the search of the most discriminative speech features in the speaker dependent speech emotion recognition
CN109545226A (zh) 一种语音识别方法、设备及计算机可读存储介质
US10446138B2 (en) System and method for assessing audio files for transcription services
Shah et al. Speech emotion recognition based on SVM using MATLAB
WO2021128847A1 (zh) 终端交互方法、装置、计算机设备及存储介质
US20180342235A1 (en) System and method for segmenting audio files for transcription
US10803853B2 (en) Audio transcription sentence tokenization system and method
WO2023173966A1 (zh) 语音识别方法、终端设备及计算机可读存储介质
Reimao Synthetic speech detection using deep neural networks
CN110933236A (zh) 一种基于机器学习的空号识别方法
CN113012680B (zh) 一种语音机器人用话术合成方法及装置
CN113053409B (zh) 音频测评方法及装置
KR102415519B1 (ko) 인공지능 음성의 컴퓨팅 탐지 장치
Liu et al. Supra-Segmental Feature Based Speaker Trait Detection.
RU2790946C1 (ru) Способ и система анализа голосовых вызовов на предмет выявления и предотвращения социальной инженерии
CN116959421B (zh) 处理音频数据的方法及装置、音频数据处理设备和介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23769477

Country of ref document: EP

Kind code of ref document: A1