WO2022205400A1 - 一种基于语音识别的安全预警方法、装置及终端设备 - Google Patents

一种基于语音识别的安全预警方法、装置及终端设备 Download PDF

Info

Publication number
WO2022205400A1
WO2022205400A1 PCT/CN2021/085180 CN2021085180W WO2022205400A1 WO 2022205400 A1 WO2022205400 A1 WO 2022205400A1 CN 2021085180 W CN2021085180 W CN 2021085180W WO 2022205400 A1 WO2022205400 A1 WO 2022205400A1
Authority
WO
WIPO (PCT)
Prior art keywords
voice
feature information
data
recognized
speech
Prior art date
Application number
PCT/CN2021/085180
Other languages
English (en)
French (fr)
Inventor
龙柏君
黄凯明
Original Assignee
深圳市锐明技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市锐明技术股份有限公司 filed Critical 深圳市锐明技术股份有限公司
Priority to PCT/CN2021/085180 priority Critical patent/WO2022205400A1/zh
Priority to CN202180000722.5A priority patent/CN113228164A/zh
Publication of WO2022205400A1 publication Critical patent/WO2022205400A1/zh

Links

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/04Segmentation; Word boundary detection
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/26Speech to text systems
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L21/00Speech or voice signal processing techniques to produce another audible or non-audible signal, e.g. visual or tactile, in order to modify its quality or its intelligibility
    • G10L21/02Speech enhancement, e.g. noise reduction or echo cancellation
    • G10L21/0208Noise filtering
    • G10L21/0216Noise filtering characterised by the method used for estimating noise
    • G10L21/0232Processing in the frequency domain

Definitions

  • the present application relates to the technical field of voice data processing, and in particular to a voice recognition-based security early warning method, device, terminal device and readable storage medium.
  • monitoring and monitoring equipment are installed in various public places to ensure the safety of users' lives and property.
  • the relevant supervision methods are mainly to save the video and audio data in the monitoring and monitoring equipment, and later process the video and audio through the management personnel to find the corresponding dangerous data or evidence. It is easy to consume a lot of manpower and the identification efficiency is low. And the recognition effect is not high.
  • One of the purposes of the embodiments of the present application is to provide a voice recognition-based security early warning method, device, terminal device and readable storage medium, aiming to solve the problem that the related security management method needs to consume a lot of manpower, the recognition efficiency is low and the recognition The problem of low effect.
  • a voice recognition-based security early warning method including:
  • the corresponding security warning mode is determined and executed according to the probability value.
  • the preprocessing of the voice data to obtain the voice data to be recognized includes:
  • Delete the voice segment that does not contain human voice in the voice data and use the voice data segment containing human voice as the to-be-recognized voice data.
  • the extracting voice feature information of the voice data to be recognized includes:
  • the processing of the voice feature information to obtain a probability value that the voice feature information includes a preset keyword includes:
  • the spectrogram is processed through a pre-trained speech recognition network model to obtain a probability value that the speech feature information contains preset keywords.
  • the pre-trained speech recognition network model is used to process the spectrogram to obtain a probability value that the speech feature information contains preset keywords, including:
  • the spectrogram corresponding to the to-be-recognized speech data is processed through a pre-trained speech recognition network model to obtain a probability value of each preset keyword contained in the to-be-recognized speech data.
  • the determining and executing the corresponding security warning mode according to the probability value includes:
  • the generating and sending an alarm notification to a preset terminal device includes:
  • An alarm notification is generated, and the voice recognition data and the alarm notification are sent to a preset security management terminal.
  • a voice recognition-based security early warning device including:
  • a preprocessing module for preprocessing the voice data to obtain the voice data to be recognized
  • an extraction module used for extracting the speech feature information of the speech data to be recognized
  • a voice processing module configured to process the voice feature information to obtain a probability value that the voice feature information includes preset keywords
  • a determination module configured to determine and execute a corresponding security warning mode according to the probability value.
  • the preprocessing module includes:
  • a frame-by-frame processing unit configured to perform frame-by-frame windowing processing on the voice data to obtain corresponding voice segments
  • a screening unit configured to delete voice segments that do not contain human voices in the voice data, and use the voice data segments containing human voices as the to-be-recognized voice data.
  • the extraction module includes:
  • a first processing unit configured to perform fast Fourier transform processing on the to-be-recognized speech data to obtain Fourier transform feature information
  • a second processing unit configured to perform filtering processing on the Fourier transform feature information to obtain filtering feature information
  • the third processing unit is configured to perform noise reduction processing on the filtering feature information to obtain speech feature information.
  • the speech processing module includes:
  • the second unit is used to extract the spectrogram of the voice feature information
  • the recognition unit is configured to process the spectrogram through a pre-trained speech recognition network model to obtain a probability value that the speech feature information contains preset keywords.
  • the identifying unit includes:
  • the recognition subunit is configured to process the spectrogram corresponding to the speech data to be recognized through a pre-trained speech recognition network model, and obtain a probability value of each preset keyword contained in the speech data to be recognized.
  • the determining module includes:
  • an accumulating unit used for accumulating the total probability value of the same preset keyword contained in the to-be-recognized speech data
  • a detection unit configured to determine that the to-be-recognized voice data contains dangerous information when it is detected that the total probability value is greater than or equal to a preset threshold
  • the generating unit is configured to generate an alarm notification and send it to the preset security management terminal.
  • the generating unit includes:
  • a cutting subunit for cutting the voice data to obtain voice recognition data of preset length
  • the generating subunit is configured to generate an alarm notification, and send the voice recognition data and the alarm notification to a preset security management terminal.
  • a terminal device including a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the first method described above when the processor executes the computer program.
  • the voice recognition-based security early warning method according to any one of the aspects.
  • a computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the speech-based recognition according to any one of the above-mentioned first aspects is implemented security warning method.
  • a fifth aspect provides a computer program product that, when the computer program product runs on a terminal device, enables the terminal device to execute the voice recognition-based security early warning method according to any one of the above-mentioned first aspects.
  • the beneficial effect of the voice recognition-based security early warning method is that: by preprocessing the obtained voice data, the voice data to be recognized is obtained, so as to reduce the amount of calculation, and the voice feature information of the voice data to be recognized is extracted. , by processing the voice feature information to identify the probability of including preset keywords, to determine the corresponding security management method according to the probability value, reduce resource consumption, improve the efficiency and recognition effect of voice recognition, and further improve the security management. efficiency.
  • FIG. 1 is a schematic flowchart of a voice recognition-based security early warning method provided by an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a voice recognition-based security early warning method provided by an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a voice recognition-based security early warning method provided by an embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of a pre-trained speech recognition network model provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a voice recognition-based security early warning device provided by an embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • Some embodiments of the present application provide a voice recognition-based security early warning method, which can be applied to terminal devices such as mobile phones, tablet computers, wearable devices, vehicle-mounted devices, and notebook computers.
  • terminal devices such as mobile phones, tablet computers, wearable devices, vehicle-mounted devices, and notebook computers.
  • the embodiments of the present application do not limit the specific types of terminal devices.
  • FIG. 1 shows a schematic flow chart of a voice recognition-based security early warning method provided by the present application.
  • the method can be applied to the above-mentioned notebook computer.
  • the voice data obtained by recording at the target location by the recording device is acquired.
  • the recording device includes but is not limited to a voice microphone.
  • Target locations include, but are not limited to, inside public transportation, eg, inside a taxi, inside a bus, inside a subway car, and so on.
  • the speech data is preprocessed, and segments that do not contain human voices are identified and deleted, and segments that contain human voices are identified and obtained as speech data to be recognized.
  • the preprocessing method includes but is not limited to frame-by-frame windowing processing.
  • the speech feature information in the speech data to be recognized is extracted by a preset method, wherein the preset method includes but is not limited to at least one of a fast Fourier transform algorithm and a noise reduction processing method.
  • Fast Fourier transform that is, a general term for efficient and fast computing methods for computing discrete Fourier transform (DFT) using a computer, referred to as FFT.
  • the FFT algorithm has the significant advantage of small computational complexity, which makes FFT widely used in the field of signal processing technology. Combined with high-speed hardware, real-time signal processing can be realized.
  • the noise reduction processing method specifically adopts spectral subtraction.
  • the processing process includes: obtaining the spectrum of each frame, detecting human voice/noise segments through the VAD algorithm, adding up the spectrums of the noise-reducing segments to obtain an average value, and subtracting the noise spectrum from the original spectrum to obtain a spectrum that does not contain noise.
  • S104 Process the voice feature information to obtain a probability value that the voice feature information includes a preset keyword.
  • the voice feature information is processed through a pre-trained voice recognition network model to obtain a probability value that the voice feature information contains preset keywords.
  • the preset keywords are multiple keywords preset by the user and used to detect whether the user in the voice data has personal and property safety problems or whether it poses a threat to others.
  • the preset keywords include "harassment words”, “words that threaten personal and property safety", “abusive words” or "words for help”.
  • a corresponding security warning method is determined to be executed, so as to protect the personal safety and health of the user corresponding to the voice data to be recognized. property safety, or eliminate the threat posed by the user to others.
  • the preset threshold value ranges from 0.5 to 1.0, and the total probability value of the same preset keyword refers to the sum of the accumulated probability values of the same preset keyword, which is mapped through the sigmoid function to obtain 0 to 1. range of probability values.
  • the preset threshold is set to 0.8, and when it is detected that the total probability value of the same preset keyword contained in the detected speech data to be recognized is 0.9, it is determined that the total probability value of the preset keyword is greater than the preset threshold. .
  • the step S102 includes:
  • the voice data is converted from an analog signal to a one-dimensional discrete digital signal, and the one-dimensional discrete digital signal is subjected to sliding window processing, frame division processing and windowing processing to obtain
  • the first and last endpoints of the speech clips are detected by the voice activation state determination algorithm (VAD).
  • VAD voice activation state determination algorithm
  • the size of the sliding window and the moving step in the sliding window processing can be set according to user requirements. For example, the window size is set to 25 ms and the moving step is 10 ms.
  • sliding window processing set the window size to 25ms audio, each time the window shift is 10ms, perform sliding window processing on the input continuous audio signal, and obtain the data of each frame.
  • Framing processing is to divide the audio signal into frames through sliding window processing to obtain the framed data; for example, the audio signal of 1000ms is divided into 100 frames by sliding window processing. data length.
  • Windowing processing refers to multiplying the 25ms window data obtained by sliding window processing by the Hamming window function and waiting for the data after windowing. Sliding window, framing, and windowing processing are the most basic processing methods in the field of speech, and are also necessary steps to obtain FFT changes.
  • the size of the speech segment after frame-by-frame and bed-adding processing is one time frame.
  • speech fragments containing human voices including:
  • the square of the amplitude of the speech segments of each 101 time frames is calculated as the short-term energy value, and the number of times the speech segment crosses the threshold in the data length of each 101 time frames is counted, and the double threshold method is used.
  • the voice activation state judgment algorithm (Voice Activity Detection, VAD) detects the beginning and end of the speech segment to determine the speech segment containing the human voice.
  • the speech segments of human voices are determined to be processed as to-be-recognized speech data, which reduces the amount of data calculation and improves speech recognition efficiency.
  • the step S103 includes:
  • fast Fourier transform is performed on each windowed speech segment containing human voice (that is, the speech data to be recognized in each time frame) to obtain Fourier transform feature information, and the obtained Fourier transform
  • the Lie transform feature information is input into the Mel filter for filtering processing to obtain the filtered Mel eigenvalues, and the logarithm processing is performed on the filtered Mel eigenvalues to obtain the corresponding filtering feature information.
  • the above filtering feature information is processed by frequency subtraction. Noise reduction processing is performed on the frequency spectrum of , and the speech feature information of each time frame after noise reduction processing is obtained.
  • the number of sampling points can be specifically set according to the requirements. For example, if each frame of speech segment is set to collect 400 sampling points, 200 Fourier points can be obtained correspondingly. Transform feature information, and 48 filtered mel feature values.
  • the step S104 includes:
  • S1042. Process the spectrogram by using a pre-trained speech recognition network model to obtain a probability value that the speech feature information includes a preset keyword.
  • the spectrogram of the speech feature information is extracted as input data, and a pre-trained speech recognition network model (Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks, CLDNN) processes the above input data to obtain the probability value of each preset keyword contained in the speech feature information output by the pre-trained speech recognition network model.
  • a pre-trained speech recognition network model Convolutional, Long Short-Term Memory, fully connected Deep Neural Networks, CLDNN
  • FIG. 4 a schematic structural diagram of a pre-trained speech recognition network model is provided.
  • the pre-trained speech recognition network model consists of 3 layers of convolutional neural network layers, long short-term memory network layer (LSTM, Long Short-Term Memory) and 1 layer of fully connected layer fc.
  • the long short-term memory network layer is the network result composed of five long and short-term memory network structures in series.
  • the size of the input data of the pre-trained speech recognition network model is set to 200x48, wherein the convolution kernel sizes of the three-layer convolutional neural network layers are [41x21], [21x11] and [21x11] respectively. , the number of convolution kernels are [32, 64, 96], respectively, the convolution strides are [2x1, 2x2, 2x2], and the obtained convolutional neural network layer output is 25x12x96.
  • the reshape function needs to be used to convert the 3-dimensional output results of the convolutional neural network layer into 2-dimensional data.
  • the size of the output data is [25x1152] (where 25 represents the number of frames on the time axis and 1152 represents the feature dimension on the frequency axis).
  • the reshaped data is passed into the long short-term memory network layer (LSTM, Long Short-Term Memory) as input data (wherein, the long short-term memory network layer is composed of 5 long and short-term memory network structures in series.
  • LSTM Long Short-Term Memory
  • the network results the number of convolution kernels of each long-term and short-term memory network layer is 256), and the output result of the size of the long-term and short-term memory network layer is 25x256, and then the output result is processed by a logistic regression (softmax) function, Finally, through the linear fully connected layer fc, (the hidden unit dimension is the number of keywords to be identified, NumCls), the corresponding output result is 25xNumCls, indicating the probability value of each time frame corresponding to each keyword.
  • the step S1042 includes:
  • the spectrogram corresponding to the to-be-recognized speech data is processed through a pre-trained speech recognition network model to obtain a probability value of each preset keyword contained in the to-be-recognized speech data.
  • the pre-trained speech recognition network model is used to process the spectrograms corresponding to the speech data to be recognized in all time frames to obtain an output result.
  • the output result is a coordinate axis, the horizontal axis represents the time frame, and the vertical axis represents the probability value of each preset keyword in the current time frame; that is, the output result represents that the speech data to be recognized in each time frame contains each preset keyword. Set the probability value of the keyword.
  • the step S105 includes:
  • the total probability value of the same preset keyword is included in the to-be-recognized speech data of the accumulated preset number of time frames, and the total probability value of the same preset keyword is detected to be greater than or equal to the preset threshold.
  • an alarm notification should be generated immediately and sent to the preset security management terminal.
  • the preset security management terminal is the terminal device of the security management personnel preset by the user, which includes but is not limited to the terminal device of the security management platform and the terminal device used by the police to receive police.
  • the preset threshold can be specifically set according to actual needs.
  • setting the preset threshold to 0.7 corresponds to a determination that the user may be a threat to others when the total probability of detecting that all the speech data to be recognized contains "harassing words" is 9.8.
  • the preset number can be specifically set according to the actual situation. In general, set the preset number of time frames to be all time frames. That is, the total probability value of the same preset keyword contained in the speech data to be recognized in all time frames is obtained accumulatively.
  • the preset number is set as a certain ratio; for example, the preset number is set as 70%. Corresponds to the total probability value that 70% of the time frames contain the same preset keyword.
  • the generating and sending an alarm notification to a preset terminal device includes:
  • An alarm notification is generated, and the voice recognition data and the alarm notification are sent to a preset security management terminal.
  • the voice data is cut to obtain a preset duration of voice recognition data containing the preset keyword as evidence , and after the alarm notification is generated, the voice recognition data and the alarm notification are sent to the preset safety management terminal together, so that the safety management personnel can quickly classify the dangers in the voice data and implement the corresponding safety management measures.
  • the preset duration can be specifically set according to actual needs. For example, setting the preset duration as 30s corresponds to acquiring speech recognition data with a duration of 30s centered on the above-mentioned preset keyword.
  • the alarm notification may include, but is not limited to, preset keywords whose total probability value is greater than a preset threshold.
  • the form of alarm notification can be specifically set according to actual needs. For example, intercepting a voice segment containing the preset keyword as a corresponding alarm notification; or converting the voice segment containing the preset keyword into a text form as a corresponding alarm notification. For example, when it is detected that the total probability value of "harassment words" in the speech data to be recognized is greater than a preset threshold, correspondingly, the preset duration of speech recognition data containing the above "harassment words" is used as evidence, and the text containing "harassment words” is used as evidence.
  • the alarm notification will be sent to the preset security management terminal together.
  • the corresponding preset security management terminals are different, and the corresponding security management personnel are also different.
  • the corresponding preset safety management terminals include but are not limited to the public security management platform and the taxi safety management platform, and the corresponding safety management personnel include but are not limited to the people's police and the safety management personnel of the taxi company .
  • the voice data to be recognized is obtained by preprocessing the obtained voice data, so as to reduce the amount of calculation, and the voice feature information of the voice data to be recognized is extracted.
  • the probability of the word is determined according to the probability value to determine the corresponding security management method, which reduces resource consumption, improves the efficiency and recognition effect of speech recognition, and further improves the efficiency of security management.
  • FIG. 5 shows a structural block diagram of the voice recognition-based security early warning device provided by the embodiment of the present application. Example relevant part.
  • the voice recognition-based security early warning device includes: a processor, wherein the processor is configured to execute the following program modules stored in the memory: an acquisition module, configured to acquire voice data;
  • a preprocessing module is used to preprocess the voice data to obtain the voice data to be recognized; an extraction module is used to extract the voice feature information of the voice data to be recognized; a voice processing module is used to analyze the voice feature information of the voice data. Perform processing to obtain a probability value that the voice feature information includes preset keywords; a determining module is configured to determine and execute a corresponding security early warning mode according to the probability value.
  • the voice recognition-based safety warning device 100 includes:
  • a preprocessing module 102 configured to preprocess the voice data to obtain voice data to be recognized
  • Extraction module 103 for extracting the voice feature information of the voice data to be recognized
  • a voice processing module 104 configured to process the voice feature information to obtain a probability value that the voice feature information includes preset keywords
  • the determining module 105 is configured to determine and execute a corresponding security warning mode according to the probability value.
  • the preprocessing module 102 includes:
  • a frame-by-frame processing unit configured to perform frame-by-frame windowing processing on the voice data to obtain corresponding voice segments
  • a screening unit configured to delete voice segments that do not contain human voices in the voice data, and use the voice data segments containing human voices as the to-be-recognized voice data.
  • the extraction module 103 includes:
  • a first processing unit configured to perform fast Fourier transform processing on the to-be-recognized speech data to obtain Fourier transform feature information
  • a second processing unit configured to perform filtering processing on the Fourier transform feature information to obtain filtering feature information
  • the third processing unit is configured to perform noise reduction processing on the filtering feature information to obtain speech feature information.
  • the speech processing module 104 includes:
  • the second unit is used to extract the spectrogram of the voice feature information
  • the recognition unit is configured to process the spectrogram through a pre-trained speech recognition network model to obtain a probability value that the speech feature information contains preset keywords.
  • the identifying unit includes:
  • the recognition subunit is configured to process the spectrogram corresponding to the speech data to be recognized through a pre-trained speech recognition network model, and obtain a probability value of each preset keyword contained in the speech data to be recognized.
  • the determining module 105 includes:
  • an accumulating unit used for accumulating the total probability value of the same preset keyword contained in the to-be-recognized speech data
  • a detection unit configured to determine that the to-be-recognized voice data contains dangerous information when it is detected that the total probability value is greater than or equal to a preset threshold
  • the generating unit is configured to generate an alarm notification and send it to the preset security management terminal.
  • the generating unit includes:
  • a cutting subunit for cutting the voice data to obtain voice recognition data of preset length
  • the generating subunit is configured to generate an alarm notification, and send the voice recognition data and the alarm notification to a preset security management terminal.
  • the voice data to be recognized is obtained by preprocessing the obtained voice data, so as to reduce the amount of calculation, and the voice feature information of the voice data to be recognized is extracted.
  • the probability of the word is used to determine the corresponding security management method according to the probability value, which reduces resource consumption, improves the efficiency and recognition effect of speech recognition, and further improves the efficiency of security management.
  • FIG. 6 is a schematic structural diagram of a terminal device provided by an embodiment of the present application.
  • the terminal device 6 in this embodiment includes: at least one processor 60 (only one is shown in FIG. 6 ), a memory 61 , and a memory 61 stored in the memory 61 and available in the at least one processor 60
  • the computer program 62 running on the processor 60 when the processor 60 executes the computer program 62, implements the steps in any of the foregoing speech recognition-based security early warning method embodiments.
  • the terminal device 6 may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the terminal device may include, but is not limited to, a processor 60 and a memory 61 .
  • FIG. 6 is only an example of the terminal device 6, and does not constitute a limitation on the terminal device 6, and may include more or less components than the one shown, or combine some components, or different components , for example, may also include input and output devices, network access devices, and the like.
  • the so-called processor 60 may be a central processing unit (Central Processing Unit, CPU), and the processor 60 may also be other general-purpose processors, digital signal processors (Digital Signal Processors, DSP), application specific integrated circuits (Application Specific Integrated Circuits) Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.
  • the memory 61 may be an internal storage unit of the terminal device 6 in some embodiments, such as a hard disk or a memory of the terminal device 6 .
  • the memory 61 may also be an external storage device of the terminal device 6 in other embodiments, such as a plug-in hard disk, a smart memory card (Smart Media Card, SMC), a secure digital Card (Secure Digital, SD), flash memory card (Flash Card), etc.
  • the memory 61 may also include both an internal storage unit of the terminal device 6 and an external storage device.
  • the memory 61 is used to store an operating system, an application program, a boot loader (BootLoader), data, and other programs, such as program codes of the computer program.
  • the memory 61 can also be used to temporarily store data that has been output or will be output.
  • Embodiments of the present application further provide a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps in the foregoing method embodiments can be implemented.
  • the embodiments of the present application provide a computer program product, when the computer program product runs on a mobile terminal, the steps in the foregoing method embodiments can be implemented when the mobile terminal executes the computer program product.
  • the integrated unit if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium.
  • all or part of the processes in the methods of the above embodiments can be implemented by a computer program to instruct the relevant hardware.
  • the computer program can be stored in a computer-readable storage medium, and the computer program When executed by a processor, the steps of each of the above method embodiments can be implemented.
  • the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form, and the like.
  • the computer-readable medium may include at least: any entity or device capable of carrying computer program codes to the photographing device/terminal device, recording medium, computer memory, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), electrical carrier signals, telecommunication signals, and software distribution media.
  • ROM read-only memory
  • RAM random access memory
  • electrical carrier signals telecommunication signals
  • software distribution media For example, U disk, mobile hard disk, disk or CD, etc.
  • computer readable media may not be electrical carrier signals and telecommunications signals.
  • the disclosed apparatus/network device and method may be implemented in other manners.
  • the apparatus/network device embodiments described above are only illustrative.
  • the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units. Or components may be combined or may be integrated into another system, or some features may be omitted, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.

Landscapes

  • Engineering & Computer Science (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Multimedia (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Quality & Reliability (AREA)
  • Signal Processing (AREA)
  • Alarm Systems (AREA)
  • Telephonic Communication Services (AREA)

Abstract

一种基于语音识别的安全预警方法、装置及终端设备,该方法包括:获取语音数据(S101);对语音数据进行预处理,获得待识别语音数据(S102);提取待识别语音数据的语音特征信息(S103);对语音特征信息进行处理,获得语音特征信息中包含预设关键词的概率值(S104);根据概率值确定执行对应的安全预警方式(S105)。该方法通过识别待识别语音数据中包含预设关键词的概率,来确定对应的安全管理方式,减小了计算量,提高了语音识别的效率和识别效果,进而提高了安全管理效率。

Description

一种基于语音识别的安全预警方法、装置及终端设备 技术领域
本申请涉及语音数据处理技术领域,具体涉及一种基于语音识别的安全预警方法、装置、终端设备及可读存储介质。
背景技术
随着科技的发展和人们生活水平的提高,在多种公共场所均安置有监控、监听设备,以保障用户的生命及财产安全。
然而,相关的监管方法主要是通过保存监控、监听设备中的视频、音频数据,后期通过管理人员对视频、音频进行处理,查找其中相应的危险数据或证据,易消耗大量的人力,识别效率底下且识别效果不高。
技术问题
本申请实施例的目的之一在于:提供一种基于语音识别的安全预警方法、装置、终端设备及可读存储介质,旨在解决相关的安全管理方法需消耗大量的人力,识别效率底下且识别效果不高的问题。
技术解决方案
为解决上述技术问题,本申请实施例采用的技术方案是:
第一方面,提供了一种基于语音识别的安全预警方法,包括:
获取语音数据;
对所述语音数据进行预处理,获得待识别语音数据;
提取所述待识别语音数据的语音特征信息;
对所述语音特征信息进行处理,获得所述语音特征信息中包含预设关键词的概率值;
根据所述概率值确定执行对应的安全预警方式。
在一个实施例中,所述对所述语音数据进行预处理,获得待识别语音数据,包括:
对所述语音数据进行分帧加窗处理,获得对应的语音片段;
删除所述语音数据中不包含人声的语音片段,将所述语音数据包含人声的片段作为所述待识别语音数据。
在一个实施例中,所述提取所述待识别语音数据的语音特征信息,包括:
对所述待识别语音数据进行快速傅里叶变换处理,获得傅里叶变换特征信息;
对所述傅里叶变换特征信息进行滤波处理,获得滤波特征信息;
对所述滤波特征信息进行降噪处理,获得语音特征信息。
在一个实施例中,所述对所述语音特征信息进行处理,获得所述语音特征信息中包含预设关键词的概率值,包括:
提取所述语音特征信息的频谱图;
通过预训练的语音识别网络模型对所述频谱图进行处理,获得所述语音特征信息中包含预设关键词的概率值。
在一个实施例中,所述通过预训练的语音识别网络模型对所述频谱图进行处理,获得所述语音特征信息中包含预设关键词的概率值,包括:
通过预训练的语音识别网络模型对与待识别语音数据对应的所述频谱图进行处理,获得待识别语音数据中包含每个预设关键词的概率值。
在一个实施例中,所述根据所述概率值确定执行对应的安全预警方式,包括:
累计所述待识别语音数据中包含同一预设关键词的总概率值;
在检测到所述总概率值大于或等于预设阈值时,判定所述待识别语音数据包含危险信息;
生成告警通知并发送至预设安全管理终端。
在一个实施例中,所述生成告警通知并发送至预设终端设备,包括:
对所述语音数据进行切割,获得预设长度的语音识别数据;
生成告警通知,将所述语音识别数据及所述告警通知发送至预设安全管理终端。
第二方面,提供了一种基于语音识别的安全预警装置,包括:
获取模块,用于获取语音数据;
预处理模块,用于对所述语音数据进行预处理,获得待识别语音数据;
提取模块,用于提取所述待识别语音数据的语音特征信息;
语音处理模块,用于对所述语音特征信息进行处理,获得所述语音特征信息中包含预设关键词的概率值;
确定模块,用于根据所述概率值确定执行对应的安全预警方式。
在一个实施例中,所述预处理模块,包括:
分帧处理单元,用于对所述语音数据进行分帧加窗处理,获得对应的语音片段;
筛选单元,用于删除所述语音数据中不包含人声的语音片段,将所述语音数据包含人声的片段作为所述待识别语音数据。
在一个实施例中,所述提取模块,包括:
第一处理单元,用于对所述待识别语音数据进行快速傅里叶变换处理,获得傅里叶变换特征信息;
第二处理单元,用于对所述傅里叶变换特征信息进行滤波处理,获得滤波特征信息;
第三处理单元,用于对所述滤波特征信息进行降噪处理,获得语音特征信息。
在一个实施例中,所述语音处理模块,包括:
第二单元,用于提取所述语音特征信息的频谱图;
识别单元,用于通过预训练的语音识别网络模型对所述频谱图进行处理,获得所述语音特征信息中包含预设关键词的概率值。
在一个实施例中,所述识别单元,包括:
识别子单元,用于通过预训练的语音识别网络模型对与待识别语音数据对应的所述频谱图进行处理,获得待识别语音数据中包含每个预设关键词的概率值。
在一个实施例中,所述确定模块,包括:
累计单元,用于累计所述待识别语音数据中包含同一预设关键词的总概率值;
检测单元,用于在检测到所述总概率值大于或等于预设阈值时,判定所述待识别语音数据包含危险信息;
生成单元,用于生成告警通知并发送至预设安全管理终端。
在一个实施例中,所述生成单元,包括:
切割子单元,用于对所述语音数据进行切割,获得预设长度的语音识别数据;
生成子单元,用于生成告警通知,将所述语音识别数据及所述告警通知发送至预设安全管理终端。
第三方面,提供一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上述第一方面中任一项所述的基于语音识别的安全预警方法。
第四方面,提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上述第一方面中任一项所述的基于语音识别的安全预警方法。
第五方面,提供一种计算机程序产品,当计算机程序产品在终端设备上运行时,使得终端设备执行上述第一方面中任一项所述的基于语音识别的安全预警方法。
有益效果
本申请实施例提供的基于语音识别的安全预警方法的有益效果在于:通过对获得的语音数据进行预处理,获得待识别语音数据,以减小计算量,并提取待识别语音数据的语音特征信息,通过对语音特征信息进行处理,识别其中包含预设关键词的概率,以根据概率值确定对应的安全管理方式,减小资源损耗,提高了语音识别的效率和识别效果,进而提高安全管理的效率。
附图说明
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或示范性技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其它的附图。
图1是本申请实施例提供的基于语音识别的安全预警方法的流程示意图;
图2是本申请实施例提供的基于语音识别的安全预警方法的流程示意图;
图3是本申请实施例提供的基于语音识别的安全预警方法的流程示意图;
图4是本申请实施例提供的预训练的语音识别网络模型的结构示意图;
图5是本申请实施例提供的基于语音识别的安全预警装置的结构示意图;
图6是本申请实施例提供的终端设备的结构示意图。
本发明的实施方式
为了使本申请的目的、技术方案及优点更加清楚明白,以下结合附图及实施例,对本申请进行进一步详细说明。应当理解,此处所描述的具体实施例仅用以解释本发明,并不用于限定本申请。
需说明的是,当部件被称为“固定于”或“设置于”另一个部件,它可以直接在另一个部件上或者间接在该另一个部件上。当一个部件被称为是“连接于”另一个部件,它可以是直接或者间接连接至该另一个部件上。术语“上”、“下”、“左”、“右”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述,而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作,因此不能理解为对本申请的限制,对于本领域的普通技术人员而言,可以根据具体情况理解上述术语的具体含义。术语“第一”、“第二”仅用于便于描述目的,而不能理解为指示或暗示相对重要性或者隐含指明技术特征的数量。“多个”的含义是两个或两个以上,除非另有明确具体的限定。
为了说明本申请所提供的技术方案,以下结合具体附图及实施例进行详细说明。
本申请的一些实施例提供基于语音识别的安全预警方法可以应用于手机、平板电脑、可穿戴设备、车载设备、笔记本电脑等终端设备上,本申请实施例对终端设备的具体类型不作任何限制。
图1示出了本申请提供的基于语音识别的安全预警方法的示意性流程图,作为示例而非限定,该方法可以应用于上述笔记本电脑中。
S101、获取语音数据。
在具体应用中,获取通过录音装置在目标场所进行录音获得的语音数据。其中,录音装置包括但不限于语音麦克风。目标场所包括但不限于公共交通内部,例如,出租车内部、公交车内部、地铁车体内部等。
S102、对所述语音数据进行预处理,获得待识别语音数据。
在具体应用中,对语音数据进行预处理,识别确定语音数据中不包含人声的片段并删除,识别获得语音数据中包含人声的片段,作为待识别语音数据。其中,预处理方法包括但不限于分帧加窗处理。
S103、提取所述待识别语音数据的语音特征信息。
在具体应用中,通过预设方法提取待识别语音数据中的语音特征信息,其中预设方法包括但不限于快速傅里叶变换算法及降噪处理方法中的至少一种。快速傅里叶变换 (fast Fourier transform),即利用计算机计算离散傅里叶变换(DFT)的高效、快速计算方法的统称,简称FFT。FFT算法具有计算量小的显著的优点,使得FFT在信号处理技术领域获得了广泛应用,结合高速硬件就能实现对信号的实时处理。例如,对语音信号的分析和合成,对通信系统中实现全数字化的时分制与频分制(TDM/FDM)的复用转换,在频域对信号滤波以及相关分析,通过对雷达、声纳、振动信号的频谱分析以提高对目标的搜索和跟踪的分辨率等等,都要用到FFT。可以说FFT的出现,对数字信号处理学科的发展起了重要的作用。降噪处理方法具体采用的是谱减法。处理过程包括:获得每一帧的频谱,通过VAD算法检测人声/噪声片段,降噪声片段的频谱加起来求取平均值,使用原始的频谱减去噪声频谱,获得不包含噪声的频谱。
S104、对所述语音特征信息进行处理,获得所述语音特征信息中包含预设关键词的概率值。
在具体应用中,通过预训练的语音识别网络模型对语音特征信息进行处理,获得语音特征信息中包含预设关键词的概率值。其中,预设关键词为用户预先设定的用于检测语音数据中的用户是否存在人身及财产安全问题或是否对他人造成威胁的多个关键词。例如,预设关键词包括“骚扰词汇”、“威胁人身、财产安全的词汇”、“辱骂词汇”或者“求救词汇”。
S105、根据所述概率值确定执行对应的安全预警方式。
在具体应用中,在检测到待识别语音数据中包含同一个预设关键词的总概率值大于预设阈值时,确定执行对应的安全预警方式,以保障待识别语音数据对应的用户的人身及财产安全,或消除该用户对他人造成的威胁。其中,预设阈值取值范围为0.5~1.0,同一个预设关键词的总概率值是指将累计获得的同一个预设关键词的概率值总和,通过sigmoid函数进行映射,获得0~1的概率值范围。例如,设定预设阈值为0.8,在检测到在检测到待识别语音数据中包含同一个预设关键词的总概率值为0.9时,判定该预设关键词的总概率值大于预设阈值。
如图2所示,在一个实施例中,所述步骤S102,包括:
S1021、对所述语音数据进行分帧加窗处理,获得对应的语音片段;
S1022、删除所述语音数据中不包含人声的语音片段,将所述语音数据包含人声的片段作为所述待识别语音数据。
在具体应用中,通过预设的模数转换模块,将语音数据从模拟信号转换为一维的离散数字信号,对一维的离散数字信号进行滑窗处理、分帧处理及加窗处理,获得分帧加窗处理后的语音片段,通过语音激活状态判断算法(VAD)对语音片段的首尾端点进行检测,确定包含人声的语音片段作为待识别语音数据,同时删除不包含人声的语音片段。其中,滑窗处理中的滑窗窗口大小及移动步长可以根据用户需求进行对应设定,例如,设定窗口大小为25ms,移动步长为10ms。其中,滑窗处理:设定窗大小为25ms音频,每次窗移为10ms,对输入的连续音频信号进行滑窗处理,获取每一帧的数据。分帧处理是对音频信号通过滑窗处理进行分帧,得到分帧后的数据;例如:把1000ms的音频信号,通过滑窗分帧处理成100帧,具体分成多少帧是根据音频信号的实际数据长度来决定的。加窗处理,是指在通过滑窗处理获得的25ms窗数据基础上,乘以汉明窗函数,等到加窗后的数据。滑窗、分帧、加窗处理是语音领域最基本的处理方法,也是获得FFT变化必须的步骤。
可以理解的是,经过分帧加床处理后的语音片段的大小为一个时间帧。
在具体应用中,确定包含人声的语音片段,包括:
以101个时间帧的语音片段为目标,统计每101个时间帧的语音片段幅值的平方作为短时能量值,统计每101个时间帧数据长度中语音片段穿越门限的次数,使用双门限方法来提高鲁棒性,降低0值附近抖动带来的干扰,并以短时能量值和短时过零率作为参数,分别设置高、低两个门限值,通过语音激活状态判断算法(Voice Activity Detection ,VAD)对语音片段的首尾端点进行检测,确定包含人声的语音片段。
通过识别删除不包含人声的语音片段,确定其中人声的语音片段作为待识别语音数据进行处理,减小了数据计算量,以提高语音识别效率。
在一个实施例中,所述步骤S103,包括:
对所述待识别语音数据进行快速傅里叶变换处理,获得傅里叶变换特征信息;
对所述傅里叶变换特征信息进行滤波处理,获得滤波特征信息;
对所述滤波特征信息进行降噪处理,获得语音特征信息。
在具体应用中,对每个加窗处理后包含人声的语音片段(即每个时间帧的待识别语音数据)进行快速傅里叶变换处理,获得傅里叶变换特征信息,将获得的傅里叶变换特征信息输入梅尔滤波器进行滤波处理,获得滤波后梅尔特征值,对滤波后梅尔特征值执行取对数处理,获得对应的滤波特征信息,通过频减法对上述滤波特征信息的频谱进行降噪处理,获得降噪处理后的每个时间帧的语音特征信息。其中,对语音片段进行快速傅里叶变换处理时,采样点的个数可根据需求进行具体设定,例如,设定每一帧语音片段采集400个采样点,对应可以获得200个傅里叶变换特征信息,及48个滤波后梅尔特征值。
如图3所示在一个实施例中,所述步骤S104,包括:
S1041、提取所述语音特征信息的频谱图;
S1042、通过预训练的语音识别网络模型对所述频谱图进行处理,获得所述语音特征信息中包含预设关键词的概率值。
在具体应用中,提取所述语音特征信息的频谱图作为输入数据,通过预训练的语音识别网络模型(Convolutional,Long Short-Term Memory, fully connected Deep Neural Networks,CLDNN)对上述输入数据进行处理,获得预训练的语音识别网络模型输出的语音特征信息中包含每个预设关键词的概率值。
如图4所示,提供了一种预训练的语音识别网络模型的结构示意图。
图4中,预训练的语音识别网络模型由3层卷积神经网络层,长短期记忆网络层(LSTM,Long Short-Term Memory)和1层全连接层fc构成。其中,长短期记忆网络层是由5个长短期记忆网络结构串联组成的网络结果。
在本实施例中,设定预训练的语音识别网络模型的输入数据的大小为200x48,其中,3层卷积神经网络层的卷积核大小分别为[41x21]、[21x11]和[21x11],卷积核数量分别为[32,64,96],卷积步长分别为[2x1,2x2,2x2],获得的卷积神经网络层输出为25x12x96。在获得卷积神经网络层得输出结果后,需要使用重塑(reshape)函数将卷积神经网络层的3维输出结果转成2维数据,对应经过重塑函数处理后,输出数据的大小为[25x1152](其中25表示时间轴上的帧数,1152表示频率轴上的特征维度)。然后将经过重塑处理后的数据,作为输入数据传入长短期记忆网络层(LSTM,Long Short-Term Memory)中(其中,长短期记忆网络层是由5个长短期记忆网络结构串联组成的网络结果,每一个长短期记忆网络层的卷积核数量均为256),获得长短期记忆网络层的大小为25x256的输出结果,然后通过一个逻辑回归(softmax)函数对该输出结果进行处理,最后通过线性全连接层fc,(其隐藏单元维度是需要识别的关键词个数NumCls),对应获得输出结果为25xNumCls,表示每一时间帧对应于每一个关键词的概率值。
在一个实施例中,所述步骤S1042,包括:
通过预训练的语音识别网络模型对与待识别语音数据对应的所述频谱图进行处理,获得待识别语音数据中包含每个预设关键词的概率值。
在具体应用中,通过预训练的语音识别网络模型对与所有时间帧待识别语音数据对应的频谱图进行处理,获得输出结果。该输出结果为一个坐标轴,其横轴表示时间帧,纵轴表示每一个预设关键词在当前时间帧的概率值;也即输出结果表示每一时间帧的待识别语音数据包含每一个预设关键词的概率值。
在一个实施例中,所述步骤S105,包括:
累计所述待识别语音数据中包含同一预设关键词的总概率值;
在检测到所述总概率值大于或等于预设阈值时,判定所述待识别语音数据包含危险信息;
生成告警通知并发送至预设安全管理终端。
在具体应用中,累计预设数量的时间帧的待识别语音数据中包含同一个预设关键词的总概率值,在检测到包含同一个预设关键词的总概率值大于或等于预设阈值时,判定待识别语音数据包含危险信息,也即与待识别语音数据对应的用户可能存在人身及财产安全问题或该用户对他人存在威胁,应立即生成告警通知并发送至预设安全管理终端。其中,预设安全管理终端为用户预先设定的安全管理人员的终端设备,其包括但不限于安全管理平台的终端设备、警方用于接警的终端设备。预设阈值可根据实际需求进行具体设定。例如,设定预设阈值为0.7,对应在检测到所有待识别语音数据中包含“骚扰词汇”的总概率值为9.8时,判定该用户可能对他人存在威胁。其中,预设数量可根据实际情况进行具体设定。一般情况下,设定预设数量的时间帧为所有时间帧。即累计获得所有时间帧的待识别语音数据中包含同一个预设关键词的总概率值。
在一个实施例中,若待识别语音数据较大,则设定预设数量为一定比例;例如,设定预设数量为70%。对应累计70%的时间帧包含同一个预设关键词的总概率值。
在一个实施例中,所述生成告警通知并发送至预设终端设备,包括:
对所述语音数据进行切割,获得预设长度的语音识别数据;
生成告警通知,将所述语音识别数据及所述告警通知发送至预设安全管理终端。
在具体应用中,在检测到包含同一个预设关键词的总概率值大于或等于预设阈值时,对语音数据进行切割,获得预设时长的包含上述预设关键词的语音识别数据作为证据,并在生成告警通知后,将语音识别数据及告警通知一同发送至预设安全管理终端,便于安全管理人员快速对语音数据中的危险进行分类,并执行对应的安全管理措施。其中,预设时长可根据实际需求进行具体设定。例如,设定预设时长为30S,对应获取以上述预设关键词为中心的时长为30s的语音识别数据。告警通知可以包括但不限于总概率值大于预设阈值的预设关键词。告警通知的形式可根据实际需求进行具体设定。例如,截取包含该预设关键词的语音片段,作为对应的告警通知;或将包含该预设关键词的语音片段转换为文字形式,作为对应的告警通知。例如,在检测到待识别语音数据中“骚扰词汇”的总概率值大于预设阈值时,对应将预设时长的包含上述“骚扰词汇”的语音识别数据作为证据,与包含“骚扰词汇”文字的告警通知一同发送至预设安全管理终端。
作为示例而非限定,基于不同的目标场所,对应的预设安全管理终端不同,对应的安全管理人员也不相同。例如,在目标场所为出租车时,对应的预设安全管理终端包括但不限于公安管理平台和出租车安全管理平台,对应的安全管理人员包括但不限于人民警察和出租车公司的安全管理人员。
本实施例通过对获得的语音数据进行预处理,获得待识别语音数据,以减小计算量,并提取待识别语音数据的语音特征信息,通过对语音特征信息进行处理,识别其中包含预设关键词的概率,以根据概率值确定对应的安全管理方式,减小资源损耗,提高了语音识别的效率和识别效果,进而提高安全管理的效率。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
对应于上文实施例所述的基于语音识别的安全预警方法,图5示出了本申请实施例提供的基于语音识别的安全预警装置的结构框图,为了便于说明,仅示出了与本申请实施例相关的部分。
在本实施例中,基于语音识别的安全预警装置包括:处理器,其中,所述处理器用于执行存在存储器的以下程序模块:获取模块,用于获取语音数据;
预处理模块,用于对所述语音数据进行预处理,获得待识别语音数据;提取模块,用于提取所述待识别语音数据的语音特征信息;语音处理模块,用于对所述语音特征信息进行处理,获得所述语音特征信息中包含预设关键词的概率值;确定模块,用于根据所述概率值确定执行对应的安全预警方式。
参照图5,该基于语音识别的安全预警装置100包括:
获取模块101,用于获取语音数据;
预处理模块102,用于对所述语音数据进行预处理,获得待识别语音数据;
提取模块103,用于提取所述待识别语音数据的语音特征信息;
语音处理模块104,用于对所述语音特征信息进行处理,获得所述语音特征信息中包含预设关键词的概率值;
确定模块105,用于根据所述概率值确定执行对应的安全预警方式。
在一个实施例中,所述预处理模块102,包括:
分帧处理单元,用于对所述语音数据进行分帧加窗处理,获得对应的语音片段;
筛选单元,用于删除所述语音数据中不包含人声的语音片段,将所述语音数据包含人声的片段作为所述待识别语音数据。
在一个实施例中,所述提取模块103,包括:
第一处理单元,用于对所述待识别语音数据进行快速傅里叶变换处理,获得傅里叶变换特征信息;
第二处理单元,用于对所述傅里叶变换特征信息进行滤波处理,获得滤波特征信息;
第三处理单元,用于对所述滤波特征信息进行降噪处理,获得语音特征信息。
在一个实施例中,所述语音处理模块104,包括:
第二单元,用于提取所述语音特征信息的频谱图;
识别单元,用于通过预训练的语音识别网络模型对所述频谱图进行处理,获得所述语音特征信息中包含预设关键词的概率值。
在一个实施例中,所述识别单元,包括:
识别子单元,用于通过预训练的语音识别网络模型对与待识别语音数据对应的所述频谱图进行处理,获得待识别语音数据中包含每个预设关键词的概率值。
在一个实施例中,所述确定模块105,包括:
累计单元,用于累计所述待识别语音数据中包含同一预设关键词的总概率值;
检测单元,用于在检测到所述总概率值大于或等于预设阈值时,判定所述待识别语音数据包含危险信息;
生成单元,用于生成告警通知并发送至预设安全管理终端。
在一个实施例中,所述生成单元,包括:
切割子单元,用于对所述语音数据进行切割,获得预设长度的语音识别数据;
生成子单元,用于生成告警通知,将所述语音识别数据及所述告警通知发送至预设安全管理终端。
本实施例通过对获得的语音数据进行预处理,获得待识别语音数据,以减小计算量,并提取待识别语音数据的语音特征信息,通过对语音特征信息进行处理,识别其中包含预设关键词的概率,以根据概率值确定对应的安全管理方式,减小资源损耗,提高了语音识别的效率和识别效果,进而提高安全管理的效率
需要说明的是,上述装置/单元之间的信息交互、执行过程等内容,由于与本申请方法实施例基于同一构思,其具体功能及带来的技术效果,具体可参见方法实施例部分,此处不再赘述。
图6为本申请一实施例提供的终端设备的结构示意图。如图6所示,该实施例的终端设备6包括:至少一个处理器60(图6中仅示出一个)、存储器61以及存储在所述存储器61中并可在所述至少一个处理器60上运行的计算机程序62,所述处理器60执行所述计算机程序62时实现上述任意各个基于语音识别的安全预警方法实施例中的步骤。
所述终端设备6可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。该终端设备可包括,但不仅限于,处理器60、存储器61。本领域技术人员可以理解,图6仅仅是终端设备6的举例,并不构成对终端设备6的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如还可以包括输入输出设备、网络接入设备等。
所称处理器60可以是中央处理单元(Central Processing Unit,CPU),该处理器60还可以是其他通用处理器、数字信号处理器 (Digital Signal Processor,DSP)、专用集成电路 (Application Specific Integrated Circuit,ASIC)、现成可编程门阵列 (Field-Programmable Gate Array,FPGA) 或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
所述存储器61在一些实施例中可以是所述终端设备6的内部存储单元,例如终端设备6的硬盘或内存。所述存储器61在另一些实施例中也可以是所述终端设备6的外部存储设备,例如所述终端设备6上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字卡(Secure Digital,SD),闪存卡(Flash Card)等。进一步地,所述存储器61还可以既包括所述终端设备6的内部存储单元也包括外部存储设备。所述存储器61用于存储操作系统、应用程序、引导装载程序(BootLoader)、数据以及其他程序等,例如所述计算机程序的程序代码等。所述存储器61还可以用于暂时地存储已经输出或者将要输出的数据。
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
本申请实施例还提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现可实现上述各个方法实施例中的步骤。
本申请实施例提供了一种计算机程序产品,当计算机程序产品在移动终端上运行时,使得移动终端执行时实现可实现上述各个方法实施例中的步骤。
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质至少可以包括:能够将计算机程序代码携带到拍照装置/终端设备的任何实体或装置、记录介质、计算机存储器、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、电载波信号、电信信号以及软件分发介质。例如U盘、移动硬盘、磁碟或者光盘等。在某些司法管辖区,根据立法和专利实践,计算机可读介质不可以是电载波信号和电信信号。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
在本申请所提供的实施例中,应该理解到,所揭露的装置/网络设备和方法,可以通过其它的方式实现。例如,以上所描述的装置/网络设备实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
以上仅为本申请的可选实施例而已,并不用于限制本申请。对于本领域的技术人员来说,本申请可以有各种更改和变化。凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (15)

  1. 一种基于语音识别的安全预警方法,其特征在于,包括:
    获取语音数据;
    对所述语音数据进行预处理,获得待识别语音数据;
    提取所述待识别语音数据的语音特征信息;
    对所述语音特征信息进行处理,获得所述语音特征信息中包含预设关键词的概率值;
    根据所述概率值确定执行对应的安全预警方式。
  2. 根据权利要求1所述的基于语音识别的安全预警方法,其特征在于,所述对所述语音数据进行预处理,获得待识别语音数据,包括:
    对所述语音数据进行分帧加窗处理,获得对应的语音片段;
    删除所述语音数据中不包含人声的语音片段,将所述语音数据包含人声的片段作为所述待识别语音数据。
  3. 如权利要求1所述的基于语音识别的安全预警方法,其特征在于,所述提取所述待识别语音数据的语音特征信息,包括:
    对所述待识别语音数据进行快速傅里叶变换处理,获得傅里叶变换特征信息;
    对所述傅里叶变换特征信息进行滤波处理,获得滤波特征信息;
    对所述滤波特征信息进行降噪处理,获得语音特征信息。
  4. 如权利要求1所述的基于语音识别的安全预警方法,其特征在于,所述对所述语音特征信息进行处理,获得所述语音特征信息中包含预设关键词的概率值,包括:
    提取所述语音特征信息的频谱图;
    通过预训练的语音识别网络模型对所述频谱图进行处理,获得所述语音特征信息中包含预设关键词的概率值。
  5. 如权利要求4所述的基于语音识别的安全预警方法,其特征在于,所述通过预训练的语音识别网络模型对所述频谱图进行处理,获得所述语音特征信息中包含预设关键词的概率值,包括:
    通过预训练的语音识别网络模型对与待识别语音数据对应的所述频谱图进行处理,获得待识别语音数据中包含每个预设关键词的概率值。
  6. 如权利要求5所述的基于语音识别的安全预警方法,其特征在于,所述根据所述概率值确定执行对应的安全预警方式,包括:
    累计所述待识别语音数据中包含同一预设关键词的总概率值;
    在检测到所述总概率值大于或等于预设阈值时,判定所述待识别语音数据包含危险信息;
    生成告警通知并发送至预设安全管理终端。
  7. 如权利要求6所述的基于语音识别的安全预警方法,其特征在于,所述生成告警通知并发送至预设终端设备,包括:
    对所述语音数据进行切割,获得预设长度的语音识别数据;
    生成告警通知,将所述语音识别数据及所述告警通知发送至预设安全管理终端。
  8. 一种基于语音识别的安全预警装置,其特征在于,包括:
    获取模块,用于获取语音数据;
    预处理模块,用于对所述语音数据进行预处理,获得待识别语音数据;
    提取模块,用于提取所述待识别语音数据的语音特征信息;
    语音处理模块,用于对所述语音特征信息进行处理,获得所述语音特征信息中包含预设关键词的概率值;
    确定模块,用于根据所述概率值确定执行对应的安全预警方式。
  9. 如权利要求8所述的基于语音识别的安全预警装置,其特征在于,所述预处理模块,包括:
    分帧处理单元,用于对所述语音数据进行分帧加窗处理,获得对应的语音片段;
    筛选单元,用于删除所述语音数据中不包含人声的语音片段,将所述语音数据包含人声的片段作为所述待识别语音数据。
  10. 如权利要求8所述的基于语音识别的安全预警装置,其特征在于,所述提取模块,包括:
    第一处理单元,用于对所述待识别语音数据进行快速傅里叶变换处理,获得傅里叶变换特征信息;
    第二处理单元,用于对所述傅里叶变换特征信息进行滤波处理,获得滤波特征信息;
    第三处理单元,用于对所述滤波特征信息进行降噪处理,获得语音特征信息。
  11. 如权利要求8所述的基于语音识别的安全预警装置,其特征在于,所述语音处理模块,包括:
    第二单元,用于提取所述语音特征信息的频谱图;
    识别单元,用于通过预训练的语音识别网络模型对所述频谱图进行处理,获得所述语音特征信息中包含预设关键词的概率值。
  12. 如权利要求11所述的基于语音识别的安全预警装置,其特征在于,所述识别单元,包括:
    识别子单元,用于通过预训练的语音识别网络模型对与待识别语音数据对应的所述频谱图进行处理,获得待识别语音数据中包含每个预设关键词的概率值。
  13. 如权利要求8所述的基于语音识别的安全预警装置,其特征在于,所述确定模块,包括:
    累计单元,用于累计所述待识别语音数据中包含同一预设关键词的总概率值;
    检测单元,用于在检测到所述总概率值大于或等于预设阈值时,判定所述待识别语音数据包含危险信息;
    生成单元,用于生成告警通知并发送至预设安全管理终端。
  14. 一种终端设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至7任一项所述的方法。
  15. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行时实现如权利要求1至7任一项所述的方法。
PCT/CN2021/085180 2021-04-02 2021-04-02 一种基于语音识别的安全预警方法、装置及终端设备 WO2022205400A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/CN2021/085180 WO2022205400A1 (zh) 2021-04-02 2021-04-02 一种基于语音识别的安全预警方法、装置及终端设备
CN202180000722.5A CN113228164A (zh) 2021-04-02 2021-04-02 一种基于语音识别的安全预警方法、装置及终端设备

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2021/085180 WO2022205400A1 (zh) 2021-04-02 2021-04-02 一种基于语音识别的安全预警方法、装置及终端设备

Publications (1)

Publication Number Publication Date
WO2022205400A1 true WO2022205400A1 (zh) 2022-10-06

Family

ID=77081332

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/085180 WO2022205400A1 (zh) 2021-04-02 2021-04-02 一种基于语音识别的安全预警方法、装置及终端设备

Country Status (2)

Country Link
CN (1) CN113228164A (zh)
WO (1) WO2022205400A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115132188A (zh) * 2022-09-02 2022-09-30 珠海翔翼航空技术有限公司 基于语音识别的预警方法、装置、终端设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150095027A1 (en) * 2013-09-30 2015-04-02 Google Inc. Key phrase detection
CN106453882A (zh) * 2016-09-29 2017-02-22 珠海格力电器股份有限公司 信息处理方法、装置及电子设备
CN109671433A (zh) * 2019-01-10 2019-04-23 腾讯科技(深圳)有限公司 一种关键词的检测方法以及相关装置
CN109817202A (zh) * 2019-01-22 2019-05-28 珠海格力电器股份有限公司 一种语音控制方法、装置、存储介质及语音设备
US20190371326A1 (en) * 2015-11-24 2019-12-05 Intel IP Corporation Low resource key phrase detection for wake on voice
CN110706700A (zh) * 2019-09-29 2020-01-17 深圳市元征科技股份有限公司 一种车内骚扰预防报警方法及装置、服务器、存储介质

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111329494B (zh) * 2020-02-28 2022-10-28 首都医科大学 抑郁症参考数据的获取方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150095027A1 (en) * 2013-09-30 2015-04-02 Google Inc. Key phrase detection
US20190371326A1 (en) * 2015-11-24 2019-12-05 Intel IP Corporation Low resource key phrase detection for wake on voice
CN106453882A (zh) * 2016-09-29 2017-02-22 珠海格力电器股份有限公司 信息处理方法、装置及电子设备
CN109671433A (zh) * 2019-01-10 2019-04-23 腾讯科技(深圳)有限公司 一种关键词的检测方法以及相关装置
CN109817202A (zh) * 2019-01-22 2019-05-28 珠海格力电器股份有限公司 一种语音控制方法、装置、存储介质及语音设备
CN110706700A (zh) * 2019-09-29 2020-01-17 深圳市元征科技股份有限公司 一种车内骚扰预防报警方法及装置、服务器、存储介质

Also Published As

Publication number Publication date
CN113228164A (zh) 2021-08-06

Similar Documents

Publication Publication Date Title
WO2018149077A1 (zh) 声纹识别方法、装置、存储介质和后台服务器
US20150286464A1 (en) Method, system and storage medium for monitoring audio streaming media
US9424743B2 (en) Real-time traffic detection
Pillos et al. A Real-Time Environmental Sound Recognition System for the Android OS.
WO2021000498A1 (zh) 复合语音识别方法、装置、设备及计算机可读存储介质
CN111754982A (zh) 语音通话的噪声消除方法、装置、电子设备及存储介质
CN110880329A (zh) 一种音频识别方法及设备、存储介质
WO2020238046A1 (zh) 人声智能检测方法、装置及计算机可读存储介质
CN111739542A (zh) 一种特征声音检测的方法、装置及设备
CN113053410B (zh) 声音识别方法、装置、计算机设备和存储介质
CN106548786A (zh) 一种音频数据的检测方法及系统
CN110136726A (zh) 一种语音性别的估计方法、装置、系统及存储介质
CN112382302A (zh) 婴儿哭声识别方法及终端设备
WO2022205400A1 (zh) 一种基于语音识别的安全预警方法、装置及终端设备
WO2022121182A1 (zh) 语音端点检测方法、装置、设备及计算机可读存储介质
CN111210817B (zh) 数据处理方法及装置
CN113421590B (zh) 异常行为检测方法、装置、设备及存储介质
Mu et al. MFCC as features for speaker classification using machine learning
Hajihashemi et al. Novel time-frequency based scheme for detecting sound events from sound background in audio segments
CN115762551A (zh) 鼾声检测方法、装置、计算机设备及存储介质
JP2018109739A (ja) 音声フレーム処理用の装置及び方法
CN112216285B (zh) 多人会话检测方法、系统、移动终端及存储介质
CN112863548A (zh) 训练音频检测模型的方法、音频检测方法及其装置
Gasenzer et al. Towards generalizing deep-audio fake detection networks
CN111552832A (zh) 基于声纹特征与关联图谱数据的风险用户识别方法、装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21934057

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21934057

Country of ref document: EP

Kind code of ref document: A1