CN116229987B

CN116229987B - Campus voice recognition method, device and storage medium

Info

Publication number: CN116229987B
Application number: CN202211592939.2A
Authority: CN
Inventors: 郑桂鹏; 刘芝秉; 李景恒; 林弟; 张常华; 朱正辉; 赵定金
Original assignee: Guangdong Baolun Electronics Co ltd
Current assignee: Guangdong Baolun Electronics Co ltd
Priority date: 2022-12-13
Filing date: 2022-12-13
Publication date: 2023-11-21
Anticipated expiration: 2042-12-13
Also published as: CN116229987A

Abstract

The invention discloses a campus voice recognition method, a device and a storage medium, wherein the method comprises the following steps: acquiring first audio signal data in first campus voice equipment, and filtering the first audio signal data to acquire voice information; inputting the voice information into a voice recognition model so that the voice recognition model judges whether the voice information contains preset violent keywords or not; if yes, inputting the voice information into a voice print recognition model, so that the voice print recognition model calculates the energy value of the voice information, and determining the sound source information in the voice information according to the voice print scale factor; wherein the sound source information includes: the number of people sending out voice information of the voice and the position direction of the people; and sending the first audio signal data, the position information and the sound source information of the first campus voice equipment to a management system to realize recognition and positioning of violent voices in the campus.

Description

Campus voice recognition method, device and storage medium

Technical Field

The present invention relates to the field of speech recognition technologies, and in particular, to a method, an apparatus, and a storage medium for campus speech recognition.

Background

Speech recognition is the conversion of lexical content in input speech into corresponding text information. The existing speech recognition model firstly processes the speech, then decodes the speech by using an acoustic model, then matches syllables with word lists to obtain word sequences, and finally obtains sentences by using the speech model.

When people perform natural spoken dialogue, not only sound is transmitted, but also emotion states, attitudes, intentions and the like of a speaker are transmitted. In the voice recognition function of current wisdom campus equipment, the keyword retrieval and emotion voice recognition of the violent vocabulary are lacked, the sound source positioning can not be carried out on the acquired voice, the voice recognition performance is poor, and the safety of the campus students can not be comprehensively protected through the voice recognition of the students.

Disclosure of Invention

The invention provides a campus voice recognition method, a device and a storage medium, which are used for realizing recognition and positioning of violent voices in a campus.

In order to identify and locate violent voices in a campus, an embodiment of the invention provides a method, a device and a storage medium for identifying campus voices, which comprise the following steps: acquiring first audio signal data in first campus voice equipment, and filtering the first audio signal data to acquire voice information;

Inputting the voice information into a voice recognition model so that the voice recognition model judges whether the voice information contains preset violent keywords or not;

if yes, inputting the voice information into a voice print recognition model, so that the voice print recognition model calculates the energy value of the voice information, and determining sound source information in the voice information according to a voice print scale factor and the energy distribution of the voice information; wherein the sound source information includes: the number of people sending out the voice information of the voice and the position distance and direction of the people;

and transmitting the first audio signal data, the position information of the first campus voice device and the sound source information to a management system.

As a preferred scheme, the method and the device for voice recognition of the campus have the advantages that the first audio signal data of any voice device in the campus are subjected to feature extraction, input into a voice recognition model for voice analysis, and whether violent voice exists in the first audio signal data is judged; if the obtained first audio signal data is violent voice, then the obtained violent voice is subjected to voiceprint analysis, sound source information of the violent voice is obtained, the number of people sending out the voice information of the human voice and the position distance and direction of the people are obtained, the voice information of students is recorded in the campus in real time, whether the voice information is violent voice is detected, and the number of people sending out the violent voice, the position distance and the direction are judged, so that sound source positioning is carried out.

As a preferred scheme, first audio signal data in a first voice device is obtained, and the first audio signal data is filtered to obtain voice information, specifically:

dividing the first audio signal data into a voice area and a mute area, removing noise of the voice area, and taking the voice area after noise removal as the voice information of the human voice.

As a preferred scheme, the voice information is firstly segmented and extracted, the characteristic information of the voice region is extracted, the calculation of environmental voices is reduced, the precision of voice analysis is improved, keywords and voiceprints of the voice are extracted, the voice information of students is recorded in campus in real time, whether the voice information is violent voice is detected, the number of characters, the position distance and the direction of the violent voice are judged according to the characteristics of the voiceprints, and therefore sound source positioning is carried out.

As a preferred scheme, detecting whether the voice information includes a preset violence keyword or not specifically includes:

calling a unified API interface to acquire channel information of a first keyword of voice information;

matching and calculating the channel information of the first keyword and the channel information of the second keyword in the training voice information; wherein the second keywords are preset violent keywords;

If the channel information of the first keyword is identical to the channel information of the second keyword in a matching manner, the voice recognition model judges that the voice information of the voice contains a preset violent keyword.

As a preferred scheme, the invention judges whether the keywords of the voice information are violent words or words with negative emotion by matching the keyword characteristic information of the voice information with the keyword characteristic information of the training voice information, thereby realizing the real-time recording of the voice information of students in campus and detecting whether the voice information is violent voice.

As a preferred scheme, the energy value calculation is performed on the voice information, and the sound source information in the voice information is determined according to the voiceprint scale factor and the energy distribution of the voice information, specifically:

respectively inputting a plurality of voice information into a plurality of corresponding matrix units, and respectively calculating the energy value and the frequency domain energy distribution of the voice information acquired by each audio acquisition terminal; the first campus voice equipment is provided with a plurality of audio acquisition terminals; the voice information of the voice is respectively filtered and processed by the first audio signal data acquired by different audio acquisition terminals;

Extracting voiceprint scale factors according to the energy value and the frequency domain energy distribution of each matrix unit, performing equalization processing on the voice information of the human voice, and outputting matrix energy distribution;

and determining the number of people and the direction of sound according to the matrix energy distribution and the positions of the plurality of audio acquisition terminals.

As a preferred scheme, the first campus voice equipment is provided with a plurality of audio acquisition terminals, the energy value and the frequency domain energy distribution of each piece of voice information are respectively calculated according to voice information after filtering processing of first audio signal data acquired by the plurality of audio acquisition terminals, voiceprint scale factors are extracted, equalization processing is carried out on the voice information, and matrix energy distribution is output; and determining the number of people and the direction of sound according to the matrix energy distribution and the positions of a plurality of audio acquisition terminals, thereby performing sound source localization.

Preferably, before inputting the voice information into the voice recognition model, the method further comprises:

acquiring a plurality of training audio data, and extracting characteristic information of the training audio data; wherein the training audio data comprises voice sounds containing violent vocabulary or emotion keywords and voice sounds without violent vocabulary or emotion keywords;

Dividing the training audio data into a voice area and a mute area according to the characteristic information; according to the characteristic types of the voice area and the mute area, carrying out fusion calculation on the characteristic information to obtain characteristic parameters of the training audio data;

and respectively modeling channels of a voice area and a mute area of the training audio data according to the characteristic parameters to obtain a voice recognition model.

Before voice information is input into a voice recognition model, the voice recognition model is trained, voice containing violent words or emotion keywords and voice without violent words or emotion keywords are used as training audio data, so that the model can train and distinguish various characteristic values containing violent words or emotion keywords and without violent words or emotion keywords, fusion is carried out according to the characteristics of the characteristic values, and the model established according to the fused characteristic parameters can detect whether the voice information is violent voice and emotion values expressed by the voice information.

Preferably, before inputting the voiceprint parameters into the voiceprint recognition model, the method further comprises:

acquiring a plurality of training audio data, and extracting first energy characteristic information of the training audio data; fusion calculation is carried out on the first energy characteristic information, and voiceprint characteristic parameters of the training audio data are obtained; and modeling the training audio data according to the voiceprint characteristic parameters to obtain a voiceprint recognition model.

Before inputting the voiceprint parameters into the voiceprint recognition model, the voiceprint recognition model is trained, the first energy characteristic information of the training audio data is extracted, the voiceprint characteristic parameters of the training audio data are obtained, and the voiceprint recognition model is trained according to the voiceprint characteristic parameters, so that the voiceprint recognition model can judge the number, the position distance and the direction of people sending violent voices, and sound source positioning is performed.

Preferably, before the first audio signal data, the location information of the first campus voice device, and the sound source information are sent to a management system, the method further includes:

playing the alarm information through broadcasting equipment; and if the violent voice is detected again within the preset time after the alarm information is played, transmitting the first audio signal data, the position information of the first campus voice equipment and the sound source information to a management system.

When violent voice is detected for the second time within preset time, the voice information, the position of voice equipment for acquiring the voice information and the character information of the voice information are sent to a management system to inform an administrator of violent voice content, the number of people and the character position, the voice information of students is recorded in the campus in real time, whether the voice information is violent voice or not is detected, sound source positioning of violent voice is carried out, the administrator is informed timely, relevant content is sent, and the safety of the students on the campus is comprehensively protected.

Correspondingly, the invention also provides a device for campus voice recognition, which comprises: the device comprises an acquisition module, a violence detection module, a voiceprint positioning module and an information sending module;

the acquisition module is used for acquiring audio signal data in campus voice equipment, and extracting characteristics of the audio signal data to acquire voice information;

the violence detection module is used for inputting the voice information into a voice recognition model so that the voice recognition model can judge whether the voice information contains preset violence keywords or not;

the voiceprint positioning module is used for inputting the voice information into a voiceprint recognition model if the voice information contains a preset violent keyword, so that the voiceprint recognition model calculates the energy value of the voice information, and determines sound source information in the voice information according to a voiceprint scale factor and the energy distribution of the voice information; wherein the sound source information includes: the number of people sending out the voice information of the voice and the position distance and direction of the people;

the information sending module is used for sending the first audio signal data, the position information of the first campus voice equipment and the sound source information to a management system.

As a preferred scheme, an acquisition module of the campus voice recognition device acquires first audio signal data of any voice equipment of a campus, performs feature extraction on the first audio signal data to obtain voice information, and inputs the voice information into a voice recognition model for voice analysis by a violence detection module to judge whether violence voice exists in the first audio signal data; if the obtained first audio signal data is violent voice, the voiceprint positioning module carries out voiceprint analysis on the obtained violent voice, obtains sound source information of the violent voice, sends out the number of people of the voice information and the position distance and direction of the people, realizes real-time recording of the voice information of students in a campus, detects whether the voice information is violent voice or not, and judges the number of people and the position distance and direction of the violent voice, so that sound source positioning is carried out. The information sending module timely feeds back sound source information of violent language to the manager.

As a preferred solution, the acquisition module includes a segmentation unit and a feature extraction unit;

the segmentation unit is used for segmenting the first audio signal data into a voice area and a mute area, and acquiring the voice area;

The characteristic extraction unit is used for extracting voice information of the voice area; the voice information comprises keyword characteristic information and voiceprint characteristic information.

As a preferred scheme, the segmentation unit of the invention carries out segmentation extraction of the voice information on the voice region before detecting the voice, the feature extraction unit extracts the feature information of the voice region, reduces the calculation of environmental voice, improves the precision of voice analysis, extracts key words and voiceprints of the voice, realizes real-time recording of the voice information of students in a campus, detects whether the voice information is violent voice, and judges the number of people sending violent voice, the position distance and the direction according to the voiceprint features, thereby carrying out sound source positioning.

As a preferred solution, the violence detection module comprises a training unit and a detection unit;

the training unit is used for acquiring a plurality of training audio data and extracting characteristic information of the training audio data; wherein the training audio data comprises voice sounds containing violent vocabulary or emotion keywords and voice sounds without violent vocabulary or emotion keywords;

Modeling channels of a voice area and a mute area of the training audio data according to the characteristic parameters to obtain a voice recognition model;

the detection unit is used for extracting characteristic information of a first keyword of voice information; calling a unified API interface to acquire the characteristic information of the first keyword; matching and calculating the characteristic information of the first keyword with the characteristic information of the second keyword in the training voice information, and judging whether the first keyword is an violent vocabulary or not; and if the second keyword is the violent vocabulary and the characteristic information of the first keyword is the same as the characteristic information of the second keyword in a matching way, judging that the first keyword is the violent vocabulary.

Before voice information is input into a voice recognition model, a training unit firstly trains the voice recognition model, and voice containing violent words or emotion keywords and voice without violent words or emotion keywords are used as training audio data, so that the model can train and distinguish various characteristic values containing violent words or emotion keywords and without violent words or emotion keywords, and is fused according to the characteristics of the various characteristic values, and the model established according to the fused characteristic parameters can detect whether the voice information is violent voice and emotion values expressed by the voice information; the detection unit judges whether the keywords of the voice information are violent words or words with negative emotion through matching the keyword characteristic information of the voice information with the keyword characteristic information of the training voice information, so that the voice information of students can be recorded in the campus in real time and whether the voice information is violent voice can be detected.

Accordingly, the present invention also provides a computer-readable storage medium including a stored computer program; the computer program controls the device where the computer readable storage medium is located to execute the campus voice recognition method according to the present invention when running.

Drawings

FIG. 1 is a flow chart of one embodiment of a method of campus voice recognition provided by the present invention;

fig. 2 is a schematic structural diagram of an embodiment of a campus voice recognition device provided by the present invention.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

Referring to fig. 1, a method for campus voice recognition according to an embodiment of the present invention includes steps S101 to S104:

Step S101: acquiring first audio signal data in first campus voice equipment, and filtering the first audio signal data to acquire voice information;

in this embodiment, first audio signal data in a first voice device is obtained, and filtering processing is performed on the first audio signal data to obtain voice information, specifically:

In this embodiment, the first audio signal data is divided into a voice area and a mute area, noise in the voice area is removed, and the voice area after noise removal is used as the voice information of the voice, specifically:

the first audio signal data is subjected to Hanning window and short-time fast Fourier transform to be subjected to segmentation processing from a time domain to a frequency domain;

inputting the segmented first audio signal data into an IIR filter for filtering, weakening the frequency band of the segmented first audio signal data containing noise, enhancing the audio containing human voice, and finally obtaining the voice information of the human voice through inverse Fourier transform to a time domain.

In this embodiment, the first audio signal data dividing and filtering process specifically includes:

the gain of the first audio signal data is adjusted to be 0.01-10 randomly, the noise gain is adjusted to be 0.1-10 randomly, the gain increase is calculated according to frames, and the audio signal data after gain is obtained; processing the audio signal data after gain through a random second-order filter to obtain a voice signal and a noise signal;

calculating the voice energy value of the voice signal, and calculating 1 voice vad feature point according to the voice energy value characteristic; calculating the energy spectrum of the noise signal to obtain 22 voiceprint feature points; mixing the voice signal and the noise signal after the gain processing to obtain a voice signal with noise, and calculating mixed characteristic points to obtain 44 mixed characteristic points;

and calculating the ratio of the energy value of the voice signal to the energy of the voice with noise, the vad characteristic points and the mute voice signal to obtain 22 gain characteristic points.

In this embodiment, training is performed on the deep neural network model according to training data, 44 mixed feature points, 22 gain feature points and 1 voice vad feature point of the training data are extracted and input into the deep neural network model for training, and the deep neural network model outputs a voice language signal. 10% of training data was used as a validation test set, and the remaining training data was divided into 32 pieces, with 120 training times.

Step S102: inputting the voice information into a voice recognition model so that the voice recognition model judges whether the voice information contains preset violent keywords or not;

in this embodiment, the detecting and determining whether the voice information includes a preset violence keyword specifically includes:

In this embodiment, before inputting the voice information into the voice recognition model, the method further includes:

In this embodiment, according to the feature types of the voice region and the silence region, the feature information is fused and calculated, an initial voice recognition model is established by using a network structure of DenseNet-LSTM, the initial voice recognition model is trained by using a plurality of training audio data, and after the model accuracy is judged to be greater than 99.5% according to a test set, the voice recognition model is obtained.

In this embodiment, each time the voice recognition model obtains a piece of voice information, an SDK is generated; providing a unified API interface for an application program in an alsa-lib library to collect channel information of keywords of voice information, and carrying out matching calculation on the channel information of the keywords of the voice information and channel information of second keywords in training voice information; the second keywords comprise, but are not limited to, preset violent words and preset emotion keywords; and the emotion keywords are key information which is intelligently identified from the text of the voice information of the voice and has the greatest influence on the overall emotion of the text.

In this embodiment, after a piece of voice information is obtained each time, the obtained voice information and the calculation result are used as training data, and the learning experience of the voice recognition model is accumulated.

In this embodiment, based on the emotion analysis engine, full analysis can be performed on emotion extremum and emotion expressed by voice information of a person, and super calculation in a server is used for training to update network models and parameters for violent text analysis, and the trained models are loaded in voice recognition in idle time.

Step S103: if yes, inputting the voice information into a voice print recognition model, so that the voice print recognition model calculates the energy value of the voice information, and determining sound source information in the voice information according to a voice print scale factor and the energy distribution of the voice information; wherein the sound source information includes: the number of people sending out the voice information of the voice and the position distance and direction of the people;

in this embodiment, energy value calculation is performed on the voice information, and sound source information in the voice information is determined according to a voiceprint scale factor and energy distribution of the voice information, which specifically includes:

In this embodiment, the number of people is determined by accumulating and comparing voiceprints of the collected voice information of the people, and comparing differences between voiceprints and differences between frequencies.

In this embodiment, before inputting the voiceprint parameters into the voiceprint recognition model, the method further includes:

In the embodiment, a plurality of voice information are respectively input into 4 corresponding matrix units, and the energy value and the frequency domain energy distribution of the voice information acquired by each audio acquisition terminal are respectively calculated; the first campus voice device is provided with 4 unidirectional audio acquisition terminals facing different directions; wherein the audio acquisition terminal includes, but is not limited to, a microphone device; the voice information of the voice is respectively filtered and processed by the first audio signal data acquired by different audio acquisition terminals;

in this embodiment, when the first campus voice device operates, the 4 audio acquisition terminals work simultaneously and acquire audio, and input the audio into the 4 corresponding matrix units respectively, where the audio acquired by each unit has different energy values, and the energy distribution of each frequency band in the frequency domain is inconsistent. Aiming at the ratio of the total energy of different matrix units and the energy distribution proportion of different frequency bands, namely the voiceprint scale factor, the signals with large proportion are subjected to enhancement processing, and the signals with small proportion are subjected to attenuation processing.

In the embodiment, the 31-segment equalizer performs enhancement and attenuation processing on signals by using a differential equation and a transfer function, and performs corresponding gain adjustment by equalizing at a center frequency point of each segment of signals; the magnitude of the gain adjustment value is controlled by a matrix calculation factor; the equalizer employs a biquad filter.

The differential equation is:

y[n]＝(b0/a0)*x[n]+(b1/a0)*x[n-1]+(b2/a0)*x[n-2]-(a1/a0)*y[n-1]-(a2/a0)*y[n-2]；

wherein a0, a1, a2, b0, b1, b2 are coefficients of the dual-order filter, y n is current audio output, x n is current audio input, x n-1 is audio input at the last moment, y n-1 is audio output value at the last moment, y n-2 is audio output value at the last moment, and y n-1 and y n-2 are feedback values of the system.

The transfer function is:

H(z)＝(b0+b1*z ^-1 +b2*(z ^-2 ))/(1+a1*z ^-1 +a2*z ^-2 )；

wherein a1, a2, b0, b1, b2 are coefficients of a two-order filter, H (z) is y [ n ] in the differential equation]Z-transformation of (2); z on molecule ^-1 And z ^-2 Is x [ n-1] in the differential equation]And x [ n-2]]Z-transformation of (2); z on denominator ^-1 And z ^-2 Is y [ n-1] in the differential equation]And y [ n-2]]Is a Z-transform of (c).

The voiceprint scale factors of the current different matrix units are stored, and are used as feedback signal processing for the next matrix calculation, so that the dynamic adjustment matrix calculation factors are realized.

In this embodiment, based on multi-feature fusion, voiceprint feature information of human voice features in a network structure of DenseNet-LSTM is used, and the more distant the sound is, and the less distant the sound is. By the energy distribution of the matrix, the direction and distance of the sound can be determined. The first campus voice device is provided with 4 unidirectional microphones facing different directions; the distance between the sound and the microphone is determined by the energy of the matrix units detected by the four microphones, and the ratio of the maximum matrix energy value to the minimum matrix energy value is multiplied by a coefficient to obtain the distance.

The direction of sound is determined by the mutual ratio of the energy values of the four matrix units, and after certain treatment, the sound is mapped into 0-3:0 is the direction between microphone 0 and microphone 1, 1 is the direction between microphone 1 and microphone 2, 2 is the direction between microphone 2 and microphone 3, and 3 is the direction between microphone 3 and microphone 0.

Step S104: and transmitting the first audio signal data, the position information of the first campus voice device and the sound source information to a management system.

In this embodiment, the alarm information is played through the broadcasting device; and if the violent voice is detected again within the preset time after the alarm information is played, transmitting the first audio signal data, the position information of the first campus voice equipment and the sound source information to a management system.

In this embodiment, if it is determined that the voice information includes a preset violent keyword; transmitting the detected violent keywords of the first audio signal data to a background MySQL through an Ethernet module, and playing alarm information through broadcasting equipment; and if the violent voice is detected again within the preset time after the alarm information is played, transmitting the violent keyword, the first audio signal data, the position information of the first campus voice equipment and the sound source information to a management system.

In this embodiment, the campus voice system includes a plurality of campus voice devices and a management system, where the campus voice devices are used to collect voice and play alert information.

The campus voice system further comprises: the system comprises a network interface and a control terminal, wherein the network interface is used for connecting the control terminal. The control terminal can directly modify the IP, the subnet mask, the gateway, the DHCP service and the script assisted configuration through the netplan configuration tool, and send the broadcast packet by utilizing the TCP network protocol communication issuing instruction; each campus voice device responds one by one, trains a heartbeat packet every five seconds, sends the heartbeat packet to ensure that each campus voice device is always on, and each campus voice device can configure a device tool to modify IP, a subnet mask, a gateway and a DHCP service to play music and tune volume.

In the embodiment, in the evening from ten to one, campus voice equipment in a dormitory area detects the voice volume in real time, and if the current campus voice equipment detects that the voice volume is in a frequency band of 40-70 decibels, the campus voice equipment automatically plays alarm information; if the voice in the frequency band of 40-70 dB is detected again within the preset time after the alarm information is played, the position information and the collected voice information of the garden checking voice equipment are sent to a background MySQL through an Ethernet module to send data and inform an administrator.

In this embodiment, the implementation of the embodiment of the present invention has the following effects:

the method comprises the steps of extracting the characteristics of first audio signal data of any voice equipment in a campus, inputting the first audio signal data into a voice recognition model for voice analysis, and judging whether violent voice exists in the first audio signal data; if the obtained first audio signal data is violent voice, then the obtained violent voice is subjected to voiceprint analysis, sound source information of the violent voice is obtained, the number of people sending out the voice information of the human voice and the position distance and direction of the people are obtained, the voice information of students is recorded in the campus in real time, whether the voice information is violent voice is detected, and the number of people sending out the violent voice, the position distance and the direction are judged, so that sound source positioning is carried out.

Example two

Referring to fig. 2, a device for campus voice recognition according to an embodiment of the present invention includes: the device comprises an acquisition module 201, a violence detection module 202, a voiceprint positioning module 203 and an information sending module 204;

the acquiring module 201 is configured to acquire audio signal data in campus voice equipment, perform feature extraction on the audio signal data, and obtain voice information;

The violence detection module 202 is configured to input the voice information into a voice recognition model, so that the voice recognition model determines whether the voice information contains a preset violence keyword;

the voiceprint positioning module 203 is configured to input the voice information into a voiceprint recognition model if the voice information includes a preset violent keyword, so that the voiceprint recognition model performs energy value calculation on the voice information, and determines sound source information in the voice information according to a voiceprint scale factor and energy distribution of the voice information; wherein the sound source information includes: the number of people sending out the voice information of the voice and the position distance and direction of the people;

the information sending module 204 is configured to send the first audio signal data, the location information of the first campus voice device, and the sound source information to a management system.

The acquisition module 201 includes a segmentation unit and a feature extraction unit;

The violence detection module 202 comprises a training unit and a detection unit;

The campus voice recognition device can implement the campus voice recognition method of the embodiment of the method. The options in the method embodiments described above are also applicable to this embodiment and will not be described in detail here. The rest of the embodiments of the present application may refer to the content of the above method embodiments, and in this embodiment, no further description is given.

The implementation of the embodiment of the application has the following effects:

the acquisition module of the campus voice recognition device acquires first audio signal data of any voice equipment of a campus, performs feature extraction on the first audio signal data to acquire voice information, and inputs the voice information into a voice recognition model for voice analysis by the violence detection module to judge whether violent voice exists in the first audio signal data; if the obtained first audio signal data is violent voice, the voiceprint positioning module carries out voiceprint analysis on the obtained violent voice, obtains sound source information of the violent voice, sends out the number of people of the voice information and the position distance and direction of the people, realizes real-time recording of the voice information of students in a campus, detects whether the voice information is violent voice or not, and judges the number of people and the position distance and direction of the violent voice, so that sound source positioning is carried out. The information sending module timely feeds back sound source information of violent language to the manager.

Example III

Correspondingly, the invention further provides a computer readable storage medium, which comprises a stored computer program, wherein the computer program is used for controlling equipment where the computer readable storage medium is located to execute the campus voice recognition method according to any embodiment.

The computer program may be divided into one or more modules/units, which are stored in the memory and executed by the processor to accomplish the present invention, for example. The one or more modules/units may be a series of computer program instruction segments capable of performing the specified functions, which instruction segments are used for describing the execution of the computer program in the terminal device.

The terminal equipment can be computing equipment such as a desktop computer, a notebook computer, a palm computer, a cloud server and the like. The terminal device may include, but is not limited to, a processor, a memory.

The processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), field programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. The general purpose processor may be a microprocessor or the processor may be any conventional processor or the like, which is a control center of the terminal device, and which connects various parts of the entire terminal device using various interfaces and lines.

The memory may be used to store the computer program and/or the module, and the processor may implement various functions of the terminal device by running or executing the computer program and/or the module stored in the memory and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the mobile terminal, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

Wherein the terminal device integrated modules/units may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand alone products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), an electrical carrier signal, a telecommunications signal, a software distribution medium, and so forth.

The foregoing embodiments have been provided for the purpose of illustrating the general principles of the present invention, and are not to be construed as limiting the scope of the invention. It should be noted that any modifications, equivalent substitutions, improvements, etc. made by those skilled in the art without departing from the spirit and principles of the present invention are intended to be included in the scope of the present invention.

Claims

1. A method for campus voice recognition, comprising:

acquiring first audio signal data in first campus voice equipment, and filtering the first audio signal data to acquire voice information; inputting the voice information into a voice recognition model so that the voice recognition model judges whether the voice information contains preset violent keywords or not;

The step of calculating the energy value of the voice information, and determining the sound source information in the voice information according to the voiceprint scale factor and the energy distribution of the voice information, specifically comprises the following steps:

determining the number of people and the direction of sound according to the matrix energy distribution and the positions of a plurality of audio acquisition terminals;

2. The method of campus voice recognition according to claim 1, wherein the obtaining first audio signal data in the first voice device, and performing filtering processing on the first audio signal data, obtains voice information, specifically includes:

3. The method of campus voice recognition according to claim 2, wherein the determining whether the voice information includes a preset violence keyword is specifically:

4. The method of campus voice recognition of claim 1 wherein prior to entering the voice information into a voice recognition model, further comprising:

5. The method of campus voice recognition of claim 1, wherein before inputting the voiceprint parameters into the voiceprint recognition model, further comprising:

6. The method of campus voice recognition of claim 1, wherein before the transmitting the first audio signal data, the location information of the first campus voice device, and the sound source information to the management system, further comprising:

7. An apparatus for campus speech recognition, comprising: the device comprises an acquisition module, a violence detection module, a voiceprint positioning module and an information sending module;

the acquisition module is used for acquiring first audio signal data in first campus voice equipment, and filtering the first audio signal data to acquire voice information;

8. The apparatus for campus voice recognition of claim 7, wherein the acquisition module comprises a segmentation unit and a feature extraction unit;

9. The apparatus of claim 7, wherein the violence detection module comprises a training unit and a detection unit;