CN116168699A

CN116168699A - Security platform control method and device based on voice recognition, storage medium and equipment

Info

Publication number: CN116168699A
Application number: CN202310156081.3A
Authority: CN
Inventors: 袁丛琳; 王磊; 方磊; 徐林楠
Original assignee: Jiangxi Huarui Digital Technology Co ltd; Jiangxi Yiyuan Multi Media Technology Co ltd
Current assignee: Jiangxi Huarui Digital Technology Co ltd; Jiangxi Yiyuan Multi Media Technology Co ltd
Priority date: 2023-02-23
Filing date: 2023-02-23
Publication date: 2023-05-26

Abstract

The invention discloses a security platform control method, a device, a storage medium and equipment based on voice recognition, wherein the method comprises the following steps: waking up a voice control mode of the client by waking up a word voice; the client pre-processes the collected voice and encodes the collected voice into an AAC format file, and the AAC format file is sent to the server based on a preset communication protocol; starting a server, acquiring an AAC format voice file sent by a client based on a preset communication protocol, and decoding the AAC format voice file into a Pulse Code Modulation (PCM) format file; the server side reasoning PCM voice data through voice recognition and semantic analysis models, obtaining text control instruction messages and then sending the text control instruction messages to the client side; and the client executes corresponding actions after receiving the text control instruction message and broadcasts prompt voice. The security platform control method, the security platform control device, the storage medium and the security platform control equipment based on the voice recognition can improve the voice recognition accuracy of specific scenes, and the adaptability and the compatibility of the original equipment are enhanced by improving the pick-up effect of a single microphone.

Description

Security platform control method and device based on voice recognition, storage medium and equipment

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a security platform control method, device, storage medium and equipment based on voice recognition.

Background

In the field of monitoring security, a user can conduct command and dispatch through voice instructions, for example, voice intelligent control is conducted on an intelligent security platform, and man-machine interaction can be greatly facilitated and intelligent.

At present, voice recognition services in systems such as security platforms, large-screen equipment and the like often call interfaces provided by third-party voice recognition companies, so that the purpose of voice control is achieved. Obviously, this presents some problems. On the one hand, the general speech recognition service has the problem of low speech recognition accuracy for specific fields and scenes, so that the user experience is poor. On the other hand, third party service providers often need additional hardware support, which is as large as the whole display screen and as small as a special array microphone, and service update cannot be realized on the original equipment.

Disclosure of Invention

Aiming at the defects of the prior art, the invention aims to provide a security platform control method, a device, a storage medium and equipment based on voice recognition, which can customize a configuration recognition system aiming at hot words in the security field and improve the voice recognition accuracy of specific scenes; through voice preprocessing such as voice endpoint detection (VAD), voice enhancement, echo cancellation and the like, the pickup effect of a single microphone is improved, and the adaptability and compatibility to original equipment are enhanced.

In order to solve the technical problems, an aspect of the present invention provides a security platform control method based on voice recognition, where the control method includes:

1) Waking up a voice control mode of the client by waking up a word voice;

2) The client terminal preprocesses collected voice through voice endpoint detection (VAD), voice enhancement, echo cancellation and the like;

3) Encoding the preprocessed voice into an AAC format file, and transmitting the AAC format file to a server based on a preset communication protocol;

4) Starting a server, acquiring an AAC format voice file sent by a client based on a preset communication protocol, and decoding the AAC format voice file into a pulse code modulation (PulseCodeModulation, PCM) format file;

5) The server side reasoning PCM voice data through voice recognition and semantic analysis models, obtaining text control instruction messages and then sending the text control instruction messages to the client side;

6) The client receives the text control instruction message, then executes corresponding action, and broadcasts prompt voice;

7) The user exits the voice control mode by using a user-defined closing instruction, and the closed voice control mode waits for the next awakening by the awakening word.

Preferably, the method further comprises the following steps of:

collecting voice data samples of preset or wake-up words selected by a user;

Sample data are sent into an established voiceprint recognition neural network for training;

after training is completed, a 1:1 speaker confirmation neural network model based on voiceprint recognition is obtained;

and constructing the voice wake-up function based on the 1:1 speaker confirmation neural network model, and waking up a voice control mode of the client.

Preferably, the collecting the voice data sample of the preset or user-selected wake-up word includes:

the user speaks a preset wake-up word or a user-defined wake-up word, and the process is repeated for three times;

the microphone collects voice data samples of wake-up words, and after pretreatment such as voice enhancement and echo cancellation, the voice data samples are stored in a local database.

Preferably, the client pre-processes the collected voice through voice endpoint detection (VAD), voice enhancement, echo cancellation, etc., including:

firstly, automatically judging a starting point and an end point (endpoint) of voice through a VAD algorithm, and ensuring the quality of collected voice and the recognition efficiency;

the collected voice is subjected to real-time noise reduction treatment by a Constrained Low-rank sparse decomposition (CLSMD) method;

the noise-reduced microphone near-end speech is retained and the far-end echo is cancelled by an adaptive echo cancellation (Adaptive Echo Cancellation, AEC) algorithm.

Preferably, the encoding the preprocessed voice into an AAC format file, and sending the AAC format file to the server based on a preset communication protocol includes:

encoding the preprocessed voice into an AAC format file so as to accelerate the end cloud transmission efficiency;

and sending the AAC voice file to a server based on a preset communication protocol.

Preferably, the server starts, and obtains an AAC format voice file sent by the client based on a preset communication protocol and decodes the AAC format voice file into a pulse code modulation (PulseCodeModulation, PCM) format file, which includes:

starting a server to acquire an AAC format voice file sent by a client;

the AAC format voice file is decoded into a PCM format file having a sampling rate of 16kHz and a bit depth of 16 bits based on a format conversion protocol.

Preferably, the server side infers PCM voice data through a voice recognition and semantic analysis model, obtains a text control instruction message, and sends the text control instruction message to the client side, wherein the text control instruction message comprises:

the server decodes the voice through the trained voice recognition model to obtain text data;

and analyzing the text data into a control instruction message through a preset instruction rule and a semantic analysis model, and sending the control instruction message to the client.

Preferably, the server decodes the voice through the trained voice recognition model, and before obtaining the text data, the method further includes:

Establishing a voice control instruction data set based on specific fields such as a security platform and the like;

establishing a voice control instruction recognition neural network;

training a voice recognition neural network model based on a voice control instruction data set of a specific field;

and constructing a voice control instruction and control instruction message mapping table based on the specific field.

Preferably, the establishing the voice control instruction recognition neural network includes:

constructing a neural network model structure by sharing an encoder, a CTC decoder and an attention decoder;

the loss function is trained by scoring the outputs of the shared encoder by the CTC decoder and the attention decoder, respectively.

Preferably, the training the voice recognition neural network model based on the voice control instruction data set of the specific field includes:

fbank features are extracted from an input WAV format voice data set through the steps of pre-emphasis, framing, windowing, short-time Fourier transform, mel filtering and the like;

downsampling the Fbank feature sequence by a 2-dimensional convolutional layer;

inputting the Fbank characteristic sequence after downsampling into a Conformer coding layer of the shared coder;

inputting the output of the shared encoder into a CTC decoder and an attention decoder simultaneously, and respectively carrying out loss function calculation on the phonetic symbol annotation sequence and the phonetic frame sequence by the two decoders;

Scoring the output result in a stream identification mode through a CTC decoder, and after the stream identification is finished, performing repeated scoring on the identification result by combining the attention decoder and the CTC decoder to further optimize the identification result;

and deploying the neural network model obtained by iterative training to a server, and deploying to a client after quantifying the int8 of the neural network model.

Preferably, the output of the shared encoder is simultaneously input to a CTC decoder and an attention decoder, and the two decoders respectively calculate a loss function for the phonetic symbol sequence and the phonetic frame sequence, where the loss function is composed of two parts, and is respectively a CTC loss of the phonetic frame CTC decoder and an autoregressive likelihood loss of the phonetic symbol sequence attention decoder, and the joint loss function is expressed as:

L _joint (x,y)＝αL _CTC (x,y)+(1-α)L _ARL (x,y)

wherein the first term is CTC loss, the second term is autoregressive likelihood ARL loss, x is the acoustic characteristics of the input voice, y is a voice labeling sequence, and alpha is a parameter for adjusting the CTC loss and the autoregressive likelihood loss.

Preferably, after receiving the text control instruction message, the client executes a corresponding action, and plays the prompt voice further includes:

after receiving the text control instruction message, the client side firstly judges the equipment state and then executes corresponding actions according to the equipment state;

And after the action is executed, synthesizing prompt voice through a voice synthesis model and broadcasting.

The user exits the voice control mode by using a user-defined closing instruction, and the waiting of the voice control mode after closing for waking up by a wake-up word next time comprises the following steps:

the user defines an instruction for exiting the voice control mode, and the closed voice control mode waits for the next awakening through the awakening word;

preferably and/or after the client has executed a control command, waiting for 10 seconds, if no new control command is issued, automatically exiting the voice control mode.

Preferably, the custom wake word and the close instruction both support user personalized modification.

The invention also provides a security platform control method based on voice recognition, which is applied to a client off-line mode or a very-speed recognition mode and comprises the following steps:

1) Waking up a voice control mode of the client by waking up a word voice;

3) Reasoning the preprocessed voice through a local voice recognition and semantic analysis model to obtain a text control instruction;

4) The client executes corresponding actions according to the text control instruction and reports prompt voice;

5) The user exits the voice control mode by using a user-defined closing instruction, and the closed voice control mode waits for the next awakening by the awakening word.

Wherein, the voice control mode of waking up the client by waking up the word voice comprises:

the user can wake up the voice control mode of the client by using a preset wake-up word;

in particular, network communication is required when the preset wake-up word is customized, and after the preset wake-up word is customized for the first time, the subsequent wake-up is not required to depend on the network.

then, real-time noise reduction processing is carried out on the collected voice by a Constrained Low-rank sparse decomposition (CLSMD) method;

the noise-reduced microphone near-end speech is then retained and the far-end echo is cancelled by an adaptive echo cancellation (Adaptive Echo Cancellation, AEC) algorithm.

Preferably, the reasoning the preprocessed voice through the local voice recognition and semantic analysis model to obtain the text control instruction includes:

establishing a voice control instruction recognition neural network;

performing int8 quantization on the voice recognition neural network model to obtain a quantization model with greatly reduced parameters and model size of the neural network;

Preferably, the client executes corresponding actions according to the text control instruction, and the broadcasting of the prompt voice includes:

the client judges the equipment state according to the text control instruction, and then executes corresponding actions according to the equipment state;

Preferably, the user exits the voice control mode by using a user-defined closing instruction, and waiting for the next wake-up by the wake-up word in the closed voice control mode includes:

the user can customize the instruction of exiting the voice control mode, and the closed voice control mode waits for the next awakening by the awakening word. Preferably, and/or wait 10 seconds after the client has executed a control command, if no new control command is issued, the voice control mode is automatically exited.

The invention further provides an online voice recognition device controlled by the security platform, which is applied to a voice recognition client and comprises a pickup module, a voice preprocessing module, a voice coding module, a text control instruction message receiving module and a voice synthesis module. The functions of each module are described as follows:

the sound pickup module is used for collecting voice instruction data;

the voice preprocessing module is used for detecting a voice starting point and an end point, and reducing noise and eliminating echo of the voice;

the voice coding module is used for coding the preprocessed voice into a format capable of being transmitted efficiently;

the text control instruction message receiving module is used for acquiring the text control instruction message sent from the server;

And the voice synthesis module is used for synthesizing and broadcasting corresponding prompt voice after the control instruction message is acquired.

The invention also provides an online voice recognition device controlled by the security platform, which is applied to a voice recognition server and comprises a voice decoding module, a voice recognition and semantic analysis module and a text control instruction message sending module. The functions of each module are described as follows:

the voice decoding module is used for decoding the voice format file which is efficiently transmitted into a PCM format file with the sampling rate of 16kHz and the bit depth of 16 bits;

the voice recognition and semantic analysis module is used for reasoning the voice data in the PCM format to obtain a control instruction message;

and the text control instruction message sending module is used for sending the text control instruction message.

Another aspect of the present invention provides an offline voice recognition device controlled by a security platform, applied to a voice recognition client, including:

the sound pickup module is used for collecting voice instruction data;

the voice recognition and semantic analysis module is used for reasoning the preprocessed voice data to obtain a text control instruction message;

In another aspect of the present invention, a security platform computer readable storage medium based on voice recognition is provided, where the computer readable storage medium stores a computer program and a deep neural network model parameter, where the computer program when executed by a processor implements the voice recognition method controlled by the security platform according to any embodiment of the present invention, and where the deep neural network model parameter when executed by the processor implements real-time voice recognition reasoning.

In another aspect of the present invention, there is also provided a security platform computer device based on voice recognition, the computer device including: the system comprises a processor, a memory, a system bus, an I/O interface and a nonvolatile storage medium, wherein the processor and the memory are connected through the system bus, the memory is used for storing a computer program, and the computer program when executed by the processor causes the processor to execute any voice recognition method implementation mode of the security platform control according to any embodiment of the invention.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Compared with the prior art, the invention has the beneficial effects that:

the client is awakened by the voice of the awakening word, the client sends the acquired voice to the server after preprocessing, the server obtains text data through voice recognition and reasoning of a semantic analysis model, the text data are analyzed into an instruction message and then sent to the client, the client receives the control instruction, executes corresponding actions and reports the prompt voice. The invention aims at the hot word customization configuration recognition system in the security field, can improve the voice recognition accuracy of specific scenes, improves the pickup effect of a single microphone through preprocessing such as VAD, noise reduction and echo cancellation, and enhances the adaptability and compatibility of original equipment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings required for the description of the embodiments or the prior art will be briefly described, and it is apparent that the drawings in the following description are only one embodiment of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flowchart of a security platform control method based on voice recognition according to a first embodiment of the present invention;

FIG. 2 is a schematic diagram of voice control of a security platform according to a first embodiment of the present invention;

FIG. 3 is a schematic diagram of a speech recognition model network according to a first embodiment of the present invention;

FIG. 4 is a schematic diagram of a voice recognition model network training and reasoning data flow provided in accordance with a first embodiment of the present invention;

FIG. 5 is a schematic diagram of a client device applied to a security platform according to a first embodiment of the present invention;

FIG. 6 is a schematic diagram of a server device applied to a security platform according to a first embodiment of the present invention;

FIG. 7 is a flowchart of a security platform control method based on voice recognition according to a second embodiment of the present invention;

fig. 8 is a schematic diagram of a client device applied to a security platform according to a second embodiment of the present invention;

fig. 9 is a schematic diagram of a computer device according to a first embodiment of the present invention.

Detailed Description

In order to make the technical means, the creation characteristics, the achievement of the purpose and the effect of the present invention easy to understand, the technical solutions in the embodiments of the present invention are clearly and completely described below to further illustrate the present invention, and it is obvious that the described embodiments are only some embodiments of the present invention, not all versions. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

Example 1

The flow of this embodiment is shown in fig. 1, and includes:

s101: the voice control mode of the client is awakened by the wake-up word voice.

In this embodiment, the user may use a wake-up word preset by the client or a custom wake-up word, for example, but not limited to "hello small" and "Yi Xiao Yi", or "Hi, xiaoyi", "Hey, xiaoyixiaoyi" and other English wake-up words. When the client is used for the first time, the user needs to customize three Chinese words and more or three syllables and more English phrases, and repeatedly speaks for three times according to the guidance of the client so as to store the local voice wake-up model training. In particular, the voice wake model described in this embodiment is a neural network model that incorporates voiceprint recognition techniques. In the actual use situation, the voice awakening is equivalent to a key for opening the voice control mode, and only the user authorized by the administrator can awaken the voice control mode by using the awakening word.

The microphone collects the voice data sample of the user-defined wake-up word, and after preprocessing such as voice enhancement and echo cancellation, the data is saved to the local database. Voiceprint information belongs to personal biometric features and personal privacy of users, and must be stored in a client for information security.

S102: the client terminal pre-processes the collected voice through voice endpoint detection (VAD), voice enhancement, echo cancellation and the like.

In this embodiment, the voice preprocessing module includes three steps of VAD, voice enhancement and echo cancellation. VAD is a method of detecting short-term energy characteristics of speech to determine whether there is voice input. According to the frequency spectrum range of the human voice, the input frequency spectrum is divided into six sub-bands (80 Hz-250 Hz,250 Hz-500 Hz,500 Hz-1 KHz,1 KHz-2 KHz,2 KHz-3 KHz,3 KHz-4 KHz), the short-time energy of the six sub-bands is calculated respectively, and the frame length is usually set to 30ms. And then establishing a Gaussian model, and calculating through a probability density function to obtain a log-likelihood ratio function. The log-likelihood ratio function is divided into a global likelihood ratio and a local likelihood ratio, the weighted sum of the six sub-bands is the global likelihood ratio, and each sub-band is the local likelihood ratio. When judging the voice, firstly judging the sub-band, if the sub-band judges that the voice is not available, then judging the global, and if any one of the sub-bands judges that the voice is available, inputting the voice.

The voice enhancement is to perform real-time noise reduction processing on the collected voice by a constrained low-rank sparse decomposition (CLSMD) method. In the frequency domain, the noise signal generally has higher redundancy, i.e. higher correlation, and is mainly represented by a low-rank structure of background noise, such as white gaussian noise, with a power spectrum of a fixed value and a rank of a time-frequency matrix of 1. Compared with noise signals, speech signals have a certain sparsity and are active only at a few frequency points. Based on this, the speech and noise in the noise-containing speech can be separated by utilizing the sparsity of the speech and the low rank of the noise. In the embodiment, a CLSMD voice enhancement method is adopted, a voice time-frequency matrix is assumed to have sparse characteristics, a noise time-frequency matrix has Low-rank characteristics, and LS (Low-rankandsearse) component decomposition is guided by using prior knowledge of voice and noise to obtain an effective sparse matrix, so that enhanced voice is obtained through inverse short-time Fourier transform (ISTFT).

In practical use scenarios, the microphone may collect signals including real human voice, sound emitted by a speaker, and indoor echo, and echo interference cannot be eliminated by simply processing the collected voice through a noise reduction algorithm. In this embodiment, the Echo cancellation is performed by using an adaptive Echo cancellation (Adaptive Echo Cancellation, AEC) algorithm, and an adaptive filter is used to perform parameter identification on an unknown Echo channel Echo, and based on the correlation between a speaker signal and multiple generated echoes, a far-end signal model is built, and an Echo path is simulated and adjusted by using the adaptive algorithm, so that the impulse response and the real Echo path of the Echo channel Echo are approximated. Then subtracting the estimated value from the noise-reduced microphone receiving signal, so as to reserve the microphone near-end voice and cancel the far-end echo.

S103: and encoding the preprocessed voice into an AAC format file, and transmitting the AAC format file to a server based on a preset communication protocol.

In this embodiment, the voice control service of the security platform is provided by the service end online, and it is required to efficiently complete the end cloud transmission. The AAC format is an efficient sound data encoding format capable of obtaining higher sound quality effects with smaller file volumes. The voice processed by the S102 is encoded into an AAC format in a manner of 30ms of a chunk, the voice in the AAC format is packaged by using preset communication protocols such as WebSocket protocol, transmission control protocol (Transmission Control Protocol, TCP), hypertext transfer protocol (Hyper Text Transfer Protocol, HTTP) and the like to obtain a voice control instruction message, and the voice control instruction message is sent to the server through a preset communication link between the client and the server.

The voice control instruction message may further include equipment code, authentication information, video channel number, audio channel number, environment variable parameter of the security platform client, format parameter of collected voice data, voice end tag, domain name address URL (Uniform Resource Locator ) and the like, and domain name address of the server.

S104: the server starts, acquires an AAC format voice file sent by the client based on a preset communication protocol and decodes the AAC format voice file into a PCM format file.

When the server side obtains the voice control instruction message sent by the client side for the first time by using a preset communication protocol, such as WebSocket protocol, TCP protocol, HTTP protocol and the like, the voice recognition server is immediately started, and the AAC format file packaged by the message is converted into a PCM format file.

The WAV audio format file is a typical PCM coding format file and is widely applied to intelligent voice fields such as voice recognition, voice synthesis, voiceprint recognition and the like. In this embodiment, the input of the speech recognition model is a WAV format single channel audio file, with a sampling rate of 16kHz and a bit depth of 16 bits.

S105: the server side infers PCM voice data through voice recognition and a semantic analysis model, and sends the text control instruction message to the client side.

After converting the AAC format file packaged by the voice control instruction message into the WAV format file, the server performs reasoning through a voice recognition model to obtain text information of the voice control instruction. Fig. 3 shows a schematic diagram of an End-to-End (E2E) speech recognition model network structure according to the present embodiment. The speech recognition model includes three parts, a shared encoder, a CTC (Connectionist Temporal Classification, connection timing classification) decoder, and an Attention (Attention) decoder. Wherein the shared Encoder comprises a multi-layer converter which is improved based on a converter, and the effect of improving the voice recognition model on long-term sequences and local features can be obtained by applying a convolutional neural network CNN (Convolutional Neural Network, CNN) to a coding layer (Encoder) of the converter. The Conformer coding layer includes a multi-head attention module, a convolution module, and a forward module, each using a residual network in combination with a normalization layer (NormLayer) and Dropout. In the convolution module, to be independent of the context information on the right, causal convolution (Causal Convolution) is used and padding (padding) 0 on the left to ensure consistent sequence length after convolution. The CTC decoder comprises a full connectivity layer and a softmax layer, while the attention decoder comprises multiple transducer layers.

During model training, the output of the shared encoder is scored through two decoders respectively, the output result is scored through a CTC decoder in a stream identification mode, and then after the stream identification is finished, the identification result is scored again through the combination of the attention decoder and the CTC decoder, so that the identification result is further optimized. The output of the two decoders can be displayed as the result of voice recognition, for example, on large screen equipment such as a security platform, the stream recognition result of the CTC decoder is used as a real-time caption screen, and users such as dispatching command personnel can see the voice recognition result in real time. When a sentence is finished, the recognition result is corrected by the joint re-scoring of the attention decoder and the CTC decoder, the result after re-scoring is finally displayed, and users such as dispatching commander and the like can dynamically see the change of the recognition result, so that the user experience is improved.

Fig. 4 shows a schematic diagram of a voice recognition model network training and reasoning data flow provided in this embodiment. During model training and reasoning, firstly, fbank (Fbank) features are extracted from an input WAV format voice file through the steps of pre-emphasis, framing, windowing, short-time Fourier transform (STFT), mel (Mel) filtering and the like, and then the Fbank feature sequence is downsampled through a 2-dimensional convolution layer so as to reduce the calculated amount. And then, inputting the Fbank characteristic sequence after downsampling into a Conformer coding layer of a shared coder, and inputting the output of the shared coder into a CTC decoder and an attention decoder at the same time, wherein the two decoders respectively calculate loss functions of the phonetic symbol injection sequence and the phonetic frame sequence. In this embodiment, the model training loss function is composed of two parts, namely CTC loss of the speech frame CTC decoder and autoregressive likelihood loss of the speech labeling sequence attention decoder, and the joint loss function is expressed as:

L _joint (x,y)＝αL _CTC (x,y)+(1-α)L _ARL (x,y)

Where x is the acoustic feature of the input speech, y is the phonetic labeling sequence, α is a parameter that adjusts CTC loss and autoregressive likelihood loss, the first term is CTC loss, and the second term is autoregressive likelihood (ARL) loss. The combined loss function efficiently simplifies the training process and improves the training speed.

Aiming at the characteristic that the duration of the voice control instruction is not fixed, the sizes of the chunk of different batches are dynamically adjusted in a dynamic chunk block training mode. For stream identification, the value of the chunk used is 16. For non-streaming identification, the value of the chunk is dynamically adjusted between 1 and 25. Wherein the size of one chunk corresponds to a 30ms speech frame.

Aiming at the characteristics of large-screen equipment such as security and protection platforms, the system setting parameters, operation instruction sets and all sub-modules, sub-columns and functional definitions contained in the platform system of the equipment are collected, and the sub-modules, the sub-columns and the functional definitions are compiled into a hot word set. On the basis of a trained reference voice recognition model, by giving higher weight to hot words in the hot word set, voice control instructions are preferentially recognized as the content of the hot word set during model reasoning. From the perspective of the data set, the method fundamentally improves the recognition accuracy of the voice in the specific scene and the specific field, and further improves the experience of the user.

In this embodiment, the server performs reasoning using the trained speech recognition model, as shown in fig. 2. In an actual use scenario, a server often needs to process voice recognition service requests sent by a plurality of clients at the same time, which requires that the server perform stress testing before deployment to be able to cope with high-concurrency recognition tasks.

After text information of a voice control instruction is obtained by utilizing a voice recognition model, semantic analysis is needed to be carried out on the text content so as to generate the control instruction which can be understood by a client. As described above, the hot word set is used to normalize the output of the recognition result, which lays a good foundation for semantic parsing. Classifying the content of the hot word set according to the requirement, setting a template corpus, finishing abstraction of user intention through setting a key slot, and extracting key information in a text.

For example, in the voice control instruction "turn on the camera of the second floor," there are two key slots, "second floor" and "camera," which are called entities, and may be replaced by brackets in the corpus, i.e., "turn on { number } { camera }. An "open" is an action performed, categorized as an action slot. In addition, middle brackets [ ] represent optional symbols, small brackets () represent mandatory symbols, | represent or, with these regular expressions, the transition from intent to semantics can be completed, resulting in control instructions that the client can understand. For example, the user can say, "help me (on |start|enable|on|switch to |cut to|switch to|tune to) the second-floor camera", intend to abstract to the semantic "open second-floor camera", and achieve resolution of key slot position and action slot position.

In this embodiment, after a semantic result is obtained through a semantic analysis model, the semantic result is encapsulated into a text control instruction packet, and the text control instruction packet is sent to a client based on a preset communication protocol, such as WebSocket protocol, TCP protocol and HTTP protocol.

S106: and the client executes corresponding actions after receiving the text control instruction message and broadcasts prompt voice.

After receiving the text control instruction message, the client firstly checks the equipment state and parameter setting of the client, then executes the control instruction according to the equipment state and parameter setting, and if a certain setting is opened or a certain monitoring picture is opened, does not respond to the instruction action and broadcasts relevant prompt voice. After the action is executed, the prompt voice is synthesized through the voice synthesis model and broadcasted. In this embodiment, the speech synthesis model is deployed at the local client, so that quick response is facilitated.

S107: the user exits the voice control mode by using a user-defined closing instruction, and the closed voice control mode waits for the next awakening by the awakening word.

The user can customize the instruction of exiting the voice control mode, and the closed voice control mode waits for the next awakening by the awakening word. Preferably, the client waits 10 seconds after executing a control command, and if no new control command is issued, the client automatically exits the voice control mode.

Example two

The embodiment is suitable for the situation that the communication connection condition of the client is poor or in an offline state, or when a user opens a very fast recognition mode of the client, even if the network is online, the local voice recognition and semantic analysis model is still preferentially used for reasoning. The flow of this embodiment is shown in fig. 7, and includes:

s701: the voice control mode of the client is awakened by the wake-up word voice.

In this embodiment, the user may use a wake-up word preset by the client or a custom wake-up word, including but not limited to "your small side", "little side" and other chinese wake-up words, or "Hi, xiaoyi", "Hey, xiaoyixiaoyi" and other english wake-up words. When the client is used for the first time, the user needs to customize three Chinese words and more or three syllables and more English phrases, and repeatedly speaks for three times according to the guidance of the client so as to store the local voice wake-up model training. In particular, the custom wake-up word requires network communication, and when the first custom wake-up word is successful, the subsequent wake-up does not need to rely on the network.

S702: the client terminal pre-processes the collected voice through voice endpoint detection (VAD), voice enhancement, echo cancellation and the like.

In this embodiment, the voice preprocessing module includes three steps of VAD, voice enhancement and echo cancellation. The VAD automatically judges the starting point and the end point of the voice by judging whether a voice is input or not through detecting the short-time energy characteristics of the voice. The voice enhancement is to eliminate the low-rank component of the background noise of the collected voice by a constraint low-rank sparse decomposition algorithm, and retain the sparse component of the voice so as to achieve the purpose of noise reduction. The Echo cancellation is to use a self-adaptive filter to identify parameters of an unknown Echo channel Echo by a self-adaptive Echo cancellation algorithm, to establish a far-end signal model based on the correlation between a loudspeaker signal and multiple generated echoes, to simulate an Echo path, and to adjust the self-adaptive algorithm to approximate the impulse response and the real Echo path. Then subtracting the estimated value from the noise-reduced microphone receiving signal, so as to reserve the near-end voice collected by the microphone and cancel the far-end echo.

S703: and reasoning the preprocessed voice through a local voice recognition and semantic analysis model to obtain a text control instruction.

In this embodiment, in order to adapt to the characteristics of low computational effort and limited storage space of the client, the voice recognition and semantic analysis model in S105 is subjected to int8 quantization, so as to obtain a quantization model with greatly reduced parameters and model sizes of the neural network. And reasoning the preprocessed voice through a local quantization model, so as to obtain a text control instruction.

S704: and the client executes corresponding actions according to the text control instruction and reports the prompt voice.

The client firstly checks the equipment state and parameter setting, then executes the control instruction according to the equipment state and parameter setting, and if a certain setting is opened or a certain monitoring picture is opened, does not respond to the instruction action and broadcasts the relevant prompt voice. After the action is executed, the prompt voice is synthesized through the voice synthesis model and broadcasted. In this embodiment, the speech synthesis model is deployed at the local client, facilitating offline and fast response.

S705: the user exits the voice control mode by using a user-defined closing instruction, and the closed voice control mode waits for the next awakening by the awakening word.

The user can customize the instruction of exiting the voice control mode, and the closed voice control mode waits for the next awakening by the awakening word. Preferably, the client is synchronously provided with a wait for 10 seconds after executing a control instruction, and if a new control instruction is not sent, the client automatically exits the voice control mode, so as to avoid the false recognition operation caused by forgetting to exit the voice control mode.

Example III

The embodiment provides an online voice recognition device controlled by a security platform, which is applied to a voice recognition client and a server, as shown in fig. 5 and 6 respectively.

The client device includes: the voice pickup module 501, the voice preprocessing module 502, the voice encoding module 503, the text control instruction message receiving module 504 and the voice synthesizing module 505. The functions of each module are described in detail as follows:

the pickup module 501 is configured to collect voice control instruction data;

the voice preprocessing module 502 is used for detecting a voice starting point and an end point, and reducing noise and eliminating echo of the voice;

a speech encoding module 503, configured to encode the preprocessed speech into a format capable of efficient transmission;

a text control instruction message receiving module 504, configured to obtain a text control instruction message sent from a server;

the voice synthesis module 505 is configured to synthesize and broadcast a corresponding prompt voice after obtaining the control instruction message.

The server device comprises: a speech decoding module 601, a speech recognition and semantic parsing module 602 and a text control instruction message sending module 603. The functions of each module are described in detail as follows:

a voice decoding module 601, configured to decode a voice format file that is efficiently transmitted into a PCM encoded format file with a sampling rate of 16kHz and a bit depth of 16 bits;

the voice recognition and semantic analysis module 602 is configured to infer PCM format voice data to obtain a text control instruction message;

The text control instruction message sending module 603 is configured to send a text control instruction message.

Example IV

The embodiment provides an offline voice recognition device controlled by a security platform, which is applied to a voice recognition client, as shown in fig. 8. The client device includes: pickup module 801, voice preprocessing module 802, voice recognition and semantic parsing module 803, voice synthesis module 804. The functions of each module are described in detail as follows:

the pickup module 801 is configured to collect voice command data;

a voice preprocessing module 802, configured to detect a voice start point and an end point, and reduce noise and echo of the voice;

the voice recognition and semantic analysis module 803 is configured to infer the preprocessed voice data to obtain a text control instruction message;

the voice synthesis module 804 is configured to synthesize and broadcast a corresponding prompt voice after obtaining the control instruction message.

Example five

The present embodiment provides a computer device, as shown in fig. 9, comprising a processor 901, a system bus 902, a memory 903, an I/O interface 904, and a non-volatile storage medium 905, the non-volatile storage medium 905 comprising an operating system, a computer program, a data set, and a neural network model. The processor 901 and the memory 903 are connected through the system bus 902, and the memory 903 is configured to store a computer program in the nonvolatile storage medium 905, where the computer program when executed by the processor 901 causes the processor 901 to execute any voice recognition method implementation of the security platform control according to any embodiment of the present invention.

It should be understood that the above embodiments are only for illustrating the technical solution of the present invention, and are not limiting thereof. It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional modules is illustrated, and in practical application, the above-described functional allocation may be performed by different functional modules according to needs, i.e. the internal structure of the apparatus is divided into different functional modules for performing all or part of the above-described functions.

Claims

1. A security platform control method based on voice recognition is applied to a client and a server, and is characterized by comprising the following steps:

waking up a voice control mode of the client by waking up a word voice;

the client terminal preprocesses collected voice through voice endpoint detection, voice enhancement, echo cancellation and the like;

encoding the preprocessed voice into an AAC format file, and transmitting the AAC format file to a server based on a preset communication protocol;

starting a server, acquiring an AAC format voice file sent by a client based on a preset communication protocol, and decoding the AAC format voice file into a Pulse Code Modulation (PCM) format file;

the server side reasoning PCM voice data through voice recognition and semantic analysis models, obtaining text control instruction messages and then sending the text control instruction messages to the client side;

The client receives the text control instruction message, then executes corresponding action, and broadcasts prompt voice;

the user exits the voice control mode by using a user-defined closing instruction, and the closed voice control mode waits for the next awakening by the awakening word.

2. The method of claim 1, wherein waking up the voice control mode of the client by waking up the word voice further comprises, prior to:

collecting voice data samples of preset or wake-up words selected by a user;

3. The method of claim 2, wherein the collecting voice data samples of preset or user-selected wake words comprises:

4. The method of claim 1, wherein the client pre-processing the collected voice through voice endpoint detection, voice enhancement, echo cancellation, etc., comprises:

firstly, automatically judging a starting point and an end point of voice through a VAD algorithm, and ensuring the quality of collected voice and the recognition efficiency;

then, real-time noise reduction treatment is carried out on the collected voice by a constraint low-rank sparse decomposition method;

and then, the noise-reduced microphone near-end voice is reserved through an adaptive echo cancellation algorithm, and the far-end echo is cancelled.

5. The method of claim 1, wherein encoding the preprocessed voice into AAC format files, and wherein sending the AAC format files to the server based on a predetermined communication protocol includes:

6. The method of claim 1, wherein the server-side startup, based on a preset communication protocol, obtains an AAC format voice file sent by the client and decodes the AAC format voice file into a pulse code modulation PCM format file, comprises:

starting a server to acquire an AAC format voice file sent by a client;

7. The method of claim 1, wherein the server reasoning the PCM voice data through the voice recognition and semantic analysis model, and sending the text control command message to the client after obtaining the text control command message comprises:

8. The method of claim 7, wherein the server decodes the speech through the trained speech recognition model, and further comprising, before obtaining the text data:

establishing a voice control instruction recognition neural network;

9. The method of claim 8, wherein the establishing a voice control instruction recognition neural network comprises:

10. The method of claim 8, wherein training the speech recognition neural network model based on the domain-specific speech control command data set comprises:

downsampling the Fbank feature sequence by a 2-dimensional convolutional layer;

11. The method of claim 10, wherein the output of the shared encoder is input to both the CTC decoder and the attention decoder, and the two decoders calculate a loss function for the speech logo sequence and the speech frame sequence, respectively, the loss function consisting of two parts, respectively, a CTC loss for the speech frame CTC decoder and an autoregressive likelihood loss for the speech logo sequence attention decoder, and the joint loss function is expressed as:

L _joint (x,y)＝αL _CTC (x,y)+(1-α)L _ARL (x,y)

12. The method of claim 1, wherein the client performs a corresponding action after receiving the text control command message, and broadcasting the alert voice further comprises:

13. The method of claim 1, wherein the user exits the voice control mode using a custom off command, the off voice control mode waiting for a next wake-up by a wake-up word comprising:

and/or the client waits for 10 seconds after executing a control instruction, and if no new control instruction is sent, the client automatically exits the voice control mode;

the wake-up word and the closing instruction support personalized modification of the user.

14. A security platform control method based on voice recognition is applied to a client off-line mode or a very-speed recognition mode, and is characterized by comprising the following steps:

waking up a voice control mode of the client by waking up a word voice;

reasoning the preprocessed voice through a local voice recognition and semantic analysis model to obtain a text control instruction;

the client executes corresponding actions according to the text control instruction and reports prompt voice;

the user exits the voice control mode by using a user-defined closing instruction, and the closed voice control mode waits for the next awakening by the awakening word;

the user wakes up the voice control mode of the client by using a preset wake-up word;

network communication is needed when the wake-up word is preset, and the subsequent wake-up is not needed to depend on the network after the wake-up word is preset successfully for the first time.

15. A security platform control device based on voice recognition is applied to a client, and is characterized by comprising:

the sound pickup module is used for collecting voice instruction data;

16. A security platform control device based on voice recognition is applied to the server side, which is characterized in that the device comprises:

the voice recognition and semantic analysis module is used for reasoning the voice data in the PCM format to obtain a text control instruction message;

17. A security platform computer readable storage medium based on voice recognition, wherein the computer readable storage medium stores a computer program and a deep neural network model parameter, the computer program when executed by a processor implements the voice recognition method controlled by the security platform according to any one of claims 1 to 14, and the deep neural network model parameter when executed by the processor implements real-time voice recognition reasoning.

18. A security platform computer device based on speech recognition, comprising: the system comprises a processor, a memory, a system bus, an I/O interface and a non-volatile storage medium, wherein the processor and the memory are connected through the system bus, and the memory is used for storing a computer program, and the computer program when executed by the processor causes the processor to execute any voice recognition method implementation mode controlled by the security platform according to any one of claims 1 to 14.