CN117238294A

CN117238294A - Automatic local fire control voice recognition method and device based on artificial intelligence

Info

Publication number: CN117238294A
Application number: CN202311501842.0A
Authority: CN
Inventors: 钟波; 蓝聪; 曹冰兵; 郑建波; 李成富; 周育玺; 薛俊; 张良
Original assignee: Chengdu Dacheng Juntu Technology Co ltd
Current assignee: Chengdu Dacheng Juntu Technology Co ltd
Priority date: 2023-11-13
Filing date: 2023-11-13
Publication date: 2023-12-15

Abstract

The application relates to the technical field of artificial intelligence, in particular to an automatic local fire control voice recognition method and device based on artificial intelligence. Text information is obtained by performing end-to-end voice conversion on the real-time alarm voice information flow, and alarm condition elements in the text information are extracted and combined to obtain final alarm information.

Description

Automatic local fire control voice recognition method and device based on artificial intelligence

Technical Field

The application relates to the technical field of artificial intelligence, in particular to an automatic local fire control voice recognition method and device based on artificial intelligence.

Background

In order to meet the alarm receiving and processing requirements in the current informatization and intelligent age, the intelligent fire-fighting alarm receiving and processing system applies a natural language processing technology to realize key information in alarm information according to alarm receiving and processing data, and comprises the following steps: police condition sources, police condition addresses, police condition types, police condition categories, police condition descriptions, combustion objects, building structures, floor layers, combustion layers, personnel trapped quantity, police condition positions, alarm telephones and other elements. And the method is combined with the information retrieval technology, big data and other technologies, so that firefighters can quickly locate useful cases, and the quick response capability and the combat capability of the firefighting department are improved. The emergency management basic capability and level are improved to enhance the urban toughness, and the emergency management method is a great demand for urban safety in a new period.

The following deficiencies exist in the prior art for an automated alarm receiving system:

(1) The existing alarm receiving and processing system is used for inputting alarm information, unreasonable in dispatching resource allocation, insufficient in rescue resource utilization and low in strength dispatching efficiency, only can manually input and manually select strength at present, and a fire-fighting alarm receiver can easily ignore part of alarm elements under high-strength alarm receiving pressure, so that the situation of dispatching errors occurs;

(2) The alert condition positioning information is missing, is influenced by technical conditions and supporting environments at the time, the positioning means is single, partial position information is still a default value of the system or has large deviation from the actual position, the effectiveness and the accuracy of the position information can not be ensured, and the treatment efficiency is influenced.

(3) Under high-intensity operation, the alarm receiving personnel may have the problems that key information is not inquired, certain information capable of assisting decision is not recorded, single-bit selection can be allocated, and the like, so that rescue is not timely and rescue resources are wasted.

Disclosure of Invention

In order to solve the problems, the application provides an automatic local fire-fighting voice recognition method and device based on artificial intelligence, which adopt the techniques of natural language processing technology, machine learning and the like to be applied to a fire-fighting alarm receiving and processing system for extracting alarm information so as to improve the working efficiency of fire-fighting alarm receiving personnel and reduce the working pressure of the alarm receiving personnel.

In order to achieve the above purpose, the technical scheme adopted by the embodiment of the application is as follows:

in a first aspect, an automated local fire control voice recognition method based on artificial intelligence is provided, and is applied to a server, and the method comprises the following steps: inputting alarm voice information acquired in real time into a voice recognition model, performing voice feature processing on the alarm voice information by the voice recognition model to obtain a voice sequence taking a frame as a unit, and performing encoding and decoding processing on the voice sequence to obtain a text sequence in a mapping relation with the voice sequence, wherein the text sequence is text information to be processed; inputting the text information into an alarm condition recognition model, extracting alarm condition elements in the text information to obtain a plurality of entity information of the alarm condition, and combining the entity information to obtain target alarm information.

Further, the voice recognition model comprises a pre-processing module, a coder and a mixed attention module; the pre-processing module comprises an acoustic pre-processing sub-module and a text pre-processing sub-module, wherein the acoustic pre-processing sub-module is used for performing voice feature processing on the alarm voice information to obtain a voice sequence taking a frame as a unit, the voice sequence consists of a plurality of voice features, and the text pre-processing sub-module is used for performing text conversion on the voice sequence to obtain an initial text sequence taking the voice sequence as a scale; the coder-decoder comprises an encoder and a decoder, wherein the encoder is used for mapping a plurality of voice features in the voice sequence, the decoder is used for decoding the initial text sequence to generate target text features in combination with the attention-adjusted hidden state, and the target text sequence is generated based on the sequence as a scale.

Further, the acoustic pre-submodule comprises a plurality of two-dimensional convolution modules and a position encoder module, wherein the convolution modules comprise a two-dimensional convolution layer and a ReLU activation layer; the two-dimensional convolution module is used for extracting acoustic features, and the position encoder is used for acquiring absolute position information of the acoustic features.

Further, the text pre-submodule comprises a plurality of time convolution networks, a position encoder module and an embedded layer, wherein the time convolution networks comprise a one-dimensional convolution layer and a ReLU activation layer.

Further, the encoder and the decoder are composed of a stack of identical modules, each module comprising two sub-layer structures, a multi-headed attention layer and a feed forward network layer, respectively, after each of which a residual network connection and layer normalization are used.

Further, the multi-head attention layer in the decoder includes a first multi-head attention layer constructed based on a multi-head attention mechanism and a second multi-head attention layer constructed based on a cross-attention mechanism.

Further, the alert recognition model comprises a feature extraction layer, a semantic coding layer and a decoding layer, wherein the feature extraction layer processes the text sequence to obtain a corresponding word representation vector representation after fusing semantic information in the general field, the semantic coding layer codes based on the word representation vector representation to obtain sentence semantic codes, and the decoding layer is used for decoding the sentence semantic codes and obtaining an optimal marking sequence according to a rationality relationship between labels.

Further, the feature extraction layer is an ALBERT model.

Further, the semantic coding layer comprises a plurality of Bi-LSTM modules which are connected with each other, the Bi-LSTM modules comprise a forward LSTM layer and a backward LSTM layer, hidden layers of the forward LSTM layer and the backward LSTM layer are connected to the same output layer, and spliced word vectors are obtained through the forward LSTM layer and the backward LSTM layer; the decoding layer is a CRF layer and is used for obtaining the optimal label sequence of the word vector.

In a second aspect, an artificial intelligence based automated local fire voice recognition device is provided, the device comprising: the text information acquisition module is used for inputting the alarm voice information acquired in real time into the voice recognition model to perform text conversion to obtain text information; the alarm information acquisition module is used for inputting the text information into the alarm condition recognition model to extract alarm condition elements in the text information, obtaining a plurality of entity information of the alarm condition, and combining the entity information to obtain target alarm information.

In a third aspect, a computer readable storage medium is provided, the computer readable storage medium storing a computer program which, when executed by a processor, implements the method of any one of the above.

In the technical scheme provided by the embodiment of the application, text information is obtained by performing end-to-end voice conversion on the real-time alarm voice information flow, and the alarm condition elements in the text information are extracted and combined to obtain final alarm information.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

The methods, systems, and/or programs in the accompanying drawings will be described further in terms of exemplary embodiments. These exemplary embodiments will be described in detail with reference to the drawings. These exemplary embodiments are non-limiting exemplary embodiments, wherein the exemplary numbers represent like mechanisms throughout the various views of the drawings.

FIG. 1 is a schematic flow chart of an artificial intelligence based automatic local fire control voice recognition method according to an embodiment of the application.

FIG. 2 is a block diagram of an artificial intelligence based automated local fire voice recognition device according to an embodiment of the present application.

Fig. 3 is a schematic structural diagram of an artificial intelligence-based automatic local fire control voice recognition device according to an embodiment of the present application.

Detailed Description

In order to better understand the above technical solutions, the following detailed description of the technical solutions of the present application is made by using the accompanying drawings and specific embodiments, and it should be understood that the specific features of the embodiments and the embodiments of the present application are detailed descriptions of the technical solutions of the present application, and not limiting the technical solutions of the present application, and the technical features of the embodiments and the embodiments of the present application may be combined with each other without conflict.

In the following detailed description, numerous specific details are set forth by way of examples in order to provide a thorough understanding of the relevant teachings. It will be apparent, however, to one skilled in the art that the application can be practiced without these details. In other instances, well known methods, procedures, systems, components, and/or circuits have been described at a relatively high-level, without detail, in order to avoid unnecessarily obscuring aspects of the present application.

The present application uses a flowchart to illustrate the execution of a system according to an embodiment of the present application. It should be clearly understood that the execution of the flowcharts may be performed out of order. Rather, these implementations may be performed in reverse order or concurrently. Additionally, at least one other execution may be added to the flowchart. One or more of the executions may be deleted from the flowchart.

Before describing embodiments of the present application in further detail, the terms and terminology involved in the embodiments of the present application will be described, and the terms and terminology involved in the embodiments of the present application will be used in the following explanation.

(1) In response to a condition or state that is used to represent the condition or state upon which the performed operation depends, the performed operation or operations may be in real-time or with a set delay when the condition or state upon which it depends is satisfied; without being specifically described, there is no limitation in the execution sequence of the plurality of operations performed.

(2) Based on the conditions or states that are used to represent the operations that are being performed, one or more of the operations that are being performed may be in real-time or with a set delay when the conditions or states that are being relied upon are satisfied; without being specifically described, there is no limitation in the execution sequence of the plurality of operations performed.

Along with the continuous acceleration of the industrialization and modernization processes of cities, the potential safety hazard elimination difficulty of old communities in the cities is high, the rural fire-fighting infrastructure is weak, and a large number of people are rushed into the cities, meanwhile, the motor vehicles and new energy automobiles develop rapidly, and the building fire situations of the rural and urban areas are increasingly serious. In order to adapt to and cope with the requirements of 'full disaster and large emergency' as soon as possible, and simultaneously to adapt to the alarm receiving and processing requirements in the current informatization and intelligence times, the intelligent fire-fighting alarm receiving and processing system adopts a natural language processing technology, and the acquisition of key information in alarm information is realized according to alarm receiving and processing data. In the prior art, applications directed to natural language processing techniques mainly include the following ways:

monitoring event alarm: the method is mainly realized by combining a mixed model of long-short time memory (LSTM) and Convolutional Neural Network (CNN) to identify grid monitoring alarm events and applying the grid monitoring alarm events to a management system, and adopts a Word2vec model to realize semantic expression of monitoring alarm information text instead of semantic expression based on character retrieval matching or Word frequency statistical probability. The method realizes the identification of the monitoring alarm event of the text-based network system, simultaneously analyzes a large amount of historical early warning information and summarizes the differences between the historical early warning information and the common Chinese text. A hybrid deep learning model is established by combining the excellent performance of LSTM in terms of processing time series problems and the excellent performance of CNN in terms of mining short text local features.

And (3) accident classification alarm: text pre-processing and Natural Language Processing (NLP) are applied to aviation accident factor classification. Semi-supervised label propagation (LS) and supervised Support Vector Machine (SVM) techniques are considered for data modeling of aeronautical incident reports. The stochastic search and Bayesian optimization method is applied to the super-parametric analysis and the improvement of model performance, and is measured by the Micro F1 fraction. The human factor categories are identified and classified from the aviation accident report, and the MicroF 1 score of the optimal prediction model is respectively 0.900, 0.779 and 0.875 for each level of the classification framework by using the TF-IDF+LS model, so that the human factor classification based on the text data can obtain good prediction performance.

The chat robot is a model formed by a knowledge graph and text similarity, and an online question-answering (QA) Healthcare Helper system is constructed based on a chat robot framework and is used for answering complex questions. The data collected from the internet builds a domain-specific knowledge graph. A novel text representation and similarity deep learning model is implemented.

Compared with the above natural language processing method, the natural language processing technology is rarely applied to the fire control field in the prior art, and aiming at the background information, the embodiment of the application provides an artificial intelligence-based automatic local fire control voice recognition method, which specifically comprises the following steps:

s110, performing voice-text conversion on the alarm information acquired in real time to obtain text information ordered by the time sequence.

S120, extracting alarm condition elements in the obtained text information through an alarm condition recognition model, obtaining a plurality of entity information of the alarm condition based on the alarm condition elements, and combining the entity information to obtain target alarm information.

The embodiment of the application realizes the processing process of voice-text-information extraction-information combination through the two steps, and the processing process is an end-to-end processing process, so that the real-time processing and recognition of voice information can be realized, and the automatic recognition and extraction of the voice information are realized.

In the embodiment of the present application, the voice recognition is performed through a voice recognition model in step S110, where the voice recognition model is used to perform voice feature processing on alarm voice information acquired in real time to obtain a voice sequence with a frame as a unit, and perform encoding and decoding processing on the voice sequence to obtain a text sequence in a mapping relationship with the voice sequence, and in the embodiment of the present application, the text sequence is text information to be processed.

The voice recognition module comprises a pre-processing module, a coder and a mixed attention module, wherein the pre-processing module comprises an acoustic pre-processing sub-module and a text pre-processing sub-module, the acoustic pre-processing sub-module is used for processing voice characteristics of the alarm voice information to obtain a voice sequence taking a frame as a unit, the voice sequence consists of a plurality of voice characteristics, and the text pre-processing sub-module is used for performing text conversion on the voice sequence to obtain an initial text sequence based on the voice sequence as a scale; the coder-decoder comprises an encoder and a decoder, wherein the encoder is used for mapping a plurality of voice features in the voice sequence, the decoder is used for decoding the initial text sequence to generate target text features in combination with the attention-adjusted hidden state, and the target text sequence is generated based on the sequence as a scale.

In an embodiment of the application, the speech recognition model takes the alert speech information as a sequence to sequence task where the encoder will input frame level acoustic featuresMapping to a sequence advanced representation +.>The decoder passes the text already generated +.>Association of attention-modulated hidden states +.>Decoding generation->Finally, the target transcription sequence is generated>The method comprises the steps of carrying out a first treatment on the surface of the Wherein->For the first acoustic feature->For the T-th acoustic feature->For the second hidden state,/->For the Nth hidden state,/->、/>、/>The first, second and first-1 text features, respectively.

In the embodiment of the application, the acoustic front-end submodule comprises a plurality of two-dimensional convolution modules and a position encoder module, wherein the convolution modules comprise a two-dimensional convolution layer and a ReLU activation layer; the two-dimensional convolution module is used for extracting acoustic features, and the position encoder is used for acquiring absolute position information of the acoustic features.

In the embodiment of the application, a CNN convolution layer is adopted for a two-dimensional convolution layer in the acoustic pre-submodule, wherein the CNN convolution layer is provided with 256 filter groups, the size of each filter kernel is 3x3, the step length is 1, downsampling is carried out, and the redundant information of voice characteristics is reduced. The input filter is 256 for the time convolution network in the text pre-sub-module, the convolution kernel is 3, the step size is 1, the padding is 2, and the expansion factor is 1.

The text pre-submodule comprises a plurality of time convolution networks, a position encoder module and an embedded layer, wherein the time convolution networks comprise a one-dimensional convolution layer and a ReLU activation layer. Wherein each module comprises two sub-layer structures, a multi-headed attention layer and a feed forward network layer, respectively, for the encoder and the decoder, consisting of the same stack of modules, after each of which a residual network connection and layer normalization are used.

In particular, the multi-head attention layer for use in a decoder includes a first multi-head attention layer constructed based on a multi-head attention mechanism and a second multi-head attention layer constructed based on a cross-attention mechanism.

The specific structure of the encoder is a multi-head attention layer, a connecting layer, a position feedforward layer, a residual connection and a layer normalization layer which are sequentially connected by an input end, wherein the encoder comprises a splicing layer, a residual connection and a layer normalization layer. The specific structure of the decoder comprises a first multi-head attention layer, a first connection layer, a codec attention layer, a second connection layer, a position feedforward layer, residual connection and a layer normalization layer which are sequentially connected by an input end, wherein the codec attention layer is provided with a receiving end, and the receiving end is connected with the residual connection in the encoder and the output end of the layer normalization layer. And, each of the first connection layer and the second connection layer comprises a splicing layer, a residual connection layer and a layer normalization layer. In the embodiment of the application, aiming at the encoder part, a parallel structure is used, and the function of the parallel structure is to fuse the processed characteristics of the multi-head attention layer, extract more characteristics and slow down the disappearance of position information; second, the encoder output section is also input to the codec attention layer in the decoder for use in speeding up model training convergence speed and improving robustness.

In the embodiment of the application, samples are trained according to the ascending order of audio length aiming at the training process of a voice recognition model, the batch size is 26, an Adam optimizer is adopted, and the learning rate is dynamically adjusted in the whole training process according to the following formula:

where n is the number of training steps, k is the scaling factor,for warm-up, step number->Is the dimension of the matrix in the attention.

Wherein k is 10 in the embodiment of the application,256->For 25000 steps, 240 rounds were trained. And, to prevent overfitting, a ratio of Dropou of 0.1 in each sub-layer.

For step S110, text information is obtained through voice conversion, for step S120, the obtained text information is input into a warning condition recognition model, warning condition elements in the text information are extracted to obtain a plurality of entity information of the warning condition, and the plurality of entity information are combined to obtain target warning information.

In the embodiment of the application, the alert recognition model in the step S120 includes a feature extraction layer, a semantic coding layer and a decoding layer, wherein the feature extraction layer processes the text sequence to obtain a corresponding word representation vector representation fused with semantic information in the general field, the semantic coding layer encodes based on the word representation vector representation to obtain sentence semantic coding, and the decoding layer is used for decoding the sentence semantic coding and obtaining an optimal marking sequence according to a rationality relationship between labels.

Specifically, the feature extraction layer in the embodiment of the application is an ALBERT pre-training model, and the ALBERT pre-training model obtains word representation vectors corresponding to words in each sequence of texts and obtained after semantic information in the universal field is fused. In the embodiment of the application, the ALBERT pre-training model is an improvement on the BERT model and is used for deleting parameters of the model, so that the recognition effect of the model can be improved while a smaller memory is used.

In the embodiment of the application, aiming at a semantic coding layer, the semantic coding layer comprises a plurality of Bi-LSTM modules which are connected with each other, wherein each Bi-LSTM module comprises a forward LSTM layer and a backward LSTM layer, hidden layers of the forward LSTM layer and the backward LSTM layer are connected to the same output layer, and a spliced word vector is obtained through the forward LSTM layer and the backward LSTM layer; the decoding layer is a CRF layer and is used for obtaining the optimal label sequence of the word vector.

The method comprises the following steps of calculating a memory gate, a current unit state, an output result and a current hidden state according to LSTM, wherein the calculation comprises the following steps:

for forgetting gate, the information to be discarded is selected and input as the hidden layer state at the previous momentAnd the current input sequence +.>I.e. the extracted word token vector represents a composed sequence,/->And->Weight matrix and bias term respectively, +.>For activating the function, the calculation formula is as follows:

。

for the memory gate and the current cell state, select the information to be retained, the current cell state calculation,as temporary unitsStatus (S)>For the cell state at the previous time, the output is the current time cell state +.>The calculation process can be expressed as:

；

。

outputting the result and the current hidden state, and inputting words at the current momentAnd the current time cell state +.>The output is the value of the output gate and the current hidden state, and the calculation process is respectively carried out by the following formulas:

；

。

in the embodiment of the application, the past history information of the sentence sequence is acquired aiming at the LSTM neural network model, and the single word or word in the sequence label task is related to the context information of the sentence where the single word or word is located, and the Bi-LSTM neural network model connects the forward LSTM hidden layer and the backward LSTM hidden layer to the same output layer to obtain the spliced word vector.

In the embodiment of the application, the decoding layer is a CRF layerAnd the optimal tag sequence is used for acquiring the word vector. Aiming at the fact that the BiLSTM model in the embodiment of the application is mutually independent of the output probability of each word label, the BiLSTM model cannot learn the transfer characteristics among the labels of the message text sequence, the same labels in a sentence can be connected, and the combination of CRF and BiLSTM can consider the sequence among the labels so as to obtain the optimal label sequence. The output of the BiLSTM layer is a score for each tag, such as for words W1 and W2, biLSTM scores for each word, biLSTM scores for W1 for 0.32,0.24,0.54,0.26,0.19, and for W2 for 0.45,0.58,0.72,0.15,0.63, then the tag corresponding to 0.54 is considered to be the tag of W1, and the tag corresponding to 0.72 is the tag of W2. For this case, a correct label cannot be obtained. For this case, following the CRF at the bimstm layer is to allow the model to learn the sequential features of the labels from the training data, ensure the accuracy of label prediction, and the first word at the beginning of a sentence should be "B" or "0" instead of "I", and for each named entity, the first label should also be so, and with continuous training, the CRF layer can learn such constraints by itself. For each sequence, i.e. each sentence,representing unordered character sequences>The tag sequence representing the sentence, the following formula is the optimal tag sequence calculation formula:

；

wherein,for being from tag->Transition probability->Output for the ith position is +>For each X, the score of all its possible tag sequences y is calculated by the formula:

wherein->Representing the set of all possible sequence observations for the input sequence X, +.>The likelihood function calculation formula of the marker sequence is as follows, which represents the actual and real marker value:

，

maximizing probability of tag sequence outputThe expression, the calculation formula is:

。

in the embodiment of the application, the processing is performed through the processes in the step S110 and the step S120, so that the test alarm voice information can be subjected to end-to-end voice-to-word conversion, and the converted words are extracted through alarm elements to obtain the final alarm information.

Referring to fig. 2, an artificial intelligence based automated local fire voice recognition device 200 is provided, comprising:

the text information obtaining module 210 is configured to input the alarm voice information obtained in real time into a voice recognition model for text conversion to obtain text information;

the alarm information obtaining module 220 is configured to input the text information to an alarm condition recognition model, extract alarm condition elements in the text information, obtain a plurality of entity information of the alarm condition, and combine the entity information to obtain target alarm information.

Referring to fig. 3, an artificial intelligence based automated local fire voice recognition device 300 may vary widely in configuration or performance, may include one or more processors 301 and memory 302, and may have one or more stored applications or data stored in memory 302. Wherein the memory 302 may be transient storage or persistent storage. The application program stored in memory 302 may include one or more modules (not shown in the figures), each of which may include a series of computer-executable instructions in an artificial intelligence-based automated local fire voice recognition device. Still further, the processor 301 may be configured to communicate with the memory 302 and execute a series of computer executable instructions in the memory 302 on an artificial intelligence based automated local fire voice recognition device. The artificial intelligence based automated local fire voice recognition device may also include one or more power supplies 303, one or more wired or wireless network interfaces 304, one or more input/output interfaces 305, one or more keyboards 306, and the like.

In one particular embodiment, an artificial intelligence based automated local fire voice recognition device includes a memory, and one or more programs, wherein the one or more programs are stored in the memory, and the one or more programs may include one or more modules, and each module may include a series of computer executable instructions for use in the artificial intelligence based automated local fire voice recognition device, and configured to be executed by one or more processors, the one or more programs comprising computer executable instructions for:

performing voice-text conversion on alarm information acquired in real time to obtain text information ordered by time sequence;

extracting alarm condition elements in the obtained text information through an alarm condition recognition model, obtaining a plurality of entity information of the alarm condition based on the alarm condition elements, and combining the entity information to obtain target alarm information.

The following describes each component of the processor in detail:

wherein in this embodiment the processor is a specific integrated circuit (application specific integrated circuit, ASIC), or one or more integrated circuits configured to implement embodiments of the present application, such as: one or more microprocessors (digital signal processor, DSPs), or one or more field programmable gate arrays (field programmable gate array, FPGAs).

Alternatively, the processor may perform various functions, such as performing the method shown in fig. 1 described above, by running or executing a software program stored in memory, and invoking data stored in memory.

In a particular implementation, the processor may include one or more microprocessors, as one embodiment.

The memory is configured to store a software program for executing the scheme of the present application, and the processor is used to control the execution of the software program, and the specific implementation manner may refer to the above method embodiment, which is not described herein again.

Alternatively, the memory may be read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (random access memory, RAM) or other type of dynamic storage device that can store information and instructions, but may also be, without limitation, electrically erasable programmable read-only memory (electrically erasable programmable read-only memory, EEPROM), compact disc read-only memory (compact disc read-only memory) or other optical disk storage, optical disk storage (including compact disc, laser disc, optical disc, digital versatile disc, blu-ray disc, etc.), magnetic disk storage media or other magnetic storage devices, or any other medium that can be used to carry or store the desired program code in the form of instructions or data structures and that can be accessed by a computer. The memory may be integrated with the processor or may exist separately and be coupled to the processing unit through an interface circuit of the processor, which is not particularly limited by the embodiment of the present application.

It should be noted that the structure of the processor shown in this embodiment is not limited to the apparatus, and an actual apparatus may include more or less components than those shown in the drawings, or may combine some components, or may be different in arrangement of components.

In addition, the technical effects of the processor may refer to the technical effects of the method described in the foregoing method embodiments, which are not described herein.

It should be appreciated that the processor in embodiments of the application may be other general purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate arrays (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. The volatile memory may be random access memory (random access memory, RAM) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, RAM) are available, such as Static RAM (SRAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus RAM (DR RAM).

The above embodiments may be implemented in whole or in part by software, hardware (e.g., circuitry), firmware, or any other combination. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. When the computer instructions or computer program are loaded or executed on a computer, the processes or functions described in accordance with embodiments of the present application are produced in whole or in part. The computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable apparatus. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from one website site, computer, server, or data center to another website site, computer, server, or data center by wired (e.g., infrared, wireless, microwave, etc.). The computer readable storage medium may be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more sets of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

In the present application, "at least one" means one or more, and "a plurality" means two or more. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b, or c may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or plural.

It should be understood that, in various embodiments of the present application, the sequence numbers of the foregoing processes do not mean the order of execution, and the order of execution of the processes should be determined by the functions and internal logic thereof, and should not constitute any limitation on the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a read-only memory (ROM), a random access memory (random access memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. An artificial intelligence based automated local fire voice recognition method, characterized by being applied to a server, the method comprising:

inputting alarm voice information acquired in real time into a voice recognition model, performing voice feature processing on the alarm voice information by the voice recognition model to obtain a voice sequence taking a frame as a unit, and performing encoding and decoding processing on the voice sequence to obtain a text sequence in a mapping relation with the voice sequence, wherein the text sequence is text information to be processed;

inputting the text information into an alarm condition recognition model, extracting alarm condition elements in the text information to obtain a plurality of entity information of the alarm condition, and combining the entity information to obtain target alarm information.

2. The artificial intelligence based automated local fire voice recognition method of claim 1, wherein the voice recognition model comprises a pre-processing module, a codec, and a mixed attention module; the pre-processing module comprises an acoustic pre-processing sub-module and a text pre-processing sub-module, wherein the acoustic pre-processing sub-module is used for performing voice feature processing on the alarm voice information to obtain a voice sequence taking a frame as a unit, the voice sequence consists of a plurality of voice features, and the text pre-processing sub-module is used for performing text conversion on the voice sequence to obtain an initial text sequence taking the voice sequence as a scale; the coder-decoder comprises an encoder and a decoder, wherein the encoder is used for mapping a plurality of voice features in the voice sequence, the decoder is used for decoding the initial text sequence to generate target text features in combination with the attention-adjusted hidden state, and the target text sequence is generated based on the sequence as a scale.

3. The automated local fire voice recognition method based on artificial intelligence according to claim 2, wherein the acoustic pre-submodule comprises a plurality of two-dimensional convolution modules and a position encoder module, wherein the convolution modules comprise a two-dimensional convolution layer and a ReLU activation layer; the two-dimensional convolution module is used for extracting acoustic features, and the position encoder is used for acquiring absolute position information of the acoustic features.

4. The automated local fire voice recognition method based on artificial intelligence of claim 3, wherein the text pre-submodule comprises a plurality of time convolution networks, a position encoder module and an embedded layer, wherein the time convolution networks comprise a one-dimensional convolution layer and a ReLU activation layer.

5. The automated local fire voice recognition method based on artificial intelligence of claim 2, wherein the encoder and decoder are composed of a stack of identical modules, each module comprising two sub-layer structures, a multi-headed attention layer and a feed-forward network layer, respectively, each of which is followed by a residual network connection and layer normalization.

6. The artificial intelligence based automated local fire voice recognition method of claim 5, wherein the multi-headed attention layer in the decoder comprises a first multi-headed attention layer constructed based on a multi-headed attention mechanism and a second multi-headed attention layer constructed based on a cross-over attention mechanism.

7. The automatic local fire control voice recognition method based on artificial intelligence according to claim 1, wherein the alert recognition model comprises a feature extraction layer, a semantic coding layer and a decoding layer, the feature extraction layer processes the text sequence to obtain a corresponding word representation vector representation after fusing semantic information in a general field, the semantic coding layer encodes based on the word representation vector representation to obtain sentence semantic coding, and the decoding layer is used for decoding the sentence semantic coding and obtaining an optimal marking sequence according to a rationality relationship between labels.

8. The automated local fire voice recognition method based on artificial intelligence of claim 7, wherein the feature extraction layer is an ALBERT model.

9. The automated local fire voice recognition method based on artificial intelligence according to claim 7, wherein the semantic coding layer comprises a plurality of Bi-LSTM modules connected with each other, the Bi-LSTM modules comprise a forward LSTM layer and a backward LSTM layer, both the forward LSTM layer and the backward LSTM layer are connected to the same output layer, and the spliced word vector is obtained through the forward LSTM layer and the backward LSTM layer; the decoding layer is a CRF layer and is used for obtaining the optimal label sequence of the word vector.

10. An artificial intelligence based automated local fire voice recognition device, the device comprising:

the text information acquisition module is used for inputting the alarm voice information acquired in real time into the voice recognition model to perform text conversion to obtain text information;

the alarm information acquisition module is used for inputting the text information into the alarm condition recognition model to extract alarm condition elements in the text information, obtaining a plurality of entity information of the alarm condition, and combining the entity information to obtain target alarm information.