CN116561632A

CN116561632A - Method, system, equipment and medium for recognizing speech steps based on pre-training and gating neural network

Info

Publication number: CN116561632A
Application number: CN202310533340.XA
Authority: CN
Inventors: 温浩; 王杰
Original assignee: Xian University of Architecture and Technology
Current assignee: Xian University of Architecture and Technology
Priority date: 2023-05-11
Filing date: 2023-05-11
Publication date: 2023-08-08

Abstract

The invention provides a speech step recognition method, a system, equipment and a medium based on a pre-training and gating neural network, which are characterized in that data in a target text segment are collected for preprocessing, and labeling is carried out according to a preset speech step; screening and splitting long and difficult complex sentences in the target text segment; setting up an ERNIE_AT-GRU-based automatic speech step recognition model; inputting the split data into an ERNIE_AT-GRU model for training, and performing a step recognition test on the test data through round training to obtain a step recognition result; the method for recognizing the word steps based on the pre-training model and the gate-controlled neural network utilizes the ERNIE pre-training model combined with large-scale text content and a knowledge graph to learn the text deep semantics during specific operation, overcomes the defect that the traditional machine learning does not fully excavate and utilizes the internal relation and characteristics between words, and effectively extracts important parts favorable for classification in the text, so that the model is more compact and has higher efficiency.

Description

Method, system, equipment and medium for recognizing speech steps based on pre-training and gating neural network

Technical Field

The invention belongs to the technical field of text processing, and particularly relates to a speech step recognition method, a system, equipment and a medium based on a pre-training and gating neural network.

Background

The abstract speech step recognition in all the text sections of the academic paper utilizes the concise and clear speech step words to summarize abstract sentences, helps readers to quickly locate specific information of the paper, has application expansion in aspects of artificial intelligence recommendation, book information subject realization, information mining, knowledge discovery, knowledge graph construction and the like, and can be used as a fundamental task of text processing related research to be an important research, however, in the existing speech step recognition algorithm, the performance of the algorithm needs to be improved and improved because the inherent relation and characteristics among words are not fully mined and utilized.

One difficulty with speech step recognition is: the natural language has the problems of expression diversity and complexity, chinese expression diversity, word ambiguity, difficult splitting of long and difficult complex sentences formed by nesting sentence structures and the like, and is difficult to understand for machines; another difficulty with speech recognition is that there is no perfect mathematical model to accurately describe what is expressed in natural language, and machines have great challenges for semantic understanding of natural language.

The abstract speech step recognition is mainly based on the traditional machine learning and deep learning methods, and in recent years, along with the excellent performance of the BERT pre-training model on various natural language processing tasks, researchers develop tuning and reconstruction of the pre-training model, and the abstract speech step recognition based on the traditional machine learning and deep learning methods is mostly based on structured abstract or English data sets; however, rule formulation is critical to the recognition effect based on the rule method, and the rule cannot cover all the speech step conditions, so that the recognition effect needs to be further improved; the traditional machine learning and deep learning-based speech step recognition method relies on characteristics such as vocabulary and lexical of the text, but the machine learning cannot deeply learn the semantics of the text, so that the effect is not optimal.

Disclosure of Invention

Aiming at the problems in the prior art, the invention provides a speech step recognition method, a system, equipment and a medium based on a pre-training and gating neural network, which can learn deep semantics of texts based on a pre-training model and combine with the gating network with an attention mechanism to perform characteristic recognition of focusing key points, thereby improving the accuracy of semantic recognition.

The invention is realized by the following technical scheme:

the speech step recognition method based on the pre-training and gating neural network comprises the following steps:

s1: collecting data in a target text segment, preprocessing the data, and labeling according to a preset language step;

s2: screening and splitting long and difficult complex sentences in the target text segment;

s3: setting up an ERNIE_AT-GRU-based automatic speech step recognition model;

s4: inputting the split data into an ERNIE_AT-GRU model for training, and performing a step recognition test on the test data through round training to obtain a step recognition result.

Preferably, in the step S1, the collected target text is subjected to unified text format, format symbols such as "\n", "\t", blank space and the like in the text are cleaned, and the complete text content of the original data is reserved.

Preferably, in the step S2, dependency syntax analysis is performed on the data of the target text by using an LTP tool, and whether a complex relationship exists in the sentence is identified according to COO; and (3) realizing discrimination and splitting of long and difficult complex sentences according to the obtained parallel relation marks to obtain single semantic data, and dividing the single semantic data into training data and test data according to the proportion of 8:2.

Preferably, the LTP tool performs dependency syntax analysis and discriminates whether there is a complex relationship in the sentence according to COO, including the steps of:

a1: the LTP tool performs word segmentation, part-of-speech tagging and dependency syntax analysis on the data of the target text;

a2: integrating the obtained data into a format convenient to process, wherein S= (word segmentation, part-of-speech mark, (word segmentation node, father node and dependency relation mark));

a3: traversing the integrated data, obtaining sentences with the parent nodes of the word segmentation as root nodes and the dependency relationship marked as COO, and storing semantic complex sentences meeting the conditions;

a4: traversing the semantic complex sentences, and separating sentences from the complex sentences meeting the conditions at commas in front of the parallel relations to obtain single semantic clauses.

Preferably, the step S3 of constructing the automatic recognition model for the speech steps includes the following steps:

b1: constructing an ERNIE pre-training model, and using a transducer-XL feature processor of the ERNIE pre-training model to fuse text semantics by a multi-head self-attention mechanism to obtain a word vector feature matrix fused with the multi-head attention mechanism;

b2: building a gating network AT-GRU module with an attention mechanism, inputting a word vector matrix obtained by a pre-training model into a text context feature of a bi-directional gating network learning, and connecting the attention mechanism to focus important information for text classification;

b3: combining the ERNIE pre-training model with the AT-GRU module to obtain an ERNIE_AT-GRU model.

Preferably, the step B1 of constructing the ERNIE pre-training model includes the following steps:

c1: writing a pre-training model calling interface, and loading information such as pre-training parameters required by the pre-training model; the ERNIE pre-training model utilizes a mask information integration mode of three-section single word masks, phrases and entity layers to obtain complete semantics of words, phrases and entities;

c2: word vector X: { w through three-segment mask _i1 ,w _i2 ,...,w _iN Input transducer-XL encoder, through word-embedded encoding process x _it ＝W _e w _it ,t∈[1,N]，W _e Converting the high-dimensional sparse word vector matrix into a low-dimensional dense word vector matrix for the weight parameters of the Embedding layer, namely, word Embedding vectors of each sentence;

and C3: three weight matrices W calculated for a single self-attribute _q 、W _k 、W _v The matrix Q, K and V obtained by multiplying the word embedding vector by the three respectively represent the correlation between the current word and the other words in the sentence, and to prevent the result from being too large, the root mean square of their dimensions is dividedd _k Representing the dimension of a Q or K matrix vector, wherein +.>For the variable to be learnable, the relative distance in a range is calculated, then normalized by a Softmax function to obtain the relativity of each word after normalization and other words, and then multiplied by a V matrix, namely weighted summation, to obtain a new vector code of each word, wherein the formula is as follows:

and C4: combining Q, K, V matrixes obtained by calculating each single self-attribute according to segmented head, and multiplying the weight W by the point ⁰ The segmented Head is linearly converted into a matrix of original dimension to obtain a Multi-Head matrix, and the process can be expressed as follows:

head _i ＝Attention(Q _i ，K _i ，V _i )，i＝1，…，h；

MultiHead(Q，K，V)＝Concact(head _i ，...，head _h )W ⁰ ；

c5: inputting a Q, K, V matrix obtained by Multi-Head calculation into an Add & Nor layer for self-attention input/output addition and normalization processing, and then obtaining a word vector feature matrix integrating a Multi-Head attention mechanism through a feedforward neural network of a full-connection layer and the Add & Nor layer processing, wherein the matrix contains text features of model learning and context semantic information contained in the text.

Preferably, the step B2 of constructing the gating network AT-GRU module with the attention mechanism includes the following steps:

d1: writing attention layer code according to a single self-attention mechanism formula

D2: the bi-directional gating mechanism BIGRU comprises a reset gate and an update gate, wherein the reset gate is used for screening out the state information of the part at the last moment of the candidate state; the update gate decides the information quantity of the retention history state in the current state, as shown in the formula:

r _t ＝δ(W _r x _t +U _r h _t-1 )；

delta is the activation function, x _t H is the current input _t-1 For hiding layer output at last moment, namely history state, W _r And U _r Is a weight matrix;

d3: the bi-directional gating network has forward and backward training GRUs, respectively, for text context feature information, where forward operation is to conceal the forward state of the ith sentenceAnd backward implicit state->The method comprises the following steps:

sentence coding representation is obtained through the front and back hidden states:

linking forward and backward training to update the current gate state is a new candidate u created by the tanh layer _i And (3) determining:

u _i ＝tanh(W _s h _i +b _s )；

adding the attention mechanism calculation word weight into a gating network to form an AT-GRU module:

and obtaining clause hidden information acting on semantic representation through an attention mechanism, and summarizing the clause information to obtain the representation information of all sentences.

Preferably, in the step B3, combining the ERNIE pre-training model with the AT-GRU module to obtain the ernie_at-GRU model includes the following steps:

e1: the three-dimensional word vector feature matrix output dimension of the multi-head attention mechanism fused obtained by the pre-training model is converted into a dimension shape capable of being input into a gating network, and data are sent into the gating network with the attention mechanism;

e2: adding a Dropout layer after the gating network layer, randomly neglecting a preset number of neurons, and preventing the model from being over fitted;

e3: and accessing a full connection layer, performing speech-step recognition by using Softmax, and outputting a classification label.

7. The method for speech step recognition based on pre-training and gating neural network according to claim 1, wherein the step S4 comprises the steps of:

f1: inputting the single semantic data into a model, calling a pre-training model interface to realize text word segmentation, and vectorizing the word segmentation according to the mapping of a pre-training model dictionary;

f2: performing 0/1 mask (mask) on the text according to the length pad_size of each sentence processing of the training data and the test data according to the batch value, merging and storing the tag numbers and the mask result into a pkl file, facilitating program reading, and loading the pkl file into a data frame table type data structure together by using a table type data structure mode during loading;

f3: the vectorized data frame structure data are input into an ERNIE_AT-GRU model for training in batches, label prediction is carried out on test data of each round through forward operation and back propagation, and cross entropy in back propagation is used as a loss function to optimize the model:

wherein D is the size of training data, C is the category number,for text data tag->And determining the label of the optimal prediction result by using multiple parameter tuning for model prediction probability to obtain the speech step recognition effect and test label data with good classification, namely a speech step classification result and a loss change analysis model in the model operation process.

A speech step recognition system based on a pre-trained and gated neural network, comprising:

the acquisition module is used for acquiring data in the target text segment for preprocessing and labeling according to a preset language step;

the processing module is used for screening and splitting long and difficult complex sentences in the target text segment;

the model building module is used for building an ERNIE_AT-GRU-based automatic speech step recognition model;

and the output module is used for inputting the split data into an ERNIE_AT-GRU model for training, and performing a step recognition test on the test data through round training to obtain a step recognition result.

A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the pre-training and gated neural network based speech step recognition method when the computer program is executed.

A computer readable storage medium storing a computer program which, when executed by a processor, implements the steps of the pre-training and gated neural network based speech step recognition method.

Compared with the prior art, the invention has the following beneficial technical effects:

the invention provides a speech step recognition method, a system, equipment and a medium based on a pre-training and gating neural network, which are characterized in that data in a target text segment are collected for preprocessing, and labeling is carried out according to a preset speech step; screening and splitting long and difficult complex sentences in the target text segment; setting up an ERNIE_AT-GRU-based automatic speech step recognition model; inputting the split data into an ERNIE_AT-GRU model for training, and performing a step recognition test on the test data through round training to obtain a step recognition result; according to the word-step recognition method based on the pre-training model and the gate-controlled neural network, when the word-step recognition method is specifically operated, the ERNIE pre-training model combined with large-scale text content and a knowledge graph is utilized to learn text deep semantics, the defects that the traditional machine learning is insufficient in excavation and utilizes the internal relation and characteristics between words are overcome, the gate-controlled network AT-GRU model with the attention mechanism AT the downstream is used for carrying out focused characteristic learning, the word vectors which are more beneficial to text classification are focused, the problem that the classification effect is poor due to the forgetting problem of long text input in the machine learning is solved, and compared with the prior art, the method effectively extracts important parts which are beneficial to classification in the text, so that the model is more simplified and has higher efficiency. In addition, it should be noted that, the word vector matrix is generated through dictionary mapping of the pre-training model, the machine-readable digital matrix is used for representing the text, and the text is subjected to deep semantic learning of the pre-training model, so that a better effect can be obtained without a large number of test data samples, the pre-training model learns large-scale text data and knowledge graph knowledge, has better mobility, and performs semantic learning training by combining own training data, so that the method has more robustness to test data.

Drawings

FIG. 1 is a flow chart of an implementation of the speech step recognition method based on a pre-training and gating neural network of the present invention;

FIG. 2 is a diagram of an unstructured abstract and automatic speech step recognition model in Chinese according to the invention;

FIG. 3 is a schematic representation of an ERNIE pre-training model of the present invention;

FIG. 4 is a diagram illustrating an exemplary MASK mode of an ERNIE pre-training model according to the invention;

FIG. 5 is a calculation flow of the attention neural network layer of the present invention;

FIG. 6 is a block diagram of a BIGRU neural network with attention mechanism according to the present invention.

Detailed Description

The invention will now be described in further detail with reference to specific examples, which are intended to illustrate, but not to limit, the invention.

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present invention and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

The invention provides a speech step recognition method based on a pre-training and gating neural network, which is shown in fig. 1 and comprises the following steps:

s3: setting up an ERNIE_AT-GRU-based automatic speech step recognition model;

Preferably, in the step S1, unified text format is performed on the collected target text, format symbols such as "\n", "\t", blank space and the like in the text are cleaned, and the complete text content of the original data is reserved; further, in the step S2, dependency syntax analysis is performed on the data of the target text by using an LTP tool, and whether a complex relationship exists in the sentence is identified according to COO; and (3) realizing discrimination and splitting of long and difficult complex sentences according to the obtained parallel relation marks to obtain single semantic data, and dividing the single semantic data into training data and test data according to the proportion of 8:2.

b1: constructing an ERNIE pre-training model, and using a transducer-XL feature processor of the ERNIE pre-training model to fuse text semantics by using a multi-head self-attention mechanism to acquire a word vector feature matrix fused with the multi-head attention mechanism, as shown in figure 3;

b3: combining the ERNIE pre-training model with the AT-GRU module results in an ernie_at-GRU model, as shown in fig. 2.

Further, the step B1 of constructing the ERNIE pre-training model comprises the following steps:

c2: word vector X: { w through three-segment mask _i1 ,w _i2 ,...,w _iN Input transducer-XL encoder, through word-embedded encoding process x _it ＝W _e w _it ,t∈[1,N]，W _e For weight parameters of an Embedding layer, converting a high-dimensional sparse word vector matrix into a low-dimensional dense word vector matrix, namely, a word Embedding vector of each sentence, as shown in fig. 4, for an input text sentence, firstly, randomly shielding information such as words, chinese characters and the like in the sentence as basic language units to obtain basic level information of the sentence by using a basic level MASK to obtain basic level information of the sentence, then, carrying out a second-stage entity level MASK, shielding entities such as proper nouns and the like in the sentence and predicting gaps in the entities, finally, carrying out a phrase level MASK, and for all basic units in the phrase MASK and predicting the same phrase in the sentence, encoding the phrase information into word Embedding, and obtaining text information of different semantic units through three-section MASKs to obtain rich expression of the semantic information of the sentence;

and C3: three weight matrices W calculated for a single self-attribute _q 、W _k 、W _v The matrix Q, K and V obtained by multiplying the word embedding vector by the three respectively represent the correlation between the current word and the other words in the sentence, and to prevent the result from being too large, the root mean square of their dimensions is dividedd _k Represents a Q or KDimension of matrix vector, wherein->For the variable to be learnable, the relative distance in a range is calculated, then normalized by a Softmax function to obtain the relativity of each word after normalization and other words, and then multiplied by a V matrix, namely weighted summation, to obtain a new vector code of each word, wherein the formula is as follows:

head _i ＝Attention(Q _i ，K _i ，V _i )，i＝1，…，h；

MultiHead(Q，K，V)＝Concact(head _i ，...，head _h )W ⁰ ；

r _t ＝δ(W _r x _t +U _r h _t-1 )；

u _i ＝tanh(W _s h _i +b _s )；

the clause hidden information which acts on the semantic representation is obtained through an attention mechanism, and the clause information is summarized to obtain the representation information of all sentences;

adding attention mechanism calculation word weights to the gating network is shown in fig. 5:the AT-GRU modules are formed as shown in FIG. 6: />And obtaining clause hidden information acting on semantic representation through an attention mechanism, and summarizing the clause information to obtain the representation information of all sentences.

Preferably, the step S4 includes the steps of:

The invention provides a speech step recognition system based on a pre-training and gating neural network, which comprises the following steps:

In yet another embodiment of the present invention, a computer device is provided that includes a processor and a memory for storing a computer program including program instructions, the processor for executing the program instructions stored by the computer storage medium. The processor may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf Programmable gate arrays (FPGAs) or other Programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc., which are the computational core and control core of the terminal adapted to implement one or more instructions, in particular adapted to load and execute one or more instructions within a computer storage medium to implement the corresponding method flow or corresponding functions; the processor of the embodiment of the invention can be used for the operation of a speech step recognition method based on a pre-training and gating neural network.

In yet another embodiment of the present invention, a storage medium, specifically a computer readable storage medium (Memory), is a Memory device in a computer device, for storing a program and data. It is understood that the computer readable storage medium herein may include both built-in storage media in a computer device and extended storage media supported by the computer device. The computer-readable storage medium provides a storage space storing an operating system of the terminal. Also stored in the memory space are one or more instructions, which may be one or more computer programs (including program code), adapted to be loaded and executed by the processor. The computer readable storage medium herein may be a high-speed RAM memory or a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. One or more instructions stored in a computer-readable storage medium may be loaded and executed by a processor to implement the corresponding steps in the embodiments described above with respect to a pre-training and gated neural network-based speech step recognition method.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced with equivalents; such modifications and substitutions do not depart from the spirit of the technical solutions according to the embodiments of the present invention.

Claims

1. The speech step recognition method based on the pre-training and gating neural network is characterized by comprising the following steps of:

s3: setting up an ERNIE_AT-GRU-based automatic speech step recognition model;

2. The speech step recognition method based on the pre-training and gating neural network according to claim 1, wherein in the step S1, unified text format is performed on the collected target text, format symbols such as "\n", "\t", blank space and the like in the text are cleaned, and complete text content of the original data is reserved.

3. The method for recognizing speech steps based on pre-training and gating neural network according to claim 1, wherein in the step S2, dependency syntax analysis is performed on the data of the target text by using LTP tool, and whether complex relationship exists in the sentence is recognized according to COO; and (3) realizing discrimination and splitting of long and difficult complex sentences according to the obtained parallel relation marks to obtain single semantic data, and dividing the single semantic data into training data and test data according to the proportion of 8:2.

4. The method for speech recognition based on pre-training and gating neural network according to claim 3, wherein the LTP tool performs dependency syntax analysis and discriminates whether there is a complex relationship in a sentence according to COO, comprising the steps of:

a4: traversing the semantic complex sentences, and separating sentences from the complex sentences conforming to the conditions at commas in front of the parallel relations to obtain single semantic clauses;

the step S3 of constructing the automatic speech step recognition model comprises the following steps:

b3: combining the ERNIE pre-training model with the AT-GRU module to obtain an ERNIE_AT-GRU model;

the step B1 of constructing the ERNIE pre-training model comprises the following steps of:

c2: word vector X: { w through three-segment mask _i1 ,w _i2 ,...,w _iN Input transducer-XL encoder, through word-embedded encoding process x _it ＝W _e w _it ,t∈[1,N]，W _e Is EmbeThe weighting parameters of the dding layer are used for converting the high-dimensional sparse word vector matrix into a low-dimensional dense word vector matrix, namely, word embedding vectors of each sentence;

head _i ＝Attentiin(Q _i ，K _i ，V _i )，i＝1，...,h；

MultiHead(Q,K,V)＝Concact(head _i ,...,head _h )W ⁰ ；

5. The speech step recognition method based on the pre-training and gating neural network according to claim 4, wherein the step B2 of constructing the gating network AT-GRU module with the attention mechanism comprises the following steps:

r _t ＝δ(W _r x _t +U _r h _t-1 )；

u _i ＝anh(W _s h _i + _s )；

6. The method for speech recognition based on pre-training and gating neural network according to claim 4, wherein the combining the ERNIE pre-training model with the AT-GRU module in step B3, to obtain the ernie_at-GRU model includes the steps of:

8. A pre-training and gating neural network-based speech step recognition system, characterized in that the pre-training and gating neural network-based speech step recognition method according to any of claims 1-7 comprises:

9. A computer device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the pre-training and gated neural network based speech step recognition method according to any of claims 1-7 when the computer program is executed.

10. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the pre-training and gated neural network based speech step recognition method according to any of claims 1-7.