CN116543768A - Model training method, voice recognition method and device, equipment and storage medium - Google Patents
Model training method, voice recognition method and device, equipment and storage medium Download PDFInfo
- Publication number
- CN116543768A CN116543768A CN202310633239.1A CN202310633239A CN116543768A CN 116543768 A CN116543768 A CN 116543768A CN 202310633239 A CN202310633239 A CN 202310633239A CN 116543768 A CN116543768 A CN 116543768A
- Authority
- CN
- China
- Prior art keywords
- sample
- voice
- vector
- data
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 101
- 238000012549 training Methods 0.000 title claims abstract description 86
- 239000013598 vector Substances 0.000 claims abstract description 317
- 238000003062 neural network model Methods 0.000 claims abstract description 72
- 238000012512 characterization method Methods 0.000 claims abstract description 53
- 238000004364 calculation method Methods 0.000 claims abstract description 33
- 238000000605 extraction Methods 0.000 claims abstract description 19
- 230000006870 function Effects 0.000 claims description 40
- 238000012545 processing Methods 0.000 claims description 36
- 238000005070 sampling Methods 0.000 claims description 15
- 238000004590 computer program Methods 0.000 claims description 13
- 238000013507 mapping Methods 0.000 claims description 6
- 238000012216 screening Methods 0.000 claims description 6
- 238000005516 engineering process Methods 0.000 abstract description 16
- 230000008569 process Effects 0.000 description 22
- 238000004891 communication Methods 0.000 description 18
- 230000000694 effects Effects 0.000 description 14
- 238000013473 artificial intelligence Methods 0.000 description 13
- 230000003993 interaction Effects 0.000 description 9
- 238000003058 natural language processing Methods 0.000 description 8
- 238000013528 artificial neural network Methods 0.000 description 5
- 238000010606 normalization Methods 0.000 description 5
- 230000004044 response Effects 0.000 description 5
- 230000007246 mechanism Effects 0.000 description 4
- 238000011160 research Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 238000001914 filtration Methods 0.000 description 3
- 238000001228 spectrum Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000011161 development Methods 0.000 description 2
- 230000003203 everyday effect Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 239000000463 material Substances 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 230000006855 networking Effects 0.000 description 2
- 230000003595 spectral effect Effects 0.000 description 2
- 230000006399 behavior Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000006835 compression Effects 0.000 description 1
- 238000007906 compression Methods 0.000 description 1
- 238000012790 confirmation Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000001419 dependent effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000007786 learning performance Effects 0.000 description 1
- 230000005055 memory storage Effects 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000011176 pooling Methods 0.000 description 1
- 239000007787 solid Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 238000012546 transfer Methods 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0464—Convolutional networks [CNN, ConvNet]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0495—Quantised networks; Sparse networks; Compressed networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
- G06N3/084—Backpropagation, e.g. using gradient descent
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
- G10L2015/0631—Creating reference templates; Clustering
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02T—CLIMATE CHANGE MITIGATION TECHNOLOGIES RELATED TO TRANSPORTATION
- Y02T10/00—Road transport of goods or passengers
- Y02T10/10—Internal combustion engine [ICE] based vehicles
- Y02T10/40—Engine management systems
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Artificial Intelligence (AREA)
- Theoretical Computer Science (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computing Systems (AREA)
- Biomedical Technology (AREA)
- Biophysics (AREA)
- Data Mining & Analysis (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Software Systems (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Mathematical Physics (AREA)
- Signal Processing (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Machine Translation (AREA)
Abstract
The application provides a training method of a model, a voice recognition method, a voice recognition device, equipment and a storage medium, and belongs to the field of financial science and technology. The method comprises the following steps: inputting the sample speech data and the sample text data into a neural network model comprising an encoding network and a decoding network; carrying out feature extraction on the sample voice data through a convolution layer of the coding network to obtain a sample voice feature vector; performing attention calculation on the sample voice feature vector through a sparse attention layer of the coding network to obtain a sample voice feature vector; encoding the sample text data through a decoding network to obtain sample text encoding vectors; reconstructing the sample text coding vector and the sample voice characterization vector through the attention layer of the decoding network to obtain a sample relation feature vector; and updating parameters of the neural network model through the loss function and the sample relation feature vector to obtain a voice recognition model, so that the voice recognition accuracy of the model in a financial scene can be improved.
Description
Technical Field
The present disclosure relates to the field of financial science and technology, and in particular, to a model training method, a speech recognition method, a device, equipment and a storage medium.
Background
With the development of network, communication and computer technologies, enterprises have the characteristics of electronization, remoting, virtualization and networking, and more online enterprises are greatly emerging. Communication and dialogue between clients and enterprises are also developed from face-to-face consultation and interaction to communication and communication based on remote means such as network, telephone and the like. Under the background, intelligent voice interaction is widely applied to the fields of finance, logistics, customer service and the like.
Currently, a financial transaction platform based on voice interaction faces a large number of telephone voice services every day, and processes diversified service demands of clients, including pre-sale consultation, purchase, after-sale, complaints and the like. In the telephone service process, the intelligent customer service robot needs to deal with different service objects and make proper reactions. If the intelligent customer service cannot accurately identify the requirements of the service object represented in the voice data in the dialogue exchange, the service response based on the voice data feedback cannot meet the requirements of the object, and the like, so that the service quality and the object satisfaction degree are affected.
The speech recognition task is mainly to convert speech audio into text form. Most of the current voice recognition methods depend on a neural network model for recognition, but the common neural network model cannot pay attention to useful voice information in voice audio, and the problem of poor training effect of the model exists, so how to improve the training effect of the model becomes a technical problem to be solved urgently.
Disclosure of Invention
The main purpose of the embodiments of the present application is to provide a training method, a voice recognition method, a device, equipment and a storage medium for a model, which aim to improve the training effect of the model.
To achieve the above object, a first aspect of an embodiment of the present application proposes a training method for a model, the training method including:
acquiring sample voice data and sample text data corresponding to the sample voice data;
inputting the sample voice data and the sample text data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;
performing feature extraction on the sample voice data through a convolution layer of the coding network to obtain a sample voice feature vector;
Performing attention calculation on the sample voice feature vector through a sparse attention layer of the coding network to obtain a sample voice feature vector;
encoding the sample text data through the decoding network to obtain a sample text encoding vector;
reconstructing the sample text coding vector and the sample voice characterization vector through an attention layer of the decoding network to obtain a sample relation feature vector, wherein the sample relation feature vector is used for characterizing the correlation degree between the sample text coding vector and the sample voice characterization vector;
and updating parameters of the neural network model through a preset loss function and the sample relation feature vector so as to train the neural network model and obtain a voice recognition model.
In some embodiments, the feature extraction of the sample voice data by the convolution layer of the coding network to obtain a sample voice feature vector includes:
carrying out convolution processing on the sample voice data through the convolution layer to obtain a sample voice convolution vector;
performing downsampling processing on the sample voice convolution vector through the convolution layer to obtain sample voice sampling characteristics;
And carrying out position coding on the sample voice sampling characteristics to obtain the sample voice characteristic vector.
In some embodiments, the performing attention calculation on the sample speech feature vector by the sparse attention layer of the coding network to obtain a sample speech feature vector includes:
normalizing the sample voice feature vector through a preset function of the sparse attention layer to obtain sample probability features;
screening the sample voice feature vectors according to the sample probability features and a preset probability threshold to obtain candidate voice feature vectors, wherein the candidate voice feature vectors are sample voice feature vectors with the sample probability features larger than the probability threshold;
and performing attention calculation on the candidate voice feature vectors according to preset weight parameters to obtain the sample voice characterization vectors.
In some embodiments, the reconstructing, by the attention layer of the decoding network, the sample text encoding vector and the sample speech characterization vector to obtain a sample relationship feature vector includes:
performing self-attention calculation on the sample text coding vector through the attention layer to obtain a sample text characterization vector;
Performing attention calculation on the sample voice characterization vector and the sample text characterization vector through the attention layer to obtain an initial relation feature vector;
and mapping the initial relation feature vector to a preset vector space according to a preset vector feature dimension to obtain the sample relation feature vector.
In some embodiments, the parameter updating of the neural network model by the preset loss function and the sample relation feature vector is performed to train the neural network model to obtain a speech recognition model, which includes:
carrying out loss calculation on the sample relation feature vector through the loss function to obtain a target loss value;
and carrying out back propagation on the target loss value to update the model parameters of the neural network model so as to obtain the voice recognition model.
To achieve the above object, a second aspect of the embodiments of the present application proposes a speech recognition method, the method including:
acquiring target voice data to be processed;
inputting the target voice data into a voice recognition model for recognition processing to obtain target text data, wherein the target text data is used for representing voice content of the target voice data; the speech recognition model is trained according to the training method of the first aspect.
To achieve the above object, a third aspect of the embodiments of the present application proposes a training device for a model, the training device including:
the sample data acquisition module is used for acquiring sample voice data and sample text data corresponding to the sample voice data;
the data input module is used for inputting the sample voice data and the sample text data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;
the characteristic extraction module is used for extracting the characteristics of the sample voice data through a convolution layer of the coding network to obtain a sample voice characteristic vector;
the computing module is used for carrying out attention computation on the sample voice feature vector through a sparse attention layer of the coding network to obtain a sample voice feature vector;
the encoding module is used for encoding the sample text data through the decoding network to obtain sample text encoding vectors;
the reconstruction module is used for reconstructing the sample text coding vector and the sample voice characterization vector through the attention layer of the decoding network to obtain a sample relation feature vector, wherein the sample relation feature vector is used for characterizing the correlation degree between the sample text coding vector and the sample voice characterization vector;
And the training module is used for carrying out parameter updating on the neural network model through a preset loss function and the sample relation feature vector so as to train the neural network model and obtain a voice recognition model.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes a speech recognition apparatus, the apparatus comprising:
the target voice data acquisition module is used for acquiring target voice data to be processed;
the voice recognition module is used for inputting the target voice data into a voice recognition model for recognition processing to obtain target text data, wherein the target text data is used for representing voice content of the target voice data; the speech recognition model is trained according to the training device of the third aspect.
To achieve the above object, a fifth aspect of the embodiments of the present application proposes an electronic device, which includes a memory, a processor, where the memory stores a computer program, and the processor implements the training method described in the first aspect or the method described in the second aspect when executing the computer program.
To achieve the above object, a sixth aspect of the embodiments of the present application proposes a computer readable storage medium storing a computer program which, when executed by a processor, implements the training method according to the first aspect or the method according to the second aspect.
The training method, the voice recognition method, the training device, the voice recognition device, the electronic equipment and the computer readable storage medium of the model are provided by the application, and sample voice data and sample text data corresponding to the sample voice data are obtained; inputting the sample voice data and the sample text data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; and the sample voice data is subjected to feature extraction through a convolution layer of the coding network to obtain sample voice feature vectors, and the sample voice feature vectors are subjected to attention calculation through a sparse attention layer of the coding network to obtain sample voice feature vectors. Further, the sample text data is encoded through the decoding network to obtain sample text encoding vectors, the sample text encoding vectors and the sample voice characterization vectors are reconstructed through the attention layer of the decoding network to obtain sample relation feature vectors, the sample relation feature vectors are used for characterizing the correlation degree between the sample text encoding vectors and the sample voice characterization vectors, the recognition performance of the model can be reflected according to the correlation, and the training efficiency of the model is improved. Finally, the parameters of the neural network model are updated through a preset loss function and a sample relation feature vector, the neural network model is trained to obtain a voice recognition model, the internal parameters of the neural network model can be continuously adjusted, so that the recognition performance of the neural network model meets the training requirement, the voice recognition model which can be used for recognizing the text content of target voice data is obtained, the voice recognition accuracy of the model is improved, further, the intelligent customer service robot can more accurately recognize the characterized requirements in the voice data of the service object in the conversation process with the service object, the targeted response and the service feedback are improved, the conversation quality and conversation effectiveness in the financial transaction process can be effectively improved, the intelligent voice conversation service can be realized, the service quality and the customer satisfaction of customers are improved, and the service yield is improved.
Drawings
FIG. 1 is a flow chart of a training method for a model provided by an embodiment of the present application;
fig. 2 is a flowchart of step S103 in fig. 1;
fig. 3 is a flowchart of step S104 in fig. 1;
fig. 4 is a flowchart of step S106 in fig. 1;
fig. 5 is a flowchart of step S107 in fig. 1;
FIG. 6 is a flow chart of a speech recognition method provided by an embodiment of the present application;
FIG. 7 is a schematic structural diagram of a training device of a model provided in an embodiment of the present application;
fig. 8 is a schematic structural diagram of a voice recognition device according to an embodiment of the present application;
fig. 9 is a schematic hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
In order to make the objects, technical solutions and advantages of the present application more apparent, the present application will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the present application.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart. The terms first, second and the like in the description and in the claims and in the above-described figures, are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the present application only and is not intended to be limiting of the present application.
First, several nouns referred to in this application are parsed:
artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Natural language processing (natural language processing, NLP): NLP is a branch of artificial intelligence that is a interdisciplinary of computer science and linguistics, and is often referred to as computational linguistics, and is processed, understood, and applied to human languages (e.g., chinese, english, etc.). Natural language processing includes parsing, semantic analysis, chapter understanding, and the like. Natural language processing is commonly used in the technical fields of machine translation, handwriting and print character recognition, voice recognition and text-to-speech conversion, information intent recognition, information extraction and filtering, text classification and clustering, public opinion analysis and opinion mining, and the like, and relates to data mining, machine learning, knowledge acquisition, knowledge engineering, artificial intelligence research, linguistic research related to language calculation, and the like.
Information extraction (Information Extraction): extracting the fact information of the appointed type of entity, relation, event and the like from the natural language text, and forming the text processing technology of the structured data output. Information extraction is a technique for extracting specific information from text data. Text data is made up of specific units, such as sentences, paragraphs, chapters, and text information is made up of small specific units, such as words, phrases, sentences, paragraphs, or a combination of these specific units. The noun phrase, the name of a person, the name of a place, etc. in the extracted text data are all text information extraction, and of course, the information extracted by the text information extraction technology can be various types of information.
Mel-frequency cepstral coefficients (Mel-Frequency Cipstal Coefficients, MFCC): is a set of key coefficients used to create the mel-frequency spectrum. From the segments of the music signal, a set of cepstrum is obtained that is sufficient to represent the music signal, and mel-frequency cepstrum coefficients are cepstrum (i.e., the spectrum) derived from the cepstrum. Unlike the general cepstrum, the biggest feature of mel-frequency cepstrum is that the frequency bands on mel-frequency spectrum are uniformly distributed on mel scale, that is, such frequency bands are closer to human nonlinear auditory System (Audio System) than the general observed, linear cepstrum representation method. For example: in audio compression techniques, mel-frequency cepstrum is often used for processing.
Fourier transform: the representation can represent a certain function satisfying a certain condition as a trigonometric function (sine and/or cosine function) or a linear combination of their integrals. In different areas of research, fourier transforms have many different variants, such as continuous fourier transforms and discrete fourier transforms.
Transformer layer: the neural network includes an embedding layer (may be referred to as an input embedding layer (input embedding)) and at least one transducer layer, which may be N transducer layers (N is an integer greater than 0); the embedded layers comprise an input embedded layer and a position coding (positional encoding) layer, wherein in the input embedded layer, word embedding processing can be carried out on each word in the current input, so that word embedded vectors of each word are obtained; the position encoding layer may obtain the position of each word in the current input, and then generate a position vector for the position of each word. Each transducer layer includes an attention layer, a sum and normalization (add & norm) layer, a feed forward layer, and an add & norm layer, which are adjacent in sequence. At an embedding layer (input embedding), performing embedding processing on the current input to obtain a plurality of feature vectors; in the attention layer, P input vectors are obtained from the upper layer of the transducer layer, any first input vector in the P input vectors is taken as a center, and based on the association degree between each input vector in the preset attention window range and the first input vector, the intermediate vector corresponding to the first input vector is obtained, and the P intermediate vectors corresponding to the P input vectors are determined; at the pooling layer, the P intermediate vectors are combined into Q output vectors, wherein the multiple output vectors obtained at the last of the at least one transform layer are used as a representation of the characteristics of the current input. At the embedding layer, the current input (which may be a text input, such as a piece of text or a sentence; the text may be chinese/english or text in other languages) is embedded to obtain a plurality of feature vectors. After the embedding layer acquires the current input, the embedding layer can embed each word in the current input, and the feature vector of each word can be obtained.
Encoding (Encoder): the input sequence is converted into a vector of fixed length.
Decoding (Decoder): the fixed vector generated before is converted into an output sequence; wherein the input sequence can be words, voice, images and video; the output sequence may be text, images.
Back propagation: the general principle of back propagation is: inputting training set data into an input layer of a neural network, passing through a hidden layer of the neural network, finally reaching an output layer of the neural network and outputting a result; because the output result of the neural network has errors with the actual result, calculating the errors between the estimated value and the actual value, and reversely transmitting the errors from the output layer to the hidden layer until the errors are transmitted to the input layer; in the process of back propagation, adjusting the values of various parameters according to the errors; the above process is iterated until convergence.
Softmax function: the Softmax function is a normalized exponential function that can "compress" a K-dimensional vector z containing arbitrary real numbers into another K-dimensional real vector σ (z) such that each element ranges between (0, 1) and the sum of all elements is 1, which is commonly used in multi-classification problems.
With the development of network, communication and computer technologies, enterprises have the characteristics of electronization, remoting, virtualization and networking, and more online enterprises are greatly emerging. Communication and dialogue between clients and enterprises are also developed from face-to-face consultation and interaction to communication and communication based on remote means such as network, telephone and the like. Under the background, intelligent voice interaction is widely applied to the fields of finance, logistics, customer service and the like.
Currently, a financial transaction platform based on voice interaction faces a large number of telephone voice services every day, and processes diversified service demands of clients, including pre-sale consultation, purchase, after-sale, complaints and the like. In the telephone service process, the intelligent customer service robot needs to deal with different service objects and make proper reactions. If the intelligent customer service cannot accurately identify the requirements of the service object represented in the voice data in the dialogue exchange, the service response based on the voice data feedback cannot meet the requirements of the object, and the like, so that the service quality and the object satisfaction degree are affected.
For example, in the product recommendation process through the virtual character, when the service object has a query and needs to be consulted and communicated, the current virtual character can only search the answer in the set options, so that the appeal represented by the service object in the voice data can not be accurately identified, the phenomenon of 'answering questions' is caused, and the voice interaction accuracy of the virtual character is low.
The speech recognition task is mainly to convert speech audio into text form. Most of the current voice recognition methods depend on a neural network model for recognition, but the common neural network model cannot pay attention to useful voice information in voice audio, and the problem of poor training effect of the model exists, so how to improve the training effect of the model becomes a technical problem to be solved urgently.
Based on this, the embodiment of the application provides a training method of a model, a voice recognition method, a training device of the model, a voice recognition device, electronic equipment and a storage medium, and aims to improve the training effect of the model.
The training method, the voice recognition method, the training device, the voice recognition device, the electronic device and the storage medium for the model provided in the embodiments of the present application are specifically described through the following embodiments, and the training method for the model in the embodiments of the present application is described first.
The embodiment of the application can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (Artificial Intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results.
Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a robot technology, a biological recognition technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The embodiment of the application provides a training method of a model, and relates to the technical field of artificial intelligence. The training method of the model provided by the embodiment of the application can be applied to the terminal, can also be applied to the server side, and can also be software running in the terminal or the server side. In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, etc.; the server side can be configured as an independent physical server, a server cluster or a distributed system formed by a plurality of physical servers, and a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, CDNs, basic cloud computing services such as big data and artificial intelligent platforms and the like; the software may be an application or the like of a training method for realizing the model, but is not limited to the above form.
The subject application is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
It should be noted that, in each specific embodiment of the present application, when related processing is required according to user information, user behavior data, user history data, user location information, and other data related to user identity or characteristics, permission or consent of the user is obtained first, and the collection, use, processing, and the like of these data comply with related laws and regulations and standards. In addition, when the embodiment of the application needs to acquire the sensitive personal information of the user, the independent permission or independent consent of the user is acquired through a popup window or a jump to a confirmation page or the like, and after the independent permission or independent consent of the user is explicitly acquired, necessary user related data for enabling the embodiment of the application to normally operate is acquired.
Fig. 1 is an optional flowchart of a training method of a model provided in an embodiment of the present application, where the method in fig. 1 may include, but is not limited to, steps S101 to S107.
Step S101, sample voice data and sample text data corresponding to the sample voice data are obtained;
step S102, inputting sample voice data and sample text data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;
Step S103, extracting characteristics of the sample voice data through a convolution layer of the coding network to obtain sample voice characteristic vectors;
step S104, performing attention calculation on the sample voice feature vector through a sparse attention layer of the coding network to obtain a sample voice feature vector;
step S105, encoding the sample text data through a decoding network to obtain sample text encoding vectors;
step S106, reconstructing the sample text coding vector and the sample voice characterization vector through the attention layer of the decoding network to obtain a sample relation feature vector, wherein the sample relation feature vector is used for characterizing the correlation degree between the sample text coding vector and the sample voice characterization vector;
and step S107, updating parameters of the neural network model through a preset loss function and a sample relation feature vector so as to train the neural network model and obtain a voice recognition model.
Steps S101 to S107 illustrated in the embodiments of the present application are performed by acquiring sample speech data and sample text data corresponding to the sample speech data; inputting the sample voice data and the sample text data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; and the sample voice data is subjected to feature extraction through a convolution layer of the coding network to obtain sample voice feature vectors, and the sample voice feature vectors are subjected to attention calculation through a sparse attention layer of the coding network to obtain sample voice feature vectors. Further, the sample text data is encoded through the decoding network to obtain sample text encoding vectors, the sample text encoding vectors and the sample voice characterization vectors are reconstructed through the attention layer of the decoding network to obtain sample relation feature vectors, the sample relation feature vectors are used for characterizing the correlation degree between the sample text encoding vectors and the sample voice characterization vectors, the recognition performance of the model can be reflected according to the correlation, and the training efficiency of the model is improved. Finally, the parameters of the neural network model are updated through a preset loss function and a sample relation feature vector, the neural network model is trained to obtain a voice recognition model, and the internal parameters of the neural network model can be continuously adjusted to enable the recognition performance of the neural network model to meet the training requirements, so that the voice recognition model capable of being used for recognizing the text content of target voice data is obtained, and the training effect of the model is improved.
In step S101 of some embodiments, data may be crawled in a targeted manner after a web crawler is written and a data source is set, so as to obtain sample voice data of a sample speaking object and corresponding sample text data, where the data source may be various types of network platforms, social media may also be some specific audio databases, etc., the sample voice data may be musical materials of the sample speaking object, a lecture report, a chat conversation, etc., and text information of the sample text data may be capable of characterizing voice content of the sample voice data. Through the mode, the sample voice data and the sample text data can be conveniently obtained, and the data obtaining efficiency is improved.
The above-mentioned sample speech data is mainly spectral data, and the acquired original audio data is subjected to short-time fourier transform and filtering processing by a mel-frequency cepstrum filter to obtain mel-frequency cepstrum feature data in a spectral form, and the mel-frequency cepstrum feature data is used as target speech data.
For example, in a financial transaction scenario, the sample voice data is audio data containing conversations commonly used in the financial field, and in a insurance promotion scenario, the sample voice data is audio data containing descriptions of the risk, cost, applicable population, etc. of a certain insurance product. The sample text data may be text data containing proper nouns in the financial field, financial business template words, product descriptions containing insurance products, product descriptions of financial products, common conversations in the financial field, and the like.
In step S102 of some embodiments, sample voice data and sample text data are input into a preset neural network model through a preset script program or other computer programs, where the neural network model may be constructed based on a transform model, and the neural network model includes an encoding network and a decoding network, where the encoding network is used to encode the input voice data, extract voice content information in the voice data, input the extracted voice content information in an encoded form into a decoding network, where the decoding network is used to decode the voice content information in an encoded form, extract text features corresponding to the voice data, and generate text data corresponding to the voice data, so as to implement recognition processing of the input voice data, and obtain voice text data of the voice data.
Referring to fig. 2, in some embodiments, step S103 may include, but is not limited to, steps S201 to S203:
step S201, carrying out convolution processing on sample voice data through a convolution layer to obtain a sample voice convolution vector;
step S202, carrying out downsampling treatment on a sample voice convolution vector through a convolution layer to obtain sample voice sampling characteristics;
Step S203, position coding is carried out on the sample voice sampling characteristics to obtain sample voice characteristic vectors.
In step S201 of some embodiments, the sample speech data is subjected to convolution processing by a convolution layer, so as to extract speech content features in the sample speech data, and obtain a sample speech convolution vector, where the sample speech convolution vector may be represented in a form of a feature map, and a convolution kernel of the convolution layer and a size of the convolution kernel may be set according to a practical situation, for example, the convolution layer includes at least one convolution kernel with a convolution size of 3×3, and a number of channels of the convolution layer may be 256 or 64, and so on.
In step S202 of some embodiments, the sample speech convolution vector is subjected to downsampling processing by the convolution layer, so as to compress the sample speech convolution vector, and in the downsampling process, the maximum value or the average value of each feature position of the sample speech convolution vector can be selected as the feature value of the feature position, so that the feature parameters of the sample speech convolution vector are reduced, and the sample speech sampling feature is obtained.
In step S203 of some embodiments, the position encoding of the sampled speech samples may be absolute encoding or relative encoding. Specifically, when the sample voice sampling feature is subjected to absolute coding, absolute position coding of each word vector of the sample voice sampling feature is generated through a sine and cosine function, each word vector of the sample voice sampling feature is subjected to position marking according to the absolute position coding, and the absolute position coding is used as a position label of the word vector, so that the sample voice feature vector is obtained. When the sample voice sampling feature is relatively coded, distance values between every two word vectors of the sample voice sampling feature are calculated respectively, the distance values can be Euclidean distance or Manhattan distance and the like, and the relation numbers are carried out on every two word vectors according to the size relation of the distance values and can be used for representing the semantic sequence of the word vectors, so that the sample voice feature vector is obtained.
Through the steps S201 to S203, the voice content features in the sample voice data can be extracted more conveniently, the voice content features are reduced in content, and the sample voice feature vectors conforming to the basic grammar specification are generated, so that the sample voice feature vectors are used for subsequent model training, the quality of training samples can be improved better, and the training effect of the model is improved.
Referring to fig. 3, in some embodiments, step S104 may include, but is not limited to, steps S301 to S303:
step S301, carrying out normalization processing on the sample voice feature vector through a preset function of the sparse attention layer to obtain sample probability features;
step S302, screening the sample voice feature vectors according to the sample probability features and a preset probability threshold value to obtain candidate voice feature vectors, wherein the candidate voice feature vectors are sample voice feature vectors with the sample probability features larger than the probability threshold value;
step S303, attention calculation is carried out on the candidate speech feature vectors according to preset weight parameters, and sample speech characterization vectors are obtained.
In step S301 of some embodiments, the preset function may be a sparsemax function, through which the euclidean projection of the sample speech feature vector can be mapped into a probability simplex p, so as to implement normalization processing on the sample speech feature vector, and the sparsity of the sample speech feature vector is reflected by the probability simplex. For example, the probability simplex is used as the probability feature of the sample voice feature vector, and the sample probability feature of the sample voice feature vector is obtained.
For example, the process of normalizing the sample speech feature vector by the sparsemax function of the sparse attention layer may be expressed as shown in equation (1);
wherein z is a sample speech feature vector; p is a probability simplex, i.e. a sample probability feature; delta J =p∈R J | p≤0 ;Is a simplex of the J dimension, R is represented as a real number, J is the number of the word vector of the sample speech feature vector.
In step S302 of some embodiments, when the sample speech feature vector is screened according to the sample probability feature and the preset probability threshold, the sparseness of the sample speech feature vector is calculated according to the probability simplex and the probability threshold.
The sparsity may be expressed as p * =[p-τ] + ,[] + Representing taking the larger of the number and 0, τ is the probability threshold, by which means the sample probability features that are smaller than the probability threshold will be forced to zero, but still guarantee
Therefore, through comparing the size relation between the sample probability feature and the preset probability threshold, the sample voice feature vector with the sample probability feature larger than the probability threshold can be conveniently screened, the series of sample voice feature vectors are used as candidate voice feature vectors, and meanwhile, the sample voice feature vectors with the sample probability feature smaller than or equal to the probability threshold are abandoned, wherein the candidate voice feature vectors comprise information with higher correlation degree with voice recognition, and the abandoned sample voice feature vectors comprise redundant information irrelevant to voice recognition. By adopting the method for screening the sample voice feature vectors, some high-probability element information can be well reserved, so that the neural network model can pay more attention to the high-probability element information (namely the contents of candidate voice feature vectors) in the subsequent model training, and the learning performance and training effect of the model are improved.
In step S303 of some embodiments, the preset weight parameter may be set according to the actual service requirement, without limitation. When attention calculation is carried out on candidate voice feature vectors through an attention mechanism and weight parameters of a coding network, different attention weights are given to different candidate voice feature vectors, so that attention distribution is concentrated on candidate voice feature vectors with higher importance, attention weighting is carried out on the candidate voice feature vectors through the attention mechanism and the weight parameters, and accordingly a sample voice characterization vector is obtained, wherein the sample voice characterization vector comprises abundant semantic information, and the semantic information can be used for characterizing main contents of sample voice data.
Through the steps S301 to S303, redundant information irrelevant to voice recognition in the sample voice data can be filtered more conveniently, the total information amount can be effectively reduced, the recognition speed of the model is improved, and the training effect of the model is improved. Meanwhile, compared with a network structure of a common transducer model, the neural network model of the embodiment of the application adopts a sparse attention layer, so that the model structure can be simplified better, the model is lighter, a voice recognition model obtained through training can be conveniently deployed to a mobile terminal and more network equipment, voice recognition of the voice recognition model in the embodiment of the application in various application scenes is realized, and the applicability of the voice recognition model is improved.
In step S105 of some embodiments, the sample text data may be position-coded by the decoding network, where the position-coding process may be absolute coding or relative coding. Specifically, when absolute encoding is performed on sample text data, an absolute position code of each word vector of the sample text data is generated through a sine and cosine function, each word vector of the sample text data is position-marked according to the absolute position code, and the absolute position code is used as a position label of the word vector to obtain a sample text encoding vector. When sample text data is relatively encoded, distance values between every two word vectors of the sample text data are calculated respectively, the distance values can be Euclidean distance or Manhattan distance and the like, each two word vectors are numbered according to the magnitude relation of the distance values, and the relation numbers can be used for representing the semantic sequence of the word vectors so as to obtain sample text encoding vectors.
Referring to fig. 4, in some embodiments, step S106 may include, but is not limited to, steps S401 to S403:
step S401, performing self-attention calculation on the sample text coding vector through an attention layer to obtain a sample text characterization vector;
step S402, performing attention calculation on the sample voice characterization vector and the sample text characterization vector through an attention layer to obtain an initial relation feature vector;
step S403, according to the preset vector feature dimension, mapping the initial relation feature vector to a preset vector space to obtain a sample relation feature vector.
In step S401 of some embodiments, each attention header for self-attention computation may be computed using a scaled dot product when performing self-attention computation on the sample text encoded vector by the attention layer, whereEach attention header head i Can be expressed asWherein,,h is the attention head number, d is the sample text characterization vector y i Vector dimension of>Are trainable projection matrices, softmax is a normalized probability function. By performing a stitching process on each attention header, a sample text token vector is obtained, and the sample text token vector O may be represented as o=concat (head) 1 ,…,head h )W o Concat is a splicing function, W o Is a projection matrix.
In step S402 of some embodiments, when the attention layer performs attention computation on the sample speech characterization vector and the sample text characterization vector, the relationship between the sample text encoding vector and the sample speech characterization vector may be constructed by a multi-headed attention mechanism. First, a sample voice characterization vector H= { H is characterized by a multi-head attention mechanism 1 ,h 2 ,…,h s Performing attention calculations to obtain each attention header for multi-head attention calculations, wherein each attention header head j Can be expressed as Wherein (1)>l is the number of attention points, d is the sample language
Phonetic representation vector h s Is used for the vector dimension of (a),are trainable projection matrices, softmax is a normalized probability function. Further, each attention head is subjected to a stitching process to obtain an attention result corresponding to the sample voice characterization vector, where the attention result M may be expressed as m=concat (head) 1 ,…,head l )W M Concat is a splicing function, W M Is a projection matrix. And finally, vector splicing processing is carried out on the attention result M of the sample text characterization vector O and the sample voice characterization vector, and an initial relation feature vector is obtained.
In step S403 of some embodiments, the preset vector feature dimension may be set according to the actual service requirement, for example, the vector feature dimension may be 64,512 or 256, and so on. Based on a preset vector feature dimension, carrying out dimension changing processing on the initial relation feature vector, mapping the initial relation feature vector to a preset vector space, wherein the space dimension of the vector space is consistent with the preset vector feature dimension, so that the transformation of the initial relation feature vector from a high-dimensional space to a low-dimensional space can be conveniently realized, and a sample relation feature vector is obtained, wherein the sample relation feature vector is used for representing the correlation degree between a sample text coding vector and a sample voice representation vector.
Through the steps S401 to S403, the association relation between the sample voice characterization vector and the sample text characterization vector can be conveniently obtained, so that the recognition performance of the model is reflected according to the association relation, and the training efficiency of the model is improved.
Referring to fig. 5, in some embodiments, step S107 may include, but is not limited to, steps S501 to S502:
step S501, carrying out loss calculation on the sample relation feature vector through a loss function to obtain a target loss value;
step S502, back propagation is carried out on the target loss value so as to update the model parameters of the neural network model and obtain a voice recognition model.
In step S501 of some embodiments, the loss function may be a cross entropy loss function, and when the loss function performs loss calculation on the sample relation feature vector, the predicted text content predicted according to the sample voice data in the sample relation feature vector is compared with the sample text content, and the similarity degree between the predicted text content and the sample text content is represented by the target loss value. The smaller the target loss value is, the higher the similarity between the predicted text content and the sample text content is, namely the better the training effect of the model is; the larger the target loss value is, the smaller the similarity between the predicted text content and the sample text content is, namely the training effect of the model is poorer.
In step S502 of some embodiments, back propagation is performed according to the target loss value, and the internal parameters (i.e., loss parameters) of the neural network model are updated by optimizing the target loss value, so as to obtain a speech recognition model. It will be appreciated that the back propagation principle may apply to conventional back propagation principles, and embodiments of the present application are not limited.
Through the steps S501 to S502, the training degree of the model and the model performance of the model can be conveniently determined, and the internal parameters of the neural network model can be continuously adjusted according to the target loss value, so that the recognition performance of the neural network model meets the training requirement, and a voice recognition model capable of being used for recognizing the text content of the target voice data is obtained.
According to the training method of the model, sample voice data and sample text data corresponding to the sample voice data are obtained; inputting the sample voice data and the sample text data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; and the sample voice data is subjected to feature extraction through a convolution layer of the coding network to obtain sample voice feature vectors, and the sample voice feature vectors are subjected to attention calculation through a sparse attention layer of the coding network to obtain sample voice feature vectors. Meanwhile, compared with a network structure of a common transducer model, the neural network model of the embodiment of the application adopts a sparse attention layer, so that the model structure can be simplified better, the model is lighter, a speech recognition model obtained through training can be conveniently deployed on a mobile terminal and more network devices, and the applicability of the model is improved. Further, the sample text data is encoded through the decoding network to obtain sample text encoding vectors, the sample text encoding vectors and the sample voice characterization vectors are reconstructed through the attention layer of the decoding network to obtain sample relation feature vectors, the sample relation feature vectors are used for characterizing the correlation degree between the sample text encoding vectors and the sample voice characterization vectors, the recognition performance of the model can be reflected according to the correlation, and the training efficiency of the model is improved. Finally, the neural network model is subjected to parameter updating through a preset loss function and a sample relation feature vector, the neural network model is trained to obtain a voice recognition model, the internal parameters of the neural network model can be continuously adjusted, so that the recognition performance of the neural network model meets the training requirement, a voice recognition model which can be used for recognizing the text content of target voice data is obtained, the training effect of the model is improved, further, the intelligent customer service robot can more accurately recognize the characterized appeal in the voice data of the service object in the conversation process with the service object, thereby improving the targeted response and service feedback, effectively improving the conversation quality and conversation effectiveness in the financial transaction process, realizing intelligent voice conversation service, and improving the service quality and customer satisfaction of customers, thereby improving the service yield.
Referring to fig. 6, the embodiment of the present application further provides a voice recognition method, which may specifically include, but is not limited to, steps S601 to S602:
step S601, obtaining target voice data to be processed;
step S602, inputting target voice data into a voice recognition model for recognition processing to obtain target text data, wherein the target text data is used for representing voice content of the target voice data; the speech recognition model is trained according to the training method of the first aspect.
In step S601 of some embodiments, data may be crawled in a targeted manner after a web crawler is written and a data source is set to obtain target voice data to be processed, where the data source may be various types of network platforms, social media may also be some specific audio databases, etc., and the target voice data may be musical materials, lectures, chat dialogs, etc. of a target speaking object, without limitation.
In step S602 of some embodiments, target voice data may be input to a voice recognition model through a preset computer program or other script program for recognition processing, the target voice data is encoded through an encoding network of the voice recognition model, voice content information in the target voice data is extracted, filtering processing is performed on the voice content information of the target voice data through a sparse attention layer of the encoding network, redundant irrelevant information is removed, a target voice characterization vector is generated according to useful voice content information, the target voice characterization vector is input to a decoding network, decoding processing is performed on the target voice characterization vector through the decoding network, text features corresponding to the target voice data are extracted, and target text data corresponding to the target voice data is generated, so that recognition processing of the target voice data is realized, the target text data is obtained, and voice content of the target voice data is characterized.
For example, in the case of financial business transaction, when the target dialogue type is a query, the corresponding object is intended to be a business query, a user data query, for example, when the target voice data input by the object is a "detail query", the target dialogue type is determined to be a data query, that is, the object is intended to be a user data query.
According to the voice recognition method, the target voice data are input into the voice recognition model to be recognized, the target voice data are encoded through the encoding network of the voice recognition model, voice content information in the target voice data is extracted, the voice content information of the target voice data is filtered through the sparse attention layer of the encoding network, redundant irrelevant information is removed, redundant information irrelevant to voice recognition in sample voice data can be filtered conveniently, the total information amount is effectively reduced, and the voice recognition speed is improved. The target text used for representing the voice content of the target voice data can be conveniently generated by decoding the target voice representation vector through the decoding network, and the accuracy and the recognition efficiency of voice recognition are improved.
Referring to fig. 7, the embodiment of the present application further provides a training device for a model, which may implement the training method for a model, where the training device includes:
a sample data obtaining module 701, configured to obtain sample speech data and sample text data corresponding to the sample speech data;
the data input module 702 is configured to input the sample voice data and the sample text data into a preset neural network model, where the neural network model includes an encoding network and a decoding network;
the feature extraction module 703 is configured to perform feature extraction on the sample speech data through a convolutional layer of the coding network, so as to obtain a sample speech feature vector;
the computing module 704 is configured to perform attention computation on the sample speech feature vector through a sparse attention layer of the coding network to obtain a sample speech feature vector;
the encoding module 705 is configured to encode the sample text data through a decoding network to obtain a sample text encoding vector;
a reconstruction module 706, configured to reconstruct the sample text encoding vector and the sample speech characterization vector through an attention layer of the decoding network to obtain a sample relationship feature vector, where the sample relationship feature vector is used to characterize a degree of correlation between the sample text encoding vector and the sample speech characterization vector;
And the training module 707 is configured to update parameters of the neural network model through a preset loss function and a sample relation feature vector, so as to train the neural network model to obtain a speech recognition model.
In some embodiments, the feature extraction module 703 includes:
the convolution unit is used for carrying out convolution processing on the sample voice data through the convolution layer to obtain a sample voice convolution vector;
the sampling unit is used for carrying out downsampling treatment on the sample voice convolution vector through the convolution layer to obtain sample voice sampling characteristics;
and the coding unit is used for carrying out position coding on the sample voice sampling characteristics to obtain sample voice characteristic vectors.
In some embodiments, the computing module 704 includes:
the normalization unit is used for carrying out normalization processing on the sample voice feature vector through a preset function of the sparse attention layer to obtain sample probability features;
the screening unit is used for screening the sample voice feature vectors according to the sample probability features and a preset probability threshold value to obtain candidate voice feature vectors, wherein the candidate voice feature vectors are sample voice feature vectors with the sample probability features larger than the probability threshold value;
the first calculation unit is used for carrying out attention calculation on the candidate voice feature vectors according to preset weight parameters to obtain sample voice feature vectors.
In some embodiments, the reconstruction module 706 includes:
the second calculation unit is used for carrying out self-attention calculation on the sample text coding vector through the attention layer to obtain a sample text characterization vector;
the third calculation unit is used for carrying out attention calculation on the sample voice characterization vector and the sample text characterization vector through the attention layer to obtain an initial relation feature vector;
and the mapping unit is used for mapping the initial relation feature vector to a preset vector space according to the preset vector feature dimension to obtain a sample relation feature vector.
In some embodiments, training module 707 includes:
the loss calculation unit is used for carrying out loss calculation on the sample relation feature vector through a loss function to obtain a target loss value;
and the parameter updating unit is used for carrying out back propagation on the target loss value so as to update the model parameters of the neural network model and obtain a voice recognition model.
The specific implementation manner of the training device of the model is basically the same as that of the specific embodiment of the training method of the model, and is not repeated here.
Referring to fig. 8, an embodiment of the present application further provides a voice recognition device, which may implement the foregoing voice recognition method, where the device includes:
A target voice data acquisition module 801, configured to acquire target voice data to be processed;
the voice recognition module 802 is configured to input target voice data into the voice recognition model for recognition processing, so as to obtain target text data, where the target text data is used to characterize voice content of the target voice data; the voice recognition model is obtained through training according to the training device.
The specific implementation of the voice recognition device is basically the same as the specific embodiment of the voice recognition method, and will not be described herein.
The embodiment of the application also provides electronic equipment, which comprises a memory and a processor, wherein the memory stores a computer program, and the processor realizes the voice recognition method when executing the computer program. The electronic equipment can be any intelligent terminal including a tablet personal computer, a vehicle-mounted computer and the like.
Referring to fig. 9, fig. 9 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:
the processor 901 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solutions provided by the embodiments of the present application;
The memory 902 may be implemented in the form of read-only memory (ReadOnlyMemory, ROM), static storage, dynamic storage, or random access memory (RandomAccessMemory, RAM). The memory 902 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present application are implemented by software or firmware, relevant program codes are stored in the memory 902, and the processor 901 invokes the voice recognition method to perform the embodiments of the present application;
an input/output interface 903 for inputting and outputting information;
the communication interface 904 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g. USB, network cable, etc.), or may implement communication in a wireless manner (e.g. mobile network, WIFI, bluetooth, etc.);
a bus 905 that transfers information between the various components of the device (e.g., the processor 901, the memory 902, the input/output interface 903, and the communication interface 904);
wherein the processor 901, the memory 902, the input/output interface 903 and the communication interface 904 are communicatively coupled to each other within the device via a bus 905.
The embodiment of the application also provides a computer readable storage medium, wherein the computer readable storage medium stores a computer program, and the computer program realizes the voice recognition method when being executed by a processor.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The embodiment of the application provides a training method, a voice recognition method, a training device, a voice recognition device, electronic equipment and a computer readable storage medium for a model, wherein the training method, the voice recognition method and the training device for the model are used for acquiring sample voice data and sample text data corresponding to the sample voice data; inputting the sample voice data and the sample text data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network; the method can filter redundant information irrelevant to voice recognition in the sample voice data more conveniently, can effectively reduce the total information amount and improve the recognition speed of a model. Meanwhile, compared with a network structure of a common transducer model, the neural network model of the embodiment of the application adopts a sparse attention layer, so that the model structure can be simplified better, the model is lighter, a speech recognition model obtained through training can be conveniently deployed on a mobile terminal and more network devices, and the applicability of the model is improved. Further, the sample text data is encoded through the decoding network to obtain sample text encoding vectors, the sample text encoding vectors and the sample voice characterization vectors are reconstructed through the attention layer of the decoding network to obtain sample relation feature vectors, the sample relation feature vectors are used for characterizing the correlation degree between the sample text encoding vectors and the sample voice characterization vectors, the recognition performance of the model can be reflected according to the correlation, and the training efficiency of the model is improved. Finally, the neural network model is subjected to parameter updating through a preset loss function and a sample relation feature vector, the neural network model is trained to obtain a voice recognition model, the internal parameters of the neural network model can be continuously adjusted, so that the recognition performance of the neural network model meets the training requirement, a voice recognition model which can be used for recognizing the text content of target voice data is obtained, the training effect of the model is improved, further, the intelligent customer service robot can more accurately recognize the characterized appeal in the voice data of the service object in the conversation process with the service object, thereby improving the targeted response and service feedback, effectively improving the conversation quality and conversation effectiveness in the financial transaction process, realizing intelligent voice conversation service, and improving the service quality and customer satisfaction of customers, thereby improving the service yield.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.
Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.
Claims (10)
1. A method of training a model, the method comprising:
acquiring sample voice data and sample text data corresponding to the sample voice data;
inputting the sample voice data and the sample text data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;
performing feature extraction on the sample voice data through a convolution layer of the coding network to obtain a sample voice feature vector;
performing attention calculation on the sample voice feature vector through a sparse attention layer of the coding network to obtain a sample voice feature vector;
encoding the sample text data through the decoding network to obtain a sample text encoding vector;
reconstructing the sample text coding vector and the sample voice characterization vector through an attention layer of the decoding network to obtain a sample relation feature vector, wherein the sample relation feature vector is used for characterizing the correlation degree between the sample text coding vector and the sample voice characterization vector;
and updating parameters of the neural network model through a preset loss function and the sample relation feature vector so as to train the neural network model and obtain a voice recognition model.
2. The training method of claim 1, wherein the feature extraction of the sample speech data by the convolutional layer of the coding network to obtain a sample speech feature vector comprises:
carrying out convolution processing on the sample voice data through the convolution layer to obtain a sample voice convolution vector;
performing downsampling processing on the sample voice convolution vector through the convolution layer to obtain sample voice sampling characteristics;
and carrying out position coding on the sample voice sampling characteristics to obtain the sample voice characteristic vector.
3. The training method of claim 1, wherein the performing attention computation on the sample speech feature vector by the sparse attention layer of the coding network to obtain a sample speech feature vector comprises:
normalizing the sample voice feature vector through a preset function of the sparse attention layer to obtain sample probability features;
screening the sample voice feature vectors according to the sample probability features and a preset probability threshold to obtain candidate voice feature vectors, wherein the candidate voice feature vectors are sample voice feature vectors with sample probability features larger than the probability threshold;
And performing attention calculation on the candidate voice feature vectors according to preset weight parameters to obtain the sample voice characterization vectors.
4. The training method of claim 1, wherein the reconstructing the sample text encoded vector and the sample speech characterization vector by the attention layer of the decoding network to obtain a sample relationship feature vector comprises:
performing self-attention calculation on the sample text coding vector through the attention layer to obtain a sample text characterization vector;
performing attention calculation on the sample voice characterization vector and the sample text characterization vector through the attention layer to obtain an initial relation feature vector;
and mapping the initial relation feature vector to a preset vector space according to a preset vector feature dimension to obtain the sample relation feature vector.
5. The training method according to any one of claims 1 to 4, wherein the parameter updating the neural network model by using a preset loss function and the sample relation feature vector to train the neural network model to obtain a speech recognition model includes:
Carrying out loss calculation on the sample relation feature vector through the loss function to obtain a target loss value;
and carrying out back propagation on the target loss value to update the model parameters of the neural network model so as to obtain the voice recognition model.
6. A method of speech recognition, the method comprising:
acquiring target voice data to be processed;
inputting the target voice data into a voice recognition model for recognition processing to obtain target text data, wherein the target text data is used for representing voice content of the target voice data; the speech recognition model is trained according to the training method of any one of claims 1 to 5.
7. A training device for a model, the training device comprising:
the sample data acquisition module is used for acquiring sample voice data and sample text data corresponding to the sample voice data;
the data input module is used for inputting the sample voice data and the sample text data into a preset neural network model, wherein the neural network model comprises an encoding network and a decoding network;
the characteristic extraction module is used for extracting the characteristics of the sample voice data through a convolution layer of the coding network to obtain a sample voice characteristic vector;
The computing module is used for carrying out attention computation on the sample voice feature vector through a sparse attention layer of the coding network to obtain a sample voice feature vector;
the encoding module is used for encoding the sample text data through the decoding network to obtain sample text encoding vectors;
the reconstruction module is used for reconstructing the sample text coding vector and the sample voice characterization vector through the attention layer of the decoding network to obtain a sample relation feature vector, wherein the sample relation feature vector is used for characterizing the correlation degree between the sample text coding vector and the sample voice characterization vector;
and the training module is used for carrying out parameter updating on the neural network model through a preset loss function and the sample relation feature vector so as to train the neural network model and obtain a voice recognition model.
8. A speech recognition device, the device comprising:
the target voice data acquisition module is used for acquiring target voice data to be processed;
the voice recognition module is used for inputting the target voice data into a voice recognition model for recognition processing to obtain target text data, wherein the target text data is used for representing voice content of the target voice data; the speech recognition model is trained by the training device of claim 7.
9. An electronic device comprising a memory storing a computer program and a processor implementing the training method of any one of claims 1 to 5 or the speech recognition method of claim 6 when the computer program is executed by the processor.
10. A computer readable storage medium storing a computer program, characterized in that the computer program, when executed by a processor, implements the training method of any one of claims 1 to 5 or the speech recognition method of claim 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310633239.1A CN116543768A (en) | 2023-05-31 | 2023-05-31 | Model training method, voice recognition method and device, equipment and storage medium |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310633239.1A CN116543768A (en) | 2023-05-31 | 2023-05-31 | Model training method, voice recognition method and device, equipment and storage medium |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116543768A true CN116543768A (en) | 2023-08-04 |
Family
ID=87445340
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310633239.1A Pending CN116543768A (en) | 2023-05-31 | 2023-05-31 | Model training method, voice recognition method and device, equipment and storage medium |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116543768A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117131347A (en) * | 2023-10-25 | 2023-11-28 | 上海为旌科技有限公司 | Method and device for generating driver dynamic image, electronic equipment and storage medium |
CN117436443A (en) * | 2023-12-19 | 2024-01-23 | 苏州元脑智能科技有限公司 | Model construction method, text generation method, device, equipment and medium |
CN117558263A (en) * | 2024-01-10 | 2024-02-13 | 科大讯飞股份有限公司 | Speech recognition method, device, equipment and readable storage medium |
CN117649846A (en) * | 2024-01-29 | 2024-03-05 | 北京安声科技有限公司 | Speech recognition model generation method, speech recognition method, device and medium |
-
2023
- 2023-05-31 CN CN202310633239.1A patent/CN116543768A/en active Pending
Cited By (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN117131347A (en) * | 2023-10-25 | 2023-11-28 | 上海为旌科技有限公司 | Method and device for generating driver dynamic image, electronic equipment and storage medium |
CN117131347B (en) * | 2023-10-25 | 2024-01-19 | 上海为旌科技有限公司 | Method and device for generating driver dynamic image, electronic equipment and storage medium |
CN117436443A (en) * | 2023-12-19 | 2024-01-23 | 苏州元脑智能科技有限公司 | Model construction method, text generation method, device, equipment and medium |
CN117436443B (en) * | 2023-12-19 | 2024-03-15 | 苏州元脑智能科技有限公司 | Model construction method, text generation method, device, equipment and medium |
CN117558263A (en) * | 2024-01-10 | 2024-02-13 | 科大讯飞股份有限公司 | Speech recognition method, device, equipment and readable storage medium |
CN117558263B (en) * | 2024-01-10 | 2024-04-26 | 科大讯飞股份有限公司 | Speech recognition method, device, equipment and readable storage medium |
CN117649846A (en) * | 2024-01-29 | 2024-03-05 | 北京安声科技有限公司 | Speech recognition model generation method, speech recognition method, device and medium |
CN117649846B (en) * | 2024-01-29 | 2024-04-30 | 北京安声科技有限公司 | Speech recognition model generation method, speech recognition method, device and medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN116543768A (en) | Model training method, voice recognition method and device, equipment and storage medium | |
CN109740158B (en) | Text semantic parsing method and device | |
CN111767697B (en) | Text processing method and device, computer equipment and storage medium | |
CN114936274B (en) | Model training method, dialogue generating method and device, equipment and storage medium | |
CN116312463A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium | |
CN116386594A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium | |
CN116343747A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium | |
CN116702736A (en) | Safe call generation method and device, electronic equipment and storage medium | |
CN117668758A (en) | Dialog intention recognition method and device, electronic equipment and storage medium | |
CN116644765A (en) | Speech translation method, speech translation device, electronic device, and storage medium | |
CN116645961A (en) | Speech recognition method, speech recognition device, electronic apparatus, and storage medium | |
CN116541551A (en) | Music classification method, music classification device, electronic device, and storage medium | |
CN116956925A (en) | Electronic medical record named entity identification method and device, electronic equipment and storage medium | |
CN116580704A (en) | Training method of voice recognition model, voice recognition method, equipment and medium | |
CN116665638A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium | |
CN116645956A (en) | Speech synthesis method, speech synthesis system, electronic device, and storage medium | |
CN114974219A (en) | Speech recognition method, speech recognition device, electronic apparatus, and storage medium | |
CN116432648A (en) | Named entity recognition method and recognition device, electronic equipment and storage medium | |
CN115641860A (en) | Model training method, voice conversion method and device, equipment and storage medium | |
CN115620702A (en) | Speech synthesis method, speech synthesis device, electronic apparatus, and storage medium | |
CN116543753A (en) | Speech recognition method, speech recognition device, electronic apparatus, and storage medium | |
CN116543780A (en) | Model updating method and device, voice conversion method and device and storage medium | |
CN116564274A (en) | Speech synthesis method, speech synthesis device, electronic device, and storage medium | |
CN116343762A (en) | Emotion recognition model training method and device, electronic equipment and storage medium | |
CN115238122A (en) | Singing object recognition method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |