CN110930985B

CN110930985B - Telephone voice recognition model, method, system, equipment and medium

Info

Publication number: CN110930985B
Application number: CN201911234303.9A
Authority: CN
Inventors: 郝竹林; 罗超; 胡泓; 王俊彬; 任君
Original assignee: Ctrip Computer Technology Shanghai Co Ltd
Current assignee: Ctrip Computer Technology Shanghai Co Ltd
Priority date: 2019-12-05
Filing date: 2019-12-05
Publication date: 2024-02-06
Anticipated expiration: 2039-12-05
Also published as: CN110930985A

Abstract

The invention discloses a telephone voice recognition model, a method, a system, equipment and a medium, wherein a plurality of paths of multi-path multi-layer two-dimensional convolution layers of the telephone voice recognition model comprise a plurality of paths of multi-path multi-layer two-dimensional convolution layers, and the plurality of paths of multi-path multi-layer two-dimensional convolution layers are formed in parallel; the input layer is used for receiving the spectrogram characteristics and adjusting the spectrogram characteristics to the characteristics of preset dimensions; each path of the multipath multi-layer two-dimensional convolution layer is used for carrying out layer-by-layer convolution calculation on the characteristics of the preset dimension through multi-layer two-dimensional convolution respectively, and first characteristic data are obtained; the GRU layer is used for processing the first characteristic data and obtaining processed second characteristic data; the output layer is used for processing the second characteristic data to obtain a probability matrix of the voice characters corresponding to the spectrogram characteristics. The telephone voice recognition model can be designed aiming at the network input size and the multipath multi-layer two-dimensional convolution of the telephone voice design with low sampling rate and the network design, and can greatly overcome the problem of acoustic model recognition in the low sampling rate environment.

Description

Telephone voice recognition model, method, system, equipment and medium

Technical Field

The present invention relates to the field of speech recognition, and in particular, to a telephone speech recognition model, method, system, device, and medium.

Background

Speech recognition technology is a high technology that allows a machine to convert speech signals into corresponding text or commands through a recognition and understanding process. The speech recognition technology mainly comprises three aspects of a feature extraction technology, a pattern matching criterion and a model training technology.

According to voice devices and channels, it can be classified into desktop (PC) voice recognition, telephone voice recognition, and embedded device (cell phone, PDA (palm computer), etc.) voice recognition. The different acquisition channels deform the acoustic properties of the human utterance and thus require the construction of separate identification systems.

In a telephone voice transmission relay system in the OTA (online travel) industry, when the OTA telephones voice to a hotel or a guest, the voice information expressed by the other party through the telephone needs to be recognized, the voice information is accurately recognized into characters through an acoustic model method, and then the customer service performs further subsequent processing operation through the characters.

The sampling rate of telephone voice is usually 8 kilohertz (kilohertz) (Hz), and the acoustic model modeling technology in the low sampling rate 8kHz scene is more mature, namely GMM-HMM and DNN-HMM, deepSpeech:

1) The traditional acoustic model modeling technology represented by GMM-HMM can well fit the probability emission relation between the voice information frame and the phoneme state, but needs to meet strict Gaussian distribution assumption, in a smart phone scene, the sampling rate of telephone voice is 8kHz, which is far lower than 16kHz or 44.1kHz of normal mobile phone recording, the voice frame information contained in the smart phone is less abundant, the GMM requirements are more strict, and the smart phone cannot be directly applied to the scene and put into use.

2) Compared with the modeling of the acoustic model by using the first generation of deep learning, the modeling technology of the acoustic model by using DNN-HMM as a representative is capable of fully expressing the probability emission relation between the voice information frame and the voice element state by using GMM, and the DNN is a discriminant model, so that the characteristic expression can be naturally carried out on the voice information frame, and the modeling technology can be very stable for low sampling rate data. However, in the training stage of modeling the DNN-HMM model, the model of the GMM-HMM frame level needs to be marked first, and the model is very dependent on a pronunciation dictionary.

3) The deep-learning acoustic model modeling technology represented by deep speech2 is used for modeling an acoustic model by applying second-generation deep learning, and compared with the first generation deep learning acoustic model, the model is completely based on end-to-end, does not need frame-level voice labeling, only needs voice and corresponding transcription characters, but is based on 16kHz voice data with high sampling rate, has limited characteristic capability for 8kHz voice extraction, and has limited model loading capability and heavy weight caused by more recent 1 ten thousand common Chinese characters.

The main problems of the existing phone voice acoustic model identification with low sampling rate in the scene of the voice propagation information relay system in the OTA industry are as follows:

1) The OTA industry lacks enough marked voice data marking samples;

2) The voice propagation environment in the voice information propagation relay system in the OTA industry is very noisy, such as reverberation, side speaking, noise and the like, and great difficulty is caused to recognition;

3) Telephone recording data in the OTA industry is based on 8kHz with a low sampling rate, and the original input information is less in information richness than the normal recording data.

Therefore, the existing acoustic model is easy to be influenced by severe environments of calling parties such as reverberation, speaking by a person and noise in the low sampling rate 8kHz telephone voice information transmission relay system environment in the OTA industry, and causes great difficulty to the identification of the acoustic model, so that the identification accuracy of the acoustic model is not high.

Disclosure of Invention

The invention aims to overcome the defect of low recognition accuracy of an acoustic model in the prior art, and provides a telephone voice recognition model, a method, a system, equipment and a medium.

The invention solves the technical problems by the following technical scheme:

A telephony speech recognition model comprising an input layer, a multi-way two-dimensional convolution layer, a GRU layer, and an output layer;

the multi-path multi-layer two-dimensional convolution layer comprises a plurality of paths of multi-layer two-dimensional convolutions, and a plurality of paths of multi-layer two-dimensional convolutions are parallel to form the multi-path multi-layer two-dimensional convolution layer;

the input layer is used for receiving the spectrogram characteristics and adjusting the spectrogram characteristics to the characteristics of preset dimensions;

each path of the multipath multi-layer two-dimensional convolution layer is used for performing layer-by-layer convolution calculation on the characteristics of the preset dimension through multi-layer two-dimensional convolution respectively, and first characteristic data are obtained;

the GRU layer is used for processing the first characteristic data and obtaining processed second characteristic data;

the output layer is used for processing the second characteristic data to obtain a probability matrix of the voice characters corresponding to the spectrogram characteristics.

Preferably, the method comprises the steps of,

the input layer comprises a Reshape layer, and the Reshape layer is used for receiving the spectrogram characteristics and adjusting the spectrogram characteristics to the characteristics of the preset dimension;

and/or the number of the groups of groups,

the multi-path multi-layer two-dimensional convolution layer further comprises a characteristic operation layer, wherein the characteristic operation layer is used for carrying out summation or connection operation on the characteristics of preset dimensions of each path of multi-layer two-dimensional convolution processing to obtain the first characteristic data;

And/or the number of the groups of groups,

the multipath multi-layer two-dimensional convolution layer further comprises a first Batch Normalization layer, and the first Batch Normalization layer is arranged before and after each layer of two-dimensional convolution in the multi-layer two-dimensional convolution;

and/or the number of the groups of groups,

the output layer comprises a first TimeDistributed layer, and the first TimeDistributed layer is used for processing the second characteristic data to obtain the probability matrix.

Preferably, the GRU layers comprise bidirectional GRU layers and/or unidirectional GRU layers.

Preferably, the GRU layer further comprises at least one of a second Batch Normalization layer, a first Dropout layer, a second Dropout layer and a second TimeDistributed layer.

Preferably, when the GRU layers include a bidirectional GRU layer and a unidirectional GRU layer, and when the GRU layers include a second Batch Normalization layer, a first Dropout layer, a second Dropout layer, and a second TimeDistributed layer, the second Batch Normalization layer is disposed between the bidirectional GRU layer and the unidirectional GRU layer, the first Dropout layer is disposed at the tail of the second Batch Normalization layer, and the second TimeDistributed layer is disposed between the first Dropout layer and the second Dropout layer.

A telephone speech recognition method implemented using a telephone speech recognition model as described above, the telephone speech recognition method comprising:

Performing mute cutting on input telephone voice to obtain an audio fragment;

extracting the characteristics of the audio fragment to obtain sound spectrum characteristics;

inputting the spectrogram characteristics into the telephone voice recognition model to obtain a probability matrix of a voice text corresponding to the spectrogram characteristics;

and decoding the probability matrix to obtain the voice text corresponding to the spectrogram characteristic.

Preferably, the step of decoding the probability matrix to obtain the voice text corresponding to the spectrogram feature includes:

establishing a word stock according to the historical telephone voice data;

homonymy and synonym combination are carried out on the word stock, and label marking is carried out on the combined word stock to obtain a word label;

decoding the probability matrix according to the word label by using a CTC decoding algorithm to obtain the voice text;

and/or the number of the groups of groups,

before the step of inputting the spectrogram features to the phone voice recognition model to obtain the probability matrix of the voice text corresponding to the spectrogram features, the step of training the phone voice recognition model further comprises the step of training the phone voice recognition model, wherein the step of training the phone voice recognition model comprises the following steps:

parameters of the input layer, the multipath multi-layer two-dimensional convolution layer, the GRU layer and the output layer are respectively set;

Setting a loss function and an optimization method;

and performing iterative training on the telephone voice recognition model according to the loss function and the optimization method by using a sample audio fragment and a voice text corresponding to the sample audio fragment.

Preferably, when the telephone speech recognition method comprises the step of training the telephone speech recognition model,

the step of setting parameters of the input layer, the multipath multi-layer two-dimensional convolution layer, the GRU layer and the output layer respectively comprises the following steps:

setting a network size adjustment parameter of the input layer according to the sampling rate of voice data;

and respectively setting convolution parameters of each path of multi-path multi-layer two-dimensional convolution layer according to the network size adjustment parameters, wherein the convolution parameters comprise at least one of the number of filters, step sizes, convolution kernel sizes, filling and output dimensions.

A telephone speech recognition system implemented using a telephone speech recognition model as described above, the speech recognition system comprising: the device comprises a cutting module, a characteristic extraction module, a prediction module and a decoding module;

the cutting module is used for carrying out mute cutting on the input telephone voice to obtain an audio fragment;

The feature extraction module is used for carrying out feature extraction on the audio fragment to obtain sound spectrum image features;

the prediction module is used for inputting the spectrogram characteristics into the telephone voice recognition model to obtain a probability matrix of a voice text corresponding to the spectrogram characteristics;

the decoding module is used for decoding the probability matrix to obtain the voice text corresponding to the spectrogram characteristic.

Preferably, the decoding module is further used for establishing a word stock according to the historical telephone voice data, carrying out homonymy combination on the word stock, and labeling the combined word stock to obtain a word label;

the decoding module is also used for decoding the probability matrix according to the word label by utilizing a CTC decoding algorithm to obtain the voice text;

and/or the number of the groups of groups,

the voice recognition system further comprises a training module, wherein the training module is used for respectively setting parameters of the input layer, the multipath multi-layer two-dimensional convolution layer, the GRU layer and the output layer; the method is also used for setting a loss function and an optimization method; and iteratively training the telephone voice recognition model according to the loss function and the optimization method by using a sample audio fragment and a voice text corresponding to the sample audio fragment.

Preferably, the training module is further configured to set a network size adjustment parameter of the input layer according to a sampling rate of the voice data; and respectively setting convolution parameters of each path of multi-path multi-layer two-dimensional convolution layer according to the network size adjustment parameters, wherein the convolution parameters comprise at least one of the number of filters, step sizes, convolution kernel sizes, filling and output dimensions.

An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, the processor implementing a telephone speech recognition method as described above when executing the computer program.

A computer readable storage medium having stored thereon a computer program which when executed by a processor implements the steps of a telephony speech recognition method as described above.

The invention has the positive progress effects that:

the telephone voice recognition model can be designed aiming at the network input size and the multipath multi-layer two-dimensional convolution of the telephone voice design with low sampling rate and the network design, and can greatly overcome the problem of acoustic model recognition in the low sampling rate environment. Aiming at the problem that the telephone voice information is recognized as characters when a call is made in the low sampling rate 8kHz environment of the OTA at present, the invention can accurately recognize the text information expressed by the opposite side in real time when a customer service dialogues with the opposite side based on the telephone voice recognition model established in the telephone voice information transmission system. The telephone voice recognition model aims at the network input size designed by the low sampling rate, and the multipath multi-layer two-dimensional convolution and network design, so that the problem of acoustic model recognition in the low sampling rate environment can be greatly solved. The homophone synonymous method applied in the voice recognition method and the system greatly improves the training speed of the network model and accelerates the data iteration and optimization of the network model. According to the telephone voice recognition method and system, the training efficiency of telephone voice recognition in the scene is improved, meanwhile, the evaluation performance of telephone voice recognition is not reduced, the accuracy of customer service processing telephone operation can be greatly improved, the OTA customer service operation flow is saved, meanwhile, the error of telephone processing by OTA customer service personnel is reduced, help is provided for follow-up offline customer service quality inspection, and the speed response of OTA platform customer service is improved.

Drawings

Fig. 1 is a schematic block diagram of a telephone speech recognition model according to embodiment 1 of the present invention.

Fig. 2 is another block diagram of a telephone speech recognition model according to embodiment 1 of the present invention.

Fig. 3 is a flow chart of a telephone voice recognition method according to embodiment 2 of the present invention.

Fig. 4 is a flowchart of step 104 of the phone voice recognition method according to embodiment 2 of the present invention.

Fig. 5 is a flowchart of step 103' of the phone voice recognition method according to embodiment 2 of the present invention.

Fig. 6 is a flowchart of step 31' of the phone voice recognition method according to embodiment 2 of the present invention.

Fig. 7 is a block diagram of a telephone voice recognition system according to embodiment 3 of the present invention.

Fig. 8 is a schematic structural diagram of an electronic device according to embodiment 4 of the present invention.

Detailed Description

The invention is further illustrated by means of the following examples, which are not intended to limit the scope of the invention.

Example 1

The present embodiment provides a telephone speech recognition model, as shown in fig. 1, which includes an input layer 1, a multi-way two-dimensional convolution layer 2, a GRU layer 3, and an output layer 4.

The input layer 1 is used for receiving the spectrogram characteristics and adjusting the spectrogram characteristics to the characteristics of preset dimensions.

As shown in fig. 2, the input layer in this embodiment includes a Reshape layer 11, where the Reshape layer 11 is configured to receive the spectrogram feature and adjust the spectrogram feature to a feature of a preset dimension.

A third Batch Normalization layer 12 may also be provided before the Reshape layer 11.

The third Batch Normalization layer 12 is used for normalizing the network weight and accelerating the network training

The input features of this embodiment are original spectrogram features of the speech signal, the input network size of the input layer can be designed according to the original spectrogram features and the sampling rate of the speech signal, then a Reshape layer 11 is designed to readjust the network size, and then multiple paths are connected to realize feature extraction by a multi-layer two-dimensional convolution layer 2.

The multi-route multi-layer two-dimensional convolution layer 2 comprises a plurality of multi-route multi-layer two-dimensional convolutions 21, and the multi-route multi-layer two-dimensional convolutions 21 are parallel to form the multi-route multi-layer two-dimensional convolution layer 2.

As shown in fig. 2, the multipath consists of a plurality of two-dimensional convolution layers 2, including a two-dimensional convolution 211, a first Batch Normalization layer 212, and a feature operation layer 213. The first Batch Normalization layer 212 is disposed before and after each layer of two-dimensional volume 211 in the multi-layer two-dimensional convolution.

Each path of multi-layer two-dimensional convolution 21 of the multi-route multi-layer two-dimensional convolution layer 2 is used for performing layer-by-layer convolution calculation on the features with preset dimensions through multi-layer two-dimensional convolution 211 and obtaining first feature data.

The feature operation layer 213 is configured to sum or connect the features of the preset dimension after processing the multi-layer two-dimensional convolution 21 to obtain first feature data.

The multi-routing multi-layer two-dimensional convolution layer 2 may enable the frequency domain features of the speech information to be extracted and characterized substantially by the two-dimensional convolution 212. The multi-path multi-layer two-dimensional convolution 2 greatly makes up the defect of the frequency domain characteristic capability of extracting the voice signal by the single-path two-dimensional convolution.

The GRU layer 3 is configured to process the first feature data and obtain processed second feature data.

The GRU layer 3 includes a bi-directional GRU layer 31 and/or a unidirectional GRU layer 32. The GRU layers further include at least one of a second Batch Normalization layer 33, a first Dropout layer 34, a second Dropout layer 35, and a second TimeDistributed layer 36.

In this embodiment, the GRU layer 3 includes a bidirectional GRU layer 31 and a unidirectional GRU layer 32, and a second Batch Normalization layer 33, a first Dropout layer 34, a second Dropout layer 35 and a second TimeDistributed layer 36, where the unidirectional GRU layer 32 is disposed at the tail of the bidirectional GRU layer 31, the second Batch Normalization layer 33 is disposed between the bidirectional GRU layer 31, the unidirectional GRU layer 32 and the first Dropout layer 34, the first Dropout layer 34 is disposed at the tail of the second Batch Normalization layer 33, and the second TimeDistributed layer 36 is disposed between the first Dropout layer 34 and the second Dropout layer 35.

The GRU layer 3 is provided with the bidirectional GRU layer 31, then is connected with the unidirectional GRU layer 32, and is connected with the second time distributed layer 36, so that the defect of the time domain information convolved in the multi-layer two-dimensional convolution layer 2 can be further overcome, and the time domain information is enriched.

The output layer 4 is used for processing the second characteristic data to obtain a probability matrix of the voice characters corresponding to the spectrogram characteristics.

The output layer 4 includes a first time distributed layer 41, and the first time distributed layer 41 is configured to process the second feature data to obtain a probability matrix.

The telephone voice recognition model of the embodiment can be designed aiming at the network input size, the multipath multi-layer two-dimensional convolution and the network of the telephone voice design with low sampling rate, and can greatly overcome the problem of acoustic model recognition in the environment with low sampling rate.

Example 2

The present embodiment provides a method for recognizing a phone voice, which is implemented by applying the phone voice recognition model in embodiment 1, as shown in fig. 3, and includes:

and 101, performing mute cutting on input telephone voice to obtain an audio fragment.

And 102, extracting features of the audio fragment to obtain spectrogram features.

Step 103, inputting the spectrogram features into a telephone voice recognition model to obtain a probability matrix of the voice text corresponding to the spectrogram features.

And 104, decoding the probability matrix to obtain the voice text corresponding to the spectrogram characteristic.

As shown in fig. 4, step 104 includes:

step 1041, establishing a word stock according to the historical telephone voice data.

Step 1042, homonymy merging is carried out on the word stock, and label marking is carried out on the merged word stock to obtain a word label.

Step 1043, decoding the probability matrix by using a CTC decoding algorithm according to the word label to obtain a voice text.

Step 103 'is preceded by a step 103' of training the phone speech recognition model.

As shown in fig. 5, step 103' includes:

and step 31', setting parameters of an input layer, a multi-path two-dimensional convolution layer, a GRU layer and an output layer respectively.

As shown in fig. 6, step 31' specifically includes:

step 311', setting the network size adjustment parameters of the input layer according to the sampling rate of the voice data.

The sampling rate of the telephone recording data is 8kHz, in this embodiment, the maximum length of the network input is designed to be 1600, that is, the maximum length of the voice frame can be 1600, the frame shift is 10ms (millisecond), the frame length is 25ms, the maximum audio time length of the network input is 16.015 seconds, the input features are the original spectrogram features of the voice signal, the size of the input network is designed to be (1600, 200), then a Reshape layer is designed to readjust the network size to be (1600, 200, 1), and the multi-route multi-layer two-dimensional convolution layer 2 is connected to realize feature extraction.

Step 312', respectively setting convolution parameters of each routing multilayer two-dimensional convolution layer according to the network size adjustment parameters, wherein the convolution parameters comprise at least one of the number of filters, step size, convolution kernel size, filling and output dimension.

In the embodiment, three-way three-dimensional convolution is taken as an example, three-way two-dimensional convolution including three ways of A, B and C is taken as an example, parameters of the three-way three-dimensional convolution are set in a feature extraction stage, three-way two-dimensional convolution is extracted through three-way two-dimensional convolution of three routes, and in design, the sizes of convolution kernels of the three ways A, B, C are different and are usually designed to be odd.

The three-layer two-dimensional convolution of the A route has the following size: the number and step sizes of the convolution filters are set to be 32 and (2, 2), the kernel sizes are set to be (11, 41), (11, 21), the filling is set to be the same, then the Reshape (200, -1) operation is carried out, and the output dimension is set to be (200, 800).

The three-layer two-dimensional convolution of the B route has the following size: the number of convolution filters and the step size of 32 and (2, 2) are set, the kernel sizes are (11, 21), (11, 11) are set to be the same, then the Reshape (200, -1) operation is carried out, and the output dimension is (200, 800).

The three-layer two-dimensional convolution of the C route has the following size: the number and step sizes of the convolution filters are set to be 32 and (2, 2), the kernel sizes are set to be (11, 11), (11, 7), the filling is set to be the same, then the Reshape (200, -1) operation is carried out, and the output dimension is (200, 800).

And then performing a dimension linking operation on the A, B, C routes, wherein the network has connection options, performs a summation operation, and the obtained matrix dimension is still 200, 800, and if a concat operation is performed, the output dimension is 200, 2400.

Step 32', set loss function and optimization method.

In the training stage, as feedback learning of deep learning is needed, a CTC (Connectionist Temporal Classification) loss function is used as a loss function, adam (an optimization algorithm) is adopted as a learning optimization method, two-dimensional rolling of each layer and Batch Normalization layers on the GRU layer can enable the model to accelerate training and be more stable, a first Dropout layer is set to be Dropout (0.25), a second layer is set to be TimeDistributed (512), and a first Dropout layer at the tail is set to be Dropout (0.25).

In the output layer, a word-based CTC end-to-end network model is designed, and under the condition of telephone voice transmission relay system in OTA industry, the CTC decoding algorithm is utilized, in the existing voice telephone recording marking data, the word frequency of all the transfer texts is counted, and because of the label design, part of homonymous word labels can be homonymous, such as you, you homonymous design, so that you can, and so on, can be combined into one by one, finally through the homonymous treatment, the original 12000 Chinese character dimensions are reduced to 3881 dimensions, so that the label can be set as the words under the 3881 telephone voice information transmission center system, and a blank (blank mark) can be additionally added. The calculation complexity of the model label layer is greatly reduced, and the first TimeDistributed layer is set to TimeDistributed

(3882)。

Step 33', iteratively training the telephone speech recognition model according to the loss function and the optimization method by using the sample audio fragment and the speech text corresponding to the sample audio fragment.

In order to reduce the voice marking workload, historical telephone recording data can be used as sample audio data to carry out channel separation on the sample audio data, the sample audio data is split into a plurality of sample audio fragments through mute cutting, then the sample audio fragments are predicted by using the disclosed existing acoustic model to obtain corresponding voice texts, the sample audio fragments are used as input, the corresponding voice texts are used as output, and the telephone voice recognition model is trained according to a loss function and an optimization method.

In the model training process, partial parameters are more sensitive to model iteration, the size in the iteration batch is set to be 64, the learning rate is not excessively large initially, the learning rate can be initialized to be 0.0001, and the learning attenuation coefficient is designed to be 0.000001. Through the initialization step, the trained CTC loss function and the transfer accuracy on the test set are used as model convergence standards. The low sampling rate characteristic under the voice information transmission relay system under the OTA is that when the output of A, B, C three paths is subjected to dimension processing, the summation operation is carried out, so that the weight is greatly reduced, and compared with a single path, the weight is basically the same, but the defect of the frequency domain characteristic capability of extracting voice signals by single-path two-dimensional convolution is overcome by the multi-route multi-layer two-dimensional volume.

Because the data of the training model is predicted based on the existing public model, in order to make the accuracy of the telephone speech recognition model better, the training is performed in a data iteration mode, and the data iteration method is as follows: after training a better model, predicting characters from the existing training test data by using the model, then calculating transcription, and then calculating the transcription accuracy:

and (3) manually correcting the transfer accuracy below a preset value, such as below 30%, and adding the corrected transfer accuracy into the training set training model again.

The lower limit design of the step-by-step transcription accuracy is more strict for each iteration. If the transcription accuracy is below 20% after the iteration, the transcription accuracy is manually corrected. Meanwhile, the corrected voice data times are larger than the preset times, and manual correction can be avoided.

For a section of audio byte stream sent by an MRCP (media resource control protocol) party on a production line through a voice endpoint control (VAD) mute cutting technology, checking whether the byte stream information is consistent with the length of the sent byte stream, if so, calculating the time length according to a sampling rate relation, and if the time length exceeds the network input length 16.015 seconds of a telephone voice recognition model, dividing the time length into a plurality of audio fragments within 16.015 seconds by adopting the VAD mute cutting technology, and respectively extracting a matrix with the characteristic dimension of a sound spectrum diagram of each frame being 200 for each audio fragment.

Secondly, according to the telephone voice recognition model generated in the training stage, the 200-dimension spectrogram feature matrix in the first step is subjected to 0 filling operation in the row direction if the 200-dimension spectrogram feature matrix is insufficient (1600, 200), so that the input dimension requirement of the telephone voice recognition model is met, and after the model is fed, the probability matrix with the dimension of 200, 3882 is output.

And thirdly, performing a bundle search decoding operation on the output probability matrix (200, 3882) by using a bundle search-based CTC decoding algorithm, and outputting characters corresponding to each cut audio segment.

Finally, as the prediction result is the operation of cutting the audio frequency fragments, the combination of all the audio frequency fragments is the complete audio frequency, so that the text information transcribed by a plurality of the cutting audio frequency fragments is combined by commas according to the sequence, and finally the transcribed content of the complete audio frequency is obtained.

The embodiment aims at the problem that the telephone voice information is recognized as characters when the telephone is called in the low sampling rate 8kHz environment of the OTA at present, and the telephone voice recognition method is provided, based on a telephone voice recognition model established in a telephone voice information transmission system, when customer service dialogues with the other party, the text information expressed by the other party can be accurately recognized in real time. The telephone voice recognition model aims at the network input size designed by the low sampling rate, and the multipath multi-layer two-dimensional convolution and network design, so that the problem of acoustic model recognition in the low sampling rate environment can be greatly solved; meanwhile, the homophone synonymous method is utilized, so that the training speed of the network model is greatly improved, and the data iteration and optimization of the network model are accelerated. According to the telephone voice recognition method, the training efficiency of telephone voice recognition in the scene is improved, meanwhile, the evaluation performance of telephone voice recognition is not reduced, the accuracy of customer service processing operation can be greatly improved, the labor for OTA customer service operation is saved, meanwhile, error errors in OTA customer service processing are reduced, help is provided for subsequent offline customer service quality inspection, and the speed response of OTA platform customer service is improved.

Example 3

The present embodiment provides a telephone voice recognition system implemented by applying the telephone voice recognition model as in embodiment 1, as shown in fig. 7, the voice recognition system including: a cutting module 201, a feature extraction module 202, a prediction module 203, a decoding module 204 and a training module 205;

the cutting module 201 is configured to mute-cut an input phone voice to obtain an audio clip;

the feature extraction module 202 is configured to perform feature extraction on the audio segment to obtain a spectrogram feature;

the prediction module 203 is configured to input the spectrogram feature to a phone voice recognition model to obtain a probability matrix of a voice text corresponding to the spectrogram feature;

the decoding module 204 is configured to decode the probability matrix to obtain a voice text corresponding to the spectrogram feature.

The decoding module 204 is further configured to establish a word stock according to the historical telephone voice data, and perform homonymy combination on the word stock, and tag the combined word stock to obtain a word tag;

the decoding module 204 is further configured to decode the probability matrix according to the word label by using a CTC decoding algorithm to obtain a voice text;

the training module 205 is configured to set parameters of an input layer, a multi-layer two-dimensional convolution layer, a GRU layer and an output layer respectively; the method is also used for setting a loss function and an optimization method; and iteratively training the telephone voice recognition model according to the loss function and the optimization method by utilizing the sample audio fragment and the voice text corresponding to the sample audio fragment.

The training module 205 is further configured to set a network size adjustment parameter of the input layer according to a sampling rate of the voice data; and respectively setting convolution parameters of each routing multilayer two-dimensional convolution layer according to the network size adjustment parameters, wherein the convolution parameters comprise at least one of the number of filters, step sizes, convolution kernel sizes, filling and output dimensions.

The sampling rate of the telephone recording data is 8kHz, in this embodiment, the maximum length of the network input is designed to be 1600, that is, the maximum length of the voice frame can be 1600, the frame shift is 10ms (millisecond), the frame length is 25ms, the maximum audio time length of the network input is 16.015 seconds, the input features are the original spectrogram features of the voice signal, the size of the input network is designed to be (1600, 200), then the network size is readjusted to (1600, 200, 1) by utilizing the Reshape layer, and the multipath connected at the back is extracted by the multi-layer two-dimensional convolution layer 2 to realize feature extraction.

In the training stage, feedback learning requiring deep learning is performed, a loss function is CTC (Connectionist Temporal Classification), a learning optimization method adopts Adam (an optimization algorithm), the model can be accelerated to train and be more stable in two-dimensional rolling of each layer and Batch Normalization layers on the GRU layer, after bidirectional GRU and unidirectional GRU, a first Dropout layer is set to be Dropout (0.25), a second layer is set to be TimeDistributed (512), and a first Dropout layer at the tail is set to be Dropout (0.25).

In the output layer, a word-based CTC end-to-end network model is designed, and under the condition of telephone voice transmission relay system in OTA industry, the CTC decoding algorithm is utilized, in the existing voice telephone recording marking data, the word frequency of all the transferred texts is counted, and because of the label design, part of homonymous word labels can be homonymous, such as you, you homonymous design, so that you can, and so on, can be combined into one, so that you can, and so on, can be combined into one by one, finally through the homonymous treatment, the original 12000 Chinese character dimensions can be reduced to 3881 dimensions, so that the label can be set as 3881 words under the telephone voice information transmission center system, and a blank can be additionally added. The computational complexity of the model tag layer is greatly reduced, and the first TimeDistributed layer is set to TimeDistributed (3882).

In the model training process, partial parameters are more sensitive to model iteration, the batch size during the iteration is set to be 64, the learning rate is not excessively large at the beginning, the learning rate is initialized to be 0.0001, and the learning attenuation coefficient is designed to be 0.000001. Through the initialization step, the trained CTC loss function and the transfer accuracy on the test set are used as model convergence standards. The low sampling rate characteristic of the voice information transmission relay system under the OTA carries out dimension processing on the output of A, B, C three paths, the weight is greatly reduced by carrying out summation operation, and compared with a single path, the weight is basically the same, but the defect of the frequency domain characteristic capability of extracting voice signals by single-path two-dimensional convolution is overcome by the multi-route multi-layer two-dimensional volume.

For a section of audio byte stream sent by an MRCP side on a production line through a VAD mute cutting technology, checking whether the byte stream information is consistent with the length of the fed byte stream, if so, calculating the time length according to the relation of the sampling rate, and if the time length exceeds the network input length 16.015 seconds of a telephone voice recognition model, dividing the time length into a plurality of audio fragments within 16.015 seconds by adopting the VAD mute cutting technology, and respectively extracting a matrix with the characteristic dimension of 200 of the spectrogram of each frame for each audio fragment.

The embodiment aims at the problem that the telephone voice information is recognized as characters when the telephone is called in the low sampling rate 8kHz environment of the OTA at present, and the telephone voice recognition system is provided, based on a telephone voice recognition model established in a telephone voice information transmission system, when customer service is in conversation with the opposite party, the text information expressed by the opposite party can be accurately recognized in real time. The telephone voice recognition model aims at the network input size designed by the low sampling rate, and the multipath multi-layer two-dimensional convolution and network design, so that the recognition problem of the telephone voice recognition model in the low sampling rate environment can be greatly solved; meanwhile, the homophone synonymous method is utilized, so that the training speed of the network model is greatly improved, and the data iteration and optimization of the network model are accelerated. The telephone voice recognition system improves the training efficiency of telephone voice recognition under the scene, does not reduce the evaluation performance of telephone voice recognition, can greatly improve the accuracy of customer service processing operation, saves the OTA customer service operation time, reduces error errors of OTA customer service processing, provides help for subsequent offline customer service quality inspection, and improves the speed response of OTA platform customer service.

Example 4

Fig. 8 is a schematic structural diagram of an electronic device according to embodiment 4 of the present invention. The electronic device includes a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the telephone speech recognition method of embodiment 2 when executing the program. The electronic device 50 shown in fig. 8 is merely an example and should not be construed as limiting the functionality and scope of use of embodiments of the present invention.

As shown in fig. 8, the electronic device 50 may be embodied in the form of a general purpose computing device, which may be a server device, for example. Components of electronic device 50 may include, but are not limited to: the at least one processor 51, the at least one memory 52, a bus 53 connecting the different system components, including the memory 52 and the processor 51.

The bus 53 includes a data bus, an address bus, and a control bus.

Memory 52 may include volatile memory such as Random Access Memory (RAM) 521 and/or cache memory 522, and may further include Read Only Memory (ROM) 523.

Memory 52 may also include a program/utility 525 having a set (at least one) of program modules 524, such program modules 524 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each or some combination of which may include an implementation of a network environment.

The processor 51 executes various functional applications and data processing such as the telephone voice recognition method provided in embodiment 1 of the present invention by running a computer program stored in the memory 52.

The electronic device 50 may also communicate with one or more external devices 54 (e.g., keyboard, pointing device, etc.). Such communication may occur through an input/output (I/O) interface 55. Also, model-generating device 50 may also communicate with one or more networks such as a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network such as the internet via network adapter 56. As shown, the network adapter 56 communicates with other modules of the model-generating device 50 via the bus 53. It should be appreciated that although not shown in the figures, other hardware and/or software modules may be used in connection with the model-generating device 50, including, but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID (disk array) systems, tape drives, data backup storage systems, and the like.

It should be noted that although several units/modules or sub-units/modules of an electronic device are mentioned in the above detailed description, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more units/modules described above may be embodied in one unit/module in accordance with embodiments of the present invention. Conversely, the features and functions of one unit/module described above may be further divided into ones that are embodied by a plurality of units/modules.

Example 5

The present embodiment provides a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the telephone speech recognition method provided by embodiment 2.

More specifically, among others, readable storage media may be employed including, but not limited to: portable disk, hard disk, random access memory, read only memory, erasable programmable read only memory, optical storage device, magnetic storage device, or any suitable combination of the foregoing.

In a possible embodiment, the invention may also be implemented in the form of a program product comprising program code for causing a terminal device to carry out the steps of implementing the telephone speech recognition method of embodiment 2, when the program product is run on the terminal device.

Wherein the program code for carrying out the invention may be written in any combination of one or more programming languages, the program code may execute entirely on the user device, partly on the user device, as a stand-alone software package, partly on the user device, partly on a remote device or entirely on the remote device.

While specific embodiments of the invention have been described above, it will be appreciated by those skilled in the art that this is by way of example only, and the scope of the invention is defined by the appended claims. Various changes and modifications to these embodiments may be made by those skilled in the art without departing from the principles and spirit of the invention, but such changes and modifications fall within the scope of the invention.

Claims

1. A telephone speech recognition method, the telephone speech recognition method comprising:

performing mute cutting on input telephone voice to obtain an audio fragment;

inputting the spectrogram characteristics into a telephone voice recognition model to obtain a probability matrix of a voice text corresponding to the spectrogram characteristics;

decoding the probability matrix to obtain a voice text corresponding to the spectrogram characteristic;

the telephone voice recognition model comprises an input layer, a plurality of layers of two-dimensional convolution layers, a GRU layer and an output layer;

the multi-path multi-layer two-dimensional convolution layer comprises a plurality of multi-path multi-layer two-dimensional convolution, a plurality of multi-layer two-dimensional convolution phases are parallel to form the multi-path multi-layer two-dimensional convolution layer, and the convolution kernel sizes of the multi-layer two-dimensional convolution are different;

the output layer is used for processing the second characteristic data to obtain a probability matrix of the voice characters corresponding to the spectrogram characteristics;

the step of decoding the probability matrix to obtain the voice text corresponding to the spectrogram features comprises the following steps:

establishing a word stock according to the historical telephone voice data;

and decoding the probability matrix by utilizing a CTC decoding algorithm according to the word label to obtain the voice text.

2. A method for telephony speech recognition according to claim 1, wherein,

setting a loss function and an optimization method;

3. A method for telephony speech recognition according to claim 2, wherein, when the method for telephony speech recognition includes the step of training the telephony speech recognition model,

4. The telephony speech recognition method of claim 1, wherein the input layer comprises a Reshape layer for receiving the spectrogram features and adjusting the spectrogram features to the features of the preset dimension;

And/or the number of the groups of groups,

and/or the number of the groups of groups,

5. The telephony speech recognition method of claim 1, wherein the GRU layers comprise bi-directional GRU layers and/or unidirectional GRU layers.

6. The telephony speech recognition method of claim 5, wherein the GRU layer further comprises at least one of a second Batch Normalization layer, a first Dropout layer, a second Dropout layer, and a second TimeDistributed layer.

7. The telephony speech recognition method of claim 6, wherein when the GRU layers include a bi-directional GRU layer and a uni-directional GRU layer, and when the GRU layers include a second Batch Normalization layer, a first Dropout layer, a second Dropout layer, and a second TimeDistributed layer, the second Batch Normalization layer is disposed between the bi-directional GRU layer and the uni-directional GRU layer, the first Dropout layer is disposed at the tail of the second Batch Normalization layer, and the second TimeDistributed layer is disposed between the first Dropout layer and the second Dropout layer.

8. A telephone speech recognition system, the speech recognition system comprising: the device comprises a cutting module, a characteristic extraction module, a prediction module and a decoding module;

the prediction module is used for inputting the spectrogram characteristics into a telephone voice recognition model to obtain a probability matrix of a voice text corresponding to the spectrogram characteristics;

the decoding module is used for decoding the probability matrix to obtain a voice text corresponding to the spectrogram characteristic;

the decoding module is also used for establishing a word stock according to the historical telephone voice data, carrying out homonymy combination on the word stock, and labeling the combined word stock to obtain a word label;

the decoding module is also used for decoding the probability matrix by utilizing a CTC decoding algorithm according to the word label to obtain the voice text.

9. A telephone speech recognition system as defined in claim 8, wherein

10. The telephony speech recognition system of claim 9, wherein the training module is further configured to set a network size adjustment parameter for the input layer based on a sampling rate of speech data; and respectively setting convolution parameters of each path of multi-path multi-layer two-dimensional convolution layer according to the network size adjustment parameters, wherein the convolution parameters comprise at least one of the number of filters, step sizes, convolution kernel sizes, filling and output dimensions.

11. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the telephony speech recognition method of any of claims 1-7 when the computer program is executed.

12. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program, when being executed by a processor, implements the steps of the telephony speech recognition method of any of claims 1-7.