CN115602165A - Digital staff intelligent system based on financial system - Google Patents
Digital staff intelligent system based on financial system Download PDFInfo
- Publication number
- CN115602165A CN115602165A CN202211090442.0A CN202211090442A CN115602165A CN 115602165 A CN115602165 A CN 115602165A CN 202211090442 A CN202211090442 A CN 202211090442A CN 115602165 A CN115602165 A CN 115602165A
- Authority
- CN
- China
- Prior art keywords
- feature map
- channel
- spectrogram
- classification
- map
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/22—Procedures used during a speech recognition process, e.g. man-machine dialogue
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/02—Feature extraction for speech recognition; Selection of recognition unit
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/06—Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
- G10L15/063—Training
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/16—Speech classification or search using artificial neural networks
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/08—Speech classification or search
- G10L15/18—Speech classification or search using natural language modelling
- G10L15/1822—Parsing for meaning understanding
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/20—Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
- G10L25/00—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
- G10L25/03—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
- G10L25/24—Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum
Abstract
The application relates to the field of financial science and technology, and particularly discloses a financial system-based digital staff intelligent system which converts consultation intention understanding of digital staff to clients into a voice theme marking problem. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of more accurately understanding user consultation extraction, the user can be responded to the client more reasonably and adaptively, and the voice consultation experience of the user is improved.
Description
Technical Field
The application relates to the field of financial science and technology, and more specifically relates to a digital staff intelligent system based on financial system.
Background
With the development of computer technology, more and more technologies (such as big data or cloud computing) are applied in the financial field, and the traditional financial industry is gradually shifting to the financial technology. At present, the use of digital staff (e.g., voice robot) in the financial field is quite extensive, for example, the purposes of financial product promotion, collection of money and the like can be realized through the digital staff. Digital employee implementations have benefited from the development of speech recognition and natural speech understanding, among other related technologies.
However, in actual operation, the digital staff often complains of the customer, and the reason is that the digital staff does not accurately understand the intention of the customer, and answers questions and the like occur.
Therefore, an optimized financial system digital staff intelligence solution is desired.
Disclosure of Invention
The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides a digital employee intelligent system based on a financial system, which converts consultation intention understanding of a digital employee to a client into a voice theme marking problem. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of understanding user consultation and extraction more accurately, the user can respond to the client more reasonably and adaptively so as to improve the voice consultation experience of the user.
According to one aspect of the application, there is provided a financial system based digital employee intelligence system comprising:
the voice signal acquisition module is used for acquiring a consultation voice signal of a client;
the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal;
a voice spectrogram extraction module for extracting a logarithmic Mel-map, a cochlear spectrogram and a constant Q transform spectrogram from the denoised consulted voice signal;
the multi-channel semantic spectrogram construction module is used for arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram;
the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model so as to obtain a classification characteristic graph;
the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and
and the consultation intention recognition module is used for enabling the corrected classification characteristic diagram to pass through a classifier to obtain a classification result, and the classification result is used for representing an intention subject label of the consultation voice signal.
In the above digital staff intelligent system based on financial system, the noise reduction module includes: a speech signal encoding unit for inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder performs explicit spatial encoding on the advisory speech signal using a convolutional layer to obtain a speech feature; and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to perform deconvolution processing on the voice features to obtain the noise-reduced consultation voice signals.
In the above digital employee intelligent system based on the financial system, the double-stream encoding module includes: the first convolution coding unit is used for inputting the multichannel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism so as to obtain a space enhancement feature map; the second convolution coding unit is used for inputting the multichannel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and the aggregation unit is used for fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
In the above digital staff intelligent system based on financial system, the first volume coding unit includes: the deep convolutional coding subunit is used for inputting the multichannel voice spectrogram into a multilayer convolutional layer of the first convolutional neural network to obtain a first convolutional characteristic map; a spatial attention subunit, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and the attention applying subunit is used for calculating the spatial attention diagram and multiplying the spatial attention diagram by the position points of the first convolution feature diagram to obtain the spatial enhancement feature diagram.
In the above digital staff intelligent system based on a financial system, the spatial attention subunit is further configured to: performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
In the above digital staff intelligent system based on the financial system, the second convolutional encoding unit includes: the deep convolutional coding subunit is used for inputting the multichannel voice spectrogram into the multilayer convolutional layer of the second convolutional neural network to obtain a second convolutional characteristic diagram; a global mean pooling subunit, configured to calculate a global mean of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector; the channel attention weight calculation subunit is used for inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and the channel attention applying subunit is configured to take the eigenvalue of each position in the channel attention weight vector as a weight to respectively weight each eigen matrix of the second convolution eigen map along the channel dimension to obtain the channel enhanced eigen map.
In the above digital staff intelligent system based on financial system, the aggregation unit is further configured to: fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:
wherein the content of the first and second substances,in order to be able to classify the feature map,in order to reinforce the characteristic diagram in the space,for strengthening the characteristic diagram of the channel ""represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.
In the above digital staff intelligent system based on a financial system, the adaptive correction module is further configured to: correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:
wherein the content of the first and second substances,is the classification feature mapThe characteristic value of each of the positions of (a),andis a set of eigenvaluesMean and variance of, andis the classification feature mapThe size of (a) is greater than (b),is logarithmic with base 2, andis a weight-over-parameter that is,and classifying the feature values of all positions in the feature map after correction.
In the above digital employee intelligent system based on a financial system, the consultation intention identifying module is further configured to: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:
whereinRepresenting projecting the corrected classification feature map as a vector,toIs a weight matrix of the fully connected layers of each layer,toA bias matrix representing the layers of the fully connected layer.
According to another aspect of the present application, there is also provided a digital employee intelligence method based on a financial system, comprising:
acquiring a consultation voice signal of a client;
the consultation voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consultation voice signal;
extracting a logarithmic Mel's map, a cochlear spectrogram and a constant Q-transformed spectrogram from the denoised consultative speech signal;
arranging the logarithmic Mel map, the cochlea spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram;
enabling the multi-channel voice spectrogram to pass through a double-flow network model to obtain a classification characteristic map;
correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and
and passing the corrected classification feature map through a classifier to obtain a classification result, wherein the classification result is used for representing an intention topic label of the consultation voice signal.
In the above digital employee intelligent method based on a financial system, the passing the consulting voice signal through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consulting voice signal includes: inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder explicitly spatially encodes the advisory speech signal using a convolutional layer to obtain speech features; and inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder performs deconvolution processing on the voice features by using a deconvolution layer to obtain the noise-reduced consultation voice signal.
In the above digital employee intelligent method based on a financial system, the obtaining a classification feature map by passing the multi-channel voice spectrogram through a double-flow network model includes: inputting the multi-channel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism to obtain a space enhancement feature map; inputting the multi-channel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
In the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map includes: inputting the multichannel voice spectrogram into a multilayer convolution layer of the first convolution neural network to obtain a first convolution feature map; inputting the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and calculating the spatial attention diagram and the position-by-position point multiplication of the first volume feature diagram to obtain the spatial enhancement feature diagram.
In the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map further includes: performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
In the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a second convolutional neural network of the dual-flow network model using a channel attention mechanism to obtain a channel enhancement feature map includes: inputting the multi-channel voice spectrogram into a multilayer convolution layer of the second convolution neural network to obtain a second convolution characteristic diagram; calculating the global mean value of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector; inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and weighting each feature matrix of the second convolution feature map along the channel dimension by taking the feature value of each position in the channel attention weight vector as a weight to obtain the channel reinforced feature map.
In the above digital employee intelligent method based on a financial system, the fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map further includes: fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map using the following formula:
wherein the content of the first and second substances,in order to provide the said classification feature map,in order to reinforce the characteristic diagram in the space,a feature map is enhanced for the channel, 'A method for manufacturing a thin film transistor'"represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.
In the above digital employee intelligent method based on a financial system, the correcting the feature values of the respective positions in the classification feature map based on the statistical features of the feature value sets of all the positions in the classification feature map to obtain a corrected classification feature map further includes: correcting the characteristic value of each position in the classification characteristic diagram by the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:
wherein, the first and the second end of the pipe are connected with each other,is the classification feature mapThe characteristic value of each of the positions of (a),andis a set of eigenvaluesMean and variance of, andis the classification feature mapThe size of (a) is greater than (b),is logarithmic with base 2, andis a weight-over-parameter that is,and classifying the feature value of each position in the feature map after correction.
In the above digital employee intelligent method based on a financial system, the step of passing the corrected classification feature map through a classifier to obtain a classification result further includes: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:
whereinRepresenting projecting the corrected classification feature map as a vector,to is thatIs a weight matrix of the fully connected layers of each layer,toA bias matrix representing the layers of the fully connected layer.
Compared with the prior art, the financial system-based digital employee intelligent system converts consultation intention understanding of digital employees to clients into voice theme marking problems. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of more accurately understanding user consultation extraction, the user can be responded to the client more reasonably and adaptively, and the voice consultation experience of the user is improved.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.
FIG. 1 illustrates a block diagram of a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application.
FIG. 2 illustrates a system architecture diagram of a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application.
FIG. 3 illustrates a block diagram of a dual stream encoding module in a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application.
FIG. 4 illustrates a block diagram of a first convolution encoding unit in a digital employee intelligence system based on a financial system according to an embodiment of the present application.
FIG. 5 illustrates a block diagram of a second convolutional encoding unit in a digital employee intelligence system based on a financial system, according to an embodiment of the present application.
FIG. 6 illustrates a flow chart of a digital employee intelligence method based on a financial system in accordance with an embodiment of the application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
Summary of the application
In the technical solution of the present application, the understanding of the consulting intention of the digital staff to the client can be converted into a voice topic labeling problem, that is, a voice signal is encoded and understood in a proper manner, and a predetermined intention topic label is not specified to the voice signal, which can be realized by a two-stage model of a feature extractor + a classifier.
However, when a client consults, different clients have different language expression habits, which brings great challenges to the understanding of speech semantics, and meanwhile, the user still has phenomena of ending, accent, intonation and the like when expressing problems, and because the user often expresses the consultation requirements through a communication system, a lot of noise is introduced in the collection and transmission processes of speech signals, and the technical problem causes that the precision of the understanding of consultation intentions is low.
Correspondingly, in the technical scheme of the application, after the consultation voice signal of the client is obtained, the consultation voice signal is firstly passed through a signal noise reduction module based on an automatic encoder to obtain the noise-reduced consultation voice signal. In particular, the auto-encoder based signal noise reduction module includes an encoder that uses convolutional layers and a decoder that uses anti-convolutional layers. Accordingly, the denoising process of the signal denoising module includes first explicitly spatially encoding the advisory speech signal with a convolutional layer using the encoder to extract speech features from the advisory speech signal (filtered in response to noise), and then deconvolving the speech features with a deconvolution layer using the decoder to obtain the denoised advisory speech signal.
In order to improve the accuracy of semantic understanding of the noise-reduced consulting voice signal, the noise-reduced consulting voice signal is converted into a spectrogram, the spectrogram is a perception graph formed by three parts of time, frequency and energy, is a visible language of the voice signal, can provide rich visual information, combines time domain analysis and frequency domain analysis, and can reflect the frequency content of the signal and the change rule of the frequency content along with the time.
Particularly, in the technical solution of the present application, in order to capture richer acoustic spectrum information, a log mellop diagram, a cochlear spectrogram, and a constant Q-transformed spectrogram are extracted from the denoised consulting speech signal, respectively. It will be appreciated that the log mel-frequency spectrum is the most widely used feature, which in the design process mimics the characteristics of the human ear, which have different acoustic sensitivities to sounds of different frequencies. The extraction flow of the logarithmic Mel-map is similar to MFCC, but linear transformation, namely discrete cosine transformation, of the last step is reduced, and after the step is removed, more high-order information and nonlinear information of the sound signal can be reserved. The cochlea spectrogram is obtained by a Gamma atom filter bank simulating the frequency selectivity of a human cochlea, and the frequency of the Gamma atom filter bank is more consistent with the auditory characteristics of human ears. The constant Q-transform map provides better frequency resolution for low frequencies and better time resolution for high frequencies, thereby better mimicking the behavior of the human auditory system.
Then, the Log-Melpe diagram, the cochlea spectrogram and the constant Q-transform spectrogram are arranged into a multi-channel voice spectrogram. The logarithm Mel-cepstrum, the cochlea spectrogram and the constant Q transformation spectrogram are arranged along the channel dimension to obtain a multi-channel semantic spectrogram, so that data input into the neural network model has a relatively larger width, on one hand, richer materials are provided for voice feature extraction of the neural network model, on the other hand, correlation exists among all the spectrograms in the multi-channel semantic spectrogram, and the accuracy and the richness of the voice feature extraction can be improved by utilizing internal correlation.
Specifically, in the technical solution of the present application, a dual-flow network model including a first convolutional neural network using a spatial attention mechanism and a second convolutional neural network using a channel attention mechanism is used to process the multi-channel speech spectrogram to obtain a classification feature map. Here, the first convolutional neural network using a spatial attention mechanism and the second convolutional neural network using a channel attention mechanism of the dual-flow network model respectively perform spatially enhanced display spatial coding and channel enhanced display spatial coding on the multi-channel speech spectrogram in a parallel structure, and aggregate a spatial enhancement feature map and a channel enhancement feature map obtained by coding the two to obtain the classification feature map.
For the classification feature map, when the dual-flow network structure fuses the spatial enhancement feature map output by the first convolutional neural network and the channel enhancement feature map output by the second convolutional neural network, it may be that an out-of-distribution feature value in the classification feature map is generated due to a non-alignment distribution between a spatial dimension distribution of the spatial enhancement feature map and a channel dimension distribution of the channel enhancement feature map, so as to affect a classification effect of the classification feature map.
Therefore, the information statistics normalization of the adaptive example is performed on the classification feature map, specifically:
is the classification feature mapThe characteristic value of each of the positions of (b),andis a set of eigenvaluesMean and variance of, andis the classification feature mapThe size of (a) is greater than (b),is logarithmic with base 2, andis a weight hyperparameter.
The information statistical normalization of the self-adaptive example takes the characteristic value set of the classification characteristic diagram as the self-adaptive example, uses intrinsic prior information of the statistical characteristics to carry out dynamic generation type information normalization on a single characteristic value, and simultaneously uses the normalization mode length information of the characteristic set as bias to be used as invariance description in a set distribution domain, thereby realizing the characteristic optimization of shielding the disturbance distribution of a special example as much as possible and improving the classification effect of the classification characteristic diagram. In this way, the accuracy of the understanding and recognition of the intent of the consulting voice signal is improved.
Based on this, this application has proposed a digital staff intelligence system based on financial system, it includes: the voice signal acquisition module is used for acquiring a consultation voice signal of a client; the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal; a voice spectrogram extraction module for extracting a logarithmic Mel's map, a cochlear spectrogram and a constant Q-transform spectrogram from the denoised consulted voice signal; the multi-channel semantic spectrogram construction module is used for arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram; the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model so as to obtain a classification characteristic graph; the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and the consultation intention recognition module is used for enabling the corrected classification characteristic diagram to pass through a classifier to obtain a classification result, and the classification result is used for representing an intention subject label of the consultation voice signal.
Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.
Exemplary System
FIG. 1 illustrates a block diagram of a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application. As shown in fig. 1, a digital employee intelligence system 100 based on a financial system according to an embodiment of the present application includes: a voice signal collecting module 110 for obtaining a consultation voice signal of the client; a noise reduction module 120, configured to pass the consulting speech signal through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consulting speech signal; a voice spectrogram extracting module 130 for extracting a logarithmic melpp map, a cochlear spectrogram and a constant Q-transform spectrogram from the denoised counsel voice signal; a multi-channel semantic spectrogram constructing module 140, configured to arrange the log mel-frequency map, the cochlear spectrogram and the constant Q-transform spectrogram into a multi-channel speech spectrogram; the double-flow coding module 150 is configured to pass the multi-channel speech spectrogram through a double-flow network model to obtain a classification feature map; the adaptive correction module 160 is configured to correct the feature values at each position in the classification feature map based on statistical features of the feature value sets at all positions in the classification feature map to obtain a corrected classification feature map; and a consultation intention recognition module 170, configured to pass the corrected classification feature map through a classifier to obtain a classification result, where the classification result is used to represent an intention topic label of the consultation voice signal.
FIG. 2 illustrates a system architecture diagram of a financial system based digital employee intelligence system 100 in accordance with an embodiment of the present application. As shown in fig. 2, in the system architecture of the digital employee intelligence system based on a financial system 100, a consultation voice signal of a client is first acquired. And then, the consultation voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consultation voice signal. Then, a logarithmic mellop map, a cochlear spectrogram, and a constant Q-transformed spectrogram are extracted from the noise-reduced consulting voice signal. And arranging the logarithmic Mel map, the cochlea spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram. And then, passing the multichannel voice spectrogram through a double-flow network model to obtain a classification characteristic diagram. And then, based on the statistical features of the feature value sets of all the positions in the classification feature map, correcting the feature values of all the positions in the classification feature map to obtain a corrected classification feature map. And then, the corrected classification feature map is passed through a classifier to obtain a classification result, and the classification result is used for representing an intention topic label of the consultation voice signal.
In the above-mentioned financial system-based digital staff intelligent system 100, the voice signal collecting module 110 is used for obtaining a consultation voice signal of the client. In the technical solution of the present application, the understanding of the consulting intention of the digital staff to the client can be converted into a voice topic labeling problem, that is, a voice signal is encoded and understood in a proper manner, and a predetermined intention topic label is not specified to the voice signal, which can be realized by a two-stage model of a feature extractor + a classifier.
However, when a client consults, different clients have different language expression habits, which brings great challenges to the understanding of speech semantics, and meanwhile, the user still has phenomena of ending, accent, intonation and the like when expressing problems, and because the user often expresses the consultation requirements through a communication system, a lot of noise is introduced in the collection and transmission processes of speech signals, and the technical problem causes that the precision of the understanding of consultation intentions is low. Therefore, in the technical solution of the present application, a consultation voice signal of the client is first acquired.
In the above-mentioned financial system-based digital staff intelligent system 100, the noise reduction module 120 is configured to pass the consulting voice signal through an automatic encoder-based signal noise reduction module to obtain a noise-reduced consulting voice signal. That is, after obtaining the consulting voice signal of the client, the consulting voice signal passes through the signal noise reduction module based on the automatic encoder to obtain the noise-reduced consulting voice signal. In particular, the auto-encoder based signal noise reduction module includes an encoder that uses convolutional layers and a decoder that uses anti-convolutional layers. Accordingly, the denoising process of the signal denoising module includes first explicitly spatially encoding the advisory speech signal with a convolutional layer using the encoder to extract speech features from the advisory speech signal (filtered in response to noise), and then deconvolving the speech features with a deconvolution layer using the decoder to obtain the denoised advisory speech signal.
In one example, in the above-mentioned financial system based digital employee intelligence system 100, the noise reduction module 120 comprises: a speech signal encoding unit for inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder performs explicit spatial encoding on the advisory speech signal using a convolutional layer to obtain a speech feature; and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to perform deconvolution processing on the voice features to obtain the noise-reduced consultation voice signals.
In the above-mentioned financial system-based digital staff intelligent system 100, the voice spectrogram extracting module 130 is configured to extract a log melpp chart, a cochlear spectrogram and a constant Q-transformed spectrogram from the denoised consulting voice signal. In order to improve the accuracy of semantic understanding of the noise-reduced consulting voice signal, the noise-reduced consulting voice signal is converted into a spectrogram, the spectrogram is a perception graph formed by three parts of time, frequency and energy, is a visible language of the voice signal, can provide rich visual information, combines time domain analysis and frequency domain analysis, and can reflect the frequency content of the signal and the change rule of the frequency content along with the time.
Particularly, in the technical solution of the present application, in order to capture richer acoustic spectrum information, a log mellop diagram, a cochlear spectrogram, and a constant Q-transformed spectrogram are extracted from the denoised consulting speech signal, respectively. It will be appreciated that the log mel-frequency spectrum is the most widely used feature, which in the design process mimics the characteristics of the human ear, which have different acoustic sensitivities to sounds of different frequencies. The extraction flow of the logarithmic Mel-map is similar to MFCC, but linear transformation, namely discrete cosine transformation, of the last step is reduced, and after the step is removed, more high-order information and nonlinear information of the sound signal can be reserved. The cochlea spectrogram is obtained by a Gamma atom filter bank simulating the frequency selectivity of a human cochlea, and the frequency of the Gamma atom filter bank is more consistent with the auditory characteristics of human ears. The constant Q-transform map provides better frequency resolution for low frequencies and better time resolution for high frequencies, thereby better mimicking the behavior of the human auditory system.
In the above digital staff intelligent system 100 based on a financial system, the multi-channel semantic spectrogram constructing module 140 is configured to arrange the log melpp chart, the cochlear spectrogram and the constant Q-transform spectrogram into a multi-channel voice spectrogram. Namely, the logarithmic Mel's map, the cochlear spectrogram and the constant Q-conversion spectrogram are arranged along the channel dimension to obtain a multi-channel semantic spectrogram, so that data input into the neural network model has a relatively larger width, on one hand, richer materials are provided for voice feature extraction of the neural network model, on the other hand, correlation exists among all spectrograms in the multi-channel semantic spectrogram, and the accuracy and richness of the voice feature extraction can be improved by utilizing internal correlation.
In the above digital staff intelligent system 100 based on the financial system, the double-flow encoding module 150 is configured to obtain the classification feature map by passing the multi-channel voice spectrogram through a double-flow network model. Specifically, in the technical solution of the present application, a dual-flow network model including a first convolutional neural network using a spatial attention mechanism and a second convolutional neural network using a channel attention mechanism is used to process the multi-channel speech spectrogram to obtain a classification feature map. Here, the first convolutional neural network using a spatial attention mechanism and the second convolutional neural network using a channel attention mechanism of the dual-flow network model respectively perform spatially enhanced display spatial coding and channel enhanced display spatial coding on the multi-channel speech spectrogram in a parallel structure, and aggregate a spatial enhancement feature map and a channel enhancement feature map obtained by coding the two to obtain the classification feature map.
FIG. 3 illustrates a block diagram of a dual stream encoding module in a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application. As shown in fig. 3, in the above-mentioned digital employee intelligent system 100 based on a financial system, the dual-stream encoding module 150 includes: a first convolution encoding unit 151, configured to input the multichannel speech spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map; a second convolution coding unit 152, configured to input the multi-channel speech spectrogram into a second convolution neural network of the dual-flow network model, where the second convolution neural network uses a channel attention mechanism, so as to obtain a channel enhancement feature map; and an aggregation unit 153, configured to fuse the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
Fig. 4 illustrates a block diagram of a first convolution encoding unit in a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application. As shown in fig. 4, in the above-mentioned digital staff intelligent system 100 based on financial system, the first volume coding unit 151 includes: a deep convolutional coding subunit 1511, configured to input the multichannel voice spectrogram into the multilayer convolutional layer of the first convolutional neural network to obtain a first convolutional feature map; a spatial attention subunit 1512, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and an attention applying subunit 1513, configured to calculate a product of the spatial attention map and the location-based point of the first convolution feature map to obtain the spatial enhancement feature map.
In one example, in the above-mentioned financial system-based digital employee intelligence system 100, the spatial attention subunit 1512 is further configured to: performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
FIG. 5 illustrates a block diagram of a second convolutional encoding unit in a digital employee intelligence system based on a financial system, according to an embodiment of the present application. As shown in fig. 5, in the above-mentioned digital staff intelligent system 100 based on a financial system, the second convolutional encoding unit 152 includes: a deep convolutional coding subunit 1521, configured to input the multichannel speech spectrogram into a multilayer convolutional layer of the second convolutional neural network to obtain a second convolutional feature map; a global mean Chi Huazi unit 1522 configured to calculate a global mean of each feature matrix of the second convolution feature map along a channel dimension to obtain a channel feature vector; a channel attention weight calculation subunit 1523, configured to input the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and a channel attention applying subunit 1524, configured to take the eigenvalue of each position in the channel attention weight vector as a weight to respectively weight each eigen matrix of the second convolution eigen map along the channel dimension, so as to obtain the channel enhanced eigen map.
In one example, in the above digital staff intelligent system 100 based on a financial system, the aggregation unit 153 is further configured to: fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:
wherein, the first and the second end of the pipe are connected with each other,in order to provide the said classification feature map,in order to reinforce the characteristic diagram in the space,for strengthening the characteristic diagram of the channel ""represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.
In the above-mentioned digital staff intelligent system 100 based on a financial system, the adaptive correction module 160 is configured to correct the feature values of each position in the classification feature map based on the statistical features of the feature value sets of all positions in the classification feature map to obtain a corrected classification feature map. For the classification feature map, when the dual-flow network structure fuses the spatial enhancement feature map output by the first convolutional neural network and the channel enhancement feature map output by the second convolutional neural network, it may be that an out-of-distribution feature value in the classification feature map is generated due to a misalignment between a spatial dimension distribution of the spatial enhancement feature map and a channel dimension distribution of the channel enhancement feature map, so as to affect a classification effect of the classification feature map. Therefore, the information statistics of the adaptive instance is normalized to the classification feature map.
In one example, in the above-mentioned financial system-based digital employee intelligence system 100, the adaptive correction module 160 is further configured to: correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:
wherein the content of the first and second substances,is the classification feature mapThe characteristic value of each of the positions of (a),andis a set of eigenvaluesMean and variance of, andis the classification feature mapThe size of (a) is greater than (b),is logarithmic with base 2, andis a weight-over-parameter that is,and classifying the feature values of all positions in the feature map after correction.
The information statistical normalization of the self-adaptive example takes the characteristic value set of the classification characteristic diagram as the self-adaptive example, uses intrinsic prior information of the statistical characteristics to carry out dynamic generation type information normalization on a single characteristic value, and simultaneously uses the normalization mode length information of the characteristic set as bias to be used as invariance description in a set distribution domain, thereby realizing the characteristic optimization of shielding the disturbance distribution of a special example as much as possible and improving the classification effect of the classification characteristic diagram. In this way, the accuracy of the understanding and recognition of the intent of the consulting voice signal is improved.
In the above-mentioned financial system-based digital employee intelligent system 100, the consultation intention identifying module 170 is configured to pass the corrected classification feature map through a classifier to obtain a classification result, where the classification result is used to represent an intention topic label of a consultation voice signal.
In one example, in the above-mentioned financial system-based digital employee intelligence system 100, the consulting intent identification module 170 is further configured to: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:
whereinRepresenting projecting the corrected classification feature map as a vector,toIs a weight matrix of the fully connected layers of each layer,toA bias matrix representing the layers of the fully connected layer.
In summary, a financial system based digital employee intelligence system 100 is illustrated that translates digital employee consultation intent understanding of a client into voice topic tagging issues in accordance with embodiments of the present application. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of more accurately understanding user consultation extraction, the user can be responded to the client more reasonably and adaptively, and the voice consultation experience of the user is improved.
As described above, the digital employee intelligence system 100 based on a financial system according to the embodiment of the present application may be implemented in various terminal devices, such as a server having digital employee intelligence based on a financial system. In one example, the financial system based digital employee intelligence system 100 according to embodiments of the subject application may be integrated into a terminal device as a software module and/or a hardware module. For example, the financial system-based digital employee intelligence system 100 may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the financial system based digital staff intelligent system 100 may also be one of many hardware modules of the terminal device.
Alternatively, in another example, the financial system based digital employee intelligence system 100 and the terminal device may also be separate devices, and the financial system based digital employee intelligence system 100 may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information in an agreed data format.
Exemplary method
According to another aspect of the application, a digital employee intelligent method based on a financial system is also provided. As shown in fig. 6, the digital employee intelligent method based on the financial system according to the embodiment of the present application includes the steps of: s110, acquiring a consultation voice signal of a client; s120, the consulting voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consulting voice signal; s130, extracting a logarithmic Mel map, a cochlear spectrogram and a constant Q transformation spectrogram from the denoised consulting voice signal; s140, arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram; s150, enabling the multi-channel voice spectrogram to pass through a double-flow network model to obtain a classification characteristic map; s160, correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and S170, passing the corrected classification characteristic diagram through a classifier to obtain a classification result, wherein the classification result is used for representing an intention theme label of the consultation voice signal.
In one example, in the above digital employee intelligence method based on a financial system, the passing the consulting voice signal through an automatic encoder based signal noise reduction module to obtain a noise-reduced consulting voice signal includes: inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder explicitly spatially encodes the advisory speech signal using a convolutional layer to obtain speech features; and inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder performs deconvolution processing on the voice features by using a deconvolution layer to obtain the noise-reduced consultation voice signal.
In one example, in the above financial system-based digital employee intelligence method, the passing the multi-channel voice spectrogram through a dual-flow network model to obtain a classification feature map includes: inputting the multi-channel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism to obtain a space enhancement feature map; inputting the multi-channel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
In one example, in the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map includes: inputting the multichannel voice spectrogram into a multilayer convolution layer of the first convolution neural network to obtain a first convolution feature map; inputting the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and calculating the spatial attention diagram and the position-by-position point multiplication of the first volume feature diagram to obtain the spatial enhancement feature diagram.
In one example, in the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhanced feature map further includes: performing convolutional coding on the first convolutional feature map by using the convolutional layer of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
In one example, in the above financial system-based digital employee intelligence method, the inputting the multi-channel voice spectrogram into a second convolutional neural network of the dual-flow network model using a channel attention mechanism to obtain a channel enhancement feature map includes: inputting the multi-channel voice spectrogram into a multilayer convolution layer of the second convolution neural network to obtain a second convolution characteristic diagram; calculating the global mean value of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector; inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and weighting each feature matrix of the second convolution feature map along the channel dimension by taking the feature value of each position in the channel attention weight vector as a weight to obtain the channel reinforced feature map.
In one example, in the above digital employee intelligence method based on a financial system, the fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map further includes: fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map using the following formula:
wherein the content of the first and second substances,in order to be able to classify the feature map,in order to reinforce the characteristic diagram in the space,for strengthening the characteristic diagram of the channel ""represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.
In one example, in the above digital employee intelligence method based on a financial system, the correcting the feature values of the respective locations in the classification feature map based on the statistical features of the feature value sets of all the locations in the classification feature map to obtain a corrected classification feature map further includes: correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:
wherein the content of the first and second substances,is the classification feature mapThe characteristic value of each of the positions of (a),andis a set of characteristic valuesA mean and a variance of (2), andis the classification feature mapThe size of (a) is greater than (b),is logarithmic with base 2, andis a weight-over-parameter that is,and classifying the feature values of all positions in the feature map after correction.
In one example, in the above digital employee intelligence method based on a financial system, the step of passing the corrected classification feature map through a classifier to obtain a classification result further includes: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:
whereinRepresenting projecting the corrected classification feature map as a vector,toIs a weight matrix of the fully connected layers of each layer,toA bias matrix representing the layers of the fully connected layer.
In summary, the financial system-based digital employee intelligence method is illustrated according to an embodiment of the present application, which translates the digital employee's consultation intention understanding of the client into a voice topic tagging problem, i.e., a feature extractor is implemented to encode and understand a voice signal and assign a predetermined intention topic tag to the voice signal. Particularly, the voice signal is subjected to noise reduction processing by using a noise reduction module so as to improve the precision of consultation intention understanding. Thus, an optimized financial system digital staff intelligent scheme is constructed.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is provided for purposes of illustration and understanding only, and is not intended to limit the application to the details which are set forth in order to provide a thorough understanding of the present application.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by one skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably herein. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.
Claims (9)
1. A digital staff intelligence system based on a financial system, comprising:
the voice signal acquisition module is used for acquiring a consultation voice signal of a client;
the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal;
a voice spectrogram extraction module for extracting a logarithmic Mel's map, a cochlear spectrogram and a constant Q-transform spectrogram from the denoised consulted voice signal;
the multi-channel semantic spectrogram construction module is used for arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram;
the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model so as to obtain a classification characteristic graph;
the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and
and the consultation intention recognition module is used for enabling the corrected classification characteristic diagram to pass through a classifier to obtain a classification result, and the classification result is used for representing an intention subject label of the consultation voice signal.
2. The financial system based digital employee intelligence system of claim 1 wherein said noise reduction module comprises:
a speech signal encoding unit for inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder performs explicit spatial encoding on the advisory speech signal using a convolutional layer to obtain a speech feature; and
and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to perform deconvolution processing on the voice features so as to obtain the noise-reduced consultation voice signals.
3. The financial system based digital employee intelligence system of claim 2 wherein said dual stream encoding module comprises:
the first convolution coding unit is used for inputting the multichannel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism so as to obtain a space enhancement feature map;
the second convolution coding unit is used for inputting the multichannel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and
and the aggregation unit is used for fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
4. The financial system-based digital employee intelligence system of claim 3 wherein said first volume coding unit comprises:
the deep convolutional coding subunit is used for inputting the multichannel voice spectrogram into a multilayer convolutional layer of the first convolutional neural network to obtain a first convolutional characteristic diagram;
a spatial attention subunit, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and
an attention applying subunit, configured to calculate a dot-by-dot multiplication between the spatial attention map and the first convolution feature map to obtain the spatial enhancement feature map.
5. The financial system-based digital employee intelligence system of claim 4 wherein the spatial attention subunit is further configured to:
performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map;
calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and
inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
6. The financial system-based digital staff intelligence system of claim 5 wherein said second convolutional encoding unit comprises:
the depth convolution coding subunit is used for inputting the multichannel voice spectrogram into a multilayer convolution layer of the second convolution neural network to obtain a second convolution characteristic map;
a global mean pooling subunit, configured to calculate a global mean of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector;
the channel attention weight calculation subunit is used for inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and
and the channel attention applying subunit is configured to take the eigenvalue of each position in the channel attention weight vector as a weight to respectively weight each eigen matrix of the second convolution eigen map along the channel dimension to obtain the channel enhanced eigen map.
7. The financial system-based digital employee intelligence system of claim 6, wherein the aggregation unit is further configured to:
fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:
wherein the content of the first and second substances,in order to be able to classify the feature map,in order to reinforce the characteristic diagram in the space,for strengthening the characteristic diagram of the channel ""represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.
8. The financial system-based digital employee intelligence system of claim 7, wherein the adaptive correction module is further configured to:
correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram;
wherein the formula is:
wherein the content of the first and second substances,is the classification feature mapThe characteristic value of each of the positions of (a),andis a set of eigenvaluesMean and variance of, andis the classification feature mapThe size of (a) is greater than (b),is logarithmic with base 2, andis a weight-over-parameter that is,and classifying the feature values of all positions in the feature map after correction.
9. The financial system-based digital employee intelligence system of claim 8 wherein said consultation intent identification module is further configured to:
processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211090442.0A CN115602165B (en) | 2022-09-07 | 2022-09-07 | Digital employee intelligent system based on financial system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202211090442.0A CN115602165B (en) | 2022-09-07 | 2022-09-07 | Digital employee intelligent system based on financial system |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115602165A true CN115602165A (en) | 2023-01-13 |
CN115602165B CN115602165B (en) | 2023-05-05 |
Family
ID=84843343
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202211090442.0A Active CN115602165B (en) | 2022-09-07 | 2022-09-07 | Digital employee intelligent system based on financial system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115602165B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116258504A (en) * | 2023-03-16 | 2023-06-13 | 广州信瑞泰信息科技有限公司 | Bank customer relationship management system and method thereof |
CN116308754A (en) * | 2023-03-22 | 2023-06-23 | 广州信瑞泰信息科技有限公司 | Bank credit risk early warning system and method thereof |
CN117173294A (en) * | 2023-11-03 | 2023-12-05 | 之江实验室科技控股有限公司 | Method and system for automatically generating digital person |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110085218A (en) * | 2019-03-26 | 2019-08-02 | 天津大学 | A kind of audio scene recognition method based on feature pyramid network |
CN110718234A (en) * | 2019-09-02 | 2020-01-21 | 江苏师范大学 | Acoustic scene classification method based on semantic segmentation coding and decoding network |
CN111145726A (en) * | 2019-10-31 | 2020-05-12 | 南京励智心理大数据产业研究院有限公司 | Deep learning-based sound scene classification method, system, device and storage medium |
CN111754988A (en) * | 2020-06-23 | 2020-10-09 | 南京工程学院 | Sound scene classification method based on attention mechanism and double-path depth residual error network |
CN113808573A (en) * | 2021-08-06 | 2021-12-17 | 华南理工大学 | Dialect classification method and system based on mixed domain attention and time sequence self-attention |
CN114333804A (en) * | 2021-12-27 | 2022-04-12 | 北京达佳互联信息技术有限公司 | Audio classification identification method and device, electronic equipment and storage medium |
CN114420097A (en) * | 2022-01-24 | 2022-04-29 | 腾讯科技(深圳)有限公司 | Voice positioning method and device, computer readable medium and electronic equipment |
CN114565041A (en) * | 2022-02-28 | 2022-05-31 | 上海嘉甲茂技术有限公司 | Payment big data analysis system based on internet finance and analysis method thereof |
US11355122B1 (en) * | 2021-02-24 | 2022-06-07 | Conversenowai | Using machine learning to correct the output of an automatic speech recognition system |
CN114974219A (en) * | 2022-05-30 | 2022-08-30 | 平安科技(深圳)有限公司 | Speech recognition method, speech recognition device, electronic apparatus, and storage medium |
-
2022
- 2022-09-07 CN CN202211090442.0A patent/CN115602165B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110085218A (en) * | 2019-03-26 | 2019-08-02 | 天津大学 | A kind of audio scene recognition method based on feature pyramid network |
CN110718234A (en) * | 2019-09-02 | 2020-01-21 | 江苏师范大学 | Acoustic scene classification method based on semantic segmentation coding and decoding network |
CN111145726A (en) * | 2019-10-31 | 2020-05-12 | 南京励智心理大数据产业研究院有限公司 | Deep learning-based sound scene classification method, system, device and storage medium |
CN111754988A (en) * | 2020-06-23 | 2020-10-09 | 南京工程学院 | Sound scene classification method based on attention mechanism and double-path depth residual error network |
US11355122B1 (en) * | 2021-02-24 | 2022-06-07 | Conversenowai | Using machine learning to correct the output of an automatic speech recognition system |
CN113808573A (en) * | 2021-08-06 | 2021-12-17 | 华南理工大学 | Dialect classification method and system based on mixed domain attention and time sequence self-attention |
CN114333804A (en) * | 2021-12-27 | 2022-04-12 | 北京达佳互联信息技术有限公司 | Audio classification identification method and device, electronic equipment and storage medium |
CN114420097A (en) * | 2022-01-24 | 2022-04-29 | 腾讯科技(深圳)有限公司 | Voice positioning method and device, computer readable medium and electronic equipment |
CN114565041A (en) * | 2022-02-28 | 2022-05-31 | 上海嘉甲茂技术有限公司 | Payment big data analysis system based on internet finance and analysis method thereof |
CN114974219A (en) * | 2022-05-30 | 2022-08-30 | 平安科技(深圳)有限公司 | Speech recognition method, speech recognition device, electronic apparatus, and storage medium |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN116258504A (en) * | 2023-03-16 | 2023-06-13 | 广州信瑞泰信息科技有限公司 | Bank customer relationship management system and method thereof |
CN116308754A (en) * | 2023-03-22 | 2023-06-23 | 广州信瑞泰信息科技有限公司 | Bank credit risk early warning system and method thereof |
CN116308754B (en) * | 2023-03-22 | 2024-02-13 | 广州信瑞泰信息科技有限公司 | Bank credit risk early warning system and method thereof |
CN117173294A (en) * | 2023-11-03 | 2023-12-05 | 之江实验室科技控股有限公司 | Method and system for automatically generating digital person |
CN117173294B (en) * | 2023-11-03 | 2024-02-13 | 之江实验室科技控股有限公司 | Method and system for automatically generating digital person |
Also Published As
Publication number | Publication date |
---|---|
CN115602165B (en) | 2023-05-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110600017B (en) | Training method of voice processing model, voice recognition method, system and device | |
JP7337953B2 (en) | Speech recognition method and device, neural network training method and device, and computer program | |
WO2021208287A1 (en) | Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium | |
WO2020232860A1 (en) | Speech synthesis method and apparatus, and computer readable storage medium | |
US9818431B2 (en) | Multi-speaker speech separation | |
CN115602165A (en) | Digital staff intelligent system based on financial system | |
CN112199548A (en) | Music audio classification method based on convolution cyclic neural network | |
CN112712813B (en) | Voice processing method, device, equipment and storage medium | |
CN112071330B (en) | Audio data processing method and device and computer readable storage medium | |
WO2023222088A1 (en) | Voice recognition and classification method and apparatus | |
CN114203163A (en) | Audio signal processing method and device | |
Huang et al. | Novel sub-band spectral centroid weighted wavelet packet features with importance-weighted support vector machines for robust speech emotion recognition | |
CN114141237A (en) | Speech recognition method, speech recognition device, computer equipment and storage medium | |
CN114399995A (en) | Method, device and equipment for training voice model and computer readable storage medium | |
Sekkate et al. | Speaker identification for OFDM-based aeronautical communication system | |
CN113823303A (en) | Audio noise reduction method and device and computer readable storage medium | |
Rituerto-González et al. | End-to-end recurrent denoising autoencoder embeddings for speaker identification | |
CN113782005B (en) | Speech recognition method and device, storage medium and electronic equipment | |
Li | RETRACTED ARTICLE: Speech-assisted intelligent software architecture based on deep game neural network | |
Konduru et al. | Multidimensional feature diversity based speech signal acquisition | |
CN112951270A (en) | Voice fluency detection method and device and electronic equipment | |
Zhou et al. | MetaRL-SE: a few-shot speech enhancement method based on meta-reinforcement learning | |
WO2024055752A1 (en) | Speech synthesis model training method, speech synthesis method, and related apparatuses | |
Samanta et al. | An energy-efficient voice activity detector using reconfigurable Gaussian base normalization deep neural network | |
CN117649846B (en) | Speech recognition model generation method, speech recognition method, device and medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |