CN115602165A - Digital staff intelligent system based on financial system - Google Patents

Digital staff intelligent system based on financial system Download PDF

Info

Publication number
CN115602165A
CN115602165A CN202211090442.0A CN202211090442A CN115602165A CN 115602165 A CN115602165 A CN 115602165A CN 202211090442 A CN202211090442 A CN 202211090442A CN 115602165 A CN115602165 A CN 115602165A
Authority
CN
China
Prior art keywords
feature map
channel
spectrogram
classification
map
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211090442.0A
Other languages
Chinese (zh)
Other versions
CN115602165B (en
Inventor
黄术
黄琪敏
裘浩祺
魏祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Youhang Information Technology Co ltd
Original Assignee
Hangzhou Youhang Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Youhang Information Technology Co ltd filed Critical Hangzhou Youhang Information Technology Co ltd
Priority to CN202211090442.0A priority Critical patent/CN115602165B/en
Publication of CN115602165A publication Critical patent/CN115602165A/en
Application granted granted Critical
Publication of CN115602165B publication Critical patent/CN115602165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS OR SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Abstract

The application relates to the field of financial science and technology, and particularly discloses a financial system-based digital staff intelligent system which converts consultation intention understanding of digital staff to clients into a voice theme marking problem. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of more accurately understanding user consultation extraction, the user can be responded to the client more reasonably and adaptively, and the voice consultation experience of the user is improved.

Description

Digital staff intelligent system based on financial system
Technical Field
The application relates to the field of financial science and technology, and more specifically relates to a digital staff intelligent system based on financial system.
Background
With the development of computer technology, more and more technologies (such as big data or cloud computing) are applied in the financial field, and the traditional financial industry is gradually shifting to the financial technology. At present, the use of digital staff (e.g., voice robot) in the financial field is quite extensive, for example, the purposes of financial product promotion, collection of money and the like can be realized through the digital staff. Digital employee implementations have benefited from the development of speech recognition and natural speech understanding, among other related technologies.
However, in actual operation, the digital staff often complains of the customer, and the reason is that the digital staff does not accurately understand the intention of the customer, and answers questions and the like occur.
Therefore, an optimized financial system digital staff intelligence solution is desired.
Disclosure of Invention
The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides a digital employee intelligent system based on a financial system, which converts consultation intention understanding of a digital employee to a client into a voice theme marking problem. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of understanding user consultation and extraction more accurately, the user can respond to the client more reasonably and adaptively so as to improve the voice consultation experience of the user.
According to one aspect of the application, there is provided a financial system based digital employee intelligence system comprising:
the voice signal acquisition module is used for acquiring a consultation voice signal of a client;
the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal;
a voice spectrogram extraction module for extracting a logarithmic Mel-map, a cochlear spectrogram and a constant Q transform spectrogram from the denoised consulted voice signal;
the multi-channel semantic spectrogram construction module is used for arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram;
the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model so as to obtain a classification characteristic graph;
the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and
and the consultation intention recognition module is used for enabling the corrected classification characteristic diagram to pass through a classifier to obtain a classification result, and the classification result is used for representing an intention subject label of the consultation voice signal.
In the above digital staff intelligent system based on financial system, the noise reduction module includes: a speech signal encoding unit for inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder performs explicit spatial encoding on the advisory speech signal using a convolutional layer to obtain a speech feature; and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to perform deconvolution processing on the voice features to obtain the noise-reduced consultation voice signals.
In the above digital employee intelligent system based on the financial system, the double-stream encoding module includes: the first convolution coding unit is used for inputting the multichannel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism so as to obtain a space enhancement feature map; the second convolution coding unit is used for inputting the multichannel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and the aggregation unit is used for fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
In the above digital staff intelligent system based on financial system, the first volume coding unit includes: the deep convolutional coding subunit is used for inputting the multichannel voice spectrogram into a multilayer convolutional layer of the first convolutional neural network to obtain a first convolutional characteristic map; a spatial attention subunit, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and the attention applying subunit is used for calculating the spatial attention diagram and multiplying the spatial attention diagram by the position points of the first convolution feature diagram to obtain the spatial enhancement feature diagram.
In the above digital staff intelligent system based on a financial system, the spatial attention subunit is further configured to: performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
In the above digital staff intelligent system based on the financial system, the second convolutional encoding unit includes: the deep convolutional coding subunit is used for inputting the multichannel voice spectrogram into the multilayer convolutional layer of the second convolutional neural network to obtain a second convolutional characteristic diagram; a global mean pooling subunit, configured to calculate a global mean of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector; the channel attention weight calculation subunit is used for inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and the channel attention applying subunit is configured to take the eigenvalue of each position in the channel attention weight vector as a weight to respectively weight each eigen matrix of the second convolution eigen map along the channel dimension to obtain the channel enhanced eigen map.
In the above digital staff intelligent system based on financial system, the aggregation unit is further configured to: fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:
Figure 216843DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 586513DEST_PATH_IMAGE002
in order to be able to classify the feature map,
Figure 100671DEST_PATH_IMAGE003
in order to reinforce the characteristic diagram in the space,
Figure 964722DEST_PATH_IMAGE004
for strengthening the characteristic diagram of the channel "
Figure 982356DEST_PATH_IMAGE005
"represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,
Figure 788507DEST_PATH_IMAGE006
is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.
In the above digital staff intelligent system based on a financial system, the adaptive correction module is further configured to: correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:
Figure 524382DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 457703DEST_PATH_IMAGE008
is the classification feature map
Figure 329844DEST_PATH_IMAGE009
The characteristic value of each of the positions of (a),
Figure 57629DEST_PATH_IMAGE010
and
Figure 521278DEST_PATH_IMAGE011
is a set of eigenvalues
Figure 727132DEST_PATH_IMAGE012
Mean and variance of, and
Figure 719358DEST_PATH_IMAGE013
is the classification feature map
Figure 618044DEST_PATH_IMAGE009
The size of (a) is greater than (b),
Figure 312200DEST_PATH_IMAGE014
is logarithmic with base 2, and
Figure 56165DEST_PATH_IMAGE015
is a weight-over-parameter that is,
Figure 168477DEST_PATH_IMAGE016
and classifying the feature values of all positions in the feature map after correction.
In the above digital employee intelligent system based on a financial system, the consultation intention identifying module is further configured to: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:
Figure 221753DEST_PATH_IMAGE017
wherein
Figure 685095DEST_PATH_IMAGE018
Representing projecting the corrected classification feature map as a vector,
Figure 232751DEST_PATH_IMAGE019
to
Figure 199570DEST_PATH_IMAGE020
Is a weight matrix of the fully connected layers of each layer,
Figure 174479DEST_PATH_IMAGE021
to
Figure 374385DEST_PATH_IMAGE022
A bias matrix representing the layers of the fully connected layer.
According to another aspect of the present application, there is also provided a digital employee intelligence method based on a financial system, comprising:
acquiring a consultation voice signal of a client;
the consultation voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consultation voice signal;
extracting a logarithmic Mel's map, a cochlear spectrogram and a constant Q-transformed spectrogram from the denoised consultative speech signal;
arranging the logarithmic Mel map, the cochlea spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram;
enabling the multi-channel voice spectrogram to pass through a double-flow network model to obtain a classification characteristic map;
correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and
and passing the corrected classification feature map through a classifier to obtain a classification result, wherein the classification result is used for representing an intention topic label of the consultation voice signal.
In the above digital employee intelligent method based on a financial system, the passing the consulting voice signal through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consulting voice signal includes: inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder explicitly spatially encodes the advisory speech signal using a convolutional layer to obtain speech features; and inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder performs deconvolution processing on the voice features by using a deconvolution layer to obtain the noise-reduced consultation voice signal.
In the above digital employee intelligent method based on a financial system, the obtaining a classification feature map by passing the multi-channel voice spectrogram through a double-flow network model includes: inputting the multi-channel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism to obtain a space enhancement feature map; inputting the multi-channel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
In the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map includes: inputting the multichannel voice spectrogram into a multilayer convolution layer of the first convolution neural network to obtain a first convolution feature map; inputting the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and calculating the spatial attention diagram and the position-by-position point multiplication of the first volume feature diagram to obtain the spatial enhancement feature diagram.
In the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map further includes: performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
In the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a second convolutional neural network of the dual-flow network model using a channel attention mechanism to obtain a channel enhancement feature map includes: inputting the multi-channel voice spectrogram into a multilayer convolution layer of the second convolution neural network to obtain a second convolution characteristic diagram; calculating the global mean value of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector; inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and weighting each feature matrix of the second convolution feature map along the channel dimension by taking the feature value of each position in the channel attention weight vector as a weight to obtain the channel reinforced feature map.
In the above digital employee intelligent method based on a financial system, the fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map further includes: fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map using the following formula:
Figure 725732DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 547058DEST_PATH_IMAGE002
in order to provide the said classification feature map,
Figure 958448DEST_PATH_IMAGE003
in order to reinforce the characteristic diagram in the space,
Figure 117421DEST_PATH_IMAGE004
a feature map is enhanced for the channel, 'A method for manufacturing a thin film transistor'
Figure 6880DEST_PATH_IMAGE005
"represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,
Figure 948291DEST_PATH_IMAGE006
is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.
In the above digital employee intelligent method based on a financial system, the correcting the feature values of the respective positions in the classification feature map based on the statistical features of the feature value sets of all the positions in the classification feature map to obtain a corrected classification feature map further includes: correcting the characteristic value of each position in the classification characteristic diagram by the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:
Figure 265003DEST_PATH_IMAGE007
wherein, the first and the second end of the pipe are connected with each other,
Figure 908342DEST_PATH_IMAGE023
is the classification feature map
Figure 601492DEST_PATH_IMAGE024
The characteristic value of each of the positions of (a),
Figure 662989DEST_PATH_IMAGE025
and
Figure 416181DEST_PATH_IMAGE026
is a set of eigenvalues
Figure 297550DEST_PATH_IMAGE027
Mean and variance of, and
Figure 778078DEST_PATH_IMAGE028
is the classification feature map
Figure 428503DEST_PATH_IMAGE024
The size of (a) is greater than (b),
Figure 352596DEST_PATH_IMAGE029
is logarithmic with base 2, and
Figure 721261DEST_PATH_IMAGE030
is a weight-over-parameter that is,
Figure 5480DEST_PATH_IMAGE031
and classifying the feature value of each position in the feature map after correction.
In the above digital employee intelligent method based on a financial system, the step of passing the corrected classification feature map through a classifier to obtain a classification result further includes: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:
Figure 510411DEST_PATH_IMAGE017
wherein
Figure 605406DEST_PATH_IMAGE018
Representing projecting the corrected classification feature map as a vector,
Figure 726946DEST_PATH_IMAGE019
to is that
Figure 286628DEST_PATH_IMAGE020
Is a weight matrix of the fully connected layers of each layer,
Figure 911644DEST_PATH_IMAGE021
to
Figure 177540DEST_PATH_IMAGE022
A bias matrix representing the layers of the fully connected layer.
Compared with the prior art, the financial system-based digital employee intelligent system converts consultation intention understanding of digital employees to clients into voice theme marking problems. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of more accurately understanding user consultation extraction, the user can be responded to the client more reasonably and adaptively, and the voice consultation experience of the user is improved.
Drawings
The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.
FIG. 1 illustrates a block diagram of a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application.
FIG. 2 illustrates a system architecture diagram of a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application.
FIG. 3 illustrates a block diagram of a dual stream encoding module in a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application.
FIG. 4 illustrates a block diagram of a first convolution encoding unit in a digital employee intelligence system based on a financial system according to an embodiment of the present application.
FIG. 5 illustrates a block diagram of a second convolutional encoding unit in a digital employee intelligence system based on a financial system, according to an embodiment of the present application.
FIG. 6 illustrates a flow chart of a digital employee intelligence method based on a financial system in accordance with an embodiment of the application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.
Summary of the application
In the technical solution of the present application, the understanding of the consulting intention of the digital staff to the client can be converted into a voice topic labeling problem, that is, a voice signal is encoded and understood in a proper manner, and a predetermined intention topic label is not specified to the voice signal, which can be realized by a two-stage model of a feature extractor + a classifier.
However, when a client consults, different clients have different language expression habits, which brings great challenges to the understanding of speech semantics, and meanwhile, the user still has phenomena of ending, accent, intonation and the like when expressing problems, and because the user often expresses the consultation requirements through a communication system, a lot of noise is introduced in the collection and transmission processes of speech signals, and the technical problem causes that the precision of the understanding of consultation intentions is low.
Correspondingly, in the technical scheme of the application, after the consultation voice signal of the client is obtained, the consultation voice signal is firstly passed through a signal noise reduction module based on an automatic encoder to obtain the noise-reduced consultation voice signal. In particular, the auto-encoder based signal noise reduction module includes an encoder that uses convolutional layers and a decoder that uses anti-convolutional layers. Accordingly, the denoising process of the signal denoising module includes first explicitly spatially encoding the advisory speech signal with a convolutional layer using the encoder to extract speech features from the advisory speech signal (filtered in response to noise), and then deconvolving the speech features with a deconvolution layer using the decoder to obtain the denoised advisory speech signal.
In order to improve the accuracy of semantic understanding of the noise-reduced consulting voice signal, the noise-reduced consulting voice signal is converted into a spectrogram, the spectrogram is a perception graph formed by three parts of time, frequency and energy, is a visible language of the voice signal, can provide rich visual information, combines time domain analysis and frequency domain analysis, and can reflect the frequency content of the signal and the change rule of the frequency content along with the time.
Particularly, in the technical solution of the present application, in order to capture richer acoustic spectrum information, a log mellop diagram, a cochlear spectrogram, and a constant Q-transformed spectrogram are extracted from the denoised consulting speech signal, respectively. It will be appreciated that the log mel-frequency spectrum is the most widely used feature, which in the design process mimics the characteristics of the human ear, which have different acoustic sensitivities to sounds of different frequencies. The extraction flow of the logarithmic Mel-map is similar to MFCC, but linear transformation, namely discrete cosine transformation, of the last step is reduced, and after the step is removed, more high-order information and nonlinear information of the sound signal can be reserved. The cochlea spectrogram is obtained by a Gamma atom filter bank simulating the frequency selectivity of a human cochlea, and the frequency of the Gamma atom filter bank is more consistent with the auditory characteristics of human ears. The constant Q-transform map provides better frequency resolution for low frequencies and better time resolution for high frequencies, thereby better mimicking the behavior of the human auditory system.
Then, the Log-Melpe diagram, the cochlea spectrogram and the constant Q-transform spectrogram are arranged into a multi-channel voice spectrogram. The logarithm Mel-cepstrum, the cochlea spectrogram and the constant Q transformation spectrogram are arranged along the channel dimension to obtain a multi-channel semantic spectrogram, so that data input into the neural network model has a relatively larger width, on one hand, richer materials are provided for voice feature extraction of the neural network model, on the other hand, correlation exists among all the spectrograms in the multi-channel semantic spectrogram, and the accuracy and the richness of the voice feature extraction can be improved by utilizing internal correlation.
Specifically, in the technical solution of the present application, a dual-flow network model including a first convolutional neural network using a spatial attention mechanism and a second convolutional neural network using a channel attention mechanism is used to process the multi-channel speech spectrogram to obtain a classification feature map. Here, the first convolutional neural network using a spatial attention mechanism and the second convolutional neural network using a channel attention mechanism of the dual-flow network model respectively perform spatially enhanced display spatial coding and channel enhanced display spatial coding on the multi-channel speech spectrogram in a parallel structure, and aggregate a spatial enhancement feature map and a channel enhancement feature map obtained by coding the two to obtain the classification feature map.
For the classification feature map, when the dual-flow network structure fuses the spatial enhancement feature map output by the first convolutional neural network and the channel enhancement feature map output by the second convolutional neural network, it may be that an out-of-distribution feature value in the classification feature map is generated due to a non-alignment distribution between a spatial dimension distribution of the spatial enhancement feature map and a channel dimension distribution of the channel enhancement feature map, so as to affect a classification effect of the classification feature map.
Therefore, the information statistics normalization of the adaptive example is performed on the classification feature map, specifically:
Figure 255218DEST_PATH_IMAGE007
Figure 881240DEST_PATH_IMAGE032
is the classification feature map
Figure 360763DEST_PATH_IMAGE033
The characteristic value of each of the positions of (b),
Figure 63140DEST_PATH_IMAGE034
and
Figure 67261DEST_PATH_IMAGE035
is a set of eigenvalues
Figure 903499DEST_PATH_IMAGE036
Mean and variance of, and
Figure 237528DEST_PATH_IMAGE037
is the classification feature map
Figure 828915DEST_PATH_IMAGE033
The size of (a) is greater than (b),
Figure 881185DEST_PATH_IMAGE038
is logarithmic with base 2, and
Figure 334163DEST_PATH_IMAGE039
is a weight hyperparameter.
The information statistical normalization of the self-adaptive example takes the characteristic value set of the classification characteristic diagram as the self-adaptive example, uses intrinsic prior information of the statistical characteristics to carry out dynamic generation type information normalization on a single characteristic value, and simultaneously uses the normalization mode length information of the characteristic set as bias to be used as invariance description in a set distribution domain, thereby realizing the characteristic optimization of shielding the disturbance distribution of a special example as much as possible and improving the classification effect of the classification characteristic diagram. In this way, the accuracy of the understanding and recognition of the intent of the consulting voice signal is improved.
Based on this, this application has proposed a digital staff intelligence system based on financial system, it includes: the voice signal acquisition module is used for acquiring a consultation voice signal of a client; the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal; a voice spectrogram extraction module for extracting a logarithmic Mel's map, a cochlear spectrogram and a constant Q-transform spectrogram from the denoised consulted voice signal; the multi-channel semantic spectrogram construction module is used for arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram; the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model so as to obtain a classification characteristic graph; the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and the consultation intention recognition module is used for enabling the corrected classification characteristic diagram to pass through a classifier to obtain a classification result, and the classification result is used for representing an intention subject label of the consultation voice signal.
Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.
Exemplary System
FIG. 1 illustrates a block diagram of a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application. As shown in fig. 1, a digital employee intelligence system 100 based on a financial system according to an embodiment of the present application includes: a voice signal collecting module 110 for obtaining a consultation voice signal of the client; a noise reduction module 120, configured to pass the consulting speech signal through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consulting speech signal; a voice spectrogram extracting module 130 for extracting a logarithmic melpp map, a cochlear spectrogram and a constant Q-transform spectrogram from the denoised counsel voice signal; a multi-channel semantic spectrogram constructing module 140, configured to arrange the log mel-frequency map, the cochlear spectrogram and the constant Q-transform spectrogram into a multi-channel speech spectrogram; the double-flow coding module 150 is configured to pass the multi-channel speech spectrogram through a double-flow network model to obtain a classification feature map; the adaptive correction module 160 is configured to correct the feature values at each position in the classification feature map based on statistical features of the feature value sets at all positions in the classification feature map to obtain a corrected classification feature map; and a consultation intention recognition module 170, configured to pass the corrected classification feature map through a classifier to obtain a classification result, where the classification result is used to represent an intention topic label of the consultation voice signal.
FIG. 2 illustrates a system architecture diagram of a financial system based digital employee intelligence system 100 in accordance with an embodiment of the present application. As shown in fig. 2, in the system architecture of the digital employee intelligence system based on a financial system 100, a consultation voice signal of a client is first acquired. And then, the consultation voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consultation voice signal. Then, a logarithmic mellop map, a cochlear spectrogram, and a constant Q-transformed spectrogram are extracted from the noise-reduced consulting voice signal. And arranging the logarithmic Mel map, the cochlea spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram. And then, passing the multichannel voice spectrogram through a double-flow network model to obtain a classification characteristic diagram. And then, based on the statistical features of the feature value sets of all the positions in the classification feature map, correcting the feature values of all the positions in the classification feature map to obtain a corrected classification feature map. And then, the corrected classification feature map is passed through a classifier to obtain a classification result, and the classification result is used for representing an intention topic label of the consultation voice signal.
In the above-mentioned financial system-based digital staff intelligent system 100, the voice signal collecting module 110 is used for obtaining a consultation voice signal of the client. In the technical solution of the present application, the understanding of the consulting intention of the digital staff to the client can be converted into a voice topic labeling problem, that is, a voice signal is encoded and understood in a proper manner, and a predetermined intention topic label is not specified to the voice signal, which can be realized by a two-stage model of a feature extractor + a classifier.
However, when a client consults, different clients have different language expression habits, which brings great challenges to the understanding of speech semantics, and meanwhile, the user still has phenomena of ending, accent, intonation and the like when expressing problems, and because the user often expresses the consultation requirements through a communication system, a lot of noise is introduced in the collection and transmission processes of speech signals, and the technical problem causes that the precision of the understanding of consultation intentions is low. Therefore, in the technical solution of the present application, a consultation voice signal of the client is first acquired.
In the above-mentioned financial system-based digital staff intelligent system 100, the noise reduction module 120 is configured to pass the consulting voice signal through an automatic encoder-based signal noise reduction module to obtain a noise-reduced consulting voice signal. That is, after obtaining the consulting voice signal of the client, the consulting voice signal passes through the signal noise reduction module based on the automatic encoder to obtain the noise-reduced consulting voice signal. In particular, the auto-encoder based signal noise reduction module includes an encoder that uses convolutional layers and a decoder that uses anti-convolutional layers. Accordingly, the denoising process of the signal denoising module includes first explicitly spatially encoding the advisory speech signal with a convolutional layer using the encoder to extract speech features from the advisory speech signal (filtered in response to noise), and then deconvolving the speech features with a deconvolution layer using the decoder to obtain the denoised advisory speech signal.
In one example, in the above-mentioned financial system based digital employee intelligence system 100, the noise reduction module 120 comprises: a speech signal encoding unit for inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder performs explicit spatial encoding on the advisory speech signal using a convolutional layer to obtain a speech feature; and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to perform deconvolution processing on the voice features to obtain the noise-reduced consultation voice signals.
In the above-mentioned financial system-based digital staff intelligent system 100, the voice spectrogram extracting module 130 is configured to extract a log melpp chart, a cochlear spectrogram and a constant Q-transformed spectrogram from the denoised consulting voice signal. In order to improve the accuracy of semantic understanding of the noise-reduced consulting voice signal, the noise-reduced consulting voice signal is converted into a spectrogram, the spectrogram is a perception graph formed by three parts of time, frequency and energy, is a visible language of the voice signal, can provide rich visual information, combines time domain analysis and frequency domain analysis, and can reflect the frequency content of the signal and the change rule of the frequency content along with the time.
Particularly, in the technical solution of the present application, in order to capture richer acoustic spectrum information, a log mellop diagram, a cochlear spectrogram, and a constant Q-transformed spectrogram are extracted from the denoised consulting speech signal, respectively. It will be appreciated that the log mel-frequency spectrum is the most widely used feature, which in the design process mimics the characteristics of the human ear, which have different acoustic sensitivities to sounds of different frequencies. The extraction flow of the logarithmic Mel-map is similar to MFCC, but linear transformation, namely discrete cosine transformation, of the last step is reduced, and after the step is removed, more high-order information and nonlinear information of the sound signal can be reserved. The cochlea spectrogram is obtained by a Gamma atom filter bank simulating the frequency selectivity of a human cochlea, and the frequency of the Gamma atom filter bank is more consistent with the auditory characteristics of human ears. The constant Q-transform map provides better frequency resolution for low frequencies and better time resolution for high frequencies, thereby better mimicking the behavior of the human auditory system.
In the above digital staff intelligent system 100 based on a financial system, the multi-channel semantic spectrogram constructing module 140 is configured to arrange the log melpp chart, the cochlear spectrogram and the constant Q-transform spectrogram into a multi-channel voice spectrogram. Namely, the logarithmic Mel's map, the cochlear spectrogram and the constant Q-conversion spectrogram are arranged along the channel dimension to obtain a multi-channel semantic spectrogram, so that data input into the neural network model has a relatively larger width, on one hand, richer materials are provided for voice feature extraction of the neural network model, on the other hand, correlation exists among all spectrograms in the multi-channel semantic spectrogram, and the accuracy and richness of the voice feature extraction can be improved by utilizing internal correlation.
In the above digital staff intelligent system 100 based on the financial system, the double-flow encoding module 150 is configured to obtain the classification feature map by passing the multi-channel voice spectrogram through a double-flow network model. Specifically, in the technical solution of the present application, a dual-flow network model including a first convolutional neural network using a spatial attention mechanism and a second convolutional neural network using a channel attention mechanism is used to process the multi-channel speech spectrogram to obtain a classification feature map. Here, the first convolutional neural network using a spatial attention mechanism and the second convolutional neural network using a channel attention mechanism of the dual-flow network model respectively perform spatially enhanced display spatial coding and channel enhanced display spatial coding on the multi-channel speech spectrogram in a parallel structure, and aggregate a spatial enhancement feature map and a channel enhancement feature map obtained by coding the two to obtain the classification feature map.
FIG. 3 illustrates a block diagram of a dual stream encoding module in a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application. As shown in fig. 3, in the above-mentioned digital employee intelligent system 100 based on a financial system, the dual-stream encoding module 150 includes: a first convolution encoding unit 151, configured to input the multichannel speech spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map; a second convolution coding unit 152, configured to input the multi-channel speech spectrogram into a second convolution neural network of the dual-flow network model, where the second convolution neural network uses a channel attention mechanism, so as to obtain a channel enhancement feature map; and an aggregation unit 153, configured to fuse the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
Fig. 4 illustrates a block diagram of a first convolution encoding unit in a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application. As shown in fig. 4, in the above-mentioned digital staff intelligent system 100 based on financial system, the first volume coding unit 151 includes: a deep convolutional coding subunit 1511, configured to input the multichannel voice spectrogram into the multilayer convolutional layer of the first convolutional neural network to obtain a first convolutional feature map; a spatial attention subunit 1512, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and an attention applying subunit 1513, configured to calculate a product of the spatial attention map and the location-based point of the first convolution feature map to obtain the spatial enhancement feature map.
In one example, in the above-mentioned financial system-based digital employee intelligence system 100, the spatial attention subunit 1512 is further configured to: performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
FIG. 5 illustrates a block diagram of a second convolutional encoding unit in a digital employee intelligence system based on a financial system, according to an embodiment of the present application. As shown in fig. 5, in the above-mentioned digital staff intelligent system 100 based on a financial system, the second convolutional encoding unit 152 includes: a deep convolutional coding subunit 1521, configured to input the multichannel speech spectrogram into a multilayer convolutional layer of the second convolutional neural network to obtain a second convolutional feature map; a global mean Chi Huazi unit 1522 configured to calculate a global mean of each feature matrix of the second convolution feature map along a channel dimension to obtain a channel feature vector; a channel attention weight calculation subunit 1523, configured to input the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and a channel attention applying subunit 1524, configured to take the eigenvalue of each position in the channel attention weight vector as a weight to respectively weight each eigen matrix of the second convolution eigen map along the channel dimension, so as to obtain the channel enhanced eigen map.
In one example, in the above digital staff intelligent system 100 based on a financial system, the aggregation unit 153 is further configured to: fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:
Figure 788278DEST_PATH_IMAGE001
wherein, the first and the second end of the pipe are connected with each other,
Figure 816146DEST_PATH_IMAGE002
in order to provide the said classification feature map,
Figure 355712DEST_PATH_IMAGE003
in order to reinforce the characteristic diagram in the space,
Figure 877960DEST_PATH_IMAGE004
for strengthening the characteristic diagram of the channel "
Figure 186582DEST_PATH_IMAGE005
"represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,
Figure 399999DEST_PATH_IMAGE006
is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.
In the above-mentioned digital staff intelligent system 100 based on a financial system, the adaptive correction module 160 is configured to correct the feature values of each position in the classification feature map based on the statistical features of the feature value sets of all positions in the classification feature map to obtain a corrected classification feature map. For the classification feature map, when the dual-flow network structure fuses the spatial enhancement feature map output by the first convolutional neural network and the channel enhancement feature map output by the second convolutional neural network, it may be that an out-of-distribution feature value in the classification feature map is generated due to a misalignment between a spatial dimension distribution of the spatial enhancement feature map and a channel dimension distribution of the channel enhancement feature map, so as to affect a classification effect of the classification feature map. Therefore, the information statistics of the adaptive instance is normalized to the classification feature map.
In one example, in the above-mentioned financial system-based digital employee intelligence system 100, the adaptive correction module 160 is further configured to: correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:
Figure 426861DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 752800DEST_PATH_IMAGE008
is the classification feature map
Figure 915928DEST_PATH_IMAGE009
The characteristic value of each of the positions of (a),
Figure 285598DEST_PATH_IMAGE010
and
Figure 799756DEST_PATH_IMAGE011
is a set of eigenvalues
Figure 663807DEST_PATH_IMAGE012
Mean and variance of, and
Figure 681442DEST_PATH_IMAGE013
is the classification feature map
Figure 503904DEST_PATH_IMAGE009
The size of (a) is greater than (b),
Figure 223467DEST_PATH_IMAGE014
is logarithmic with base 2, and
Figure 891209DEST_PATH_IMAGE015
is a weight-over-parameter that is,
Figure 28929DEST_PATH_IMAGE016
and classifying the feature values of all positions in the feature map after correction.
The information statistical normalization of the self-adaptive example takes the characteristic value set of the classification characteristic diagram as the self-adaptive example, uses intrinsic prior information of the statistical characteristics to carry out dynamic generation type information normalization on a single characteristic value, and simultaneously uses the normalization mode length information of the characteristic set as bias to be used as invariance description in a set distribution domain, thereby realizing the characteristic optimization of shielding the disturbance distribution of a special example as much as possible and improving the classification effect of the classification characteristic diagram. In this way, the accuracy of the understanding and recognition of the intent of the consulting voice signal is improved.
In the above-mentioned financial system-based digital employee intelligent system 100, the consultation intention identifying module 170 is configured to pass the corrected classification feature map through a classifier to obtain a classification result, where the classification result is used to represent an intention topic label of a consultation voice signal.
In one example, in the above-mentioned financial system-based digital employee intelligence system 100, the consulting intent identification module 170 is further configured to: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:
Figure 740402DEST_PATH_IMAGE017
wherein
Figure 963573DEST_PATH_IMAGE018
Representing projecting the corrected classification feature map as a vector,
Figure 435006DEST_PATH_IMAGE019
to
Figure 427233DEST_PATH_IMAGE020
Is a weight matrix of the fully connected layers of each layer,
Figure 46957DEST_PATH_IMAGE021
to
Figure 757424DEST_PATH_IMAGE022
A bias matrix representing the layers of the fully connected layer.
In summary, a financial system based digital employee intelligence system 100 is illustrated that translates digital employee consultation intent understanding of a client into voice topic tagging issues in accordance with embodiments of the present application. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of more accurately understanding user consultation extraction, the user can be responded to the client more reasonably and adaptively, and the voice consultation experience of the user is improved.
As described above, the digital employee intelligence system 100 based on a financial system according to the embodiment of the present application may be implemented in various terminal devices, such as a server having digital employee intelligence based on a financial system. In one example, the financial system based digital employee intelligence system 100 according to embodiments of the subject application may be integrated into a terminal device as a software module and/or a hardware module. For example, the financial system-based digital employee intelligence system 100 may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the financial system based digital staff intelligent system 100 may also be one of many hardware modules of the terminal device.
Alternatively, in another example, the financial system based digital employee intelligence system 100 and the terminal device may also be separate devices, and the financial system based digital employee intelligence system 100 may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information in an agreed data format.
Exemplary method
According to another aspect of the application, a digital employee intelligent method based on a financial system is also provided. As shown in fig. 6, the digital employee intelligent method based on the financial system according to the embodiment of the present application includes the steps of: s110, acquiring a consultation voice signal of a client; s120, the consulting voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consulting voice signal; s130, extracting a logarithmic Mel map, a cochlear spectrogram and a constant Q transformation spectrogram from the denoised consulting voice signal; s140, arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram; s150, enabling the multi-channel voice spectrogram to pass through a double-flow network model to obtain a classification characteristic map; s160, correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and S170, passing the corrected classification characteristic diagram through a classifier to obtain a classification result, wherein the classification result is used for representing an intention theme label of the consultation voice signal.
In one example, in the above digital employee intelligence method based on a financial system, the passing the consulting voice signal through an automatic encoder based signal noise reduction module to obtain a noise-reduced consulting voice signal includes: inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder explicitly spatially encodes the advisory speech signal using a convolutional layer to obtain speech features; and inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder performs deconvolution processing on the voice features by using a deconvolution layer to obtain the noise-reduced consultation voice signal.
In one example, in the above financial system-based digital employee intelligence method, the passing the multi-channel voice spectrogram through a dual-flow network model to obtain a classification feature map includes: inputting the multi-channel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism to obtain a space enhancement feature map; inputting the multi-channel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
In one example, in the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map includes: inputting the multichannel voice spectrogram into a multilayer convolution layer of the first convolution neural network to obtain a first convolution feature map; inputting the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and calculating the spatial attention diagram and the position-by-position point multiplication of the first volume feature diagram to obtain the spatial enhancement feature diagram.
In one example, in the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhanced feature map further includes: performing convolutional coding on the first convolutional feature map by using the convolutional layer of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
In one example, in the above financial system-based digital employee intelligence method, the inputting the multi-channel voice spectrogram into a second convolutional neural network of the dual-flow network model using a channel attention mechanism to obtain a channel enhancement feature map includes: inputting the multi-channel voice spectrogram into a multilayer convolution layer of the second convolution neural network to obtain a second convolution characteristic diagram; calculating the global mean value of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector; inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and weighting each feature matrix of the second convolution feature map along the channel dimension by taking the feature value of each position in the channel attention weight vector as a weight to obtain the channel reinforced feature map.
In one example, in the above digital employee intelligence method based on a financial system, the fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map further includes: fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map using the following formula:
Figure 32548DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 879281DEST_PATH_IMAGE002
in order to be able to classify the feature map,
Figure 932557DEST_PATH_IMAGE003
in order to reinforce the characteristic diagram in the space,
Figure 864741DEST_PATH_IMAGE004
for strengthening the characteristic diagram of the channel "
Figure 412397DEST_PATH_IMAGE005
"represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,
Figure 379216DEST_PATH_IMAGE006
is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.
In one example, in the above digital employee intelligence method based on a financial system, the correcting the feature values of the respective locations in the classification feature map based on the statistical features of the feature value sets of all the locations in the classification feature map to obtain a corrected classification feature map further includes: correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:
Figure 603392DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 288452DEST_PATH_IMAGE023
is the classification feature map
Figure 639799DEST_PATH_IMAGE024
The characteristic value of each of the positions of (a),
Figure 444812DEST_PATH_IMAGE025
and
Figure 590623DEST_PATH_IMAGE026
is a set of characteristic values
Figure 497399DEST_PATH_IMAGE027
A mean and a variance of (2), and
Figure 652437DEST_PATH_IMAGE028
is the classification feature map
Figure 580466DEST_PATH_IMAGE024
The size of (a) is greater than (b),
Figure 897178DEST_PATH_IMAGE029
is logarithmic with base 2, and
Figure 291250DEST_PATH_IMAGE030
is a weight-over-parameter that is,
Figure 984400DEST_PATH_IMAGE031
and classifying the feature values of all positions in the feature map after correction.
In one example, in the above digital employee intelligence method based on a financial system, the step of passing the corrected classification feature map through a classifier to obtain a classification result further includes: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:
Figure 29585DEST_PATH_IMAGE017
wherein
Figure 517198DEST_PATH_IMAGE018
Representing projecting the corrected classification feature map as a vector,
Figure 132987DEST_PATH_IMAGE019
to
Figure 895407DEST_PATH_IMAGE020
Is a weight matrix of the fully connected layers of each layer,
Figure 529519DEST_PATH_IMAGE021
to
Figure 453613DEST_PATH_IMAGE022
A bias matrix representing the layers of the fully connected layer.
In summary, the financial system-based digital employee intelligence method is illustrated according to an embodiment of the present application, which translates the digital employee's consultation intention understanding of the client into a voice topic tagging problem, i.e., a feature extractor is implemented to encode and understand a voice signal and assign a predetermined intention topic tag to the voice signal. Particularly, the voice signal is subjected to noise reduction processing by using a noise reduction module so as to improve the precision of consultation intention understanding. Thus, an optimized financial system digital staff intelligent scheme is constructed.
The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is provided for purposes of illustration and understanding only, and is not intended to limit the application to the details which are set forth in order to provide a thorough understanding of the present application.
The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by one skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably herein. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".
It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims (9)

1. A digital staff intelligence system based on a financial system, comprising:
the voice signal acquisition module is used for acquiring a consultation voice signal of a client;
the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal;
a voice spectrogram extraction module for extracting a logarithmic Mel's map, a cochlear spectrogram and a constant Q-transform spectrogram from the denoised consulted voice signal;
the multi-channel semantic spectrogram construction module is used for arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram;
the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model so as to obtain a classification characteristic graph;
the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and
and the consultation intention recognition module is used for enabling the corrected classification characteristic diagram to pass through a classifier to obtain a classification result, and the classification result is used for representing an intention subject label of the consultation voice signal.
2. The financial system based digital employee intelligence system of claim 1 wherein said noise reduction module comprises:
a speech signal encoding unit for inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder performs explicit spatial encoding on the advisory speech signal using a convolutional layer to obtain a speech feature; and
and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to perform deconvolution processing on the voice features so as to obtain the noise-reduced consultation voice signals.
3. The financial system based digital employee intelligence system of claim 2 wherein said dual stream encoding module comprises:
the first convolution coding unit is used for inputting the multichannel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism so as to obtain a space enhancement feature map;
the second convolution coding unit is used for inputting the multichannel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and
and the aggregation unit is used for fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
4. The financial system-based digital employee intelligence system of claim 3 wherein said first volume coding unit comprises:
the deep convolutional coding subunit is used for inputting the multichannel voice spectrogram into a multilayer convolutional layer of the first convolutional neural network to obtain a first convolutional characteristic diagram;
a spatial attention subunit, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and
an attention applying subunit, configured to calculate a dot-by-dot multiplication between the spatial attention map and the first convolution feature map to obtain the spatial enhancement feature map.
5. The financial system-based digital employee intelligence system of claim 4 wherein the spatial attention subunit is further configured to:
performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map;
calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and
inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
6. The financial system-based digital staff intelligence system of claim 5 wherein said second convolutional encoding unit comprises:
the depth convolution coding subunit is used for inputting the multichannel voice spectrogram into a multilayer convolution layer of the second convolution neural network to obtain a second convolution characteristic map;
a global mean pooling subunit, configured to calculate a global mean of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector;
the channel attention weight calculation subunit is used for inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and
and the channel attention applying subunit is configured to take the eigenvalue of each position in the channel attention weight vector as a weight to respectively weight each eigen matrix of the second convolution eigen map along the channel dimension to obtain the channel enhanced eigen map.
7. The financial system-based digital employee intelligence system of claim 6, wherein the aggregation unit is further configured to:
fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:
Figure 362072DEST_PATH_IMAGE001
wherein the content of the first and second substances,
Figure 544792DEST_PATH_IMAGE002
in order to be able to classify the feature map,
Figure 793371DEST_PATH_IMAGE003
in order to reinforce the characteristic diagram in the space,
Figure 782055DEST_PATH_IMAGE004
for strengthening the characteristic diagram of the channel "
Figure 799690DEST_PATH_IMAGE005
"represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,
Figure 717092DEST_PATH_IMAGE006
is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.
8. The financial system-based digital employee intelligence system of claim 7, wherein the adaptive correction module is further configured to:
correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram;
wherein the formula is:
Figure 452967DEST_PATH_IMAGE007
wherein the content of the first and second substances,
Figure 245343DEST_PATH_IMAGE008
is the classification feature map
Figure 179801DEST_PATH_IMAGE009
The characteristic value of each of the positions of (a),
Figure 642006DEST_PATH_IMAGE010
and
Figure 989811DEST_PATH_IMAGE011
is a set of eigenvalues
Figure 930085DEST_PATH_IMAGE012
Mean and variance of, and
Figure 46946DEST_PATH_IMAGE013
is the classification feature map
Figure 945631DEST_PATH_IMAGE009
The size of (a) is greater than (b),
Figure 452836DEST_PATH_IMAGE014
is logarithmic with base 2, and
Figure 822900DEST_PATH_IMAGE015
is a weight-over-parameter that is,
Figure 935212DEST_PATH_IMAGE016
and classifying the feature values of all positions in the feature map after correction.
9. The financial system-based digital employee intelligence system of claim 8 wherein said consultation intent identification module is further configured to:
processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:
Figure 598275DEST_PATH_IMAGE017
wherein
Figure 796038DEST_PATH_IMAGE018
Representing the projection of the corrected classification feature map as a vector,
Figure 468328DEST_PATH_IMAGE019
to
Figure 435147DEST_PATH_IMAGE020
Is a weight matrix of the fully connected layers of each layer,
Figure 472373DEST_PATH_IMAGE021
to
Figure 16487DEST_PATH_IMAGE022
A bias matrix representing the layers of the fully connected layer.
CN202211090442.0A 2022-09-07 2022-09-07 Digital employee intelligent system based on financial system Active CN115602165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211090442.0A CN115602165B (en) 2022-09-07 2022-09-07 Digital employee intelligent system based on financial system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211090442.0A CN115602165B (en) 2022-09-07 2022-09-07 Digital employee intelligent system based on financial system

Publications (2)

Publication Number Publication Date
CN115602165A true CN115602165A (en) 2023-01-13
CN115602165B CN115602165B (en) 2023-05-05

Family

ID=84843343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211090442.0A Active CN115602165B (en) 2022-09-07 2022-09-07 Digital employee intelligent system based on financial system

Country Status (1)

Country Link
CN (1) CN115602165B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116258504A (en) * 2023-03-16 2023-06-13 广州信瑞泰信息科技有限公司 Bank customer relationship management system and method thereof
CN116308754A (en) * 2023-03-22 2023-06-23 广州信瑞泰信息科技有限公司 Bank credit risk early warning system and method thereof
CN117173294A (en) * 2023-11-03 2023-12-05 之江实验室科技控股有限公司 Method and system for automatically generating digital person

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110085218A (en) * 2019-03-26 2019-08-02 天津大学 A kind of audio scene recognition method based on feature pyramid network
CN110718234A (en) * 2019-09-02 2020-01-21 江苏师范大学 Acoustic scene classification method based on semantic segmentation coding and decoding network
CN111145726A (en) * 2019-10-31 2020-05-12 南京励智心理大数据产业研究院有限公司 Deep learning-based sound scene classification method, system, device and storage medium
CN111754988A (en) * 2020-06-23 2020-10-09 南京工程学院 Sound scene classification method based on attention mechanism and double-path depth residual error network
CN113808573A (en) * 2021-08-06 2021-12-17 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention
CN114333804A (en) * 2021-12-27 2022-04-12 北京达佳互联信息技术有限公司 Audio classification identification method and device, electronic equipment and storage medium
CN114420097A (en) * 2022-01-24 2022-04-29 腾讯科技(深圳)有限公司 Voice positioning method and device, computer readable medium and electronic equipment
CN114565041A (en) * 2022-02-28 2022-05-31 上海嘉甲茂技术有限公司 Payment big data analysis system based on internet finance and analysis method thereof
US11355122B1 (en) * 2021-02-24 2022-06-07 Conversenowai Using machine learning to correct the output of an automatic speech recognition system
CN114974219A (en) * 2022-05-30 2022-08-30 平安科技(深圳)有限公司 Speech recognition method, speech recognition device, electronic apparatus, and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110085218A (en) * 2019-03-26 2019-08-02 天津大学 A kind of audio scene recognition method based on feature pyramid network
CN110718234A (en) * 2019-09-02 2020-01-21 江苏师范大学 Acoustic scene classification method based on semantic segmentation coding and decoding network
CN111145726A (en) * 2019-10-31 2020-05-12 南京励智心理大数据产业研究院有限公司 Deep learning-based sound scene classification method, system, device and storage medium
CN111754988A (en) * 2020-06-23 2020-10-09 南京工程学院 Sound scene classification method based on attention mechanism and double-path depth residual error network
US11355122B1 (en) * 2021-02-24 2022-06-07 Conversenowai Using machine learning to correct the output of an automatic speech recognition system
CN113808573A (en) * 2021-08-06 2021-12-17 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention
CN114333804A (en) * 2021-12-27 2022-04-12 北京达佳互联信息技术有限公司 Audio classification identification method and device, electronic equipment and storage medium
CN114420097A (en) * 2022-01-24 2022-04-29 腾讯科技(深圳)有限公司 Voice positioning method and device, computer readable medium and electronic equipment
CN114565041A (en) * 2022-02-28 2022-05-31 上海嘉甲茂技术有限公司 Payment big data analysis system based on internet finance and analysis method thereof
CN114974219A (en) * 2022-05-30 2022-08-30 平安科技(深圳)有限公司 Speech recognition method, speech recognition device, electronic apparatus, and storage medium

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116258504A (en) * 2023-03-16 2023-06-13 广州信瑞泰信息科技有限公司 Bank customer relationship management system and method thereof
CN116308754A (en) * 2023-03-22 2023-06-23 广州信瑞泰信息科技有限公司 Bank credit risk early warning system and method thereof
CN116308754B (en) * 2023-03-22 2024-02-13 广州信瑞泰信息科技有限公司 Bank credit risk early warning system and method thereof
CN117173294A (en) * 2023-11-03 2023-12-05 之江实验室科技控股有限公司 Method and system for automatically generating digital person
CN117173294B (en) * 2023-11-03 2024-02-13 之江实验室科技控股有限公司 Method and system for automatically generating digital person

Also Published As

Publication number Publication date
CN115602165B (en) 2023-05-05

Similar Documents

Publication Publication Date Title
CN110600017B (en) Training method of voice processing model, voice recognition method, system and device
JP7337953B2 (en) Speech recognition method and device, neural network training method and device, and computer program
WO2021208287A1 (en) Voice activity detection method and apparatus for emotion recognition, electronic device, and storage medium
WO2020232860A1 (en) Speech synthesis method and apparatus, and computer readable storage medium
US9818431B2 (en) Multi-speaker speech separation
CN115602165A (en) Digital staff intelligent system based on financial system
CN112199548A (en) Music audio classification method based on convolution cyclic neural network
CN112712813B (en) Voice processing method, device, equipment and storage medium
CN112071330B (en) Audio data processing method and device and computer readable storage medium
WO2023222088A1 (en) Voice recognition and classification method and apparatus
CN114203163A (en) Audio signal processing method and device
Huang et al. Novel sub-band spectral centroid weighted wavelet packet features with importance-weighted support vector machines for robust speech emotion recognition
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN114399995A (en) Method, device and equipment for training voice model and computer readable storage medium
Sekkate et al. Speaker identification for OFDM-based aeronautical communication system
CN113823303A (en) Audio noise reduction method and device and computer readable storage medium
Rituerto-González et al. End-to-end recurrent denoising autoencoder embeddings for speaker identification
CN113782005B (en) Speech recognition method and device, storage medium and electronic equipment
Li RETRACTED ARTICLE: Speech-assisted intelligent software architecture based on deep game neural network
Konduru et al. Multidimensional feature diversity based speech signal acquisition
CN112951270A (en) Voice fluency detection method and device and electronic equipment
Zhou et al. MetaRL-SE: a few-shot speech enhancement method based on meta-reinforcement learning
WO2024055752A1 (en) Speech synthesis model training method, speech synthesis method, and related apparatuses
Samanta et al. An energy-efficient voice activity detector using reconfigurable Gaussian base normalization deep neural network
CN117649846B (en) Speech recognition model generation method, speech recognition method, device and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant