CN115602165A

CN115602165A - Digital staff intelligent system based on financial system

Info

Publication number: CN115602165A
Application number: CN202211090442.0A
Authority: CN
Inventors: 黄术; 黄琪敏; 裘浩祺; 魏祥
Original assignee: Hangzhou Youhang Information Technology Co ltd
Current assignee: Hangzhou Youhang Information Technology Co ltd
Priority date: 2022-09-07
Filing date: 2022-09-07
Publication date: 2023-01-13
Anticipated expiration: 2042-09-07
Also published as: CN115602165B

Abstract

The application relates to the field of financial science and technology, and particularly discloses a financial system-based digital staff intelligent system which converts consultation intention understanding of digital staff to clients into a voice theme marking problem. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of more accurately understanding user consultation extraction, the user can be responded to the client more reasonably and adaptively, and the voice consultation experience of the user is improved.

Description

Digital staff intelligent system based on financial system

Technical Field

The application relates to the field of financial science and technology, and more specifically relates to a digital staff intelligent system based on financial system.

Background

With the development of computer technology, more and more technologies (such as big data or cloud computing) are applied in the financial field, and the traditional financial industry is gradually shifting to the financial technology. At present, the use of digital staff (e.g., voice robot) in the financial field is quite extensive, for example, the purposes of financial product promotion, collection of money and the like can be realized through the digital staff. Digital employee implementations have benefited from the development of speech recognition and natural speech understanding, among other related technologies.

However, in actual operation, the digital staff often complains of the customer, and the reason is that the digital staff does not accurately understand the intention of the customer, and answers questions and the like occur.

Therefore, an optimized financial system digital staff intelligence solution is desired.

Disclosure of Invention

The present application is proposed to solve the above-mentioned technical problems. The embodiment of the application provides a digital employee intelligent system based on a financial system, which converts consultation intention understanding of a digital employee to a client into a voice theme marking problem. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of understanding user consultation and extraction more accurately, the user can respond to the client more reasonably and adaptively so as to improve the voice consultation experience of the user.

According to one aspect of the application, there is provided a financial system based digital employee intelligence system comprising:

the voice signal acquisition module is used for acquiring a consultation voice signal of a client;

the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal;

a voice spectrogram extraction module for extracting a logarithmic Mel-map, a cochlear spectrogram and a constant Q transform spectrogram from the denoised consulted voice signal;

the multi-channel semantic spectrogram construction module is used for arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram;

the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model so as to obtain a classification characteristic graph;

the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and

and the consultation intention recognition module is used for enabling the corrected classification characteristic diagram to pass through a classifier to obtain a classification result, and the classification result is used for representing an intention subject label of the consultation voice signal.

In the above digital staff intelligent system based on financial system, the noise reduction module includes: a speech signal encoding unit for inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder performs explicit spatial encoding on the advisory speech signal using a convolutional layer to obtain a speech feature; and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to perform deconvolution processing on the voice features to obtain the noise-reduced consultation voice signals.

In the above digital employee intelligent system based on the financial system, the double-stream encoding module includes: the first convolution coding unit is used for inputting the multichannel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism so as to obtain a space enhancement feature map; the second convolution coding unit is used for inputting the multichannel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and the aggregation unit is used for fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.

In the above digital staff intelligent system based on financial system, the first volume coding unit includes: the deep convolutional coding subunit is used for inputting the multichannel voice spectrogram into a multilayer convolutional layer of the first convolutional neural network to obtain a first convolutional characteristic map; a spatial attention subunit, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and the attention applying subunit is used for calculating the spatial attention diagram and multiplying the spatial attention diagram by the position points of the first convolution feature diagram to obtain the spatial enhancement feature diagram.

In the above digital staff intelligent system based on a financial system, the spatial attention subunit is further configured to: performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.

In the above digital staff intelligent system based on the financial system, the second convolutional encoding unit includes: the deep convolutional coding subunit is used for inputting the multichannel voice spectrogram into the multilayer convolutional layer of the second convolutional neural network to obtain a second convolutional characteristic diagram; a global mean pooling subunit, configured to calculate a global mean of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector; the channel attention weight calculation subunit is used for inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and the channel attention applying subunit is configured to take the eigenvalue of each position in the channel attention weight vector as a weight to respectively weight each eigen matrix of the second convolution eigen map along the channel dimension to obtain the channel enhanced eigen map.

In the above digital staff intelligent system based on financial system, the aggregation unit is further configured to: fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:

wherein the content of the first and second substances,

in order to be able to classify the feature map,

in order to reinforce the characteristic diagram in the space,

for strengthening the characteristic diagram of the channel "

"represents the addition of elements at the corresponding positions of the spatial enhancement feature map and the channel enhancement feature map,

is a weighting parameter for controlling a balance between the spatial enhanced feature map and the channel enhanced feature map in the classification feature map.

In the above digital staff intelligent system based on a financial system, the adaptive correction module is further configured to: correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:

wherein the content of the first and second substances,

is the classification feature map

The characteristic value of each of the positions of (a),

and

is a set of eigenvalues

Mean and variance of, and

is the classification feature map

The size of (a) is greater than (b),

is logarithmic with base 2, and

is a weight-over-parameter that is,

and classifying the feature values of all positions in the feature map after correction.

In the above digital employee intelligent system based on a financial system, the consultation intention identifying module is further configured to: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:

wherein

Representing projecting the corrected classification feature map as a vector,

to

Is a weight matrix of the fully connected layers of each layer,

to

A bias matrix representing the layers of the fully connected layer.

According to another aspect of the present application, there is also provided a digital employee intelligence method based on a financial system, comprising:

acquiring a consultation voice signal of a client;

the consultation voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consultation voice signal;

extracting a logarithmic Mel's map, a cochlear spectrogram and a constant Q-transformed spectrogram from the denoised consultative speech signal;

arranging the logarithmic Mel map, the cochlea spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram;

enabling the multi-channel voice spectrogram to pass through a double-flow network model to obtain a classification characteristic map;

correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and

and passing the corrected classification feature map through a classifier to obtain a classification result, wherein the classification result is used for representing an intention topic label of the consultation voice signal.

In the above digital employee intelligent method based on a financial system, the passing the consulting voice signal through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consulting voice signal includes: inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder explicitly spatially encodes the advisory speech signal using a convolutional layer to obtain speech features; and inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder performs deconvolution processing on the voice features by using a deconvolution layer to obtain the noise-reduced consultation voice signal.

In the above digital employee intelligent method based on a financial system, the obtaining a classification feature map by passing the multi-channel voice spectrogram through a double-flow network model includes: inputting the multi-channel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism to obtain a space enhancement feature map; inputting the multi-channel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.

In the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map includes: inputting the multichannel voice spectrogram into a multilayer convolution layer of the first convolution neural network to obtain a first convolution feature map; inputting the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and calculating the spatial attention diagram and the position-by-position point multiplication of the first volume feature diagram to obtain the spatial enhancement feature diagram.

In the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map further includes: performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.

In the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a second convolutional neural network of the dual-flow network model using a channel attention mechanism to obtain a channel enhancement feature map includes: inputting the multi-channel voice spectrogram into a multilayer convolution layer of the second convolution neural network to obtain a second convolution characteristic diagram; calculating the global mean value of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector; inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and weighting each feature matrix of the second convolution feature map along the channel dimension by taking the feature value of each position in the channel attention weight vector as a weight to obtain the channel reinforced feature map.

In the above digital employee intelligent method based on a financial system, the fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map further includes: fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map using the following formula:

wherein the content of the first and second substances,

in order to provide the said classification feature map,

in order to reinforce the characteristic diagram in the space,

a feature map is enhanced for the channel, 'A method for manufacturing a thin film transistor'

In the above digital employee intelligent method based on a financial system, the correcting the feature values of the respective positions in the classification feature map based on the statistical features of the feature value sets of all the positions in the classification feature map to obtain a corrected classification feature map further includes: correcting the characteristic value of each position in the classification characteristic diagram by the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:

wherein, the first and the second end of the pipe are connected with each other,

is the classification feature map

The characteristic value of each of the positions of (a),

and

is a set of eigenvalues

Mean and variance of, and

is the classification feature map

The size of (a) is greater than (b),

is logarithmic with base 2, and

is a weight-over-parameter that is,

and classifying the feature value of each position in the feature map after correction.

In the above digital employee intelligent method based on a financial system, the step of passing the corrected classification feature map through a classifier to obtain a classification result further includes: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:

wherein

Representing projecting the corrected classification feature map as a vector,

to is that

Is a weight matrix of the fully connected layers of each layer,

to

A bias matrix representing the layers of the fully connected layer.

Compared with the prior art, the financial system-based digital employee intelligent system converts consultation intention understanding of digital employees to clients into voice theme marking problems. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of more accurately understanding user consultation extraction, the user can be responded to the client more reasonably and adaptively, and the voice consultation experience of the user is improved.

Drawings

The above and other objects, features and advantages of the present application will become more apparent by describing in more detail embodiments of the present application with reference to the attached drawings. The accompanying drawings are included to provide a further understanding of the embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the principles of the application. In the drawings, like reference numbers generally represent like parts or steps.

FIG. 1 illustrates a block diagram of a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application.

FIG. 2 illustrates a system architecture diagram of a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application.

FIG. 3 illustrates a block diagram of a dual stream encoding module in a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application.

FIG. 4 illustrates a block diagram of a first convolution encoding unit in a digital employee intelligence system based on a financial system according to an embodiment of the present application.

FIG. 5 illustrates a block diagram of a second convolutional encoding unit in a digital employee intelligence system based on a financial system, according to an embodiment of the present application.

FIG. 6 illustrates a flow chart of a digital employee intelligence method based on a financial system in accordance with an embodiment of the application.

Detailed Description

Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be understood that the described embodiments are only some embodiments of the present application and not all embodiments of the present application, and that the present application is not limited by the example embodiments described herein.

Summary of the application

In the technical solution of the present application, the understanding of the consulting intention of the digital staff to the client can be converted into a voice topic labeling problem, that is, a voice signal is encoded and understood in a proper manner, and a predetermined intention topic label is not specified to the voice signal, which can be realized by a two-stage model of a feature extractor + a classifier.

However, when a client consults, different clients have different language expression habits, which brings great challenges to the understanding of speech semantics, and meanwhile, the user still has phenomena of ending, accent, intonation and the like when expressing problems, and because the user often expresses the consultation requirements through a communication system, a lot of noise is introduced in the collection and transmission processes of speech signals, and the technical problem causes that the precision of the understanding of consultation intentions is low.

Correspondingly, in the technical scheme of the application, after the consultation voice signal of the client is obtained, the consultation voice signal is firstly passed through a signal noise reduction module based on an automatic encoder to obtain the noise-reduced consultation voice signal. In particular, the auto-encoder based signal noise reduction module includes an encoder that uses convolutional layers and a decoder that uses anti-convolutional layers. Accordingly, the denoising process of the signal denoising module includes first explicitly spatially encoding the advisory speech signal with a convolutional layer using the encoder to extract speech features from the advisory speech signal (filtered in response to noise), and then deconvolving the speech features with a deconvolution layer using the decoder to obtain the denoised advisory speech signal.

In order to improve the accuracy of semantic understanding of the noise-reduced consulting voice signal, the noise-reduced consulting voice signal is converted into a spectrogram, the spectrogram is a perception graph formed by three parts of time, frequency and energy, is a visible language of the voice signal, can provide rich visual information, combines time domain analysis and frequency domain analysis, and can reflect the frequency content of the signal and the change rule of the frequency content along with the time.

Particularly, in the technical solution of the present application, in order to capture richer acoustic spectrum information, a log mellop diagram, a cochlear spectrogram, and a constant Q-transformed spectrogram are extracted from the denoised consulting speech signal, respectively. It will be appreciated that the log mel-frequency spectrum is the most widely used feature, which in the design process mimics the characteristics of the human ear, which have different acoustic sensitivities to sounds of different frequencies. The extraction flow of the logarithmic Mel-map is similar to MFCC, but linear transformation, namely discrete cosine transformation, of the last step is reduced, and after the step is removed, more high-order information and nonlinear information of the sound signal can be reserved. The cochlea spectrogram is obtained by a Gamma atom filter bank simulating the frequency selectivity of a human cochlea, and the frequency of the Gamma atom filter bank is more consistent with the auditory characteristics of human ears. The constant Q-transform map provides better frequency resolution for low frequencies and better time resolution for high frequencies, thereby better mimicking the behavior of the human auditory system.

Then, the Log-Melpe diagram, the cochlea spectrogram and the constant Q-transform spectrogram are arranged into a multi-channel voice spectrogram. The logarithm Mel-cepstrum, the cochlea spectrogram and the constant Q transformation spectrogram are arranged along the channel dimension to obtain a multi-channel semantic spectrogram, so that data input into the neural network model has a relatively larger width, on one hand, richer materials are provided for voice feature extraction of the neural network model, on the other hand, correlation exists among all the spectrograms in the multi-channel semantic spectrogram, and the accuracy and the richness of the voice feature extraction can be improved by utilizing internal correlation.

Specifically, in the technical solution of the present application, a dual-flow network model including a first convolutional neural network using a spatial attention mechanism and a second convolutional neural network using a channel attention mechanism is used to process the multi-channel speech spectrogram to obtain a classification feature map. Here, the first convolutional neural network using a spatial attention mechanism and the second convolutional neural network using a channel attention mechanism of the dual-flow network model respectively perform spatially enhanced display spatial coding and channel enhanced display spatial coding on the multi-channel speech spectrogram in a parallel structure, and aggregate a spatial enhancement feature map and a channel enhancement feature map obtained by coding the two to obtain the classification feature map.

For the classification feature map, when the dual-flow network structure fuses the spatial enhancement feature map output by the first convolutional neural network and the channel enhancement feature map output by the second convolutional neural network, it may be that an out-of-distribution feature value in the classification feature map is generated due to a non-alignment distribution between a spatial dimension distribution of the spatial enhancement feature map and a channel dimension distribution of the channel enhancement feature map, so as to affect a classification effect of the classification feature map.

Therefore, the information statistics normalization of the adaptive example is performed on the classification feature map, specifically:

is the classification feature map

The characteristic value of each of the positions of (b),

and

is a set of eigenvalues

Mean and variance of, and

is the classification feature map

The size of (a) is greater than (b),

is logarithmic with base 2, and

is a weight hyperparameter.

The information statistical normalization of the self-adaptive example takes the characteristic value set of the classification characteristic diagram as the self-adaptive example, uses intrinsic prior information of the statistical characteristics to carry out dynamic generation type information normalization on a single characteristic value, and simultaneously uses the normalization mode length information of the characteristic set as bias to be used as invariance description in a set distribution domain, thereby realizing the characteristic optimization of shielding the disturbance distribution of a special example as much as possible and improving the classification effect of the classification characteristic diagram. In this way, the accuracy of the understanding and recognition of the intent of the consulting voice signal is improved.

Based on this, this application has proposed a digital staff intelligence system based on financial system, it includes: the voice signal acquisition module is used for acquiring a consultation voice signal of a client; the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal; a voice spectrogram extraction module for extracting a logarithmic Mel's map, a cochlear spectrogram and a constant Q-transform spectrogram from the denoised consulted voice signal; the multi-channel semantic spectrogram construction module is used for arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram; the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model so as to obtain a classification characteristic graph; the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and the consultation intention recognition module is used for enabling the corrected classification characteristic diagram to pass through a classifier to obtain a classification result, and the classification result is used for representing an intention subject label of the consultation voice signal.

Having described the general principles of the present application, various non-limiting embodiments of the present application will now be described with reference to the accompanying drawings.

Exemplary System

FIG. 1 illustrates a block diagram of a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application. As shown in fig. 1, a digital employee intelligence system 100 based on a financial system according to an embodiment of the present application includes: a voice signal collecting module 110 for obtaining a consultation voice signal of the client; a noise reduction module 120, configured to pass the consulting speech signal through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consulting speech signal; a voice spectrogram extracting module 130 for extracting a logarithmic melpp map, a cochlear spectrogram and a constant Q-transform spectrogram from the denoised counsel voice signal; a multi-channel semantic spectrogram constructing module 140, configured to arrange the log mel-frequency map, the cochlear spectrogram and the constant Q-transform spectrogram into a multi-channel speech spectrogram; the double-flow coding module 150 is configured to pass the multi-channel speech spectrogram through a double-flow network model to obtain a classification feature map; the adaptive correction module 160 is configured to correct the feature values at each position in the classification feature map based on statistical features of the feature value sets at all positions in the classification feature map to obtain a corrected classification feature map; and a consultation intention recognition module 170, configured to pass the corrected classification feature map through a classifier to obtain a classification result, where the classification result is used to represent an intention topic label of the consultation voice signal.

FIG. 2 illustrates a system architecture diagram of a financial system based digital employee intelligence system 100 in accordance with an embodiment of the present application. As shown in fig. 2, in the system architecture of the digital employee intelligence system based on a financial system 100, a consultation voice signal of a client is first acquired. And then, the consultation voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consultation voice signal. Then, a logarithmic mellop map, a cochlear spectrogram, and a constant Q-transformed spectrogram are extracted from the noise-reduced consulting voice signal. And arranging the logarithmic Mel map, the cochlea spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram. And then, passing the multichannel voice spectrogram through a double-flow network model to obtain a classification characteristic diagram. And then, based on the statistical features of the feature value sets of all the positions in the classification feature map, correcting the feature values of all the positions in the classification feature map to obtain a corrected classification feature map. And then, the corrected classification feature map is passed through a classifier to obtain a classification result, and the classification result is used for representing an intention topic label of the consultation voice signal.

In the above-mentioned financial system-based digital staff intelligent system 100, the voice signal collecting module 110 is used for obtaining a consultation voice signal of the client. In the technical solution of the present application, the understanding of the consulting intention of the digital staff to the client can be converted into a voice topic labeling problem, that is, a voice signal is encoded and understood in a proper manner, and a predetermined intention topic label is not specified to the voice signal, which can be realized by a two-stage model of a feature extractor + a classifier.

However, when a client consults, different clients have different language expression habits, which brings great challenges to the understanding of speech semantics, and meanwhile, the user still has phenomena of ending, accent, intonation and the like when expressing problems, and because the user often expresses the consultation requirements through a communication system, a lot of noise is introduced in the collection and transmission processes of speech signals, and the technical problem causes that the precision of the understanding of consultation intentions is low. Therefore, in the technical solution of the present application, a consultation voice signal of the client is first acquired.

In the above-mentioned financial system-based digital staff intelligent system 100, the noise reduction module 120 is configured to pass the consulting voice signal through an automatic encoder-based signal noise reduction module to obtain a noise-reduced consulting voice signal. That is, after obtaining the consulting voice signal of the client, the consulting voice signal passes through the signal noise reduction module based on the automatic encoder to obtain the noise-reduced consulting voice signal. In particular, the auto-encoder based signal noise reduction module includes an encoder that uses convolutional layers and a decoder that uses anti-convolutional layers. Accordingly, the denoising process of the signal denoising module includes first explicitly spatially encoding the advisory speech signal with a convolutional layer using the encoder to extract speech features from the advisory speech signal (filtered in response to noise), and then deconvolving the speech features with a deconvolution layer using the decoder to obtain the denoised advisory speech signal.

In one example, in the above-mentioned financial system based digital employee intelligence system 100, the noise reduction module 120 comprises: a speech signal encoding unit for inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder performs explicit spatial encoding on the advisory speech signal using a convolutional layer to obtain a speech feature; and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to perform deconvolution processing on the voice features to obtain the noise-reduced consultation voice signals.

In the above-mentioned financial system-based digital staff intelligent system 100, the voice spectrogram extracting module 130 is configured to extract a log melpp chart, a cochlear spectrogram and a constant Q-transformed spectrogram from the denoised consulting voice signal. In order to improve the accuracy of semantic understanding of the noise-reduced consulting voice signal, the noise-reduced consulting voice signal is converted into a spectrogram, the spectrogram is a perception graph formed by three parts of time, frequency and energy, is a visible language of the voice signal, can provide rich visual information, combines time domain analysis and frequency domain analysis, and can reflect the frequency content of the signal and the change rule of the frequency content along with the time.

In the above digital staff intelligent system 100 based on a financial system, the multi-channel semantic spectrogram constructing module 140 is configured to arrange the log melpp chart, the cochlear spectrogram and the constant Q-transform spectrogram into a multi-channel voice spectrogram. Namely, the logarithmic Mel's map, the cochlear spectrogram and the constant Q-conversion spectrogram are arranged along the channel dimension to obtain a multi-channel semantic spectrogram, so that data input into the neural network model has a relatively larger width, on one hand, richer materials are provided for voice feature extraction of the neural network model, on the other hand, correlation exists among all spectrograms in the multi-channel semantic spectrogram, and the accuracy and richness of the voice feature extraction can be improved by utilizing internal correlation.

In the above digital staff intelligent system 100 based on the financial system, the double-flow encoding module 150 is configured to obtain the classification feature map by passing the multi-channel voice spectrogram through a double-flow network model. Specifically, in the technical solution of the present application, a dual-flow network model including a first convolutional neural network using a spatial attention mechanism and a second convolutional neural network using a channel attention mechanism is used to process the multi-channel speech spectrogram to obtain a classification feature map. Here, the first convolutional neural network using a spatial attention mechanism and the second convolutional neural network using a channel attention mechanism of the dual-flow network model respectively perform spatially enhanced display spatial coding and channel enhanced display spatial coding on the multi-channel speech spectrogram in a parallel structure, and aggregate a spatial enhancement feature map and a channel enhancement feature map obtained by coding the two to obtain the classification feature map.

FIG. 3 illustrates a block diagram of a dual stream encoding module in a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application. As shown in fig. 3, in the above-mentioned digital employee intelligent system 100 based on a financial system, the dual-stream encoding module 150 includes: a first convolution encoding unit 151, configured to input the multichannel speech spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map; a second convolution coding unit 152, configured to input the multi-channel speech spectrogram into a second convolution neural network of the dual-flow network model, where the second convolution neural network uses a channel attention mechanism, so as to obtain a channel enhancement feature map; and an aggregation unit 153, configured to fuse the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.

Fig. 4 illustrates a block diagram of a first convolution encoding unit in a digital employee intelligence system based on a financial system in accordance with an embodiment of the present application. As shown in fig. 4, in the above-mentioned digital staff intelligent system 100 based on financial system, the first volume coding unit 151 includes: a deep convolutional coding subunit 1511, configured to input the multichannel voice spectrogram into the multilayer convolutional layer of the first convolutional neural network to obtain a first convolutional feature map; a spatial attention subunit 1512, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and an attention applying subunit 1513, configured to calculate a product of the spatial attention map and the location-based point of the first convolution feature map to obtain the spatial enhancement feature map.

In one example, in the above-mentioned financial system-based digital employee intelligence system 100, the spatial attention subunit 1512 is further configured to: performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.

FIG. 5 illustrates a block diagram of a second convolutional encoding unit in a digital employee intelligence system based on a financial system, according to an embodiment of the present application. As shown in fig. 5, in the above-mentioned digital staff intelligent system 100 based on a financial system, the second convolutional encoding unit 152 includes: a deep convolutional coding subunit 1521, configured to input the multichannel speech spectrogram into a multilayer convolutional layer of the second convolutional neural network to obtain a second convolutional feature map; a global mean Chi Huazi unit 1522 configured to calculate a global mean of each feature matrix of the second convolution feature map along a channel dimension to obtain a channel feature vector; a channel attention weight calculation subunit 1523, configured to input the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and a channel attention applying subunit 1524, configured to take the eigenvalue of each position in the channel attention weight vector as a weight to respectively weight each eigen matrix of the second convolution eigen map along the channel dimension, so as to obtain the channel enhanced eigen map.

In one example, in the above digital staff intelligent system 100 based on a financial system, the aggregation unit 153 is further configured to: fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:

in order to provide the said classification feature map,

in order to reinforce the characteristic diagram in the space,

for strengthening the characteristic diagram of the channel "

In the above-mentioned digital staff intelligent system 100 based on a financial system, the adaptive correction module 160 is configured to correct the feature values of each position in the classification feature map based on the statistical features of the feature value sets of all positions in the classification feature map to obtain a corrected classification feature map. For the classification feature map, when the dual-flow network structure fuses the spatial enhancement feature map output by the first convolutional neural network and the channel enhancement feature map output by the second convolutional neural network, it may be that an out-of-distribution feature value in the classification feature map is generated due to a misalignment between a spatial dimension distribution of the spatial enhancement feature map and a channel dimension distribution of the channel enhancement feature map, so as to affect a classification effect of the classification feature map. Therefore, the information statistics of the adaptive instance is normalized to the classification feature map.

In one example, in the above-mentioned financial system-based digital employee intelligence system 100, the adaptive correction module 160 is further configured to: correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:

wherein the content of the first and second substances,

is the classification feature map

The characteristic value of each of the positions of (a),

and

is a set of eigenvalues

Mean and variance of, and

is the classification feature map

The size of (a) is greater than (b),

is logarithmic with base 2, and

is a weight-over-parameter that is,

In the above-mentioned financial system-based digital employee intelligent system 100, the consultation intention identifying module 170 is configured to pass the corrected classification feature map through a classifier to obtain a classification result, where the classification result is used to represent an intention topic label of a consultation voice signal.

In one example, in the above-mentioned financial system-based digital employee intelligence system 100, the consulting intent identification module 170 is further configured to: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:

wherein

Representing projecting the corrected classification feature map as a vector,

to

Is a weight matrix of the fully connected layers of each layer,

to

A bias matrix representing the layers of the fully connected layer.

In summary, a financial system based digital employee intelligence system 100 is illustrated that translates digital employee consultation intent understanding of a client into voice topic tagging issues in accordance with embodiments of the present application. Specifically, a plurality of spectrogram are extracted from the voice signal, and the spectrogram is coded and decoded by a deep neural network model to obtain a recognition result representing a predetermined intention topic label. Therefore, on the premise of more accurately understanding user consultation extraction, the user can be responded to the client more reasonably and adaptively, and the voice consultation experience of the user is improved.

As described above, the digital employee intelligence system 100 based on a financial system according to the embodiment of the present application may be implemented in various terminal devices, such as a server having digital employee intelligence based on a financial system. In one example, the financial system based digital employee intelligence system 100 according to embodiments of the subject application may be integrated into a terminal device as a software module and/or a hardware module. For example, the financial system-based digital employee intelligence system 100 may be a software module in the operating system of the terminal device, or may be an application developed for the terminal device; of course, the financial system based digital staff intelligent system 100 may also be one of many hardware modules of the terminal device.

Alternatively, in another example, the financial system based digital employee intelligence system 100 and the terminal device may also be separate devices, and the financial system based digital employee intelligence system 100 may be connected to the terminal device through a wired and/or wireless network and transmit the interaction information in an agreed data format.

Exemplary method

According to another aspect of the application, a digital employee intelligent method based on a financial system is also provided. As shown in fig. 6, the digital employee intelligent method based on the financial system according to the embodiment of the present application includes the steps of: s110, acquiring a consultation voice signal of a client; s120, the consulting voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consulting voice signal; s130, extracting a logarithmic Mel map, a cochlear spectrogram and a constant Q transformation spectrogram from the denoised consulting voice signal; s140, arranging the logarithmic Mel map, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram; s150, enabling the multi-channel voice spectrogram to pass through a double-flow network model to obtain a classification characteristic map; s160, correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and S170, passing the corrected classification characteristic diagram through a classifier to obtain a classification result, wherein the classification result is used for representing an intention theme label of the consultation voice signal.

In one example, in the above digital employee intelligence method based on a financial system, the passing the consulting voice signal through an automatic encoder based signal noise reduction module to obtain a noise-reduced consulting voice signal includes: inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder explicitly spatially encodes the advisory speech signal using a convolutional layer to obtain speech features; and inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder performs deconvolution processing on the voice features by using a deconvolution layer to obtain the noise-reduced consultation voice signal.

In one example, in the above financial system-based digital employee intelligence method, the passing the multi-channel voice spectrogram through a dual-flow network model to obtain a classification feature map includes: inputting the multi-channel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism to obtain a space enhancement feature map; inputting the multi-channel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.

In one example, in the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map includes: inputting the multichannel voice spectrogram into a multilayer convolution layer of the first convolution neural network to obtain a first convolution feature map; inputting the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and calculating the spatial attention diagram and the position-by-position point multiplication of the first volume feature diagram to obtain the spatial enhancement feature diagram.

In one example, in the above digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into a first convolution neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhanced feature map further includes: performing convolutional coding on the first convolutional feature map by using the convolutional layer of the spatial attention module to obtain a spatial perceptual feature map; calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.

In one example, in the above financial system-based digital employee intelligence method, the inputting the multi-channel voice spectrogram into a second convolutional neural network of the dual-flow network model using a channel attention mechanism to obtain a channel enhancement feature map includes: inputting the multi-channel voice spectrogram into a multilayer convolution layer of the second convolution neural network to obtain a second convolution characteristic diagram; calculating the global mean value of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector; inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and weighting each feature matrix of the second convolution feature map along the channel dimension by taking the feature value of each position in the channel attention weight vector as a weight to obtain the channel reinforced feature map.

In one example, in the above digital employee intelligence method based on a financial system, the fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map further includes: fusing the spatial enhanced feature map and the channel enhanced feature map to obtain the classification feature map using the following formula:

wherein the content of the first and second substances,

in order to be able to classify the feature map,

in order to reinforce the characteristic diagram in the space,

for strengthening the characteristic diagram of the channel "

In one example, in the above digital employee intelligence method based on a financial system, the correcting the feature values of the respective locations in the classification feature map based on the statistical features of the feature value sets of all the locations in the classification feature map to obtain a corrected classification feature map further includes: correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram; wherein the formula is:

wherein the content of the first and second substances,

is the classification feature map

The characteristic value of each of the positions of (a),

and

is a set of characteristic values

A mean and a variance of (2), and

is the classification feature map

The size of (a) is greater than (b),

is logarithmic with base 2, and

is a weight-over-parameter that is,

In one example, in the above digital employee intelligence method based on a financial system, the step of passing the corrected classification feature map through a classifier to obtain a classification result further includes: processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:

wherein

Representing projecting the corrected classification feature map as a vector,

to

Is a weight matrix of the fully connected layers of each layer,

to

A bias matrix representing the layers of the fully connected layer.

In summary, the financial system-based digital employee intelligence method is illustrated according to an embodiment of the present application, which translates the digital employee's consultation intention understanding of the client into a voice topic tagging problem, i.e., a feature extractor is implemented to encode and understand a voice signal and assign a predetermined intention topic tag to the voice signal. Particularly, the voice signal is subjected to noise reduction processing by using a noise reduction module so as to improve the precision of consultation intention understanding. Thus, an optimized financial system digital staff intelligent scheme is constructed.

The foregoing describes the general principles of the present application in conjunction with specific embodiments, however, it is noted that the advantages, effects, etc. mentioned in the present application are merely examples and are not limiting, and they should not be considered essential to the various embodiments of the present application. Furthermore, the foregoing disclosure of specific details is provided for purposes of illustration and understanding only, and is not intended to limit the application to the details which are set forth in order to provide a thorough understanding of the present application.

The block diagrams of devices, apparatuses, systems referred to in this application are only given as illustrative examples and are not intended to require or imply that the connections, arrangements, configurations, etc. must be made in the manner shown in the block diagrams. These devices, apparatuses, devices, systems may be connected, arranged, configured in any manner, as will be appreciated by one skilled in the art. Words such as "including," "comprising," "having," and the like are open-ended words that mean "including, but not limited to," and are used interchangeably herein. The words "or" and "as used herein mean, and are used interchangeably with, the word" and/or, "unless the context clearly dictates otherwise. The word "such as" is used herein to mean, and is used interchangeably with, the phrase "such as but not limited to".

It should also be noted that in the devices, apparatuses, and methods of the present application, the components or steps may be decomposed and/or recombined. These decompositions and/or recombinations are to be considered as equivalents of the present application.

The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.

The foregoing description has been presented for purposes of illustration and description. Furthermore, the description is not intended to limit embodiments of the application to the form disclosed herein. While a number of example aspects and embodiments have been discussed above, those of skill in the art will recognize certain variations, modifications, alterations, additions and sub-combinations thereof.

Claims

1. A digital staff intelligence system based on a financial system, comprising:

a voice spectrogram extraction module for extracting a logarithmic Mel's map, a cochlear spectrogram and a constant Q-transform spectrogram from the denoised consulted voice signal;

2. The financial system based digital employee intelligence system of claim 1 wherein said noise reduction module comprises:

a speech signal encoding unit for inputting the advisory speech signal into an encoder of the signal noise reduction module, wherein the encoder performs explicit spatial encoding on the advisory speech signal using a convolutional layer to obtain a speech feature; and

and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to perform deconvolution processing on the voice features so as to obtain the noise-reduced consultation voice signals.

3. The financial system based digital employee intelligence system of claim 2 wherein said dual stream encoding module comprises:

the first convolution coding unit is used for inputting the multichannel voice spectrogram into a first convolution neural network of the double-current network model using a space attention mechanism so as to obtain a space enhancement feature map;

the second convolution coding unit is used for inputting the multichannel voice spectrogram into a second convolution neural network of the double-current network model using a channel attention mechanism to obtain a channel enhancement feature map; and

and the aggregation unit is used for fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.

4. The financial system-based digital employee intelligence system of claim 3 wherein said first volume coding unit comprises:

the deep convolutional coding subunit is used for inputting the multichannel voice spectrogram into a multilayer convolutional layer of the first convolutional neural network to obtain a first convolutional characteristic diagram;

a spatial attention subunit, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and

an attention applying subunit, configured to calculate a dot-by-dot multiplication between the spatial attention map and the first convolution feature map to obtain the spatial enhancement feature map.

5. The financial system-based digital employee intelligence system of claim 4 wherein the spatial attention subunit is further configured to:

performing convolutional encoding on the first convolutional feature map by using convolutional layers of the spatial attention module to obtain a spatial perceptual feature map;

calculating a position-based multiplication between the spatial perception feature map and the first volume feature map to obtain a spatial attention score map; and

inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.

6. The financial system-based digital staff intelligence system of claim 5 wherein said second convolutional encoding unit comprises:

the depth convolution coding subunit is used for inputting the multichannel voice spectrogram into a multilayer convolution layer of the second convolution neural network to obtain a second convolution characteristic map;

a global mean pooling subunit, configured to calculate a global mean of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector;

the channel attention weight calculation subunit is used for inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and

and the channel attention applying subunit is configured to take the eigenvalue of each position in the channel attention weight vector as a weight to respectively weight each eigen matrix of the second convolution eigen map along the channel dimension to obtain the channel enhanced eigen map.

7. The financial system-based digital employee intelligence system of claim 6, wherein the aggregation unit is further configured to:

fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:

wherein the content of the first and second substances,

in order to be able to classify the feature map,

in order to reinforce the characteristic diagram in the space,

for strengthening the characteristic diagram of the channel "

8. The financial system-based digital employee intelligence system of claim 7, wherein the adaptive correction module is further configured to:

correcting the characteristic value of each position in the classification characteristic diagram according to the following formula to obtain a corrected classification characteristic diagram;

wherein the formula is:

wherein the content of the first and second substances,

is the classification feature map

The characteristic value of each of the positions of (a),

and

is a set of eigenvalues

Mean and variance of, and

is the classification feature map

The size of (a) is greater than (b),

is logarithmic with base 2, and

is a weight-over-parameter that is,

9. The financial system-based digital employee intelligence system of claim 8 wherein said consultation intent identification module is further configured to:

processing the corrected classification feature map using the classifier to generate a classification result according to the following formula:

wherein

Representing the projection of the corrected classification feature map as a vector,

to

Is a weight matrix of the fully connected layers of each layer,

to

A bias matrix representing the layers of the fully connected layer.