CN115602165B - Digital employee intelligent system based on financial system - Google Patents

Digital employee intelligent system based on financial system Download PDF

Info

Publication number
CN115602165B
CN115602165B CN202211090442.0A CN202211090442A CN115602165B CN 115602165 B CN115602165 B CN 115602165B CN 202211090442 A CN202211090442 A CN 202211090442A CN 115602165 B CN115602165 B CN 115602165B
Authority
CN
China
Prior art keywords
feature map
channel
spectrogram
classification
convolution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211090442.0A
Other languages
Chinese (zh)
Other versions
CN115602165A (en
Inventor
黄术
黄琪敏
裘浩祺
魏祥
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Youhang Information Technology Co ltd
Original Assignee
Hangzhou Youhang Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Youhang Information Technology Co ltd filed Critical Hangzhou Youhang Information Technology Co ltd
Priority to CN202211090442.0A priority Critical patent/CN115602165B/en
Publication of CN115602165A publication Critical patent/CN115602165A/en
Application granted granted Critical
Publication of CN115602165B publication Critical patent/CN115602165B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/22Procedures used during a speech recognition process, e.g. man-machine dialogue
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/02Feature extraction for speech recognition; Selection of recognition unit
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/06Creation of reference templates; Training of speech recognition systems, e.g. adaptation to the characteristics of the speaker's voice
    • G10L15/063Training
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/16Speech classification or search using artificial neural networks
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/08Speech classification or search
    • G10L15/18Speech classification or search using natural language modelling
    • G10L15/1822Parsing for meaning understanding
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L15/00Speech recognition
    • G10L15/20Speech recognition techniques specially adapted for robustness in adverse environments, e.g. in noise, of stress induced speech
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • GPHYSICS
    • G10MUSICAL INSTRUMENTS; ACOUSTICS
    • G10LSPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
    • G10L25/00Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00
    • G10L25/03Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters
    • G10L25/24Speech or voice analysis techniques not restricted to a single one of groups G10L15/00 - G10L21/00 characterised by the type of extracted parameters the extracted parameters being the cepstrum

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Human Computer Interaction (AREA)
  • Acoustics & Sound (AREA)
  • Multimedia (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Data Mining & Analysis (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Financial Or Insurance-Related Operations Such As Payment And Settlement (AREA)

Abstract

The application relates to the field of financial science and technology, and particularly discloses a digital employee intelligent system based on a financial system, which converts consultation intention understanding of digital employees to customers into a voice theme labeling problem. Specifically, a plurality of kinds of spectrograms are extracted from a voice signal, and the plurality of kinds of spectrograms are encoded and decoded in a deep neural network model to obtain a recognition result for representing a predetermined intention topic label. Therefore, on the premise that the user consultation extraction can be understood more accurately, the clients can be responded more reasonably and adaptively, so that the voice consultation experience of the users is improved.

Description

Digital employee intelligent system based on financial system
Technical Field
The present application relates to the field of financial technology, and more particularly to a digital employee intelligence system based on a financial system.
Background
With the development of computer technology, more and more technologies (such as big data or cloud computing) are applied in the financial field, and the traditional financial industry is gradually changing to financial technology. Currently, the use of digital staff (e.g., voice robots) in the financial field is quite extensive, for example, the purposes of popularization of financial products, collection of money and the like can be achieved through the digital staff. The implementation of digital employees benefits from the development of related technologies such as speech recognition and natural speech understanding.
However, in actual operation, the customer is often complained about the digital staff, and the reason is that the digital staff does not accurately understand the intention of the customer, and the customer cannot answer questions.
Thus, an optimized digital employee intelligence scheme for a financial system is desired.
Disclosure of Invention
The present application has been made in order to solve the above technical problems. The embodiment of the application provides a digital employee intelligent system based on a financial system, which converts the consulting intention understanding of a digital employee to a client into a voice theme labeling problem. Specifically, a plurality of kinds of spectrograms are extracted from a voice signal, and the plurality of kinds of spectrograms are encoded and decoded in a deep neural network model to obtain a recognition result for representing a predetermined intention topic label. Therefore, on the premise that the user consultation extraction can be understood more accurately, the clients can be responded more reasonably and adaptively, so that the voice consultation experience of the users is improved.
According to one aspect of the present application, there is provided a digital employee intelligence system based on a financial system, comprising:
the voice signal acquisition module is used for acquiring a consultation voice signal of a client;
the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal;
The voice spectrogram extraction module is used for extracting a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram from the noise-reduced consultation voice signal;
the multi-channel semantic spectrogram construction module is used for arranging the logarithmic mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram;
the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model to obtain a classification characteristic diagram;
the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and
and the consultation intention recognition module is used for passing the corrected classification characteristic diagram through a classifier to obtain a classification result, wherein the classification result is used for representing an intention theme label of the consultation voice signal.
In the above-mentioned digital employee intelligent system based on financial system, the noise reduction module includes: a voice signal coding unit, configured to input the advisory voice signal into an encoder of the signal noise reduction module, where the encoder uses a convolution layer to perform explicit spatial coding on the advisory voice signal to obtain a voice feature; and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to deconvolute the voice features so as to obtain the noise-reduced consultative voice signal.
In the above-mentioned digital employee intelligent system based on a financial system, the dual-stream encoding module includes: the first convolution coding unit is used for inputting the multichannel voice spectrogram into a first convolution neural network of the double-flow network model by using a spatial attention mechanism so as to obtain a spatial enhancement characteristic diagram; the second convolution coding unit is used for inputting the multi-channel voice spectrogram into a second convolution neural network of the double-flow network model, which uses a channel attention mechanism, so as to obtain a channel enhancement feature map; and an aggregation unit, configured to fuse the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
In the above-mentioned digital employee intelligent system based on a financial system, the first convolution encoding unit includes: the depth convolution coding subunit is used for inputting the multichannel voice spectrogram into a plurality of convolution layers of the first convolution neural network to obtain a first convolution characteristic map; a spatial attention subunit, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and an attention applying subunit for calculating a point-by-point multiplication of the spatial attention map and the first convolution feature map to obtain the spatial enhancement feature map.
In the above-described financial system-based digital employee intelligence system, the spatial attention subunit is further configured to: performing convolutional encoding on the first convolutional feature map by using a convolutional layer of the spatial attention module to obtain a spatial perception feature map; calculating the point-by-point multiplication between the spatial perception feature map and the first convolution feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
In the above-mentioned financial system-based digital employee intelligence system, the second convolution encoding unit includes: the depth convolution coding subunit is used for inputting the multichannel voice spectrogram into a plurality of convolution layers of the second convolution neural network to obtain a second convolution characteristic diagram; a global average Chi Huazi unit, configured to calculate a global average of each feature matrix of the second convolution feature map along a channel dimension to obtain a channel feature vector; a channel attention weight calculation subunit, configured to input the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and a channel attention applying subunit, configured to weight each feature matrix of the second convolution feature map along a channel dimension with a feature value of each position in the channel attention weight vector as a weight, so as to obtain the channel enhancement feature map.
In the above-mentioned digital employee intelligent system based on a financial system, the aggregation unit is further configured to: fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:
F s =αF 1 +βF 2
wherein F is s For the classification characteristic diagram, F 1 For the space-enhancing feature map, F 2 For the channel enhancement feature map, "+" indicates that the spatial enhancement feature map corresponds to the channel enhancement feature mapThe elements at positions are added together, α and β being weighting parameters for controlling the balance between the spatial enhancement feature map and the channel enhancement feature map in the classification feature map.
In the above-mentioned digital employee intelligent system based on a financial system, the adaptive correction module is further configured to: correcting the characteristic values of each position in the classification characteristic map by the following formula to obtain a corrected classification characteristic map; wherein, the formula is:
Figure GDA0004159487860000031
/>
wherein f i,j,k Is the eigenvalue of each position of the classification characteristic diagram F, and mu and sigma are the eigenvalue set F i,j,k E means and variances of F, and W H C is the scale of the classification feature map F, log is the base 2 logarithm, and α is the weight super-parameter, F' i,j,k And the characteristic values of all the positions in the corrected classified characteristic diagram are obtained.
In the above-mentioned digital employee intelligent system based on a financial system, the consultation intention recognition module is further configured to: processing the corrected classification feature map using the classifier to generate a classification result with the following formula:
O=softmax{(W n ,B n ):…:(W 1 ,B 1 )|Project(F)}
wherein Project (F) represents projecting the corrected classification feature map as a vector, W 1 To W n Weight matrix for all the connection layers of each layer, B 1 To B n Representing the bias matrix for each fully connected layer.
According to another aspect of the present application, there is also provided a digital employee intelligence method based on a financial system, comprising:
acquiring a consultation voice signal of a client;
the consultation voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consultation voice signal;
extracting a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram from the noise-reduced consultation voice signal;
arranging the logarithmic mel-spectrogram, the cochlear spectrogram and the constant Q transform spectrogram into a multi-channel speech spectrogram;
the multi-channel voice spectrogram passes through a double-flow network model to obtain a classification characteristic diagram;
Correcting the characteristic values of all positions in the classification characteristic map based on the statistical characteristics of the characteristic value sets of all positions in the classification characteristic map to obtain a corrected classification characteristic map; and
and passing the corrected classification characteristic diagram through a classifier to obtain a classification result, wherein the classification result is used for representing the intention topic label of the consultation voice signal.
In the above intelligent method for digital staff based on financial system, the step of obtaining the noise-reduced consultation voice signal by the signal noise reduction module based on the automatic encoder includes: inputting the advisory voice signal to an encoder of the signal noise reduction module, wherein the encoder uses a convolution layer to perform explicit spatial encoding on the advisory voice signal to obtain voice characteristics; and inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to deconvolute the voice features to obtain the noise-reduced consultative voice signal.
In the above-mentioned digital employee intelligent method based on financial system, the step of obtaining the classification feature map by passing the multi-channel voice spectrogram through a dual-flow network model includes: inputting the multichannel voice spectrogram into a first convolution neural network of the double-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map; inputting the multi-channel voice spectrogram into a second convolution neural network of the double-flow network model, which uses a channel attention mechanism, so as to obtain a channel enhancement feature map; and fusing the space enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
In the above-mentioned digital employee intelligence method based on a financial system, the inputting the multi-channel speech spectrogram into the first convolutional neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map includes: inputting the multichannel voice spectrogram into a plurality of convolution layers of the first convolution neural network to obtain a first convolution feature map; inputting the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and calculating a per-position point multiplication of the spatial attention map and the first convolution feature map to obtain the spatial enhancement feature map.
In the above-mentioned digital employee intelligence method based on a financial system, the inputting the multi-channel speech spectrogram into the first convolutional neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map further includes: performing convolutional encoding on the first convolutional feature map by using a convolutional layer of the spatial attention module to obtain a spatial perception feature map; calculating the point-by-point multiplication between the spatial perception feature map and the first convolution feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
In the above-mentioned digital employee intelligence method based on a financial system, the inputting the multi-channel voice spectrogram into the second convolutional neural network of the dual-flow network model using a channel attention mechanism to obtain a channel enhancement feature map includes: inputting the multichannel voice spectrogram into a plurality of convolution layers of the second convolution neural network to obtain a second convolution feature map; calculating the global average value of each feature matrix of the second convolution feature diagram along the channel dimension to obtain a channel feature vector; inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and weighting each feature matrix of the second convolution feature map along the channel dimension by taking the feature value of each position in the channel attention weight vector as a weight to obtain the channel enhancement feature map.
In the above-mentioned digital employee intelligence method based on a financial system, the fusing the space enhancement feature map and the channel enhancement feature map to obtain the classification feature map further includes: the spatial enhancement feature map and the channel enhancement feature map are fused to obtain the classification feature map using the following formula:
F s =αF 1 +βF 2
Wherein F is s For the classification characteristic diagram, F 1 For the space-enhancing feature map, F 2 For the channel enhancement feature map, "+" indicates that elements at corresponding positions of the spatial enhancement feature map and the channel enhancement feature map are added, and α and β are weighting parameters for controlling balance between the spatial enhancement feature map and the channel enhancement feature map in the classification feature map.
In the above-mentioned intelligent method for digital staff based on a financial system, the correcting the feature values of each position in the classification feature map based on the statistical features of the feature value sets of all positions in the classification feature map to obtain a corrected classification feature map further includes: correcting the characteristic values of each position in the classification characteristic map by the following formula to obtain a corrected classification characteristic map; wherein, the formula is:
Figure GDA0004159487860000051
wherein f i,j,k Is the eigenvalue of each position of the classification characteristic diagram F, and mu and sigma are the eigenvalue set F i,j,k E means and variances of F, and W H C is the scale of the classification feature map F, log is the base 2 logarithm, and α is the weight super-parameter, F' i,j,k And the characteristic values of all the positions in the corrected classified characteristic diagram are obtained.
In the above-mentioned intelligent method for digital staff based on financial system, the classifying feature map after correction is passed through a classifier to obtain a classifying result, further comprising: processing the corrected classification feature map using the classifier to generate a classification result with the following formula:
O=softmax{(W n ,B n ):…:(W 1 ,B 1 )|Project(F)}
wherein Project (F) represents projecting the corrected classification feature map as a vector, W 1 To W n Weight matrix for all the connection layers of each layer, B 1 To B n Representing the bias matrix for each fully connected layer.
Compared with the prior art, the digital employee intelligent system based on the financial system converts the consulting intention understanding of the digital employee to the client into the voice theme labeling problem. Specifically, a plurality of kinds of spectrograms are extracted from a voice signal, and the plurality of kinds of spectrograms are encoded and decoded in a deep neural network model to obtain a recognition result for representing a predetermined intention topic label. Therefore, on the premise that the user consultation extraction can be understood more accurately, the clients can be responded more reasonably and adaptively, so that the voice consultation experience of the users is improved.
Drawings
The foregoing and other objects, features and advantages of the present application will become more apparent from the following more particular description of embodiments of the present application, as illustrated in the accompanying drawings. The accompanying drawings are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate the application and not constitute a limitation to the application. In the drawings, like reference numerals generally refer to like parts or steps.
FIG. 1 illustrates a block diagram of a financial system-based digital employee intelligence system in accordance with an embodiment of the present application.
Fig. 2 illustrates a system architecture diagram of a financial system-based digital employee intelligence system in accordance with an embodiment of the present application.
FIG. 3 illustrates a block diagram of a dual stream encoding module in a financial system based digital employee intelligence system in accordance with an embodiment of the present application.
Fig. 4 illustrates a block diagram of a first convolutional encoding unit in a financial system-based digital employee intelligence system in accordance with an embodiment of the present application.
Fig. 5 illustrates a block diagram of a second convolutional encoding unit in a financial system-based digital employee intelligence system in accordance with an embodiment of the present application.
FIG. 6 illustrates a flow chart of a digital employee intelligence method based on a financial system in accordance with an embodiment of the present application.
Detailed Description
Hereinafter, example embodiments according to the present application will be described in detail with reference to the accompanying drawings. It should be apparent that the described embodiments are only some of the embodiments of the present application and not all of the embodiments of the present application, and it should be understood that the present application is not limited by the example embodiments described herein.
Summary of the application
In the technical solution of the present application, the consulting intention understanding of the client by the digital staff can be converted into a voice topic labeling problem, that is, the voice signal is encoded and understood in a proper manner, and the voice signal is not assigned with a predetermined intention topic label, which can be achieved through a two-stage model of a feature extractor+classifier.
However, because the clients have different language expression habits during consultation, the method brings great challenges to speech semantic understanding, and meanwhile, the users have the phenomena of barking, accent, intonation and the like during the problem expression, and because the users express the consultation requirement through a communication system, a plurality of noises can be introduced during the collection and transmission of speech signals, and the technical problems cause lower accuracy of the consultation intention understanding.
Correspondingly, in the technical scheme of the application, after the consultation voice signal of the client is obtained, the consultation voice signal is firstly passed through a signal noise reduction module based on an automatic encoder to obtain the consultation voice signal after noise reduction. Specifically, the automatic encoder-based signal noise reduction module includes an encoder that uses a convolutional layer and a decoder that uses a deconvolution layer. Accordingly, the noise reduction process of the signal noise reduction module includes first explicit spatial encoding the advisory speech signal with a convolutional layer using the encoder to extract speech features (filtered in response to noise) from the advisory speech signal, and then deconvolution encoding the speech features with a deconvolution layer using the decoder to obtain the noise reduced advisory speech signal.
In order to improve the accuracy of semantic understanding of the noise-reduced consultation voice signal, the noise-reduced consultation voice signal is converted into a spectrogram, and it is understood that the spectrogram is a perception chart consisting of time, frequency and energy, is a visual language of the voice signal, can provide abundant visual information, combines time domain analysis and frequency domain analysis, and can reflect the frequency content of the signal and the change rule of the frequency content along with time.
In particular, in the technical scheme of the application, in order to capture richer sound spectrum information, a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram are respectively extracted from the noise-reduced consultation voice signal. It will be appreciated that log mel-spectra are the most widely used feature which mimics the characteristics of the human ear during design, which have different acoustic sensitivities to sound of different frequencies. The extraction flow of the logarithmic mel spectrogram is similar to the MFCC, but the linear transformation, namely the discrete cosine transformation, of the last step is reduced, and after the last step is removed, more high-order information and nonlinear information of the sound signal can be reserved. Cochlear spectrograms are obtained by simulating the frequency selective Gammatone filter bank of the human cochlea, which better conforms to the auditory characteristics of the human ear. The constant Q transform spectrum provides better frequency resolution for low frequencies and better time resolution for high frequencies, thus better mimicking the behavior of the human auditory system.
Next, the log mel-spectrum, the cochlear spectrum, and the constant Q transform spectrum are arranged into a multi-channel speech spectrum. That is, the log mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram are arranged along the channel dimension to obtain a multi-channel semantic spectrogram, so that data input into the neural network model has relatively larger width, on one hand, richer materials are provided for voice feature extraction of the neural network model, and on the other hand, correlation exists among the spectrograms in the multi-channel semantic spectrogram, and the accuracy and the richness of voice feature extraction can be improved by utilizing the internal correlation.
Specifically, in the technical solution of the present application, the multi-channel speech spectrogram is processed by using a dual-flow network model including a first convolutional neural network using a spatial attention mechanism and a second convolutional neural network using a channel attention mechanism to obtain a classification feature map. Here, the first convolutional neural network using a spatial attention mechanism and the second convolutional neural network using a channel attention mechanism of the dual-flow network model perform spatial enhancement display spatial encoding and channel enhancement display spatial encoding on the multi-channel speech spectrogram respectively in a parallel structure, and aggregate a spatial enhancement feature map and a channel enhancement feature map obtained by encoding the two to obtain the classification feature map.
For the classification feature map, because the dual-flow network structure fuses the space enhancement feature map output by the first convolutional neural network and the channel enhancement feature map output by the second convolutional neural network, the distribution misalignment between the space dimension distribution of the space enhancement feature map and the channel dimension distribution of the channel enhancement feature map may generate a distribution external feature value in the classification feature map, thereby affecting the classification effect of the classification feature map.
Therefore, the information statistical normalization of the self-adaptive example is carried out on the classification characteristic diagram, specifically:
Figure GDA0004159487860000081
f i,j,k is the eigenvalue of each position of the classification characteristic diagram F, and mu and sigma are the eigenvalue set F i,j,k E means and variances of F, and w×h×c is the scale of the classification feature map F, log is the base 2 logarithm, and α is the weight super-parameter.
The information statistical normalization of the self-adaptive examples takes the characteristic value set of the classified characteristic map as the self-adaptive example, utilizes intrinsic prior information of statistical characteristics of the self-adaptive example to normalize information of a single characteristic value in a dynamic generation mode, and takes normalized module length information of the characteristic set as bias to serve as invariance description in a set distribution domain, so that characteristic optimization of disturbance distribution of a special example is shielded as much as possible, and classification effect of the classified characteristic map is improved. In this way, the accuracy of intended understanding and recognition of the advisory voice signal is improved.
Based on this, the application proposes a digital employee intelligence system based on a financial system, comprising: the voice signal acquisition module is used for acquiring a consultation voice signal of a client; the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal; the voice spectrogram extraction module is used for extracting a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram from the noise-reduced consultation voice signal; the multi-channel semantic spectrogram construction module is used for arranging the logarithmic mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram; the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model to obtain a classification characteristic diagram; the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and the consultation intention recognition module is used for enabling the corrected classification characteristic diagram to pass through a classifier to obtain a classification result, wherein the classification result is used for representing an intention theme label of the consultation voice signal.
Having described the basic principles of the present application, various non-limiting embodiments of the present application will now be described in detail with reference to the accompanying drawings.
Exemplary System
FIG. 1 illustrates a block diagram of a financial system-based digital employee intelligence system in accordance with an embodiment of the present application. As shown in fig. 1, a digital employee intelligence system 100 based on a financial system according to an embodiment of the present application includes: the voice signal acquisition module 110 is used for acquiring a consultation voice signal of a client; the noise reduction module 120 is configured to pass the advisory voice signal through the signal noise reduction module based on the automatic encoder to obtain a noise-reduced advisory voice signal; a voice spectrogram extraction module 130, configured to extract a log mel spectrogram, a cochlear spectrogram and a constant Q transform spectrogram from the noise-reduced consultation voice signal; a multi-channel semantic spectrogram construction module 140 for arranging the log mel spectrogram, the cochlear spectrogram, and the constant Q transform spectrogram into a multi-channel speech spectrogram; the dual-stream encoding module 150 is configured to pass the multi-channel speech spectrogram through a dual-stream network model to obtain a classification feature map; an adaptive correction module 160, configured to correct the feature values of each position in the classification feature map based on the statistical features of the feature value sets of all positions in the classification feature map to obtain a corrected classification feature map; and a counseling intention recognition module 170 for passing the corrected classification characteristic map through a classifier to obtain a classification result, wherein the classification result is used for representing an intention topic label of the counseling voice signal.
Fig. 2 illustrates a system architecture diagram of a financial system based digital employee intelligence system 100 in accordance with an embodiment of the present application. As shown in fig. 2, in the system architecture of the digital employee intelligent system 100 based on the financial system, the consultation voice signal of the customer is first acquired. And then, the consultation voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consultation voice signal. Then, a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transform spectrogram are extracted from the noise reduced advisory voice signal. Further, the logarithmic mel-spectrum, the cochlear-spectrum, and the constant Q-transform spectrum are arranged into a multi-channel speech spectrum. And then, the multi-channel voice spectrogram passes through a double-flow network model to obtain a classification characteristic map. And correcting the characteristic values of all the positions in the classification characteristic map based on the statistical characteristics of the characteristic value sets of all the positions in the classification characteristic map to obtain a corrected classification characteristic map. And then, the corrected classification characteristic diagram is passed through a classifier to obtain a classification result, wherein the classification result is used for representing the intention topic label of the consultation voice signal.
In the above-mentioned digital employee intelligent system 100 based on a financial system, the voice signal acquisition module 110 is configured to acquire a consultation voice signal of a customer. In the technical solution of the present application, the consulting intention understanding of the client by the digital staff can be converted into a voice topic labeling problem, that is, the voice signal is encoded and understood in a proper manner, and the voice signal is not assigned with a predetermined intention topic label, which can be achieved through a two-stage model of a feature extractor+classifier.
However, because the clients have different language expression habits during consultation, the method brings great challenges to speech semantic understanding, and meanwhile, the users have the phenomena of barking, accent, intonation and the like during the problem expression, and because the users express the consultation requirement through a communication system, a plurality of noises can be introduced during the collection and transmission of speech signals, and the technical problems cause lower accuracy of the consultation intention understanding. Therefore, in the technical scheme of the application, the consultation voice signal of the client is firstly obtained.
In the digital employee intelligent system 100 based on the financial system, the noise reduction module 120 is configured to make the advisory voice signal pass through the signal noise reduction module based on the automatic encoder to obtain the noise-reduced advisory voice signal. That is, after the consultation voice signal of the customer is obtained, the consultation voice signal is passed through a signal noise reduction module based on an automatic encoder to obtain the noise-reduced consultation voice signal. Specifically, the automatic encoder-based signal noise reduction module includes an encoder that uses a convolutional layer and a decoder that uses a deconvolution layer. Accordingly, the noise reduction process of the signal noise reduction module includes first explicit spatial encoding the advisory speech signal with a convolutional layer using the encoder to extract speech features (filtered in response to noise) from the advisory speech signal, and then deconvolution encoding the speech features with a deconvolution layer using the decoder to obtain the noise reduced advisory speech signal.
In one example, in the digital employee intelligence system 100 based on a financial system, the noise reduction module 120 includes: a voice signal coding unit, configured to input the advisory voice signal into an encoder of the signal noise reduction module, where the encoder uses a convolution layer to perform explicit spatial coding on the advisory voice signal to obtain a voice feature; and the semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to deconvolute the voice features so as to obtain the noise-reduced consultative voice signal.
In the above-mentioned digital employee intelligence system 100 based on a financial system, the voice spectrogram extraction module 130 is configured to extract a log mel spectrogram, a cochlear spectrogram and a constant Q transform spectrogram from the noise-reduced advisory voice signal. In order to improve the accuracy of semantic understanding of the noise-reduced consultation voice signal, the noise-reduced consultation voice signal is converted into a spectrogram, and it is understood that the spectrogram is a perception chart consisting of time, frequency and energy, is a visual language of the voice signal, can provide abundant visual information, combines time domain analysis and frequency domain analysis, and can reflect the frequency content of the signal and the change rule of the frequency content along with time.
In particular, in the technical scheme of the application, in order to capture richer sound spectrum information, a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram are respectively extracted from the noise-reduced consultation voice signal. It will be appreciated that log mel-spectra are the most widely used feature which mimics the characteristics of the human ear during design, which have different acoustic sensitivities to sound of different frequencies. The extraction flow of the logarithmic mel spectrogram is similar to the MFCC, but the linear transformation, namely the discrete cosine transformation, of the last step is reduced, and after the last step is removed, more high-order information and nonlinear information of the sound signal can be reserved. Cochlear spectrograms are obtained by simulating the frequency selective Gammatone filter bank of the human cochlea, which better conforms to the auditory characteristics of the human ear. The constant Q transform spectrum provides better frequency resolution for low frequencies and better time resolution for high frequencies, thus better mimicking the behavior of the human auditory system.
In the above-mentioned financial system-based digital employee intelligence system 100, the multi-channel semantic spectrogram construction module 140 is configured to arrange the log mel spectrogram, the cochlear spectrogram, and the constant Q transform spectrogram into a multi-channel speech spectrogram. That is, the log mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram are arranged along the channel dimension to obtain a multi-channel semantic spectrogram, so that data input into the neural network model has relatively larger width, on one hand, richer materials are provided for voice feature extraction of the neural network model, and on the other hand, correlation exists among the spectrograms in the multi-channel semantic spectrogram, and the accuracy and the richness of voice feature extraction can be improved by utilizing the internal correlation.
In the above-mentioned digital employee intelligent system 100 based on a financial system, the dual-stream encoding module 150 is configured to pass the multi-channel speech spectrogram through a dual-stream network model to obtain a classification characteristic map. Specifically, in the technical solution of the present application, the multi-channel speech spectrogram is processed by using a dual-flow network model including a first convolutional neural network using a spatial attention mechanism and a second convolutional neural network using a channel attention mechanism to obtain a classification feature map. Here, the first convolutional neural network using a spatial attention mechanism and the second convolutional neural network using a channel attention mechanism of the dual-flow network model perform spatial enhancement display spatial encoding and channel enhancement display spatial encoding on the multi-channel speech spectrogram respectively in a parallel structure, and aggregate a spatial enhancement feature map and a channel enhancement feature map obtained by encoding the two to obtain the classification feature map.
FIG. 3 illustrates a block diagram of a dual stream encoding module in a financial system based digital employee intelligence system in accordance with an embodiment of the present application. As shown in fig. 3, in the above-mentioned digital employee intelligent system 100 based on a financial system, the dual stream encoding module 150 includes: a first convolutional encoding unit 151, configured to input the multi-channel speech spectrogram into a first convolutional neural network of the dual-stream network model using a spatial attention mechanism to obtain a spatial enhancement feature map; a second convolutional encoding unit 152, configured to input the multi-channel speech spectrogram into a second convolutional neural network of the dual-stream network model using a channel attention mechanism to obtain a channel enhancement feature map; and an aggregation unit 153, configured to fuse the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
Fig. 4 illustrates a block diagram of a first convolutional encoding unit in a financial system-based digital employee intelligence system in accordance with an embodiment of the present application. As shown in fig. 4, in the above-mentioned financial system-based digital employee intelligent system 100, the first convolution encoding unit 151 includes: a deep convolution coding subunit 1511, configured to input the multi-channel speech spectrogram into a multi-layer convolution layer of the first convolution neural network to obtain a first convolution feature map; a spatial attention subunit 1512 for inputting the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and an attention applying subunit 1513 configured to calculate a point-by-point multiplication of the spatial attention map and the first convolution feature map to obtain the spatial enhancement feature map.
In one example, in the above-described financial system-based digital employee intelligence system 100, the spatial attention subunit 1512 is further configured to: performing convolutional encoding on the first convolutional feature map by using a convolutional layer of the spatial attention module to obtain a spatial perception feature map; calculating the point-by-point multiplication between the spatial perception feature map and the first convolution feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Si gmoid activation function to obtain the spatial attention map.
Fig. 5 illustrates a block diagram of a second convolutional encoding unit in a financial system-based digital employee intelligence system in accordance with an embodiment of the present application. As shown in fig. 5, in the above-mentioned digital employee intelligent system 100 based on a financial system, the second convolution encoding unit 152 includes: a deep convolution coding subunit 1521, configured to input the multi-channel speech spectrogram into a multi-layer convolution layer of the second convolution neural network to obtain a second convolution feature map; a global average Chi Huazi unit 1522, configured to calculate a global average of each feature matrix of the second convolution feature map along the channel dimension to obtain a channel feature vector; a channel attention weight calculation subunit 1523, configured to input the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and a channel attention applying subunit 1524, configured to weight each feature matrix of the second convolution feature map along a channel dimension with a feature value of each position in the channel attention weight vector as a weight, so as to obtain the channel enhancement feature map.
In one example, in the above-mentioned digital employee intelligent system 100 based on a financial system, the aggregation unit 153 is further configured to: fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:
F s =αF 1 +βF 2
Wherein F is s For the classification characteristic diagram, F 1 For the space-enhancing feature map, F 2 For the channel enhancement feature map, "+" indicates that elements at corresponding positions of the spatial enhancement feature map and the channel enhancement feature map are added, and α and β are weighting parameters for controlling balance between the spatial enhancement feature map and the channel enhancement feature map in the classification feature map.
In the above-mentioned digital employee intelligent system 100 based on a financial system, the adaptive correction module 160 is configured to correct the feature values of each location in the classification feature map based on the statistical features of the feature value sets of all locations in the classification feature map to obtain a corrected classification feature map. For the classification feature map, because the dual-flow network structure fuses the space enhancement feature map output by the first convolutional neural network and the channel enhancement feature map output by the second convolutional neural network, the distribution misalignment between the space dimension distribution of the space enhancement feature map and the channel dimension distribution of the channel enhancement feature map may generate a distribution external feature value in the classification feature map, thereby affecting the classification effect of the classification feature map. Thus, the classification feature map is subjected to statistical normalization of the information of the adaptive instance.
In one example, in the above-described financial system-based digital employee intelligence system 100, the adaptive correction module 160 is further configured to: correcting the characteristic values of each position in the classification characteristic map by the following formula to obtain a corrected classification characteristic map; wherein, the formula is:
Figure GDA0004159487860000131
wherein f i,j,k Is the eigenvalue of each position of the classification characteristic diagram F, and mu and sigma are the eigenvalue set F i,j,k E means and variances of F, and W H C is the scale of the classification feature map F, log is the base 2 logarithm, and α is the weight super-parameter, F' i,j,k And the characteristic values of all the positions in the corrected classified characteristic diagram are obtained.
The information statistical normalization of the self-adaptive examples takes the characteristic value set of the classified characteristic map as the self-adaptive example, utilizes intrinsic prior information of statistical characteristics of the self-adaptive example to normalize information of a single characteristic value in a dynamic generation mode, and takes normalized module length information of the characteristic set as bias to serve as invariance description in a set distribution domain, so that characteristic optimization of disturbance distribution of a special example is shielded as much as possible, and classification effect of the classified characteristic map is improved. In this way, the accuracy of intended understanding and recognition of the advisory voice signal is improved.
In the above-mentioned digital employee intelligence system 100 based on a financial system, the consultation intention recognition module 170 is configured to pass the corrected classification feature map through a classifier to obtain a classification result, where the classification result is used to represent an intention topic label of the consultation voice signal.
In one example, in the above-described financial system-based digital employee intelligence system 100, the consultation intent identification module 170 is further configured to: processing the corrected classification feature map using the classifier to generate a classification result with the following formula:
O=softmax({W n ,B n ):…:(W 1 ,B 1 )|Project(F)}
wherein Project (F) represents projecting the corrected classification feature map as a vector, W 1 To W n Weight matrix for all the connection layers of each layer, B 1 To B n Representing the bias matrix for each fully connected layer.
In summary, the financial system-based digital employee intelligence system 100 according to embodiments of the present application is illustrated that converts digital employee understanding of a customer's consultation intent into a voice topic labeling problem. Specifically, a plurality of kinds of spectrograms are extracted from a voice signal, and the plurality of kinds of spectrograms are encoded and decoded in a deep neural network model to obtain a recognition result for representing a predetermined intention topic label. Therefore, on the premise that the user consultation extraction can be understood more accurately, the clients can be responded more reasonably and adaptively, so that the voice consultation experience of the users is improved.
As described above, the digital employee intelligence system 100 based on a financial system according to the embodiment of the present application may be implemented in various terminal devices, such as a server having digital employee intelligence based on a financial system, and the like. In one example, the financial system-based digital employee intelligence system 100 according to embodiments of the present application may be integrated into the terminal device as a software module and/or hardware module. For example, the financial system based digital employee intelligence system 100 may be a software module in the operating system of the terminal device or may be an application developed for the terminal device; of course, the financial system based digital employee intelligence system 100 could equally be one of the plurality of hardware modules of the terminal device.
Alternatively, in another example, the financial system-based digital employee intelligence system 100 and the terminal device may be separate devices, and the financial system-based digital employee intelligence system 100 may be connected to the terminal device via a wired and/or wireless network and communicate the interactive information in a agreed data format.
Exemplary method
According to another aspect of the application, a digital employee intelligence method based on a financial system is also provided. As shown in fig. 6, the digital employee intelligence method based on the financial system according to the embodiment of the application includes the steps of: s110, acquiring a consultation voice signal of a client; s120, the consultation voice signal passes through a signal noise reduction module based on an automatic encoder to obtain a noise-reduced consultation voice signal; s130, extracting a logarithmic Mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram from the noise-reduced consultation voice signal; s140, arranging the logarithmic Mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram; s150, the multi-channel voice spectrogram passes through a double-flow network model to obtain a classification characteristic diagram; s160, correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and S170, passing the corrected classification characteristic diagram through a classifier to obtain a classification result, wherein the classification result is used for representing an intention topic label of the consultation voice signal.
In one example, in the above-mentioned intelligent method for digital staff based on financial system, the step of passing the advisory voice signal through an automatic encoder-based signal noise reduction module to obtain a noise-reduced advisory voice signal includes: inputting the advisory voice signal to an encoder of the signal noise reduction module, wherein the encoder uses a convolution layer to perform explicit spatial encoding on the advisory voice signal to obtain voice characteristics; and inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to deconvolute the voice features to obtain the noise-reduced consultative voice signal.
In one example, in the above-mentioned digital employee intelligence method based on a financial system, the step of passing the multi-channel speech spectrogram through a dual-flow network model to obtain a classification feature map includes: inputting the multichannel voice spectrogram into a first convolution neural network of the double-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map; inputting the multi-channel voice spectrogram into a second convolution neural network of the double-flow network model, which uses a channel attention mechanism, so as to obtain a channel enhancement feature map; and fusing the space enhancement feature map and the channel enhancement feature map to obtain the classification feature map.
In one example, in the above-mentioned digital employee intelligence method based on a financial system, the inputting the multi-channel speech spectrogram into the first convolutional neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map includes: inputting the multichannel voice spectrogram into a plurality of convolution layers of the first convolution neural network to obtain a first convolution feature map; inputting the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and calculating a per-position point multiplication of the spatial attention map and the first convolution feature map to obtain the spatial enhancement feature map.
In one example, in the above-mentioned digital employee intelligence method based on a financial system, the inputting the multi-channel speech spectrogram into the first convolutional neural network of the dual-flow network model using a spatial attention mechanism to obtain a spatial enhancement feature map further includes: performing convolutional encoding on the first convolutional feature map by using a convolutional layer of the spatial attention module to obtain a spatial perception feature map; calculating the point-by-point multiplication between the spatial perception feature map and the first convolution feature map to obtain a spatial attention score map; and inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map.
In one example, in the above-mentioned digital employee intelligence method based on a financial system, the inputting the multi-channel speech spectrogram into the second convolutional neural network of the dual-flow network model using a channel attention mechanism to obtain a channel enhancement feature map includes: inputting the multichannel voice spectrogram into a plurality of convolution layers of the second convolution neural network to obtain a second convolution feature map; calculating the global average value of each feature matrix of the second convolution feature diagram along the channel dimension to obtain a channel feature vector; inputting the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and weighting each feature matrix of the second convolution feature map along the channel dimension by taking the feature value of each position in the channel attention weight vector as a weight to obtain the channel enhancement feature map.
In one example, in the above-mentioned digital employee intelligence method based on a financial system, the fusing the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map further includes: the spatial enhancement feature map and the channel enhancement feature map are fused to obtain the classification feature map using the following formula:
F s =αF 1 +βF 2
Wherein F is s For the classification characteristic diagram, F 1 For the space-enhancing feature map, F 2 For the channel enhancement feature map, "+" indicates that elements at corresponding positions of the spatial enhancement feature map and the channel enhancement feature map are added, and α and β are weighting parameters for controlling balance between the spatial enhancement feature map and the channel enhancement feature map in the classification feature map.
In one example, in the above-mentioned intelligent method for digital staff based on a financial system, the correcting the feature values of each position in the classification feature map based on the statistical features of the feature value sets of all positions in the classification feature map to obtain a corrected classification feature map further includes: correcting the characteristic values of each position in the classification characteristic map by the following formula to obtain a corrected classification characteristic map; wherein, the formula is:
Figure GDA0004159487860000171
wherein f i,j,k Is the eigenvalue of each position of the classification characteristic diagram F, and mu and sigma are the eigenvalue set F i,j,k E means and variances of F, and W H C is the scale of the classification feature map F, log is the base 2 logarithm, and α is the weight super-parameter, F' i,j,k And the characteristic values of all the positions in the corrected classified characteristic diagram are obtained.
In one example, in the above-mentioned digital employee intelligence method based on a financial system, the classifying the feature map after correction by a classifier to obtain a classification result, further includes: processing the corrected classification feature map using the classifier to generate a classification result with the following formula:
O=softmax{(W n ,B n ):…:(W 1 ,B 1 )|Project(F)}
wherein Project (F) represents projecting the corrected classification feature map as a vector, W 1 To W n Weight matrix for all the connection layers of each layer, B 1 To B n Representing the bias matrix for each fully connected layer.
In summary, the financial system-based digital employee intelligence method according to embodiments of the present application is illustrated that converts a digital employee's consulting intent understanding of a customer into a voice topic labeling problem, i.e., encoding and understanding a voice signal with a feature extractor and assigning a predetermined intent topic label to the voice signal. Particularly, the noise reduction module is further used for carrying out noise reduction processing on the voice signal so as to improve the accuracy of consulting intention understanding. In this way, an optimized digital employee intelligence scheme for the financial system is constructed.
The basic principles of the present application have been described above in connection with specific embodiments, however, it should be noted that the advantages, benefits, effects, etc. mentioned in the present application are merely examples and not limiting, and these advantages, benefits, effects, etc. are not to be considered as necessarily possessed by the various embodiments of the present application. Furthermore, the specific details disclosed herein are for purposes of illustration and understanding only, and are not intended to be limiting, as the application is not intended to be limited to the details disclosed herein as such.
The block diagrams of the devices, apparatuses, devices, systems referred to in this application are only illustrative examples and are not intended to require or imply that the connections, arrangements, configurations must be made in the manner shown in the block diagrams. As will be appreciated by one of skill in the art, the devices, apparatuses, devices, systems may be connected, arranged, configured in any manner. Words such as "including," "comprising," "having," and the like are words of openness and mean "including but not limited to," and are used interchangeably therewith. The terms "or" and "as used herein refer to and are used interchangeably with the term" and/or "unless the context clearly indicates otherwise. The term "such as" as used herein refers to, and is used interchangeably with, the phrase "such as, but not limited to.
It is also noted that in the apparatus, devices and methods of the present application, the components or steps may be disassembled and/or assembled. Such decomposition and/or recombination should be considered as equivalent to the present application.
The previous description of the disclosed aspects is provided to enable any person skilled in the art to make or use the present application. Various modifications to these aspects will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other aspects without departing from the scope of the application. Thus, the present application is not intended to be limited to the aspects shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
The foregoing description has been presented for purposes of illustration and description. Furthermore, this description is not intended to limit the embodiments of the application to the form disclosed herein. Although a number of example aspects and embodiments have been discussed above, a person of ordinary skill in the art will recognize certain variations, modifications, alterations, additions, and subcombinations thereof.

Claims (4)

1. A digital employee intelligence system based on a financial system, comprising:
the voice signal acquisition module is used for acquiring a consultation voice signal of a client;
the noise reduction module is used for enabling the consultation voice signal to pass through the signal noise reduction module based on the automatic encoder so as to obtain a noise-reduced consultation voice signal;
the voice spectrogram extraction module is used for extracting a logarithmic mel spectrogram, a cochlear spectrogram and a constant Q transformation spectrogram from the noise-reduced consultation voice signal;
the multi-channel semantic spectrogram construction module is used for arranging the logarithmic mel spectrogram, the cochlear spectrogram and the constant Q transformation spectrogram into a multi-channel voice spectrogram;
the double-flow coding module is used for enabling the multi-channel voice spectrogram to pass through a double-flow network model to obtain a classification characteristic diagram;
the self-adaptive correction module is used for correcting the characteristic values of all positions in the classification characteristic diagram based on the statistical characteristics of the characteristic value sets of all positions in the classification characteristic diagram to obtain a corrected classification characteristic diagram; and
The consultation intention recognition module is used for passing the corrected classification characteristic diagram through a classifier to obtain a classification result, wherein the classification result is used for representing an intention topic label of a consultation voice signal;
wherein the dual stream encoding module comprises:
the first convolution coding unit is used for inputting the multichannel voice spectrogram into a first convolution neural network of the double-flow network model by using a spatial attention mechanism so as to obtain a spatial enhancement characteristic diagram;
the second convolution coding unit is used for inputting the multi-channel voice spectrogram into a second convolution neural network of the double-flow network model, which uses a channel attention mechanism, so as to obtain a channel enhancement feature map; and
an aggregation unit, configured to fuse the spatial enhancement feature map and the channel enhancement feature map to obtain the classification feature map;
wherein the first convolutional encoding unit comprises:
the depth convolution coding subunit is used for inputting the multichannel voice spectrogram into a plurality of convolution layers of the first convolution neural network to obtain a first convolution characteristic map;
a spatial attention subunit, configured to input the first convolution feature map into a spatial attention module of the first convolution neural network to obtain a spatial attention map; and
An attention applying subunit, configured to calculate a point-by-point multiplication of the spatial attention map and the first convolution feature map to obtain the spatial enhancement feature map;
wherein the spatial attention subunit is configured to:
performing convolutional encoding on the first convolutional feature map by using a convolutional layer of the spatial attention module to obtain a spatial perception feature map;
calculating the point-by-point multiplication between the spatial perception feature map and the first convolution feature map to obtain a spatial attention score map; and
inputting the spatial attention score map into a Sigmoid activation function to obtain the spatial attention map;
wherein the second convolution encoding unit includes:
the depth convolution coding subunit is used for inputting the multichannel voice spectrogram into a plurality of convolution layers of the second convolution neural network to obtain a second convolution characteristic diagram;
a global average Chi Huazi unit, configured to calculate a global average of each feature matrix of the second convolution feature map along a channel dimension to obtain a channel feature vector;
a channel attention weight calculation subunit, configured to input the channel feature vector into the Sigmoid activation function to obtain a channel attention weight vector; and
A channel attention applying subunit, configured to weight each feature matrix of the second convolution feature map along a channel dimension with a feature value of each position in the channel attention weight vector as a weight, so as to obtain the channel enhancement feature map;
wherein the polymerization unit is used for:
fusing the spatial enhancement feature map and the channel enhancement feature map using the aggregation unit to obtain the classification feature map with the following formula:
Fs=αF 1 +βF 2
wherein F is s For the classification characteristic diagram, F 1 For the space-enhancing feature map, F 2 For the channel enhancement feature map, "+" indicates that elements at corresponding positions of the spatial enhancement feature map and the channel enhancement feature map are added, and α and β are weighting parameters for controlling balance between the spatial enhancement feature map and the channel enhancement feature map in the classification feature map.
2. The financial system-based digital employee intelligence system of claim 1 wherein the noise reduction module includes:
a voice signal coding unit, configured to input the advisory voice signal into an encoder of the signal noise reduction module, where the encoder uses a convolution layer to perform explicit spatial coding on the advisory voice signal to obtain a voice feature; and
The semantic feature decoding unit is used for inputting the semantic features into a decoder of the signal noise reduction module, wherein the decoder uses a deconvolution layer to deconvolute the voice features so as to obtain the noise-reduced consultation voice signal.
3. The financial system-based digital employee intelligence system of claim 2 wherein the adaptive correction module is further configured to:
correcting the characteristic values of each position in the classification characteristic map by the following formula to obtain a corrected classification characteristic map;
wherein, the formula is:
Figure FDA0004159487840000031
wherein f i,j,k Is the eigenvalue of each position of the classification characteristic diagram F, and mu and sigma are the eigenvalue set F i,j,k E means and variances of F, and W H C is the scale of the classification feature map F, log is the base 2 logarithm, and α is the weight super-parameter, F' i,j,k And the characteristic values of all the positions in the corrected classified characteristic diagram are obtained.
4. The financial system-based digital employee intelligence system of claim 3 wherein the advisory intent recognition module is further configured to:
processing the corrected classification feature map using the classifier to generate a classification result with the following formula:
O=softmax((W n ,B n ):...:(W 1 ,B 1 )|Project(F)}
Wherein Project (F) represents projecting the corrected classification feature map as a vector, W 1 To W n Weight matrix for all the connection layers of each layer, B 1 To B n Representing the bias matrix for each fully connected layer.
CN202211090442.0A 2022-09-07 2022-09-07 Digital employee intelligent system based on financial system Active CN115602165B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211090442.0A CN115602165B (en) 2022-09-07 2022-09-07 Digital employee intelligent system based on financial system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211090442.0A CN115602165B (en) 2022-09-07 2022-09-07 Digital employee intelligent system based on financial system

Publications (2)

Publication Number Publication Date
CN115602165A CN115602165A (en) 2023-01-13
CN115602165B true CN115602165B (en) 2023-05-05

Family

ID=84843343

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211090442.0A Active CN115602165B (en) 2022-09-07 2022-09-07 Digital employee intelligent system based on financial system

Country Status (1)

Country Link
CN (1) CN115602165B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116258504A (en) * 2023-03-16 2023-06-13 广州信瑞泰信息科技有限公司 Bank customer relationship management system and method thereof
CN116308754B (en) * 2023-03-22 2024-02-13 广州信瑞泰信息科技有限公司 Bank credit risk early warning system and method thereof
CN117173294B (en) * 2023-11-03 2024-02-13 之江实验室科技控股有限公司 Method and system for automatically generating digital person

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110085218A (en) * 2019-03-26 2019-08-02 天津大学 A kind of audio scene recognition method based on feature pyramid network
CN113808573A (en) * 2021-08-06 2021-12-17 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention
US11355122B1 (en) * 2021-02-24 2022-06-07 Conversenowai Using machine learning to correct the output of an automatic speech recognition system

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110718234A (en) * 2019-09-02 2020-01-21 江苏师范大学 Acoustic scene classification method based on semantic segmentation coding and decoding network
CN111145726B (en) * 2019-10-31 2022-09-23 南京励智心理大数据产业研究院有限公司 Deep learning-based sound scene classification method, system, device and storage medium
CN111754988B (en) * 2020-06-23 2022-08-16 南京工程学院 Sound scene classification method based on attention mechanism and double-path depth residual error network
CN114333804A (en) * 2021-12-27 2022-04-12 北京达佳互联信息技术有限公司 Audio classification identification method and device, electronic equipment and storage medium
CN114420097A (en) * 2022-01-24 2022-04-29 腾讯科技(深圳)有限公司 Voice positioning method and device, computer readable medium and electronic equipment
CN114565041A (en) * 2022-02-28 2022-05-31 上海嘉甲茂技术有限公司 Payment big data analysis system based on internet finance and analysis method thereof
CN114974219A (en) * 2022-05-30 2022-08-30 平安科技(深圳)有限公司 Speech recognition method, speech recognition device, electronic apparatus, and storage medium

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110085218A (en) * 2019-03-26 2019-08-02 天津大学 A kind of audio scene recognition method based on feature pyramid network
US11355122B1 (en) * 2021-02-24 2022-06-07 Conversenowai Using machine learning to correct the output of an automatic speech recognition system
CN113808573A (en) * 2021-08-06 2021-12-17 华南理工大学 Dialect classification method and system based on mixed domain attention and time sequence self-attention

Also Published As

Publication number Publication date
CN115602165A (en) 2023-01-13

Similar Documents

Publication Publication Date Title
CN109817213B (en) Method, device and equipment for performing voice recognition on self-adaptive language
CN115602165B (en) Digital employee intelligent system based on financial system
CN109326302B (en) Voice enhancement method based on voiceprint comparison and generation of confrontation network
CN112199548A (en) Music audio classification method based on convolution cyclic neural network
CN109256118B (en) End-to-end Chinese dialect identification system and method based on generative auditory model
CN112102846B (en) Audio processing method and device, electronic equipment and storage medium
CN113658583B (en) Ear voice conversion method, system and device based on generation countermeasure network
CN113724718B (en) Target audio output method, device and system
CN114338623B (en) Audio processing method, device, equipment and medium
CN111899757A (en) Single-channel voice separation method and system for target speaker extraction
CN111986661A (en) Deep neural network speech recognition method based on speech enhancement in complex environment
CN105047192A (en) Statistic voice synthesis method and device based on hidden Markov model (HMM)
CN113823273A (en) Audio signal processing method, audio signal processing device, electronic equipment and storage medium
CN111724806A (en) Double-visual-angle single-channel voice separation method based on deep neural network
CN114141237A (en) Speech recognition method, speech recognition device, computer equipment and storage medium
CN110767238B (en) Blacklist identification method, device, equipment and storage medium based on address information
Li RETRACTED ARTICLE: Speech-assisted intelligent software architecture based on deep game neural network
Tzudir et al. Low-resource dialect identification in Ao using noise robust mean Hilbert envelope coefficients
Zhipeng et al. Voiceprint recognition based on BP Neural Network and CNN
CN113571079A (en) Voice enhancement method, device, equipment and storage medium
Alam et al. Radon transform of auditory neurograms: a robust feature set for phoneme classification
Alex et al. Performance analysis of SOFM based reduced complexity feature extraction methods with back propagation neural network for multilingual digit recognition
WO2024114303A1 (en) Phoneme recognition method and apparatus, electronic device and storage medium
CN117649846B (en) Speech recognition model generation method, speech recognition method, device and medium
CN111833897B (en) Voice enhancement method for interactive education

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant