CN117149977A

CN117149977A - Intelligent collecting robot based on robot flow automation

Info

Publication number: CN117149977A
Application number: CN202311150380.2A
Authority: CN
Inventors: 魏永强; 杨兴海; 李建州; 杨兴荣
Original assignee: Shijihengtong Technology Co ltd
Current assignee: Shijihengtong Technology Co ltd
Priority date: 2023-09-07
Filing date: 2023-09-07
Publication date: 2023-12-01

Abstract

The invention discloses an intelligent collecting robot based on robot process automation, which comprises a collecting robot setting module, a collecting strategy customizing module, an intelligent client service system module, a data protocol interaction module and a collecting business management system module; compared with the traditional intelligent customer service and virtual digital personnel, the invention has smoother service, is more friendly to customers, can answer common questions through the preset templates, provides background information and suggestions for the artificial customer service, helps the artificial customer service to understand the questions and provide solutions more quickly, improves the overall working efficiency, can continuously train dialect libraries and fine tune and optimizing, can accurately identify dialects in each place, convert voice into texts, intelligently answer the questions of users in a mandarin mode, can be improved and optimized according to requirements and feedback, can work cooperatively with the artificial customer service team, and can convert the questions into artificial customer service to be processed when the system cannot answer certain complex questions or cannot understand the intention of the users.

Description

Intelligent collecting robot based on robot flow automation

Technical Field

The invention relates to the technical field of information processing, in particular to an intelligent collecting robot based on robot flow automation and an application method thereof.

Background

In the prior art, intelligent collecting robots are widely applied to replace manual customer service to collect, so that the efficiency of the traditional collecting process is improved, and problems and disadvantages still exist. Some intelligent induction robots use pre-written fixed answers or template-based answer approaches. The advantage of this approach is simplicity and efficiency, but they lack individuality and flexibility since customized answers cannot be generated according to specific situations, such answer approaches may not meet specific needs of users, degrading the quality of user experience. Conventional speech recognition techniques have certain limitations in terms of dialect recognition and speech-to-text. Conventional techniques are typically based on rules and pattern matching, the accuracy and adaptability of which may be low when dealing with complex dialects and specific accents. This results in the robot may not accurately understand and recognize the user's voice input when communicating with the user, affecting the quality and accuracy of the answer. Some voice question-answering systems can successfully convert the user's voice into text and correct errors and answer the user's question through keyword matching or a predefined question-answering library. However, such systems often lack contextual understanding and consistency. They cannot understand the context and semantic relationships of the dialog and it is difficult to provide a consistent, natural answer. This may cause the robot's answer to be inconsistent with the user's expectations, limiting the effectiveness and user experience of the communication. Aiming at the existing reply limitation of the intelligent collecting robot, an intelligent collecting robot capable of achieving communication with debtors in a more intelligent and humanized mode is needed.

Currently, many scientific and technological companies and financial enterprises are actively developing and deploying intelligent robots for intelligent customer service of the companies to improve information processing efficiency. The intelligent robots realize efficient natural language processing and text generation capability by utilizing flow automation technology, data automatic acquisition and analysis, machine learning technology, neural network and other deep learning technologies. As an example, chatGPT proposed by OpenAI, an artificial intelligence research laboratory in the united states, employs a Transformer neural network architecture, which further promotes the development of the artificial intelligence field.

The intelligent robot has an increasingly wide application range, and is particularly in the field of intelligent customer service. Through automatic data collection and analysis, the intelligent robot can extract useful information from mass data, and provides accurate answers and solutions for users. The process automation technology enables the robot to automatically execute various tasks, thereby reducing the manpower workload and improving the efficiency. Meanwhile, by utilizing a machine learning technology, the intelligent robot can continuously optimize the performance of the intelligent robot and adapt to the continuously changing user demands.

Disclosure of Invention

The invention aims to solve the technical problem of providing an intelligent collecting robot based on robot flow automation, which can more intelligently and humanizedly complete communication with debtors and has the functions of multilingual recognition, human-like voice interaction, integrated application and the like.

In order to solve the technical problems, the technical scheme of the invention is as follows:

an intelligent collecting robot based on robot process automation comprises a collecting robot setting module, a collecting strategy customizing module, an intelligent client service system module, a data protocol interaction module and a collecting business management system module;

the collecting robot setting module is used for setting collecting robots;

the collection strategy customizing module is used for evaluating collection task risks and customizing collection tasks;

the intelligent customer service system module is used for dialing telephone or sending short messages, generating a telephone operation for a customer, identifying voice information of the customer and communicating with the customer;

the data protocol interaction module is used for receiving a prompting scheme of the prompting service management system and sending prompting decision information of the prompting scheme after processing to the prompting service management system;

the collect service management system is used for storing the customized collect scheme and collect decision information.

Further, the intelligent client service system module comprises a call-accelerating operation generating module, a call-accelerating voice synthesizing module, a call-accelerating text synthesizing module, a call-accelerating interactive voice recognition module, a call-accelerating semantic analysis module and an intelligent call-out system;

the call-accelerating generation module is used for automatically generating a call-accelerating for the client according to the call-accelerating task and sending the call-accelerating to the call-accelerating speech synthesis module;

the voice prompting and receiving synthesis module converts the voice generated by the voice prompting and receiving generation module into voice and outputs the voice to a prompting and receiving client through the intelligent outbound system;

the call-in text synthesis module converts the call-in text generated by the call-in text generation module into text, and outputs the text to a called client through the intelligent outbound system;

the prompting and receiving interactive voice recognition module can output the text information after recognition to the prompting and receiving semantic analysis module;

the collection semantic analysis module is used for preprocessing dialogue text information of the client, training and identifying keywords based on an automation technology and analyzing the intention of the client;

the intelligent outbound system is used for calling or sending short messages.

Furthermore, the collection-accelerating interactive voice recognition module uses a neural network structure, combines a long-short-time memory network and a convolution neural network to perform corpus training and optimizing, and realizes dialect and language voice recognition through an optimized CNN-BiLSTM-MFA-FL model.

Furthermore, in terms of specific dialect language recognition, the interactive voice recognition module is also provided with a GPT, and adopts a Conformer model as an end-to-end voice recognition system of the encoder.

Further, the training mode of the voice recognition system is as follows:

1) And (3) data collection: collecting large-scale data sets, encompassing a wide range of audio distributions from many different environments, recording settings, speakers and languages, including voice samples and corresponding tags, ensuring that the voice samples and tags in the data sets are matched;

2) Data preprocessing: research shows that training a training set of hybrid human-machine generated data may significantly affect the performance of the translation system, that many existing ASR systems only output a limited subset of written language, for which detection can be performed, and remove machine-generated transcription in the training data set, and perform operations such as noise reduction, volume normalization, etc. on audio in the removed database to ensure audio quality and consistency;

3) Extracting signal characteristics: transforming the audio frames in the database to obtain a log-Mel spectrogram;

4) Sequence alignment: because the length of the voice signal is inconsistent with the length of the text transcription, sequence alignment is needed, and the aim of the sequence alignment is to align the voice characteristics of each time step in the spectrogram with the corresponding label so as to train a model;

5) Prediction task: firstly, predicting the language in which a user is speaking, wherein each language is represented by a unique mark in a training set, and the next mark designates a task; after that, whether or not to predict the time stamp is specified by including a flag for this case, for the time stamp prediction, the predicted time is relative to the current audio piece, and an additional vocabulary is added to the vocabulary for each of them; interleaving the predictions with subtitle markers, adding a marker at last, masking training losses on past context text, and training the model to predict all other markers;

6) Model evaluation and fine tuning: the method is called as a LoRA model fine tuning method, the performance of the model is gradually improved through optimization of the model and iterative training, evaluation and optimization processes, and the method is suitable for downstream application.

Furthermore, the collection-accelerating voice synthesis module uses an automatic process to manage training of the Tacotron2 model, takes a text generated by GPT as input, and adjusts parameters of the Tacotron2 model through an automatic optimization algorithm so that emotion information can be considered when generating voice.

Further, the emotion voice synthesis training steps of the prompting voice synthesis module are as follows:

1) Data automatic acquisition and pretreatment: collecting a voice data set with emotion labels by using an automation technology, and automatically processing a data preprocessing step by means of web crawlers or API calls to convert a voice file into a characteristic representation;

2) Emotion marking: the emotion analysis algorithm can be used for automatically labeling emotion of a part of data, and then the part of data is manually verified and corrected;

3) Model training and automated optimization: during model training, an automated hyper-parametric optimization tool is used to search for the optimal hyper-parametric combination;

4) Model evaluation and automatic tuning: the trained models are evaluated by using an automatic flow, automatic tuning is performed according to index feedback, and performances of different model configurations are rapidly compared through the automatic evaluation flow;

5) And (3) process automation: the whole training process is managed through a process automation tool, which is helpful for automatically triggering, monitoring and managing each stage of the training process;

6) Deep learning technology: in model training, deep learning techniques such as recurrent or convolutional neural networks are used for emotion speech synthesis tasks, or models such as transformers or Conformer are used to extract richer features.

The invention has the beneficial effects that:

1. and (3) improving humanization: the intelligent collecting robot based on the automatic technology can quickly generate human-understandable replies according to the problems raised by clients, and compared with the traditional intelligent customer service and virtual digital people, the intelligent collecting robot is smoother in service and more friendly to clients.

2. Efficiency is improved: the intelligent collecting robot based on the automatic technology can answer common questions through a preset template, provide background information and suggestions for the artificial customer service, help the artificial customer service to understand the questions more quickly and provide solutions, and improve the overall working efficiency.

3. Multilingual recognition: the intelligent induction robot based on the automatic technology can accurately recognize dialects of various places, convert voice into text and intelligently answer the questions of users in a mandarin mode through a long and short memory network (LSTM) and a Convolutional Neural Network (CNN), constantly trains dialects library and fine-adjusts and optimizes.

4. Scalability: the intelligent collecting robot based on the automation technology has good expandability. Once the initial version of the system is built, improvements and optimizations can be made based on demand and feedback.

5. Manual cooperation: the intelligent collecting robot based on the automatic technology can work cooperatively with a manual customer service team. When the system cannot answer some complex questions or understand the user's intention, the questions can be transferred to manual customer service.

Drawings

FIG. 1 is a schematic diagram of a system for dialect recognition based on CNN-BiLSTM;

FIG. 2 is a schematic diagram of an end-to-end speech recognition system;

FIG. 3 is a schematic diagram of Tacotron2 model;

FIG. 4 is a flow chart of an intelligent revenue-generating robot application based on an automated technique;

fig. 5 is a flow chart of the processing of the voice module of the intelligent collecting robot based on the automation technology.

Detailed Description

The following describes the embodiments of the present invention further with reference to the drawings. The description of these embodiments is provided to assist understanding of the present invention, but is not intended to limit the present invention. In addition, the technical features of the embodiments of the present invention described below may be combined with each other as long as they do not collide with each other.

As shown in fig. 1-5:

the intelligent collecting robot based on the robot process automation comprises a collecting robot setting module, a collecting strategy customizing module, an intelligent client service system module, a data protocol interaction module and a collecting service management system module:

the collecting robot setting module is used for setting collecting robots;

the intelligent customer service system module is used for dialing telephone or sending short messages, generating a telephone operation for a customer, identifying information of the customer and communicating with the customer;

The intelligent client service system module comprises a call-accelerating operation generating module, a call-accelerating voice synthesizing module and a call-accelerating text synthesizing module; the system comprises a prompting and receiving interactive voice recognition module, a prompting and receiving semantic analysis module and an intelligent outbound system;

the voice synthesis module converts the voice generated by the voice generation module into text, and outputs the text to the called client through the intelligent outbound system;

the intelligent outbound system can call or send out short messages.

The core technology of the invention is as follows:

1. speech recognition algorithm

One of the core technologies of the invention is that the collection interactive voice recognition module is based on a dialect language recognition system structure (shown in figure 1) of CNN-BiLSTM, and an automation technology is selected for voice recognition training, and the advantage of training by using the automation technology is that the method has the capability of learning characteristic and structural information in the training process, thereby improving the generalization capability of the model. The automatic training enables the robot to better adapt to unseen data samples, and reasoning and decision under different situations are achieved. Training is performed by using a large model, so that the risk of overfitting is reduced, and the robustness and reliability of the model are enhanced. Specifically, a neural network structure is used, and long-short-term memory network (LSTM) and Convolutional Neural Network (CNN) are combined for corpus training and optimizing. The dialect language identification is realized through an optimized CNN-BiLSTM-MFA-FL model, and the accuracy rate is verified:

CNN: the structure of CNN is mainly composed of 3 parts: part 1 is a convolution layer that filters the input image and calculates and extracts meaningful values by sliding window methods; part 2 is a pooling layer that reduces the size of features extracted by convolution operations; part 3 is a fully connected layer, connecting all neurons as in classical neural network structures.

Bi-LSTM: the LSTM has three control gates, namely a forget gate is responsible for forgetting some information in the past, an output gate is responsible for outputting information, and an update gate is responsible for memorizing some information in the present, which enables the LSTM to correlate the language context. The Bi-LSTM main structure is composed of two unidirectional LSTM neural networks. At each time t, the input is simultaneously provided to the two opposite-direction recurrent neural networks, and the output is jointly determined by the two unidirectional recurrent neural networks, so that the Bi-LSTM can acquire memory from the past time point and can acquire information from the future time point.

FL: an extension of the focus loss function, cross entropy loss, essentially uses a suitable function to measure the contribution of the difficulty sample to the total loss function and thus optimize the model. The method has good performance in the corpus which is easy to be confused and has different sample numbers in training dialects, and improves the recognition rate by 0.99 percent compared with the cross entropy loss function.

In terms of language recognition of a specific dialect, an intelligent collecting robot based on an automation technology is provided with GPT, and a transducer neural network used by the GPT has good performance on a natural semantic recognition (NLP) project, and the training word recognition error rate on the specific dialect is only 2.8%. However, although the transducer neural network can effectively capture the context information, it is difficult to learn rich local information. Conformer adds convolution module in the encoder of transform, has increased the ability of acquireing fine local information, has further improved the model performance. Conformer is 2% -6% lower than the WER of the transducer in multiple test sets.

An end-to-end speech recognition system (see fig. 2) using a Conformer model as an encoder is as follows:

conformer model structure: is formed by stacking a plurality of Conformer blocks. The Conformer block consists of a Layer Norm module, a feedforward Layer, a convolution Layer, and a multi-head attention Layer, and residual structures are provided on the feedforward Layer, the convolution Layer, and the multi-head attention Layer, where the residual structures are introduced to facilitate training of the convolution network.

At the decoder, in addition to the traditional CTC decoder construction, the use of a method of LAS decoder to match with a Conformer has been demonstrated to improve the accuracy of recognition. An encoder RNN is used as a listener (listner) and a decoder RNN is used as a speller (speller). When Conformer is used as a litener, the word error rate is relatively reduced by 46.74% relative to the model of the transducer encoder, and further reduced after loss-aided training using CTC.

The following is the way in which speech recognition training is performed:

1) And (3) data collection: large-scale datasets are collected covering a wide range of audio distributions from many different environments, recording settings, speakers and languages. These datasets include speech samples and corresponding labels (text transcription). Ensuring that the voice samples and labels in the dataset are matched.

2) Data preprocessing: studies have shown that training a training set of hybrid man-machine generated data may significantly impact the performance of the translation system. Many existing ASR systems output only a limited subset of written language for which detection can be made and machine-generated transcription in the training dataset removed. And performing operations such as noise reduction, volume normalization and the like on the audio in the removed database so as to ensure the quality and consistency of the audio.

3) Extracting signal characteristics: and transforming the audio frames in the database to obtain a log-Mel spectrogram.

4) Sequence alignment: sequence alignment is required because the length of the speech signal and the length of the text transcription are not identical. The goal of the sequence alignment is to align the speech features of each time step in the spectrogram with the corresponding labels in order to train the model.

5) Prediction task: first, the languages predicted to be speaking are used, each represented by a unique token in the training set. The next tag specifies the task (transcription or translation). After this, whether or not to predict the time stamp is specified by including a flag for this case. For time-stamped predictions, the predicted time is relative to the current audio piece and additional vocabulary is added to the vocabulary for each of them, interleaving the predictions with subtitle markers. Finally, a flag is added. The training penalty is masked on past context text and the training model predicts all other markers.

6) Model evaluation and fine tuning: the text is iterated using a text normalizer and the system is evaluated and compared based on a Word Error Rate (WER) indicator. Based on the evaluation result, a trainable rank decomposition matrix is injected into each layer of the architecture, so that the number of trainable parameters of a downstream task can be reduced, and the calculated amount and the storage cost are reduced, and the method is called a LoRA model fine tuning method. The performance of the model is gradually improved through optimization of the model and iterative training, evaluation and optimization processes, and the model is suitable for downstream application.

2. Emotion speech synthesis

The other core technology of the invention aims at carrying out emotion voice synthesis, thereby creating customer service with temperature and humidity, and using an automation technology to build a robot, compared with the traditional intelligent customer service and virtual digital person, the robot is smoother in service, more friendly to customers, has context continuity in dialogue content, improves customer experience, is closer to reality than the intelligent voice judged by the current AB, and approximates to real person answer.

Most of the current emotion voice synthesis adopts a Tacotron2 model (see figure 3) for optimization and improvement, and the following is the action mechanism of Tacotron 2:

it is a neural network model that is output from the text end to the speech end, i.e., an end-to-end synthesis model. In particular, the framework is composed of two parts altogether, one part is a cyclic sequence-to-sequence feature prediction network with a position sensitive attention mechanism, and the cyclic sequence-to-sequence feature prediction network is used for predicting a corresponding frequency spectrum frame sequence according to an input character sequence; the other part is a modified version of the WaveNet, whose function is to generate a corresponding time-domain waveform function from the model predicted melody-spectrogram frames. The main workflow is to encode the input text sequence into a character-embedded sequence by an encoder, which is then decoded into a predicted spectrogram by a decoder with a position-sensitive attention mechanism.

Based on Tacotron2 structure, optimization is generally performed on preprocessing end and characteristic parameter modification, so that the voice reply of the robot has emotion and humanization. The invention uses an automated process to manage training of a Tacotron2 model, and takes a text generated by GPT as input. Parameters of the Tacotron2 model are adjusted through an automatic optimization algorithm, so that emotion information can be considered when voice is generated, and voice with emotion can be generated.

The following steps are used for training the robot to perform emotion voice synthesis:

1) Data automatic acquisition and pretreatment: automated techniques are used to collect voice datasets with emotion tags, such as by web crawlers, API calls, etc. An automatic processing data preprocessing step, such as converting a voice file into a feature representation, e.g., MFCC, LPCC, etc.

2) Emotion marking: if the data size is large, manually labeling emotions can be time consuming. At the moment, a part of data can be automatically marked with emotion by means of an emotion analysis algorithm, and then manually verified and corrected. This process may also be partially automated.

3) Model training and automated optimization: during model training, automated techniques may be used to search for optimal combinations of hyper-parameters, such as learning rate, batch size, etc. An automated hyper-parameter optimization tool (e.g., hyperopt, optuna) can help speed up finding optimal settings.

4) Model evaluation and automatic tuning: and evaluating the trained model by using an automatic process, and automatically tuning according to index feedback. By means of an automated evaluation flow, the performance of different model configurations can be compared rapidly.

5) And (3) process automation: the entire training process may be managed by a process automation tool, such as Apache Airflow. This facilitates automatic triggering, monitoring and management of the various stages of the training process.

6) Deep learning technology: in model training, deep learning techniques such as Recurrent Neural Networks (RNNs) or Convolutional Neural Networks (CNNs) may be used for emotion speech synthesis tasks. Furthermore, you can use a transducer or Conformer model to extract richer features if needed.

The application flow of the collecting forcing robot is as follows (see the accompanying figures 4-5):

1. case access: the debt information is imported including the name of the debtor, contact details, amount of debt, expiration date, etc. The data are subjected to format unification and standardization, so that the data are suitable for the processing and analysis requirements of a robot system.

2. Strategy analysis: based on a risk analysis training result of an automation technology, the repayment willingness and risk of the debtor are identified, the repayment possibility is predicted, the debt risk is analyzed, the intelligent robot automatically learns and optimizes the collection accelerating strategy, and a strategy suitable for a case is selected.

3. Automatic communication: and outputting the cases with higher risks to a manual end for solving by using the robot after training, and for the cases with lower risks, automatically sending a short message, an email or a voice call to remind the debtor to pay or communicate with the cases.

4. Language interaction: the intelligent robot pre-trained by long and short time memory network (LSTM) and Convolutional Neural Network (CNN) of the automation technology is used for carrying out semantic analysis and recognition on the response and the behavior of the user, and interaction with the user is realized by utilizing the voice and text synthesis technology. When the keywords fed back by the user cannot be found in the knowledge base, manual processing is automatically switched.

5. And (3) real-time monitoring: and for each case, monitoring the execution condition of the liability-forcing task in real time, and generating corresponding reports and indexes. The method can monitor the sending condition of the collect notification, the response rate of the debtor, the repayment condition and the like, and generate a collect effect report and performance evaluation, so that enterprises can know the overall condition of the collect service.

6. Feedback and tuning: and continuously optimizing and adjusting the performance of the pre-training intelligent collecting robot according to feedback information in practical application. System performance and efficiency can be improved through continuous optimization.

The innovation point of the invention is that:

1. the training AIGC/GPT model is integrated into an intelligent collecting robot by using a process automation technology, CNN-BiLSTM-MFA-FL and Conformer-LAS neural network are fused by using automatic data acquisition, analysis and process automation, machine learning technology, neural network and other deep learning technologies, and loss assistance is carried out by using CTC, so that multilingual language judgment and semantic recognition are realized. Compared with the prior art that a single neural network model is fused with a Conformer neural network or a multi-neural network is fused with a transducer neural network, the process automation technology improves the accuracy.

2. GPT is combined with Tacotron2 framework by using flow automation technology to realize emotion voice reply, and extraction and recognition of emotion characteristics are carried out by using an automated CNN. The traditional emotion voice reply mode, such as preset text or simple keyword recognition, is replaced by a process automation technology. The GPT-mounted replies retrieve the knowledge base through natural semantic recognition (NLP), generating more human-like text answers. Then, text emotion is analyzed by using a Tactron deep neural network inference model, and text trained by embedding different emotions is used for synthesizing emotion voice reply which is more similar to human according to different emotions.

3. An intelligent collecting robot business application flow based on an automation technology. And constructing a business application flow of the intelligent collecting robot by using a technology based on a flow automation technology. This includes automated process management, data driven decisions, automatic voice notification and reply, and real-time monitoring and feedback. The automatic technology is embedded into the business application flow of the whole intelligent collecting robot through the data automatic collection, analysis and flow automation, machine learning technology, neural network and other deep learning technologies, so that an automatic and efficient collecting process is realized.

The embodiments of the present invention have been described in detail above with reference to the accompanying drawings, but the present invention is not limited to the described embodiments. It will be apparent to those skilled in the art that various changes, modifications, substitutions and alterations can be made to these embodiments without departing from the principles and spirit of the invention, and yet fall within the scope of the invention.

Claims

1. Intelligent collecting robot based on robot process automation, its characterized in that: the system comprises a collection robot setting module, a collection strategy customizing module, an intelligent client service system module, a data protocol interaction module and a collection service management system module;

the collecting robot setting module is used for setting collecting robots;

2. The intelligent revenue-generating robot based on robotic process automation of claim 1, wherein: the intelligent client service system module comprises a call-accelerating operation generating module, a call-accelerating voice synthesizing module, a call-accelerating text synthesizing module, a call-accelerating interactive voice identifying module, a call-accelerating semantic analyzing module and an intelligent call-out system;

the intelligent outbound system is used for calling or sending short messages.

3. The intelligent revenue-generating robot based on robotic process automation of claim 2, wherein: the prompting and receiving interactive voice recognition module uses a neural network structure, combines a long-short-time memory network and a convolution neural network to carry out corpus training and optimizing, and realizes dialect and language voice recognition through an optimized CNN-BiLSTM-MFA-FL model.

4. The intelligent harvesting robot based on robotic process automation of claim 3, wherein: in the aspect of specific dialect language recognition, the interactive voice recognition module is also provided with a GPT, and adopts a Conformer model as an end-to-end voice recognition system of the encoder.

5. The intelligent revenue-generating robot based on robotic process automation of claim 4, wherein: the training mode of the voice recognition system is as follows:

6. The intelligent revenue-generating robot based on robotic process automation of claim 4, wherein: the collection-accelerating voice synthesis module uses an automatic process to manage training of a Tacotron2 model, takes a text generated by GPT as input, and adjusts parameters of the Tacotron2 model through an automatic optimization algorithm so that emotion information can be considered when voice is generated.

7. The intelligent revenue-generating robot based on robotic process automation of claim 6, wherein: the emotion voice synthesis training steps of the prompting and receiving voice synthesis module are as follows: