CN111489743B

CN111489743B - Operation management analysis system based on intelligent voice technology

Info

Publication number: CN111489743B
Application number: CN201910082514.9A
Authority: CN
Inventors: 张劭韡; 吴佐平; 王颖; 邓艳丽; 陈敏耀; 邓志东; 张晓慧; 杜小瑾; 姜冬; 徐景龙; 乔晅; 徐强
Original assignee: State Grid Co ltd Customer Service Center; Beijing China Power Information Technology Co Ltd
Current assignee: State Grid Co ltd Customer Service Center; Beijing China Power Information Technology Co Ltd
Filing date: 2019-01-28
Publication date: 2024-06-25
Anticipated expiration: 2039-01-28

Abstract

The invention provides an operation management analysis system based on intelligent voice technology, which comprises: the recording acquisition unit is used for downloading recording data files from the telephone recording platform, splicing and converting the files and generating complete voice files; the scene segmentation unit is used for carrying out scene segmentation or speaker segmentation on the voice file; the voice transfer unit is used for recognizing the voice file based on the intelligent voice recognition engine and transferring the voice file into text content; the data analysis unit is used for analyzing the text content and the voice file based on the neural network model and outputting an analysis report; a database unit for storing the voice file, text content and analysis report; and the content indexing unit is used for retrieving the data stored in the database according to the index command. The invention discovers problems and defects in the service process by intelligently analyzing a large number of recording data files of the call center and timely grasps the demands of users, thereby improving the satisfaction degree of the users.

Description

Operation management analysis system based on intelligent voice technology

Technical Field

The invention relates to the technical field of information analysis, in particular to an operation management analysis system based on an intelligent voice technology.

Background

With the development of mobile communication technology, a customer service call center plays a vital role as a bridge between an operation platform and a user. In recent years, the industrial scale of intelligent voice technology is rapidly and continuously increased under the drive of multiple factors such as the demand pulling of global users, the national strategic guidance, the competition of enterprises and the like, and the intelligent voice technology is continuously and deeply applied to the fields such as mobile internet, intelligent home, automobile electronics, financial payment, online education, medical treatment and the like. Under the promotion of mass data and deep learning, intelligent voice technologies such as voice recognition, voice synthesis, voiceprint recognition and the like are mature gradually, and the stage of practical application is started.

The national power grid 95598 call center is used as an important bridge between a national power grid company and a user, and the application environments of different ages, different regions, different crowds, different channels, different terminals and different noise environments can be applied by adopting the Chinese voice recognition technology trained by the current international mainstream DNN (deep neural network) and HMM (hidden Markov model) method, and meanwhile, the mass voice corpus and text corpus accumulated by the national power grid 95598 call center are utilized to carry out the customized training of the model, so that a voice transcription and analysis platform with high availability and high recognition rate is established, the defects of unclear and inaccurate voice recognition and transcription in the prior art are greatly improved, and the voice recognition error rate is reduced.

Disclosure of Invention

In view of the above, the main objective of the present invention is to provide an operation management analysis system based on intelligent voice technology, which establishes an intelligent voice recognition model with high availability and high recognition rate by using massive voice corpus and text corpus accumulated in a call center of a national power grid 95598, and continuously trains and optimizes the model by using a self-deep learning technology, thereby improving recognition accuracy, continuously improving recognition accuracy and applicability to the power service industry, and timely transcribing and analyzing a large amount of recording data generated in the call center based on the intelligent voice recognition model, so as to timely find defects in the service process, grasp user requirements, and further improve service quality.

The technical scheme adopted by the invention is that an operation management analysis system based on intelligent voice technology comprises:

the recording acquisition unit is used for downloading recording data files from the telephone recording platform, splicing and converting the files and generating complete voice files;

The scene segmentation unit is used for carrying out scene segmentation or speaker segmentation on the voice file;

The voice transfer unit is used for recognizing the voice file based on the intelligent voice recognition engine and transferring the voice file into text content;

The data analysis unit is used for analyzing the text content and the voice file based on the neural network model and outputting an analysis report;

a database unit for storing the voice file, text content and analysis report;

And the content indexing unit is used for retrieving the data stored in the database according to the index command.

Wherein, the recording acquisition unit includes:

The recording downloading module is connected with the telephone recording platform and used for receiving recording data file fragments transmitted at regular time;

The splicing transcoding module splices the recording data file fragments, decompresses the spliced recording data file and converts the spliced recording data file into a complete voice file which can be identified;

and the transfer scheduling module is used for scheduling the corresponding voice file to be sent to the voice transfer unit according to the transfer command.

Wherein the data analysis unit includes:

The audio analysis module comprises a silence interval detection module, a speech speed detection module and a emotion detection module and is used for respectively carrying out silence interval analysis, speech speed analysis and emotion analysis on the voice file;

And the text analysis module is used for analyzing the text content.

Further improved, the system also comprises a recording distribution module for respectively transmitting the voice files to a voice transcription unit, a data analysis unit and a database unit.

The intelligent voice recognition engine comprises an acoustic model and a language model, and the acoustic model and the language model correspond to calculation of syllable-to-syllable probability and calculation of syllable-to-word probability of audio features extracted from a voice file respectively;

Modeling the language model by adopting an N-Gram model;

the acoustic model is modeled using a deep neural network and a hidden markov model.

Wherein the content index unit includes:

The data storage module is used for storing the text content and the along-with data generated in the transfer process;

The data query module is used for querying and aggregating the voice file and the along-the-way data according to preset query conditions and displaying the result;

the word library is internally provided with a custom word segmentation word library, so that the word segmentation of the power related sentences can be accurately carried out;

the text word segmentation processing module is used for marking the word segmentation of the text content based on the word segmentation result and extracting each word segmentation to generate structured text data;

And the text clustering module is used for carrying out text clustering processing and deep cross analysis on the structured text data generated in a time period to obtain text related clustering information in the time period.

Further improvements, the system also comprises a standardized sentence vector model library, which comprises a plurality of standardized sentence vector models;

The standardized sentence vector model is obtained by carrying out sentence vector similarity calculation on sentence samples in a corpus based on a neural network model and carrying out standardized training on sentence vectors meeting similarity thresholds.

Further improved, the voice transcription unit further comprises a standardized processing module, which is used for carrying out sentence segmentation on the text content, calculating the sentence vector of each sentence, selecting a standardized sentence vector model corresponding to each sentence from the standardized sentence vector model library, and carrying out standardized training on each sentence so as to output the corresponding standardized sentence.

Drawings

Fig. 1 is a block diagram of an operation management analysis system based on intelligent voice technology according to the present invention.

Detailed Description

The invention mainly aims to provide an operation management analysis system based on an intelligent voice technology, which establishes an intelligent voice recognition model with high availability and high recognition rate by utilizing massive voice corpus and text corpus accumulated by a national power grid 95598 call center, continuously trains and optimizes the model by utilizing a self-deep learning technology, improves recognition accuracy, continuously improves recognition accuracy and applicability to the power service industry, and timely transcribes and analyzes a large amount of recording data generated by the call center based on the intelligent voice recognition model, so that defects in a service process are timely found, user demands are mastered, and service quality is improved.

The national power grid 95598 call center can be suitable for application environments of different ages, different regions, different crowds, different channels, different terminals and different noise environments by adopting a Chinese voice recognition technology trained by a current international mainstream DNN (deep neural network) +HMM (hidden Markov model) method, and meanwhile, the customization training of the model is carried out by utilizing massive voice corpus and text corpus accumulated by the national power grid 95598 call center, so that a voice transcription platform with high availability and high recognition rate is achieved;

The core technology of the voice transcription platform is an intelligent voice recognition technology, and the intelligent voice recognition technology adopts a latest generation recognition algorithm, a decoder core and an advanced acoustic model and language model training method and mainly comprises three important components: training a voice recognition model, performing front-end voice processing and performing back-end recognition processing;

1. speech recognition model training

The speech recognition model is typically composed of two parts, an acoustic model and a language model, corresponding to the calculation of the feature-to-syllable probability and the syllable-to-word probability, respectively, extracted from the speech signal.

The modeling method of the acoustic model generally adopts a DNN (deep neural network) +HMM (hidden Markov model), compared with the method of GMM (Gaussian mixture model) +HMM used in the previous generation, the error rate of voice recognition is reduced by 30%, and the method is the fastest progress in the aspect of voice recognition technology in the last 20 years. In terms of language models, a modeling method of a statistical language model is generally adopted at present, the statistical language model adopts an N-Gram model, the N-Gram model is also called a first-order Markov chain, the basic idea is that the content in a text is subjected to sliding window operation with the size of N according to bytes to form a byte fragment sequence with the length of N, each byte fragment is called a Gram, the occurrence frequencies of all the grams are counted, and filtering is carried out according to a preset threshold value to form a key Gram list, namely a vector feature space of the text, and each Gram in the list is a feature vector dimension;

The algorithm has the advantages of strong fault tolerance and language independence, is universal for Chinese, english and Chinese, does not need linguistic processing, is a language model commonly used in large-vocabulary continuous speech recognition, is simple and effective, and is widely used.

In order to adapt to application environments of different ages, different regions, different crowds, different channels, different terminals and different noise environments, a large amount of voice corpus and text corpus are needed for training, so that the recognition rate can be effectively improved. With the rapid development of the internet and the popularization and application of mobile terminals such as mobile phones, a large amount of text or speech corpus can be obtained from a plurality of channels, so that abundant resources are provided for training of language models and acoustic models in a speech recognition model, and the construction of a universal large-scale language model and acoustic model is possible.

2. Front-end speech processing

Front-end speech processing refers to preprocessing such as detection and noise reduction of speaker speech by using a signal processing method so as to obtain speech most suitable for processing by a recognition engine. The main functions include:

(1) Endpoint detection

The endpoint detection is to analyze the input audio stream to distinguish the periods of speech and non-speech signals in the speech signal, and accurately determine the starting point of the speech signal. After endpoint detection, the subsequent processing can be performed on the voice signal only, which plays an important role in improving the accuracy and recognition accuracy of the model.

(2) Noise cancellation

Background noise is a real challenge for speech recognition applications in practice, and it is difficult to avoid certain noise during a telephone voice call even if the speaker is in a quiet office environment. A good speech recognition engine needs to have efficient noise cancellation capabilities to accommodate the user's demands for application in a wide variety of environments.

(3) Feature extraction

The extraction of acoustic features is a process of greatly compressing information and plays an important role in better mode division of a later mode classifier, so that the extraction and selection of acoustic features are an important link of speech recognition. Features commonly used at present include MFCC (Mel Frequency Cepstrum Coefficient, mel-frequency cepstral coefficients) and PLP (Perceptual Linear Prediction, perceptual linear prediction cepstral coefficients), among others.

3. Backend recognition processing

The back-end recognition processing is a process of recognizing (also called "decoding") the extracted feature vector by using the trained "acoustic model" and "language model" to obtain text information. The main purpose of the acoustic model is to correspond to the calculation of the probability of a speech feature to a syllable (or phoneme), and the main purpose of the language model is to correspond to the calculation of the probability of a syllable to a word. The most main decoder part performs acoustic model scoring and language model scoring on the original voice characteristics, and obtains the optimal path of the word pattern sequence on the basis, wherein the corresponding text on the path is the final recognition result.

Early decoders based on grammar tree structure have complex design and under the current technical condition, the speed improvement of the decoders has hit the bottleneck, while most of the current mainstream speech recognition decoders have adopted decoding networks based on finite state machine (WFST), and the decoding networks can integrate language models, dictionaries and acoustic shared phonetic word sets into a large decoding network, so that the decoding speed is greatly improved, and the decoding process and knowledge sources can be separated.

Based on the above-mentioned massive voice corpus and text corpus accumulated in the call center of the national power grid 95598 and the intelligent voice recognition model formed by custom training by adopting DNN (deep neural network) +HMM (hidden Markov model), in a preferred embodiment provided by the invention, an operation management analysis system is involved, and the integrated aspect of the system needs to be directly in butt joint with a 95598 telephone recording platform, a 95598 service management system and a quality inspection management module, and is indirectly associated with the 95598 service support system through the quality inspection management module.

The system aims at clear hierarchical division on the whole framework, and the technology adopted by the core component achieves harmony and unification of advancement and maturation stability. The high availability requirements of various interfaces, services and engines of the whole system are fully considered, the interfaces are in principle in a main-standby mode, the services and the engines are in a load balancing mechanism, single-point faults are avoided, and service interruption or flow blockage caused by failure of a few nodes can be avoided.

The system needs to deploy corresponding parts in the south (north) sub-centers, the south sub-centers need to be in butt joint with a local telephone recording platform to obtain recording data nearby, and binary voice streams need to be transmitted between the south and north networks when the voice is heard.

The north branch center is also in butt joint with a local recording platform, so that recording acquisition and transcription are realized. In addition, because the users of voice analysis are mainly close to the north minute center, and because of the requirement of data summarization, the quality inspection text content processing and voice content analysis of the whole customer service center are concentrated in the north minute center, and the content retrieval service and the database are required to be deployed in the north minute center.

And the data management is to store text data in a centralized way according to actual conditions, and all types of storage are planned according to the current reasonable flow, so that the processing requirement of the maximum voice flow in the peak-meeting summer period is met.

The present system will be described in detail with reference to fig. 1, and the present operation management analysis system includes:

The recording acquisition unit 100 specifically comprises a recording downloading module, is connected with the telephone recording platform and receives recording data file fragments transmitted by the telephone recording platform at regular time; the splicing transcoding module splices the recording data file fragments, decompresses the spliced recording data file and converts the spliced recording data file into a complete voice file which can be identified; the transfer scheduling module is used for scheduling the corresponding voice file to be sent according to the transfer command;

the recording distribution module 200 is used for respectively transmitting the voice files;

The voice transcription unit 300 is used for recognizing the voice file based on the intelligent voice recognition engine and transcribing the voice file into text content;

The intelligent voice recognition engine comprises an acoustic model and a language model, which respectively correspond to the calculation of the syllable-to-syllable probability and the calculation of the syllable-to-text probability of the audio features extracted from the voice file;

Modeling the language model by adopting an N-Gram model;

the acoustic model is modeled by adopting a deep neural network and a hidden Markov model

A scene segmentation unit 400, configured to segment the speech file into a scene or a speaker;

In order to save cost, the current call center usually uses single-channel recording, that is, recording is performed on the user and the customer service at the same time, and the single-channel recording is stored in the same channel. However, in general, the customer service record and the user record need to be analyzed respectively, the customer service record is mainly used for evaluating the service capability of the customer service, the user record contains potential demand information or competitor information of the user, and the customer service record has obvious commercial value, and in this time, the user voice and the customer service voice need to be analyzed from the single-channel record. This function is commonly referred to as "talker separation", also referred to as "scene segmentation";

The data analysis unit 500 analyzes the text content and the voice file based on the neural network model and outputs an analysis report, including an audio analysis module and a text analysis module;

The audio analysis module comprises a silence interval detection module, a speech speed detection module and a emotion detection module, and is used for respectively carrying out silence interval analysis, speech speed analysis and emotion analysis on the voice file;

the text analysis module is used for analyzing the text content;

A database unit 600 for storing the voice file, text content and analysis report;

the content indexing unit 700 retrieves data stored in the database according to an index command, and specifically includes:

In addition, the system also comprises a standardized sentence vector model library which contains a plurality of standardized sentence vector models;

Based on a standardized sentence vector model library, the voice transcription unit further comprises a standardized processing module, which is used for carrying out sentence segmentation on the text content, calculating the sentence vector of each sentence, selecting a standardized sentence vector model corresponding to each sentence from the standardized sentence vector model library, and carrying out standardized training on each sentence so as to output a corresponding standardized sentence;

The standardized sentences are recombined to generate standardized text content for subsequent more accurate text analysis.

Based on the operation management analysis system provided by the invention, the refined operation application on the service can be realized, for example:

customer service voice quality inspection: by applying the voice analysis technology, various retrieval functions are flexibly combined and applied, different application parameter thresholds are set, and the customer service call quality problem can be effectively and comprehensively analyzed and evaluated. The voice analysis can also locate the specific position where the problem occurs, so that management personnel can further track and determine the problem.

Operation management analysis: and carrying out service efficiency analysis, incoming call cause mining, user demand analysis, call duration analysis and hot spot and change trend monitoring by combining the recording and recording identification results, mining a service short board, and providing auxiliary support for standardizing the service process and optimizing the service flow.

The voice analysis technology supports: the voice analysis system detects the variation amplitude of fundamental frequency, pitch and the like in the audio in the telephone recording, provides prediction of possible emotion fluctuation in the recording, and positions the position information of the audio with emotion fluctuation in the whole voice; detecting and analyzing the average speech speed in the whole telephone recording and the change of the speech speed in a certain recording section; and detecting mute time and the like when a user and a hot-line service person in the record file do not speak, generating an index file in a standard XML format, and performing keyword retrieval, abnormal condition detection, abnormal voice detection and abnormal dialogue detection.

The foregoing description of the preferred embodiments of the invention is not intended to be limiting, but rather is intended to cover all modifications, equivalents, alternatives, and improvements that fall within the spirit and scope of the invention.

Claims

1. An operation management analysis system based on intelligent voice technology, comprising:

The recording acquisition unit is used for downloading recording data files from the telephone recording platform, splicing and converting the files and generating complete voice files; the recording acquisition unit includes: the recording downloading module is connected with the telephone recording platform and used for receiving recording data file fragments transmitted at regular time; the splicing transcoding module splices the recording data file fragments, decompresses the spliced recording data file and converts the spliced recording data file into a complete voice file which can be identified; the transfer scheduling module is used for scheduling the corresponding voice file to be sent to the voice transfer unit according to the transfer command;

The standardized sentence vector model library comprises a plurality of standardized sentence vector models; the standardized sentence vector model is obtained by carrying out sentence vector similarity calculation on sentence samples in a corpus based on a neural network model and carrying out standardized training on sentence vectors meeting similarity thresholds;

The voice transfer unit is used for recognizing the voice file based on the intelligent voice recognition engine and transferring the voice file into text content; the voice transcription unit further comprises a standardized processing module, wherein the standardized processing module is used for carrying out sentence segmentation on the text content, calculating sentence vectors of each sentence, selecting a standardized sentence vector model corresponding to each sentence from the standardized sentence vector model library, and carrying out standardized training on each sentence so as to output a corresponding standardized sentence; the intelligent voice recognition engine comprises an acoustic model and a language model, which respectively correspond to the calculation of the syllable-to-syllable probability and the calculation of the syllable-to-text probability of the audio features extracted from the voice file; modeling the language model by adopting an N-Gram model; the acoustic model is modeled by adopting a deep neural network and a hidden Markov model;

The data analysis unit is used for analyzing the text content and the voice file based on the neural network model and outputting an analysis report; the data analysis unit includes: the audio analysis module comprises a silence interval detection module, a speech speed detection module and a emotion detection module and is used for respectively carrying out silence interval analysis, speech speed analysis and emotion analysis on the voice file; the text analysis module is used for analyzing the text content;

a database unit for storing the voice file, text content and analysis report;

A content index unit for retrieving data stored in a database according to an index command, the content index unit comprising: the data storage module is used for storing the text content and the along-with data generated in the transfer process; the data query module is used for querying and aggregating the voice file and the along-the-way data according to preset query conditions and displaying the result; the word library is internally provided with a custom word segmentation word library, so that the word segmentation of the power related sentences can be accurately carried out; the text word segmentation processing module is used for marking the word segmentation of the text content based on the word segmentation result and extracting each word segmentation to generate structured text data; and the text clustering module is used for carrying out text clustering processing and deep cross analysis on the structured text data generated in a time period to obtain text related clustering information in the time period.

2. The system of claim 1, further comprising a sound recording distribution module for transmitting the voice file to a voice transcription unit, a data analysis unit, and a database unit, respectively.