CN115910066A

CN115910066A - Intelligent dispatching command and operation system for regional power distribution network

Info

Publication number: CN115910066A
Application number: CN202211118157.5A
Authority: CN
Inventors: 龚利武; 徐操宇; 施维佩; 王吉宁; 吕妤宸; 阮晨捷; 李嘉宾; 潘白浪
Original assignee: Pinghu General Electrical Installation Co ltd
Current assignee: Pinghu General Electrical Installation Co ltd
Priority date: 2022-09-15
Filing date: 2022-09-15
Publication date: 2023-04-04

Abstract

The invention discloses an intelligent dispatching command and operation system for a regional power distribution network, which comprises an intelligent voice navigation module, an automatic generation module of a planned dispatching instruction ticket, an information monitoring and diagnosis module and an intelligent commuting recognition module, wherein the intelligent voice navigation module comprises a voice recognition engine unit and a semantic analysis engine unit, and the voice recognition engine unit is used for directly converting human voice into corresponding texts so as to facilitate a processing module to understand and generate corresponding operations, thereby realizing natural voice interaction between a human and a machine and converting audio stream data into a recognition result of character stream data in real time. The invention discloses an intelligent scheduling command and operation system for a regional power distribution network, which integrates artificial intelligence with a production operation system, realizes intelligent sensing, analysis, control and interaction of production operation services, and truly realizes transformation development of intelligent production operation.

Description

Intelligent dispatching command and operation system for regional power distribution network

Technical Field

The invention belongs to the technical field of production and operation of regional power distribution networks, and particularly relates to an intelligent dispatching command and operation system for a regional power distribution network.

Background

With the further expansion of the scale of the power distribution network and the increasing complexity of the network frame, the requirement on the operation management level of the power distribution network is higher and higher. Whether the power distribution network can normally operate or not directly influences the user level and the power supply quality of an enterprise, and is a main factor for evaluating the responsibility accommodation capacity of a company.

The distribution network production operation center is used as a distribution network service hub, and firstly, more and more related information systems are provided. The islanding phenomenon between the information systems is serious. Secondly, the distribution network production operation points are diversified and wide, and the production operation needs to be repeatedly and flowingly butted with a plurality of sites at the same time, so that the 'pivot congestion' effect of the distribution network production operation business is caused. Thirdly, mass data information of the production operation system is to be further mined. Fourth, the existing production operation auxiliary system is constructed by adopting a traditional expert system, manual editing is adopted for knowledge acquisition and maintenance, expansibility is poor, and the automation and intelligence degrees of knowledge reasoning are insufficient. Fifthly, the production operation system has single intelligent interaction mode, low efficiency and lacks interaction modes such as voice and touch. Therefore, the distribution network production operation business efficiency is urgently required to be improved, and the flow is reconstructed.

Disclosure of Invention

The invention mainly aims to provide an intelligent dispatching command and operation system for a regional power distribution network, which integrates artificial intelligence with a production operation system, realizes intelligent sensing, analysis, control and interaction of production operation services, and truly realizes transformation development of intelligent production operation; the intelligent control is to intelligently control the operation mode and the equipment state of the power grid according to the analysis and study result, so as to comprehensively improve the control capability of the power grid; the intelligent interaction is realized by technical means such as comprehensive display, intelligent information push, intelligent voice interaction and the like, and information interaction with production operators, rush-repair operation personnel, operation and maintenance personnel and users is realized.

In order to achieve the above purposes, the invention provides an intelligent scheduling command and operation system for a regional power distribution network, which is used for ensuring the safe operation of a production operation system of the power distribution network, and comprises an intelligent voice navigation module, a plan scheduling instruction ticket automatic generation module, an information monitoring diagnosis module and an intelligent shift-over identification module, wherein:

the intelligent voice navigation module (based on hybrid intelligence and man-machine coupling technology) comprises a voice recognition engine unit and a semantic analysis engine unit, wherein the voice recognition engine unit is used for directly converting human voice into corresponding texts so as to facilitate the processing module to understand and generate corresponding operations, thereby realizing natural voice interaction between a human and a machine, and converting audio stream data into a recognition result of character stream data in real time based on a deep full-sequence convolutional neural network framework; the semantic analysis engine unit is used for performing semantic analysis on the recognition result obtained by the voice recognition engine unit (so that the human-computer interaction process is smoother);

the automatic generation module of the schedule scheduling instruction ticket extracts and automatically generates the schedule scheduling instruction ticket based on the information, and the automatic generation module of the schedule scheduling instruction ticket comprises a learning phase unit and an extraction phase unit;

the information monitoring and diagnosing module carries out intelligent monitoring, diagnosis and early warning on the information of the production operation system based on the text recognition unit and the voice synthesis unit;

the intelligent shift-to-shift identification module is used for identifying the voiceprint of the user and fusing the voiceprint to a production operation system, identity authentication based on the voiceprint is achieved through voiceprint registration and analysis, and then the shift-to-shift user is identified through the voiceprint.

As a further preferred technical solution of the above technical solution, the speech recognition engine unit includes a chinese punctuation intelligent prediction unit, a file format intelligent conversion unit, a front-end speech processing unit and a back-end recognition processing unit, wherein;

the Chinese punctuation intelligent prediction unit is used for intelligently predicting dialogue voice for the recognition result obtained by the recognition of the voice recognition engine unit through a language model so as to provide prediction of intelligent punctuation and punctuation;

the file format intelligent conversion unit formats parameters including numbers, dates and times appearing in the recognition result and generates a regular text;

the front-end voice processing unit detects and performs noise reduction pretreatment on the input voice through a signal processing method so as to obtain the voice which is most matched with the voice processed by the voice recognition engine unit, the front-end voice processing unit comprises an endpoint detection subunit and a noise elimination subunit, the endpoint detection subunit is used for analyzing the input voice in the form of audio stream and determining the start and the end of the user speaking, and when the user starts speaking is detected, the voice starts to flow to the voice recognition engine unit until the end of the user speaking is detected;

the rear-end recognition processing unit comprises an individualized voice recognition subunit, a confidence coefficient output subunit, a multi-result recognition subunit, a speaker self-adaption subunit and a semantic context self-correction subunit, wherein the individualized voice recognition subunit collects and uploads words (hot words) with high utilization rate based on the voice characteristics of a user, and establishes an individualized entry language model from the business perspective so as to adjust recognition parameters, modify the weight of the uploaded hot words and continuously optimize recognition; the confidence coefficient output subunit is used for carrying the confidence coefficient of the recognition result when the recognition result is returned, so that analysis and subsequent processing are carried out according to the confidence coefficient result; in the identification process of the multi-result identification subunit, returning a plurality of identification results meeting the conditions to the application program through the results of the confidence coefficient output subunit instead of unique results, providing a possible identification result list, and arranging according to the confidence coefficient results from high to low; the speaker self-adaptive subunit is used for extracting the voice characteristics of the call on line and automatically adjusting the recognition parameters in the process of multiple conversations between the user and the voice recognition engine unit so as to continuously optimize the recognition effect; the semantic context self-correction subunit is used for dynamically correcting according to the speech recognition result and the context pair in combination with the context dynamic correction, so that the result is more in line with the current context;

the semantic analysis engine unit comprises a rule understanding unit and a model understanding unit, the rule understanding unit is used for matching rules, the model understanding unit comprises a semantic model training subunit, a semantic feature extraction subunit and a semantic similarity evaluation subunit, the semantic model training subunit is used for modeling texts, and the texts with similar semantic expressions are mapped to similar vectors in a semantic space; the semantic feature extraction subunit is used for extracting features of the text information and obtaining the text information by extracting a specific hidden vector of the model after the training of the semantic model training subunit is completed; and the semantic similarity evaluation subunit is used for evaluating the similarity between the user sentence vectors extracted by utilizing deep learning and the offline sentence vectors in the library.

As a further preferable embodiment of the above-described embodiment, the learning stage means:

in the learning stage, a labeled data set is provided, and each sample comprises a word unit sequence and a labeled sequence, as follows:

the method comprises the following steps that (1) an ith sample is represented, the front part x represents a character unit sequence of the ith sample, and the rear part y represents a labeling sequence of the ith sample;

constructing a learning model according to the existing labels, and expressing the learning model by using a conditional probability distribution;

the extraction stage unit takes different models as classifiers, so that the extraction mode of the text information which does not pass is obtained, and the extraction data obtained by each classifier is mutually corrected.

As a further preferable technical solution of the above technical solution, the text recognition unit introduces context sequence information in text recognition, and performs recognition through a neural network of a time series relationship.

As a further preferable aspect of the above aspect, the speech synthesis section:

the system comprises a whole-language semantic model, a non-supervision text pre-training model based on multi-language fusion, a text semantic information extraction module and a text prediction information branch construction module, wherein the text prediction information branch is constructed under each language independent condition on the basis of the unified pre-training model;

the pronunciation system model automatically constructs a pronunciation dictionary of a unified unit through a semi-supervised clustering method, and realizes the quick customization of a synthesis system by a language correlation technique modularization method under the condition of limited language resources;

based on a multi-language mixed model of the listening quantization coding, the listening quantization coding is carried out on different attribute information in the voice, and the acoustic parameters of the voice are predicted

Based on generating a high quality speech generating model of the countermeasure network, high quality multilingual speech synthesis is performed by modeling the overall structure of the spectral envelope and modeling the local details of the spectral envelope.

As a further preferable technical solution of the above technical solution, the intelligent shift identification module includes a model training phase unit and a model prediction phase unit:

the model training stage unit is used for training a model by using a data set which is synchronous with audio and video and has a voice character label;

and the model prediction stage unit is used for inputting the voice of a plurality of persons mixed with the lip movement video of a specific target person, recognizing a character sequence and simultaneously separating the voice of a speaker according with the content of the lip movement video.

In order to achieve the above object, the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the intelligent scheduling command and operation system for a regional power distribution network when executing the program.

To achieve the above object, the present invention further provides a non-transitory computer readable storage medium, on which a computer program is stored, which, when being executed by a processor, implements the intelligent scheduling command and operation system for regional distribution network.

The beneficial effects of the invention are as follows:

firstly, the cognitive level of the operation characteristics of the power grid is further improved. Based on technologies such as video acquisition and OCR text recognition, intelligent monitoring, diagnosis and early warning of SOE information are achieved, and self-diagnosis of four states such as failure in coincidence, success in coincidence, no coincidence in tripping and grounding fault is achieved. And important information is broadcasted in a production operation command center through a TTS (text to transfer) broadcasting technology, so that the important information is ensured to be omitted.

And secondly, the production operation control efficiency and the task completion efficiency are further improved. Based on technologies such as simulated clicking, semantic understanding and text parsing, automatic generation of the power failure planning and scheduling instruction ticket is achieved, generation efficiency of the scheduling ticket and completion efficiency of scheduling tasks are improved, planning power failure and fault power failure time is further reduced, and self-healing level of a power grid is improved. Meanwhile, intelligent voice control and navigation of the production operation system are achieved based on voice, and control efficiency of the production operation system is improved.

Thirdly, the safety level of the production operation system is further promoted. The voiceprint authentication function of the production operation system based on the voiceprint recognition technology is realized, the operation authority of production operators is determined, the smooth shift is ensured, and the safe and efficient intelligent production operation system is built.

Drawings

Fig. 1 is a schematic diagram of an intelligent dispatching command and operation system for regional distribution network according to the present invention.

Detailed Description

The following description is presented to disclose the invention so as to enable any person skilled in the art to practice the invention. The preferred embodiments described below are by way of example only, and other obvious variations will occur to those skilled in the art. The basic principles of the invention, as defined in the following description, may be applied to other embodiments, variations, modifications, equivalents, and other technical solutions without departing from the spirit and scope of the invention.

In the preferred embodiment of the present invention, those skilled in the art should note that the power distribution network and the like to which the present invention relates may be regarded as the prior art.

Preferred embodiments.

The invention discloses an intelligent dispatching command and operation system for a regional power distribution network, which is used for guaranteeing the safe operation of a production operation system of the power distribution network and comprises an intelligent voice navigation module, a plan dispatching instruction ticket automatic generation module, an information monitoring and diagnosis module and an intelligent shift identification module, wherein the intelligent dispatching command and operation system comprises:

the intelligent voice navigation module (based on hybrid intelligence and human-computer coupling technology) comprises a voice recognition engine unit and a semantic parsing engine unit, wherein the voice recognition engine unit is used for directly converting human voice into corresponding texts so as to facilitate the processing module to understand and generate corresponding operations, so that natural voice interaction between a human and a machine is realized, and audio stream data are converted into a recognition result of character stream data in real time based on a deep full-sequence convolutional neural network framework; the semantic analysis engine unit is used for performing semantic analysis on the recognition result obtained by the voice recognition engine unit (so that the human-computer interaction process is smoother);

the automatic generation module of the scheduling instruction ticket of the plan is extracted and automatically generates the scheduling instruction ticket of the plan based on the information, the automatic generation module of the scheduling instruction ticket of the plan comprises a learning stage unit and an extraction stage unit (based on the information extraction technology, the scheduling instruction ticket of the plan is automatically generated, the contents of 5 pieces of key field information in the detail page of the blackout scheduling information are analyzed, and the scheduling instruction ticket is generated according to the analyzed contents and the template);

the system comprises an information monitoring and diagnosing module, a voice synthesis module and a voice broadcasting module, wherein the information monitoring and diagnosing module is used for intelligently monitoring, diagnosing and early warning information of a production operation system based on a text recognition unit and a voice synthesis unit (based on an OCR text recognition technology, an NLP technology and a voice outbound technology, the intelligent monitoring, diagnosing and early warning of SOE information are realized, and based on the text information and the relation of SOE, fault self-diagnosis is realized;

the intelligent shift identification module is used for identifying voiceprints of users and fusing the voiceprints to a production operation system, identity authentication based on the voiceprints is achieved through voiceprint registration and analysis, and then the shift users are identified through the voiceprints (production operators need to speak out an appointed password when logging in the system, the system can identify whether the shift users have authority operation or not according to voiceprint identification of the password, and the system can judge identities of two transfers according to voiceprint identification and prompt that the transfer is successful when shifting shifts).

Specifically, the speech recognition engine unit comprises a Chinese punctuation intelligent prediction unit, a file format intelligent conversion unit, a front-end speech processing unit and a rear-end recognition processing unit, wherein;

ASR (speech recognition engine unit) is a technology that implements conversion from "voice" to "text" by directly converting human speech into corresponding text for a computer to understand and generate corresponding operations, and finally implements natural speech interaction between a human and a machine. And converting the audio stream data into a text stream data result in real time based on a deep full-sequence convolution neural network framework.

Aiming at the problems of dialect accents, background noise and the like in speech recognition application, based on massive speech data which are collected in an actual service system and cover different dialects and different types of background noise, speech modeling is carried out through an advanced discriminative training method, so that a speech recognizer has good effect performance under a complex application environment.

the rear-end recognition processing unit comprises an individualized voice recognition subunit, a confidence coefficient output subunit, a multi-result recognition subunit, a speaker self-adaption subunit and a semantic context self-correction subunit, wherein the individualized voice recognition subunit collects and uploads words (hot words) with high utilization rate based on the voice characteristics of a user, and establishes an individualized entry language model from the business perspective so as to adjust recognition parameters, modify the weight of the uploaded hot words and continuously optimize recognition; the confidence coefficient output subunit is used for carrying the confidence coefficient of the recognition result when the recognition result is returned, so that analysis and subsequent processing are carried out according to the confidence coefficient result; in the identification process of the multi-result identification subunit, returning a plurality of identification results meeting the conditions to the application program through the results of the confidence coefficient output subunit instead of unique results, providing a possible identification result list, and arranging according to the confidence coefficient results from high to low; the speaker self-adapting subunit is used for extracting the voice characteristics of the conversation on line and automatically adjusting the recognition parameters in the process of carrying out multiple conversations between the user and the voice recognition engine unit, so that the recognition effect is continuously optimized; the semantic context self-correction subunit is used for dynamically correcting according to the speech recognition result and the context pair in combination with the context dynamic correction, so that the result is more in line with the current context;

Further, for the learning phase unit:

in the learning stage, a labeled data set is set, and each sample comprises a text unit sequence and a labeled sequence as follows:

More specifically, the text recognition unit introduces context sequence information in text recognition and performs recognition through a neural network of a time sequence relation.

The common models for extracting text information include: the system comprises a hidden Markov model, a maximum entropy Markov model, a conditional random field and a voting sensor model. These models are specific implementations of the above general text information extraction.

Firstly, regarding each label as other labels independent of the sequence and the sequence, and obtaining the conditional probability distribution for each character unit sequence and the label thereof;

such a conditional probability distribution is in fact a classifier model, which yields the best labeling of the sequences of units of the sample sequence.

On the basis, different models are used as classifiers, and different text information extraction methods can be obtained.

For example, a maximum entropy model is used as a classifier, and a first order Markov property is assumed between different labels;

each conditional probability distribution is a new classifier model, but the classifier is a classifier under the condition determined based on the previous label, and the model becomes a maximum entropy Markov model.

It should be noted that the maximum entropy markov model is a local model, because the labels have a first order markov property, and are trained by local data, while in extracting global information, the effect may not be good, and the label bias problem may be generated. Therefore, the labeling bias problem can be solved by using a plurality of complex global models. Among them, the most common global model is conditional random fields. Each label in the conditional probability distribution depends on other labels except for the label, the conditional random field can accurately describe the global labeling condition, and the training time is more due to the maximum entropy Markov model in the aspect of accuracy.

Further, for the speech synthesis unit:

the method comprises the steps of extracting text semantic information based on an unsupervised text pre-training model with multinational language fusion, and constructing text prediction information branches of various language independent situations on the basis of a unified pre-training model (extracting the text semantic information and reducing the difficulty and quantity of text manual labeling based on a BERT unsupervised text pre-training model with multinational language fusion);

the construction process of Global Phone is based on International telephonic alpha beta (IPA), and covers all pronunciation systems according to the physical pronunciation rules, including isolated language, sticky language, inflected language, and pronunciation types in various main language systems such as syllable explicit language (tone, non-tone), syllable non-explicit language (tone, non-tone), etc. The callouts are predefined by Global Phone and manually validated in data rich language,

in order to realize multi-language voice synthesis in a limited time, a multi-language synthesis unified framework based on Global Phone is provided, a pronunciation dictionary of a unified unit is automatically constructed by a VAE semi-supervised clustering method, a language correlation technique modularization method is adopted, the synthesis system is quickly customized under the condition of limited language resources, and the requirement of a voice system of resource limited languages such as a Chinese language or a dialect is met.

Firstly, a system module is divided into two modules of language correlation and language independence, wherein the language independence module mainly comprises a speech synthesis engine and a universal module of a speech recognition engine, in the research and development process of a multilingual speech system, the similarity between different languages in functional sub-modules of the synthesis system is concerned, and the speech phenomenon is observed in a uniform visual angle.

The language related module is mainly related to text processing and sound library processing, and comprises all modules required by speech synthesis, such as speech unit definition and classification, text normalization, word segmentation, prosodic structure prediction, speech database labeling and the like, wherein speech recognition can share the speech unit definition and classification, the text normalization, the word segmentation, the speech design, the speech data processing and the like. In the construction of language related information such as a dictionary, a part of speech, a prosodic hierarchy and the like, a Global Phone unified framework is used, and a same system is used for uniformly modeling a small language, so that the construction difficulty of a single language is reduced.

The construction process of Global Phone is based on International telephonic Alphabet (IPA), and consonants are mainly distinguished according to pronunciation parts and pronunciation methods; the vowels use three points of high and low, front and back, and round lip/non-round lip of the pronunciation position as main distinguishing dimensions, and use duration, nasalization and tightness as secondary distinguishing dimensions to construct a Global Phone system with uniform languages. And for the tonal language, carrying out uniform Global tone marking according to the class, the value and the domain.

And automatically learning a pronunciation sequence and a pronunciation dictionary of a sound library under a unified system facing to the languages with insufficient resources by semi-supervised VAE clustering under the support of labeled data on the languages with insufficient data resources and expert resources, and finally realizing a set of universal Global Phone multi-language pronunciation unified construction method.

Based on a multi-language mixed model of the listening quantization coding, the listening quantization coding is carried out on different attribute information in the voice, and the acoustic parameters of the voice are predicted;

since the collection of the data recording of the language is difficult, the problem of the mixed use of multi-language, multi-person and multi-format data needs to be considered at the beginning. The method comprises the steps of constructing a voice synthesis system by adopting an acoustic model based on auditory quantization coding, firstly carrying out auditory quantization coding on different attribute information in voice, and then predicting acoustic parameters of the voice. Specifically, attribute codes such as speakers, languages, emotional styles and the like of multi-person mixed voice are defined manually, residual coding is introduced to describe variation information of the pronouncing person under different states such as emotion, environment, time difference and the like when voice data are recorded, and prediction of voice acoustic parameters is achieved through a fully-connected feed-forward (FF) network and a long-time memory network (LSTM-RNN) in combination with text information. Aiming at new speakers with less linguistic data, a transfer learning method is planned to be adopted so as to improve the voice synthesis effect under the condition of less data volume.

Assume a multi-person multi-lingual hybrid dataset. The traditional acoustic modeling method for speech synthesis is to directly model by using a neural network. In order to realize the control of a speech synthesis speaker, language type, speaking style and the like to decompose information in speech, four kinds of auditory sense quantization codes are defined: speaker coding, language coding, emotion style coding and residual coding. Then, listening quantitative coding is carried out on speakers, languages and emotional styles in the voice in a display mode, and the joint distribution of the listening quantitative coding and the acoustic parameters is directly modeled;

the whole model is realized by using a neural network and mainly comprises two parts: (1) The main network part predicts acoustic parameters when a text and a listening quantization code are given, and enables a model to learn the influence of different speakers, languages and style codes on synthesized voice through sharing the same main network by all data; (2) The side branch network predicts emotion codes when a text, a language and a speaker are given, and is also realized by using a neural network.

During model training, a speaker and a language are given, and emotion quantization coding is defined through artificial marking. The residual coding describes that a speaker cannot manually mark the change information of the voice under different states such as emotion, environment, time difference and the like when recording voice data, so that each sentence is represented by using a single residual coding, and the residual coding is randomly initialized and updated through model training. The whole model is trained by using a minimum mean square error criterion, and a random gradient descent algorithm is used for gradient updating.

Aiming at new languages and new speakers with less linguistic data, the project provides a method for using transfer learning, and the voice synthesis effect under the condition of small data volume is improved. Firstly, the weight of a trained multi-person mixed model is used as the initialization of a new speaker model, and then a small amount of data of a new speaker is used for fine adjustment, so that the rapid modeling of the new speaker with small data volume is achieved.

Based on generating a high quality speech generating model of the antagonistic network, high quality multilingual speech synthesis is performed by modeling the overall structure of the spectral envelope and modeling the local details of the spectral envelope.

To construct a high-quality, multi-lingual speech synthesis system, it is planned to divide the spectral modeling of speech into two phases: 1) Modeling the overall structure of the spectral envelope; 2) Local details of the spectral envelope are modeled. A GAN model is used, and local details of the spectrum envelope are modeled in a language-independent mode under the condition that an acoustic model predicts the overall framework of the spectrum envelope. Specifically, a convolutional neural network is used for constructing a mapping from an acoustic model prediction spectrum to a natural speech spectrum envelope, and a GAN criterion is used for model training. A generating network in the GAN takes noise, text features and a low-Vimeyer cepstrum as input to predict spectrum envelope features; the discrimination network discriminates the spectral envelope predicted by the generation network and the natural spectral envelope by using the text characteristics as conditions. And finally, under the condition that the data volume of a speaker in a specific language is limited, restoring a fine structure in a spectrum envelope through a cross-language GAN model, and improving the subjective quality of the voice.

The overall framework of the spectral envelope determines the pronunciation correctness of the speech, and the specific details of the spectral envelope determine the tone quality of the speech. The mel-frequency cepstrum is a decorrelated feature extracted from a spectral envelope, wherein the low dimension represents the whole structure of the spectral envelope, and the high dimension describes the local details of the spectral envelope. In speech synthesis, the high dimension of mel cepstrum greatly affects the sound quality, and in the global variance of mel cepstrum of natural speech and mel cepstrum of synthetic speech, the low dimension global variance of mel cepstrum has similar distribution, while the global variance of high dimension part is larger, and the high dimension mel cepstrum of synthetic speech is usually too smooth, which results in the impaired sound quality of speech. This shows that modeling the whole structure of the spectral envelope in the current statistical parameter speech synthesis is accurate, and modeling the local details of the spectral envelope has a great defect. Considering that the samples generated by the GAN model are closer to natural samples than the traditional generative model trained based on the maximum likelihood criterion, it is proposed to use the GAN model to model the local details of the spectral envelope given the overall framework of the spectral envelope. As the CNN has stronger modeling capability on a local structure and the voice spectrum envelope characteristic has an important local structure on time and frequency axes, the CNN is selected as a discrimination network and a generation network of a GAN model, and the spectrum envelope characteristic is selected as a prediction sample. Given the spectral envelope feature s of speech, the corresponding text feature c, and the corresponding low-dimensional mel cepstrum feature m, the conditional distribution P (s | c) of the spectral envelope feature at the time of the given text is not directly modeled, but the conditional distribution P (s | c, m) of the spectral envelope at the time of the given text feature and the low-dimensional mel cepstrum feature is modeled;

a generating network in the GAN takes noise, text features and a low-Vimeyer cepstrum as input to predict spectrum envelope features; the discrimination network discriminates the spectral envelope predicted by the generation network and the natural spectral envelope by using the text characteristics as conditions.

In the training stage, natural Mel cepstrum low-dimensional sum text is used as condition input, and in the testing process, because there is no Mel cepstrum parameter, another model is needed to predict it, and a traditional statistical modeling method or a modeling method based on GAN can be used. The project plans use the RNN-LSTM model to model the low-dimensional parameters of the mel-frequency cepstrum. And then predicting spectral envelope parameters by using a GAN model under the condition of text features and a Mel cepstrum low dimension. And finally, synthesizing the final voice by combining the duration and the fundamental frequency predicted by the traditional statistical parameter voice synthesis system. The spectrum generation process can be regarded as a two-stage generation method, firstly, low-dimensional Mel cepstrum parameters are predicted by a model trained based on an MSE (mean square error) rule, a basic framework of a spectrum envelope is determined, and then a GAN (gain information) model is used for predicting a fine structure in the spectrum envelope. The two-stage generation model is expected to remarkably improve the generation quality of the voice and improve the experience effect.

Preferably, the intelligent shift recognition module comprises a model training phase unit and a model prediction phase unit:

the model training stage unit is used for training a model by using a data set which is synchronous with audio and video and has voice character marks;

The model training step and the testing step are specifically implemented as follows:

the method comprises the following steps: data pre-processing

The data preprocessing mainly comprises preprocessing of images, preprocessing of voice and preprocessing of text labeling. The image preprocessing comprises the steps of detecting human face characteristic points of speakers in a video, zooming the human face to the same size according to the characteristic points, taking a mouth region image with a fixed size by taking a mouth central point as a center, and obtaining a mouth image video sequence, wherein the size of the obtained mouth region image is 80 × 80, the number of channels is RGB three channels, and the frame rate of the video is 25fps; the preprocessing of speech is divided into two categories: (1) The recognized voice feature extraction is to extract 40-dimensional fbank features of voice by a sliding window, the window length is set to be 25ms and the frame shift is 10ms in the example, and (2) the separated voice feature extraction is to perform short-time Fourier transform (STFT) on the voice signal to form a 2-channel spectrogram; the pre-processing of the text labels is to align the text pronunciation phonemes onto the speech signal using a force alignment, one microphone for every 4 frames of the speech signal, so that in effect the text labels are converted to microphone labels with a frame rate of 25fps, synchronised with the video frame rate, one quarter of the audio frame rate.

Step two: joint training of speech recognition models and separation models

After enough training data are obtained, a speech recognition neural network model and a lip speech recognition neural network model are respectively built, and multi-mode speech recognition and separation combined training is carried out.

The input of the adopted multi-modal speech recognition and separation neural network model is audio and video with fixed duration of 3 s.

The audio characteristic input is a 40-dimensional voice frame fbank characteristic vector sequence of 100fps, the 3s duration is 300 × 40 fbank characteristic graphs in total, and after passing through a characteristic extraction module, the audio characteristic input is sampled by 4 times in the time dimension to obtain a 75 × 512-dimensional voice characteristic vector sequence; the video image input is an image sequence of 25fps, the image size is RGB three-channel images of 80X 80, the 3s duration is 75X 3X 80, and a 75X 512-dimensional video feature vector sequence is obtained after the image sequence passes through a feature extraction module; the multi-modal information fusion module fuses the 512-dimensional voice feature vectors extracted by the two modules with the 512-dimensional image feature vectors, the fusion can adopt feature splicing, and a new 75 x 512-dimensional fusion feature vector is generated through a small fusion neural network.

The multi-modal recognition task directly sends the 75-by-512-dimensional fusion features to a recognition module, and after softmax classification, the fusion features correspond to labeled triphone phonemes label one by one, and the Loss function used is Cross entry Loss.

The multi-modal speech separation task uses a U _ Net network structure, and the original input of the model is a 3s speech signal which is subjected to STFT transformation to form a 2 × 298 × 257 spectrogram. The U _ Net structure performs operations of downsampling and then upsampling on the feature map, wherein the intermediate feature map with the corresponding size in the downsampling module and the upsampling module has a feature splicing operation; on the feature vector with the minimum middle dimension, the fused feature vectors with the former 75-by-512-dimension are spliced to form an intermediate condition, and the intermediate condition guides U _ Net to carry out spectrogram reconstruction. The reconstructed spectrogram and the label spectrogram do L2 Loss, and gradient reverse transmission updates not only the U _ Net module, but also the feature fusion module, the audio feature extraction module and the video feature extraction module, and forms a combined training effect with the cross entry Loss identified by multiple modes.

In the training process, the task type is similar to Multi _ task, the voice separation and recognition are synchronously performed, the two tasks are synchronously performed, but the same fusion characteristics are used, so that the characteristic extraction and fusion module learns more compact Multi-mode joint information. And finally, carrying out weighted summation on the L2 Loss and the CrossEntry Loss to form the final Loss, and training the whole network through a gradient back propagation optimization algorithm.

Step three: model testing and use

After the model training is finished, the multi-mode voice separation and recognition test can be carried out. The pre-processing of the voice and the image in the using process is consistent with the training process, and the details are not repeated here. Firstly, the audio and video characteristics form fusion characteristics through a fusion module, and then after the audio and video characteristics are identified into a triphone phoneme state through an identification module, the phoneme state is decoded into a character sequence through a Viterbi algorithm, so that the final identification effect is achieved. The fusion features can be spliced to the intermediate features of the separated and reconstructed U _ Net network to reconstruct a separated spectrogram, and the spectrogram can be restored into a separated voice signal after being subjected to ISTFT conversion.

The invention also discloses an electronic device which comprises a memory, a processor and a computer program which is stored on the memory and can run on the processor, wherein the processor executes the program to realize the intelligent dispatching command and operation system for the regional power distribution network.

The invention also discloses a non-transitory computer readable storage medium, on which a computer program is stored, which, when executed by a processor, implements the intelligent scheduling command and operation system for a regional distribution network.

It should be noted that the technical features of the power distribution network and the like related to the present patent application should be regarded as the prior art, and the specific structure, the operation principle, the control mode and the spatial arrangement mode of the technical features may be selected conventionally in the field, and should not be regarded as the inventive points of the present patent, and the present patent is not further specifically described in detail.

It will be apparent to those skilled in the art that modifications and equivalents can be made to the embodiments described above, or some features of the embodiments described above, and any modifications, equivalents, improvements, and the like, which fall within the spirit and principle of the present invention, are intended to be included within the scope of the present invention.

Claims

1. The utility model provides an intelligent scheduling commander and operation system for regional distribution network for the safe operation of the production operation system of guarantee distribution network, its characterized in that, including intelligent voice navigation module, the automatic generation module of plan dispatch instruction ticket, information monitoring diagnosis module and intelligent shift identification module, wherein:

the intelligent voice navigation module comprises a voice recognition engine unit and a semantic analysis engine unit, wherein the voice recognition engine unit is used for directly converting human voice into corresponding texts so that the processing module can understand and generate corresponding operations, thereby realizing natural voice interaction between a human and a machine, and converting audio stream data into a recognition result of character stream data in real time based on a deep full-sequence convolutional neural network framework; the semantic analysis engine unit is used for performing semantic analysis on the recognition result obtained by the voice recognition engine unit;

the automatic generation module of the scheduling instruction ticket extracts and automatically generates the scheduling instruction ticket based on the information, and the automatic generation module of the scheduling instruction ticket comprises a learning stage unit and an extraction stage unit;

2. The intelligent dispatching command and operation system for the regional power distribution network according to claim 1, wherein the voice recognition engine unit comprises a Chinese punctuation intelligent prediction unit, a file format intelligent conversion unit, a front-end voice processing unit and a back-end recognition processing unit, wherein;

the rear-end recognition processing unit comprises an individualized voice recognition subunit, a confidence coefficient output subunit, a multi-result recognition subunit, a speaker self-adaption subunit and a semantic context self-correction subunit, wherein the individualized voice recognition subunit collects and uploads words with high utilization rate based on the voice characteristics of a user, and establishes an individualized entry language model from a business perspective so as to adjust recognition parameters, modify the weight of uploaded words and continuously optimize recognition; the confidence coefficient output subunit is used for carrying the confidence coefficient of the recognition result when the recognition result is returned, so that analysis and subsequent processing are carried out according to the confidence coefficient result; in the identification process of the multi-result identification subunit, returning a plurality of identification results meeting the conditions to the application program through the results of the confidence coefficient output subunit instead of unique results, providing a possible identification result list, and arranging according to the confidence coefficient results from high to low; the speaker self-adaptive subunit is used for extracting the voice characteristics of the call on line and automatically adjusting the recognition parameters in the process of multiple conversations between the user and the voice recognition engine unit so as to continuously optimize the recognition effect; the semantic context self-correction subunit is used for dynamically correcting according to the speech recognition result and the context pair in combination with the context dynamic correction, so that the result is more in line with the current context;

the semantic analysis engine unit comprises a rule understanding unit and a model understanding unit, the rule understanding unit is used for matching rules, the model understanding unit comprises a semantic model training subunit, a semantic feature extraction subunit and a semantic similarity evaluation subunit, the semantic model training subunit is used for modeling texts, and mapping the texts with similar semantic expressions to similar vectors in a semantic space; the semantic feature extraction subunit is used for extracting features of the text information and obtaining the text information by extracting a specific hidden vector of the model after the training of the semantic model training subunit is completed; and the semantic similarity evaluation subunit is used for evaluating the similarity between the user sentence vector extracted by deep learning and the offline sentence vector extracted from the library.

3. The intelligent dispatching command and operation system for the regional distribution network of claim 2, wherein for the learning phase unit:

4. The system according to claim 3, wherein the text recognition unit introduces context sequence information into the text recognition and performs the recognition through a neural network of a time sequence relationship.

5. The intelligent dispatching command and operation system for the regional distribution network of claim 4, wherein for the speech synthesis unit:

6. The intelligent dispatching command and operation system for the regional distribution network of claim 5, wherein the intelligent shift recognition module comprises a model training phase unit and a model prediction phase unit:

and the model prediction stage unit is applied to the model, the voice of a plurality of people with aliasing and the lip movement video of a specific target person are input, the character sequence is recognized, and the voice of the speaker corresponding to the lip movement video content is separated.

7. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, wherein the processor when executing the program implements the intelligent scheduling commanding and operating system for regional distribution network according to any one of claims 1 to 6.

8. A non-transitory computer readable storage medium, having a computer program stored thereon, wherein the computer program, when executed by a processor, implements the intelligent dispatch command and operation system for regional distribution networks as claimed in any one of claims 1 to 6.