CN114595324A - Method, device, terminal and non-transitory storage medium for power grid service data domain division - Google Patents

Method, device, terminal and non-transitory storage medium for power grid service data domain division Download PDF

Info

Publication number
CN114595324A
CN114595324A CN202111159160.7A CN202111159160A CN114595324A CN 114595324 A CN114595324 A CN 114595324A CN 202111159160 A CN202111159160 A CN 202111159160A CN 114595324 A CN114595324 A CN 114595324A
Authority
CN
China
Prior art keywords
data
feature
vector
word
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111159160.7A
Other languages
Chinese (zh)
Inventor
沈亮
欧阳红
何鑫
高士杰
朱广新
陈翔
廖小琦
张鹏宇
李杏
占震滨
陈小明
张伟
颜克礼
刘玉玺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Big Data Center Of State Grid Corp Of China
State Grid Zhejiang Electric Power Co Ltd
Beijing China Power Information Technology Co Ltd
Original Assignee
Big Data Center Of State Grid Corp Of China
State Grid Zhejiang Electric Power Co Ltd
Beijing China Power Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Big Data Center Of State Grid Corp Of China, State Grid Zhejiang Electric Power Co Ltd, Beijing China Power Information Technology Co Ltd filed Critical Big Data Center Of State Grid Corp Of China
Priority to CN202111159160.7A priority Critical patent/CN114595324A/en
Publication of CN114595324A publication Critical patent/CN114595324A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3335Syntactic pre-processing, e.g. stopword elimination, stemming
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/047Probabilistic or stochastic networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Biophysics (AREA)
  • Databases & Information Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present disclosure provides a method, an apparatus, a terminal and a non-transitory storage medium for power grid service data domain division, including: judging the characteristics of the power grid service data for the power grid service data; wherein the features include descriptive features and discrete features; determining, by a first model, a category of first data for which the feature is a descriptive feature; determining a category of second data for which the feature is a discrete feature by a second model; and performing power grid service data domain division by using the category of the first data and the category of the second data. The method for dividing the power grid service data into the domains can be used for processing the classification problem of different types of data.

Description

Method, device, terminal and non-transitory storage medium for power grid service data domain division
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a method, a device, a terminal and a non-transitory storage medium for power grid service data domain division.
Background
In a data model of a power grid service, relevant features of a table and the like need to be subjected to data processing, and semantic source data are intelligently extracted to be used as input of intelligent analysis. The current data models are huge in quantity and extremely numerous in professional categories, and many terms are difficult to determine through grammatical rules and need to be described together by constructing multi-feature semantic information; and because the feature space is huge, the feature representation of the domain is unclear, the entity connection relation of the related table is unclear, and the domain name confusion is easily caused.
Disclosure of Invention
In order to solve the problems in the prior art, the present disclosure provides a method, an apparatus, a terminal and a non-transitory storage medium for power grid service data domain division.
The present disclosure provides a method for power grid service data domain division, which includes:
judging the characteristics of the power grid service data for the power grid service data; wherein the features include descriptive features and discrete features;
determining, by a first model, a category of first data for which the feature is a descriptive feature;
determining a category of second data for which the feature is a discrete feature by a second model; and
and performing power grid service data domain division by using the category of the first data and the category of the second data.
In an embodiment of the present disclosure, the determining the characteristic of the data includes:
preprocessing the data; wherein the pre-processing comprises: at least one of splitting matching, counting, word segmentation and removal of stop words; and
extracting the features of the preprocessed data through a bag of words model and/or a feature extraction model based on word vectors.
In an embodiment of the present disclosure, the determining, by the first model, the category of the first data of which the feature is the descriptive feature includes:
extracting domain name information, label information and word information from the first data;
vectorizing the domain name information and the label information to obtain vectorized data;
vectorizing the word information to obtain a word vector;
obtaining an attention value according to the word vector;
converting the word vector into a fixed-length vector; and
and classifying the vectorization data and the fixed-length vector according to the attention value to obtain a classification result.
In an embodiment of the disclosure, the first model comprises: a text embedding ALE layer, an ATT layer, a TextRNN layer and a softmax perception layer; the ATT layer comprises a single layer or multiple layers;
wherein, when the ATT layer is a single layer, the method further comprises:
obtaining corresponding weight according to the word vector;
weighting the word vectors according to the corresponding weights to obtain weights; and
weighting and summing the word vectors and the weight values to obtain the attention numerical value;
wherein, when the ATT layer is a multilayer, the method further comprises:
converting the word vector to a sentence vector on a first layer; and
converting the sentence vector to a paragraph vector on a second layer.
In an embodiment of the present disclosure, the obtaining, by the first model, the category of the first data with the characteristic being the descriptive characteristic further includes:
acquiring a word bank of the keywords;
extracting a category label with the highest occurrence frequency of each keyword from the word stock; and
re-encoding the first data such that each data group in the first data corresponds to a multi-dimensional vector; wherein each dimension of the multi-dimensional vector characterizes a statistic of the occurrence of a class label in the data set.
In an embodiment of the present disclosure, the determining, by the second model, the category of the second data of which the feature is a discrete feature includes:
extracting domain serialization features from the first data;
extracting discretization features from the second data;
performing second-order cross calculation on the domain continuous features and the discretization features to obtain a calculation result; and
and according to the calculation result and the high-dimensional embedding of the domain continuous features in the depth measurement, obtaining a classification result.
In embodiments of the present disclosure, the discretization feature comprises a domain discretization feature and a source system discretization feature;
wherein the method further comprises:
adopting discretization one-hot vector representation to the domain discretization feature; and
and carrying out discrete value coding processing on the discretization characteristic of the source system.
The utility model provides a device of electric wire netting business data locoregion, includes:
the judging module is used for judging the characteristics of the power grid service data;
the determining module is used for determining the category of the first data of which the characteristic is a descriptive characteristic through a first model and determining the category of the second data of which the characteristic is a discrete characteristic through a second model; and
and the domain division module is used for performing domain division on the power grid service data by utilizing the category of the first data and the category of the second data.
The present disclosure proposes a terminal, comprising:
at least one memory and at least one processor;
wherein the at least one memory is configured to store program code, and the at least one processor is configured to call the program code stored in the at least one memory to perform any of the methods described above.
The present disclosure proposes a non-transitory storage medium for storing program code which, when executed by a computer device, causes the computer device to perform the method of any of the above.
The technical scheme of the present disclosure has the following positive effects:
(1) the method aims at the problem of identification and classification of domain name related semantic attributes of a power grid data model, provides a first model ALE-TextRNN model for the problem of classification of table description features according to text polarity, can capture the importance of different context information to given class tendency, facilitates the resolution of fine class difference in multi-classification tasks, and reduces class confusion.
(2) Aiming at the characteristics of fields and source systems of the table, the method and the device have the advantages that the semantic dependency relationship among discrete and classified characteristics is excavated by using the improved deep FM model, the characteristic distribution diversity can be increased without manually designing input characteristics, an embedded layer can be supervised by the FM side, the characteristic dimension and the repeatability can be effectively reduced, and the combined calculation performance is improved.
(3) And mining and identifying the semantics of the multi-form features, redefining an evaluation mode, and determining a final result by the party with higher regression prediction score of the two parts, thereby obtaining more accurate semantics and more ideal domain division effect.
Drawings
The features and advantages of the invention, as well as the technical and industrial significance of exemplary embodiments, will be described in detail below with reference to the accompanying drawings, wherein like reference numerals indicate like elements, and wherein:
fig. 1 is a flowchart of a power grid service data domain division method according to an embodiment of the present disclosure.
Fig. 2 is a schematic diagram of a fastText in accordance with an embodiment of the present disclosure.
FIGS. 3a-3b are schematic diagrams of a TextCNN according to an embodiment of the present disclosure.
FIG. 4 is a schematic diagram of a TextRNN according to an embodiment of the present disclosure.
FIG. 5 is a frame diagram of a TextRNN + attention according to an embodiment of the present disclosure.
FIG. 6 is a schematic diagram of the Hierarchical orientation of an embodiment of the present disclosure.
Fig. 7 is a schematic diagram of a Wide & Deep framework according to an embodiment of the present disclosure.
Fig. 8 is an architecture diagram of a first model according to an embodiment of the disclosure.
Fig. 9 is a schematic diagram of RNN principles of an embodiment of the present disclosure.
Fig. 10 is a schematic diagram of softmax of an embodiment of the present disclosure.
Fig. 11 is a schematic structural diagram of a second model of an embodiment of the disclosure.
Fig. 12 is a schematic structural diagram of a power grid service data domain dividing device according to an embodiment of the present disclosure.
Detailed Description
The present invention will now be described in further detail with reference to the accompanying drawings. These drawings are simplified schematic views illustrating only the basic structure of the present invention in a schematic manner, and thus show only the constitution related to the present invention.
In a data model of the power grid service, domain name characteristics can be used as the most core boundary among data sources, represent the common characteristics and the application range of all data under a table, and are the basic work of all subsequent semantic identification and knowledge mining. Therefore, we need to use the domain name of the entity table as an important attribute value for constructing the map, extract the semantics related to the domain name from the description features or other features of the table, classify different domains and other attributes in the domains according to the semantics attribute value, and complement the semantics attribute information.
With the rapid development of electronic office, mobile terminals and social media, a large number of short texts are accumulated in an internet database to be processed urgently, and the continuity and the consistency of the original corpus data enable the original corpus data to have certain local laws, so that the solution of text representation and classification based on deep learning becomes an important field in natural language processing. The text classification model mainly comprises emotion polarity analysis and main body class classification in application. The emotion polarity analysis can find out user preferences by mining user feedback, and can play a great help role in further popularization of products made by companies and manufacturers. And the classification of the subject categories is helpful to the mining of public expectations, the implementation of public opinion monitoring and the identification of sensitive topics.
Referring to fig. 1, fig. 1 is a flowchart of a power grid service data domain dividing method according to an embodiment of the present disclosure, including the following steps.
S100, judging the characteristics of the power grid service data for the power grid service data; wherein the features include descriptive features and discrete features.
Specifically, embodiments of the present disclosure may include preprocessing the data; wherein the pre-processing comprises: at least one of splitting matching, counting, word segmentation and removal of stop words; and extracting the features of the preprocessed data through a bag of words model and/or a feature extraction model based on word vectors.
More specifically, the embodiment of the disclosure may first perform original sentence splitting and matching or word statistics on a text, thereby performing a series of preprocessing operations such as chinese word segmentation or stop word removal; then, text feature extraction is performed, and the following methods are mainly used:
(1) bag of words model
This embodiment can establish a dictionary base containing all words in the training corpus and represent the uniquely identifiable number of each word by one-hot, wherein the word vector dimension and the number of words in the dictionary base are the same.
(2) Text feature extraction Term Frequency-Inverse Document Frequency
The embodiment can measure the importance degree of the words in the text by using two parameters of a certain word occurrence frequency (TF) in the document and the occurrence probability (IDF) of each word in the document, and can respectively show the importance degree of the words in the document and the importance degree of the words between the documents. The expression of the characteristic value is as follows:
TF_IDF(i,j)=TFi,j*IDFi (1)
wherein, TF _ IDF (i, j) is an importance index of a word i in a document j and is obtained by multiplying the word frequency by the reverse file frequency; TFi,jRepresenting the number of times the word i appears in the document j,
Figure BDA0003289444590000081
IDFithe inverse file frequency of the word i is represented,
Figure BDA0003289444590000082
Figure BDA0003289444590000083
(3) feature extraction model based on word vectors
According to the embodiment, each word can be mapped into one dimension in a large text corpus through training a neural network, for example, a word2vec model and the like, each word can be represented by a vector obtained through training a large number of corpora, and the effect on efficiency and grammar semantic expression is improved.
The deep learning text classification model is used for carrying out feature expression, and the following modes are mainly adopted:
(1) fastText: as shown in fig. 2, fig. 2 is a schematic diagram of fastText according to an embodiment of the present disclosure. In the embodiment, all word vectors in the sentence can be normalized after being averaged, and the local sequence information can be obtained by means of n-gram trigk.
(2) TextCNN: 3a-3b, FIGS. 3a-3b are schematic diagrams of a TextCNN according to an embodiment of the disclosure. According to the method, more local sequence classification information can be concerned on the basis of the quick text tribk, and better prediction accuracy can be achieved. But also because of the introduction of one-dimensional convolution, a plurality of convolution kernels with different sizes need to be specified to obtain fields of view with different widths, and a dynamic pooling (k-max Pooling) method needs to be introduced to keep k maximum information related to global sequences.
(3) TextRNN: as shown in FIG. 4, FIG. 4 is a schematic diagram of a TextRNN according to an embodiment of the present disclosure. In this embodiment, k can more flexibly model the text sequence information with a longer length, and achieve the effect of better expressing the text context information.
(4) TextRNN + attention: as shown in FIG. 5, FIG. 5 is a frame diagram of TextRNN + attention according to an embodiment of the present disclosure. The embodiment can introduce an attention mechanism to identify the semantics of emotion polarity, and can capture important information responding to given polarity by adding N-pass embedding on a hidden layer of a basic LSTM. Emotion classification refers to recognizing emotion tendencies, which are contained in corpora, commendable or deresome, extracting attitudes and viewpoints of a text description subject, and is considered as a special text classification problem due to the fact that information expression has concealment and ambiguity. It can learn by building supervised, semi-supervised and unsupervised tasks. The supervised emotion classification task needs to label a large number of samples by taking fine-grained emotion polarity words as labels, and learns a feature space by using a multi-classification algorithm; the problem of insufficient corpus labeling can be solved through collaborative learning in semi-supervised learning; the unsupervised learning can obtain the emotional tendency by calculating the point mutual information between the emotional seed vocabulary and the text vocabulary.
S200, determining the category of the first data with the characteristic as the descriptive characteristic through the first model.
Specifically, the embodiments of the present disclosure may include extracting domain name information, label information, and word information from the first data; vectorizing the domain name information and the label information to obtain vectorized data; vectorizing the word information to obtain a word vector; obtaining an attention value according to the word vector; converting the word vector into a fixed-length vector; and classifying the vectorization data and the fixed-length vector according to the attention value to obtain a classification result. Wherein the first model (ATT-ALE-TextRNN model) includes: a text embedding ALE layer, an ATT layer, a TextRNN layer and a softmax perception layer; the ATT layer comprises a single layer or multiple layers. More specifically, when the ATT layer is a single layer, embodiments of the present disclosure may include: obtaining corresponding weight according to the word vector; weighting the word vectors according to the corresponding weights to obtain weights; and weighting and summing the word vector and the weight value to obtain the attention numerical value. When the ATT layer is a multilayer, embodiments of the present disclosure may further include: converting the word vector to a sentence vector on a first layer; and converting the sentence vector to a paragraph vector on a second layer. In addition, the embodiment of the disclosure may further include obtaining a thesaurus of the keywords; extracting a category label with the highest occurrence frequency of each keyword from the word stock; re-encoding the first data to enable each data group in the first data to correspond to a multi-dimensional vector; wherein each dimension of the multi-dimensional vector characterizes a statistic of the occurrence of a class label in the data set.
In order to prevent confusion of domain names, the embodiment of the disclosure can send the fusion features obtained by connecting the text information and the information related to the domain names into a deep attention network to identify the most important information under different domains of the text. Referring to fig. 5, in the present embodiment, the core information in the sentence is obtained by learning the vocabulary and the sentence with hierarchical attention weights, and then the text classification task is realized by the TextRNN structure. A keyword weight learning model is introduced into the scheme, and a Hierarchical Attention Network is added to the structure of the TextRNN. Referring to fig. 6, fig. 6 is a schematic diagram of a structural attachment according to an embodiment of the disclosure. The Hierarchical Attention Network can be used for longer text classification problems: the first layer can represent the word vector as sentence vector through TextRNN + Attention, and the second layer represents the sentence vector as the vector of the whole text segment through the same structure; the network only containing single-layer Attention can apply different external weights to different pixels on the same feature map in image learning for learning, migrate to a single-layer feature map, and perform word-by-word weighting by considering the similarity between words based on the interior of the corpus. The hierarchical Attention network refers to the classification problem of longer texts, the word vector is represented as a sentence vector through TextRNN + Attention in the first layer, and the sentence vector is represented as a vector of an integral text segment through the same structure in the second layer. See the following equation:
ut=tanh(Wwht+bw) (2)
αt=exp(ut Tuw)/(∑texp(ut Tuw)) (3)
Figure BDA0003289444590000111
wherein t represents the t-th word, htA word vector representing a tth word; u. oftIs htIs represented by a hidden layer of htObtained through the operation of a single-layer neural network in a formula (2); u. ofwA randomly initialized weight vector is trained as a parameter of the model; formula (3) to htHidden layer representation of utPerforming softmax standardization to obtain alphat;stIs a vector representation of the t-th sentence, denoted by αtAnd the word vector h of the t-th wordtAnd (4) obtaining the weight calculation.
Based on the steps, the trained model has better text representation and higher classification precision, and the importance of words and sentences in the text classification is expressed by an intuitive method, so that the model interpretability is improved.
For descriptive continuous text features, semantic information related to the labels can be obtained by means of a text classification algorithm. The textCNN algorithm can extract local sequence information of a text, and the textRNN algorithm can learn the context relationship of the sequence, does not consider the limitation of the length of the sequence, and is more suitable for the condition that the length of the corpus cannot be fixed. In practical verification, the accuracy of the TextCNN can reach 69%, and the accuracy of the TextRNN is slightly higher than that of the TextCNN and can reach 73% although the training speed is slower.
For example, the feature words under two domain names have a certain similarity, and the corpus is short, so that the confusion of categories is easy to occur if representative features are missing. Therefore, in order to enhance semantic information related to corpora and categories, the embodiment of the present disclosure adds an attention mechanism in Natural Language Processing (NLP) to focus on core features of the original corpora. In addition, the word frequency statistics of the dictionary is mapped to the category label information, the original corpus is recoded and added into the last hidden layer, and the category prediction result is influenced together.
Please refer to fig. 7, fig. 7 is a schematic diagram of the Wide & Deep framework according to the embodiment of the present disclosure. The embodiment of the disclosure can utilize the web architecture of wide & deep to generate a word bank of high-frequency keywords according to the training corpus; and extracting the category label with the highest occurrence frequency of each keyword, and recoding the original sentence, wherein each sentence corresponds to a 10-dimensional vector, each dimension represents the word number statistics of one category label in the sentence, and the statistics can be used as category weighting when the full-connection layer softmax is output, so that the mapping relation from the keyword to the corresponding label and the category semantics in the original sentence are strengthened.
Referring to fig. 8, fig. 8 is a schematic diagram of a first model according to the embodiment of the disclosure. The embodiment of the disclosure can connect the theme to be strengthened, such as label-aspect embedding, with the text information as input, and calculate the attribute score together by adding N times of polarity information of the theme on the hidden layer, and meanwhile, the long-short time memory network is used for facilitating serialized feature extraction, so that polarity information and text key information can be better mined and utilized. The attention mechanism can dynamically change the weight according to input data, and for a neural network not comprising the attention mechanism, the weight of the neural network is independent of an input word vector, and the weight is not dynamically adjusted according to the input. The attention mechanism can input different weights according to different word vectors, endow different word vectors with different weights, and obtain the weights by comparing different word vectors. Specifically, a word vector is input first, and the entry weights the input word vector (entry weighting), and then the sum of the word vector and the weighting value can obtain a corresponding entry value, i.e., an entry score.
According to the method, the second-level domain name information and the text (table description) information are used as input, then the bidirectional RNN is used for training and sharing weight information, the obtained domain name characteristics and the text characteristics are fused, and finally the fusion characteristics are processed by a deep attention mechanism, so that the polarity division tendency of different domain names in the text can be effectively recognized, and the recognition capability of the whole model on domain name related semantics is improved.
The layers included in the first model (ATT-ALE-TextRNN model) will be described below, respectively: text embedding ALE layer, ATT layer, TextRNN layer, and softmax perception layer.
(1) Text embedding ALE layer
The embodiment of the disclosure introduces an ALE module/layer (attitude aspect-level embedding) for the detection of fine-grained polarity classification. In this embodiment, the original corpus coding and the secondary domain name coding are input at the input end, and the advantages are as follows: firstly, by embedding the training aspect-level into another vector space, the information of the aspect can be more fully utilized, and the related semantics of the domain name in the original corpus can be further strengthened. Secondly, the problem that word vectors are not consistent with the embedding of aspect-level is solved, and the most important information responding to the given aspect-level is captured. The model can capture the most important part of the sentence at present and the part with distinction degree when different aspect-levels are given when different second-level domain names are given.
(2) ATT layer
The embodiment of the disclosure can calculate the weight between the secondary domain name and the output vector of the original feature after deep network processing by using an attention mechanism, thereby calculating the attention degree of the related semantics of the content and the domain name of the original corpus, enabling the model to notice different parts of sentences, and capturing the potential relevance of the content and the domain name. For example, hidden layer vector [ h1, h2, h3, …, hN]A matrix of composition, where the size of the hidden layer is d, the length of a given sentence is N, vlaDefined as the embedding of label-aspect, the attention mechanism will produce an attribute weight vector α and a weighted hidden vector r that characterize the weighting of sentences of a given classification polarity. See in particular the following formula:
Figure BDA0003289444590000141
α=softmax(wTM) (6)
r=HαT (7)
wherein the content of the first and second substances,
Figure BDA0003289444590000151
the characterization is to connect the repeated processes, that is, the process of performing N times of linear transformation on the repeated processes, where the number of times is the length of the sentence. The final sentence can be characterized as follows:
h*=tanh(WPr+WxhN) (8)
y=softmax(Wsh*+bs) (9)
where h may be considered as a new feature representation after adding a given secondary domain name to the original sentence. And then adding a linear layer, converting the sentence vector into a vector e with the length equal to the number of the categories, and finally converting e into conditional probability distribution through a softmax layer. The loss function is also defined as a cross-entropy function.
(3) TextRNN layer
The neural network model may represent variable-length text as fixed-length vectors, which typically consist of a projection layer that maps words, sub-word units, or n-grams to vector representations (which may typically be trained using unsupervised methods) and then combines them with different neural network architectures to model text, such as neural bag-of-words models, convolutional neural networks, recursive neural networks, and so forth. The Recurrent Neural Network (RNN) is very suitable for processing variable-length texts due to the recurrent structure, can be used for processing natural language problems, recursively performs state transition on internal hidden states of an input sequence according to activation of the input sequence and a previous hidden state vector under different time steps. Referring to fig. 9, fig. 9 is a schematic diagram of an RNN according to an embodiment of the disclosure. The embodiment of the disclosure can complete sequence mapping from an input vector to a fixed length vector through RNN, and then input a softmax layer for predicting the probability distribution of categories and applying to classification or other tasks; meanwhile, the cross entropy of prediction and real distribution is minimized by training network parameters.
To avoid gradient explosion or disappearance during training due to lack of learning of long-range correlations in the sequence, e.g., as a gradient vector grows or decays exponentially over time, embodiments of the present disclosure further introduce an LSTM network. The LSTM internal independent memory cells include many variations. In this embodiment, the TextRNN model defines LSTM units of each time step t as a vector set with a size of d, where each LSTM unit includes an input gate, a forgetting gate, an output gate, a hidden state and a storage unit, and d represents the number of LSTM units. The algorithm formula is as follows:
it=σ(Wixt+Uiht-1+Vict-1) (10)
ft=σ(Wfxt+Ufht-1+Vfct-1) (11)
ot=σ(Woxt+Uoht-1+Voct-1) (12)
Figure BDA0003289444590000161
Figure BDA0003289444590000162
ht=ot⊙tanh(ct) (15)
the hidden layer of RNN has only one state, h, which is very sensitive to short-term inputs, and LSTM added state c preserves long-term state. In the above formula, at time t, the input of LSTM includes: input value x of the network at the present momenttLast time LSTM output value ht-1And cell state c at the previous timet-1. The output of the LSTM includes: output value h of current time LSTMtAnd cell state c at the present time tt. In the formula (10), WiWeight matrix representing input gates, in equation (11), WfA weight matrix representing a forgetting gate.
(4) Sensing layer softmax
Referring to fig. 10, fig. 10 is a schematic diagram of softmax according to an embodiment of the present disclosure. The embodiment of the disclosure can use a softmax regression model as an output layer of a deep learning network to output the probability of a certain sample on all possible categories in the form of predicted probability values. The formulas and principles are as follows:
Figure BDA0003289444590000171
wherein, ViThe output value of the ith node is represented, and the output value of the multi-classification can be converted into the range of [0,1 ] through a Softmax function]And a probability distribution of 1.
And S300, determining the category of the second data with the characteristic as the discrete characteristic through a second model.
Specifically, embodiments of the present disclosure may include extracting domain-serialization features from the first data; extracting discretization features from the second data; performing second-order cross calculation on the domain continuous features and the discretization features to obtain a calculation result; and embedding the high-dimensional characteristic in the depth measurement according to the calculation result and the domain continuous characteristic to obtain a classification result. Wherein the discretization feature comprises a domain discretization feature and a source system discretization feature. More specifically, embodiments of the present disclosure may also include employing a discretized one-hot vector representation of the domain discretization feature; and carrying out discrete value coding processing on the discretization characteristic of the source system.
In the CTR prediction of the recommendation system, it is necessary to determine whether a commodity can be recommended according to the click rate of the CTR prediction. Whether a user clicks the advertisement on an interface or not determines the conversion (passing) rate of the advertisement, and the user, the advertisement and the corresponding cross description features are more, such as the age, the gender, the region, the position, the size, the industry and the real-time feedback information of the advertisement … … of the mobile phone type, the CTR of the advertisement and the gender cross, and the like. Since advertisement clicks are a sparse event, the number of times that many combined features appear in the training dataset is very small, which directly results in insufficient weight learning of the model for such features, resulting in overfitting. Therefore, when the CTR estimation is performed and whether an advertisement can be clicked is determined, the features are often combined in addition to the single feature. The algorithms like LR and GBDT consider all cross signatures to be independent from each other, even though two cross signatures have relevance from a traffic perspective. This results in models that are also independent of each other when optimizing parameters, and that do not fully exploit the correlation on feature services, resulting in overfitting. Preferably, compared with LR and GBDT, the Factorization Machine (FM) algorithm can extract features through the computation of implicit vector inner products and perform cross-combination when there are more sparse features, so as to learn few or no feature combinations, and is beneficial to automatically processing the problem of feature cross, that is, learning cross features efficiently under the condition that the combined features are not sufficiently co-occurrence. For example: the feature a and the feature b never appear in pairs in the training data, but the feature b and the feature c often co-occur, and the feature a and the feature c also often co-occur, so that it can be considered that there is a certain correlation between the feature a and the feature b in the FM model. The comparison between the LR method and FM is as follows:
Figure BDA0003289444590000181
Figure BDA0003289444590000182
wherein x is an n-dimensional vector, xiA value, x, representing the ith dimension of the vectorjA value, w, representing the jth dimension of the vectorijIs the corresponding weight; < vi,vjDenotes the vector viSum vector vjA dot product of, wherein viA vector of dimension i, V, representing a matrix of coefficients VjRepresents the j-th dimension vector of the coefficient matrix V.
In addition to the linear part, the disclosed embodiments introduce quadratic cross terms to achieve linear training complexity. When the domain name of the table is classified, a plurality of field features can be used as advertisement features, low-dimensional intensive embedding of high-dimensional discrete features is carried out on the advertisement features, feature combination is carried out, and effective feature information is mined. In this embodiment, domain information is fused in a feature processing process, features with the same property are classified into the same domain, and a formula after a quadratic cross term is improved is as follows:
Figure BDA0003289444590000191
where sample x is an n-dimensional vector, xiA value, x, representing the ith dimension of the vectorjA value, f, representing the jth dimension of the vectorjDenotes the domain to which the jth feature belongs, vi,fjRepresents xiThe corresponding hidden vector.
The calculation of the second-order cross features can solve the problem of limitation of calculation complexity, and the classification precision of actual verification can reach 62%.
The DNN is also called a multi-layer perceptron and can be considered as a neural network with many hidden layers because the local look is the same as the perceptron principle, i.e. the linear relationship and the activation function together form a non-linear relationship. The internal host network structure can be roughly divided into three categories: an input layer, a hidden layer and an output layer. All the layers are connected, namely any neuron in the previous layer is connected with any neuron in the next layer. And because of the full connection structure, the method has the capability of high-order characteristic representation.
Referring to fig. 11, fig. 11 is a schematic structural diagram of a second model according to an embodiment of the disclosure. The second model (deep FM model) in the embodiment of the disclosure can give consideration to the high-order feature representation and the calculation parallel capability of the model, respectively introduce the domain discretization feature and the source system discretization feature into the FM layer, and simultaneously add the domain serialization feature into the FM to perform second-order cross calculation with the discretization feature. And combining with high-dimensional embedding of the continuous features on the depth side, finally, jointly calculating logsitc score and softmax normalized output, and jointly influencing the prediction result.
And S400, performing power grid service data domain division by using the category of the first data and the category of the second data.
In the embodiment of the present disclosure, if the category of the first data is identical to the category of the second data, the identical result may be used as a final result with high confidence.
The utility model provides a power grid data model domain division method based on natural language processing technology, which integrates a characteristic description feature and a characteristic dispersion feature, and designs an ATT-ALE-TextRNN text classification model based on the characteristic description feature and a deep FM classification model based on the characteristic dispersion feature.
The present disclosure provides a data model domain-dividing method based on natural language processing technology. In the domain name classification task of the data table, a plurality of characteristics of the table can be divided into two types of input of continuous text characteristics and other discretization category characteristics according to different forms of the table. In the field of recommendation algorithms, a model architecture combining continuous and discrete inputs is provided, but as the relevance of text features and discrete features of most tables is not strong, overlapped words are lacked, the category of field features is large, and the traditional single model training effect is poor. Aiming at descriptive characteristics of the table and discrete characteristics of the table, the method adopts suitable models to predict respectively, and finally integrates the two models to achieve a better training effect.
Referring to fig. 12, an embodiment of the present disclosure further provides a device 10 for power grid service data domain division, including: the device comprises a judging module 11, a determining module 13 and a domain dividing module 15. The judging module 11 may be configured to judge characteristics of the power grid service data; the determining module 13 is operable to determine a category of the first data in which the feature is a descriptive feature by a first model, and determine a category of the second data in which the feature is a discrete feature by a second model; the domain division module 15 may be configured to perform domain division on the grid service data according to the category of the first data and the category of the second data.
For the embodiments of the apparatus, since they correspond substantially to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described apparatus embodiments are merely illustrative, wherein the modules described as separate modules may or may not be separate. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The method has the advantages that by means of deep learning and natural language processing technologies, a designed and trained model has better text representation and higher classification precision, and can be applied to a power grid service data model to realize semantic automatic classification and association of information under a power grid data model domain, and finally realize automatic mapping of the service data model.
It should be understood that the above-described specific embodiments are merely illustrative of the present invention and are not intended to limit the present invention. Obvious variations or modifications which are within the spirit of the invention are also within the scope of the invention.
In the present specification, whenever reference is made to "an exemplary embodiment", "a preferred embodiment", "one embodiment", or the like, it is intended that a particular feature, structure, or characteristic described in connection with the embodiment is included in at least one embodiment of the present invention. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment. Further, when a particular feature, structure, or characteristic is described in connection with any embodiment, it is submitted that it is within the purview of one skilled in the art to effect such feature, structure, or characteristic in other ones of all the embodiments described.
The embodiments of the present invention have been described above in detail. However, aspects of the present invention are not limited to the above embodiments. Various modifications and substitutions may be made to the above-described embodiments without departing from the scope of the present invention.

Claims (10)

1. A method for power grid service data domain division comprises the following steps:
judging the characteristics of the power grid service data for the power grid service data; wherein the features include descriptive features and discrete features;
determining, by a first model, a category of first data for which the feature is a descriptive feature;
determining a category of second data for which the feature is a discrete feature by a second model; and
and performing power grid service data domain division by using the category of the first data and the category of the second data.
2. The method of claim 1, wherein the determining the characteristic of the data comprises:
preprocessing the data; wherein the pre-processing comprises: at least one of splitting matching, counting, word segmentation and removal of stop words; and
extracting the features of the preprocessed data through a bag of words model and/or a feature extraction model based on word vectors.
3. The method of claim 1, wherein determining the class of the first data for which the feature is a descriptive feature by the first model comprises:
extracting domain name information, label information and word information from the first data;
vectorizing the domain name information and the label information to obtain vectorized data;
vectorizing the word information to obtain a word vector;
obtaining an attention value according to the word vector;
converting the word vector into a fixed-length vector; and
and classifying the vectorization data and the fixed-length vector according to the attention value to obtain a classification result.
4. The method of claim 3, wherein the first model comprises: a text embedding ALE layer, an ATT layer, a TextRNN layer and a softmax perception layer; the ATT layer comprises a single layer or multiple layers;
wherein, when the ATT layer is a single layer, the method further comprises:
obtaining corresponding weight according to the word vector;
weighting the word vectors according to the corresponding weights to obtain weights; and
weighting and summing the word vectors and the weight values to obtain the attention numerical value;
wherein, when the ATT layer is a multilayer, the method further comprises:
converting the word vector to a sentence vector on a first layer; and
converting the sentence vector to a paragraph vector on a second layer.
5. The method of claim 1, wherein obtaining the category of the first data with the characteristic being the descriptive characteristic through the first model further comprises:
acquiring a word bank of the keywords;
extracting a category label with the highest occurrence frequency of each keyword from the word stock; and
re-encoding the first data such that each data group in the first data corresponds to a multi-dimensional vector; wherein each dimension of the multi-dimensional vector characterizes a statistic of the occurrence of a class label in the data set.
6. The method of claim 1, wherein determining the class of the second data for which the feature is a discrete feature using the second model comprises:
extracting domain serialization features from the first data;
extracting discretization features from the second data;
performing second-order cross calculation on the domain continuous features and the discretization features to obtain a calculation result; and
and according to the calculation result and the high-dimensional embedding of the domain continuous features in the depth measurement, obtaining a classification result.
7. The method of claim 6, wherein the discretized features comprise a domain discretized feature and a source system discretized feature;
wherein the method further comprises:
adopting discretization one-hot vector representation to the domain discretization feature; and
and carrying out discrete value coding processing on the discretization characteristic of the source system.
8. An apparatus for power grid service data zoning, comprising:
the judging module is used for judging the characteristics of the power grid service data;
the determining module is used for determining the category of the first data of which the characteristic is a descriptive characteristic through a first model and determining the category of the second data of which the characteristic is a discrete characteristic through a second model; and
and the domain division module is used for performing domain division on the power grid service data by utilizing the category of the first data and the category of the second data.
9. A terminal, comprising:
at least one memory and at least one processor;
wherein the at least one memory is configured to store program code and the at least one processor is configured to invoke the program code stored in the at least one memory to perform the method of any of claims 1 to 7.
10. A non-transitory storage medium storing program code which, when executed by a computer device, causes the computer device to perform the method of any one of claims 1 to 7.
CN202111159160.7A 2021-09-30 2021-09-30 Method, device, terminal and non-transitory storage medium for power grid service data domain division Pending CN114595324A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111159160.7A CN114595324A (en) 2021-09-30 2021-09-30 Method, device, terminal and non-transitory storage medium for power grid service data domain division

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111159160.7A CN114595324A (en) 2021-09-30 2021-09-30 Method, device, terminal and non-transitory storage medium for power grid service data domain division

Publications (1)

Publication Number Publication Date
CN114595324A true CN114595324A (en) 2022-06-07

Family

ID=81813725

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111159160.7A Pending CN114595324A (en) 2021-09-30 2021-09-30 Method, device, terminal and non-transitory storage medium for power grid service data domain division

Country Status (1)

Country Link
CN (1) CN114595324A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438128A (en) * 2022-09-16 2022-12-06 中国建设银行股份有限公司 Data processing method, device, equipment, storage medium and program product

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115438128A (en) * 2022-09-16 2022-12-06 中国建设银行股份有限公司 Data processing method, device, equipment, storage medium and program product

Similar Documents

Publication Publication Date Title
CN111783474B (en) Comment text viewpoint information processing method and device and storage medium
CN111401077B (en) Language model processing method and device and computer equipment
CN113591483A (en) Document-level event argument extraction method based on sequence labeling
CN113255320A (en) Entity relation extraction method and device based on syntax tree and graph attention machine mechanism
CN107688870B (en) Text stream input-based hierarchical factor visualization analysis method and device for deep neural network
CN109086265B (en) Semantic training method and multi-semantic word disambiguation method in short text
Kaur Incorporating sentimental analysis into development of a hybrid classification model: A comprehensive study
CN112052684A (en) Named entity identification method, device, equipment and storage medium for power metering
CN110888980A (en) Implicit discourse relation identification method based on knowledge-enhanced attention neural network
CN112667782A (en) Text classification method, device, equipment and storage medium
CN113157859A (en) Event detection method based on upper concept information
Suyanto Synonyms-based augmentation to improve fake news detection using bidirectional LSTM
CN113705238A (en) Method and model for analyzing aspect level emotion based on BERT and aspect feature positioning model
CN114417851A (en) Emotion analysis method based on keyword weighted information
CN115169349A (en) Chinese electronic resume named entity recognition method based on ALBERT
Pandey et al. Natural language generation using sequential models: a survey
CN113535928A (en) Service discovery method and system of long-term and short-term memory network based on attention mechanism
CN113516094A (en) System and method for matching document with review experts
CN114595324A (en) Method, device, terminal and non-transitory storage medium for power grid service data domain division
CN117271701A (en) Method and system for extracting system operation abnormal event relation based on TGGAT and CNN
CN116804998A (en) Medical term retrieval method and system based on medical semantic understanding
CN116956228A (en) Text mining method for technical transaction platform
CN113342964B (en) Recommendation type determination method and system based on mobile service
CN117216617A (en) Text classification model training method, device, computer equipment and storage medium
Viji et al. A hybrid approach of Poisson distribution LDA with deep Siamese Bi-LSTM and GRU model for semantic similarity prediction for text data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination