CN112257425A

CN112257425A - Power data analysis method and system based on data classification model

Info

Publication number: CN112257425A
Application number: CN202011051534.9A
Authority: CN
Inventors: 董阳; 张倩宜; 郑阳; 张驰; 赵迪
Original assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd
Current assignee: State Grid Corp of China SGCC; State Grid Tianjin Electric Power Co Ltd
Priority date: 2020-09-29
Filing date: 2020-09-29
Publication date: 2021-01-22

Abstract

The invention discloses a power data analysis method based on a data classification model, which comprises the following steps: s1, establishing a root database; s2, preprocessing the power document to obtain a target sentence of the power document, wherein the target sentence is sentence information needing word segmentation; s3, identifying a target sentence, calling a root database to match the target sentence, judging whether the target sentence contains keywords in an ambiguous word bank, generating a feature identification result, and obtaining a multi-level label; s4, carrying out word segmentation processing according to the feature recognition result to form characters, and converting the characters into a feature vector matrix; and S5, inputting the feature vector matrix into a text classifier, and outputting a grading result of the electric power document. The electric power document is subjected to word segmentation processing according to relevant laws and regulations of the electric power system, corresponding target sentences are extracted, the root database is matched with the target sentences to generate recognition results, grading results of the electric power document are output, and the efficiency and the speed of grading the electric power data are greatly improved.

Description

Power data analysis method and system based on data classification model

Technical Field

The invention belongs to the technical field of data processing, and particularly relates to a power data analysis method and system based on a data classification model.

Background

The information security level protection system is implemented in China at present, and the proposed protection idea of 'zoning and key point division' is an effective means for solving the current information security problem. For the promotion of digital transformation of power companies and centralized and unified management of power data, the problem of classification of power data needs to be solved urgently, and especially for the security classification of power companies, it is clear which data can be shared open unconditionally, and which data is applicable to conditional sharing open or unopened sharing open according to the core business secret or relevant laws and regulations, so that data authorization and sharing open can be developed by combining different application scenarios.

At present, data management is disordered in the data sharing and exchanging process of an electric power company, the same or similar protection measures are taken for different data, the protection granularity is coarse, great hidden danger is brought to the data sharing and exchanging safety, and if sensitive data are not protected, the benefit of the enterprise and even the national safety can be seriously influenced. Therefore, fine-grained protection of data is an important content of information security.

Natural language processing, as an important branch of artificial intelligence, is increasingly used in many scenarios such as machine translation, intelligent question answering, and the like, playing an increasingly important role. Text word segmentation is the most basic process in natural language processing, and text can be better analyzed and recognized only after being accurately segmented.

At present, manual grading is mainly carried out by means of knowledge background of professionals and relevant reference regulations, the manual grading mode depends on the capability of workers, and the method is huge in workload, low in efficiency and high in error rate. The common mechanical word segmentation method is based on character string matching, is simple and efficient, has a simple language processing effect, but is not good in processing complex ambiguous sentences and cannot process ambiguities and new words. The word segmentation method based on machine learning improves the precision of text word segmentation by constructing a statistical model, can learn new words, but has higher complexity, needs to train a huge corpus, has high training cost, cannot well recognize words in a dictionary, and needs to improve the classification accuracy.

Therefore, in order to solve the above technical problem, it is necessary to extract feature words not only in the classified data description but also in the related legal provision, and to appropriately increase the weight of these feature words, and it is necessary to develop a data analysis method capable of classifying power data based on a data classification model.

Disclosure of Invention

The invention aims to provide a power data analysis method based on a data grading model, which can perform word segmentation processing on text data, extract characteristic words accurately and analyze power data accurately.

Another object of the present invention is to provide a power data analysis system based on a data classification model.

The technical scheme of the invention is as follows:

a power data analysis method based on a data classification model comprises the following steps:

s1, establishing a root database;

s2, importing an electric power document, preprocessing the electric power document, and acquiring a target sentence of the electric power document, wherein the target sentence is sentence information needing word segmentation;

s3, identifying the target sentence, calling the root database to match the target sentence, judging whether the target sentence contains keywords in an ambiguous word bank, if so, generating a feature identification result according to a word segmentation rule, and obtaining a multi-level label;

s4, carrying out word segmentation processing according to the feature recognition result to form characters, and converting the characters into a feature vector matrix through a TF-IDF algorithm;

and S5, inputting the characteristic vector matrix into a text classifier, generating a data grading model, and outputting a grading result of the electric power document.

In the above technical solution, the creating a root database in S1 includes:

s10, acquiring a large amount of text data as a corpus in a manual mode according to relevant laws and regulations of the power system to form an initial training sample;

s11, importing the training samples into a training model to gradually form a root classification model;

s12, after classification is formed, further training a root classification model through classification actual combat simulation, increasing decision data and improving the capability of the root classification model for dealing with abnormity;

s13, inputting the result data into the root classification model again for training the root classification model after artificial decision making and re-learning as a training sample;

and S14, collecting the result data and establishing a root database.

In the above technical solution, in S10, a large amount of text data is always obtained from relevant laws and regulations of the power system as a corpus, and a preset N value is used to remove homogeneous data in the corpus.

In the above technical solution, in S2, the preprocessing of the power document includes removing sensitive words, messy codes, and punctuation marks, so as to remove redundant parts in the power document and further filter the power document.

In the above technical solution, the matching of the target sentence in S3 includes fuzzy matching and regular matching.

In the above technical solution, the ambiguous word bank in S3 includes a preset keyword set with ambiguous properties.

In the above technical solution, the word segmentation method in S4 fully segments sentences in the power document, reads characters in each line in the power document by establishing a TF-IDF structure, calculates the frequency of occurrence of each character, and establishes a feature vector matrix.

In the above technical solution, in S5, the feature vector matrix is converted into one input vector of the text classifier, the multi-level label is converted into another input vector of the text classifier, a data-level model is generated by invoking a text classifier training algorithm, and a level-level result of the power document is input.

A power data analysis system based on a data staging model, comprising:

the preprocessing module is used for receiving the power document and acquiring a target statement of the power document;

the word segmentation module is used for generating a feature recognition result by matching the root database with the target sentence and obtaining a plurality of hierarchical labels;

the character dividing module is used for carrying out character dividing processing according to the characteristic identification result to form characters and generating a characteristic vector matrix;

and the output module is used for generating a grading result of the electric power document after the characteristic vector matrix is input through the text classifier.

Further, the word segmentation module further comprises:

the judging module is used for judging whether the target sentence has the keywords with ambiguous properties according to the keywords in the ambiguous word bank;

and the identification module is used for carrying out feature recognition on the target sentence after judging the keyword with ambiguous property on the target sentence so as to generate a feature recognition result.

The invention has the advantages and positive effects that:

1. the method comprises the steps of performing word segmentation processing on an electric power document through relevant laws and regulations of an electric power system, extracting corresponding target sentences, matching the root database with the target sentences to generate recognition results, outputting classification results of the electric power document, and greatly improving efficiency, speed and accuracy of electric power data classification.

2. The data value is used as a core, a data analysis system is constructed from a view point of combining safety management and data management, the power data is comprehensively analyzed, and the power data is objectively and accurately analyzed.

3. The intensity of data security management in the system is enhanced, the management strategy and granularity are refined, the requirement of data security management in the big data era is better met, and the security guarantee is provided for the security of dynamic service data of a big data platform.

Detailed Description

The present invention will be described in further detail with reference to specific examples. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the scope of the invention in any way.

As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or wirelessly coupled. As used herein, the term "and/or" includes all or any element and all combinations of one or more of the associated listed items.

It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.

Example 1

The invention discloses a power data analysis method based on a data classification model, which comprises the following steps:

s1, establishing a root database;

Further, the creating a root database in S1 includes:

and S14, collecting the result data and establishing a root database.

Further, in S10, a large amount of text data is always obtained from the relevant laws and regulations of the power system as the corpus, and the preset N value is used to remove the homogenization data in the corpus.

Further, in S2, the preprocessing of the power document includes removing sensitive words, messy codes, and punctuation marks, so as to remove redundant parts in the power document, thereby implementing further filtering of the power document.

Further, the matching the target sentence in S3 includes fuzzy matching and regular matching.

Further, the ambiguous word library in S3 includes a preset keyword set with ambiguous properties.

Further, the word segmentation method in S4 fully segments the sentences in the power document, reads the characters in each line of the power document by establishing a TF-IDF structure, calculates the frequency of occurrence of each character, and establishes a feature vector matrix.

Further, in S5, the feature vector matrix is converted into one input vector of the text classifier, the multi-level label is converted into another input vector of the text classifier, a data level model is generated by calling a text classifier training algorithm, and a level result of the power document is input.

Example 2

On the basis of embodiment 1, the power data analysis system based on the data classification model of the present invention includes:

Further, the word segmentation module further comprises:

Example 3

On the basis of embodiment 1, the computer device of the present invention includes an air blowing device, a nonvolatile storage medium, a memory, and a network interface connected through a system. Wherein the non-volatile storage medium of the computer device stores an operating system, a database, and computer readable instructions. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a method of power data analysis based on the data staging model of embodiment 1.

The network interface of the computer device is used for communication connection with the terminal.

The invention has been described in an illustrative manner, and it is to be understood that any simple variations, modifications or other equivalent changes which can be made by one skilled in the art without departing from the spirit of the invention fall within the scope of the invention.

Claims

1. A power data analysis method based on a data classification model is characterized by comprising the following steps:

s1, establishing a root database;

2. The power data analysis method according to claim 1, wherein the creating a root database in S1 includes:

and S14, collecting the result data and establishing a root database.

3. The power data analysis method according to claim 2, characterized in that: in S10, a large amount of text data is always obtained from the relevant laws and regulations of the power system as a corpus, and a preset N value is used to remove the homogeneous data in the corpus.

4. The power data analysis method according to claim 3, characterized in that: in S2, the preprocessing the power document includes removing sensitive words, messy codes, and punctuation marks, so as to remove redundant parts in the power document, thereby further filtering the power document.

5. The power data analysis method according to claim 4, characterized in that: the matching of the target sentence in S3 includes fuzzy matching and regular matching.

6. The power data analysis method according to claim 5, characterized in that: the ambiguous word bank in S3 includes a preset set of keywords with ambiguous properties.

7. The power data analysis method according to claim 6, characterized in that: the word segmentation method in the step S4 is used to fully segment sentences in the power document, read characters in each line of the power document by establishing a TF-IDF structure, calculate the frequency of occurrence of each character, and establish a feature vector matrix.

8. The power data analysis method according to claim 7, characterized in that: in S5, the feature vector matrix is converted into one input vector of the text classifier, the multi-level label is converted into another input vector of the text classifier, a data-level model is generated by invoking a text classifier training algorithm, and a level result of the power document is input.

9. A power data analysis system based on a data staging model, comprising:

10. The power data analysis system according to claim 9, wherein: the word segmentation module further comprises: