CN115563298A

CN115563298A - Basic data acquisition method based on electric power material knowledge

Info

Publication number: CN115563298A
Application number: CN202211146095.9A
Authority: CN
Inventors: 杨洁; 郑佳妮; 田行健; 岳凡与; 邓楚杭
Original assignee: Guizhou Power Grid Materials Co ltd
Current assignee: Guizhou Power Grid Materials Co ltd
Priority date: 2022-09-20
Filing date: 2022-09-20
Publication date: 2023-01-03

Abstract

The invention discloses a power material knowledge graph construction method, which comprises the steps of collecting power system data, including obtaining the existing data in a power system, obtaining the missing product and service information data and external network information resources; dividing data types, and classifying the power system data according to data attributes; if the data is structured data or imported from a third-party database, performing knowledge fusion after data integration, and if the data is semi-structured data and unstructured data, performing knowledge extraction and performing knowledge fusion on the extracted data; constructing a knowledge base based on the fused data; the electric power material encyclopedia knowledge base can be updated, and an electric power material encyclopedia knowledge base inquiry catalogue framework is formed by combining with an electric power material subdivision standard and based on an industrial intelligence algorithm model. Further analyzing the public power material encyclopedia knowledge network, and realizing the convenient inquiry requirement and the public application content display requirement through material classification, key words and other modes.

Description

Basic data acquisition method based on electric power material knowledge

Technical Field

The invention relates to the technical field of material knowledge acquisition and arrangement, in particular to a basic data acquisition method based on electric power material knowledge.

Background

In the middle of the 20 th century, plece et al proposed a method of using a citation network to study the context of contemporary scientific development, the first time proposing the concept of knowledge-maps. In 1977, the concept of knowledge engineering was proposed at the fifth international human intelligence society, and knowledge base systems represented by expert systems were widely researched and applied, and until the 90 s of the 20 th century, the concept of the knowledge base of organizations was proposed, and since then, research work on knowledge representation and knowledge organization began to be intensively carried out. The organization knowledge base system is widely applied to data integration and external publicity work in various departments and institutions. Google corporation, 11 months 2012, pioneers the concept of a knowledgegraph to indicate the functionality of adding a knowledgegraph to its search results. The purpose of the method is to improve the capacity of a search engine and enhance the search quality and the search experience of users. According to the statistical data of 1 month in 2015, the KG constructed by Google already has 5 hundred million entities and about 35 hundred million entity relationship information, and has been widely applied to improving the search quality of a search engine. Although the concept of knowledge graph (KnowledgeGraph) is newer, it is not a completely new research field, as early as 2006, berners lee has proposed the idea of data linking, calling for promotion and perfection of relevant technical standards such as URI, RDF, OWL, and making provision for meeting the arrival of semantic networks. The heat tide of semantic network research is lifted later, and the knowledge graph technology is built on the related research results and is a abandonment and sublimation of the existing semantic network technology.

With the continuous acceleration of urbanization and industrialization, the demand for electric power is more and more large, the supply guarantee of electric power materials is the key for determining electric power construction, the management and application of electric power material information at the present stage are mostly carried out by manual mode and are far insufficient to meet the application demand of electric power departments on the electric power material information, the establishment of an electric power material encyclopedia knowledge network according to the application demand of the electric power material industry in the future is an important means and development trend for solving the current problems, and meanwhile, a solid foundation is laid for the future construction of advanced applications such as electric power material brain intelligence and the like.

Disclosure of Invention

This section is for the purpose of summarizing some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. In this section, as well as in the abstract and title of the application, simplifications or omissions may be made to avoid obscuring the purpose of the section, the abstract and the title, and such simplifications or omissions are not intended to limit the scope of the invention.

The present invention has been made keeping in mind the above problems occurring in the prior art.

Therefore, the technical problems to be solved by the invention are that the urbanization and industrialization are accelerated continuously, the demand for electric power is increased, the supply guarantee of electric power materials is the key for determining electric power construction, and the management and application of electric power material information at the present stage are performed manually, which is far insufficient for meeting the application demand of electric power material information by electric power departments.

In order to solve the technical problems, the invention provides the following technical scheme: a basic data acquisition method based on power material knowledge comprises the steps of acquiring power system data, including acquiring existing data inside a power system, acquiring missing product and service information data and external network information resources; dividing data types, and classifying the power system data according to data attributes; respectively carrying out noise reduction pretreatment on the data based on the divided data categories; and carrying out data processing on different types of data, and uploading the data to a material encyclopedia database for system management and analysis.

As a preferable scheme of the basic data acquisition method based on the electric power material knowledge, the method comprises the following steps: according to the data attribute classification, the material data of the power system comprises structural data, non-structural data and text data, and the structural data, the non-structural data and the text data are respectively subjected to data processing.

As a preferable scheme of the basic data acquisition method based on electric power material knowledge of the present invention, wherein: the data processing comprises processing the structural data in a data integration mode; processing the non-structural data by means of entity identification and relationship extraction; and processing the text data in an NLP mode.

As a preferable scheme of the basic data acquisition method based on the electric power material knowledge, the method comprises the following steps: the preprocessing includes washing and normalizing the text data prior to data processing.

As a preferable scheme of the basic data acquisition method based on electric power material knowledge of the present invention, wherein: the pre-processing process includes noise removal, lexical normalization, and object normalization.

As a preferable scheme of the basic data acquisition method based on the electric power material knowledge, the method comprises the following steps: the noise removal includes iterating the object text using the dictionary of noise entities to remove symbols present in the noise dictionary.

As a preferable scheme of the basic data acquisition method based on electric power material knowledge of the present invention, wherein: the lexical normalization is based on processing text in feature engineering, converting high-dimensional features to a low-dimensional space, and comprises:

stem extraction: suffix removal based on rules;

and (3) word metaplasia: the root word is obtained using a vocabulary and morphological analysis.

As a preferable scheme of the basic data acquisition method based on electric power material knowledge of the present invention, wherein: the text data is systematically analyzed, understood, and information extracted and managed using NLP.

As a preferable scheme of the basic data acquisition method based on the electric power material knowledge, the method comprises the following steps: the system operation inspection content mainly comprises the file integrity accuracy rate, the reading rate and the reading accuracy rate, and the following table shows the system data test content and the test result.

As a preferable scheme of the basic data acquisition method based on the electric power material knowledge, the method comprises the following steps: the data processing flow comprises

Reading a configuration file and setting program operation parameters;

carrying out error judgment on the parameters;

when the parameters are wrong, the process is directly ended;

judging whether an increment import condition exists or not when the parameters are correct, and only extracting increment import data when the increment import exists; and when the increment import condition does not exist, carrying out full import extraction, and inputting the extracted data into a system.

The invention has the beneficial effects that: according to the invention, a complete power material knowledge information database is constructed to meet the data requirements of project research, so that the final power material encyclopedia knowledge network knowledge information is more complete and comprehensive, and different application requirements can be met; the information is more accurate and simplified, the error rate of the knowledge information of the electric power materials is reduced, the availability of the information is improved, and potential safety hazards caused by the knowledge information of the electric power materials are avoided.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without inventive exercise. Wherein:

fig. 1 is a flowchart of an algorithm of data acquisition and processing in the first embodiment.

Fig. 2 is a flow chart of data collection and processing in the second embodiment.

Detailed Description

In order to make the aforementioned objects, features and advantages of the present invention more comprehensible, embodiments accompanying figures of the present invention are described in detail below.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, however, the present invention may be practiced otherwise than as specifically described herein, and it will be appreciated by those skilled in the art that the present invention may be practiced without departing from the spirit and scope of the present invention and that the present invention is not limited by the specific embodiments disclosed below.

Furthermore, the references herein to "one embodiment" or "an embodiment" refer to a particular feature, structure, or characteristic that may be included in at least one implementation of the present invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

Example 1

The data processing comprises processing the structural data in a data integration mode; processing the non-structural data by means of entity identification and relationship extraction; and processing the text data in an NLP mode.

The preprocessing comprises cleaning and standardizing the text data before data processing.

The principle that an HDFS distributed file system solves the problem of mass data storage, a mechanism for efficient parallel operation of MapReduce, a Hive data warehouse and Sqoop data migration are intensively researched in the aspect of big data, various commonly used prediction algorithm models in the field of prediction are compared in the aspect of data mining, and a prediction algorithm is realized by utilizing an R language.

As a preferable scheme of the basic data acquisition method based on the electric power material knowledge, the method comprises the following steps: the pre-processing process includes noise removal, lexical normalization, and object normalization.

MapReduce is used as a parallel computing and running software framework, and the complex parallel computing process is highly abstracted to two functions of Map and Reduce. The method can automatically complete the parallelization processing of the calculation tasks, automatically divide the calculation data and the calculation tasks, automatically distribute and execute the tasks on the cluster nodes, collect the calculation results, and give the complex details at the bottom layer of a plurality of systems related to the parallelization calculation such as data communication, fault-tolerant processing and the like to the system for processing.

stem extraction: suffix removal based on rules;

Hive adopts a plurality of concepts in relational databases, such as modes, tables, rows, columns and the like, provides a set of SQL-like syntax, hiveQL for short HQL, for data query and data processing, and is a strong support for a MapReduce framework, users who do not know MapReduce can simply write SQL-like statements which are converted into MapReduce programs by Hive for execution, and the process is transparent to the users, and finally returns the results of data query or data analysis to the users, which is undoubtedly a good news for DBA database administrators and development engineers. Hive can make people complete a great deal of work with relatively little effort. In actual development, 80% of operations are not directly completed by the MapReduce program, but are completed by Hive, and a small part of complex tasks are left, so that a developer is forced to select to develop custom Mapper and Reducer programs when HQL cannot be used for realizing the task.

The text data is systematically analyzed, understood, and information extracted and managed using NLP.

The system operation inspection content mainly comprises the file integrity accuracy rate, the reading rate and the reading accuracy rate, and the following table shows the system data test content and the test result.

The data processing flow comprises

Reading a configuration file and setting program operation parameters;

carrying out error judgment on the parameters;

when the parameters are wrong, the process is directly ended;

Example 2

Referring to fig. 1, a second embodiment of the present invention is based on the previous embodiment, and aims at text data, which contains many different types of noise inside. It is temporarily unsuitable for direct analysis before the text data is preprocessed. The text preprocessing process mainly cleans and standardizes the text data. Referring to fig. 1, a first embodiment of the present invention provides a basic data acquisition method based on power supply knowledge, which includes the following steps:

s1: the method comprises the steps of collecting power system data, wherein the steps of obtaining existing data in a power system, obtaining missing product and service information data and external network information resources are included. Aiming at the existing data in the existing power system, acquiring data required by a project in a communication and coordination mode with related power departments; aiming at about 3000 types of missing supplier product and service information data in the existing power system, a standard data acquisition format template is established through cooperation with a supplier, public information such as material public knowledge, futures, spot goods and the like aiming at external network information resources is acquired in a mode of self-filling and uploading by the supplier, acquisition is carried out through various modes such as a web crawler and the like, and data sources are marked in the acquisition and data use processes.

S2: dividing data types, and classifying the power system data according to data attributes; specifically, the power system material data includes structural data, non-structural data, and textual data.

S3: and performing noise reduction preprocessing on the data aiming at the text data.

S4: and carrying out data processing on different types of data structure data, non-structure data and text data, and uploading the data to a material encyclopedia database for system management and analysis. Specifically, the structural data is processed in a data integration mode; processing the non-structural data by means of entity identification and relation extraction; and processing the text data in an NLP mode.

The preprocessing includes washing and normalizing the text data prior to data processing. Specifically, the preprocessing process includes noise removal, vocabulary normalization, and object normalization.

Any text segment that is not data context-dependent, as well as end-output, can be considered noise. For example, language stop words, URLs or links, entities in social media such as @ symbols, # tags, etc., punctuation, and industry-specific vocabularies, etc.

The noise removal includes: and iterating the object text by using the dictionary of the noise entity to remove symbols in the noise dictionary.

Further, all the different forms of a word are converted into its canonical form. Normalization is a key step in text processing in feature engineering.

The lexical normalization is based on processing the text in feature engineering, converting N different features of a high-dimensional feature into 1 feature of a low-dimensional space, and comprises:

extracting a stem: is a preliminary, rule-based process of un-suffixing;

and (3) word metaplasia: the method is an organized process for acquiring the root word step by step. And the importance and morphology of the vocabulary words in the dictionary are used for analyzing the word structure and grammatical relation.

Further, the text data is systematically analyzed, understood, and information extracted using NLP, and managed. By using NLP and its components, we can manage very large chunks of text data, or perform a large number of automated tasks, and solve a wide variety of problems, such as automatic summarization, machine translation, named entity recognition, relationship extraction, sentiment analysis, speech recognition, and topic segmentation, among others.

The entity identification means to identify entities with specific meanings in the text, and mainly includes names of people, places, organizations, proper nouns, and the like. It generally comprises two parts: 1. identifying an entity boundary; 2. the entity class name, manufacturer name, device name, or others are determined. The main technical methods for entity identification at present are as follows: the present embodiment preferably uses a neural network method, such as a rule and dictionary based method, a statistical based method, a hybrid method of the two, a neural network method, and the like.

Processing a model of an NLP task by using a neural network method, wherein the main models of the NLP task comprise NN/CNN-CRF, RNN-CRF and LSTM-CRF, and the neural network method comprises the following steps: mapping tokens from the discrete one-hot representation to the enbelling which becomes dense in the low dimensional space; inputting an embedding sequence of sentences into the RNN, and automatically extracting features by using a neural network; softmax predicts the label for each token. Its advantages include:

1. the training of the neural network model becomes an end-to-end overall process, rather than the traditional pipeline.

2. Is independent of characteristic engineering and is a data-driven method.

The relationship extraction comprises a rule and template based method, a statistical machine learning based method and an open domain oriented extraction method.

It is important to note that the construction and arrangement of the present application as shown in the various exemplary embodiments is illustrative only. Although only a few embodiments have been described in detail in this disclosure, those skilled in the art who review this disclosure will readily appreciate that many modifications are possible, such as variations in sizes, dimensions, structures, shapes and proportions of the various elements, values of parameters such as temperature, pressure, etc., mounting arrangements, use of materials, colors, orientations, etc., without materially departing from the novel teachings and advantages of the subject matter recited in this application. For example, elements shown as integrally formed may be constructed of multiple parts or elements, the position of elements may be reversed or otherwise varied, and the nature or number of discrete elements or positions may be altered or varied. Accordingly, all such modifications are intended to be included within the scope of this invention. The order or sequence of any process or method steps may be varied or re-sequenced according to alternative embodiments. In the claims, any means-plus-function clause is intended to cover the structures described herein as performing the recited function and not only structural equivalents but also equivalent structures. Other substitutions, modifications, changes and omissions may be made in the design, operating conditions and arrangement of the exemplary embodiments without departing from the scope of the present inventions. Therefore, the present invention is not limited to a particular embodiment, but extends to various modifications that nevertheless fall within the scope of the appended claims.

Furthermore, in an effort to provide a concise description of the exemplary embodiments, all features of an actual implementation may not be described, i.e., those unrelated to the presently contemplated best mode of carrying out the invention, or those unrelated to enabling the invention.

It should be appreciated that in the development of any such actual implementation, as in any engineering or design project, numerous implementation-specific decisions may be made. Such a development effort might be complex and time consuming, but would nevertheless be a routine undertaking of design, fabrication, and manufacture for those of ordinary skill having the benefit of this disclosure, without undue experimentation.

It should be noted that the above-mentioned embodiments are only for illustrating the technical solutions of the present invention and not for limiting, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications or equivalent substitutions may be made on the technical solutions of the present invention without departing from the spirit and scope of the technical solutions of the present invention, which should be covered by the claims of the present invention.

Example 3

A third embodiment of the invention, which differs from the first two embodiments, is:

the testing environment is deployed in a simulated city mode, different servers are respectively installed in provincial companies and nineteen city companies in a simulated mode, and the provincial companies and the nineteen city companies uniformly manage the testing environment.

The system adopts a mode of 19+1 on deployment. Longitudinal information interaction exists between the metropolis company system and the provincial company system, and transverse information interaction does not exist between the metropolis company systems.

The system operating environment mainly includes database server hardware and software configuration and application server hardware and software configuration, and the detailed configuration of the database server hardware and software configuration is given in the following table:

testing server hard-software configuration

The neural network model mainly comprises data backup, a network system, a database system and application system user permission setting.

System platform test content

The system performance testing part is used for summarizing key functions of basic application, advanced application, operation management, statistical query and system management of the system, and testing items comprise the contents of station area data profiling, concentrator parameters, task issuing, real-time electricity calling and testing, meter electricity checking and exporting, acquisition rate statistics, distribution transformer real-time data checking, line loss calculation, marketing interface and the like.

The system operation inspection content mainly comprises the complete accuracy rate, the reading rate and the reading accuracy rate of the file, and the following table shows the system data test content and the test result:

Claims

1. a basic data acquisition method based on power material knowledge is characterized by comprising the following steps: comprises the steps of (a) preparing a mixture of a plurality of raw materials,

acquiring data of the power system, including acquiring the existing data in the power system, acquiring missing product and service information data and external network information resources;

dividing data categories, and classifying the power system data according to data attributes;

respectively carrying out noise reduction pretreatment on the data based on the divided data categories; and the number of the first and second groups,

and processing data of different types, and uploading the data to a material encyclopedia database for system management and analysis.

2. The basic data acquisition method based on electric power material knowledge as claimed in claim 1, wherein: according to the data attribute classification, the power system material data comprise structural data, non-structural data and text data, and the structural data, the non-structural data and the text data are respectively subjected to data processing.

3. The basic data acquisition method based on electric power material knowledge as claimed in claim 2, characterized in that: the data processing comprises the steps of processing the data,

processing the structural data in a data integration mode;

processing the non-structural data by means of entity identification and relationship extraction;

and processing the text data by means of NLP.

4. The basic data acquisition method based on electric power material knowledge as claimed in claim 2 or 3, characterized in that: the preprocessing comprises cleaning and standardizing the text data before data processing.

5. The basic data acquisition method based on electric power material knowledge as claimed in claim 4, wherein: the pre-processing process includes noise removal, lexical normalization, and object normalization.

6. The basic data acquisition method based on electric power material knowledge as claimed in claim 5, wherein: the noise removal includes iterating the object text using the dictionary of noise entities to remove symbols present in the noise dictionary.

7. The basic data acquisition method based on electric power material knowledge as claimed in claim 5 or 6, characterized in that: the lexical normalization converts high-dimensional features to low-dimensional space based on processing text in feature engineering, which includes:

stem extraction: suffix removal based on rules;

8. The basic data acquisition method based on electric power material knowledge as claimed in claim 3, characterized in that: the text data is systematically analyzed, understood, and information extracted and managed using NLP.

9. The basic data acquisition method based on electric power material knowledge as claimed in claim 3 or 8, characterized in that: the system operation inspection content mainly comprises the file integrity accuracy, the reading rate and the reading accuracy, and the following table shows the system data test content and the test result.

10. The basic data acquisition method based on electric power material knowledge as claimed in claim 9, wherein: the data processing flow comprises

Reading a configuration file and setting program operation parameters;

carrying out error judgment on the parameters;

when the parameters are wrong, the process is directly ended;