Audit text classification method, system and equipment based on knowledge graph
Technical Field
The invention relates to the field of electric power audit, in particular to an audit text classification method, an audit text classification system and audit text classification equipment based on a knowledge graph.
Background
The internal audit is a powerful guarantee for the internal management of the power company, and through the internal audit, the problems existing in the power company can be found and solved, the management level is improved, and the safety and stability of power supply are maintained.
Along with the social development, the audit workload of each electric company is gradually increased, the quantity of audit texts generated by the electric company is also in an ascending trend, and the types of the audit texts are various, so that the classification of the audit texts becomes important content in the audit work of the electric power system, and how to improve the classification efficiency and accuracy of the audit texts, improve the working efficiency of the audit work and realize the automation of the audit text classification is a problem to be solved urgently in the current electric power system.
Currently, the methods for audit text classification studied also have the following disadvantages:
(1) Because of the professionality and complexity of the audit text, the text classification method based on the pre-training language model needs professional audit auditors to label, has high cost for labeling a large amount of data and has low efficiency;
(2) An audit text classification algorithm based on deep learning needs a large number of training samples to improve the classification accuracy of the model;
(3) Because of the professionality and complexity of the audit text, various items of information contained in the audit text are difficult to visually represent.
Disclosure of Invention
Aiming at the problems, the invention provides an audit text classification method, an audit text classification system and audit text classification equipment based on a knowledge graph, which aim to improve the accuracy and the appropriateness of audit text classification, improve the working efficiency of audit work of a power system and realize the automation of audit text classification.
In one aspect, the invention provides an audit text classification method based on a knowledge graph, which comprises the following steps:
s1, acquiring a first audit text data set, and constructing an audit text knowledge graph based on the first audit text data set;
S2, acquiring a second audit text data set, acquiring word vectors of audit problem categories in audit texts to be classified, calculating a matching index based on the audit text knowledge graph and the word vectors, complementing the word vectors to the audit text knowledge graph when the matching index is smaller than a first preset value, and creating a text database corresponding to the audit problem categories; and when the coordination index is larger than a second preset value, acquiring the next audit text to be classified in the second audit text data set, and executing S2 until the second audit text data set is completely executed.
Further, the acquiring mode of the first audit text data set comprises a power audit text data set and a power audit text acquired from a webpage.
Further, the second audit text data set is obtained in a power audit text of daily audit records of each power company.
Further, the construction process of the audit text knowledge graph comprises the following steps:
Acquiring entity data in the first audit text data set; the entity data comprises audited units, item types, audit problem titles, audit problem categories, system basis and audit opinions;
acquiring relationship data in the first audit text data set; the relation data comprises belongings, occurrences, reasons and basis;
Acquiring attribute data in the first audit text data set; the attribute data comprise unqualified contractual service, nonstandard subcontract management, nonstandard subcontract and insufficient research depth;
The entity data, the relation data and the attribute data are subjected to data fusion and stored in a graph database in a mode of entity-relation-entity-attribute;
And obtaining the audit text knowledge graph by adopting a visualization tool.
Further, the calculation formula of the word vector is as follows:
;
Where m is the number of words contained in the audit question category, n is the word vector length of each word contained in the audit question category, U m×n is the vector when the word contained in the audit question category is the center word, V m×n is the vector when the word contained in the audit question category is the non-center word, and Y is the word vector.
Further, the calculation formula of the coordination index is as follows:
;
E is the coordination index, Y i is the i field of the word vector, Q j is the characteristic vector of the entity data audit problem category contained in the audit text knowledge graph, n is the number of the entity data audit problem category contained in the audit text knowledge graph, and max is the maximum value.
Further, the first preset value is 40%.
Further, the second preset value is 95%.
Further, the creating process of the text database includes:
based on the word vector newly completed by the audit text knowledge graph, automatically capturing data information by the background and carrying out data export;
And when the data is exported, automatically generating a file database named by the audit problem category corresponding to the word vector by adopting an online export technology.
In another aspect, the invention provides an audit text classification system based on a knowledge graph, comprising: an audit text knowledge graph construction module and an audit text classification module;
the audit text knowledge graph construction module is used for acquiring a first audit text data set and constructing an audit text knowledge graph based on the first audit text data set;
The audit text classification module is used for acquiring a second audit text data set, acquiring word vectors of audit problem categories in audit texts to be classified, calculating a matching index based on the audit text knowledge graph and the word vectors, supplementing the word vectors to the audit text knowledge graph when the matching index is smaller than a first preset value, and creating a text database corresponding to the audit problem categories; and when the coordination index is larger than a second preset value, acquiring the next audit text to be classified in the second audit text data set, and re-executing the audit text classification module until the second audit text data set is completely executed.
Further, the acquiring mode of the first audit text data set comprises a power audit text data set and a power audit text acquired from a webpage.
Further, the second audit text data set is obtained in a power audit text of daily audit records of each power company.
Further, the audit text knowledge graph construction module includes: the system comprises an entity data acquisition unit, a relation data acquisition unit, an attribute data acquisition unit, a fusion storage unit and a visualization unit;
The entity data acquisition unit is used for acquiring entity data in the first audit text data set; the entity data comprises audited units, item types, audit problem titles, audit problem categories, system basis and audit opinions;
the relation data acquisition unit is used for acquiring relation data in the first audit text data set; the relation data comprises belongings, occurrences, reasons and basis;
The attribute data acquisition unit is used for acquiring attribute data in the first audit text data set; the attribute data comprise unqualified contractual service, nonstandard subcontract management, nonstandard subcontract and insufficient research depth;
The fusion storage unit is used for carrying out data fusion on the entity data, the relation data and the attribute data and storing the data into a graph database in a mode of entity-relation-entity-attribute;
And the visualization unit is used for obtaining the audit text knowledge graph by adopting a visualization tool.
Further, the calculation formula of the word vector is as follows:
;
Where m is the number of words contained in the audit question category, n is the word vector length of each word contained in the audit question category, U m×n is the vector when the word contained in the audit question category is the center word, V m×n is the vector when the word contained in the audit question category is the non-center word, and Y is the word vector.
Further, the calculation formula of the coordination index is as follows:
;
E is the coordination index, Y i is the i field of the word vector, Q j is the characteristic vector of the entity data audit problem category contained in the audit text knowledge graph, n is the number of the entity data audit problem category contained in the audit text knowledge graph, and max is the maximum value.
Further, the first preset value is 40%.
Further, the second preset value is 95%.
Further, the creating process of the text database includes:
based on the word vector newly completed by the audit text knowledge graph, automatically capturing data information by the background and carrying out data export;
And when the data is exported, automatically generating a file database named by the audit problem category corresponding to the word vector by adopting an online export technology.
On the other hand, the invention also provides audit text classification equipment based on the knowledge graph, which comprises the following steps: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
and the processor is used for realizing any one of the audit text classification methods based on the knowledge graph when executing the program stored in the memory.
The invention has at least the following beneficial effects:
According to the invention, the entity data, the relation data and the attribute data in the audit text are obtained to construct the audit text knowledge graph, so that the association relation among different entity data in the audit text is visualized, the electric staff can be helped to quickly obtain various information of the audit text, and the working efficiency is improved; the auditing problem category is converted into word vectors, the coordination index is calculated, and entity data of the auditing problem category with the coordination index lower than a first preset value is complemented to an auditing text knowledge graph, so that accuracy and appropriateness of auditing text classification in subsequent work are improved; by automatically creating a text database of audit problem categories through the background, the classification efficiency of audit texts and the working efficiency of electric staff are improved, and the automation of audit text classification is realized.
Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention may be realized and attained by the structure particularly pointed out in the written description and drawings.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions of the prior art, the following description will briefly explain the drawings used in the embodiments or the description of the prior art, and it is obvious that the drawings in the following description are some embodiments of the present invention, and other drawings can be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method according to an embodiment of the invention;
fig. 2 is a schematic diagram of a text knowledge graph for audit according to an embodiment of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Along with the social development, the audit workload of each electric company is gradually increased, the quantity of audit texts generated by the electric company is also in an ascending trend, and the types of the audit texts are various, so that the classification of the audit texts becomes important content in the audit work of the electric power system, and how to improve the classification efficiency and accuracy of the audit texts, improve the working efficiency of the audit work and realize the automation of the audit text classification is a problem to be solved urgently in the current electric power system.
Therefore, the invention provides an audit text classification method, an audit text classification system and audit text classification equipment based on a knowledge graph, wherein the audit text classification method, the audit text classification system and the audit text classification equipment based on the knowledge graph are based on the knowledge graph.
Fig. 1 is a flowchart of an audit text classification method based on a knowledge graph according to an embodiment of the present invention, please refer to fig. 1, and the embodiment provides an audit text classification method based on a knowledge graph, which includes:
S1, acquiring a first audit text data set, and constructing an audit text knowledge graph based on the first audit text data set;
specifically, the acquiring mode of the first audit text data set comprises a power audit text data set and a power audit text acquired from a webpage, so that various types of information of the power audit text can be covered in a maximum range;
fig. 2 is a schematic diagram of an audit text knowledge graph according to an embodiment of the present invention, referring to fig. 2, a process for constructing the audit text knowledge graph includes:
Because the audit text is unstructured data, an LSTM-CNNs-CRF model is adopted to acquire entity data in the first audit text data set; the entity data comprises audited units, item types, audit problem titles, audit problem categories, system basis and audit opinions; the LSTM-CNNs-CRF model comprises a forward LSTM layer, a reverse LSTM layer and a CRF layer, characters in the audit text are used as input, and the entity data is output through the forward LSTM layer, the reverse LSTM layer and the CRF layer in sequence;
acquiring relationship data in a first audit text data set by adopting a BERT model; the relationship data comprises belonging, occurrence, reason and basis;
Acquiring attribute data in a first audit text data set; the attribute data comprise unqualified contractual service, non-standardization of subcontract management, non-standardization of subcontract and insufficient research depth;
the entity data, the relation data and the attribute data are subjected to data fusion and stored in a graph database, and the process of data fusion comprises the following steps:
Performing data preprocessing on entity data, relation data and attribute data, wherein the data preprocessing comprises canonical grammar, grammar matching, irrelevant symbol removal, shorthand replacement and the like;
calculating the similarity of the two-by-two entity data in the entity data by adopting K-Means clustering;
And performing entity matching on the two entity data with high similarity by adopting a Limes method to finish data fusion.
The storage mode is entity-relation-entity-attribute;
And obtaining an audit text knowledge graph by adopting a visualization tool.
S2, acquiring a second audit text data set, acquiring word vectors of audit problem categories in the audit text to be classified, calculating a matching index based on the audit text knowledge graph and the word vectors, complementing the word vectors to the audit text knowledge graph when the matching index is smaller than a first preset value, and creating a text database corresponding to the audit problem categories; when the matching index is larger than a second preset value, acquiring the next audit text to be classified in the second audit text data set and executing S2 until the second audit text data set is completely executed;
Specifically, the second audit text data set is obtained by using the electric power audit text of each electric power company in a daily audit record, and the audit text to be classified is the audit text contained in the second audit text data set;
specifically, the calculation formula of the word vector of the audit problem category in the audit text to be classified is as follows:
;
Where m is the number of words contained in the audit problem category, n is the length of a single word vector of each word contained in the audit problem category, U m×n represents a vector when a word contained in the audit problem category is used as a central word, V m×n represents a vector when a word contained in the audit problem category is used as a non-central word, and Y is the word vector, namely a final word vector of the audit problem category.
Specifically, on the basis that the audit text knowledge graph has visual characteristics, calculating a matching index based on the audit text knowledge graph and the word vector, wherein a calculation formula of the matching index is as follows:
;
E is the coordination index, Y i is the i-th field of word vectors, Q j is the feature vector of the entity data audit problem category contained in the audit text knowledge graph, n is the number of the entity data audit problem category contained in the audit text knowledge graph, and max is the maximum value.
Specifically, when the matching index is between 40% and 95%, the audit problem category corresponding to the audit text to be classified is considered to exist in the audit text knowledge graph, and the audit text to be classified is not executed S2 and deleted (filtered), so that the classification speed of the audit text is improved, and the data volume of the background is reduced.
Specifically, the process of completing the word vector to the audit text knowledge graph comprises the following steps:
because the "audit problem category" corresponding to the word vector to be completed is one of the entity data, in this embodiment, the entity data to be completed is referred to as a first audit problem category entity, the entity data "audit problem category" existing in the audit text knowledge graph is referred to as a second audit problem category entity, the matching degree between the first audit problem category and the second audit problem category is calculated by adopting the evaluation index w (a, b), and the calculation formula of the evaluation index w (a, b) is as follows:
;
Wherein a is a first audit question category entity, b is a second audit question category entity, Is a circular correlation operation.
Specifically, the text database creation process includes:
And automatically capturing data information in the background based on the newly completed word vector of the audit text knowledge graph, and carrying out data export, wherein when the data is exported, a BLOB online export technology is adopted to automatically generate a file database named by the audit problem category corresponding to the word vector.
It is noted that after the execution of S1 and S2 in sequence is ended, the method further includes: the first audit text data set and the second audit text data set acquired by the embodiment are combined to obtain a third audit text data set, the third audit text data set is divided into a training set and a test set based on the third audit text data set, a deep learning model is built, audit text classification is realized, and because a part of data sets in the second audit text data set are filtered in S2, the training set required to be marked for training the deep learning model is greatly reduced under the condition that the classification accuracy of the deep learning model is not reduced, the workload is further reduced, and the classification efficiency is improved. The deep learning model includes, but is not limited to, a high-precision model commonly used in the prior art, and this embodiment is not repeated.
The embodiment provides an audit text classification system based on a knowledge graph, which comprises the following steps: an audit text knowledge graph construction module and an audit text classification module;
the audit text knowledge graph construction module is used for acquiring a first audit text data set and constructing an audit text knowledge graph based on the first audit text data set;
The audit text classification module is used for acquiring a second audit text data set, acquiring word vectors of audit problem categories in the audit text to be classified, calculating a matching index based on the audit text knowledge graph and the word vectors, complementing the word vectors to the audit text knowledge graph when the matching index is smaller than a first preset value, and creating a text database corresponding to the audit problem categories; and when the matching index is larger than a second preset value, acquiring the next audit text to be classified in the second audit text data set, and re-executing the audit text until the second audit text data set is completely executed.
In specific implementation, the implementation processes of the audit text classification system based on the knowledge graph and the audit text classification method based on the knowledge graph are in one-to-one correspondence, and are not repeated here.
The invention provides audit text classification equipment based on a knowledge graph, which comprises the following steps: the device comprises a processor, a communication interface, a memory and a communication bus, wherein the processor, the communication interface and the memory are communicated with each other through the communication bus;
a memory for storing a computer program;
And the processor is used for realizing any audit text classification method based on the knowledge graph when executing the program stored in the memory.
According to the invention, the entity data, the relation data and the attribute data in the audit text are obtained to construct the audit text knowledge graph, so that the association relation among different entity data in the audit text is visualized, the electric staff can be helped to quickly obtain various information of the audit text, and the working efficiency is improved; the auditing problem category is converted into word vectors, the coordination index is calculated, and entity data of the auditing problem category with the coordination index lower than a first preset value is complemented to an auditing text knowledge graph, so that accuracy and appropriateness of auditing text classification in subsequent work are improved; by automatically creating a text database of audit problem categories through the background, the classification efficiency of audit texts and the working efficiency of electric staff are improved, and the automation of audit text classification is realized.
Although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention.