CN113963357A

CN113963357A - Knowledge graph-based sensitive text detection method and system

Info

Publication number: CN113963357A
Application number: CN202111535596.1A
Authority: CN
Inventors: 张静磊; 叶蔚; 张世琨; 谢睿; 温国昌
Original assignee: Peking University
Current assignee: Peking University
Priority date: 2021-12-16
Filing date: 2021-12-16
Publication date: 2022-01-21
Anticipated expiration: 2041-12-16
Also published as: CN113963357B

Abstract

The invention discloses a method and a system for detecting sensitive texts based on a knowledge graph, wherein the method comprises the following steps: crawling the existing knowledge in the network, and preprocessing the existing knowledge to obtain a knowledge graph network; acquiring a sensitive text in a network, and preprocessing the sensitive text to obtain a training corpus; obtaining coding information of a text detection model according to the training corpus and the knowledge graph network, and converting the coding information into vector representation to obtain a final text detection model; and preprocessing the text to be tested, and obtaining a detection result according to the text detection model. According to the invention, external knowledge is introduced through the knowledge map, a text detection model is established, and the external knowledge is further fused through the multi-view reasoning network, so that the external knowledge can be fully utilized.

Description

Knowledge graph-based sensitive text detection method and system

Technical Field

The invention relates to the field of sensitive text detection, in particular to a knowledge graph-based sensitive text detection method and system.

Background

With the development of the internet, the information on the internet is growing explosively, however, unhealthy and illegal information is increasing, so that reasonable screening of the information is particularly important, the NLP technology plays an increasingly important role in the fields of text classification, language translation, part of speech tagging, named entity identification and the like in human daily language processing, and obtains remarkable results, the sensitive text analysis in the NLP field plays an increasingly important role in the internet field, however, for the defects of the technology, methods such as pinyin replacement, sequence disturbance, reference replacement and the like are utilized, so that the sensitive text detection is more difficult, and the problem can be reasonably solved by utilizing knowledge maps.

Disclosure of Invention

The invention provides a method and a system for detecting sensitive texts based on a knowledge graph, which introduce external knowledge through the knowledge graph, provide necessary basis for detection of a model, and further fuse the external knowledge through a multi-view reasoning network, so that the external knowledge can be fully utilized.

In order to achieve the above object, the present invention provides a method for detecting sensitive text based on knowledge-graph, comprising:

crawling the existing knowledge in the network, and preprocessing the existing knowledge to obtain a knowledge graph network;

acquiring a sensitive text in a network, and preprocessing the sensitive text to obtain a training corpus;

obtaining coding information of a text detection model according to the training corpus and the knowledge graph network, and converting the coding information into vector representation to obtain a final text detection model;

and preprocessing the text to be tested, and obtaining a detection result according to the text detection model.

According to one aspect of the invention, the method for obtaining the knowledge graph network comprises the following steps:

the existing knowledge in the open source community and the information disclosure website is obtained through a web crawler technology, a data set is obtained through collection, the data set is processed through an entity recognition and relation extraction technology, structured data of the data set are obtained, and the knowledge graph network is formed.

According to one aspect of the present invention, the method for obtaining the corpus comprises:

and acquiring the sensitive texts in the open source community and the information public website by the web crawler technology, deleting stop words and special symbols in the sensitive texts, and segmenting the length of the sensitive texts to obtain the training corpus.

According to one aspect of the invention, the corpus comprises entities and instances corresponding to the entities, custom identifiers are inserted into front and rear positions of the instances, different entities correspond to different custom identifiers, different instances of the same entity correspond to the same custom identifier, anchors are set for the entities, and position information of the corpus is obtained through language model coding.

According to one aspect of the invention, related concepts of each entity and the confidence degree corresponding to the related concepts are extracted according to the knowledge-graph network, and if the related concepts of the entity are less than 10, the confidence degree of a spare part is set to be 0.

According to one aspect of the invention, the entities and related concepts are preprocessed, supplemented by crawling wikipedia text, and if the knowledge-graph network does not have the entities, the entities are replaced with wiki information, which is encoded by the language model and max pooling.

According to one aspect of the invention, a weight value of the related concept is obtained through softmax operation according to the confidence degree, a vector set is obtained according to the weight value and the vector representation, a vector representation of the entity is obtained according to the vector set, and data information interaction between the training corpus and the knowledge graph network is achieved.

To achieve the above object, the present invention provides a knowledge-graph-based sensitive text detection system, comprising:

a knowledge graph network establishment module: crawling the existing knowledge in the network, and preprocessing the existing knowledge to obtain a knowledge graph network;

the training corpus building module: acquiring a sensitive text in a network, and preprocessing the sensitive text to obtain a training corpus;

the text detection model construction module: obtaining coding information of a text detection model according to the training corpus and the knowledge graph network, and converting the coding information into vector representation to obtain a final text detection model;

a prediction result module: and preprocessing the text to be tested, and obtaining a detection result according to the text detection model.

To achieve the above object, the present invention provides an electronic device, which includes a processor, a memory, and a computer program stored in the memory and running on the processor, wherein the computer program, when executed by the processor, implements the above method for detecting sensitive text based on a knowledge graph.

To achieve the above object, the present invention provides a computer-readable storage medium, on which a computer program is stored, which when executed by a processor implements the above-mentioned method for detecting sensitive texts based on knowledge-graph.

Based on this, the beneficial effects of the invention are:

(1) sensitive texts are detected through a knowledge graph network, so that the defects of the traditional technologies such as pinyin replacement, sequence disturbance, reference replacement and the like are avoided;

(2) the training corpus and the knowledge graph network are converted into vector representation, so that the interactivity between the training corpus and the knowledge graph network is enhanced, and the accuracy of the text detection model is improved.

Drawings

FIG. 1 schematically represents a flow diagram of a knowledge-graph based sensitive text detection method according to the present invention;

FIG. 2 schematically represents a diagram of a sensitive text three-tier inference mechanism in accordance with the present invention;

FIG. 3 schematically represents an architecture diagram of a sensitive text detection model according to the present invention;

FIG. 4 schematically represents a flow diagram of a knowledge-graph based sensitive text detection system according to the present invention.

Detailed Description

The present invention will now be discussed with reference to exemplary embodiments, it being understood that the embodiments discussed are only for the purpose of enabling a person of ordinary skill in the art to better understand and thus implement the contents of the present invention, and do not imply any limitation on the scope of the present invention.

As used herein, the term "include" and its variants are to be read as open-ended terms meaning "including, but not limited to. The term "based on" is to be read as "based, at least in part, on" and the terms "one embodiment" and "an embodiment" are to be read as "at least one embodiment".

Fig. 1 schematically shows a flow chart of a method for detecting sensitive text based on a knowledge-graph according to the present invention, as shown in fig. 1, the method for detecting sensitive text based on a knowledge-graph according to the present invention comprises the following steps:

101: crawling the existing knowledge in the network, and preprocessing the existing knowledge to obtain a knowledge graph network;

102: acquiring a sensitive text in a network, and preprocessing the sensitive text to obtain a training corpus;

103: obtaining coding information of the text detection model according to the training corpus and the knowledge graph network, and converting the coding information into vector representation to obtain a final text detection model;

104: and preprocessing the text to be tested, and obtaining a detection result according to the text detection model.

According to one embodiment of the invention, the method for obtaining the knowledge graph network comprises the following steps:

the method comprises the steps of obtaining existing knowledge in open source communities and information disclosure websites through a web crawler technology, summarizing to obtain data sets, processing the data sets through an entity recognition and relation extraction technology to obtain structured data of the data sets, and forming a knowledge graph network.

According to one embodiment of the present invention, the method for obtaining the corpus comprises:

fig. 2 schematically shows a schematic diagram of a sensitive text three-layer inference mechanism according to the present invention, and as shown in fig. 2, sensitive texts in an open source community and an information disclosure website are obtained through a web crawler technology, stop words and special symbols in the sensitive texts are deleted, and the length of the sensitive texts is cut to obtain a corpus.

According to an embodiment of the present invention, fig. 3 schematically shows an architecture diagram of a sensitive text detection model according to the present invention, and according to fig. 3, a corpus includes entities and instances corresponding to the entities, custom identifiers are inserted into front and rear positions of the instances, different entities correspond to different custom identifiers, different instances of the same entity correspond to the same custom identifier, anchors are set for the entities, and position information of the corpus is obtained through language model coding.

According to one embodiment of the invention, the related concepts and the confidence degrees corresponding to the related concepts of each entity are extracted according to the knowledge-graph network, and if the related concepts of the entities are less than 10, the confidence degree of the spare part is set to be 0.

According to one embodiment of the invention, entities and related concepts are preprocessed, entities and related concepts are supplemented by crawling wikipedia text, and if the knowledge-graph network has no entities, the entities are replaced with wikipedia information, which is encoded by a language model and max pooling.

According to one embodiment of the invention, the weight value of the related concept is obtained through softmax operation according to the confidence degree, the vector set is obtained according to the weight value and the vector representation, the vector representation of the entity is obtained according to the vector set, and the interaction of data information between the training corpus and the knowledge graph network is realized.

According to an embodiment of the invention, in order to test the effect of the invention, 15 ten thousand sensitive texts are collected, 95% of items are used as a training set, 5% of items are used as a test set, the training set is implemented according to the scheme of the invention, after the training is finished, evaluation is performed on the test set, in order to better verify the effect of generating the abstract, an accuracy rate, a recall rate and an F1 value are selected as evaluation indexes, the accuracy rate: precision = (amount of text classified as sensitive text/total amount of text) x100%, recall: recall = (amount of text classified as sensitive text/total amount of text of sensitive text in text) x100%, F1 value: in order to evaluate the advantages and disadvantages of different algorithms, the concept of F1 value is proposed on the basis of the accuracy and the recall ratio to carry out overall evaluation on the accuracy and the recall ratio: f1 value = correct rate recall rate 2/(correct rate + recall rate), the existing models CNN, GRU, LSTM and BERT were selected as the reference models, the accuracy of the model CNN was 70.1%, the recall rate was 61.2%, the F1 value was 65.3%; the accuracy of the model GRU was 69.7%, the recall was 59.5%, and the F1 value was 64.2%; the accuracy of the model CNN was 66.5%, the recall was 71.8%, and the F1 value was 68.9%; the accuracy of the model CNN was 70.1%, the recall was 74.5%, and the F1 value was 72.0%; the accuracy rate of the text detection model is 84.7%, the recall rate is 86.9%, and the F1 value is 85.7%, so that the data show that the text detection model provided by the invention can better identify sensitive texts.

Furthermore, to achieve the above objects, the present invention provides a system for detecting sensitive texts based on a knowledge-graph, fig. 4 schematically shows a flow chart of the system for detecting sensitive texts based on a knowledge-graph according to the present invention, as shown in fig. 4, the system for detecting sensitive texts based on a knowledge-graph according to the present invention comprises:

the training corpus building module: the training corpus building module: acquiring a sensitive text in a network, and preprocessing the sensitive text to obtain a training corpus;

the text detection model construction module: obtaining coding information of the text detection model according to the training corpus and the knowledge graph network, and converting the coding information into vector representation to obtain a final text detection model;

as shown in fig. 2, sensitive texts in the open-source community and the information disclosure website are obtained through a web crawler technology, stop words and special symbols in the sensitive texts are deleted, and the length of the sensitive texts is segmented to obtain the corpus.

According to an embodiment of the present invention, as shown in fig. 3, the corpus includes entities and instances corresponding to the entities, custom identifiers are inserted into front and rear positions of the instances, different entities correspond to different custom identifiers, different instances of the same entity correspond to the same custom identifier, anchors are set for the entities, and the position information of the corpus is obtained through language model coding.

To achieve the above object, the present invention also provides an electronic device, including: the system comprises a processor, a memory and a computer program stored on the memory and capable of running on the processor, wherein the computer program realizes the above-mentioned sensitive text detection method based on the knowledge graph when being executed by the processor.

To achieve the above object, the present invention further provides a computer-readable storage medium, on which a computer program is stored, and the computer program, when executed by a processor, implements the above method for detecting sensitive texts based on knowledge-graph.

Those of ordinary skill in the art will appreciate that the modules and algorithm steps described in connection with the embodiments disclosed herein can be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described apparatuses and devices may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is merely a logical division, and in actual implementation, there may be other divisions, for example, multiple modules or components may be combined or integrated into another system, or some features may be omitted, or not implemented. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or modules, and may be in an electrical, mechanical or other form.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical modules, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, each functional module in the embodiments of the present invention may be integrated into one processing module, or each module may exist alone physically, or two or more modules may be integrated into one module.

The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method for transmitting/receiving the power saving signal according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

The above description is only a preferred embodiment of the application and is illustrative of the principles of the technology employed. It will be appreciated by a person skilled in the art that the scope of the invention as referred to in the present application is not limited to the embodiments with a specific combination of the above-mentioned features, but also covers other embodiments with any combination of the above-mentioned features or their equivalents without departing from the inventive concept. For example, the above features may be replaced with (but not limited to) features having similar functions disclosed in the present application.

It should be understood that the order of execution of the steps in the summary of the invention and the embodiments of the present invention does not absolutely imply any order of execution, and the order of execution of the steps should be determined by their functions and inherent logic, and should not be construed as limiting the process of the embodiments of the present invention.

Claims

1. The method for detecting the sensitive text based on the knowledge graph is characterized by comprising the following steps:

2. The method for detecting sensitive texts based on knowledge-graph according to claim 1, wherein the method for obtaining knowledge-graph network is as follows:

3. The knowledge-graph-based sensitive text detection method according to claim 2, wherein the method for obtaining the training corpus comprises:

4. The knowledge-graph-based sensitive text detection method according to claim 1, wherein the corpus comprises entities and instances corresponding to the entities, custom identifiers are inserted at front and rear positions of the instances, different entities correspond to different custom identifiers, different instances of the same entity correspond to the same custom identifier, anchors are set for the entities, and position information of the corpus is obtained through language model coding.

5. The method of claim 4, wherein relevant concepts of each of the entities and a confidence level corresponding to the relevant concepts are extracted from the knowledge-graph network, and if the relevant concepts of the entity are less than 10, the confidence level of the spare part is set to 0.

6. The method of knowledgegraph-based sensitive text detection according to claim 5, characterized in that the entities and related concepts are preprocessed, supplemented by crawling wikipedia text, if the knowledgegraph network does not have the entities, the entities are replaced by wiki information, which is encoded by the language model and max pooling.

7. The method of claim 6, wherein the weight values of the related concepts are obtained through softmax operation according to the confidence level, a vector set is obtained according to the weight values and the vector representation, a vector representation of the entity is obtained according to the vector set, and the training corpus is enabled to interact with data information of the knowledge-graph network.

8. A sensitive text detection system based on knowledge-graph, comprising:

9. An electronic device comprising a processor, a memory, and a computer program stored on the memory and executable on the processor, the computer program, when executed by the processor, implementing a method for knowledgegraph-based sensitive text detection according to any of claims 1 to 7.

10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, implements a method for knowledge-graph based sensitive text detection according to any one of claims 1 to 7.