CN115759071A

CN115759071A - Government affair sensitive information identification system and method based on big data

Info

Publication number: CN115759071A
Application number: CN202211424814.9A
Authority: CN
Inventors: 李先美; 雷海峰
Original assignee: Shenzhen Zhongke Baotai Technology Co ltd
Current assignee: Shenzhen Zhongke Baotai Technology Co ltd
Priority date: 2022-11-14
Filing date: 2022-11-14
Publication date: 2023-03-07

Abstract

The invention provides a government affair sensitive information identification system and method based on big data, and belongs to the technical field of big data analysis. The method comprises the following steps: step 1, acquiring text data to be analyzed; step 2, converting the text data into a vector form; step 3, constructing a sensitive information identification analysis model and receiving text data in a vector form; step 4, identifying the sensitive information of the text data by using a sensitive information identification analysis model; and 5, outputting a recognition analysis result. The method reduces the possibility that the malicious text is widely spread in practical application by effectively identifying the sensitive words, further analyzes different text tampering modes, expands the sensitive word database according to different text tampering modes, and effectively improves the detection accuracy of comparison.

Description

Government affair sensitive information identification system and method based on big data

Technical Field

The invention belongs to the technical field of big data analysis, and particularly relates to a government affair sensitive information identification system and method based on big data.

Background

With the development of internet technology, the phenomenon of data electronization gradually dominates the data management mode, and government information is an important category of information and is a general term for information, conditions, data, charts, text materials, audio-video materials and the like which reflect government work and related things in government activities. Government affair information should meet three conditions at the same time, wherein the information mastered by the government agency means that the government agency legally generates, collects and integrates; the second is information related to economic, social management and public services, and the third is content reflected by a specific carrier. Since government affair information relates to all aspects of society, compared with other application fields, sensitive words related in the neighborhood of government affairs often cause deviation of understanding and development direction of public sentiment, how to realize deep mining and analysis of sensitive information in massive texts and improve recognition results are a problem to be solved urgently at present.

In the prior art, there is a technical scheme for identifying and screening government affair sensitive information:

prior art 1 (CN 114386408 a) discloses a government affair sensitive information identification method, apparatus, device, medium, and program product, and specifically discloses obtaining at least one government affair statement, including text content associated with government affair data; generating a first sentence vector based on semantic information of the at least one government affair sentence; and taking the first sentence vector as the input of an identification model to obtain a classification result output by the identification model, and determining the sensitive information related to the at least one government affair sentence according to the classification result.

The prior art 2 (CN 113792308 a) discloses a method for analyzing risk of security behaviors facing government affairs sensitive data, and specifically discloses a method for studying and judging use behaviors of sensitive data, identifying and automatically combing sensitive assets, and assisting in judging properties and motivations of sensitive data circulation; and performing risk identification and analysis by using the associated risk strategy and risk rule.

Prior art 3 (CN 111782811 a) discloses an e-government affair sensitive text detection method based on a convolutional neural network and a support vector machine, and specifically discloses a text vector constructed by using a TFIDF weighting technique, a sensitive field text classification model constructed by using a support vector machine algorithm through continuous machine learning training, and the model is used for judging whether a text belongs to the sensitive field.

When the prior art processes government affair sensitive information, the following problems still exist:

1. taking the prior art 1 as an example, a technical means for performing semantic extraction on a vector and training a recognition model is disclosed, but the technical means does not disclose the type and recognition method of the recognition model as a key technology, which is a common problem of the technology, still stays at the level of sensitive information screening based on the semantic vector, and has a high accuracy for a standard word stock based on the traditional technology in the Natural Language (NL) field of the semantic vector, but cannot perform rapid and accurate recognition on a non-standard word stock, especially information with emotional languages, and frequently overlooks sensitive information with strong emotional words but main ideographic vocabularies.

2. Taking the prior art 2 as an example, the information detection technology based on the label is disclosed, and belongs to the technical sub-field of label-based detection, however, when the label is inaccurate, the detection precision can be rapidly reduced, and when the big data of the label is lacked, the detection efficiency and precision are low;

3. taking the prior art 3 as an example, the invention discloses a technical sub-field for identifying sensitive information based on a convolution algorithm, which belongs to a high-complexity model, has higher requirement on the computational power of data, can deal with a small amount of data when the computational power of an operation system or a platform is lower, but has the problems of identification delay, memory occupation and the like caused by computational power congestion when identifying large data.

Disclosure of Invention

The invention aims to: a government affair sensitive information identification system and method based on big data are provided to solve the above problems in the prior art. The possibility of widespread dissemination of malicious text in practical applications is reduced by the efficient recognition of sensitive words.

The technical scheme is as follows: in a first aspect, a method for identifying government affairs sensitive information based on big data is provided, and the method specifically includes a method for identifying government affairs sensitive information based on big data, and is characterized by specifically including the following steps:

step 1, acquiring text data to be analyzed;

step 1.1, preprocessing text data, and extracting subject, predicate, object, fixed phrase, object, complement and punctuation information of the text;

step 1.2, extracting keywords after pretreatment; the extraction expression of the key words is as follows:

in the formula (I), the compound is shown in the specification,

representing each numerical value with the emotion degree after the preprocessing, and the subscript c representing the serial number of each numerical value with the emotion degree after the preprocessing;

the parameter is determined based on a previous sensitive word frequency library, and the subscript t represents the serial number of the parameter with the emotion degree; z represents a criticality parameter, wherein the criticality parameter refers to the frequency of occurrence of the keyword in the current network heat ranking;

step 2, converting the text data into a vector form;

step 3, constructing a sensitive information identification analysis model and receiving text data in a vector form;

and 4, identifying the sensitive information of the text data by using a sensitive information identification analysis model:

when the type of the sensitive words is the sensitive words with similar pronunciations, firstly analyzing the acquired text into phonetic codes, and then calculating the editing distance of the phonetic codes to obtain the semantic similarity between the sensitive words and the words to be detected;

when the type of the sensitive word is the sensitive word in the form of short name, firstly extracting the initial letter of the word to be analyzed and combining the initial letter, and then taking the initial letter as a matched target string and template string;

when the type of the sensitive word is the sensitive word in a splitting form, firstly converting the split word into a region code, and then matching the obtained region code, thereby realizing the matching of the word to be analyzed;

and 5, outputting a recognition analysis result.

In the process of recognizing the sensitive information through the sensitive information recognition analysis model, in order to improve the recognition accuracy of the sensitive words, a sensitive word database for storing sensitive words is further expanded; and then, detecting the sensitive words by further carrying out a corresponding processing mode in a mode of classifying the types. The expansion mode of the sensitive word database comprises the following steps: expanding sensitive words with similar pronunciations, expanding sensitive words in a form of short names and expanding sensitive words in a split form.

When the type of the sensitive words is the sensitive words with similar pronunciations, the acquired text is firstly analyzed into phonetic codes, and then the semantic similarity between the sensitive words and the words to be detected is obtained through the edit distance calculation of the phonetic codes.

When the type of the sensitive word is the sensitive word in the form of short name, firstly, the initials of the word to be analyzed are extracted and combined, and then the initials are used as the matched target string and template string.

When the type of the sensitive word is the sensitive word in the split form, firstly, the split word is converted into the region code, and then, the obtained region code is matched, so that the matching of the word to be analyzed is realized.

When the sensitive words are analyzed by using the sensitive information recognition and analysis model, the semantic tendency of the current text is further mined, the existence of extreme viewpoints is recognized by analyzing the tendency of the viewpoint, and a basis is provided for subsequent artificial viewpoint monitoring and control by transmitting the extreme viewpoints to a responsible person.

Firstly, a sensitive word fingerprint library is established, and then the semantic similarity distance of the extracted sensitive words in the sensitive word fingerprint library is calculated in a semantic similarity detection mode. And finally, through the judgment of the calculated threshold, finding out the tendency viewpoint corresponding to the text data from the sensitive word fingerprint database.

Based on the obtained semantic fingerprints, when a text with high similarity in the text data is subjected to rapid viewpoint orientation identification, the corresponding similarity calculation expression is as follows:

in the formula (I), the compound is shown in the specification,

representing an exclusive or operation; numful () represents a function for calculating a value of 1; f _i And F _j The ratio is a generation operation parameter for calculating the distance between two values.

In a second aspect, a government affair sensitive information identification system based on big data is provided for realizing an identification method of government affair sensitive information, and the system specifically comprises the following modules:

the data acquisition module is used for reading government affair text data to be analyzed;

a data conversion module configured to convert the read text data into a desired form;

the model construction module is arranged for constructing a sensitive information identification analysis model;

the data analysis module is used for analyzing the read text data by utilizing the sensitive information identification analysis model;

and the data output module is used for outputting the analysis result of the data analysis module.

In some implementation manners of the second aspect, when analyzing the massive sensitive words, firstly reading text data to be analyzed by using a data acquisition module; secondly, converting the read text data form into a required form by using a data conversion module according to requirements; thirdly, a sensitive information identification analysis model is built by utilizing a model building module; then, the data analysis module analyzes the sensitive information of the read text data by using the sensitive information identification and analysis model; and finally, outputting the analysis result of the data analysis module by adopting a data output module.

In a third aspect, a big data-based government affairs sensitive information identification device is provided, which includes: a processor and a memory storing computer program instructions.

The processor reads and executes computer program instructions to realize the government affair sensitive information identification method.

In a fourth aspect, a computer-readable storage medium having computer program instructions stored thereon is presented. The computer program instructions, when executed by the processor, implement a government-sensitive information identification method.

Has the advantages that:

1. the invention provides a government affair sensitive information identification system and method based on big data, which reduces the possibility that malicious texts are widely spread in practical application by effectively identifying sensitive words;

2. the method further analyzes different text tampering modes, expands the sensitive word database according to the different text tampering modes, and effectively improves the detection accuracy of comparison;

3. the sensitive information identification analysis model provided by the invention further excavates the semantic tendency of the current text, and identifies the existence of extreme points through the analysis of the tendency of the points of interest, thereby improving the monitoring strength on malicious texts;

4. the invention introduces an emotion investigation method, extracts keywords based on emotional degree, and improves the extraction precision of sensitive information when encountering sudden public sentiment, and particularly, the invention extracts the emotion keywords based on network popularity, so that the sensitive information related to the public sentiment can be determined more accurately and efficiently when the sudden public sentiment is faced.

5. The method is based on the sensitive word fingerprint library and the semantic fingerprint for identification, when a user performs anti-monitoring operation, for example, expression is performed by means of harmonic sounds, pinyin, characters with similar shapes, symbol segmentation and the like, the traditional monitoring mode of word library comparison can be avoided, but the semantic fingerprint cannot be completely hidden.

Drawings

FIG. 1 is a flow chart of data processing according to the present invention.

Detailed Description

In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without one or more of these specific details. In other instances, well-known features have not been described in order to avoid obscuring the present invention.

The applicant considers that a great amount of text data is an important component for constructing the electronic information age, and the sensitivity analysis of the web text is one of important factors for monitoring current public opinions. In order to avoid social security threats caused by propagation of certain sensitive words, a government affair sensitive information recognition system and method based on big data are provided, and emotion trends represented in texts are obtained through mining and analyzing mass text data, so that the security and stability of a network environment in the aspect of government affair application are maintained.

Example one

In one embodiment, a government affair sensitive information identification method based on big data is provided, potential relation between sensitive words and public opinion directions is mined through mining and analyzing a large amount of government affair data, network data supervision is achieved from a surface active head, and stability of network safety is improved. As shown in fig. 1, the method specifically includes the following steps:

step 1, acquiring mass text data to be analyzed;

specifically, the obtained text data is preprocessed firstly, and the preprocessing refers to extracting subject, predicate, object, fixed phrase, subject, complement and punctuation information of the text, so that the interference of multi-noise data is reduced, and the subsequent text recognition accuracy is improved. And then, extracting keywords from the data in the text, and analyzing the sensitive words through extracting the keywords.

Step 2, converting the text data into a vector form;

step 4, identifying the sensitive information of the text data by using a sensitive information identification analysis model;

and 5, outputting a recognition analysis result.

The embodiment provides a government affair sensitive information identification method based on big data, and the possibility that malicious texts are widely spread in practical application is reduced through effective identification of sensitive words.

Example two

In a further embodiment based on the first embodiment, the meaning of the sentence is usually highlighted by a keyword, so in the data processing process, the method further includes an extraction step of the keyword, specifically, the expression for extracting the keyword is as follows:

in the formula (I), the compound is shown in the specification,

representing each numerical value with the emotion degree after the pretreatment, and the subscript c representing the serial number of each numerical value with the emotion degree after the pretreatment;

the parameter is determined based on a previous sensitive word frequency library, and the subscript t represents the serial number of the parameter with the emotion degree; z represents a criticality parameter, which refers to the frequency of occurrence of the keyword in the current network heat ranking.

In the preprocessing process, firstly, the division of text data is realized through word segmentation, and then interfering words such as the mood assist words and the like are removed, so that a more accurate analysis text is obtained.

The method further provides extraction of the keywords in the text data aiming at the analysis process of the sensitive words, and by extracting the keywords which have more representative meanings for text expression, the speed of identifying the sensitive words can be effectively improved, and the operation time is shortened.

EXAMPLE III

In a further embodiment based on the embodiment, in an actual application, in order to avoid the possibility that malicious tampering text data is detected, a user with improper intention often changes the text presentation form by deforming the form of the sensitive words, so as to convey semantic information similar to the sensitive words to other users. Therefore, the implementation further performs a corresponding processing mode by a mode of classifying the sensitive words. Specifically, according to the data analysis requirements, the types of the sensitive words are divided into: sensitive words with similar pronunciations, short-form sensitive words and split-form sensitive words.

In a further embodiment, since the semantic propagation of the chinese text is mainly determined by pinyin, for the detection of sensitive words, users with improper intentions often adopt texts with similar pronunciations to avoid the detection of sensitive words. Therefore, when the type of the sensitive word is a sensitive word with similar pronunciation, the acquired text is firstly analyzed into a phonetic code, and then the semantic similarity between the sensitive word and the word to be detected is obtained through the edit distance calculation of the phonetic code.

In a further embodiment, language habits oriented to mass-living, communication in a way of omitting words often occurs, and when too many ellipses exist, the expression is made too obscure, thereby reducing the possibility of detecting sensitive words. Therefore, when the kind of sensitive word is a sensitive word in the form of abbreviation, the initials of the words to be analyzed are first extracted and combined, and then taken as the matching target string and template string.

In a further embodiment, with the increase of network terms, in order to meet the demand of mass entertainment, a phenomenon of splitting words by components occurs, for example, "research" is transformed into "stone-breaking study", so that users with improper intentions can also adopt the way which is not easy to be intelligently detected to carry out malicious semantic dissemination. When the type of the sensitive word is the sensitive word in the split form due to malicious splitting, the split word is firstly converted into a region code, and then the obtained region code is matched, so that the matching of the word to be analyzed is realized.

In a further embodiment, in the process of detecting the sensitive words, in order to improve the detection accuracy of the text, a sensitive word database used for comparison of the sensitive words is enriched according to the classification types of the sensitive words.

According to the method and the device, the text data are analyzed, and the sensitive word database is expanded based on the analysis result, so that the data base is effectively tamped for the detection of subsequent sensitive word information.

Example four

In a further embodiment based on the embodiment, when the sensitive words are analyzed by using the sensitive information identification and analysis model, the semantic tendency of the current text is further mined, the existence of the extreme viewpoint is identified by analyzing the tendency of the viewpoint, and a basis is provided for subsequent artificial viewpoint monitoring and control by transmitting the extreme viewpoint to a user supervisor.

Specifically, a sensitive word fingerprint library is established, and then semantic similarity distance calculation is performed on the extracted sensitive words in the sensitive word fingerprint library in a semantic similarity detection mode. And finally, through the judgment of the calculated threshold, finding out the tendency viewpoint corresponding to the text data from the sensitive word fingerprint database.

In a further embodiment, when there are many identical text contents in the text data to be analyzed, if the sensitive information identification analysis model is repeatedly called to perform the piece-by-piece analysis, a large amount of system resources are consumed, which results in the waste of operation resources. Aiming at the problems, the semantic fingerprint technology is adopted to quickly identify the text data with higher similarity. In a preferred embodiment, the process of computing semantic fingerprints specifically comprises the following steps:

step 1, performing word segmentation on received text data to obtain a word segmentation set;

step 2, identifying the sensitive words and obtaining fingerprint values corresponding to the sensitive words from the existing fingerprint database;

step 3, transforming the word set in the step 1 by adopting Hash processing to obtain a corresponding binary Hash value;

step 4, carrying out bitwise summation on the obtained hash values to obtain sequence values;

step 5, assigning 0 or 1 according to the sequence value and the positive and negative conditions of the numerical value to further obtain a final semantic fingerprint value of the text;

and 6, circularly calling the operation process of the semantic fingerprints until the text data in the whole process is analyzed.

And based on the obtained semantic fingerprint, quickly identifying the viewpoint tendency of the text with high similarity in the text data to be treated.

Wherein, the calculation expression of the similarity is as follows:

in the formula (I), the compound is shown in the specification,

In a further embodiment, the text data only containing sensitive words does not represent the current text and relates to sensitive viewpoints, so that the defect of semantic existence is overcome by means of deep learning aiming at the proposed sensitive information recognition and analysis model.

Because the importance degree of each word to the text classification result is different, the embodiment introduces a self-attention mechanism, learns the weight values of the words in the sentence, and highlights the influence of the important words on the classification result because the words with high importance degree in the sentence have higher weight values, thereby further improving the identification accuracy of the model. The main purpose of the self-attention layer is to learn the weight value of a word at each position, so that the attention of a task is transferred to the word which plays an important role in a sentence during task learning. Since the multi-task learning has the same input, but the importance of each word in the two tasks is different, the weights of the words are adjusted in the self-attention layer, and the words playing an important role in the embodiment are given larger weights.

In order to effectively improve the performance of the sensitive information identification analysis model, a loss function is adopted to carry out performance optimization on the sensitive information identification analysis model, wherein the corresponding loss function expression is as follows:

where y represents the actual value of the parsed text;

representing a model prediction output value; s represents the likelihood probability distribution over each class.

EXAMPLE five

In one embodiment, a big data-based government affair sensitive information identification system is provided, which is used for implementing a big data-based government affair sensitive information identification method, and specifically includes the following modules: the device comprises a data acquisition module, a data conversion module, a model construction module, a data analysis module and a data output module.

Specifically, the data acquisition module is used for reading mass government affair text data to be analyzed according to analysis requirements; the data conversion module is used for performing form conversion on the read text data according to the file format requirement; the model construction module is used for constructing a sensitive information identification analysis model; the data analysis module is used for analyzing the sensitive information of the converted text data; and the data output module is used for outputting the analysis result obtained by the data analysis module.

In a further embodiment, when analyzing massive sensitive words, reading text data to be analyzed by using a data acquisition module; secondly, converting the read text data form into a required form by using a data conversion module according to requirements; thirdly, a sensitive information identification analysis model is built by utilizing a model building module; then, the data analysis module analyzes the sensitive information of the read text data by using the sensitive information identification and analysis model; and finally, outputting the analysis result of the data analysis module by adopting a data output module.

EXAMPLE six

In one embodiment, a big data-based government affairs sensitive information identification device is provided, which comprises: a processor and a memory storing computer program instructions.

Wherein, the processor reads and executes the computer program instructions to realize the government affair sensitive information identification method.

EXAMPLE seven

In one embodiment, a computer-readable storage medium having computer program instructions stored thereon is presented.

Wherein the computer program instructions, when executed by the processor, implement a government affairs sensitive information identification method.

As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A government affair sensitive information identification method based on big data is characterized by comprising the following steps:

step 1, acquiring text data to be analyzed;

in the formula (I), the compound is shown in the specification,

step 2, converting the text data into a vector form;

and 5, outputting a recognition analysis result.

2. The government affair sensitive information recognition method based on big data according to claim 1, wherein in the process of recognizing the sensitive information through the sensitive information recognition analysis model, in order to improve the recognition accuracy of the sensitive words, the sensitive word database storing the sensitive words is further expanded.

3. The method for identifying government affairs sensitive information based on big data according to claim 2, wherein the expansion mode of the sensitive word database comprises the following steps:

expanding sensitive words with similar pronunciations;

expanding sensitive words in a short form;

and expanding the sensitive words in a split form.

4. The government affair sensitive information identifying method based on big data as claimed in claim 3, wherein when the sensitive words are analyzed by using the sensitive information identifying and analyzing model, the semantic tendency of the current text is further mined, the existence of extreme viewpoints is identified by analyzing the tendency of the viewpoint, and a basis is provided for the subsequent artificial viewpoint monitoring and control by transmitting the extreme viewpoints to the responsible persons.

5. A government affairs sensitive information identification system based on big data, which is used for realizing the government affairs sensitive information identification method according to any one of claims 1-4, and is characterized by comprising the following modules:

6. A big data-based government affairs sensitive information identification device, comprising:

a processor and a memory storing computer program instructions;

the processor reads and executes the computer program instructions to implement the government affairs sensitive information identification method according to any one of claims 1-4.

7. A computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the government-sensitive information identifying method according to any one of claims 1-4.