CN115759071A - Government affair sensitive information identification system and method based on big data - Google Patents

Government affair sensitive information identification system and method based on big data Download PDF

Info

Publication number
CN115759071A
CN115759071A CN202211424814.9A CN202211424814A CN115759071A CN 115759071 A CN115759071 A CN 115759071A CN 202211424814 A CN202211424814 A CN 202211424814A CN 115759071 A CN115759071 A CN 115759071A
Authority
CN
China
Prior art keywords
sensitive
sensitive information
data
words
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
CN202211424814.9A
Other languages
Chinese (zh)
Inventor
李先美
雷海峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Zhongke Baotai Technology Co ltd
Original Assignee
Shenzhen Zhongke Baotai Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Zhongke Baotai Technology Co ltd filed Critical Shenzhen Zhongke Baotai Technology Co ltd
Priority to CN202211424814.9A priority Critical patent/CN115759071A/en
Publication of CN115759071A publication Critical patent/CN115759071A/en
Withdrawn legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention provides a government affair sensitive information identification system and method based on big data, and belongs to the technical field of big data analysis. The method comprises the following steps: step 1, acquiring text data to be analyzed; step 2, converting the text data into a vector form; step 3, constructing a sensitive information identification analysis model and receiving text data in a vector form; step 4, identifying the sensitive information of the text data by using a sensitive information identification analysis model; and 5, outputting a recognition analysis result. The method reduces the possibility that the malicious text is widely spread in practical application by effectively identifying the sensitive words, further analyzes different text tampering modes, expands the sensitive word database according to different text tampering modes, and effectively improves the detection accuracy of comparison.

Description

Government affair sensitive information identification system and method based on big data
Technical Field
The invention belongs to the technical field of big data analysis, and particularly relates to a government affair sensitive information identification system and method based on big data.
Background
With the development of internet technology, the phenomenon of data electronization gradually dominates the data management mode, and government information is an important category of information and is a general term for information, conditions, data, charts, text materials, audio-video materials and the like which reflect government work and related things in government activities. Government affair information should meet three conditions at the same time, wherein the information mastered by the government agency means that the government agency legally generates, collects and integrates; the second is information related to economic, social management and public services, and the third is content reflected by a specific carrier. Since government affair information relates to all aspects of society, compared with other application fields, sensitive words related in the neighborhood of government affairs often cause deviation of understanding and development direction of public sentiment, how to realize deep mining and analysis of sensitive information in massive texts and improve recognition results are a problem to be solved urgently at present.
In the prior art, there is a technical scheme for identifying and screening government affair sensitive information:
prior art 1 (CN 114386408 a) discloses a government affair sensitive information identification method, apparatus, device, medium, and program product, and specifically discloses obtaining at least one government affair statement, including text content associated with government affair data; generating a first sentence vector based on semantic information of the at least one government affair sentence; and taking the first sentence vector as the input of an identification model to obtain a classification result output by the identification model, and determining the sensitive information related to the at least one government affair sentence according to the classification result.
The prior art 2 (CN 113792308 a) discloses a method for analyzing risk of security behaviors facing government affairs sensitive data, and specifically discloses a method for studying and judging use behaviors of sensitive data, identifying and automatically combing sensitive assets, and assisting in judging properties and motivations of sensitive data circulation; and performing risk identification and analysis by using the associated risk strategy and risk rule.
Prior art 3 (CN 111782811 a) discloses an e-government affair sensitive text detection method based on a convolutional neural network and a support vector machine, and specifically discloses a text vector constructed by using a TFIDF weighting technique, a sensitive field text classification model constructed by using a support vector machine algorithm through continuous machine learning training, and the model is used for judging whether a text belongs to the sensitive field.
When the prior art processes government affair sensitive information, the following problems still exist:
1. taking the prior art 1 as an example, a technical means for performing semantic extraction on a vector and training a recognition model is disclosed, but the technical means does not disclose the type and recognition method of the recognition model as a key technology, which is a common problem of the technology, still stays at the level of sensitive information screening based on the semantic vector, and has a high accuracy for a standard word stock based on the traditional technology in the Natural Language (NL) field of the semantic vector, but cannot perform rapid and accurate recognition on a non-standard word stock, especially information with emotional languages, and frequently overlooks sensitive information with strong emotional words but main ideographic vocabularies.
2. Taking the prior art 2 as an example, the information detection technology based on the label is disclosed, and belongs to the technical sub-field of label-based detection, however, when the label is inaccurate, the detection precision can be rapidly reduced, and when the big data of the label is lacked, the detection efficiency and precision are low;
3. taking the prior art 3 as an example, the invention discloses a technical sub-field for identifying sensitive information based on a convolution algorithm, which belongs to a high-complexity model, has higher requirement on the computational power of data, can deal with a small amount of data when the computational power of an operation system or a platform is lower, but has the problems of identification delay, memory occupation and the like caused by computational power congestion when identifying large data.
Disclosure of Invention
The invention aims to: a government affair sensitive information identification system and method based on big data are provided to solve the above problems in the prior art. The possibility of widespread dissemination of malicious text in practical applications is reduced by the efficient recognition of sensitive words.
The technical scheme is as follows: in a first aspect, a method for identifying government affairs sensitive information based on big data is provided, and the method specifically includes a method for identifying government affairs sensitive information based on big data, and is characterized by specifically including the following steps:
step 1, acquiring text data to be analyzed;
step 1.1, preprocessing text data, and extracting subject, predicate, object, fixed phrase, object, complement and punctuation information of the text;
step 1.2, extracting keywords after pretreatment; the extraction expression of the key words is as follows:
Figure BDA0003942071010000021
in the formula (I), the compound is shown in the specification,
Figure BDA0003942071010000022
representing each numerical value with the emotion degree after the preprocessing, and the subscript c representing the serial number of each numerical value with the emotion degree after the preprocessing;
Figure BDA0003942071010000023
the parameter is determined based on a previous sensitive word frequency library, and the subscript t represents the serial number of the parameter with the emotion degree; z represents a criticality parameter, wherein the criticality parameter refers to the frequency of occurrence of the keyword in the current network heat ranking;
step 2, converting the text data into a vector form;
step 3, constructing a sensitive information identification analysis model and receiving text data in a vector form;
and 4, identifying the sensitive information of the text data by using a sensitive information identification analysis model:
when the type of the sensitive words is the sensitive words with similar pronunciations, firstly analyzing the acquired text into phonetic codes, and then calculating the editing distance of the phonetic codes to obtain the semantic similarity between the sensitive words and the words to be detected;
when the type of the sensitive word is the sensitive word in the form of short name, firstly extracting the initial letter of the word to be analyzed and combining the initial letter, and then taking the initial letter as a matched target string and template string;
when the type of the sensitive word is the sensitive word in a splitting form, firstly converting the split word into a region code, and then matching the obtained region code, thereby realizing the matching of the word to be analyzed;
and 5, outputting a recognition analysis result.
In the process of recognizing the sensitive information through the sensitive information recognition analysis model, in order to improve the recognition accuracy of the sensitive words, a sensitive word database for storing sensitive words is further expanded; and then, detecting the sensitive words by further carrying out a corresponding processing mode in a mode of classifying the types. The expansion mode of the sensitive word database comprises the following steps: expanding sensitive words with similar pronunciations, expanding sensitive words in a form of short names and expanding sensitive words in a split form.
When the type of the sensitive words is the sensitive words with similar pronunciations, the acquired text is firstly analyzed into phonetic codes, and then the semantic similarity between the sensitive words and the words to be detected is obtained through the edit distance calculation of the phonetic codes.
When the type of the sensitive word is the sensitive word in the form of short name, firstly, the initials of the word to be analyzed are extracted and combined, and then the initials are used as the matched target string and template string.
When the type of the sensitive word is the sensitive word in the split form, firstly, the split word is converted into the region code, and then, the obtained region code is matched, so that the matching of the word to be analyzed is realized.
When the sensitive words are analyzed by using the sensitive information recognition and analysis model, the semantic tendency of the current text is further mined, the existence of extreme viewpoints is recognized by analyzing the tendency of the viewpoint, and a basis is provided for subsequent artificial viewpoint monitoring and control by transmitting the extreme viewpoints to a responsible person.
Firstly, a sensitive word fingerprint library is established, and then the semantic similarity distance of the extracted sensitive words in the sensitive word fingerprint library is calculated in a semantic similarity detection mode. And finally, through the judgment of the calculated threshold, finding out the tendency viewpoint corresponding to the text data from the sensitive word fingerprint database.
Based on the obtained semantic fingerprints, when a text with high similarity in the text data is subjected to rapid viewpoint orientation identification, the corresponding similarity calculation expression is as follows:
Figure BDA0003942071010000031
in the formula (I), the compound is shown in the specification,
Figure BDA0003942071010000032
representing an exclusive or operation; numful () represents a function for calculating a value of 1; f i And F j The ratio is a generation operation parameter for calculating the distance between two values.
In a second aspect, a government affair sensitive information identification system based on big data is provided for realizing an identification method of government affair sensitive information, and the system specifically comprises the following modules:
the data acquisition module is used for reading government affair text data to be analyzed;
a data conversion module configured to convert the read text data into a desired form;
the model construction module is arranged for constructing a sensitive information identification analysis model;
the data analysis module is used for analyzing the read text data by utilizing the sensitive information identification analysis model;
and the data output module is used for outputting the analysis result of the data analysis module.
In some implementation manners of the second aspect, when analyzing the massive sensitive words, firstly reading text data to be analyzed by using a data acquisition module; secondly, converting the read text data form into a required form by using a data conversion module according to requirements; thirdly, a sensitive information identification analysis model is built by utilizing a model building module; then, the data analysis module analyzes the sensitive information of the read text data by using the sensitive information identification and analysis model; and finally, outputting the analysis result of the data analysis module by adopting a data output module.
In a third aspect, a big data-based government affairs sensitive information identification device is provided, which includes: a processor and a memory storing computer program instructions.
The processor reads and executes computer program instructions to realize the government affair sensitive information identification method.
In a fourth aspect, a computer-readable storage medium having computer program instructions stored thereon is presented. The computer program instructions, when executed by the processor, implement a government-sensitive information identification method.
Has the advantages that:
1. the invention provides a government affair sensitive information identification system and method based on big data, which reduces the possibility that malicious texts are widely spread in practical application by effectively identifying sensitive words;
2. the method further analyzes different text tampering modes, expands the sensitive word database according to the different text tampering modes, and effectively improves the detection accuracy of comparison;
3. the sensitive information identification analysis model provided by the invention further excavates the semantic tendency of the current text, and identifies the existence of extreme points through the analysis of the tendency of the points of interest, thereby improving the monitoring strength on malicious texts;
4. the invention introduces an emotion investigation method, extracts keywords based on emotional degree, and improves the extraction precision of sensitive information when encountering sudden public sentiment, and particularly, the invention extracts the emotion keywords based on network popularity, so that the sensitive information related to the public sentiment can be determined more accurately and efficiently when the sudden public sentiment is faced.
5. The method is based on the sensitive word fingerprint library and the semantic fingerprint for identification, when a user performs anti-monitoring operation, for example, expression is performed by means of harmonic sounds, pinyin, characters with similar shapes, symbol segmentation and the like, the traditional monitoring mode of word library comparison can be avoided, but the semantic fingerprint cannot be completely hidden.
Drawings
FIG. 1 is a flow chart of data processing according to the present invention.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a more thorough understanding of the present invention. It will be apparent, however, to one skilled in the art, that the present invention may be practiced without one or more of these specific details. In other instances, well-known features have not been described in order to avoid obscuring the present invention.
The applicant considers that a great amount of text data is an important component for constructing the electronic information age, and the sensitivity analysis of the web text is one of important factors for monitoring current public opinions. In order to avoid social security threats caused by propagation of certain sensitive words, a government affair sensitive information recognition system and method based on big data are provided, and emotion trends represented in texts are obtained through mining and analyzing mass text data, so that the security and stability of a network environment in the aspect of government affair application are maintained.
Example one
In one embodiment, a government affair sensitive information identification method based on big data is provided, potential relation between sensitive words and public opinion directions is mined through mining and analyzing a large amount of government affair data, network data supervision is achieved from a surface active head, and stability of network safety is improved. As shown in fig. 1, the method specifically includes the following steps:
step 1, acquiring mass text data to be analyzed;
specifically, the obtained text data is preprocessed firstly, and the preprocessing refers to extracting subject, predicate, object, fixed phrase, subject, complement and punctuation information of the text, so that the interference of multi-noise data is reduced, and the subsequent text recognition accuracy is improved. And then, extracting keywords from the data in the text, and analyzing the sensitive words through extracting the keywords.
Step 2, converting the text data into a vector form;
step 3, constructing a sensitive information identification analysis model and receiving text data in a vector form;
step 4, identifying the sensitive information of the text data by using a sensitive information identification analysis model;
and 5, outputting a recognition analysis result.
The embodiment provides a government affair sensitive information identification method based on big data, and the possibility that malicious texts are widely spread in practical application is reduced through effective identification of sensitive words.
Example two
In a further embodiment based on the first embodiment, the meaning of the sentence is usually highlighted by a keyword, so in the data processing process, the method further includes an extraction step of the keyword, specifically, the expression for extracting the keyword is as follows:
Figure BDA0003942071010000061
in the formula (I), the compound is shown in the specification,
Figure BDA0003942071010000062
representing each numerical value with the emotion degree after the pretreatment, and the subscript c representing the serial number of each numerical value with the emotion degree after the pretreatment;
Figure BDA0003942071010000063
the parameter is determined based on a previous sensitive word frequency library, and the subscript t represents the serial number of the parameter with the emotion degree; z represents a criticality parameter, which refers to the frequency of occurrence of the keyword in the current network heat ranking.
In the preprocessing process, firstly, the division of text data is realized through word segmentation, and then interfering words such as the mood assist words and the like are removed, so that a more accurate analysis text is obtained.
The method further provides extraction of the keywords in the text data aiming at the analysis process of the sensitive words, and by extracting the keywords which have more representative meanings for text expression, the speed of identifying the sensitive words can be effectively improved, and the operation time is shortened.
EXAMPLE III
In a further embodiment based on the embodiment, in an actual application, in order to avoid the possibility that malicious tampering text data is detected, a user with improper intention often changes the text presentation form by deforming the form of the sensitive words, so as to convey semantic information similar to the sensitive words to other users. Therefore, the implementation further performs a corresponding processing mode by a mode of classifying the sensitive words. Specifically, according to the data analysis requirements, the types of the sensitive words are divided into: sensitive words with similar pronunciations, short-form sensitive words and split-form sensitive words.
In a further embodiment, since the semantic propagation of the chinese text is mainly determined by pinyin, for the detection of sensitive words, users with improper intentions often adopt texts with similar pronunciations to avoid the detection of sensitive words. Therefore, when the type of the sensitive word is a sensitive word with similar pronunciation, the acquired text is firstly analyzed into a phonetic code, and then the semantic similarity between the sensitive word and the word to be detected is obtained through the edit distance calculation of the phonetic code.
In a further embodiment, language habits oriented to mass-living, communication in a way of omitting words often occurs, and when too many ellipses exist, the expression is made too obscure, thereby reducing the possibility of detecting sensitive words. Therefore, when the kind of sensitive word is a sensitive word in the form of abbreviation, the initials of the words to be analyzed are first extracted and combined, and then taken as the matching target string and template string.
In a further embodiment, with the increase of network terms, in order to meet the demand of mass entertainment, a phenomenon of splitting words by components occurs, for example, "research" is transformed into "stone-breaking study", so that users with improper intentions can also adopt the way which is not easy to be intelligently detected to carry out malicious semantic dissemination. When the type of the sensitive word is the sensitive word in the split form due to malicious splitting, the split word is firstly converted into a region code, and then the obtained region code is matched, so that the matching of the word to be analyzed is realized.
In a further embodiment, in the process of detecting the sensitive words, in order to improve the detection accuracy of the text, a sensitive word database used for comparison of the sensitive words is enriched according to the classification types of the sensitive words.
According to the method and the device, the text data are analyzed, and the sensitive word database is expanded based on the analysis result, so that the data base is effectively tamped for the detection of subsequent sensitive word information.
Example four
In a further embodiment based on the embodiment, when the sensitive words are analyzed by using the sensitive information identification and analysis model, the semantic tendency of the current text is further mined, the existence of the extreme viewpoint is identified by analyzing the tendency of the viewpoint, and a basis is provided for subsequent artificial viewpoint monitoring and control by transmitting the extreme viewpoint to a user supervisor.
Specifically, a sensitive word fingerprint library is established, and then semantic similarity distance calculation is performed on the extracted sensitive words in the sensitive word fingerprint library in a semantic similarity detection mode. And finally, through the judgment of the calculated threshold, finding out the tendency viewpoint corresponding to the text data from the sensitive word fingerprint database.
In a further embodiment, when there are many identical text contents in the text data to be analyzed, if the sensitive information identification analysis model is repeatedly called to perform the piece-by-piece analysis, a large amount of system resources are consumed, which results in the waste of operation resources. Aiming at the problems, the semantic fingerprint technology is adopted to quickly identify the text data with higher similarity. In a preferred embodiment, the process of computing semantic fingerprints specifically comprises the following steps:
step 1, performing word segmentation on received text data to obtain a word segmentation set;
step 2, identifying the sensitive words and obtaining fingerprint values corresponding to the sensitive words from the existing fingerprint database;
step 3, transforming the word set in the step 1 by adopting Hash processing to obtain a corresponding binary Hash value;
step 4, carrying out bitwise summation on the obtained hash values to obtain sequence values;
step 5, assigning 0 or 1 according to the sequence value and the positive and negative conditions of the numerical value to further obtain a final semantic fingerprint value of the text;
and 6, circularly calling the operation process of the semantic fingerprints until the text data in the whole process is analyzed.
And based on the obtained semantic fingerprint, quickly identifying the viewpoint tendency of the text with high similarity in the text data to be treated.
Wherein, the calculation expression of the similarity is as follows:
Figure BDA0003942071010000071
in the formula (I), the compound is shown in the specification,
Figure BDA0003942071010000081
representing an exclusive or operation; numful () represents a function for calculating a value of 1; f i And F j The ratio is a generation operation parameter for calculating the distance between two values.
In a further embodiment, the text data only containing sensitive words does not represent the current text and relates to sensitive viewpoints, so that the defect of semantic existence is overcome by means of deep learning aiming at the proposed sensitive information recognition and analysis model.
Because the importance degree of each word to the text classification result is different, the embodiment introduces a self-attention mechanism, learns the weight values of the words in the sentence, and highlights the influence of the important words on the classification result because the words with high importance degree in the sentence have higher weight values, thereby further improving the identification accuracy of the model. The main purpose of the self-attention layer is to learn the weight value of a word at each position, so that the attention of a task is transferred to the word which plays an important role in a sentence during task learning. Since the multi-task learning has the same input, but the importance of each word in the two tasks is different, the weights of the words are adjusted in the self-attention layer, and the words playing an important role in the embodiment are given larger weights.
In order to effectively improve the performance of the sensitive information identification analysis model, a loss function is adopted to carry out performance optimization on the sensitive information identification analysis model, wherein the corresponding loss function expression is as follows:
Figure BDA0003942071010000082
where y represents the actual value of the parsed text;
Figure BDA0003942071010000083
representing a model prediction output value; s represents the likelihood probability distribution over each class.
EXAMPLE five
In one embodiment, a big data-based government affair sensitive information identification system is provided, which is used for implementing a big data-based government affair sensitive information identification method, and specifically includes the following modules: the device comprises a data acquisition module, a data conversion module, a model construction module, a data analysis module and a data output module.
Specifically, the data acquisition module is used for reading mass government affair text data to be analyzed according to analysis requirements; the data conversion module is used for performing form conversion on the read text data according to the file format requirement; the model construction module is used for constructing a sensitive information identification analysis model; the data analysis module is used for analyzing the sensitive information of the converted text data; and the data output module is used for outputting the analysis result obtained by the data analysis module.
In a further embodiment, when analyzing massive sensitive words, reading text data to be analyzed by using a data acquisition module; secondly, converting the read text data form into a required form by using a data conversion module according to requirements; thirdly, a sensitive information identification analysis model is built by utilizing a model building module; then, the data analysis module analyzes the sensitive information of the read text data by using the sensitive information identification and analysis model; and finally, outputting the analysis result of the data analysis module by adopting a data output module.
EXAMPLE six
In one embodiment, a big data-based government affairs sensitive information identification device is provided, which comprises: a processor and a memory storing computer program instructions.
Wherein, the processor reads and executes the computer program instructions to realize the government affair sensitive information identification method.
EXAMPLE seven
In one embodiment, a computer-readable storage medium having computer program instructions stored thereon is presented.
Wherein the computer program instructions, when executed by the processor, implement a government affairs sensitive information identification method.
As noted above, while the present invention has been shown and described with reference to certain preferred embodiments, it is not to be construed as limited thereto. Various changes in form and detail may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A government affair sensitive information identification method based on big data is characterized by comprising the following steps:
step 1, acquiring text data to be analyzed;
step 1.1, preprocessing text data, and extracting subject, predicate, object, fixed phrase, object, complement and punctuation information of the text;
step 1.2, extracting keywords after pretreatment; the extraction expression of the key words is as follows:
Figure FDA0003942070000000011
in the formula (I), the compound is shown in the specification,
Figure FDA0003942070000000012
representing each numerical value with the emotion degree after the preprocessing, and the subscript c representing the serial number of each numerical value with the emotion degree after the preprocessing;
Figure FDA0003942070000000013
the parameter is determined based on a previous sensitive word frequency library, and the subscript t represents the serial number of the parameter with the emotion degree; z represents a criticality parameter, wherein the criticality parameter refers to the frequency of occurrence of the keyword in the current network heat ranking;
step 2, converting the text data into a vector form;
step 3, constructing a sensitive information identification analysis model and receiving text data in a vector form;
and 4, identifying the sensitive information of the text data by using a sensitive information identification analysis model:
when the type of the sensitive words is the sensitive words with similar pronunciations, firstly analyzing the acquired text into phonetic codes, and then calculating the editing distance of the phonetic codes to obtain the semantic similarity between the sensitive words and the words to be detected;
when the type of the sensitive word is the sensitive word in the form of short name, firstly extracting the initial letter of the word to be analyzed and combining the initial letter, and then taking the initial letter as a matched target string and template string;
when the type of the sensitive word is the sensitive word in a splitting form, firstly converting the split word into a region code, and then matching the obtained region code, thereby realizing the matching of the word to be analyzed;
and 5, outputting a recognition analysis result.
2. The government affair sensitive information recognition method based on big data according to claim 1, wherein in the process of recognizing the sensitive information through the sensitive information recognition analysis model, in order to improve the recognition accuracy of the sensitive words, the sensitive word database storing the sensitive words is further expanded.
3. The method for identifying government affairs sensitive information based on big data according to claim 2, wherein the expansion mode of the sensitive word database comprises the following steps:
expanding sensitive words with similar pronunciations;
expanding sensitive words in a short form;
and expanding the sensitive words in a split form.
4. The government affair sensitive information identifying method based on big data as claimed in claim 3, wherein when the sensitive words are analyzed by using the sensitive information identifying and analyzing model, the semantic tendency of the current text is further mined, the existence of extreme viewpoints is identified by analyzing the tendency of the viewpoint, and a basis is provided for the subsequent artificial viewpoint monitoring and control by transmitting the extreme viewpoints to the responsible persons.
5. A government affairs sensitive information identification system based on big data, which is used for realizing the government affairs sensitive information identification method according to any one of claims 1-4, and is characterized by comprising the following modules:
the data acquisition module is used for reading government affair text data to be analyzed;
a data conversion module configured to convert the read text data into a desired form;
the model construction module is arranged for constructing a sensitive information identification analysis model;
the data analysis module is used for analyzing the read text data by utilizing the sensitive information identification analysis model;
and the data output module is used for outputting the analysis result of the data analysis module.
6. A big data-based government affairs sensitive information identification device, comprising:
a processor and a memory storing computer program instructions;
the processor reads and executes the computer program instructions to implement the government affairs sensitive information identification method according to any one of claims 1-4.
7. A computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the government-sensitive information identifying method according to any one of claims 1-4.
CN202211424814.9A 2022-11-14 2022-11-14 Government affair sensitive information identification system and method based on big data Withdrawn CN115759071A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211424814.9A CN115759071A (en) 2022-11-14 2022-11-14 Government affair sensitive information identification system and method based on big data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211424814.9A CN115759071A (en) 2022-11-14 2022-11-14 Government affair sensitive information identification system and method based on big data

Publications (1)

Publication Number Publication Date
CN115759071A true CN115759071A (en) 2023-03-07

Family

ID=85370785

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211424814.9A Withdrawn CN115759071A (en) 2022-11-14 2022-11-14 Government affair sensitive information identification system and method based on big data

Country Status (1)

Country Link
CN (1) CN115759071A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628211A (en) * 2023-07-25 2023-08-22 中国电信股份有限公司 Data classification method and device, storage medium and electronic equipment
CN116939292A (en) * 2023-09-15 2023-10-24 天津市北海通信技术有限公司 Video text content monitoring method and system in rail transit environment

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116628211A (en) * 2023-07-25 2023-08-22 中国电信股份有限公司 Data classification method and device, storage medium and electronic equipment
CN116628211B (en) * 2023-07-25 2023-11-07 中国电信股份有限公司 Data classification method and device, storage medium and electronic equipment
CN116939292A (en) * 2023-09-15 2023-10-24 天津市北海通信技术有限公司 Video text content monitoring method and system in rail transit environment
CN116939292B (en) * 2023-09-15 2023-11-24 天津市北海通信技术有限公司 Video text content monitoring method and system in rail transit environment

Similar Documents

Publication Publication Date Title
CN107315737B (en) Semantic logic processing method and system
JP5167546B2 (en) Sentence search method, sentence search device, computer program, recording medium, and document storage device
CN111046656B (en) Text processing method, text processing device, electronic equipment and readable storage medium
US20100185691A1 (en) Scalable semi-structured named entity detection
CN111950273A (en) Network public opinion emergency automatic identification method based on emotion information extraction analysis
CN111310476B (en) Public opinion monitoring method and system using aspect-based emotion analysis method
CN115759071A (en) Government affair sensitive information identification system and method based on big data
CN112069312B (en) Text classification method based on entity recognition and electronic device
CN114676255A (en) Text processing method, device, equipment, storage medium and computer program product
Suyanto Synonyms-based augmentation to improve fake news detection using bidirectional LSTM
CN114756675A (en) Text classification method, related equipment and readable storage medium
CN111782793A (en) Intelligent customer service processing method, system and equipment
CN112711666B (en) Futures label extraction method and device
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature
CN112528653B (en) Short text entity recognition method and system
CN111061939A (en) Scientific research academic news keyword matching recommendation method based on deep learning
Hao Naive Bayesian Prediction of Japanese Annotated Corpus for Textual Semantic Word Formation Classification
JP2000339310A (en) Method and device for classifying document and recording medium with program recorded thereon
Golovko et al. Neural network approach for semantic coding of words
CN114036946B (en) Text feature extraction and auxiliary retrieval system and method
Oghaz et al. Detection and Classification of ChatGPT Generated Contents Using Deep Transformer Models
Mussabayev et al. Creation of necessary technical and expert-analytical conditions for development of the information system of evaluating open text information sources’ influence on society
BARKOVSKA et al. WAYS TO DETERMINE THE RANGE OF KEYWORDS IN A FREQUENCY DICTIONARY FOR TEXT CLASSIFICATION
Ferrández et al. Fine tuning features and post-processing rules to improve named entity recognition
Chaudhary Word Embedding Based Feature Extraction for Nepali News Classification

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WW01 Invention patent application withdrawn after publication
WW01 Invention patent application withdrawn after publication

Application publication date: 20230307