CN111522950B - Rapid identification system for unstructured massive text sensitive data - Google Patents

Rapid identification system for unstructured massive text sensitive data Download PDF

Info

Publication number
CN111522950B
CN111522950B CN202010338431.4A CN202010338431A CN111522950B CN 111522950 B CN111522950 B CN 111522950B CN 202010338431 A CN202010338431 A CN 202010338431A CN 111522950 B CN111522950 B CN 111522950B
Authority
CN
China
Prior art keywords
data
unit
sensitive
module
identification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010338431.4A
Other languages
Chinese (zh)
Other versions
CN111522950A (en
Inventor
章明珠
刘超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Siwei Century Technology Co ltd
Original Assignee
Chengdu Siwei Century Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Siwei Century Technology Co ltd filed Critical Chengdu Siwei Century Technology Co ltd
Priority to CN202010338431.4A priority Critical patent/CN111522950B/en
Publication of CN111522950A publication Critical patent/CN111522950A/en
Application granted granted Critical
Publication of CN111522950B publication Critical patent/CN111522950B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a rapid identification system for unstructured massive text sensitive data, which comprises a modeling unit, an identification layer unit, a storage unit, a support layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, the storage unit is electrically connected with the modeling unit, and the support layer unit comprises a business monitoring module, a man-machine interaction module, a service hosting module and a log tracking module. According to the invention, for the rapid classification of unstructured data, a learning engine is utilized to autonomously select a proper algorithm from common classification algorithms to perform rapid classification of the data, so that the recognition efficiency is improved, and for the efficient recognition of unstructured data, a corresponding query method can be autonomously selected for scanning according to the sensitive type, and the scanning efficiency is improved.

Description

Rapid identification system for unstructured massive text sensitive data
Technical Field
The invention belongs to the fields of data security, data classification algorithms and data modeling, and particularly relates to a rapid identification system for unstructured massive text sensitive data.
Background
Aiming at massive unstructured text data, the method performs extraction and optimization on a large-scale unstructured data classification algorithm through modeling and text similarity comparison of the text of the unstructured data in the current market, and classifies and sensitively extracts the unstructured data. The related technical scheme mainly comprises the steps of classifying and summarizing text data by using a neural network data analysis engine, and then extracting and identifying the data, wherein the core technology is a sensitive identification engine for rapidly classifying and systemizing the text data, a large amount of unstructured electronic texts exist on the Internet along with the development and popularization of the Internet technology, and the sensitive data threatens the daily life of enterprises and individuals at any time in the face of increasing webpage data. How to help enterprises efficiently identify the sensitive data, quickly classify the sensitive data from massive unstructured texts, and how to express the unstructured text data into a form which can be understood by a computer, so that the identification cost is reduced, and meanwhile, the data is efficiently mined and stored, so that the method is increasingly in wide market demands.
For the prior art of sensitive identification of unstructured data, the main disadvantages are: when mass data is sensitively identified, the identification efficiency is quite low, and the main reasons are the classification efficiency of the data and the scanning efficiency of key information.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a rapid identification system for unstructured massive text sensitive data.
In order to achieve the above purpose, the present invention provides the following technical solutions:
the rapid identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a supporting layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected to the modeling unit, the supporting layer unit comprises a business monitoring module, a man-machine interaction module, a service hosting module and a log tracking module, and the supporting layer unit is used for optimizing an algorithm of the serialization unit and providing a strategy and a basis for acquisition of the center of gravity of the information acquisition module.
Preferably, the information acquisition module comprises a manual acquisition module and a machine acquisition module, wherein the manual acquisition module is used for manually sorting sample data to the storage unit and labeling the sample in sensitive type and grade, an interface connected with the identification layer unit is arranged on the manual acquisition module, the manual acquisition module is used for manually providing key word import in batches, the manual acquisition module is used for acquiring at least 100 pieces of information for each sample, and the samples are stored in the storage unit.
Preferably, the modeling calculation module is the same as providing corresponding human acquisition calculation and computer acquisition calculation for the manual acquisition module and the machine acquisition module, the computer acquisition calculation adopts an algorithm of opening sources of enterprises at the technical front of the industry in the aspects of neural networks and artificial intelligence, the human acquisition calculation is used for correcting the business correlation of the computer acquisition calculation, smooth transition processing is carried out on a sensitive classification and rating scoring system, the human acquisition calculation introduces a similarity calculation and a Hamming distance expansibility algorithm, and the human acquisition calculation increases the natural language processing of approximation and lexical association.
Preferably, the recognition layer unit performs initial loading operation on the model by taking the output of the modeling unit as input, and the recognition layer unit dynamically increases and decreases model items according to service requirements and supports hot plug operation, and the hit scoring system returned by the recognition layer unit for each sensitive model should be provided with a summarizing algorithm, namely that each classification self weight is multiplied by a matching degree accumulated value to take logarithm, and the result is a floating point number between zero and one to be used as a correction value for final evaluation calculation of the sensitivity.
Preferably, the recognition layer unit is used for summarizing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging character sets and languages, converting the character sets into character sets corresponding to internal storage according to requirements, and extracting keywords of a current text after the keyword is deleted by using a word segmentation system to segment metadata.
Preferably, the storage unit adopts a semi-structured distributed storage solution to store the webpage content with high expansibility based on the network data characteristic, and the message queue in the storage unit meets the first-in first-out characteristic and can be freely subscribed to.
Preferably, the read-write interface is configured to periodically read incremental data and transmit the incremental data to the storage unit to form a message queue, and push the message queue to each service node of the identification layer unit, where the service node sets the incremental data according to the equipment load condition and the consumption queue data, and write the incremental data back to the message queue immediately after the processing is completed.
Preferably, the serialization unit includes a sensitive information module, the sensitive information module is used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and is de-serialized at a consumption end, and the information to be serialized by the serialization unit includes version number, information type, operation type, encryption identification and key, data length, data information and identification result.
Preferably, the specific workflow of the rapid identification system for unstructured massive text-sensitive data is as follows:
s1: data acquisition and storage, wherein the data provided by a mechanism or an enterprise to be identified is stored in hbase, ES or other non-relational databases;
s2: the method comprises the steps of carrying out recognition operation, loading part or all of recognition models according to configuration items, carrying out data recognition according to the recognition models by utilizing relational extraction from data, reading records from a message queue one by a thread pool, executing deserialization operation, executing different processing flows according to data types, carrying out summarization calculation after model matching is completed, writing back a system bus message queue subject after serialization, and recording logs for the current execution process for offline effect analysis;
s3: the bus queue can create a production working thread and a consumption working thread when the system bus is started, the production working thread tracks the change condition of the bottom storage increment data at fixed time, and when the data arrives, the data to be consumed is extracted from the storage unit and put into a consumption theme; the consumption working thread is suspended and waits at the entrance, and automatically triggers write-back operation when a new message exists, and the original address data of the bottom layer is updated;
s4: the log analysis, the serialization unit system uses an hour as a unit, analyzes log data in a batch processing mode and generates a report, and statistics is carried out to obtain the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat and the identification accuracy;
s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, mainly comprises a database component and the support capability of a learning engine, wherein the database needs to be cleaned and optimized regularly, and the learning engine needs to update an identification algorithm and an identification library in time;
s6: and the external interface is used for providing sensitive identification data import and a request interface for sensitive data identification.
The invention has the technical effects and advantages that: compared with the traditional irrigation technology, the rapid identification system for unstructured massive text sensitive data provided by the invention has the advantages that the learning engine is utilized to autonomously select a proper algorithm from common classification algorithms to rapidly classify the unstructured data, the identification efficiency is improved, and the corresponding query method can be autonomously selected for scanning according to the sensitive type for efficient identification of the unstructured data, so that the scanning efficiency is improved.
Drawings
FIG. 1 is a block diagram of a rapid recognition system for unstructured massive text-sensitive data of the present invention;
FIG. 2 is a flow chart of the rapid recognition system of the present invention for unstructured massive text-sensitive data.
Detailed Description
The present invention will be described in further detail with reference to specific embodiments in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
Example 1
The rapid identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a supporting layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected with the modeling unit, the supporting layer unit comprises a service monitoring module, a man-machine interaction module, a service hosting module and a log tracking module, the supporting layer unit is used for optimizing an algorithm of the serialization unit, and providing a strategy and a basis for acquisition center of gravity of the information acquisition module.
The information acquisition module comprises a manual acquisition module and a machine acquisition module, wherein the manual acquisition module is used for manually sorting sample data to the storage unit and labeling the samples in sensitive types and grades, an interface connected with the identification layer unit is arranged on the manual acquisition module, the manual acquisition module is used for manually providing key word import in batches, the manual acquisition module is used for acquiring at least 100 pieces of information of each sample, and the samples are stored in the storage unit.
The modeling calculation module is used for providing corresponding human acquisition calculation and computer acquisition calculation for the artificial acquisition module and the machine acquisition module, wherein the computer acquisition calculation adopts an algorithm of an industry technology front-edge enterprise open source in the aspects of a neural network and artificial intelligence, the human acquisition calculation is used for carrying out business correlation correction on the computer acquisition calculation, smooth transition processing is carried out on a sensitive classification and rating scoring system, the human acquisition calculation introduces a similarity calculation and Hamming distance expansibility algorithm, and the human acquisition calculation increases approximation and lexical association natural language processing.
The recognition layer unit takes the output of the modeling unit as input to perform initial loading operation on the model, the recognition layer unit dynamically increases and decreases model items according to service requirements and supports hot plug operation, and a hit scoring system returned by the recognition layer unit for each sensitive model is provided with a summarizing algorithm, namely, each classification self weight is multiplied by a matching degree accumulated value to take the logarithm, and the result is a floating point number between zero and one to be used as a correction value for the final sensitive evaluation calculation.
The recognition layer unit is used for summarizing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging character sets and languages, converting the character sets and the languages into character sets corresponding to internal storage according to requirements, segmenting metadata by using a word segmentation system, deleting stop words, and extracting keywords of a current text.
The storage unit adopts a semi-structured distributed storage solution to store high-expansibility webpage content based on network data characteristics, and the message queue in the storage unit meets the first-in first-out characteristics and can be freely subscribed.
The read-write interface is used for periodically reading the incremental data and transmitting the incremental data to the storage unit to form a message queue, pushing the message queue to each service node of the identification layer unit, setting the service node according to the equipment load condition and the consumption queue data, and immediately writing back into the message queue after the processing is finished.
The serialization unit comprises a sensitive information module, the sensitive information module is used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and is reversely serialized at a consumption end, and information to be serialized by the serialization unit comprises a version number, an information type, an operation type, an encryption identifier, a secret key, a data length, data information and a recognition result.
Example 2
The specific workflow of the rapid identification system for unstructured massive text sensitive data is as follows:
s1: data acquisition and storage, wherein the data provided by a mechanism or an enterprise to be identified is stored in hbase, ES or other non-relational databases;
s2: the method comprises the steps of carrying out recognition operation, loading part or all of recognition models according to configuration items, carrying out data recognition according to the recognition models by utilizing relational extraction from data, reading records from a message queue one by a thread pool, executing deserialization operation, executing different processing flows according to data types, carrying out summarization calculation after model matching is completed, writing back a system bus message queue subject after serialization, and recording logs for the current execution process for offline effect analysis;
s3: the bus queue can create a production working thread and a consumption working thread when the system bus is started, the production working thread tracks the change condition of the bottom storage increment data at fixed time, and when the data arrives, the data to be consumed is extracted from the storage unit and put into a consumption theme; the consumption working thread is suspended and waits at the entrance, and automatically triggers write-back operation when a new message exists, and the original address data of the bottom layer is updated;
s4: the log analysis, the serialization unit system uses an hour as a unit, analyzes log data in a batch processing mode and generates a report, and statistics is carried out to obtain the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat and the identification accuracy;
s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, mainly comprises a database component and the support capability of a learning engine, wherein the database needs to be cleaned and optimized regularly, and the learning engine needs to update an identification algorithm and an identification library in time;
s6: and the external interface is used for providing sensitive identification data import and a request interface for sensitive data identification.
To sum up: compared with the traditional irrigation technology, the rapid identification system for unstructured massive text sensitive data provided by the invention has the advantages that the learning engine is utilized to autonomously select a proper algorithm from common classification algorithms to rapidly classify the unstructured data, the identification efficiency is improved, and the corresponding query method can be autonomously selected for scanning according to the sensitive type for efficient identification of the unstructured data, so that the scanning efficiency is improved.
Finally, it should be noted that: the foregoing description is only illustrative of the preferred embodiments of the present invention, and although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments described, or equivalents may be substituted for elements thereof, and any modifications, equivalents, improvements or changes may be made without departing from the spirit and principles of the present invention.

Claims (7)

1. The rapid identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a support layer unit and a serialization unit, and is characterized in that: the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with an identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected with the modeling unit, the support layer unit comprises a business monitoring module, a man-machine interaction module, a service hosting module and a log tracking module, and the support layer unit is used for optimizing an algorithm of the serialization unit and providing a strategy and a basis for the acquisition center of gravity of the information acquisition module;
the information acquisition module comprises a manual acquisition module and a machine acquisition module, wherein the manual acquisition module is used for manually sorting sample data to the storage unit and labeling the sample in sensitive type and grade, an interface connected with the identification layer unit is arranged on the manual acquisition module, the manual acquisition module is used for manually providing batch keyword import, the manual acquisition module is used for acquiring at least 100 pieces of information of each sample, and the samples are stored in the storage unit;
the identification layer unit takes the output of the modeling unit as input to perform initial loading operation on the model, the identification layer unit dynamically increases and decreases model items according to service requirements and supports hot plug operation, and a hit scoring system returned by the identification layer unit for each sensitive model is provided with a summarizing algorithm, namely, the weight of each classification is multiplied by the accumulated value of the matching degree to take the logarithm, and the result is a floating point number between zero and one to be used as a correction value of the sensitive final evaluation calculation.
2. A rapid identification system for unstructured massive text-sensitive data according to claim 1, wherein: the modeling calculation module is the same as providing corresponding human acquisition calculation and computer acquisition calculation for the manual acquisition module and the machine acquisition module, the computer acquisition calculation adopts an algorithm of opening sources of enterprises at the technical front of the industry in the aspects of neural networks and artificial intelligence, the human acquisition calculation is used for carrying out business correlation correction on the computer acquisition calculation, smooth transition processing is carried out on a sensitive classification and rating scoring system, the human acquisition calculation introduces a similarity calculation and Hamming distance expansibility algorithm, and the human acquisition calculation increases the natural language processing of approximation and lexical association.
3. A rapid identification system for unstructured massive text-sensitive data according to claim 1, wherein: the recognition layer unit is used for summarizing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging character sets and languages, converting the character sets and the languages into character sets corresponding to internal storage according to requirements, segmenting metadata by using a word segmentation system, and extracting keywords of a current text after deleting stop words.
4. A rapid identification system for unstructured massive text-sensitive data according to claim 1, wherein: the storage unit adopts a semi-structured distributed storage solution to store high-expansibility webpage content based on network data characteristics, and the message queue in the storage unit meets the first-in first-out characteristics and can be freely subscribed.
5. The rapid identification system for unstructured massive text-sensitive data of claim 4, wherein: the read-write interface is used for periodically reading the incremental data and transmitting the incremental data to the storage unit to form a message queue, pushing the message queue to each service node of the identification layer unit, setting the service node according to the equipment load condition and the consumption queue data, and immediately writing the service node back into the message queue after processing is finished.
6. A rapid identification system for unstructured massive text-sensitive data according to claim 1, wherein: the serialization unit comprises a sensitive information module, wherein the sensitive information module is used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and is reversely serialized at a consumption end, and information to be serialized by the serialization unit comprises a version number, an information type, an operation type, an encryption identifier, a secret key, a data length, data information and an identification result.
7. A rapid identification system for unstructured massive text-sensitive data according to claim 1, wherein: the specific workflow of the rapid identification system for unstructured massive text sensitive data is as follows:
s1: data acquisition and storage, wherein the data provided by a mechanism or an enterprise to be identified is stored in hbase, ES or other non-relational databases;
s2: the method comprises the steps of carrying out recognition operation, loading part or all of recognition models according to configuration items, carrying out data recognition according to the recognition models by utilizing relational extraction from data, reading records from a message queue one by a thread pool, executing deserialization operation, executing different processing flows according to data types, carrying out summarization calculation after model matching is completed, writing back a system bus message queue subject after serialization, and recording logs for the current execution process for offline effect analysis;
s3: the bus queue can create a production working thread and a consumption working thread when the system bus is started, the production working thread tracks the change condition of the bottom storage increment data at fixed time, and when the data arrives, the data to be consumed is extracted from the storage unit and put into a consumption theme; the consumption working thread is suspended and waits at the entrance, and automatically triggers write-back operation when a new message exists, and the original address data of the bottom layer is updated;
s4: the log analysis, the serialization unit system uses an hour as a unit, analyzes log data in a batch processing mode and generates a report, and statistics is carried out to obtain the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat and the identification accuracy;
s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, mainly comprises a database component and the support capability of a learning engine, wherein the database needs to be cleaned and optimized regularly, and the learning engine needs to update an identification algorithm and an identification library in time;
s6: and the external interface is used for providing sensitive identification data import and a request interface for sensitive data identification.
CN202010338431.4A 2020-04-26 2020-04-26 Rapid identification system for unstructured massive text sensitive data Active CN111522950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010338431.4A CN111522950B (en) 2020-04-26 2020-04-26 Rapid identification system for unstructured massive text sensitive data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010338431.4A CN111522950B (en) 2020-04-26 2020-04-26 Rapid identification system for unstructured massive text sensitive data

Publications (2)

Publication Number Publication Date
CN111522950A CN111522950A (en) 2020-08-11
CN111522950B true CN111522950B (en) 2023-06-27

Family

ID=71903482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010338431.4A Active CN111522950B (en) 2020-04-26 2020-04-26 Rapid identification system for unstructured massive text sensitive data

Country Status (1)

Country Link
CN (1) CN111522950B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580092B (en) * 2020-12-07 2023-03-24 北京明朝万达科技股份有限公司 Sensitive file identification method and device
CN112698676B (en) * 2020-12-09 2021-10-01 泽恩科技有限公司 AI-based intelligent power distribution room operation method
CN113343108B (en) * 2021-06-30 2023-05-26 中国平安人寿保险股份有限公司 Recommended information processing method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN106446232A (en) * 2016-10-08 2017-02-22 深圳市彬讯科技有限公司 Sensitive texts filtering method based on rules
CN107463666A (en) * 2017-08-02 2017-12-12 成都德尔塔信息科技有限公司 A kind of filtering sensitive words method based on content of text
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN109284631A (en) * 2018-10-26 2019-01-29 中国电子科技网络信息安全有限公司 A kind of document desensitization system and method based on big data
CN109299865A (en) * 2018-09-06 2019-02-01 西南大学 Psychological assessment system and method, information data processing terminal based on semantic analysis
CN109716345A (en) * 2016-04-29 2019-05-03 普威达有限公司 Computer implemented privacy engineering system and method
CN110377731A (en) * 2019-06-18 2019-10-25 深圳壹账通智能科技有限公司 Complain text handling method, device, computer equipment and storage medium
CN110415053A (en) * 2019-08-12 2019-11-05 秦宇亮 A kind of user experience monitoring system and method based on big data

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8171412B2 (en) * 2006-06-01 2012-05-01 International Business Machines Corporation Context sensitive text recognition and marking from speech
US8752181B2 (en) * 2006-11-09 2014-06-10 Touchnet Information Systems, Inc. System and method for providing identity theft security

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN109716345A (en) * 2016-04-29 2019-05-03 普威达有限公司 Computer implemented privacy engineering system and method
CN106446232A (en) * 2016-10-08 2017-02-22 深圳市彬讯科技有限公司 Sensitive texts filtering method based on rules
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN107463666A (en) * 2017-08-02 2017-12-12 成都德尔塔信息科技有限公司 A kind of filtering sensitive words method based on content of text
CN109299865A (en) * 2018-09-06 2019-02-01 西南大学 Psychological assessment system and method, information data processing terminal based on semantic analysis
CN109284631A (en) * 2018-10-26 2019-01-29 中国电子科技网络信息安全有限公司 A kind of document desensitization system and method based on big data
CN110377731A (en) * 2019-06-18 2019-10-25 深圳壹账通智能科技有限公司 Complain text handling method, device, computer equipment and storage medium
CN110415053A (en) * 2019-08-12 2019-11-05 秦宇亮 A kind of user experience monitoring system and method based on big data

Also Published As

Publication number Publication date
CN111522950A (en) 2020-08-11

Similar Documents

Publication Publication Date Title
Day et al. Deep learning for financial sentiment analysis on finance news providers
CN111723215B (en) Device and method for establishing biotechnological information knowledge graph based on text mining
CN111522950B (en) Rapid identification system for unstructured massive text sensitive data
CN110825877A (en) Semantic similarity analysis method based on text clustering
CN108733748B (en) Cross-border product quality risk fuzzy prediction method based on commodity comment public sentiment
Bisandu et al. Clustering news articles using efficient similarity measure and N-grams
CN109165294A (en) Short text classification method based on Bayesian classification
CN112395539B (en) Public opinion risk monitoring method and system based on natural language processing
CN113962293A (en) LightGBM classification and representation learning-based name disambiguation method and system
CN111782806A (en) Artificial intelligence algorithm-based similar marketing enterprise retrieval classification method and system
CN115827862A (en) Associated acquisition method for multivariate expense voucher data
CN116610818A (en) Construction method and system of power transmission and transformation project knowledge base
Wang et al. Topic discovery method based on topic model combined with hierarchical clustering
CN112417082A (en) Scientific research achievement data disambiguation filing storage method
CN115934936A (en) Intelligent traffic text analysis method based on natural language processing
Awad et al. Analyzing customer reviews on social media via applying association rule
CN113177121A (en) Text topic classification method and device, electronic equipment and storage medium
Fu et al. Prediction of hot topics of agricultural public opinion based on attention mechanism LSTM model
Suhasini et al. A Hybrid TF-IDF and N-Grams Based Feature Extraction Approach for Accurate Detection of Fake News on Twitter Data
Ahmed et al. Text and Sentimental Analysis on Big Data
US11893008B1 (en) System and method for automated data harmonization
Ahmed et al. Bangla News Popularity Prediction Using Machine Learning Techniques
Karthica et al. A STUDY ON TECHNIQUES AND TOOLS ASSOCIATE WITH WEB CONTENT
Jin et al. Web table data integration based on smart campus scenarios to resolve name disambiguation of scientific research personnel
Li et al. An ICT system fault analysis technology based on text classification and image recognition

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant