CN111522950A - Rapid identification system for unstructured massive text sensitive data - Google Patents

Rapid identification system for unstructured massive text sensitive data Download PDF

Info

Publication number
CN111522950A
CN111522950A CN202010338431.4A CN202010338431A CN111522950A CN 111522950 A CN111522950 A CN 111522950A CN 202010338431 A CN202010338431 A CN 202010338431A CN 111522950 A CN111522950 A CN 111522950A
Authority
CN
China
Prior art keywords
data
unit
identification
sensitive
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010338431.4A
Other languages
Chinese (zh)
Other versions
CN111522950B (en
Inventor
章明珠
刘超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Siwei Century Technology Co ltd
Original Assignee
Chengdu Siwei Century Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Siwei Century Technology Co ltd filed Critical Chengdu Siwei Century Technology Co ltd
Priority to CN202010338431.4A priority Critical patent/CN111522950B/en
Publication of CN111522950A publication Critical patent/CN111522950A/en
Application granted granted Critical
Publication of CN111522950B publication Critical patent/CN111522950B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a rapid identification system for unstructured massive text sensitive data, which comprises a modeling unit, an identification layer unit, a storage unit, a supporting layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, the storage unit is electrically connected with the modeling unit, and the supporting layer unit comprises a business monitoring module, a human-computer interaction module, a service hosting module and a log tracking module. According to the method, for the rapid classification of the unstructured data, a learning engine is used for independently selecting a proper algorithm from common classification algorithms to perform the rapid classification of the data, so that the identification efficiency is improved, the efficient identification of the unstructured data can be realized, the corresponding query method can be independently selected for scanning according to the sensitive type, and the scanning efficiency is improved.

Description

Rapid identification system for unstructured massive text sensitive data
Technical Field
The invention belongs to the fields of data security, data classification algorithm and data modeling, and particularly relates to a rapid identification system for unstructured massive text sensitive data.
Background
Aiming at massive unstructured text data, in the market at present, extraction and optimization are carried out on a classification algorithm of large-scale unstructured data through modeling of texts of the unstructured data and text similarity comparison, and the unstructured data are classified and sensitively extracted. At present, the mainstream related technical scheme is mainly to use a neural network data analysis engine to classify and summarize text data and then extract and identify the data, wherein a core technology is a sensitive identification engine for quickly classifying and systematizing the text data, and along with the development and popularization of the internet technology, a large amount of unstructured electronic texts exist on the internet, and in the face of increasing webpage data, the sensitive data also threatens the daily life of enterprises and individuals all the time. How to help enterprises to efficiently identify the sensitive data, how to quickly classify the sensitive data from massive unstructured texts, how to express the unstructured text data into a form that a computer can understand, and how to reduce the identification cost, and how to efficiently mine and store the data, the method and the system are increasingly subjected to wide market demands.
For the sensitive identification prior art of the current unstructured data, the main disadvantages are as follows: when sensitive identification is performed on mass data, the identification efficiency is very low, mainly due to the classification efficiency of the data and the scanning efficiency of key information.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a rapid identification system for unstructured massive text sensitive data.
In order to achieve the purpose, the invention provides the following technical scheme:
a quick identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a supporting layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected with the modeling unit, the supporting layer unit comprises a business monitoring module, a human-computer interaction module, a service hosting module and a log tracking module, and the supporting layer unit is used for optimizing an algorithm of the serialization unit and providing strategies and bases for the information acquisition module to acquire the center of gravity.
Preferably, the information acquisition module comprises an artificial acquisition module and a machine acquisition module, the artificial acquisition module is used for manually sorting sample data to the storage unit and marking the sensitive types and the grades of the samples, an interface connected with the identification layer unit is arranged on the artificial acquisition module, the artificial acquisition module is used for manually providing batch keywords for leading in, the artificial acquisition module is used for acquiring each piece of information which is not less than 100 pieces of information of the samples, and the samples are stored in the storage unit.
Preferably, the modeling calculation module is used for providing corresponding human mining calculation and computer mining calculation for the artificial acquisition module and the machine acquisition module, the computer mining calculation adopts an algorithm of technology frontier enterprise sourcing in the neural network and artificial intelligence aspects of the industry, the human mining calculation is used for correcting business relevance of the computer mining calculation and performing smooth transition processing on a sensitive classification and rating scoring system, the human mining calculation introduces similarity calculation and hamming distance expansibility algorithm, and the human mining calculation increases natural language processing of similarity and lexical association.
Preferably, the identification layer unit performs initial loading operation on the model by taking the output of the modeling unit as input, dynamically increases and decreases model items according to business needs and supports hot plug operation, and a hit scoring system returned by the identification layer unit for each sensitive model should have a summarizing algorithm, that is, the weight of each classification is multiplied by the logarithm of the accumulated value of the matching degree of each classification, and the result is a floating point number between zero and one to serve as a correction value of the sensitive final evaluation calculation.
Preferably, the recognition layer unit is configured to perform summarization on massive unstructured text data by using a classification algorithm or a clustering algorithm, then process and judge a character set and a language, convert the character set into a character set corresponding to internal storage according to needs, perform word segmentation on metadata by using a word segmentation system, and extract keywords of a current text after deleting stop words.
Preferably, the storage unit stores highly extensible web page content by adopting a semi-structured distributed storage solution based on network data characteristics, and the message queue in the storage unit meets the first-in first-out characteristic and can freely subscribe the message queue.
Preferably, the read-write interface is configured to periodically read incremental data and transmit the incremental data to the storage unit to form a message queue, and push the message queue to each service node of the identification layer unit, where the service node is set according to a device load condition and consumption queue data, and immediately writes back the incremental data to the message queue after processing is completed.
Preferably, the serialization unit includes a sensitive information module, the sensitive information module is used for sealing the sensitive words and the sensitive fields, the serialization unit is serialized at the production end and deserialized at the consumption end, and the information that the serialization unit needs to serialize includes version number, information type, operation type, encryption identification and key, data length, data information, and identification result.
Preferably, the specific workflow of the rapid identification system for the unstructured massive sensitive text data is as follows:
s1: data acquisition and storage, wherein data provided by an organization or an enterprise needing to be identified is stored in hbase, ES or other non-relational data;
s2: identification operation, namely loading part or all identification models according to configuration items, utilizing relational extraction to identify data from data according to the identification models, reading records from a message queue one by a thread pool and executing deserialization operation, executing different processing flows according to data types, performing summary calculation after model matching is finished, writing back a system bus message queue theme after serialization, and using a current execution process record log for offline effect analysis;
s3: the system comprises a bus queue, a memory unit and a data processing unit, wherein the bus queue can create a production working thread and a consumption working thread when a system bus is started, the production working thread regularly tracks the change condition of bottom-layer storage incremental data, and when data arrives, the data to be consumed is extracted from the memory unit and put into a consumption theme; the consumption working thread is suspended and waits at an entrance, and write-back operation is automatically triggered when a new message exists, so that original address data of a bottom layer are updated;
s4: analyzing the log, wherein a serialization unit system analyzes log data in an hour unit by adopting a whole batch processing mode and generates a report, and the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat degree, the identification accuracy rate and the like are obtained through statistics;
s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, the support capability is mainly the support capability of the database component and the learning engine, and the database needs to be regularly cleaned and optimized. The learning engine needs to update the recognition algorithm and the recognition library in time;
s6: and the external interface is used for externally providing sensitive identification data import and a request interface for identifying the sensitive data.
The invention has the technical effects and advantages that: compared with the traditional irrigation technology, the rapid identification system for the unstructured massive text sensitive data provided by the invention has the advantages that the learning engine is utilized to independently select a proper algorithm from common classification algorithms for rapid classification of the unstructured data, the identification efficiency is improved, the unstructured data can be efficiently identified, the corresponding query method can be automatically selected for scanning according to the sensitive type, and the scanning efficiency is improved.
Drawings
FIG. 1 is a block diagram of a fast recognition system for unstructured massive text sensitive data according to the present invention;
FIG. 2 is a flowchart of the fast recognition system for unstructured massive text sensitive data according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
The utility model provides a quick identification system to unstructured massive text sensitive data, including the modeling unit, the identification layer unit, a memory cell, supporting layer unit and serialization unit, the modeling unit includes information acquisition module and modeling calculation module, and modeling unit and identification layer unit electric connection, the memory cell is used for providing the persistence storage for the metadata of modeling unit, and be provided with the read-write interface on the memory cell, memory cell electric connection is in the modeling unit, supporting layer unit includes business monitoring module, man-machine interaction module, service trusteeship module and log tracking module, and supporting layer unit is used for optimizing the algorithm of serialization unit, and provide strategy and basis to information acquisition module collection focus.
The information acquisition module comprises an artificial acquisition module and a machine acquisition module, the artificial acquisition module is used for manually sorting sample data to the storage unit and marking the samples according to sensitive types and grades, an interface connected with the identification layer unit is arranged on the artificial acquisition module, the artificial acquisition module is used for manually providing batch keywords for introduction, the artificial acquisition module acquires that each sample is not less than 100 pieces of information, and the samples are stored in the storage unit.
The modeling calculation module is used for providing corresponding human mining calculation and computer mining calculation for the artificial acquisition module and the machine acquisition module, the computer mining calculation adopts an algorithm of technology front-edge enterprise sourcing in the neural network and artificial intelligence aspects of the industry, the human mining calculation is used for correcting business correlation of the computer mining calculation, smooth transition processing is carried out on a sensitive classification and rating scoring system, similarity calculation and Hamming distance expansibility algorithm are introduced into the human mining calculation, and natural language processing of similarity and lexical association is increased through the human mining calculation.
The identification layer unit takes the output of the modeling unit as input to carry out initial loading operation on the model, dynamically increases and decreases model items according to business needs and supports hot plug operation, and a hit scoring system returned by the identification layer unit for each sensitive model is provided with a summarizing algorithm, namely, the weight of each classification is multiplied by the accumulated value of the matching degree of each classification to take logarithm, and the result is a floating point number between zero and one to serve as a correction value of the sensitive final evaluation calculation.
The recognition layer unit is used for summarizing and processing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging a character set and a language, converting the character set into a character set corresponding to internal storage according to needs, segmenting metadata by using a segmentation system, and extracting keywords of a current text after deleting stop words.
The storage unit stores highly-extensible webpage content by adopting a semi-structured distributed storage solution scheme based on network data characteristics, and a message queue in the storage unit meets the first-in first-out characteristic and can be freely subscribed.
The read-write interface is used for regularly reading incremental data and transmitting the incremental data to the storage unit to form a message queue, the message queue is pushed to each service node of the identification layer unit, and the service nodes are set according to equipment load conditions and consumption queue data and immediately write back to the message queue after processing is finished.
The serialization unit comprises a sensitive information module, the sensitive information module is used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and deserialized at a consumption end, and information required to be serialized by the serialization unit comprises a version number, an information type, an operation type, an encryption identifier, a secret key, data length, data information and an identification result.
Example 2
The specific work flow of the rapid identification system for the unstructured massive text sensitive data is as follows:
s1: data acquisition and storage, wherein data provided by an organization or an enterprise needing to be identified is stored in hbase, ES or other non-relational data;
s2: identification operation, namely loading part or all identification models according to configuration items, utilizing relational extraction to identify data from data according to the identification models, reading records from a message queue one by a thread pool and executing deserialization operation, executing different processing flows according to data types, performing summary calculation after model matching is finished, writing back a system bus message queue theme after serialization, and using a current execution process record log for offline effect analysis;
s3: the system comprises a bus queue, a memory unit and a data processing unit, wherein the bus queue can create a production working thread and a consumption working thread when a system bus is started, the production working thread regularly tracks the change condition of bottom-layer storage incremental data, and when data arrives, the data to be consumed is extracted from the memory unit and put into a consumption theme; the consumption working thread is suspended and waits at an entrance, and write-back operation is automatically triggered when a new message exists, so that original address data of a bottom layer are updated;
s4: analyzing the log, wherein a serialization unit system analyzes log data in an hour unit by adopting a whole batch processing mode and generates a report, and the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat degree, the identification accuracy rate and the like are obtained through statistics;
s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, the support capability is mainly the support capability of the database component and the learning engine, and the database needs to be regularly cleaned and optimized. The learning engine needs to update the recognition algorithm and the recognition library in time;
s6: and the external interface is used for externally providing sensitive identification data import and a request interface for identifying the sensitive data.
In summary, the following steps: compared with the traditional irrigation technology, the rapid identification system for the unstructured massive text sensitive data provided by the invention has the advantages that the learning engine is utilized to independently select a proper algorithm from common classification algorithms for rapid classification of the unstructured data, the identification efficiency is improved, the unstructured data can be efficiently identified, the corresponding query method can be automatically selected for scanning according to the sensitive type, and the scanning efficiency is improved.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.

Claims (9)

1. A quick identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a support layer unit and a serialization unit, and is characterized in that: the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected with the modeling unit, the supporting layer unit comprises a business monitoring module, a human-computer interaction module, a service hosting module and a log tracking module, and the supporting layer unit is used for optimizing the algorithm of the serialization unit and providing strategies and bases for the information acquisition module to acquire the gravity center.
2. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the information acquisition module comprises an artificial acquisition module and a machine acquisition module, the artificial acquisition module is used for manually sorting sample data to a storage unit and marking the sensitive types and the grades of the samples, an interface connected with the identification layer unit is arranged on the artificial acquisition module, the artificial acquisition module is used for manually providing batch keywords for leading in, the artificial acquisition module is used for acquiring each sample and storing the sample in the storage unit, and 100 pieces of information are not less than each sample.
3. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 2, is characterized in that: the modeling calculation module is used for providing corresponding human mining calculation and computer mining calculation for the artificial acquisition module and the machine acquisition module, the computer mining calculation adopts an algorithm of technology frontier enterprise sourcing in the neural network and artificial intelligence aspects of the industry, the human mining calculation is used for correcting business correlation of the computer mining calculation and performing smooth transition processing on a sensitive classification and rating scoring system, the human mining calculation introduces similarity calculation and Hamming distance expansibility algorithm, and the human mining calculation increases natural language processing of similarity and lexical association.
4. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the identification layer unit takes the output of the modeling unit as input to carry out initial loading operation on the model, dynamically increases and decreases model items according to business requirements and supports hot plug operation, and a hit scoring system returned by the identification layer unit aiming at each sensitive model is provided with a summarizing algorithm, namely, the weight of each classification is multiplied by the accumulated value of the matching degree of each classification to take logarithm, and the result is a floating point number between zero and one to serve as a correction value of the sensitive final evaluation calculation.
5. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 4, is characterized in that: the recognition layer unit is used for summarizing and processing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging a character set and a language, converting the character set into a character set corresponding to internal storage according to needs, segmenting metadata by using a segmentation system, and extracting keywords of a current text after deleting stop words.
6. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the storage unit stores highly-extensible webpage content by adopting a semi-structured distributed storage solution scheme based on network data characteristics, and a message queue in the storage unit meets the first-in first-out characteristic and can be freely subscribed.
7. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 6, is characterized in that: the read-write interface is used for regularly reading incremental data and transmitting the incremental data to the storage unit to form a message queue, the message queue is pushed to each service node of the identification layer unit, and the service nodes are set according to equipment load conditions and consumption queue data and immediately write back to the message queue after processing is finished.
8. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the serialization unit comprises a sensitive information module used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and deserialized at a consumption end, and the information required to be serialized by the serialization unit comprises a version number, an information type, an operation type, an encryption identifier and key, a data length, data information and an identification result.
9. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the specific work flow of the rapid identification system for the unstructured massive text sensitive data is as follows:
s1: data acquisition and storage, wherein data provided by an organization or an enterprise needing to be identified is stored in hbase, ES or other non-relational data;
s2: identification operation, namely loading part or all identification models according to configuration items, utilizing relational extraction to identify data from data according to the identification models, reading records from a message queue one by a thread pool and executing deserialization operation, executing different processing flows according to data types, performing summary calculation after model matching is finished, writing back a system bus message queue theme after serialization, and using a current execution process record log for offline effect analysis;
s3: the system comprises a bus queue, a memory unit and a data processing unit, wherein the bus queue can create a production working thread and a consumption working thread when a system bus is started, the production working thread regularly tracks the change condition of bottom-layer storage incremental data, and when data arrives, the data to be consumed is extracted from the memory unit and put into a consumption theme; the consumption working thread is suspended and waits at an entrance, and write-back operation is automatically triggered when a new message exists, so that original address data of a bottom layer are updated;
s4: analyzing the log, wherein a serialization unit system analyzes log data in an hour unit by adopting a whole batch processing mode and generates a report, and the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat degree, the identification accuracy rate and the like are obtained through statistics;
s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, the support capability is mainly the support capability of the database component and the learning engine, and the database needs to be regularly cleaned and optimized. The learning engine needs to update the recognition algorithm and the recognition library in time;
s6: and the external interface is used for externally providing sensitive identification data import and a request interface for identifying the sensitive data.
CN202010338431.4A 2020-04-26 2020-04-26 Rapid identification system for unstructured massive text sensitive data Active CN111522950B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010338431.4A CN111522950B (en) 2020-04-26 2020-04-26 Rapid identification system for unstructured massive text sensitive data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010338431.4A CN111522950B (en) 2020-04-26 2020-04-26 Rapid identification system for unstructured massive text sensitive data

Publications (2)

Publication Number Publication Date
CN111522950A true CN111522950A (en) 2020-08-11
CN111522950B CN111522950B (en) 2023-06-27

Family

ID=71903482

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010338431.4A Active CN111522950B (en) 2020-04-26 2020-04-26 Rapid identification system for unstructured massive text sensitive data

Country Status (1)

Country Link
CN (1) CN111522950B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580092A (en) * 2020-12-07 2021-03-30 北京明朝万达科技股份有限公司 Sensitive file identification method and device
CN112698676A (en) * 2020-12-09 2021-04-23 泽恩科技有限公司 Intelligent power distribution room operation method based on AI and digital twin technology
CN113343108A (en) * 2021-06-30 2021-09-03 中国平安人寿保险股份有限公司 Recommendation information processing method, device, equipment and storage medium

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070283270A1 (en) * 2006-06-01 2007-12-06 Sand Anne R Context sensitive text recognition and marking from speech
US20110040983A1 (en) * 2006-11-09 2011-02-17 Grzymala-Busse Withold J System and method for providing identity theft security
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN106446232A (en) * 2016-10-08 2017-02-22 深圳市彬讯科技有限公司 Sensitive texts filtering method based on rules
CN107463666A (en) * 2017-08-02 2017-12-12 成都德尔塔信息科技有限公司 A kind of filtering sensitive words method based on content of text
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN109284631A (en) * 2018-10-26 2019-01-29 中国电子科技网络信息安全有限公司 A kind of document desensitization system and method based on big data
CN109299865A (en) * 2018-09-06 2019-02-01 西南大学 Psychological assessment system and method, information data processing terminal based on semantic analysis
CN109716345A (en) * 2016-04-29 2019-05-03 普威达有限公司 Computer implemented privacy engineering system and method
CN110377731A (en) * 2019-06-18 2019-10-25 深圳壹账通智能科技有限公司 Complain text handling method, device, computer equipment and storage medium
CN110415053A (en) * 2019-08-12 2019-11-05 秦宇亮 A kind of user experience monitoring system and method based on big data

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070283270A1 (en) * 2006-06-01 2007-12-06 Sand Anne R Context sensitive text recognition and marking from speech
US20110040983A1 (en) * 2006-11-09 2011-02-17 Grzymala-Busse Withold J System and method for providing identity theft security
CN105205104A (en) * 2015-08-26 2015-12-30 成都布林特信息技术有限公司 Cloud platform data acquisition method
CN109716345A (en) * 2016-04-29 2019-05-03 普威达有限公司 Computer implemented privacy engineering system and method
CN106446232A (en) * 2016-10-08 2017-02-22 深圳市彬讯科技有限公司 Sensitive texts filtering method based on rules
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN107463666A (en) * 2017-08-02 2017-12-12 成都德尔塔信息科技有限公司 A kind of filtering sensitive words method based on content of text
CN109299865A (en) * 2018-09-06 2019-02-01 西南大学 Psychological assessment system and method, information data processing terminal based on semantic analysis
CN109284631A (en) * 2018-10-26 2019-01-29 中国电子科技网络信息安全有限公司 A kind of document desensitization system and method based on big data
CN110377731A (en) * 2019-06-18 2019-10-25 深圳壹账通智能科技有限公司 Complain text handling method, device, computer equipment and storage medium
CN110415053A (en) * 2019-08-12 2019-11-05 秦宇亮 A kind of user experience monitoring system and method based on big data

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112580092A (en) * 2020-12-07 2021-03-30 北京明朝万达科技股份有限公司 Sensitive file identification method and device
CN112580092B (en) * 2020-12-07 2023-03-24 北京明朝万达科技股份有限公司 Sensitive file identification method and device
CN112698676A (en) * 2020-12-09 2021-04-23 泽恩科技有限公司 Intelligent power distribution room operation method based on AI and digital twin technology
CN113343108A (en) * 2021-06-30 2021-09-03 中国平安人寿保险股份有限公司 Recommendation information processing method, device, equipment and storage medium
CN113343108B (en) * 2021-06-30 2023-05-26 中国平安人寿保险股份有限公司 Recommended information processing method, device, equipment and storage medium

Also Published As

Publication number Publication date
CN111522950B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
CN109992645B (en) Data management system and method based on text data
CN101593200B (en) Method for classifying Chinese webpages based on keyword frequency analysis
Bisandu et al. Clustering news articles using efficient similarity measure and N-grams
CN111522950B (en) Rapid identification system for unstructured massive text sensitive data
Yao et al. Bursty event detection from collaborative tags
CN113962293B (en) LightGBM classification and representation learning-based name disambiguation method and system
CN104199857A (en) Tax document hierarchical classification method based on multi-tag classification
CN108647322B (en) Method for identifying similarity of mass Web text information based on word network
US20170109358A1 (en) Method and system of determining enterprise content specific taxonomies and surrogate tags
CN112100149B (en) Automatic log analysis system
CN112395539A (en) Public opinion risk monitoring method and system based on natural language processing
CN110163688A (en) Commodity network public sentiment detection system
CN112148881A (en) Method and apparatus for outputting information
CN112487161A (en) Enterprise demand oriented expert recommendation method, device, medium and equipment
CN111782806A (en) Artificial intelligence algorithm-based similar marketing enterprise retrieval classification method and system
CN115827862A (en) Associated acquisition method for multivariate expense voucher data
Hossari et al. TEST: A terminology extraction system for technology related terms
CN114328800A (en) Text processing method and device, electronic equipment and computer readable storage medium
Wang et al. Topic discovery method based on topic model combined with hierarchical clustering
Benny et al. Hadoop framework for entity resolution within high velocity streams
CN112417082A (en) Scientific research achievement data disambiguation filing storage method
Sun et al. A scenario model aggregation approach for mobile app requirements evolution based on user comments
Awad et al. Analyzing customer reviews on social media via applying association rule
CN109871429A (en) Merge the short text search method of Wikipedia classification and explicit semantic feature
Wang et al. A Method of Hot Topic Detection in Blogs Using N-gram Model.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Fast Recognition System for Unstructured Massive Text Sensitive Data

Granted publication date: 20230627

Pledgee: Chengdu SME financing Company Limited by Guarantee

Pledgor: CHENGDU SIWEI CENTURY TECHNOLOGY Co.,Ltd.

Registration number: Y2024980015966