CN111522950A - Rapid identification system for unstructured massive text sensitive data - Google Patents
Rapid identification system for unstructured massive text sensitive data Download PDFInfo
- Publication number
- CN111522950A CN111522950A CN202010338431.4A CN202010338431A CN111522950A CN 111522950 A CN111522950 A CN 111522950A CN 202010338431 A CN202010338431 A CN 202010338431A CN 111522950 A CN111522950 A CN 111522950A
- Authority
- CN
- China
- Prior art keywords
- data
- unit
- identification
- sensitive
- module
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003860 storage Methods 0.000 claims abstract description 42
- 238000004364 calculation method Methods 0.000 claims abstract description 37
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 21
- 238000000034 method Methods 0.000 claims abstract description 9
- 238000007635 classification algorithm Methods 0.000 claims abstract description 8
- 230000003993 interaction Effects 0.000 claims abstract description 4
- 238000012544 monitoring process Methods 0.000 claims abstract description 4
- 230000002085 persistent effect Effects 0.000 claims abstract description 3
- 238000005065 mining Methods 0.000 claims description 21
- 238000012545 processing Methods 0.000 claims description 19
- 238000004519 manufacturing process Methods 0.000 claims description 9
- 238000005516 engineering process Methods 0.000 claims description 7
- 238000013528 artificial neural network Methods 0.000 claims description 4
- 230000000694 effects Effects 0.000 claims description 4
- 238000000605 extraction Methods 0.000 claims description 4
- 230000008569 process Effects 0.000 claims description 4
- 230000011218 segmentation Effects 0.000 claims description 4
- 238000004458 analytical method Methods 0.000 claims description 3
- 238000013473 artificial intelligence Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 3
- 238000012937 correction Methods 0.000 claims description 3
- 230000007423 decrease Effects 0.000 claims description 3
- 238000011156 evaluation Methods 0.000 claims description 3
- 238000007667 floating Methods 0.000 claims description 3
- 230000008676 import Effects 0.000 claims description 3
- 238000003058 natural language processing Methods 0.000 claims description 3
- 230000008520 organization Effects 0.000 claims description 3
- 238000007789 sealing Methods 0.000 claims description 3
- 238000012358 sourcing Methods 0.000 claims description 3
- 230000007704 transition Effects 0.000 claims description 3
- 230000001960 triggered effect Effects 0.000 claims description 3
- 230000005484 gravity Effects 0.000 claims description 2
- 238000003973 irrigation Methods 0.000 description 2
- 230000002262 irrigation Effects 0.000 description 2
- 238000007405 data analysis Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 230000002688 persistence Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/953—Querying, e.g. by the use of web search engines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Evolutionary Computation (AREA)
- Medical Informatics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a rapid identification system for unstructured massive text sensitive data, which comprises a modeling unit, an identification layer unit, a storage unit, a supporting layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, the storage unit is electrically connected with the modeling unit, and the supporting layer unit comprises a business monitoring module, a human-computer interaction module, a service hosting module and a log tracking module. According to the method, for the rapid classification of the unstructured data, a learning engine is used for independently selecting a proper algorithm from common classification algorithms to perform the rapid classification of the data, so that the identification efficiency is improved, the efficient identification of the unstructured data can be realized, the corresponding query method can be independently selected for scanning according to the sensitive type, and the scanning efficiency is improved.
Description
Technical Field
The invention belongs to the fields of data security, data classification algorithm and data modeling, and particularly relates to a rapid identification system for unstructured massive text sensitive data.
Background
Aiming at massive unstructured text data, in the market at present, extraction and optimization are carried out on a classification algorithm of large-scale unstructured data through modeling of texts of the unstructured data and text similarity comparison, and the unstructured data are classified and sensitively extracted. At present, the mainstream related technical scheme is mainly to use a neural network data analysis engine to classify and summarize text data and then extract and identify the data, wherein a core technology is a sensitive identification engine for quickly classifying and systematizing the text data, and along with the development and popularization of the internet technology, a large amount of unstructured electronic texts exist on the internet, and in the face of increasing webpage data, the sensitive data also threatens the daily life of enterprises and individuals all the time. How to help enterprises to efficiently identify the sensitive data, how to quickly classify the sensitive data from massive unstructured texts, how to express the unstructured text data into a form that a computer can understand, and how to reduce the identification cost, and how to efficiently mine and store the data, the method and the system are increasingly subjected to wide market demands.
For the sensitive identification prior art of the current unstructured data, the main disadvantages are as follows: when sensitive identification is performed on mass data, the identification efficiency is very low, mainly due to the classification efficiency of the data and the scanning efficiency of key information.
Disclosure of Invention
The invention aims to solve the defects in the prior art, and provides a rapid identification system for unstructured massive text sensitive data.
In order to achieve the purpose, the invention provides the following technical scheme:
a quick identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a supporting layer unit and a serialization unit, wherein the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected with the modeling unit, the supporting layer unit comprises a business monitoring module, a human-computer interaction module, a service hosting module and a log tracking module, and the supporting layer unit is used for optimizing an algorithm of the serialization unit and providing strategies and bases for the information acquisition module to acquire the center of gravity.
Preferably, the information acquisition module comprises an artificial acquisition module and a machine acquisition module, the artificial acquisition module is used for manually sorting sample data to the storage unit and marking the sensitive types and the grades of the samples, an interface connected with the identification layer unit is arranged on the artificial acquisition module, the artificial acquisition module is used for manually providing batch keywords for leading in, the artificial acquisition module is used for acquiring each piece of information which is not less than 100 pieces of information of the samples, and the samples are stored in the storage unit.
Preferably, the modeling calculation module is used for providing corresponding human mining calculation and computer mining calculation for the artificial acquisition module and the machine acquisition module, the computer mining calculation adopts an algorithm of technology frontier enterprise sourcing in the neural network and artificial intelligence aspects of the industry, the human mining calculation is used for correcting business relevance of the computer mining calculation and performing smooth transition processing on a sensitive classification and rating scoring system, the human mining calculation introduces similarity calculation and hamming distance expansibility algorithm, and the human mining calculation increases natural language processing of similarity and lexical association.
Preferably, the identification layer unit performs initial loading operation on the model by taking the output of the modeling unit as input, dynamically increases and decreases model items according to business needs and supports hot plug operation, and a hit scoring system returned by the identification layer unit for each sensitive model should have a summarizing algorithm, that is, the weight of each classification is multiplied by the logarithm of the accumulated value of the matching degree of each classification, and the result is a floating point number between zero and one to serve as a correction value of the sensitive final evaluation calculation.
Preferably, the recognition layer unit is configured to perform summarization on massive unstructured text data by using a classification algorithm or a clustering algorithm, then process and judge a character set and a language, convert the character set into a character set corresponding to internal storage according to needs, perform word segmentation on metadata by using a word segmentation system, and extract keywords of a current text after deleting stop words.
Preferably, the storage unit stores highly extensible web page content by adopting a semi-structured distributed storage solution based on network data characteristics, and the message queue in the storage unit meets the first-in first-out characteristic and can freely subscribe the message queue.
Preferably, the read-write interface is configured to periodically read incremental data and transmit the incremental data to the storage unit to form a message queue, and push the message queue to each service node of the identification layer unit, where the service node is set according to a device load condition and consumption queue data, and immediately writes back the incremental data to the message queue after processing is completed.
Preferably, the serialization unit includes a sensitive information module, the sensitive information module is used for sealing the sensitive words and the sensitive fields, the serialization unit is serialized at the production end and deserialized at the consumption end, and the information that the serialization unit needs to serialize includes version number, information type, operation type, encryption identification and key, data length, data information, and identification result.
Preferably, the specific workflow of the rapid identification system for the unstructured massive sensitive text data is as follows:
s1: data acquisition and storage, wherein data provided by an organization or an enterprise needing to be identified is stored in hbase, ES or other non-relational data;
s2: identification operation, namely loading part or all identification models according to configuration items, utilizing relational extraction to identify data from data according to the identification models, reading records from a message queue one by a thread pool and executing deserialization operation, executing different processing flows according to data types, performing summary calculation after model matching is finished, writing back a system bus message queue theme after serialization, and using a current execution process record log for offline effect analysis;
s3: the system comprises a bus queue, a memory unit and a data processing unit, wherein the bus queue can create a production working thread and a consumption working thread when a system bus is started, the production working thread regularly tracks the change condition of bottom-layer storage incremental data, and when data arrives, the data to be consumed is extracted from the memory unit and put into a consumption theme; the consumption working thread is suspended and waits at an entrance, and write-back operation is automatically triggered when a new message exists, so that original address data of a bottom layer are updated;
s4: analyzing the log, wherein a serialization unit system analyzes log data in an hour unit by adopting a whole batch processing mode and generates a report, and the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat degree, the identification accuracy rate and the like are obtained through statistics;
s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, the support capability is mainly the support capability of the database component and the learning engine, and the database needs to be regularly cleaned and optimized. The learning engine needs to update the recognition algorithm and the recognition library in time;
s6: and the external interface is used for externally providing sensitive identification data import and a request interface for identifying the sensitive data.
The invention has the technical effects and advantages that: compared with the traditional irrigation technology, the rapid identification system for the unstructured massive text sensitive data provided by the invention has the advantages that the learning engine is utilized to independently select a proper algorithm from common classification algorithms for rapid classification of the unstructured data, the identification efficiency is improved, the unstructured data can be efficiently identified, the corresponding query method can be automatically selected for scanning according to the sensitive type, and the scanning efficiency is improved.
Drawings
FIG. 1 is a block diagram of a fast recognition system for unstructured massive text sensitive data according to the present invention;
FIG. 2 is a flowchart of the fast recognition system for unstructured massive text sensitive data according to the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail with reference to the following embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
The utility model provides a quick identification system to unstructured massive text sensitive data, including the modeling unit, the identification layer unit, a memory cell, supporting layer unit and serialization unit, the modeling unit includes information acquisition module and modeling calculation module, and modeling unit and identification layer unit electric connection, the memory cell is used for providing the persistence storage for the metadata of modeling unit, and be provided with the read-write interface on the memory cell, memory cell electric connection is in the modeling unit, supporting layer unit includes business monitoring module, man-machine interaction module, service trusteeship module and log tracking module, and supporting layer unit is used for optimizing the algorithm of serialization unit, and provide strategy and basis to information acquisition module collection focus.
The information acquisition module comprises an artificial acquisition module and a machine acquisition module, the artificial acquisition module is used for manually sorting sample data to the storage unit and marking the samples according to sensitive types and grades, an interface connected with the identification layer unit is arranged on the artificial acquisition module, the artificial acquisition module is used for manually providing batch keywords for introduction, the artificial acquisition module acquires that each sample is not less than 100 pieces of information, and the samples are stored in the storage unit.
The modeling calculation module is used for providing corresponding human mining calculation and computer mining calculation for the artificial acquisition module and the machine acquisition module, the computer mining calculation adopts an algorithm of technology front-edge enterprise sourcing in the neural network and artificial intelligence aspects of the industry, the human mining calculation is used for correcting business correlation of the computer mining calculation, smooth transition processing is carried out on a sensitive classification and rating scoring system, similarity calculation and Hamming distance expansibility algorithm are introduced into the human mining calculation, and natural language processing of similarity and lexical association is increased through the human mining calculation.
The identification layer unit takes the output of the modeling unit as input to carry out initial loading operation on the model, dynamically increases and decreases model items according to business needs and supports hot plug operation, and a hit scoring system returned by the identification layer unit for each sensitive model is provided with a summarizing algorithm, namely, the weight of each classification is multiplied by the accumulated value of the matching degree of each classification to take logarithm, and the result is a floating point number between zero and one to serve as a correction value of the sensitive final evaluation calculation.
The recognition layer unit is used for summarizing and processing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging a character set and a language, converting the character set into a character set corresponding to internal storage according to needs, segmenting metadata by using a segmentation system, and extracting keywords of a current text after deleting stop words.
The storage unit stores highly-extensible webpage content by adopting a semi-structured distributed storage solution scheme based on network data characteristics, and a message queue in the storage unit meets the first-in first-out characteristic and can be freely subscribed.
The read-write interface is used for regularly reading incremental data and transmitting the incremental data to the storage unit to form a message queue, the message queue is pushed to each service node of the identification layer unit, and the service nodes are set according to equipment load conditions and consumption queue data and immediately write back to the message queue after processing is finished.
The serialization unit comprises a sensitive information module, the sensitive information module is used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and deserialized at a consumption end, and information required to be serialized by the serialization unit comprises a version number, an information type, an operation type, an encryption identifier, a secret key, data length, data information and an identification result.
Example 2
The specific work flow of the rapid identification system for the unstructured massive text sensitive data is as follows:
s1: data acquisition and storage, wherein data provided by an organization or an enterprise needing to be identified is stored in hbase, ES or other non-relational data;
s2: identification operation, namely loading part or all identification models according to configuration items, utilizing relational extraction to identify data from data according to the identification models, reading records from a message queue one by a thread pool and executing deserialization operation, executing different processing flows according to data types, performing summary calculation after model matching is finished, writing back a system bus message queue theme after serialization, and using a current execution process record log for offline effect analysis;
s3: the system comprises a bus queue, a memory unit and a data processing unit, wherein the bus queue can create a production working thread and a consumption working thread when a system bus is started, the production working thread regularly tracks the change condition of bottom-layer storage incremental data, and when data arrives, the data to be consumed is extracted from the memory unit and put into a consumption theme; the consumption working thread is suspended and waits at an entrance, and write-back operation is automatically triggered when a new message exists, so that original address data of a bottom layer are updated;
s4: analyzing the log, wherein a serialization unit system analyzes log data in an hour unit by adopting a whole batch processing mode and generates a report, and the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat degree, the identification accuracy rate and the like are obtained through statistics;
s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, the support capability is mainly the support capability of the database component and the learning engine, and the database needs to be regularly cleaned and optimized. The learning engine needs to update the recognition algorithm and the recognition library in time;
s6: and the external interface is used for externally providing sensitive identification data import and a request interface for identifying the sensitive data.
In summary, the following steps: compared with the traditional irrigation technology, the rapid identification system for the unstructured massive text sensitive data provided by the invention has the advantages that the learning engine is utilized to independently select a proper algorithm from common classification algorithms for rapid classification of the unstructured data, the identification efficiency is improved, the unstructured data can be efficiently identified, the corresponding query method can be automatically selected for scanning according to the sensitive type, and the scanning efficiency is improved.
Finally, it should be noted that: although the present invention has been described in detail with reference to the foregoing embodiments, it will be apparent to those skilled in the art that modifications may be made to the embodiments or portions thereof without departing from the spirit and scope of the invention.
Claims (9)
1. A quick identification system for unstructured massive text sensitive data comprises a modeling unit, an identification layer unit, a storage unit, a support layer unit and a serialization unit, and is characterized in that: the modeling unit comprises an information acquisition module and a modeling calculation module, the modeling unit is electrically connected with the identification layer unit, the storage unit is used for providing persistent storage for metadata of the modeling unit, a read-write interface is arranged on the storage unit, the storage unit is electrically connected with the modeling unit, the supporting layer unit comprises a business monitoring module, a human-computer interaction module, a service hosting module and a log tracking module, and the supporting layer unit is used for optimizing the algorithm of the serialization unit and providing strategies and bases for the information acquisition module to acquire the gravity center.
2. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the information acquisition module comprises an artificial acquisition module and a machine acquisition module, the artificial acquisition module is used for manually sorting sample data to a storage unit and marking the sensitive types and the grades of the samples, an interface connected with the identification layer unit is arranged on the artificial acquisition module, the artificial acquisition module is used for manually providing batch keywords for leading in, the artificial acquisition module is used for acquiring each sample and storing the sample in the storage unit, and 100 pieces of information are not less than each sample.
3. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 2, is characterized in that: the modeling calculation module is used for providing corresponding human mining calculation and computer mining calculation for the artificial acquisition module and the machine acquisition module, the computer mining calculation adopts an algorithm of technology frontier enterprise sourcing in the neural network and artificial intelligence aspects of the industry, the human mining calculation is used for correcting business correlation of the computer mining calculation and performing smooth transition processing on a sensitive classification and rating scoring system, the human mining calculation introduces similarity calculation and Hamming distance expansibility algorithm, and the human mining calculation increases natural language processing of similarity and lexical association.
4. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the identification layer unit takes the output of the modeling unit as input to carry out initial loading operation on the model, dynamically increases and decreases model items according to business requirements and supports hot plug operation, and a hit scoring system returned by the identification layer unit aiming at each sensitive model is provided with a summarizing algorithm, namely, the weight of each classification is multiplied by the accumulated value of the matching degree of each classification to take logarithm, and the result is a floating point number between zero and one to serve as a correction value of the sensitive final evaluation calculation.
5. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 4, is characterized in that: the recognition layer unit is used for summarizing and processing massive unstructured text data by using a classification algorithm or a clustering algorithm, then processing and judging a character set and a language, converting the character set into a character set corresponding to internal storage according to needs, segmenting metadata by using a segmentation system, and extracting keywords of a current text after deleting stop words.
6. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the storage unit stores highly-extensible webpage content by adopting a semi-structured distributed storage solution scheme based on network data characteristics, and a message queue in the storage unit meets the first-in first-out characteristic and can be freely subscribed.
7. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 6, is characterized in that: the read-write interface is used for regularly reading incremental data and transmitting the incremental data to the storage unit to form a message queue, the message queue is pushed to each service node of the identification layer unit, and the service nodes are set according to equipment load conditions and consumption queue data and immediately write back to the message queue after processing is finished.
8. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the serialization unit comprises a sensitive information module used for sealing sensitive words and sensitive fields, the serialization unit is serialized at a production end and deserialized at a consumption end, and the information required to be serialized by the serialization unit comprises a version number, an information type, an operation type, an encryption identifier and key, a data length, data information and an identification result.
9. The system for rapidly recognizing the unstructured massive text sensitive data according to claim 1, is characterized in that: the specific work flow of the rapid identification system for the unstructured massive text sensitive data is as follows:
s1: data acquisition and storage, wherein data provided by an organization or an enterprise needing to be identified is stored in hbase, ES or other non-relational data;
s2: identification operation, namely loading part or all identification models according to configuration items, utilizing relational extraction to identify data from data according to the identification models, reading records from a message queue one by a thread pool and executing deserialization operation, executing different processing flows according to data types, performing summary calculation after model matching is finished, writing back a system bus message queue theme after serialization, and using a current execution process record log for offline effect analysis;
s3: the system comprises a bus queue, a memory unit and a data processing unit, wherein the bus queue can create a production working thread and a consumption working thread when a system bus is started, the production working thread regularly tracks the change condition of bottom-layer storage incremental data, and when data arrives, the data to be consumed is extracted from the memory unit and put into a consumption theme; the consumption working thread is suspended and waits at an entrance, and write-back operation is automatically triggered when a new message exists, so that original address data of a bottom layer are updated;
s4: analyzing the log, wherein a serialization unit system analyzes log data in an hour unit by adopting a whole batch processing mode and generates a report, and the data scale, the sensitive information proportion, the sensitive information intensity, the propagation frequency heat degree, the identification accuracy rate and the like are obtained through statistics;
s5: the support system provides support capability for the whole framework of the rapid identification system through the support layer unit, the support capability is mainly the support capability of the database component and the learning engine, and the database needs to be regularly cleaned and optimized. The learning engine needs to update the recognition algorithm and the recognition library in time;
s6: and the external interface is used for externally providing sensitive identification data import and a request interface for identifying the sensitive data.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010338431.4A CN111522950B (en) | 2020-04-26 | 2020-04-26 | Rapid identification system for unstructured massive text sensitive data |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010338431.4A CN111522950B (en) | 2020-04-26 | 2020-04-26 | Rapid identification system for unstructured massive text sensitive data |
Publications (2)
Publication Number | Publication Date |
---|---|
CN111522950A true CN111522950A (en) | 2020-08-11 |
CN111522950B CN111522950B (en) | 2023-06-27 |
Family
ID=71903482
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010338431.4A Active CN111522950B (en) | 2020-04-26 | 2020-04-26 | Rapid identification system for unstructured massive text sensitive data |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111522950B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112580092A (en) * | 2020-12-07 | 2021-03-30 | 北京明朝万达科技股份有限公司 | Sensitive file identification method and device |
CN112698676A (en) * | 2020-12-09 | 2021-04-23 | 泽恩科技有限公司 | Intelligent power distribution room operation method based on AI and digital twin technology |
CN113343108A (en) * | 2021-06-30 | 2021-09-03 | 中国平安人寿保险股份有限公司 | Recommendation information processing method, device, equipment and storage medium |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070283270A1 (en) * | 2006-06-01 | 2007-12-06 | Sand Anne R | Context sensitive text recognition and marking from speech |
US20110040983A1 (en) * | 2006-11-09 | 2011-02-17 | Grzymala-Busse Withold J | System and method for providing identity theft security |
CN105205104A (en) * | 2015-08-26 | 2015-12-30 | 成都布林特信息技术有限公司 | Cloud platform data acquisition method |
CN106446232A (en) * | 2016-10-08 | 2017-02-22 | 深圳市彬讯科技有限公司 | Sensitive texts filtering method based on rules |
CN107463666A (en) * | 2017-08-02 | 2017-12-12 | 成都德尔塔信息科技有限公司 | A kind of filtering sensitive words method based on content of text |
CN107480549A (en) * | 2017-06-28 | 2017-12-15 | 银江股份有限公司 | A kind of shared sensitive information desensitization method of data-oriented and system |
CN109284631A (en) * | 2018-10-26 | 2019-01-29 | 中国电子科技网络信息安全有限公司 | A kind of document desensitization system and method based on big data |
CN109299865A (en) * | 2018-09-06 | 2019-02-01 | 西南大学 | Psychological assessment system and method, information data processing terminal based on semantic analysis |
CN109716345A (en) * | 2016-04-29 | 2019-05-03 | 普威达有限公司 | Computer implemented privacy engineering system and method |
CN110377731A (en) * | 2019-06-18 | 2019-10-25 | 深圳壹账通智能科技有限公司 | Complain text handling method, device, computer equipment and storage medium |
CN110415053A (en) * | 2019-08-12 | 2019-11-05 | 秦宇亮 | A kind of user experience monitoring system and method based on big data |
-
2020
- 2020-04-26 CN CN202010338431.4A patent/CN111522950B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20070283270A1 (en) * | 2006-06-01 | 2007-12-06 | Sand Anne R | Context sensitive text recognition and marking from speech |
US20110040983A1 (en) * | 2006-11-09 | 2011-02-17 | Grzymala-Busse Withold J | System and method for providing identity theft security |
CN105205104A (en) * | 2015-08-26 | 2015-12-30 | 成都布林特信息技术有限公司 | Cloud platform data acquisition method |
CN109716345A (en) * | 2016-04-29 | 2019-05-03 | 普威达有限公司 | Computer implemented privacy engineering system and method |
CN106446232A (en) * | 2016-10-08 | 2017-02-22 | 深圳市彬讯科技有限公司 | Sensitive texts filtering method based on rules |
CN107480549A (en) * | 2017-06-28 | 2017-12-15 | 银江股份有限公司 | A kind of shared sensitive information desensitization method of data-oriented and system |
CN107463666A (en) * | 2017-08-02 | 2017-12-12 | 成都德尔塔信息科技有限公司 | A kind of filtering sensitive words method based on content of text |
CN109299865A (en) * | 2018-09-06 | 2019-02-01 | 西南大学 | Psychological assessment system and method, information data processing terminal based on semantic analysis |
CN109284631A (en) * | 2018-10-26 | 2019-01-29 | 中国电子科技网络信息安全有限公司 | A kind of document desensitization system and method based on big data |
CN110377731A (en) * | 2019-06-18 | 2019-10-25 | 深圳壹账通智能科技有限公司 | Complain text handling method, device, computer equipment and storage medium |
CN110415053A (en) * | 2019-08-12 | 2019-11-05 | 秦宇亮 | A kind of user experience monitoring system and method based on big data |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112580092A (en) * | 2020-12-07 | 2021-03-30 | 北京明朝万达科技股份有限公司 | Sensitive file identification method and device |
CN112580092B (en) * | 2020-12-07 | 2023-03-24 | 北京明朝万达科技股份有限公司 | Sensitive file identification method and device |
CN112698676A (en) * | 2020-12-09 | 2021-04-23 | 泽恩科技有限公司 | Intelligent power distribution room operation method based on AI and digital twin technology |
CN113343108A (en) * | 2021-06-30 | 2021-09-03 | 中国平安人寿保险股份有限公司 | Recommendation information processing method, device, equipment and storage medium |
CN113343108B (en) * | 2021-06-30 | 2023-05-26 | 中国平安人寿保险股份有限公司 | Recommended information processing method, device, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN111522950B (en) | 2023-06-27 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109992645B (en) | Data management system and method based on text data | |
CN101593200B (en) | Method for classifying Chinese webpages based on keyword frequency analysis | |
Bisandu et al. | Clustering news articles using efficient similarity measure and N-grams | |
CN111522950B (en) | Rapid identification system for unstructured massive text sensitive data | |
Yao et al. | Bursty event detection from collaborative tags | |
CN113962293B (en) | LightGBM classification and representation learning-based name disambiguation method and system | |
CN104199857A (en) | Tax document hierarchical classification method based on multi-tag classification | |
CN108647322B (en) | Method for identifying similarity of mass Web text information based on word network | |
US20170109358A1 (en) | Method and system of determining enterprise content specific taxonomies and surrogate tags | |
CN112100149B (en) | Automatic log analysis system | |
CN112395539A (en) | Public opinion risk monitoring method and system based on natural language processing | |
CN110163688A (en) | Commodity network public sentiment detection system | |
CN112148881A (en) | Method and apparatus for outputting information | |
CN112487161A (en) | Enterprise demand oriented expert recommendation method, device, medium and equipment | |
CN111782806A (en) | Artificial intelligence algorithm-based similar marketing enterprise retrieval classification method and system | |
CN115827862A (en) | Associated acquisition method for multivariate expense voucher data | |
Hossari et al. | TEST: A terminology extraction system for technology related terms | |
CN114328800A (en) | Text processing method and device, electronic equipment and computer readable storage medium | |
Wang et al. | Topic discovery method based on topic model combined with hierarchical clustering | |
Benny et al. | Hadoop framework for entity resolution within high velocity streams | |
CN112417082A (en) | Scientific research achievement data disambiguation filing storage method | |
Sun et al. | A scenario model aggregation approach for mobile app requirements evolution based on user comments | |
Awad et al. | Analyzing customer reviews on social media via applying association rule | |
CN109871429A (en) | Merge the short text search method of Wikipedia classification and explicit semantic feature | |
Wang et al. | A Method of Hot Topic Detection in Blogs Using N-gram Model. |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
PE01 | Entry into force of the registration of the contract for pledge of patent right |
Denomination of invention: A Fast Recognition System for Unstructured Massive Text Sensitive Data Granted publication date: 20230627 Pledgee: Chengdu SME financing Company Limited by Guarantee Pledgor: CHENGDU SIWEI CENTURY TECHNOLOGY Co.,Ltd. Registration number: Y2024980015966 |