CN113672915A - Machine learning-based data leakage prevention system - Google Patents

Machine learning-based data leakage prevention system Download PDF

Info

Publication number
CN113672915A
CN113672915A CN202111221497.6A CN202111221497A CN113672915A CN 113672915 A CN113672915 A CN 113672915A CN 202111221497 A CN202111221497 A CN 202111221497A CN 113672915 A CN113672915 A CN 113672915A
Authority
CN
China
Prior art keywords
semantic
machine learning
algorithm
similarity
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111221497.6A
Other languages
Chinese (zh)
Inventor
韩旭东
王鹤
吴明
蒋荣
邹建宇
郑海树
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Zhongfu Information Technology Co Ltd
Original Assignee
Nanjing Zhongfu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Zhongfu Information Technology Co Ltd filed Critical Nanjing Zhongfu Information Technology Co Ltd
Priority to CN202111221497.6A priority Critical patent/CN113672915A/en
Publication of CN113672915A publication Critical patent/CN113672915A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/556Detecting local intrusion or implementing counter-measures involving covert channels, i.e. data leakage between processes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Abstract

The invention discloses a set of data leakage prevention system based on machine learning in the technical field of network data leakage prevention systems, which comprises a preprocessing module; a semantic classified similarity calculation module; and a secret-related degree judgment module. The invention combines semantic analysis, semantic fuzzy matching and pattern recognition algorithm with confidential document detection, changes the method of using keywords to hit in the traditional confidential document detection, and realizes the confidential document detection with wider range and more detection methods.

Description

Machine learning-based data leakage prevention system
Technical Field
The invention relates to the technical field of network data leakage protection systems, in particular to a set of data leakage prevention system based on machine learning.
Background
The existing data anti-leakage system framework diagram (as the description attached with figure 1) has the operation process that: a user uses a client to send a request to a target server through a proxy server, the proxy server caches data transmitted on a network to a local file, the data is sent to a file detection service for strategy matching, the data is processed according to whether a strategy is hit and behaviors (blocking, alarming and releasing) after the strategy is hit, the file detection service is the core of a data leakage detection processing method, the service adopts a keyword regular matching mode, the file detection service carries out protocol recovery and accurate keyword matching on the file transmitted to the local by the network according to keywords set by an administrator at a management end, and alarming is carried out if the keywords are hit, and leakage behaviors are reported.
The data detection technology used by the existing data leakage prevention system adopts an accurate matching idea, processes whether a file is leaked or not, and uses a keyword hit matching method. The administrator sets a relevant keyword issuing strategy at the management end, and when relevant keywords exist in the confidential files, the alarm is given an alarm to block data. A file detection service module of the existing data leakage prevention system detects data leakage, related keywords appear explicitly in confidential files, if the confidential files convert sentences according to semantics, keywords in a strategy are not mentioned, the confidential files are not alarmed, and the risk of data leakage appears.
Based on the above, the invention designs a set of data leakage prevention system based on machine learning to solve the above problems.
Disclosure of Invention
The invention aims to provide a data leakage prevention system based on machine learning to solve the problems in the background technology.
In order to achieve the purpose, the invention provides the following technical scheme: a set of data leakage prevention system based on machine learning comprises
The preprocessing module is used for preprocessing the sentences, performing vector segmentation on the sentences by using a word2vec algorithm and dividing the sentences into word vector small sentences;
the semantic security similarity calculation module judges and calculates the security similarity of semantic security by combining a fuzzy matching algorithm and a K neighbor algorithm;
and the wading density judging module is used for comparing the wading density similarity calculated by the semantic wading density similarity calculating module with a specified threshold value and blocking the alarm of the file with the wading density similarity exceeding the specified threshold value calculated by the semantic wading density similarity calculating module.
As a further scheme of the present invention, the method for judging and calculating the secret-related similarity by the semantic secret-related similarity calculation module is as follows:
s2-1: constructing a word vector space for the statement, and constructing a vector space for the semantic set;
s2-2: matching the word vector space in the S2-1 by using a neural network algorithm, and converting the corresponding word vector space in the S2-1 into a corresponding sequence of the sememes;
s2-3: and constructing a semantic fuzzy matching set from the semantic sequence obtained in the step S2-2 by combining a parameter-free machine learning method, and performing encryption detection and judgment on the encryption degree by using a fuzzy matching algorithm aiming at the converted semantic sequence.
As a further scheme of the invention, the mapping process from the word vector to the semantic meaning in the S2-2 is realized by adopting a neural network algorithm, the detection module establishes a training corpus with reference to the "national diary labeled corpus", the training corpus and the Hownet knowledge base are combined, the words in the training corpus are used as input, and the Hownet semantic label is used as a label to train the neural network, so that the problems of storage, traversal and the like of a large-scale knowledge base in the traditional method are avoided; the neural network can map the input word vector into a certain semantic vector in a semantic space to complete the conversion from the word sequence to the semantic sequence; after the text to be detected is input, the word sequence of the text is input into a neural network through preprocessing, and is converted into an semantic sequence, so that the semantic mapping work is completed.
As a further scheme of the present invention, the S2-2 integrates multiple long-and-short-term memory network models by using a Bagging algorithm, so as to expand a prediction range of the models, improve a prediction capability, and obtain more prediction primitives meeting context information.
As a further scheme of the present invention, the long-short term memory network model integrated by Bagging in S2-2 is specifically a double-layer long-short term memory network model sharing hidden state information.
As a further scheme of the present invention, the S2-3 measures the similarity of two sememe vectors by using euclidean distance, and performs fuzzy matching on the predicted sememes by using a K-nearest neighbor algorithm.
Compared with the prior art, the invention has the beneficial effects that:
1. the method combines semantic analysis, semantic fuzzy matching and pattern recognition algorithm with confidential document detection, changes the method of using keywords to hit in the traditional confidential document detection, and realizes the confidential document detection with wider range and more detection methods;
2. the method combines a training corpus and a Hownet knowledge base, takes words in the training corpus as input, labels the Hownet sememe as a label training neural network, and obtains sub data sets with different distributions by sampling the training corpus for multiple times;
3. the invention trains the sub data sets to obtain a differentiated double-layer structure long-time memory network. Different long and short time memory networks are integrated through a Bagging algorithm to obtain a deep neural network integration model capable of making time sequence prediction on a semantic matching sequence.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art that other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a prior art data leak prevention system block diagram;
FIG. 2 is a flow chart of a method for determining and calculating a secret-related similarity by a semantic secret-related similarity calculation module according to the present invention;
FIG. 3 is a flow chart of a machine learning-based data leakage prevention system according to the present invention;
FIG. 4 is a double-layer long-and-short term memory network model for sharing hidden state information in the present invention;
FIG. 5 is a block diagram of a process for implementing the machine learning document detection system of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Referring to fig. 1-5, the present invention provides a technical solution: the data leakage prevention system based on machine learning comprises a preprocessing module, wherein the preprocessing module is used for preprocessing sentences, performing vector segmentation on the sentences by using word2vec algorithm and dividing the sentences into word vector small sentences;
the semantic security similarity calculation module judges and calculates the security similarity of semantic security by combining a fuzzy matching algorithm and a K neighbor algorithm;
the wading density judging module is used for comparing the wading density similarity calculated by the semantic wading density similarity calculating module with a specified threshold value and blocking the alarm of the files of which the wading density similarity calculated by the semantic wading density similarity calculating module exceeds the specified threshold value;
the novel data leakage prevention system based on machine learning provides a semantic leakage judgment processing method based on machine learning technology, neural network technology and fuzzy matching technology aiming at the defect that whether the existing system needs explicit keywords to judge whether the existing system is leaked or not, the secret data file is judged according to semantics by training a neural network model, even if the files do not have explicit keywords, the novel data leakage prevention system based on machine learning can recognize the semantics of the files, and the alarm can be stopped as long as the semantics of the sentences of the files accord with the defined meanings of the leaked keywords.
The key component machine learning file detection system of the invention is realized by the following steps:
preprocessing a sentence, performing vector segmentation on the sentence by using a word2vec algorithm, and segmenting the sentence into word vector small sentences;
and constructing a word vector space for the sentence and a vector space for the semantic set. Matching a word vector space and an ambiguity set vector space by adopting a neural network method, integrating a plurality of long-time and short-time memory network models by using a Bagging algorithm, and expanding the semantic judgment range by using an integrated model;
a fuzzy matching set of the semantic is constructed by combining the K nearest neighbor algorithm, so that the semantic judgment precision is improved;
and comparing the calculated secret-related similarity with a specified threshold, and blocking the alarm by the file exceeding the threshold.
Key technology analysis:
(1) a pretreatment stage:
in a vector space model for text analysis, a word is described by a vector, such as the most common One-hot representation.
One obvious disadvantage of the One-hot representation method is that no association is established from word to word.
In deep learning, a Word is generally described by Distributed Representation, which is often called "Word Representation" or "Word Embedding", that is, we colloquially called "Word vector".
The machine learning file detection system cuts word vectors of the sentences by adopting a word2vec algorithm in the preprocessing stage of the sentences;
the reason for adopting the word2vec algorithm is that the word2vec algorithm is simpler to create a word cutting model, and the difference compared with ffnnlm is mainly reflected in that: the model is simpler, the hidden layer in ffnnlm is removed, and the connection from the input layer to the output layer directly after jumping over the hidden layer is removed.
The training language model predicts the mth word by using the first n words of the mth word, and the training word vector predicts the mth word by using the front and back n words, so that the context is really used for prediction.
(2) And a semantic matching stage:
the basic flow of the semantic matching stage is as follows:
constructing a word vector space for the statement, and constructing a vector space for the semantic set;
matching the word vector space by using a neural network algorithm to convert the corresponding word vector space into a corresponding semantic sequence;
and (4) constructing an ambiguity matching set of the sememe by combining a K nearest neighbor algorithm, and performing standard secret detection on the converted sememe sequence by using the ambiguity matching algorithm to judge the secret-involved degree.
The mapping process from word vectors to semantics is realized by adopting a neural network algorithm, the detection module establishes a training corpus according to 'national diary labeled corpus', the training corpus and the Hownet knowledge base are combined, the words in the training corpus are used as input, and the Hownet semantic labels are used as labels to train the neural network, so that the problems of storage, traversal and the like of a large-scale knowledge base in the traditional method are solved. The neural network can map an input word vector into a certain semantic vector in a semantic space, and complete conversion from a word sequence to a semantic sequence. After the text to be detected is input, the word sequence of the text is input into a neural network through preprocessing, and is converted into an semantic sequence, so that the semantic mapping work is completed.
Because the classified text detection is a semantic sequence oriented to complete sentences, the length of the sequence is not fixed, and the occurrence probability of the long sequence is very high, a model capable of storing long-time sequence information needs to be adopted, and a long-time memory network relatively meets the requirements. The long-time and short-time memory network model has more complete storage of time sequence information and stronger learning capacity of long-time sequence data, and compared with the traditional language model based on the knowledge base, the long-time and short-time memory network model learns the time sequence change of information instead of the information, so the requirement on the knowledge base is greatly reduced. By continuously accumulating the time sequence information, the long-time and short-time memory network model can extract more context semantic features, and by learning the features, the long-time and short-time memory network can analyze and judge the input semantic sequence according to the context information superposed by the time sequence. Therefore, the long-time and short-time memory network model is more suitable for constructing a natural language model for semantic feature sequence input. Therefore, the system integrates a plurality of long-time and short-time memory network models by adopting a Bagging algorithm, enlarges the prediction range of the models, improves the prediction capability and obtains more prediction primitives meeting context information.
After the sequence of the sememes is input into the integration model, the integration model inputs the sequence into each submodel, and the submodels predict collocation sememes of the next time sequence of the input sememes according to the learning experience of the submodels and the current context information.
In the training process, the learning data of the sub-models are different, and the tendency of fitting the data is also different, so that the method finds more matched sememes meeting the current context according to the differentiated characteristic. Although the long-and-short term memory network can reduce the influence of gradient disappearance and gradient explosion to a certain extent, when long-distance memory information is processed, the situation that the memory information is explosively increased due to excessive redundant information is inevitable, and in order to deal with the situation, the system provides a double-layer long-and-short term memory network model sharing hidden state information.
(3) Similarity calculation by fuzzy matching algorithm
In the stage (2), the system adopts a deep learning method to complete the mapping conversion operation of the words and the sememes, and the method sacrifices partial precision in exchange for optimization of time complexity and space complexity. The precision loss may have a large influence on the semantic collocation prediction part, so in order to improve the prediction precision, besides the multi-model integrated prediction model, a fuzzy matching method is also required to be combined to increase the range of the semantic prediction, so as to improve the prediction precision. The fuzzy matching method is usually performed by similarity, a semantic vector space is constructed according to the How Net knowledge base in the previous work, the vector space is an embedded feature space using a high-dimensional semantic space, all words in a corpus can find corresponding points in the semantic space, and in the word vector space, the closer the distance between two word vectors or the higher the similarity, the closer the semantics of the two words are. On the basis, the system measures the similarity of two sememe vectors by adopting Euclidean distance, and performs fuzzy matching on the predicted sememes by adopting a parameter-free machine learning method;
the machine learning method comprises a parameter-free method and a parameter-free method, wherein the parameter-free method comprises a K nearest neighbor algorithm (KNN) K-MEANS clustering method, a DBSCAN density clustering method and the like, the parameter-free method has the advantages that pre-training is not needed, raw data are directly processed, the parameter-free method is an important learning method in machine learning, the K nearest neighbor algorithm is adopted for fuzzy matching, and Euclidean distance is used for measuring the similarity between primitive vectors.
The technical innovation point of the invention lies in that the technology for judging whether the file is divulged is updated, and the application of the machine learning technology and the neural network technology to the field of data leakage prevention is a leading realization method in the industry; the new system judges whether the file is divulged or not only is not limited to simple keyword matching judgment, but adopts a Chinese semantic matching model, trains a sentence semantic automatic conversion matching model through a neural network, judges the semantic security by combining a fuzzy matching algorithm and a K neighbor algorithm and calculates the similarity of the security, and finally judges the file is divulged when the similarity exceeds a threshold value;
the novel data leakage-proof system adopts machine learning and neural network technologies, establishes a Chinese semantic processing model with self-learning, self-decomposition semantics and self-judgment, avoids the defects existing in detection only by keywords, enlarges the detection range and the detection precision, can realize secret-related judgment on fuzzy articles without keywords through semantic analysis, and has better effect on secret leakage prevention.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise embodiments disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.

Claims (6)

1. One set of data leakage prevention system based on machine learning, its characterized in that: comprises that
The preprocessing module is used for preprocessing the sentences, performing vector segmentation on the sentences by using a word2vec algorithm and dividing the sentences into word vector small sentences;
the semantic security similarity calculation module judges and calculates the security similarity of semantic security by combining a fuzzy matching algorithm and a K neighbor algorithm;
and the wading density judging module is used for comparing the wading density similarity calculated by the semantic wading density similarity calculating module with a specified threshold value and blocking the alarm of the file with the wading density similarity exceeding the specified threshold value calculated by the semantic wading density similarity calculating module.
2. The set of machine learning-based data leak prevention systems of claim 1, wherein: the method for judging and calculating the secret-related similarity by the semantic secret-related similarity calculation module comprises the following steps:
s2-1: constructing a word vector space for the statement, and constructing a vector space for the semantic set;
s2-2: matching the word vector space in the S2-1 by using a neural network algorithm, and converting the corresponding word vector space in the S2-1 into a corresponding sequence of the sememes;
s2-3: and constructing a semantic fuzzy matching set from the semantic sequence obtained in the step S2-2 by combining a parameter-free machine learning method, and performing encryption detection and judgment on the encryption degree by using a fuzzy matching algorithm aiming at the converted semantic sequence.
3. The set of machine learning-based data leak prevention systems of claim 2, wherein: the mapping process from the word vector to the semantic meaning in the S2-2 is realized by adopting a neural network algorithm, the detection module establishes a training corpus according to the 'national daily newspaper marking corpus', the training corpus and the Hownet knowledge base are combined, the words in the training corpus are used as input, the Hownet semantic annotation is used as a label to train the neural network, and the problems of storage, traversal and the like of a large-scale knowledge base in the traditional method are avoided; the neural network can map the input word vector into a certain semantic vector in a semantic space to complete the conversion from the word sequence to the semantic sequence; after the text to be detected is input, the word sequence of the text is input into a neural network through preprocessing, and is converted into an semantic sequence, so that the semantic mapping work is completed.
4. The set of machine learning-based data leak prevention systems of claim 3, wherein: and S2-2 integrates a plurality of long-time and short-time memory network models by adopting a Bagging algorithm, so that the prediction range of the models is expanded, the prediction capability is improved, and more prediction primitives meeting context information are obtained.
5. The set of machine learning-based data leak prevention systems of claim 4, wherein: the long-time and short-time memory network model integrated by Bagging is specifically a double-layer long-time and short-time memory network model sharing hidden state information in the S2-2.
6. The set of machine learning-based data leak prevention systems of claim 2, wherein: and S2-3, measuring the similarity of the two sememe vectors by adopting Euclidean distance, and carrying out fuzzy matching on the predicted sememes by adopting a K nearest neighbor algorithm.
CN202111221497.6A 2021-10-20 2021-10-20 Machine learning-based data leakage prevention system Pending CN113672915A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111221497.6A CN113672915A (en) 2021-10-20 2021-10-20 Machine learning-based data leakage prevention system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111221497.6A CN113672915A (en) 2021-10-20 2021-10-20 Machine learning-based data leakage prevention system

Publications (1)

Publication Number Publication Date
CN113672915A true CN113672915A (en) 2021-11-19

Family

ID=78550659

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111221497.6A Pending CN113672915A (en) 2021-10-20 2021-10-20 Machine learning-based data leakage prevention system

Country Status (1)

Country Link
CN (1) CN113672915A (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819604A (en) * 2012-08-20 2012-12-12 徐亮 Method for retrieving confidential information of file and judging and marking security classification based on content correlation
CN104915386A (en) * 2015-05-25 2015-09-16 中国科学院自动化研究所 Short text clustering method based on deep semantic feature learning
CN110019640A (en) * 2017-07-25 2019-07-16 杭州盈高科技有限公司 Confidential document inspection method and device
CN110287333A (en) * 2019-06-12 2019-09-27 北京语言大学 A kind of knowledge based library carries out the method and system of paraphrase generation
CN110298024A (en) * 2018-03-21 2019-10-01 西北工业大学 Detection method, device and the storage medium of security files

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102819604A (en) * 2012-08-20 2012-12-12 徐亮 Method for retrieving confidential information of file and judging and marking security classification based on content correlation
CN104915386A (en) * 2015-05-25 2015-09-16 中国科学院自动化研究所 Short text clustering method based on deep semantic feature learning
CN110019640A (en) * 2017-07-25 2019-07-16 杭州盈高科技有限公司 Confidential document inspection method and device
CN110298024A (en) * 2018-03-21 2019-10-01 西北工业大学 Detection method, device and the storage medium of security files
CN110287333A (en) * 2019-06-12 2019-09-27 北京语言大学 A kind of knowledge based library carries out the method and system of paraphrase generation

Similar Documents

Publication Publication Date Title
CN111897970A (en) Text comparison method, device and equipment based on knowledge graph and storage medium
CN107798033B (en) Case text classification method in public security field
CN111694958A (en) Microblog topic clustering method based on word vector and single-pass fusion
CN110728151B (en) Information depth processing method and system based on visual characteristics
CN115081437B (en) Machine-generated text detection method and system based on linguistic feature contrast learning
Gao et al. Automatic image annotation through multi-topic text categorization
CN102426585A (en) Webpage automatic classification method based on Bayesian network
CN115203421A (en) Method, device and equipment for generating label of long text and storage medium
CN108280357A (en) Data leakage prevention method, system based on semantic feature extraction
CN114201583A (en) Chinese financial event automatic extraction method and system based on graph attention network
CN113672915A (en) Machine learning-based data leakage prevention system
Xu et al. A block-level RNN model for resume block classification
CN116842934A (en) Multi-document fusion deep learning title generation method based on continuous learning
CN116628377A (en) Webpage theme relevance judging method
Cai et al. Semantic entity detection by integrating CRF and SVM
CN115546496A (en) Internet of things equipment identification method and device under active detection scene
CN114842301A (en) Semi-supervised training method of image annotation model
CN113987536A (en) Method and device for determining security level of field in data table, electronic equipment and medium
CN114548104A (en) Few-sample entity identification method and model based on feature and category intervention
Yang et al. Web service clustering method based on word vector and biterm topic model
CN112270185A (en) Text representation method based on topic model
CN112784227A (en) Dictionary generating system and method based on password semantic structure
Gao et al. A supervised named entity recognition method based on pattern matching and semantic verification
Ren et al. Learning refined features for open-world text classification with class description and commonsense knowledge
Pan et al. Attentive Feature Focusing for Person Search by Natural Language

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20211119