CN109543764A - A kind of warning information legitimacy detection method and detection system based on intelligent semantic perception - Google Patents
A kind of warning information legitimacy detection method and detection system based on intelligent semantic perception Download PDFInfo
- Publication number
- CN109543764A CN109543764A CN201811438885.8A CN201811438885A CN109543764A CN 109543764 A CN109543764 A CN 109543764A CN 201811438885 A CN201811438885 A CN 201811438885A CN 109543764 A CN109543764 A CN 109543764A
- Authority
- CN
- China
- Prior art keywords
- character
- standard
- semantic
- warning information
- vector
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
- G06F18/2411—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/30—Semantic analysis
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Abstract
The present invention provides a kind of warning information legitimacy detection method and detection system based on intelligent semantic perception, comprising: S1: the vertical field early warning text multi-standard segmentation methods based on deep learning;S2: white list building and real time updating method based on coupled form;S3: online forbidden character matching algorithm: multi-standard participle is carried out to warning information to be released using multi-standard segmentation methods and obtains candidate characters set, in conjunction with inverted index and tree data structure, the search of large scale text data level and alignment algorithm are designed, the quick positioning and judgement of the forbidden character in warning information text are realized by the semantic comparison with white list.Advantage are as follows: traditional reversed illegal word (word) searching algorithm is replaced with positive legal word (word) Intellisense algorithm, can achieve 100% detection effect of illegal word (word).White list building and real-time update based on coupled form, can gradually reduce with continuing on for early warning delivery system to artificial dependence.
Description
Technical field
The present invention relates to information technology field, a kind of specifically warning information legitimacy based on intelligent semantic perception
Detection method and detection system.
Background technique
It is needed in view of the reality of public safety and national security, it is pre- that emergency event all has been established from country to each province and city at present
Alert distribution platform, the emergency events such as meteorology, territory in the form of text push the public, usually require to utilize before push
Legitimacy detection technique is filtered warning information, prevents the forbidden characters such as the mistake being likely to occur, terror.Existing skill at present
Art is typically dependent on blacklist filter algorithm, i.e., first with known forbidden character building blacklist is artificially collected, then
Each warning information to be released is matched with each of blacklist character, thinks to be released pre- if successful match
There are illegal words for alert information text.There are following two disadvantages for above-mentioned technology: (1) will using the artificial building for carrying out blacklist
Expend a large amount of manpower and material resources and financial resources;(2) forbidden character of prior typing can only be filtered and is intercepted, to undefined or thing
First unforeseen word, such as " corpse " can not then carry out effectively detecting and intercepting.
Summary of the invention
Save the cost while the technical problem to be solved by the present invention is to how improve early warning efficiency.
The present invention solves above-mentioned technical problem by the following technical programs:
A kind of warning information legitimacy detection method based on intelligent semantic perception, comprising:
Step S1: the vertical field early warning text multi-standard segmentation methods based on deep learning: using public data collection with
Vertical FIELD Data collection, designs the multi-standard segmentation methods based on sequence deep learning;
Step S2: white list building and real time updating method based on coupled form: to have legal warning information
Library is data basis, constructs legitimate characters white list using multi-standard segmentation methods, while auditor is according to actually detected result
Real-time update is carried out to white list, semantic vector is carried out using term vector embedded technology for each of white list words
It indicates;
Step S3: online forbidden character matching algorithm: warning information to be released is carried out using multi-standard segmentation methods more
Standard participle obtains candidate characters set, in conjunction with inverted index and tree data structure, designs large scale text data level and searches
Rope and alignment algorithm by the semantic quick positioning for comparing the forbidden character in realization warning information text with white list and are sentenced
It is disconnected.
Preferably, wherein step 1 specifically:
Step S11: word insertion indicates;It is vector form that all characters, which are carried out coded representation, first, then to each
A character carries out semantic vector mapping, and the character insertion stage is made just to have the semantic modeling ability of long range spans multi-character words;
Step S12: context modeling;Being obtained by step S11 indicates the semantic vectorization of each character, then right
Positive semantic and reversed semanteme is modeled;Then probability mark is carried out using condition random field, acquires optimal segmentation sequence knot
Fruit;
Step S13: difference participle standard data set joint modeling;Using notation methods as a kind of implicit supervision message and step
Rapid S12 joint modeling, i.e., be handled as follows on the basis of step S12: (1) carrying out 0-N number to all marking types,
When inputting training text, increase mark classification information belonging to current statement;(2) for read statement obtained in step S12
Implicit vector indicates, as the input of a list Classification Neural while it is inputted as condition random field, this
The supervisory signals of Classification Neural are the classification that marking types are segmented locating for current read statement;
Step S14: unified end-to-end training;Step S11, S12, S13 are unified in a multi-standard participle model, made
It is trained end to end with error backpropagation algorithm;In multi-standard participle model in use, directly will be pre- after the completion of training
Input of the alert information as multi-standard participle model.
Preferably, wherein in step S11, all characters is subjected to one-hot coding first and are expressed as vector form, then
Semantic vector mapping is carried out to each character using the expansion convolutional neural networks of stacking;
In step S12, using the two-way length of stacking, memory unit simultaneously builds positive semantic and reversed semanteme in short-term
Mould;
In step S13 (2) specifically: in step S12 using stacking two-way length in short-term memory unit obtain it is defeated
Enter sentence and implies vector expression.
Preferably, the step 3 specifically:
Step S31: acting on existing all training text sentences for trained multi-standard participle model, by participle mark
Brigadier's word segmentation result is integrated into different text files;All Files are carried out with the high dimension vector of character using term vector method
It indicates compression, i.e., is a high dimension vector by each of white list character representation;
Step S32: for warning information to be released, participle is carried out first with multi-standard participle model and is selected character
Each of set words is then all passed through two ways and is compared with white list by set, comparison method are as follows: word first
Symbol set each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, followed by fall arrange
Index and tree data structure realize quick semantic matches, if the Hamming distances between characters all in white list are both greater than
Preset threshold, then it is assumed that the current corresponding character of warning information that inputs is invalid information.
The present invention also provides a kind of warning information legitimacy detection method systems based on intelligent semantic perception, are applied to upper
The method stated, including the building of multi-standard segmentation methods module, white list match mould with real-time update module, online forbidden character
Block;
The multi-standard segmentation methods module is based on sequence depth using public data collection and vertical FIELD Data collection, design
The multi-standard segmentation methods of study;
The white list building and real-time update module utilize multi-standard to have legal warning information library as data basis
Segmentation methods construct legitimate characters white list, while auditor carries out real-time update to white list according to actually detected result, right
Semantic vector expression is carried out using term vector embedded technology in each of white list words;
The online forbidden character matching module carries out multi-standard to warning information to be released using multi-standard segmentation methods
Participle obtains candidate characters set, in conjunction with inverted index and tree data structure, design the search of large scale text data level with
Alignment algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.
Preferably,
The multi-standard segmentation methods module completes word insertion first to be indicated: all characters, which are first carried out coded representation, is
Vector form then carries out semantic vector mapping to each character, and the character insertion stage is made just to have long range spans more
The semantic modeling ability of words;
Then context modeling;It is indicated by the semantic vectorization to each character, then to the forward direction of whole sentence text
Semantic and reversed semanteme is modeled;Then probability mark is carried out using condition random field, acquires optimal segmentation sequence result;
Different participle standard data sets are combined again and are modeled;Join notation methods as a kind of implicit supervision message and model
It builds mould jointly, handles as follows: (1) 0-N number being carried out to all marking types, when inputting training text, increase current statement institute
The mark classification information of category;(2) for read statement obtained in context modeling imply vector indicate, its as condition with
As the input of a single Classification Neural while airport inputs, the supervisory signals of this Classification Neural are current
The classification of marking types is segmented locating for read statement;
Finally unify end-to-end training;By multi-standard segmentation methods module, white list building and real-time update module, online
Forbidden character matching module is unified in a multi-standard participle model, is instructed end to end using error backpropagation algorithm
Practice;In multi-standard participle model in use, directly using warning information as the input of multi-standard participle model after the completion of training.
Preferably, the multi-standard segmentation methods module using stacking two-way length in short-term memory unit simultaneously to positive language
Adopted and reversed semanteme is modeled, and implying vector using the two-way length read statement that memory unit obtains in short-term of stacking indicates.
Preferably, trained multi-standard participle model is acted on existing institute by the online forbidden character matching module
There is training text sentence, word segmentation result is integrated into different text files by participle standard;For All Files using word to
The high dimension vector that amount method carries out character indicates compression, i.e., is a high dimension vector by each of white list character representation;
For warning information to be released, participle is carried out first with multi-standard participle model and is selected character set, with
Each of set words is all passed through two ways afterwards to be compared with white list, comparison method are as follows: character set first
Each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, followed by inverted index and
Tree data structure realizes quick semantic matches, if the both greater than default threshold of the Hamming distances between characters all in white list
Value, then it is assumed that the current corresponding character of warning information that inputs is invalid information.
The present invention has the advantages that replacing traditional reversed illegal word (word) with positive legal word (word) Intellisense algorithm
Searching algorithm can achieve 100% detection effect of illegal word (word).Based on coupled form white list building in real time more
Newly, it can gradually reduce with continuing on for early warning delivery system to artificial dependence, the warning information " that happens suddenly of bringing into play conscientiously
The important function in one of defence line ".
Detailed description of the invention
Fig. 1 is 1 specific embodiment flow diagram of the embodiment of the present invention;
Fig. 2 is warning information multi-standard participle model schematic diagram of the embodiment of the present invention 1 based on sequence deep learning;
Fig. 3 is that 1 warning information multi-standard of the embodiment of the present invention segments schematic diagram;
Specific embodiment
The effect of to make to structure feature of the invention and being reached, has a better understanding and awareness, to preferable
Examples and drawings cooperation detailed description, is described as follows:
Embodiment 1
As shown in Figure 1, Figure 2, Figure 3 shows, a kind of warning information legitimacy detection method based on intelligent semantic perception, it is specific real
Protocol procedures schematic diagram is applied as shown in Figure 1, three big modules correspond to aforementioned three steps: step S1: based on the vertical of deep learning
Field early warning text multi-standard segmentation methods;Step S2: white list building and real-time update based on coupled form;Step
S3: online forbidden character Fast Match Algorithm.
One, step S1 is implemented as follows:
In natural language processing, especially Chinese language processing, corpus is often rare and precious.Specific to Chinese word segmentation,
It is such.In order to make a practical system, efficient algorithm is not only needed, Large Scale Corpus is also essential.It is existing
Legal early warning text data set word, word limited amount, but the illegal sensitive word type magnanimity that is likely to occur and can not predict.With this
Concurrently there are a large amount of non-burst event field open source Chinese text participle data sets can make up this defect.Further, since not
Different with the participle labeled standards of data source, what the present invention utilized different labeled standard simultaneously has legal early warning text data set
Multi-standard Chinese word segmentation model with the public data collection training of open field Chinese word segmentation towards emergency event early warning text, joins below
Fig. 2 is examined to be specifically introduced:
Step S11: word insertion indicates.Purpose is to carry out language to each of the corpus of separate sources in training set text
Adopted vectorization indicates.All words are carried out solely hot (one-hot) coding (i.e. dictionary encoding) first and are expressed as vector by the present invention
Form then carries out semantic vector mapping to each word using the expansion convolutional neural networks of stacking.Compared with prior art
Insertion expression directly is carried out to word using One hidden layer neuron, the method that the present invention uses can just have length in the word insertion stage
The semantic modeling ability of range spans multi-character words can solve the problems, such as a word ambiguity and polysemy simultaneously.
Step S12: context modeling.Can get by step S11 indicates the semantic vectorization of each word, is then
Make algorithm that can completely understand input text entirety semantic context, the present invention using stacking two-way length memory unit in short-term
Positive semantic and reversed semanteme is modeled simultaneously.Then probability mark is carried out using condition random field, acquires optimal participle
Sequence results.
Step S13: difference participle standard data set joint modeling.Due to point of the data set to same a word separate sources
Word labeled standards are different, and (such as " heavy rain pre-warning signal ", different participle standards can be segmented as " heavy rain, early warning, signal ", " heavy rain, pre-
Alert signal " or " heavy rain early warning, signal "), if directly implementation steps S12 can not use all data aggregates.This is not solved to ask
Topic, notation methods are combined modeling as a kind of implicit supervision message with step S12 by the present invention, i.e., on the basis of step S12
It is handled as follows: (1) 0-N number being carried out to all marking types, it is additional to increase current statement institute when inputting training text
The mark classification information of category;(2) hidden for the two-way length read statement that memory unit obtains in short-term in step S12 using stacking
It is indicated containing vector, as the input of a single Classification Neural, this point while it is inputted as condition random field
The supervisory signals of class network are the classification that marking types are segmented locating for current read statement.
Step S14: unified end-to-end training.In order to obtain optimum model parameter, the present invention unites step S11, S12, S13
One (refers to Fig. 2) in a model, is trained end to end using error backpropagation algorithm.In model after the completion of training
In use, can be directly using warning information as the input of multi-standard participle model, Fig. 3 is that the warning information of the embodiment of the present invention is more
Standard segments schematic diagram.
Two, step S2 is implemented as follows:
Since the present invention is to be detected based on positive legitimate characters Intellisense to the legitimacy of warning information, white list
It is most important for the judgement of invalid information, and with the continuous growth of warning information text, white list should can quick and precisely into
Row updates.To solve this problem, firstly, the present invention is based on trained to complete participle model to existing legal early warning text data set
Carry out the legal words white list in participle building basis;Secondly, the information that user can construct according to this project when the present invention uses
The output real time modifying of legitimacy detection system updates white list.
Three, step S3 is implemented as follows:
After the completion of step S1 and S2, multi-standard participle is carried out to warning information to be released using institute's invention segmentation methods and is obtained
Candidate characters set is obtained, intuitively, candidate characters set is compared with white list to carry out warning information legitimacy
Judgement.But direct violence search matching cannot reflect the semantic information of word, word, and the legal warning information number collected in advance
It is extremely limited according to storage capacity, it is impossible to cover all legitimate characters, it will cause entire legitimacy detection system initial
False alarm rate is excessively high when use, can serious waste manpower and material resources and financial resources.To solve this problem, the present invention devises one kind towards pre-
The quick semantic matching algorithm of alert message area, is specifically described as follows:
Step S31: acting on existing all training text sentences for trained multi-standard participle model, by participle mark
Brigadier's word segmentation result is integrated into different text files.All Files are carried out with the high dimension vector of character using term vector method
It indicates compression, i.e., is a high dimension vector by each of white list character representation, different from existing generation real number feature
Term vector representation, the present invention increase on the basis of traditional term vector Sigmoid function may make high dimension vector be two-value to
Amount, convenient for extensive Rapid matching and retrieval.
Step S32: for warning information to be released, participle is carried out first with multi-standard participle model and is selected character
Each of set words is then all passed through two ways and is compared with white list by set.Mode one: direct violence
Match;Mode two: first set each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector,
Quick semantic matches are realized followed by inverted index and tree data structure, if the sea between characters all in white list
Prescribed distance is both greater than preset threshold, then it is assumed that the current corresponding character of warning information that inputs is that invalid information (in actual use may be used
It is automatically reminded to auditor).
Embodiment 2
As shown in Figure 1, a kind of warning information legitimacy detection method system based on intelligent semantic perception, is applied to implement
Method described in example 1, including the building of multi-standard segmentation methods module, white list and real-time update module, online forbidden character
With module;
Multi-standard segmentation methods module completes word insertion first to be indicated: it is vector that all characters, which are first carried out coded representation,
Form then carries out semantic vector mapping to each character, has the character insertion stage just using expansion convolution technique
The local semantic modeling ability of long range spans multi-character words;Then the complete semantic context modeling of read statement;By to each
The semantic vectorization of a character indicates, using stacking two-way length in short-term memory unit simultaneously to positive semantic and reversed semanteme into
Row modeling,;Then probability mark is carried out using condition random field, acquires optimal segmentation sequence result;
Different participle standard data sets are combined again and are modeled;Join notation methods as a kind of implicit supervision message and model
It builds mould jointly, handles as follows: (1) 0-N number being carried out to all marking types, when inputting training text, increase current statement institute
The mark classification information of category;(2) for read statement obtained in context modeling using the two-way long short-term memory list of stacking
The read statement that member obtains, which implies vector, to be indicated, as mind of classifying one while it is inputted as condition random field more
Input through network, the supervisory signals of this Classification Neural are the classification that marking types are segmented locating for current read statement;
Finally unify end-to-end training, word is embedded in representation module, bidirectional circulating neural network module, condition random field module and not
It is unified in a multi-standard participle model with labeled standards data set modeling module, is held using error backpropagation algorithm
To the training at end;In multi-standard participle model in use, directly using warning information as multi-standard participle model after the completion of training
Input.
White list building and real-time update module are segmented using having legal warning information library as data basis using multi-standard
Algorithm constructs legitimate characters white list, while auditor carries out real-time update to white list according to actually detected result, for white
Each of list words carries out semantic vector expression using term vector embedded technology;
Online forbidden character matching module carries out multi-standard participle to warning information to be released using multi-standard segmentation methods
Candidate characters set is obtained, in conjunction with inverted index and tree data structure, the search of large scale text data level is designed and compares
Algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.
Trained multi-standard participle model is acted on existing all training texts by online forbidden character matching module
Word segmentation result is integrated into different text files by participle standard by sentence;All Files are carried out using term vector method
The high dimension vector of character indicates compression, i.e., is a high dimension vector by each of white list character representation;
For warning information to be released, participle is carried out first with multi-standard participle model and is selected character set, with
Each of set words is all passed through two ways afterwards to be compared with white list, comparison method are as follows: character set first
Each of character two-value high dimension vector is all expressed as by the way of term vector, followed by inverted index and tree shaped data
Structure realizes that quick semantic matches are recognized if the Hamming distances between characters all in white list are both greater than preset threshold
It is invalid information currently to input the corresponding character of warning information.
The basic principles, main features and advantages of the present invention have been shown and described above.The technology of the industry
Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and what is described in the above embodiment and the description is only the present invention
Principle, various changes and improvements may be made to the invention without departing from the spirit and scope of the present invention, these variation and
Improvement is both fallen in the range of claimed invention.The present invention claims protection scope by appended claims and its
Equivalent defines.
Claims (8)
1. a kind of warning information legitimacy detection method based on intelligent semantic perception, it is characterised in that: include:
Step S1: the vertical field early warning text multi-standard segmentation methods based on deep learning: using public data collection with it is vertical
FIELD Data collection designs the multi-standard segmentation methods based on sequence deep learning;
Step S2: white list building and real time updating method based on coupled form: it is to have legal warning information library
Data basis constructs legitimate characters white list using multi-standard segmentation methods, while auditor is according to actually detected result dialogue
List carries out real-time update, carries out semantic vector table using term vector embedded technology for each of white list words
Show;
Step S3: multi-standard online forbidden character matching algorithm: is carried out to warning information to be released using multi-standard segmentation methods
Participle obtains candidate characters set, in conjunction with inverted index and tree data structure, design the search of large scale text data level with
Alignment algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.
2. a kind of warning information legitimacy detection method based on intelligent semantic perception according to claim 1, feature
It is: wherein step 1 specifically:
Step S11: word insertion indicates;It is first higher-dimension binary vector form by all character codes, then to each word
Symbol using expansion convolution technique carry out the character semantic vectorization based on local semantic context map, by character higher-dimension binary to
Amount is mapped as low-dimensional real vector;
Step S12: whole sentence semantic context modeling;Being obtained by step S11 indicates the semantic vectorization of each character, with
The semantic and reversed semanteme of forward direction of complete Chinese sentence is modeled afterwards;Then probability mark is carried out using condition random field,
Acquire optimal segmentation sequence result;
Step S13: difference participle standard data set joint modeling;Using notation methods as a kind of implicit supervision message and step
S12 joint modeling, i.e., be handled as follows on the basis of step S12: (1) 0-N number carried out to all marking types, defeated
When entering training text, increase mark classification information belonging to current statement;(2) hidden for read statement obtained in step S12
It is indicated containing vector, as the input of a single Classification Neural, this point while it is inputted as condition random field
The supervisory signals of neural network are the classification that marking types are segmented locating for current read statement;
Step S14: unified end-to-end training;Step S11, S12, S13 are unified in a multi-standard participle model, using accidentally
Poor back-propagation algorithm is trained end to end;In multi-standard participle model in use, directly believing early warning after the completion of training
Cease the input as multi-standard participle model.
3. a kind of warning information legitimacy detection method based on intelligent semantic perception according to claim 2, feature
It is: wherein in step S11, all characters is subjected to one-hot coding first and are expressed as vector form, then using stacking
It expands convolutional neural networks and semantic vector mapping is carried out to each character;
In step S12, using the two-way length of stacking, memory unit simultaneously models positive semantic and reversed semanteme in short-term;
In step S13 (2) specifically: for using the two-way length input language that memory unit obtains in short-term of stacking in step S12
The implicit vector of sentence indicates.
4. a kind of warning information legitimacy detection method based on intelligent semantic perception according to any one of claims 1 to 3,
It is characterized by: the step 3 specifically:
Step S31: acting on existing all training text sentences for trained multi-standard participle model, will by participle standard
Word segmentation result is integrated into different text files;All Files are indicated using the high dimension vector that term vector method carries out character
Each of white list character representation is a high dimension vector by compression;
Step S32: for warning information to be released, participle is carried out first with multi-standard participle model and is selected character set
It closes, each of set words is all then passed through into two ways and is compared with white list, comparison method are as follows: character first
Set each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, followed by fall row rope
Draw and realize quick semantic matches with tree data structure, if the Hamming distances between characters all in white list are both greater than pre-
If threshold value, then it is assumed that the current corresponding character of warning information that inputs is invalid information.
5. a kind of warning information legitimacy detection method system based on intelligent semantic perception, it is characterised in that: be applied to above-mentioned
Any method of Claims 1-4, including the building of multi-standard segmentation methods module, white list and real-time update module,
Line forbidden character matching module;
The multi-standard segmentation methods module is based on sequence deep learning using public data collection and vertical FIELD Data collection, design
Multi-standard segmentation methods;
The white list building and real-time update module are segmented using having legal warning information library as data basis using multi-standard
Algorithm constructs legitimate characters white list, while auditor carries out real-time update to white list according to actually detected result, for white
Each of list words carries out semantic vector expression using term vector embedded technology;
The online forbidden character matching module carries out multi-standard participle to warning information to be released using multi-standard segmentation methods
Candidate characters set is obtained, in conjunction with inverted index and tree data structure, the search of large scale text data level is designed and compares
Algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.
6. a kind of warning information legitimacy detection method system based on intelligent semantic perception according to claim 5,
It is characterized in that:
The multi-standard segmentation methods module completes word insertion first to be indicated: it is vector that all characters, which are first carried out coded representation,
Form then carries out semantic vector mapping to each character, and the character insertion stage is made just to have long range spans multi-character words
Local semantic modeling ability, polysemy and a word ambiguity are solved the problems, such as according to local context;
Then whole sentence semantic context modeling;Indicated by semantic vectorization to each character, then to positive semantic and
Reversed semanteme is modeled;Then probability mark is carried out using condition random field, acquires optimal segmentation sequence result;
Different participle standard data sets are combined again and are modeled;Notation methods are combined as a kind of implicit supervision message with model and are built
Mould is handled as follows: (1) carrying out 0-N number to all marking types, when inputting training text, increase belonging to current statement
Mark classification information;(2) implying vector for read statement obtained in context modeling indicates, at it as condition random field
As the input of a single Classification Neural while input, the supervisory signals of this Classification Neural are current input
The classification of marking types is segmented locating for sentence;
Finally unify end-to-end training;By multi-standard segmentation methods module, white list building and real-time update module, online illegal
Character match module is unified in a multi-standard participle model, is trained end to end using error backpropagation algorithm;
In multi-standard participle model in use, directly using warning information as the input of multi-standard participle model after the completion of training.
7. a kind of warning information legitimacy detection method system based on intelligent semantic perception according to claim 5,
Be characterized in that: the multi-standard segmentation methods module using stacking two-way length in short-term memory unit simultaneously to positive semantic and anti-
It is modeled to semanteme, implying vector using the two-way length read statement that memory unit obtains in short-term of stacking indicates.
8. a kind of warning information legitimacy detection method system based on intelligent semantic perception according to claim 5,
Be characterized in that: trained multi-standard participle model is acted on existing all training by the online forbidden character matching module
Word segmentation result is integrated into different text files by participle standard by text sentence;Term vector method is used for All Files
The high dimension vector for carrying out character indicates compression, i.e., is a high dimension vector by each of white list character representation;
For warning information to be released, participle is carried out first with multi-standard participle model and is selected character set, then will
Each of set words all passes through two ways and is compared with white list, comparison method are as follows: first in character set
Each character is expressed as two-value high dimension vector using term vector method, realizes followed by inverted index and tree data structure
Quick semantic matches, if the Hamming distances between characters all in white list are both greater than preset threshold, then it is assumed that current defeated
Entering the corresponding character of warning information is invalid information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811438885.8A CN109543764B (en) | 2018-11-28 | 2018-11-28 | Early warning information validity detection method and detection system based on intelligent semantic perception |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811438885.8A CN109543764B (en) | 2018-11-28 | 2018-11-28 | Early warning information validity detection method and detection system based on intelligent semantic perception |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109543764A true CN109543764A (en) | 2019-03-29 |
CN109543764B CN109543764B (en) | 2023-06-16 |
Family
ID=65850988
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811438885.8A Active CN109543764B (en) | 2018-11-28 | 2018-11-28 | Early warning information validity detection method and detection system based on intelligent semantic perception |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109543764B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061874A (en) * | 2019-12-10 | 2020-04-24 | 苏州思必驰信息科技有限公司 | Sensitive information detection method and device |
CN112115933A (en) * | 2020-08-25 | 2020-12-22 | 上海微亿智造科技有限公司 | Character recognition method, device and storage medium |
CN113095045A (en) * | 2021-04-20 | 2021-07-09 | 河海大学 | Chinese mathematics application problem data enhancement method based on reverse operation |
CN113297879A (en) * | 2020-02-23 | 2021-08-24 | 深圳中科飞测科技股份有限公司 | Acquisition method of measurement model group, measurement method and related equipment |
CN113590767A (en) * | 2021-09-28 | 2021-11-02 | 西安热工研究院有限公司 | Multilingual alarm information category judgment method, system, equipment and storage medium |
CN113673219A (en) * | 2021-08-20 | 2021-11-19 | 合肥中科类脑智能技术有限公司 | Power failure plan text analysis method |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101650768A (en) * | 2009-07-10 | 2010-02-17 | 深圳市永达电子股份有限公司 | Security guarantee method and system for Windows terminals based on auto white list |
US20130185797A1 (en) * | 2010-08-18 | 2013-07-18 | Qizhi Software (Beijing) Company Limited | Whitelist-based inspection method for malicious process |
CN104008169A (en) * | 2014-05-30 | 2014-08-27 | 中国测绘科学研究院 | Semanteme based geographical label content safe checking method and device |
CN104965818A (en) * | 2015-05-25 | 2015-10-07 | 中国科学院信息工程研究所 | Project name entity identification method and system based on self-learning rules |
CN106506486A (en) * | 2016-11-03 | 2017-03-15 | 上海三零卫士信息安全有限公司 | A kind of intelligent industrial-control network information security monitoring method based on white list matrix |
CN106569998A (en) * | 2016-10-27 | 2017-04-19 | 浙江大学 | Text named entity recognition method based on Bi-LSTM, CNN and CRF |
CN107241352A (en) * | 2017-07-17 | 2017-10-10 | 浙江鹏信信息科技股份有限公司 | A kind of net security accident classificaiton and Forecasting Methodology and system |
CN107832634A (en) * | 2017-11-29 | 2018-03-23 | 江苏方天电力技术有限公司 | A kind of Dblink monitoring method and monitoring system |
CN109086267A (en) * | 2018-07-11 | 2018-12-25 | 南京邮电大学 | A kind of Chinese word cutting method based on deep learning |
-
2018
- 2018-11-28 CN CN201811438885.8A patent/CN109543764B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101650768A (en) * | 2009-07-10 | 2010-02-17 | 深圳市永达电子股份有限公司 | Security guarantee method and system for Windows terminals based on auto white list |
US20130185797A1 (en) * | 2010-08-18 | 2013-07-18 | Qizhi Software (Beijing) Company Limited | Whitelist-based inspection method for malicious process |
CN104008169A (en) * | 2014-05-30 | 2014-08-27 | 中国测绘科学研究院 | Semanteme based geographical label content safe checking method and device |
CN104965818A (en) * | 2015-05-25 | 2015-10-07 | 中国科学院信息工程研究所 | Project name entity identification method and system based on self-learning rules |
CN106569998A (en) * | 2016-10-27 | 2017-04-19 | 浙江大学 | Text named entity recognition method based on Bi-LSTM, CNN and CRF |
CN106506486A (en) * | 2016-11-03 | 2017-03-15 | 上海三零卫士信息安全有限公司 | A kind of intelligent industrial-control network information security monitoring method based on white list matrix |
CN107241352A (en) * | 2017-07-17 | 2017-10-10 | 浙江鹏信信息科技股份有限公司 | A kind of net security accident classificaiton and Forecasting Methodology and system |
CN107832634A (en) * | 2017-11-29 | 2018-03-23 | 江苏方天电力技术有限公司 | A kind of Dblink monitoring method and monitoring system |
CN109086267A (en) * | 2018-07-11 | 2018-12-25 | 南京邮电大学 | A kind of Chinese word cutting method based on deep learning |
Non-Patent Citations (6)
Title |
---|
NAIMA ZERARI ET AL.: "Bi-directional recurrent end-to-end neural network classifier for spoken Arab digit recognition", 《IEEE XPLORE》 * |
张仰森等: "基于双重注意力模型的微博情感分析方法", 《清华大学学报(自然科学版)》 * |
张淑静等: "基于Bi-LSTM-CRF算法的气象预警信息质控系统的实现", 《计算机与现代化》 * |
朱丹浩等: "基于深度学习的中文机构名识别研究――一种汉字级别的循环神经网络方法", 《现代图书情报技术》 * |
白静等: "基于注意力的BiLSTM-CNN中文微博立场检测模型", 《计算机应用与软件》 * |
郭璇等: "基于深度学习和公开来源信息的反恐情报挖掘", 《情报理论与实践》 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111061874A (en) * | 2019-12-10 | 2020-04-24 | 苏州思必驰信息科技有限公司 | Sensitive information detection method and device |
CN113297879A (en) * | 2020-02-23 | 2021-08-24 | 深圳中科飞测科技股份有限公司 | Acquisition method of measurement model group, measurement method and related equipment |
CN112115933A (en) * | 2020-08-25 | 2020-12-22 | 上海微亿智造科技有限公司 | Character recognition method, device and storage medium |
CN113095045A (en) * | 2021-04-20 | 2021-07-09 | 河海大学 | Chinese mathematics application problem data enhancement method based on reverse operation |
CN113095045B (en) * | 2021-04-20 | 2023-11-10 | 河海大学 | Chinese mathematic application question data enhancement method based on reverse operation |
CN113673219A (en) * | 2021-08-20 | 2021-11-19 | 合肥中科类脑智能技术有限公司 | Power failure plan text analysis method |
CN113590767A (en) * | 2021-09-28 | 2021-11-02 | 西安热工研究院有限公司 | Multilingual alarm information category judgment method, system, equipment and storage medium |
Also Published As
Publication number | Publication date |
---|---|
CN109543764B (en) | 2023-06-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109543764A (en) | A kind of warning information legitimacy detection method and detection system based on intelligent semantic perception | |
CN106055541B (en) | A kind of news content filtering sensitive words method and system | |
CN108897857B (en) | Chinese text subject sentence generating method facing field | |
CN108874878B (en) | Knowledge graph construction system and method | |
CN110674840B (en) | Multi-party evidence association model construction method and evidence chain extraction method and device | |
CN112733533B (en) | Multi-modal named entity recognition method based on BERT model and text-image relation propagation | |
CN108304372A (en) | Entity extraction method and apparatus, computer equipment and storage medium | |
CN104820629A (en) | Intelligent system and method for emergently processing public sentiment emergency | |
CN103179122A (en) | Telcom phone phishing-resistant method and system based on discrimination and identification content analysis | |
CN110008699B (en) | Software vulnerability detection method and device based on neural network | |
CN113449111B (en) | Social governance hot topic automatic identification method based on time-space semantic knowledge migration | |
CN108052504A (en) | Mathematics subjective item answers the structure analysis method and system of result | |
CN114282534A (en) | Meteorological disaster event aggregation method based on element information extraction | |
CN108763211A (en) | The automaticabstracting and system of knowledge are contained in fusion | |
Meng et al. | A deep learning approach for a source code detection model using self-attention | |
CN108280357A (en) | Data leakage prevention method, system based on semantic feature extraction | |
CN110750981A (en) | High-accuracy website sensitive word detection method based on machine learning | |
CN112818668B (en) | Meteorological disaster data semantic recognition analysis method and system | |
CN117275202A (en) | Omnibearing real-time intelligent early warning method and system for dangerous sources in important areas of cultural relics | |
CN114091462B (en) | Case fact mixed coding based criminal case risk mutual learning assessment method | |
CN115511280A (en) | Urban flood toughness evaluation method based on multi-mode data fusion | |
AU2020101024A4 (en) | Multi-language oriented general method for calculating place name semanteme similarity and use thereof | |
CN115048929A (en) | Sensitive text monitoring method and device | |
CN115270774A (en) | Big data keyword dictionary construction method for semi-supervised learning | |
CN114238738A (en) | Rumor detection method based on attention mechanism and bidirectional GRU |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |