CN109543764A - A kind of warning information legitimacy detection method and detection system based on intelligent semantic perception - Google Patents

A kind of warning information legitimacy detection method and detection system based on intelligent semantic perception Download PDF

Info

Publication number
CN109543764A
CN109543764A CN201811438885.8A CN201811438885A CN109543764A CN 109543764 A CN109543764 A CN 109543764A CN 201811438885 A CN201811438885 A CN 201811438885A CN 109543764 A CN109543764 A CN 109543764A
Authority
CN
China
Prior art keywords
character
standard
semantic
warning information
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201811438885.8A
Other languages
Chinese (zh)
Other versions
CN109543764B (en
Inventor
苗开超
杨彬
年福东
张淑静
汪翔
李腾
吴丹娃
张亚力
程天奇
刘宜轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Public Meteorological Service Center
Anhui University
Original Assignee
Anhui Public Meteorological Service Center
Anhui University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Public Meteorological Service Center, Anhui University filed Critical Anhui Public Meteorological Service Center
Priority to CN201811438885.8A priority Critical patent/CN109543764B/en
Publication of CN109543764A publication Critical patent/CN109543764A/en
Application granted granted Critical
Publication of CN109543764B publication Critical patent/CN109543764B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The present invention provides a kind of warning information legitimacy detection method and detection system based on intelligent semantic perception, comprising: S1: the vertical field early warning text multi-standard segmentation methods based on deep learning;S2: white list building and real time updating method based on coupled form;S3: online forbidden character matching algorithm: multi-standard participle is carried out to warning information to be released using multi-standard segmentation methods and obtains candidate characters set, in conjunction with inverted index and tree data structure, the search of large scale text data level and alignment algorithm are designed, the quick positioning and judgement of the forbidden character in warning information text are realized by the semantic comparison with white list.Advantage are as follows: traditional reversed illegal word (word) searching algorithm is replaced with positive legal word (word) Intellisense algorithm, can achieve 100% detection effect of illegal word (word).White list building and real-time update based on coupled form, can gradually reduce with continuing on for early warning delivery system to artificial dependence.

Description

A kind of warning information legitimacy detection method and detection based on intelligent semantic perception System
Technical field
The present invention relates to information technology field, a kind of specifically warning information legitimacy based on intelligent semantic perception Detection method and detection system.
Background technique
It is needed in view of the reality of public safety and national security, it is pre- that emergency event all has been established from country to each province and city at present Alert distribution platform, the emergency events such as meteorology, territory in the form of text push the public, usually require to utilize before push Legitimacy detection technique is filtered warning information, prevents the forbidden characters such as the mistake being likely to occur, terror.Existing skill at present Art is typically dependent on blacklist filter algorithm, i.e., first with known forbidden character building blacklist is artificially collected, then Each warning information to be released is matched with each of blacklist character, thinks to be released pre- if successful match There are illegal words for alert information text.There are following two disadvantages for above-mentioned technology: (1) will using the artificial building for carrying out blacklist Expend a large amount of manpower and material resources and financial resources;(2) forbidden character of prior typing can only be filtered and is intercepted, to undefined or thing First unforeseen word, such as " corpse " can not then carry out effectively detecting and intercepting.
Summary of the invention
Save the cost while the technical problem to be solved by the present invention is to how improve early warning efficiency.
The present invention solves above-mentioned technical problem by the following technical programs:
A kind of warning information legitimacy detection method based on intelligent semantic perception, comprising:
Step S1: the vertical field early warning text multi-standard segmentation methods based on deep learning: using public data collection with Vertical FIELD Data collection, designs the multi-standard segmentation methods based on sequence deep learning;
Step S2: white list building and real time updating method based on coupled form: to have legal warning information Library is data basis, constructs legitimate characters white list using multi-standard segmentation methods, while auditor is according to actually detected result Real-time update is carried out to white list, semantic vector is carried out using term vector embedded technology for each of white list words It indicates;
Step S3: online forbidden character matching algorithm: warning information to be released is carried out using multi-standard segmentation methods more Standard participle obtains candidate characters set, in conjunction with inverted index and tree data structure, designs large scale text data level and searches Rope and alignment algorithm by the semantic quick positioning for comparing the forbidden character in realization warning information text with white list and are sentenced It is disconnected.
Preferably, wherein step 1 specifically:
Step S11: word insertion indicates;It is vector form that all characters, which are carried out coded representation, first, then to each A character carries out semantic vector mapping, and the character insertion stage is made just to have the semantic modeling ability of long range spans multi-character words;
Step S12: context modeling;Being obtained by step S11 indicates the semantic vectorization of each character, then right Positive semantic and reversed semanteme is modeled;Then probability mark is carried out using condition random field, acquires optimal segmentation sequence knot Fruit;
Step S13: difference participle standard data set joint modeling;Using notation methods as a kind of implicit supervision message and step Rapid S12 joint modeling, i.e., be handled as follows on the basis of step S12: (1) carrying out 0-N number to all marking types, When inputting training text, increase mark classification information belonging to current statement;(2) for read statement obtained in step S12 Implicit vector indicates, as the input of a list Classification Neural while it is inputted as condition random field, this The supervisory signals of Classification Neural are the classification that marking types are segmented locating for current read statement;
Step S14: unified end-to-end training;Step S11, S12, S13 are unified in a multi-standard participle model, made It is trained end to end with error backpropagation algorithm;In multi-standard participle model in use, directly will be pre- after the completion of training Input of the alert information as multi-standard participle model.
Preferably, wherein in step S11, all characters is subjected to one-hot coding first and are expressed as vector form, then Semantic vector mapping is carried out to each character using the expansion convolutional neural networks of stacking;
In step S12, using the two-way length of stacking, memory unit simultaneously builds positive semantic and reversed semanteme in short-term Mould;
In step S13 (2) specifically: in step S12 using stacking two-way length in short-term memory unit obtain it is defeated Enter sentence and implies vector expression.
Preferably, the step 3 specifically:
Step S31: acting on existing all training text sentences for trained multi-standard participle model, by participle mark Brigadier's word segmentation result is integrated into different text files;All Files are carried out with the high dimension vector of character using term vector method It indicates compression, i.e., is a high dimension vector by each of white list character representation;
Step S32: for warning information to be released, participle is carried out first with multi-standard participle model and is selected character Each of set words is then all passed through two ways and is compared with white list by set, comparison method are as follows: word first Symbol set each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, followed by fall arrange Index and tree data structure realize quick semantic matches, if the Hamming distances between characters all in white list are both greater than Preset threshold, then it is assumed that the current corresponding character of warning information that inputs is invalid information.
The present invention also provides a kind of warning information legitimacy detection method systems based on intelligent semantic perception, are applied to upper The method stated, including the building of multi-standard segmentation methods module, white list match mould with real-time update module, online forbidden character Block;
The multi-standard segmentation methods module is based on sequence depth using public data collection and vertical FIELD Data collection, design The multi-standard segmentation methods of study;
The white list building and real-time update module utilize multi-standard to have legal warning information library as data basis Segmentation methods construct legitimate characters white list, while auditor carries out real-time update to white list according to actually detected result, right Semantic vector expression is carried out using term vector embedded technology in each of white list words;
The online forbidden character matching module carries out multi-standard to warning information to be released using multi-standard segmentation methods Participle obtains candidate characters set, in conjunction with inverted index and tree data structure, design the search of large scale text data level with Alignment algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.
Preferably,
The multi-standard segmentation methods module completes word insertion first to be indicated: all characters, which are first carried out coded representation, is Vector form then carries out semantic vector mapping to each character, and the character insertion stage is made just to have long range spans more The semantic modeling ability of words;
Then context modeling;It is indicated by the semantic vectorization to each character, then to the forward direction of whole sentence text Semantic and reversed semanteme is modeled;Then probability mark is carried out using condition random field, acquires optimal segmentation sequence result;
Different participle standard data sets are combined again and are modeled;Join notation methods as a kind of implicit supervision message and model It builds mould jointly, handles as follows: (1) 0-N number being carried out to all marking types, when inputting training text, increase current statement institute The mark classification information of category;(2) for read statement obtained in context modeling imply vector indicate, its as condition with As the input of a single Classification Neural while airport inputs, the supervisory signals of this Classification Neural are current The classification of marking types is segmented locating for read statement;
Finally unify end-to-end training;By multi-standard segmentation methods module, white list building and real-time update module, online Forbidden character matching module is unified in a multi-standard participle model, is instructed end to end using error backpropagation algorithm Practice;In multi-standard participle model in use, directly using warning information as the input of multi-standard participle model after the completion of training.
Preferably, the multi-standard segmentation methods module using stacking two-way length in short-term memory unit simultaneously to positive language Adopted and reversed semanteme is modeled, and implying vector using the two-way length read statement that memory unit obtains in short-term of stacking indicates.
Preferably, trained multi-standard participle model is acted on existing institute by the online forbidden character matching module There is training text sentence, word segmentation result is integrated into different text files by participle standard;For All Files using word to The high dimension vector that amount method carries out character indicates compression, i.e., is a high dimension vector by each of white list character representation;
For warning information to be released, participle is carried out first with multi-standard participle model and is selected character set, with Each of set words is all passed through two ways afterwards to be compared with white list, comparison method are as follows: character set first Each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, followed by inverted index and Tree data structure realizes quick semantic matches, if the both greater than default threshold of the Hamming distances between characters all in white list Value, then it is assumed that the current corresponding character of warning information that inputs is invalid information.
The present invention has the advantages that replacing traditional reversed illegal word (word) with positive legal word (word) Intellisense algorithm Searching algorithm can achieve 100% detection effect of illegal word (word).Based on coupled form white list building in real time more Newly, it can gradually reduce with continuing on for early warning delivery system to artificial dependence, the warning information " that happens suddenly of bringing into play conscientiously The important function in one of defence line ".
Detailed description of the invention
Fig. 1 is 1 specific embodiment flow diagram of the embodiment of the present invention;
Fig. 2 is warning information multi-standard participle model schematic diagram of the embodiment of the present invention 1 based on sequence deep learning;
Fig. 3 is that 1 warning information multi-standard of the embodiment of the present invention segments schematic diagram;
Specific embodiment
The effect of to make to structure feature of the invention and being reached, has a better understanding and awareness, to preferable Examples and drawings cooperation detailed description, is described as follows:
Embodiment 1
As shown in Figure 1, Figure 2, Figure 3 shows, a kind of warning information legitimacy detection method based on intelligent semantic perception, it is specific real Protocol procedures schematic diagram is applied as shown in Figure 1, three big modules correspond to aforementioned three steps: step S1: based on the vertical of deep learning Field early warning text multi-standard segmentation methods;Step S2: white list building and real-time update based on coupled form;Step S3: online forbidden character Fast Match Algorithm.
One, step S1 is implemented as follows:
In natural language processing, especially Chinese language processing, corpus is often rare and precious.Specific to Chinese word segmentation, It is such.In order to make a practical system, efficient algorithm is not only needed, Large Scale Corpus is also essential.It is existing Legal early warning text data set word, word limited amount, but the illegal sensitive word type magnanimity that is likely to occur and can not predict.With this Concurrently there are a large amount of non-burst event field open source Chinese text participle data sets can make up this defect.Further, since not Different with the participle labeled standards of data source, what the present invention utilized different labeled standard simultaneously has legal early warning text data set Multi-standard Chinese word segmentation model with the public data collection training of open field Chinese word segmentation towards emergency event early warning text, joins below Fig. 2 is examined to be specifically introduced:
Step S11: word insertion indicates.Purpose is to carry out language to each of the corpus of separate sources in training set text Adopted vectorization indicates.All words are carried out solely hot (one-hot) coding (i.e. dictionary encoding) first and are expressed as vector by the present invention Form then carries out semantic vector mapping to each word using the expansion convolutional neural networks of stacking.Compared with prior art Insertion expression directly is carried out to word using One hidden layer neuron, the method that the present invention uses can just have length in the word insertion stage The semantic modeling ability of range spans multi-character words can solve the problems, such as a word ambiguity and polysemy simultaneously.
Step S12: context modeling.Can get by step S11 indicates the semantic vectorization of each word, is then Make algorithm that can completely understand input text entirety semantic context, the present invention using stacking two-way length memory unit in short-term Positive semantic and reversed semanteme is modeled simultaneously.Then probability mark is carried out using condition random field, acquires optimal participle Sequence results.
Step S13: difference participle standard data set joint modeling.Due to point of the data set to same a word separate sources Word labeled standards are different, and (such as " heavy rain pre-warning signal ", different participle standards can be segmented as " heavy rain, early warning, signal ", " heavy rain, pre- Alert signal " or " heavy rain early warning, signal "), if directly implementation steps S12 can not use all data aggregates.This is not solved to ask Topic, notation methods are combined modeling as a kind of implicit supervision message with step S12 by the present invention, i.e., on the basis of step S12 It is handled as follows: (1) 0-N number being carried out to all marking types, it is additional to increase current statement institute when inputting training text The mark classification information of category;(2) hidden for the two-way length read statement that memory unit obtains in short-term in step S12 using stacking It is indicated containing vector, as the input of a single Classification Neural, this point while it is inputted as condition random field The supervisory signals of class network are the classification that marking types are segmented locating for current read statement.
Step S14: unified end-to-end training.In order to obtain optimum model parameter, the present invention unites step S11, S12, S13 One (refers to Fig. 2) in a model, is trained end to end using error backpropagation algorithm.In model after the completion of training In use, can be directly using warning information as the input of multi-standard participle model, Fig. 3 is that the warning information of the embodiment of the present invention is more Standard segments schematic diagram.
Two, step S2 is implemented as follows:
Since the present invention is to be detected based on positive legitimate characters Intellisense to the legitimacy of warning information, white list It is most important for the judgement of invalid information, and with the continuous growth of warning information text, white list should can quick and precisely into Row updates.To solve this problem, firstly, the present invention is based on trained to complete participle model to existing legal early warning text data set Carry out the legal words white list in participle building basis;Secondly, the information that user can construct according to this project when the present invention uses The output real time modifying of legitimacy detection system updates white list.
Three, step S3 is implemented as follows:
After the completion of step S1 and S2, multi-standard participle is carried out to warning information to be released using institute's invention segmentation methods and is obtained Candidate characters set is obtained, intuitively, candidate characters set is compared with white list to carry out warning information legitimacy Judgement.But direct violence search matching cannot reflect the semantic information of word, word, and the legal warning information number collected in advance It is extremely limited according to storage capacity, it is impossible to cover all legitimate characters, it will cause entire legitimacy detection system initial False alarm rate is excessively high when use, can serious waste manpower and material resources and financial resources.To solve this problem, the present invention devises one kind towards pre- The quick semantic matching algorithm of alert message area, is specifically described as follows:
Step S31: acting on existing all training text sentences for trained multi-standard participle model, by participle mark Brigadier's word segmentation result is integrated into different text files.All Files are carried out with the high dimension vector of character using term vector method It indicates compression, i.e., is a high dimension vector by each of white list character representation, different from existing generation real number feature Term vector representation, the present invention increase on the basis of traditional term vector Sigmoid function may make high dimension vector be two-value to Amount, convenient for extensive Rapid matching and retrieval.
Step S32: for warning information to be released, participle is carried out first with multi-standard participle model and is selected character Each of set words is then all passed through two ways and is compared with white list by set.Mode one: direct violence Match;Mode two: first set each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, Quick semantic matches are realized followed by inverted index and tree data structure, if the sea between characters all in white list Prescribed distance is both greater than preset threshold, then it is assumed that the current corresponding character of warning information that inputs is that invalid information (in actual use may be used It is automatically reminded to auditor).
Embodiment 2
As shown in Figure 1, a kind of warning information legitimacy detection method system based on intelligent semantic perception, is applied to implement Method described in example 1, including the building of multi-standard segmentation methods module, white list and real-time update module, online forbidden character With module;
Multi-standard segmentation methods module completes word insertion first to be indicated: it is vector that all characters, which are first carried out coded representation, Form then carries out semantic vector mapping to each character, has the character insertion stage just using expansion convolution technique The local semantic modeling ability of long range spans multi-character words;Then the complete semantic context modeling of read statement;By to each The semantic vectorization of a character indicates, using stacking two-way length in short-term memory unit simultaneously to positive semantic and reversed semanteme into Row modeling,;Then probability mark is carried out using condition random field, acquires optimal segmentation sequence result;
Different participle standard data sets are combined again and are modeled;Join notation methods as a kind of implicit supervision message and model It builds mould jointly, handles as follows: (1) 0-N number being carried out to all marking types, when inputting training text, increase current statement institute The mark classification information of category;(2) for read statement obtained in context modeling using the two-way long short-term memory list of stacking The read statement that member obtains, which implies vector, to be indicated, as mind of classifying one while it is inputted as condition random field more Input through network, the supervisory signals of this Classification Neural are the classification that marking types are segmented locating for current read statement; Finally unify end-to-end training, word is embedded in representation module, bidirectional circulating neural network module, condition random field module and not It is unified in a multi-standard participle model with labeled standards data set modeling module, is held using error backpropagation algorithm To the training at end;In multi-standard participle model in use, directly using warning information as multi-standard participle model after the completion of training Input.
White list building and real-time update module are segmented using having legal warning information library as data basis using multi-standard Algorithm constructs legitimate characters white list, while auditor carries out real-time update to white list according to actually detected result, for white Each of list words carries out semantic vector expression using term vector embedded technology;
Online forbidden character matching module carries out multi-standard participle to warning information to be released using multi-standard segmentation methods Candidate characters set is obtained, in conjunction with inverted index and tree data structure, the search of large scale text data level is designed and compares Algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.
Trained multi-standard participle model is acted on existing all training texts by online forbidden character matching module Word segmentation result is integrated into different text files by participle standard by sentence;All Files are carried out using term vector method The high dimension vector of character indicates compression, i.e., is a high dimension vector by each of white list character representation;
For warning information to be released, participle is carried out first with multi-standard participle model and is selected character set, with Each of set words is all passed through two ways afterwards to be compared with white list, comparison method are as follows: character set first Each of character two-value high dimension vector is all expressed as by the way of term vector, followed by inverted index and tree shaped data Structure realizes that quick semantic matches are recognized if the Hamming distances between characters all in white list are both greater than preset threshold It is invalid information currently to input the corresponding character of warning information.
The basic principles, main features and advantages of the present invention have been shown and described above.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and what is described in the above embodiment and the description is only the present invention Principle, various changes and improvements may be made to the invention without departing from the spirit and scope of the present invention, these variation and Improvement is both fallen in the range of claimed invention.The present invention claims protection scope by appended claims and its Equivalent defines.

Claims (8)

1. a kind of warning information legitimacy detection method based on intelligent semantic perception, it is characterised in that: include:
Step S1: the vertical field early warning text multi-standard segmentation methods based on deep learning: using public data collection with it is vertical FIELD Data collection designs the multi-standard segmentation methods based on sequence deep learning;
Step S2: white list building and real time updating method based on coupled form: it is to have legal warning information library Data basis constructs legitimate characters white list using multi-standard segmentation methods, while auditor is according to actually detected result dialogue List carries out real-time update, carries out semantic vector table using term vector embedded technology for each of white list words Show;
Step S3: multi-standard online forbidden character matching algorithm: is carried out to warning information to be released using multi-standard segmentation methods Participle obtains candidate characters set, in conjunction with inverted index and tree data structure, design the search of large scale text data level with Alignment algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.
2. a kind of warning information legitimacy detection method based on intelligent semantic perception according to claim 1, feature It is: wherein step 1 specifically:
Step S11: word insertion indicates;It is first higher-dimension binary vector form by all character codes, then to each word Symbol using expansion convolution technique carry out the character semantic vectorization based on local semantic context map, by character higher-dimension binary to Amount is mapped as low-dimensional real vector;
Step S12: whole sentence semantic context modeling;Being obtained by step S11 indicates the semantic vectorization of each character, with The semantic and reversed semanteme of forward direction of complete Chinese sentence is modeled afterwards;Then probability mark is carried out using condition random field, Acquire optimal segmentation sequence result;
Step S13: difference participle standard data set joint modeling;Using notation methods as a kind of implicit supervision message and step S12 joint modeling, i.e., be handled as follows on the basis of step S12: (1) 0-N number carried out to all marking types, defeated When entering training text, increase mark classification information belonging to current statement;(2) hidden for read statement obtained in step S12 It is indicated containing vector, as the input of a single Classification Neural, this point while it is inputted as condition random field The supervisory signals of neural network are the classification that marking types are segmented locating for current read statement;
Step S14: unified end-to-end training;Step S11, S12, S13 are unified in a multi-standard participle model, using accidentally Poor back-propagation algorithm is trained end to end;In multi-standard participle model in use, directly believing early warning after the completion of training Cease the input as multi-standard participle model.
3. a kind of warning information legitimacy detection method based on intelligent semantic perception according to claim 2, feature It is: wherein in step S11, all characters is subjected to one-hot coding first and are expressed as vector form, then using stacking It expands convolutional neural networks and semantic vector mapping is carried out to each character;
In step S12, using the two-way length of stacking, memory unit simultaneously models positive semantic and reversed semanteme in short-term;
In step S13 (2) specifically: for using the two-way length input language that memory unit obtains in short-term of stacking in step S12 The implicit vector of sentence indicates.
4. a kind of warning information legitimacy detection method based on intelligent semantic perception according to any one of claims 1 to 3, It is characterized by: the step 3 specifically:
Step S31: acting on existing all training text sentences for trained multi-standard participle model, will by participle standard Word segmentation result is integrated into different text files;All Files are indicated using the high dimension vector that term vector method carries out character Each of white list character representation is a high dimension vector by compression;
Step S32: for warning information to be released, participle is carried out first with multi-standard participle model and is selected character set It closes, each of set words is all then passed through into two ways and is compared with white list, comparison method are as follows: character first Set each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, followed by fall row rope Draw and realize quick semantic matches with tree data structure, if the Hamming distances between characters all in white list are both greater than pre- If threshold value, then it is assumed that the current corresponding character of warning information that inputs is invalid information.
5. a kind of warning information legitimacy detection method system based on intelligent semantic perception, it is characterised in that: be applied to above-mentioned Any method of Claims 1-4, including the building of multi-standard segmentation methods module, white list and real-time update module, Line forbidden character matching module;
The multi-standard segmentation methods module is based on sequence deep learning using public data collection and vertical FIELD Data collection, design Multi-standard segmentation methods;
The white list building and real-time update module are segmented using having legal warning information library as data basis using multi-standard Algorithm constructs legitimate characters white list, while auditor carries out real-time update to white list according to actually detected result, for white Each of list words carries out semantic vector expression using term vector embedded technology;
The online forbidden character matching module carries out multi-standard participle to warning information to be released using multi-standard segmentation methods Candidate characters set is obtained, in conjunction with inverted index and tree data structure, the search of large scale text data level is designed and compares Algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.
6. a kind of warning information legitimacy detection method system based on intelligent semantic perception according to claim 5, It is characterized in that:
The multi-standard segmentation methods module completes word insertion first to be indicated: it is vector that all characters, which are first carried out coded representation, Form then carries out semantic vector mapping to each character, and the character insertion stage is made just to have long range spans multi-character words Local semantic modeling ability, polysemy and a word ambiguity are solved the problems, such as according to local context;
Then whole sentence semantic context modeling;Indicated by semantic vectorization to each character, then to positive semantic and Reversed semanteme is modeled;Then probability mark is carried out using condition random field, acquires optimal segmentation sequence result;
Different participle standard data sets are combined again and are modeled;Notation methods are combined as a kind of implicit supervision message with model and are built Mould is handled as follows: (1) carrying out 0-N number to all marking types, when inputting training text, increase belonging to current statement Mark classification information;(2) implying vector for read statement obtained in context modeling indicates, at it as condition random field As the input of a single Classification Neural while input, the supervisory signals of this Classification Neural are current input The classification of marking types is segmented locating for sentence;
Finally unify end-to-end training;By multi-standard segmentation methods module, white list building and real-time update module, online illegal Character match module is unified in a multi-standard participle model, is trained end to end using error backpropagation algorithm; In multi-standard participle model in use, directly using warning information as the input of multi-standard participle model after the completion of training.
7. a kind of warning information legitimacy detection method system based on intelligent semantic perception according to claim 5, Be characterized in that: the multi-standard segmentation methods module using stacking two-way length in short-term memory unit simultaneously to positive semantic and anti- It is modeled to semanteme, implying vector using the two-way length read statement that memory unit obtains in short-term of stacking indicates.
8. a kind of warning information legitimacy detection method system based on intelligent semantic perception according to claim 5, Be characterized in that: trained multi-standard participle model is acted on existing all training by the online forbidden character matching module Word segmentation result is integrated into different text files by participle standard by text sentence;Term vector method is used for All Files The high dimension vector for carrying out character indicates compression, i.e., is a high dimension vector by each of white list character representation;
For warning information to be released, participle is carried out first with multi-standard participle model and is selected character set, then will Each of set words all passes through two ways and is compared with white list, comparison method are as follows: first in character set Each character is expressed as two-value high dimension vector using term vector method, realizes followed by inverted index and tree data structure Quick semantic matches, if the Hamming distances between characters all in white list are both greater than preset threshold, then it is assumed that current defeated Entering the corresponding character of warning information is invalid information.
CN201811438885.8A 2018-11-28 2018-11-28 Early warning information validity detection method and detection system based on intelligent semantic perception Active CN109543764B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811438885.8A CN109543764B (en) 2018-11-28 2018-11-28 Early warning information validity detection method and detection system based on intelligent semantic perception

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811438885.8A CN109543764B (en) 2018-11-28 2018-11-28 Early warning information validity detection method and detection system based on intelligent semantic perception

Publications (2)

Publication Number Publication Date
CN109543764A true CN109543764A (en) 2019-03-29
CN109543764B CN109543764B (en) 2023-06-16

Family

ID=65850988

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811438885.8A Active CN109543764B (en) 2018-11-28 2018-11-28 Early warning information validity detection method and detection system based on intelligent semantic perception

Country Status (1)

Country Link
CN (1) CN109543764B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061874A (en) * 2019-12-10 2020-04-24 苏州思必驰信息科技有限公司 Sensitive information detection method and device
CN112115933A (en) * 2020-08-25 2020-12-22 上海微亿智造科技有限公司 Character recognition method, device and storage medium
CN113095045A (en) * 2021-04-20 2021-07-09 河海大学 Chinese mathematics application problem data enhancement method based on reverse operation
CN113297879A (en) * 2020-02-23 2021-08-24 深圳中科飞测科技股份有限公司 Acquisition method of measurement model group, measurement method and related equipment
CN113590767A (en) * 2021-09-28 2021-11-02 西安热工研究院有限公司 Multilingual alarm information category judgment method, system, equipment and storage medium
CN113673219A (en) * 2021-08-20 2021-11-19 合肥中科类脑智能技术有限公司 Power failure plan text analysis method

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650768A (en) * 2009-07-10 2010-02-17 深圳市永达电子股份有限公司 Security guarantee method and system for Windows terminals based on auto white list
US20130185797A1 (en) * 2010-08-18 2013-07-18 Qizhi Software (Beijing) Company Limited Whitelist-based inspection method for malicious process
CN104008169A (en) * 2014-05-30 2014-08-27 中国测绘科学研究院 Semanteme based geographical label content safe checking method and device
CN104965818A (en) * 2015-05-25 2015-10-07 中国科学院信息工程研究所 Project name entity identification method and system based on self-learning rules
CN106506486A (en) * 2016-11-03 2017-03-15 上海三零卫士信息安全有限公司 A kind of intelligent industrial-control network information security monitoring method based on white list matrix
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN107241352A (en) * 2017-07-17 2017-10-10 浙江鹏信信息科技股份有限公司 A kind of net security accident classificaiton and Forecasting Methodology and system
CN107832634A (en) * 2017-11-29 2018-03-23 江苏方天电力技术有限公司 A kind of Dblink monitoring method and monitoring system
CN109086267A (en) * 2018-07-11 2018-12-25 南京邮电大学 A kind of Chinese word cutting method based on deep learning

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650768A (en) * 2009-07-10 2010-02-17 深圳市永达电子股份有限公司 Security guarantee method and system for Windows terminals based on auto white list
US20130185797A1 (en) * 2010-08-18 2013-07-18 Qizhi Software (Beijing) Company Limited Whitelist-based inspection method for malicious process
CN104008169A (en) * 2014-05-30 2014-08-27 中国测绘科学研究院 Semanteme based geographical label content safe checking method and device
CN104965818A (en) * 2015-05-25 2015-10-07 中国科学院信息工程研究所 Project name entity identification method and system based on self-learning rules
CN106569998A (en) * 2016-10-27 2017-04-19 浙江大学 Text named entity recognition method based on Bi-LSTM, CNN and CRF
CN106506486A (en) * 2016-11-03 2017-03-15 上海三零卫士信息安全有限公司 A kind of intelligent industrial-control network information security monitoring method based on white list matrix
CN107241352A (en) * 2017-07-17 2017-10-10 浙江鹏信信息科技股份有限公司 A kind of net security accident classificaiton and Forecasting Methodology and system
CN107832634A (en) * 2017-11-29 2018-03-23 江苏方天电力技术有限公司 A kind of Dblink monitoring method and monitoring system
CN109086267A (en) * 2018-07-11 2018-12-25 南京邮电大学 A kind of Chinese word cutting method based on deep learning

Non-Patent Citations (6)

* Cited by examiner, † Cited by third party
Title
NAIMA ZERARI ET AL.: "Bi-directional recurrent end-to-end neural network classifier for spoken Arab digit recognition", 《IEEE XPLORE》 *
张仰森等: "基于双重注意力模型的微博情感分析方法", 《清华大学学报(自然科学版)》 *
张淑静等: "基于Bi-LSTM-CRF算法的气象预警信息质控系统的实现", 《计算机与现代化》 *
朱丹浩等: "基于深度学习的中文机构名识别研究――一种汉字级别的循环神经网络方法", 《现代图书情报技术》 *
白静等: "基于注意力的BiLSTM-CNN中文微博立场检测模型", 《计算机应用与软件》 *
郭璇等: "基于深度学习和公开来源信息的反恐情报挖掘", 《情报理论与实践》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111061874A (en) * 2019-12-10 2020-04-24 苏州思必驰信息科技有限公司 Sensitive information detection method and device
CN113297879A (en) * 2020-02-23 2021-08-24 深圳中科飞测科技股份有限公司 Acquisition method of measurement model group, measurement method and related equipment
CN112115933A (en) * 2020-08-25 2020-12-22 上海微亿智造科技有限公司 Character recognition method, device and storage medium
CN113095045A (en) * 2021-04-20 2021-07-09 河海大学 Chinese mathematics application problem data enhancement method based on reverse operation
CN113095045B (en) * 2021-04-20 2023-11-10 河海大学 Chinese mathematic application question data enhancement method based on reverse operation
CN113673219A (en) * 2021-08-20 2021-11-19 合肥中科类脑智能技术有限公司 Power failure plan text analysis method
CN113590767A (en) * 2021-09-28 2021-11-02 西安热工研究院有限公司 Multilingual alarm information category judgment method, system, equipment and storage medium

Also Published As

Publication number Publication date
CN109543764B (en) 2023-06-16

Similar Documents

Publication Publication Date Title
CN109543764A (en) A kind of warning information legitimacy detection method and detection system based on intelligent semantic perception
CN106055541B (en) A kind of news content filtering sensitive words method and system
CN108897857B (en) Chinese text subject sentence generating method facing field
CN108874878B (en) Knowledge graph construction system and method
CN110674840B (en) Multi-party evidence association model construction method and evidence chain extraction method and device
CN112733533B (en) Multi-modal named entity recognition method based on BERT model and text-image relation propagation
CN108304372A (en) Entity extraction method and apparatus, computer equipment and storage medium
CN104820629A (en) Intelligent system and method for emergently processing public sentiment emergency
CN103179122A (en) Telcom phone phishing-resistant method and system based on discrimination and identification content analysis
CN110008699B (en) Software vulnerability detection method and device based on neural network
CN113449111B (en) Social governance hot topic automatic identification method based on time-space semantic knowledge migration
CN108052504A (en) Mathematics subjective item answers the structure analysis method and system of result
CN114282534A (en) Meteorological disaster event aggregation method based on element information extraction
CN108763211A (en) The automaticabstracting and system of knowledge are contained in fusion
Meng et al. A deep learning approach for a source code detection model using self-attention
CN108280357A (en) Data leakage prevention method, system based on semantic feature extraction
CN110750981A (en) High-accuracy website sensitive word detection method based on machine learning
CN112818668B (en) Meteorological disaster data semantic recognition analysis method and system
CN117275202A (en) Omnibearing real-time intelligent early warning method and system for dangerous sources in important areas of cultural relics
CN114091462B (en) Case fact mixed coding based criminal case risk mutual learning assessment method
CN115511280A (en) Urban flood toughness evaluation method based on multi-mode data fusion
AU2020101024A4 (en) Multi-language oriented general method for calculating place name semanteme similarity and use thereof
CN115048929A (en) Sensitive text monitoring method and device
CN115270774A (en) Big data keyword dictionary construction method for semi-supervised learning
CN114238738A (en) Rumor detection method based on attention mechanism and bidirectional GRU

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant