CN109543764A

CN109543764A - A kind of warning information legitimacy detection method and detection system based on intelligent semantic perception

Info

Publication number: CN109543764A
Application number: CN201811438885.8A
Authority: CN
Inventors: 苗开超; 杨彬; 年福东; 张淑静; 汪翔; 李腾; 吴丹娃; 张亚力; 程天奇; 刘宜轩
Original assignee: Anhui Public Meteorological Service Center; Anhui University
Current assignee: Anhui Public Meteorological Service Center; Anhui University
Priority date: 2018-11-28
Filing date: 2018-11-28
Publication date: 2019-03-29
Anticipated expiration: 2038-11-28
Also published as: CN109543764B

Abstract

The present invention provides a kind of warning information legitimacy detection method and detection system based on intelligent semantic perception, comprising: S1: the vertical field early warning text multi-standard segmentation methods based on deep learning；S2: white list building and real time updating method based on coupled form；S3: online forbidden character matching algorithm: multi-standard participle is carried out to warning information to be released using multi-standard segmentation methods and obtains candidate characters set, in conjunction with inverted index and tree data structure, the search of large scale text data level and alignment algorithm are designed, the quick positioning and judgement of the forbidden character in warning information text are realized by the semantic comparison with white list.Advantage are as follows: traditional reversed illegal word (word) searching algorithm is replaced with positive legal word (word) Intellisense algorithm, can achieve 100% detection effect of illegal word (word).White list building and real-time update based on coupled form, can gradually reduce with continuing on for early warning delivery system to artificial dependence.

Description

A kind of warning information legitimacy detection method and detection based on intelligent semantic perception System

Technical field

The present invention relates to information technology field, a kind of specifically warning information legitimacy based on intelligent semantic perception Detection method and detection system.

Background technique

It is needed in view of the reality of public safety and national security, it is pre- that emergency event all has been established from country to each province and city at present Alert distribution platform, the emergency events such as meteorology, territory in the form of text push the public, usually require to utilize before push Legitimacy detection technique is filtered warning information, prevents the forbidden characters such as the mistake being likely to occur, terror.Existing skill at present Art is typically dependent on blacklist filter algorithm, i.e., first with known forbidden character building blacklist is artificially collected, then Each warning information to be released is matched with each of blacklist character, thinks to be released pre- if successful match There are illegal words for alert information text.There are following two disadvantages for above-mentioned technology: (1) will using the artificial building for carrying out blacklist Expend a large amount of manpower and material resources and financial resources；(2) forbidden character of prior typing can only be filtered and is intercepted, to undefined or thing First unforeseen word, such as " corpse " can not then carry out effectively detecting and intercepting.

Summary of the invention

Save the cost while the technical problem to be solved by the present invention is to how improve early warning efficiency.

The present invention solves above-mentioned technical problem by the following technical programs:

A kind of warning information legitimacy detection method based on intelligent semantic perception, comprising:

Step S1: the vertical field early warning text multi-standard segmentation methods based on deep learning: using public data collection with Vertical FIELD Data collection, designs the multi-standard segmentation methods based on sequence deep learning；

Step S2: white list building and real time updating method based on coupled form: to have legal warning information Library is data basis, constructs legitimate characters white list using multi-standard segmentation methods, while auditor is according to actually detected result Real-time update is carried out to white list, semantic vector is carried out using term vector embedded technology for each of white list words It indicates；

Step S3: online forbidden character matching algorithm: warning information to be released is carried out using multi-standard segmentation methods more Standard participle obtains candidate characters set, in conjunction with inverted index and tree data structure, designs large scale text data level and searches Rope and alignment algorithm by the semantic quick positioning for comparing the forbidden character in realization warning information text with white list and are sentenced It is disconnected.

Preferably, wherein step 1 specifically:

Step S11: word insertion indicates；It is vector form that all characters, which are carried out coded representation, first, then to each A character carries out semantic vector mapping, and the character insertion stage is made just to have the semantic modeling ability of long range spans multi-character words；

Step S12: context modeling；Being obtained by step S11 indicates the semantic vectorization of each character, then right Positive semantic and reversed semanteme is modeled；Then probability mark is carried out using condition random field, acquires optimal segmentation sequence knot Fruit；

Step S13: difference participle standard data set joint modeling；Using notation methods as a kind of implicit supervision message and step Rapid S12 joint modeling, i.e., be handled as follows on the basis of step S12: (1) carrying out 0-N number to all marking types, When inputting training text, increase mark classification information belonging to current statement；(2) for read statement obtained in step S12 Implicit vector indicates, as the input of a list Classification Neural while it is inputted as condition random field, this The supervisory signals of Classification Neural are the classification that marking types are segmented locating for current read statement；

Step S14: unified end-to-end training；Step S11, S12, S13 are unified in a multi-standard participle model, made It is trained end to end with error backpropagation algorithm；In multi-standard participle model in use, directly will be pre- after the completion of training Input of the alert information as multi-standard participle model.

Preferably, wherein in step S11, all characters is subjected to one-hot coding first and are expressed as vector form, then Semantic vector mapping is carried out to each character using the expansion convolutional neural networks of stacking；

In step S12, using the two-way length of stacking, memory unit simultaneously builds positive semantic and reversed semanteme in short-term Mould；

In step S13 (2) specifically: in step S12 using stacking two-way length in short-term memory unit obtain it is defeated Enter sentence and implies vector expression.

Preferably, the step 3 specifically:

Step S31: acting on existing all training text sentences for trained multi-standard participle model, by participle mark Brigadier's word segmentation result is integrated into different text files；All Files are carried out with the high dimension vector of character using term vector method It indicates compression, i.e., is a high dimension vector by each of white list character representation；

Step S32: for warning information to be released, participle is carried out first with multi-standard participle model and is selected character Each of set words is then all passed through two ways and is compared with white list by set, comparison method are as follows: word first Symbol set each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, followed by fall arrange Index and tree data structure realize quick semantic matches, if the Hamming distances between characters all in white list are both greater than Preset threshold, then it is assumed that the current corresponding character of warning information that inputs is invalid information.

The present invention also provides a kind of warning information legitimacy detection method systems based on intelligent semantic perception, are applied to upper The method stated, including the building of multi-standard segmentation methods module, white list match mould with real-time update module, online forbidden character Block；

The multi-standard segmentation methods module is based on sequence depth using public data collection and vertical FIELD Data collection, design The multi-standard segmentation methods of study；

The white list building and real-time update module utilize multi-standard to have legal warning information library as data basis Segmentation methods construct legitimate characters white list, while auditor carries out real-time update to white list according to actually detected result, right Semantic vector expression is carried out using term vector embedded technology in each of white list words；

The online forbidden character matching module carries out multi-standard to warning information to be released using multi-standard segmentation methods Participle obtains candidate characters set, in conjunction with inverted index and tree data structure, design the search of large scale text data level with Alignment algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.

Preferably,

The multi-standard segmentation methods module completes word insertion first to be indicated: all characters, which are first carried out coded representation, is Vector form then carries out semantic vector mapping to each character, and the character insertion stage is made just to have long range spans more The semantic modeling ability of words；

Then context modeling；It is indicated by the semantic vectorization to each character, then to the forward direction of whole sentence text Semantic and reversed semanteme is modeled；Then probability mark is carried out using condition random field, acquires optimal segmentation sequence result；

Different participle standard data sets are combined again and are modeled；Join notation methods as a kind of implicit supervision message and model It builds mould jointly, handles as follows: (1) 0-N number being carried out to all marking types, when inputting training text, increase current statement institute The mark classification information of category；(2) for read statement obtained in context modeling imply vector indicate, its as condition with As the input of a single Classification Neural while airport inputs, the supervisory signals of this Classification Neural are current The classification of marking types is segmented locating for read statement；

Finally unify end-to-end training；By multi-standard segmentation methods module, white list building and real-time update module, online Forbidden character matching module is unified in a multi-standard participle model, is instructed end to end using error backpropagation algorithm Practice；In multi-standard participle model in use, directly using warning information as the input of multi-standard participle model after the completion of training.

Preferably, the multi-standard segmentation methods module using stacking two-way length in short-term memory unit simultaneously to positive language Adopted and reversed semanteme is modeled, and implying vector using the two-way length read statement that memory unit obtains in short-term of stacking indicates.

Preferably, trained multi-standard participle model is acted on existing institute by the online forbidden character matching module There is training text sentence, word segmentation result is integrated into different text files by participle standard；For All Files using word to The high dimension vector that amount method carries out character indicates compression, i.e., is a high dimension vector by each of white list character representation；

For warning information to be released, participle is carried out first with multi-standard participle model and is selected character set, with Each of set words is all passed through two ways afterwards to be compared with white list, comparison method are as follows: character set first Each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, followed by inverted index and Tree data structure realizes quick semantic matches, if the both greater than default threshold of the Hamming distances between characters all in white list Value, then it is assumed that the current corresponding character of warning information that inputs is invalid information.

The present invention has the advantages that replacing traditional reversed illegal word (word) with positive legal word (word) Intellisense algorithm Searching algorithm can achieve 100% detection effect of illegal word (word).Based on coupled form white list building in real time more Newly, it can gradually reduce with continuing on for early warning delivery system to artificial dependence, the warning information " that happens suddenly of bringing into play conscientiously The important function in one of defence line ".

Detailed description of the invention

Fig. 1 is 1 specific embodiment flow diagram of the embodiment of the present invention；

Fig. 2 is warning information multi-standard participle model schematic diagram of the embodiment of the present invention 1 based on sequence deep learning；

Fig. 3 is that 1 warning information multi-standard of the embodiment of the present invention segments schematic diagram；

Specific embodiment

The effect of to make to structure feature of the invention and being reached, has a better understanding and awareness, to preferable Examples and drawings cooperation detailed description, is described as follows:

Embodiment 1

As shown in Figure 1, Figure 2, Figure 3 shows, a kind of warning information legitimacy detection method based on intelligent semantic perception, it is specific real Protocol procedures schematic diagram is applied as shown in Figure 1, three big modules correspond to aforementioned three steps: step S1: based on the vertical of deep learning Field early warning text multi-standard segmentation methods；Step S2: white list building and real-time update based on coupled form；Step S3: online forbidden character Fast Match Algorithm.

One, step S1 is implemented as follows:

In natural language processing, especially Chinese language processing, corpus is often rare and precious.Specific to Chinese word segmentation, It is such.In order to make a practical system, efficient algorithm is not only needed, Large Scale Corpus is also essential.It is existing Legal early warning text data set word, word limited amount, but the illegal sensitive word type magnanimity that is likely to occur and can not predict.With this Concurrently there are a large amount of non-burst event field open source Chinese text participle data sets can make up this defect.Further, since not Different with the participle labeled standards of data source, what the present invention utilized different labeled standard simultaneously has legal early warning text data set Multi-standard Chinese word segmentation model with the public data collection training of open field Chinese word segmentation towards emergency event early warning text, joins below Fig. 2 is examined to be specifically introduced:

Step S11: word insertion indicates.Purpose is to carry out language to each of the corpus of separate sources in training set text Adopted vectorization indicates.All words are carried out solely hot (one-hot) coding (i.e. dictionary encoding) first and are expressed as vector by the present invention Form then carries out semantic vector mapping to each word using the expansion convolutional neural networks of stacking.Compared with prior art Insertion expression directly is carried out to word using One hidden layer neuron, the method that the present invention uses can just have length in the word insertion stage The semantic modeling ability of range spans multi-character words can solve the problems, such as a word ambiguity and polysemy simultaneously.

Step S12: context modeling.Can get by step S11 indicates the semantic vectorization of each word, is then Make algorithm that can completely understand input text entirety semantic context, the present invention using stacking two-way length memory unit in short-term Positive semantic and reversed semanteme is modeled simultaneously.Then probability mark is carried out using condition random field, acquires optimal participle Sequence results.

Step S13: difference participle standard data set joint modeling.Due to point of the data set to same a word separate sources Word labeled standards are different, and (such as " heavy rain pre-warning signal ", different participle standards can be segmented as " heavy rain, early warning, signal ", " heavy rain, pre- Alert signal " or " heavy rain early warning, signal "), if directly implementation steps S12 can not use all data aggregates.This is not solved to ask Topic, notation methods are combined modeling as a kind of implicit supervision message with step S12 by the present invention, i.e., on the basis of step S12 It is handled as follows: (1) 0-N number being carried out to all marking types, it is additional to increase current statement institute when inputting training text The mark classification information of category；(2) hidden for the two-way length read statement that memory unit obtains in short-term in step S12 using stacking It is indicated containing vector, as the input of a single Classification Neural, this point while it is inputted as condition random field The supervisory signals of class network are the classification that marking types are segmented locating for current read statement.

Step S14: unified end-to-end training.In order to obtain optimum model parameter, the present invention unites step S11, S12, S13 One (refers to Fig. 2) in a model, is trained end to end using error backpropagation algorithm.In model after the completion of training In use, can be directly using warning information as the input of multi-standard participle model, Fig. 3 is that the warning information of the embodiment of the present invention is more Standard segments schematic diagram.

Two, step S2 is implemented as follows:

Since the present invention is to be detected based on positive legitimate characters Intellisense to the legitimacy of warning information, white list It is most important for the judgement of invalid information, and with the continuous growth of warning information text, white list should can quick and precisely into Row updates.To solve this problem, firstly, the present invention is based on trained to complete participle model to existing legal early warning text data set Carry out the legal words white list in participle building basis；Secondly, the information that user can construct according to this project when the present invention uses The output real time modifying of legitimacy detection system updates white list.

Three, step S3 is implemented as follows:

After the completion of step S1 and S2, multi-standard participle is carried out to warning information to be released using institute's invention segmentation methods and is obtained Candidate characters set is obtained, intuitively, candidate characters set is compared with white list to carry out warning information legitimacy Judgement.But direct violence search matching cannot reflect the semantic information of word, word, and the legal warning information number collected in advance It is extremely limited according to storage capacity, it is impossible to cover all legitimate characters, it will cause entire legitimacy detection system initial False alarm rate is excessively high when use, can serious waste manpower and material resources and financial resources.To solve this problem, the present invention devises one kind towards pre- The quick semantic matching algorithm of alert message area, is specifically described as follows:

Step S31: acting on existing all training text sentences for trained multi-standard participle model, by participle mark Brigadier's word segmentation result is integrated into different text files.All Files are carried out with the high dimension vector of character using term vector method It indicates compression, i.e., is a high dimension vector by each of white list character representation, different from existing generation real number feature Term vector representation, the present invention increase on the basis of traditional term vector Sigmoid function may make high dimension vector be two-value to Amount, convenient for extensive Rapid matching and retrieval.

Step S32: for warning information to be released, participle is carried out first with multi-standard participle model and is selected character Each of set words is then all passed through two ways and is compared with white list by set.Mode one: direct violence Match；Mode two: first set each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, Quick semantic matches are realized followed by inverted index and tree data structure, if the sea between characters all in white list Prescribed distance is both greater than preset threshold, then it is assumed that the current corresponding character of warning information that inputs is that invalid information (in actual use may be used It is automatically reminded to auditor).

Embodiment 2

As shown in Figure 1, a kind of warning information legitimacy detection method system based on intelligent semantic perception, is applied to implement Method described in example 1, including the building of multi-standard segmentation methods module, white list and real-time update module, online forbidden character With module；

Multi-standard segmentation methods module completes word insertion first to be indicated: it is vector that all characters, which are first carried out coded representation, Form then carries out semantic vector mapping to each character, has the character insertion stage just using expansion convolution technique The local semantic modeling ability of long range spans multi-character words；Then the complete semantic context modeling of read statement；By to each The semantic vectorization of a character indicates, using stacking two-way length in short-term memory unit simultaneously to positive semantic and reversed semanteme into Row modeling,；Then probability mark is carried out using condition random field, acquires optimal segmentation sequence result；

Different participle standard data sets are combined again and are modeled；Join notation methods as a kind of implicit supervision message and model It builds mould jointly, handles as follows: (1) 0-N number being carried out to all marking types, when inputting training text, increase current statement institute The mark classification information of category；(2) for read statement obtained in context modeling using the two-way long short-term memory list of stacking The read statement that member obtains, which implies vector, to be indicated, as mind of classifying one while it is inputted as condition random field more Input through network, the supervisory signals of this Classification Neural are the classification that marking types are segmented locating for current read statement； Finally unify end-to-end training, word is embedded in representation module, bidirectional circulating neural network module, condition random field module and not It is unified in a multi-standard participle model with labeled standards data set modeling module, is held using error backpropagation algorithm To the training at end；In multi-standard participle model in use, directly using warning information as multi-standard participle model after the completion of training Input.

White list building and real-time update module are segmented using having legal warning information library as data basis using multi-standard Algorithm constructs legitimate characters white list, while auditor carries out real-time update to white list according to actually detected result, for white Each of list words carries out semantic vector expression using term vector embedded technology；

Online forbidden character matching module carries out multi-standard participle to warning information to be released using multi-standard segmentation methods Candidate characters set is obtained, in conjunction with inverted index and tree data structure, the search of large scale text data level is designed and compares Algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.

Trained multi-standard participle model is acted on existing all training texts by online forbidden character matching module Word segmentation result is integrated into different text files by participle standard by sentence；All Files are carried out using term vector method The high dimension vector of character indicates compression, i.e., is a high dimension vector by each of white list character representation；

For warning information to be released, participle is carried out first with multi-standard participle model and is selected character set, with Each of set words is all passed through two ways afterwards to be compared with white list, comparison method are as follows: character set first Each of character two-value high dimension vector is all expressed as by the way of term vector, followed by inverted index and tree shaped data Structure realizes that quick semantic matches are recognized if the Hamming distances between characters all in white list are both greater than preset threshold It is invalid information currently to input the corresponding character of warning information.

The basic principles, main features and advantages of the present invention have been shown and described above.The technology of the industry Personnel are it should be appreciated that the present invention is not limited to the above embodiments, and what is described in the above embodiment and the description is only the present invention Principle, various changes and improvements may be made to the invention without departing from the spirit and scope of the present invention, these variation and Improvement is both fallen in the range of claimed invention.The present invention claims protection scope by appended claims and its Equivalent defines.

Claims

1. a kind of warning information legitimacy detection method based on intelligent semantic perception, it is characterised in that: include:

Step S1: the vertical field early warning text multi-standard segmentation methods based on deep learning: using public data collection with it is vertical FIELD Data collection designs the multi-standard segmentation methods based on sequence deep learning；

Step S2: white list building and real time updating method based on coupled form: it is to have legal warning information library Data basis constructs legitimate characters white list using multi-standard segmentation methods, while auditor is according to actually detected result dialogue List carries out real-time update, carries out semantic vector table using term vector embedded technology for each of white list words Show；

Step S3: multi-standard online forbidden character matching algorithm: is carried out to warning information to be released using multi-standard segmentation methods Participle obtains candidate characters set, in conjunction with inverted index and tree data structure, design the search of large scale text data level with Alignment algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.

2. a kind of warning information legitimacy detection method based on intelligent semantic perception according to claim 1, feature It is: wherein step 1 specifically:

Step S11: word insertion indicates；It is first higher-dimension binary vector form by all character codes, then to each word Symbol using expansion convolution technique carry out the character semantic vectorization based on local semantic context map, by character higher-dimension binary to Amount is mapped as low-dimensional real vector；

Step S12: whole sentence semantic context modeling；Being obtained by step S11 indicates the semantic vectorization of each character, with The semantic and reversed semanteme of forward direction of complete Chinese sentence is modeled afterwards；Then probability mark is carried out using condition random field, Acquire optimal segmentation sequence result；

Step S13: difference participle standard data set joint modeling；Using notation methods as a kind of implicit supervision message and step S12 joint modeling, i.e., be handled as follows on the basis of step S12: (1) 0-N number carried out to all marking types, defeated When entering training text, increase mark classification information belonging to current statement；(2) hidden for read statement obtained in step S12 It is indicated containing vector, as the input of a single Classification Neural, this point while it is inputted as condition random field The supervisory signals of neural network are the classification that marking types are segmented locating for current read statement；

Step S14: unified end-to-end training；Step S11, S12, S13 are unified in a multi-standard participle model, using accidentally Poor back-propagation algorithm is trained end to end；In multi-standard participle model in use, directly believing early warning after the completion of training Cease the input as multi-standard participle model.

3. a kind of warning information legitimacy detection method based on intelligent semantic perception according to claim 2, feature It is: wherein in step S11, all characters is subjected to one-hot coding first and are expressed as vector form, then using stacking It expands convolutional neural networks and semantic vector mapping is carried out to each character；

In step S12, using the two-way length of stacking, memory unit simultaneously models positive semantic and reversed semanteme in short-term；

In step S13 (2) specifically: for using the two-way length input language that memory unit obtains in short-term of stacking in step S12 The implicit vector of sentence indicates.

4. a kind of warning information legitimacy detection method based on intelligent semantic perception according to any one of claims 1 to 3, It is characterized by: the step 3 specifically:

Step S31: acting on existing all training text sentences for trained multi-standard participle model, will by participle standard Word segmentation result is integrated into different text files；All Files are indicated using the high dimension vector that term vector method carries out character Each of white list character representation is a high dimension vector by compression；

Step S32: for warning information to be released, participle is carried out first with multi-standard participle model and is selected character set It closes, each of set words is all then passed through into two ways and is compared with white list, comparison method are as follows: character first Set each of character all by with it is consistent in step S31 in a manner of be expressed as two-value high dimension vector, followed by fall row rope Draw and realize quick semantic matches with tree data structure, if the Hamming distances between characters all in white list are both greater than pre- If threshold value, then it is assumed that the current corresponding character of warning information that inputs is invalid information.

5. a kind of warning information legitimacy detection method system based on intelligent semantic perception, it is characterised in that: be applied to above-mentioned Any method of Claims 1-4, including the building of multi-standard segmentation methods module, white list and real-time update module, Line forbidden character matching module；

The multi-standard segmentation methods module is based on sequence deep learning using public data collection and vertical FIELD Data collection, design Multi-standard segmentation methods；

The white list building and real-time update module are segmented using having legal warning information library as data basis using multi-standard Algorithm constructs legitimate characters white list, while auditor carries out real-time update to white list according to actually detected result, for white Each of list words carries out semantic vector expression using term vector embedded technology；

The online forbidden character matching module carries out multi-standard participle to warning information to be released using multi-standard segmentation methods Candidate characters set is obtained, in conjunction with inverted index and tree data structure, the search of large scale text data level is designed and compares Algorithm realizes the quick positioning and judgement of the forbidden character in warning information text by the semantic comparison with white list.

6. a kind of warning information legitimacy detection method system based on intelligent semantic perception according to claim 5, It is characterized in that:

The multi-standard segmentation methods module completes word insertion first to be indicated: it is vector that all characters, which are first carried out coded representation, Form then carries out semantic vector mapping to each character, and the character insertion stage is made just to have long range spans multi-character words Local semantic modeling ability, polysemy and a word ambiguity are solved the problems, such as according to local context；

Then whole sentence semantic context modeling；Indicated by semantic vectorization to each character, then to positive semantic and Reversed semanteme is modeled；Then probability mark is carried out using condition random field, acquires optimal segmentation sequence result；

Different participle standard data sets are combined again and are modeled；Notation methods are combined as a kind of implicit supervision message with model and are built Mould is handled as follows: (1) carrying out 0-N number to all marking types, when inputting training text, increase belonging to current statement Mark classification information；(2) implying vector for read statement obtained in context modeling indicates, at it as condition random field As the input of a single Classification Neural while input, the supervisory signals of this Classification Neural are current input The classification of marking types is segmented locating for sentence；

Finally unify end-to-end training；By multi-standard segmentation methods module, white list building and real-time update module, online illegal Character match module is unified in a multi-standard participle model, is trained end to end using error backpropagation algorithm； In multi-standard participle model in use, directly using warning information as the input of multi-standard participle model after the completion of training.

7. a kind of warning information legitimacy detection method system based on intelligent semantic perception according to claim 5, Be characterized in that: the multi-standard segmentation methods module using stacking two-way length in short-term memory unit simultaneously to positive semantic and anti- It is modeled to semanteme, implying vector using the two-way length read statement that memory unit obtains in short-term of stacking indicates.

8. a kind of warning information legitimacy detection method system based on intelligent semantic perception according to claim 5, Be characterized in that: trained multi-standard participle model is acted on existing all training by the online forbidden character matching module Word segmentation result is integrated into different text files by participle standard by text sentence；Term vector method is used for All Files The high dimension vector for carrying out character indicates compression, i.e., is a high dimension vector by each of white list character representation；

For warning information to be released, participle is carried out first with multi-standard participle model and is selected character set, then will Each of set words all passes through two ways and is compared with white list, comparison method are as follows: first in character set Each character is expressed as two-value high dimension vector using term vector method, realizes followed by inverted index and tree data structure Quick semantic matches, if the Hamming distances between characters all in white list are both greater than preset threshold, then it is assumed that current defeated Entering the corresponding character of warning information is invalid information.