CN107679036A

CN107679036A - A kind of wrong word monitoring method and system

Info

Publication number: CN107679036A
Application number: CN201710946362.3A
Authority: CN
Inventors: 周金娟; 王治平
Original assignee: Nanjing Net Number Mdt Infotech Ltd
Current assignee: Nanjing Net Number Mdt Infotech Ltd
Priority date: 2017-10-12
Filing date: 2017-10-12
Publication date: 2018-02-09

Abstract

The invention provides a kind of wrong word monitoring method, method includes：Build wrong word dictionary；Data acquisition is carried out to targeted website, obtains website data；The website data of acquisition is pre-processed, web analysis and denoising, obtain content of text；Word segmentation processing is carried out to content of text, obtains single word；AC automatic machine dictionary trees are built based on wrong word dictionary, and generate caching；Build context of co-text analysis model；Wrong word identification is carried out according to AC dictionary trees caching and context of co-text analysis model, exports wrong word recognition result, the present invention can effectively improve the monitoring accuracy of wrong word.

Description

A kind of wrong word monitoring method and system

Technical field

The present invention relates to wrong word identification technology field, more particularly to a kind of wrong word monitoring method and system.

Background technology

Currently, there is the same page and serious wrong word or multiple pages repeatedly occur in the presence of serious wrong other in part government website The situation of word, trigger public media concern, have a strong impact on government image.For this phenomenon, for the first time national government website generaI investigation Serious wrong word is included in the investigation point of " gross error " index.

Wrong word monitoring method is typically to be made up of wrong word dictionary, participle technique, wrong word identification model.

Participle technique is the premise and key of wrong word identification.There are a variety of segmenting methods in the prior art, wherein based on word The segmenting method of string is accorded with because relatively simple and more common.Segmenting method based on character string can probably include positive maximum Matching method and reverse maximum matching method.Such as have a kind of segmenting method based on character string mainly using Forward Maximum Method method or Reverse maximum matching method carries out mechanical Chinese word segmentation processing to the character string that needs segment, to it is unidentified go out individual character realize place name and The participle identification of street name, its object is to identify place name, street name etc., has expanded ground thesaurus.Existing participle technique In at least find following technical problem be present：

1st, existing Words partition system comes only with a kind of segmenting method (Forward Maximum Method method or reverse maximum matching method) Segmented, participle process is more coarse, and the word segmentation result for causing to obtain is not accurate enough, reduces word segmentation accuracy；

2nd, existing segmenting method generally pertains only to the participle of particular area, still can not enter for multi-field character string Row effectively identification.

Therefore, how effectively to improve the accuracy rate of wrong word monitoring is a urgent problem to be solved.

The content of the invention

In view of this, the invention provides a kind of wrong word monitoring method, the monitoring that can effectively improve wrong word is accurate True rate.

The invention provides a kind of wrong word monitoring method, methods described includes：

Build wrong word dictionary；

Data acquisition is carried out to targeted website, obtains website data；

The website data of acquisition is pre-processed, web analysis and denoising, obtain content of text；

Word segmentation processing is carried out to the content of text, obtains single word；

AC automatic machine dictionary trees are built based on the wrong word dictionary, and generate caching；

Build context of co-text analysis model；

Wrong word identification, output wrong word identification knot are carried out according to AC dictionary trees caching and context of co-text analysis model Fruit.

Preferably, the structure wrong word dictionary includes：

Obtain the essential information of Chinese character；

It is that each word constructs similar character using similar character algorithm based on the essential information of the Chinese character, and according to similar character Wrong other word corresponding to word is generated, obtains wrong word dictionary.

Preferably, methods described also includes：

The wrong word recognition result is input to wrong word examination ＆ verification mark platform, manually marked；

Obtained annotation results feedback renewal will be carried out manually marking into wrong word dictionary and core word bank；

For the obtained wrong word recognition result, wrong word identification knot corresponding to the word that will be contained in core word bank Fruit carries out system page presentation.

A kind of wrong word monitoring system, it is characterised in that including：

First structure module, for building wrong word dictionary；

Acquisition module, for carrying out data acquisition to targeted website, obtain website data；

Processing module, for being pre-processed to the website data of acquisition, web analysis and denoising, obtain text Content；

Word segmentation processing module, for carrying out word segmentation processing to the content of text, obtain single word；

Second structure module, for building AC automatic machine dictionary trees based on the wrong word dictionary, and generates caching；

3rd structure module, for building context of co-text analysis model；

Output module, for carrying out wrong word identification, output according to AC dictionary trees caching and context of co-text analysis model Wrong word recognition result.

Preferably, the first structure module includes：

Acquiring unit, for obtaining the essential information of Chinese character；

Generation unit, it is that each word constructs similar character using similar character algorithm for the essential information based on the Chinese character, And the wrong other word according to corresponding to similar character generates word, obtain wrong word dictionary.

Preferably, the system also includes：

Labeling module, for the wrong word recognition result to be input into wrong word examination ＆ verification mark platform, manually marked Note；

Update module is fed back, wrong word dictionary and core are arrived for progress manually to be marked into the feedback renewal of obtained annotation results In heart dictionary；

Display module, for the wrong word recognition result for obtaining, the word that will be contained in core word bank is corresponding Wrong word recognition result carry out system page presentation.

It can be seen from the above technical proposal that the invention provides a kind of wrong word monitoring method, when needing on website Wrong word when being monitored, build wrong word dictionary first, data acquisition carried out to targeted website, obtains website data, so Afterwards the website data to acquisition pre-processed, web analysis and denoising, obtain content of text, content of text carried out Word segmentation processing, obtain single word；AC automatic machine dictionary trees are built based on wrong word dictionary, and generate caching；Build context Contextual analysis model；Wrong word identification is finally carried out according to AC dictionary trees caching and context of co-text analysis model, exports mistake not Word recognition result.AC dictionary trees caching is generated according to wrong word dictionary, it is every to government website with reference to context of co-text analysis model It is automatic, continues to monitor, and more comprehensively efficiently wrong word in website can be positioned, provide errorlevel, effectively carry High wrong word monitoring accuracy.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is a kind of method flow diagram of wrong word monitoring method embodiment 1 disclosed by the invention；

Fig. 2 is a kind of method flow diagram of wrong word monitoring method embodiment 2 disclosed by the invention；

Fig. 3 is a kind of structural representation of wrong word monitoring system embodiment 1 disclosed by the invention；

Fig. 4 is a kind of structural representation of wrong word monitoring system embodiment 2 disclosed by the invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

As shown in figure 1, be a kind of flow chart of the embodiment 1 of wrong word monitoring method disclosed by the invention, methods described It may comprise steps of：

S101, structure wrong word dictionary；

When needing to be monitored the wrong word on website, basic data preparation is carried out first, builds wrong word dictionary, It is main to include collecting the essential information of Chinese character, and everyday expressions and professional domain vocabulary etc. when building wrong word dictionary, so Wrong word dictionary is constructed according to the essential information of Chinese character afterwards.

S102, data acquisition is carried out to targeted website, obtain website data；

After wrong word dictionary is constructed, data acquisition is carried out to targeted website and obtains website data, wherein, described mesh Mark website is the website for needing to carry out wrong word monitoring.

S103, the website data to acquisition are pre-processed, web analysis and denoising, obtain content of text；

After the data of targeted website are got, the website data of acquisition is pre-processed and web analysis and gone Make an uproar, obtain content of text.For original web page content of text, the label substance in webpage, the operation master of this step are rejected first To be completed using regular expression.Mainly match html label by using regular expression, and matching Label is deleted, and finally, it is exactly webpage text content to obtain remaining content.

S104, word segmentation processing is carried out to content of text, obtain single word；

Segmentation methods carry out word segmentation processing mainly for webpage text content, and text Chinese character sequence is cut into one by one Single word, it can be mainly comprised the following steps so as to reach computer with the effect of automatic identification, algorithm idea：

Atom cutting：Cutting for a sentence, it is by atom cutting first, whole sentence is cut into one by one All possible atomic unit.

Generate full segmenting word figure：Full cutting is carried out to sentence according to participle dictionary, and generates a corresponding adjacent chained list and represents Word figure.

Calculate optimal cutting route：On the basis of this word figure, cutting optimal path is generated with dynamic programming algorithm.

Unknown word identification：Using hidden horse model, the identification that name, place name, mechanism name etc. are not logged in noun is carried out.

Again calculate optimal cutting route with dynamic programming algorithm and obtain optimal word segmentation result.

S105, AC automatic machine dictionary trees are built based on wrong word dictionary, and generate caching；

Dictionary tree is a kind of data structure traded space for time, and insertion can be realized in constant time O (len) and looked into Operation is ask, and is widely used in word frequency statisticses and input statistics field, main disadvantage is that multi-mode matching problem, and AC is certainly Motivation can efficiently complete multi-mode matching.The advantages of in order to gather both, herein, dictionary tree has been used as AC automatic machines Search data structure.

The offline persistence of wrong word dictionary is configured to a dictionary tree caching cache file, only need to be when servicing startup Dictionary tree caching cache files are loaded into the efficient retrieval of the i.e. achievable wrong word dictionary of internal memory, it is whole to greatly improve system The performance of body.

S106, structure context of co-text analysis model；

Content of text is expressed as vector space, other words (linguistic context) of surrounding target word are as input, and word The precedence that language occurs in the text, target word is exported after mapping layer does weighting processing, to maximize target wrong word Word output probability is target.

Wherein, context represents contextual information.

K gts are reduced to firstly the need of by content of text processing, each input word is mapped as a vector, should Mapping is represented with C, so C (w_t-1) represent w_t-1Term vector, with its front and rear each n (n>1) individual word predicts t-th of word.It is false Determine wrong word mapping to be represented with E, then E (w_t-1) it is designated as the term vector of wrong word.

P(C(w_t))=P (E (w_t))=P (w_t/w_t-k,w_t-k-1,...,w_t-1,w_t,w_t+1..., w_t+k)

An appropriately sized window is taken to regard linguistic context, when predicting word w_t below, corresponding node necessarily has one Individual binary coding, such as " 010011 ", target are exactly each of the binary number of predicting this word so that predict word Binary coding maximum probability.

S107, wrong word identification is carried out according to AC dictionary trees caching and context of co-text analysis model, output wrong word is known Other result.

On the basis of word segmentation result, context of co-text analysis model scans for AC dictionary trees caching, and calculates The potential other word of mistake of maximum probability, the correct word after correcting is provided, and fuzzy is carried out to the difference set of correct word and erroneous words Match somebody with somebody, carry out error correction judgement, and provide other attribute informations such as type of error of erroneous words, ultimately generate erroneous words entity class simultaneously Return.The entity class includes erroneous words, correct word, type of error, and the contextual information of the erroneous words in the text.

In summary, in the above-described embodiments, when needing to be monitored the wrong word on website, structure is wrong not first Character word stock, data acquisition is carried out to targeted website, website data is obtained, then the website data of acquisition is located in advance Reason, web analysis and denoising, obtain content of text, carry out word segmentation processing to content of text, obtain single word；Based on wrong other Character word stock builds AC automatic machine dictionary trees, and generates caching；Build context of co-text analysis model；Finally delayed according to AC dictionary trees Deposit and carry out wrong word identification with context of co-text analysis model, export wrong word recognition result.AC is generated according to wrong word dictionary Dictionary tree caches, with reference to context of co-text analysis model it is automatic daily to government website, continue to monitor, can be more comprehensively efficient Wrong word in website is positioned, provides errorlevel, effectively raises wrong word monitoring accuracy.

As shown in Fig. 2 be a kind of flow chart of the embodiment 2 of wrong word monitoring method disclosed by the invention, methods described It may comprise steps of：

S201, the essential information for obtaining Chinese character；

Collect and arrange 9891 Chinese characters, obtain the essential informations of these Chinese characters, including phonetic using reptile, initial consonant, simple or compound vowel of a Chinese syllable, Consonant, tone, Hanzi structure, 5-stroke coding, quadrangle coding, Chinese-character order of strokes, Chinese character radicals, stroke number etc..Polyphone retains institute Some phonetic.Meanwhile conventional Chinese terms and phrase are arranged according to collections such as Chinese character voluminous dictionary, dictionaries for word segmentation, and it is artificial right National title, government affairs Field Words, medical industry vocabulary etc. are specifically collected arrangement, and correct word dictionary is formed with this；

S202, the essential information based on Chinese character, it is that each word constructs similar character using similar character algorithm, and according to similar character Wrong other word corresponding to word is generated, obtains wrong word dictionary；

Similar character algorithm mainly includes two types, the phase that Hanzi structure identical similar character and Hanzi structure differ Like word.If similar character " value " and " plant " are tiled configurations, their structure is identical；" value " and " straight ", the former is tiled configuration, after Person is up-down structure.For different in the case of, the algorithm of use is also different.

The first structure identical situation is directed to, similar character algorithm flow is mainly as follows：

First, the premise of similar character be between Hanzi structure is identical, stroke number is drawn ± 3 and quadrangle coding in identical Digit have to be larger than equal to 2；

Then, by remaining phonetic, quadrangle coding and 5-stroke coding assign corresponding to weight, weight is respectively 30%, 25%th, 45%, detailed rules is as follows：

The phonetic of phonetic identical (tone can different) or two words is front and rear nasal sound relation, such as yin and ying, or The phonetic of two words of person is the relation of flat tongue consonant and cacuminal, must be 0.3 point such as sa and sha, is otherwise 0 point；

In quadrangle coding, there is N number of coding identical, then obtain N*0.2*0.25 points；

In 5-stroke coding, 3 with 3 matchings when have that M coding is identical, then obtain M*1/3*0.45；4 and 4 There is N number of coding identical when matching, then obtain N*1/4*0.45；3 and 4 matching when, if continuous 2 it is identical then 2*2/3*0.45, if 3 it is identical then 0.45 point.

Secondly, the similarity obtained by phonetic, quadrangle coding and 5-stroke coding is added up, last only screening similarity is big In the Chinese character equal to 0.5 as similar character；

Be directed to second of situation, because structure is changed, quadrangle coding and 5-stroke coding it is inclined using degree It is weak.In fact we it is seen that, " value " is made up of " Ren " and " straight ", by chance " straight " into " value " a subset, such case I Can be handled completely using the order of strokes observed in calligraphy, detailed algorithm is as follows：

First, we filter out the related subset Chinese character of target word either superset Chinese character, and stroke number difference is no more than 5；

Judge whether phonetic similar (i.e. phonetic is identical, and tone can be different, or two words phonetic be front and rear nasal sound or The difference of person's flat tongue consonant and cacuminal), directly it is judged as similar character if similar, if dissimilar, continues subsequent treatment；

It is identical to filter out quadrangle coding at least 4, and in 5-stroke coding at least N positions (if the five of two Chinese characters Pen coding is all 3, then N=2；If the 5-stroke coding of two Chinese characters is all 4, N=3,；If five volumes of a Chinese character Code is 3, and the 5-stroke coding of another Chinese character is 4, then N=2) identical Chinese character.

Excluding Chinese character that the first situation occurred, (as " bar " is identical with " stalk " structure, the order of strokes observed in calligraphy of " bar " belongs to " stalk " Subset), the remaining similar character as second of situation.

Here wrong other word refers to the word being made up of similar character, such as " plant " one word, wherein, a similar character of " plant " is " value ", then replace " planting " composition " value thing " by " value ", here it is the so-called other word of mistake.In practical situations both, these words are dived Contain wrong word.

The main flow of the wrong other word of generation is as follows：

Basic data prepares：All words are extracted from our correct word dictionary, and each word is split into list Only word；

The potential other word generation of mistake：According to each single word in word, its corresponding similar character is found, then by similar character Original word is replaced, generates potential wrong other word；

Wrong other word generation：Wrong other word after replacement, it is possible to and a correct word, as a similar character of " far or indistinct " is " vast ", a word of " far or indistinct " is " dimly discernible ", and the wrong other word of generation is " ethereal ", but in a practical situation, " ethereal " is also One correct word, it is therefore desirable to filtered such case.

After wrong other word is formed corresponding to correct word, correct word and the other word of corresponding mistake and the type of error manually set Wrong word dictionary is collectively constituted.

S203, data acquisition is carried out to targeted website, obtain website data；

S204, the website data to acquisition are pre-processed, web analysis and denoising, obtain content of text；

S205, word segmentation processing is carried out to content of text, obtain single word；

S206, AC automatic machine dictionary trees are built based on wrong word dictionary, and generate caching；

S207, structure context of co-text analysis model；

Wherein, context represents contextual information.

P(C(w_t))=P (E (w_t))=P (w_t/w_t-k,w_t-k-1,...,w_t-1,w_t,w_t+1..., w_t+k)

S208, wrong word identification is carried out according to AC dictionary trees caching and context of co-text analysis model, output wrong word is known Other result；

S209, wrong word recognition result is input to wrong word examination ＆ verification mark platform, manually marked；

For the wrong word identification record monitored daily, it is flat that the not record in platform is marked all is appended to mark In platform.After user logs in mark platform, the mark that wrong word identification record carries out right and wrong, every note can be obtained manually Record is at most assigned to N number of user (N can be set), and each user can carry out self to the wrong word record for distributing to oneself Judgement, according to the wrong word linguistic context provided, it is believed that wrong word identification is correct, then clicks on correct button, it is believed that the wrong word Mistake is identified, then clicks on wrong button.Mark platform sets every record to distribute to multiple users, so collects more people's wisdom, Erroneous judgement risk caused by reducing single people's error to a certain extent.

S210, it will carry out manually marking obtained annotation results feedback renewal into wrong word dictionary and core word bank；

Platform meeting every night one shell script of start by set date is marked, shell script first can be according to each wrong word All annotation results of record, calculate the final result of this wrong word record.For example, for a wrong word identification note Record, it is believed that its correct number of users, which is more than, thinks its wrong number of users, then it is correct that this, which records final result, conversely, recognizing It is more than for its wrong number of users and thinks its correct number of users, then it is mistake that this, which records final result,.If it is considered to it is just True number of users, which is equal to, thinks its wrong number of users, then this records result and fixed tentatively to be undetermined, will subsequently continue to be assigned to Other users are labeled, until obtaining clear and definite result.

After the final result for calculating the wrong word record being marked, the correct identification record of program automatic screening, more Newly into core word bank；The identification record to make mistake is screened, the identification record of this partial error is deleted from wrong word dictionary, Ensure to be identified again after this kind of word.

So, on the basis of machine identifies wrong word comprehensively, it is aided with artificial mark, and manually will further sentence automatically Disconnected result feeds back to wrong word dictionary and core word bank so that our wrong word identification is more comprehensive and accurate.

S211, the wrong word recognition result for obtaining, wrong word corresponding to the word that will be contained in core word bank Recognition result carries out system page presentation.

Because wrong word dictionary is according to certain rule construct by algorithm, relative to manual construction wrong word Speech, amount are big and comprehensive.But monitor daily in obtained a large amount of wrong words record, it is inevitably inaccurate in the presence of some Wrong word identification.Page presentation is not carried out directly by screening, will necessarily cause one to client if all of identification record Fixed puzzlement, therefore, by marking platform, a core word bank is established, will identify that accurate word is put into core word bank, and constantly Accumulation, grow core word bank, and will be present in the wrong word identification record in core word bank carry out page presentation, it is ensured that system The wrong word identification accuracy of page presentation is higher.

In summary, in the above-described embodiments, similar character is constructed according to the essential information of Chinese character, and generates wrong word word Storehouse.AC dictionary trees caching is generated according to wrong word dictionary, with reference to context of co-text analysis model it is automatic daily to government website, hold Continuous monitoring, more comprehensively efficiently can be positioned to wrong word in website, provide errorlevel, such as serious wrong word and general Logical wrong word, and suggestion for revision is provided, meanwhile, the composition manually marked is combined in our monitoring method, can be good at The deficiency of machine recognition is made up, core word bank is refined, promotes conversion of the wrong word dictionary from edge to core, so as to more accurately Monitoring site wrong word.

As shown in figure 3, be a kind of structural representation of wrong word monitoring system embodiment 1 disclosed by the invention, the system System can include：

First structure module 301, for building wrong word dictionary；

Acquisition module 302, for carrying out data acquisition to targeted website, obtain website data；

Processing module 303, for being pre-processed to the website data of acquisition, web analysis and denoising, obtain text Content；

Word segmentation processing module 304, for carrying out word segmentation processing to content of text, obtain single word；

Second structure module 305, for building AC automatic machine dictionary trees based on wrong word dictionary, and generates caching；

3rd structure module 306, for building context of co-text analysis model；

Wherein, context represents contextual information.

P(C(w_t))=P (E (w_t))=P (w_t/w_t-k,w_t-k-1,...,w_t-1,w_t,w_t+1..., w_t+k)

Output module 307, it is defeated for carrying out wrong word identification according to AC dictionary trees caching and context of co-text analysis model Go out wrong word recognition result.

As shown in figure 4, be a kind of structural representation of wrong word monitoring system embodiment 2 disclosed by the invention, the system System can include：

Acquiring unit 401, for obtaining the essential information of Chinese character；

Generation unit 402, it is that each word constructs similar character using similar character algorithm for the essential information based on Chinese character, And the wrong other word according to corresponding to similar character generates word, obtain wrong word dictionary；

The main flow of the wrong other word of generation is as follows：

Acquisition module 403, for carrying out data acquisition to targeted website, obtain website data；

Processing module 404, for being pre-processed to the website data of acquisition, web analysis and denoising, obtain text Content；

Word segmentation processing module 405, for carrying out word segmentation processing to content of text, obtain single word；

Second structure module 406, for building AC automatic machine dictionary trees based on wrong word dictionary, and generates caching；

3rd structure module 407, for building context of co-text analysis model；

Wherein, context represents contextual information.

P(C(w_t))=P (E (w_t))=P (w_t/w_t-k,w_t-k-1,...,w_t-1,w_t,w_t+1..., w_t+k)

Output module 408, it is defeated for carrying out wrong word identification according to AC dictionary trees caching and context of co-text analysis model Go out wrong word recognition result；

Labeling module 409, for wrong word recognition result to be input into wrong word examination ＆ verification mark platform, manually marked Note；

Update module 410 is fed back, wrong word dictionary is arrived for progress manually to be marked into the feedback renewal of obtained annotation results In core word bank；

Display module 411, for the wrong word recognition result for obtaining, the word pair that will be contained in core word bank The wrong word recognition result answered carries out system page presentation.

Each embodiment is described by the way of progressive in this specification, what each embodiment stressed be and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment For, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is said referring to method part It is bright.

Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and The interchangeability of software, the composition and step of each example are generally described according to function in the above description.These Function is performed with hardware or software mode actually, application-specific and design constraint depending on technical scheme.Specialty Technical staff can realize described function using distinct methods to each specific application, but this realization should not Think beyond the scope of this invention.

Directly it can be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor Capable software module, or the two combination are implemented.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.

The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims

1. a kind of wrong word monitoring method, it is characterised in that methods described includes：

Build wrong word dictionary；

Data acquisition is carried out to targeted website, obtains website data；

Build context of co-text analysis model；

Wrong word identification is carried out according to AC dictionary trees caching and context of co-text analysis model, exports wrong word recognition result.

2. according to the method for claim 1, it is characterised in that the structure wrong word dictionary includes：

Obtain the essential information of Chinese character；

It is that each word constructs similar character using similar character algorithm, and generated according to similar character based on the essential information of the Chinese character Wrong other word, obtains wrong word dictionary corresponding to word.

3. according to the method for claim 1, it is characterised in that also include：

For the obtained wrong word recognition result, wrong word recognition result enters corresponding to the word that will be contained in core word bank Row system page presentation.

A kind of 4. wrong word monitoring system, it is characterised in that including：

First structure module, for building wrong word dictionary；

Processing module, for being pre-processed to the website data of acquisition, web analysis and denoising, obtain in text Hold；

3rd structure module, for building context of co-text analysis model；

Output module, for carrying out wrong word identification according to AC dictionary trees caching and context of co-text analysis model, export mistake not Word recognition result.

5. system according to claim 4, it is characterised in that the first structure module includes：

Acquiring unit, for obtaining the essential information of Chinese character；

Generation unit, it is that each word constructs similar character, and root using similar character algorithm for the essential information based on the Chinese character According to wrong other word corresponding to similar character generation word, wrong word dictionary is obtained.

6. system according to claim 4, it is characterised in that also include：

Labeling module, for the wrong word recognition result to be input into wrong word examination ＆ verification mark platform, manually marked；

Update module is fed back, wrong word dictionary and core word are arrived for progress manually to be marked into the feedback renewal of obtained annotation results In storehouse；

Display module, it is wrong corresponding to the word that will be contained in core word bank for the wrong word recognition result for obtaining Malapropism recognition result carries out system page presentation.