Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete
Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on
Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made
Embodiment, belong to the scope of protection of the invention.
As shown in figure 1, be a kind of flow chart of the embodiment 1 of wrong word monitoring method disclosed by the invention, methods described
It may comprise steps of:
S101, structure wrong word dictionary;
When needing to be monitored the wrong word on website, basic data preparation is carried out first, builds wrong word dictionary,
It is main to include collecting the essential information of Chinese character, and everyday expressions and professional domain vocabulary etc. when building wrong word dictionary, so
Wrong word dictionary is constructed according to the essential information of Chinese character afterwards.
S102, data acquisition is carried out to targeted website, obtain website data;
After wrong word dictionary is constructed, data acquisition is carried out to targeted website and obtains website data, wherein, described mesh
Mark website is the website for needing to carry out wrong word monitoring.
S103, the website data to acquisition are pre-processed, web analysis and denoising, obtain content of text;
After the data of targeted website are got, the website data of acquisition is pre-processed and web analysis and gone
Make an uproar, obtain content of text.For original web page content of text, the label substance in webpage, the operation master of this step are rejected first
To be completed using regular expression.Mainly match html label by using regular expression, and matching
Label is deleted, and finally, it is exactly webpage text content to obtain remaining content.
S104, word segmentation processing is carried out to content of text, obtain single word;
Segmentation methods carry out word segmentation processing mainly for webpage text content, and text Chinese character sequence is cut into one by one
Single word, it can be mainly comprised the following steps so as to reach computer with the effect of automatic identification, algorithm idea:
Atom cutting:Cutting for a sentence, it is by atom cutting first, whole sentence is cut into one by one
All possible atomic unit.
Generate full segmenting word figure:Full cutting is carried out to sentence according to participle dictionary, and generates a corresponding adjacent chained list and represents
Word figure.
Calculate optimal cutting route:On the basis of this word figure, cutting optimal path is generated with dynamic programming algorithm.
Unknown word identification:Using hidden horse model, the identification that name, place name, mechanism name etc. are not logged in noun is carried out.
Again calculate optimal cutting route with dynamic programming algorithm and obtain optimal word segmentation result.
S105, AC automatic machine dictionary trees are built based on wrong word dictionary, and generate caching;
Dictionary tree is a kind of data structure traded space for time, and insertion can be realized in constant time O (len) and looked into
Operation is ask, and is widely used in word frequency statisticses and input statistics field, main disadvantage is that multi-mode matching problem, and AC is certainly
Motivation can efficiently complete multi-mode matching.The advantages of in order to gather both, herein, dictionary tree has been used as AC automatic machines
Search data structure.
The offline persistence of wrong word dictionary is configured to a dictionary tree caching cache file, only need to be when servicing startup
Dictionary tree caching cache files are loaded into the efficient retrieval of the i.e. achievable wrong word dictionary of internal memory, it is whole to greatly improve system
The performance of body.
S106, structure context of co-text analysis model;
Content of text is expressed as vector space, other words (linguistic context) of surrounding target word are as input, and word
The precedence that language occurs in the text, target word is exported after mapping layer does weighting processing, to maximize target wrong word
Word output probability is target.
Wherein, context represents contextual information.
K gts are reduced to firstly the need of by content of text processing, each input word is mapped as a vector, should
Mapping is represented with C, so C (wt-1) represent wt-1Term vector, with its front and rear each n (n>1) individual word predicts t-th of word.It is false
Determine wrong word mapping to be represented with E, then E (wt-1) it is designated as the term vector of wrong word.
P(C(wt))=P (E (wt))=P (wt/wt-k,wt-k-1,...,wt-1,wt,wt+1..., wt+k)
An appropriately sized window is taken to regard linguistic context, when predicting word w_t below, corresponding node necessarily has one
Individual binary coding, such as " 010011 ", target are exactly each of the binary number of predicting this word so that predict word
Binary coding maximum probability.
S107, wrong word identification is carried out according to AC dictionary trees caching and context of co-text analysis model, output wrong word is known
Other result.
On the basis of word segmentation result, context of co-text analysis model scans for AC dictionary trees caching, and calculates
The potential other word of mistake of maximum probability, the correct word after correcting is provided, and fuzzy is carried out to the difference set of correct word and erroneous words
Match somebody with somebody, carry out error correction judgement, and provide other attribute informations such as type of error of erroneous words, ultimately generate erroneous words entity class simultaneously
Return.The entity class includes erroneous words, correct word, type of error, and the contextual information of the erroneous words in the text.
In summary, in the above-described embodiments, when needing to be monitored the wrong word on website, structure is wrong not first
Character word stock, data acquisition is carried out to targeted website, website data is obtained, then the website data of acquisition is located in advance
Reason, web analysis and denoising, obtain content of text, carry out word segmentation processing to content of text, obtain single word;Based on wrong other
Character word stock builds AC automatic machine dictionary trees, and generates caching;Build context of co-text analysis model;Finally delayed according to AC dictionary trees
Deposit and carry out wrong word identification with context of co-text analysis model, export wrong word recognition result.AC is generated according to wrong word dictionary
Dictionary tree caches, with reference to context of co-text analysis model it is automatic daily to government website, continue to monitor, can be more comprehensively efficient
Wrong word in website is positioned, provides errorlevel, effectively raises wrong word monitoring accuracy.
As shown in Fig. 2 be a kind of flow chart of the embodiment 2 of wrong word monitoring method disclosed by the invention, methods described
It may comprise steps of:
S201, the essential information for obtaining Chinese character;
Collect and arrange 9891 Chinese characters, obtain the essential informations of these Chinese characters, including phonetic using reptile, initial consonant, simple or compound vowel of a Chinese syllable,
Consonant, tone, Hanzi structure, 5-stroke coding, quadrangle coding, Chinese-character order of strokes, Chinese character radicals, stroke number etc..Polyphone retains institute
Some phonetic.Meanwhile conventional Chinese terms and phrase are arranged according to collections such as Chinese character voluminous dictionary, dictionaries for word segmentation, and it is artificial right
National title, government affairs Field Words, medical industry vocabulary etc. are specifically collected arrangement, and correct word dictionary is formed with this;
S202, the essential information based on Chinese character, it is that each word constructs similar character using similar character algorithm, and according to similar character
Wrong other word corresponding to word is generated, obtains wrong word dictionary;
Similar character algorithm mainly includes two types, the phase that Hanzi structure identical similar character and Hanzi structure differ
Like word.If similar character " value " and " plant " are tiled configurations, their structure is identical;" value " and " straight ", the former is tiled configuration, after
Person is up-down structure.For different in the case of, the algorithm of use is also different.
The first structure identical situation is directed to, similar character algorithm flow is mainly as follows:
First, the premise of similar character be between Hanzi structure is identical, stroke number is drawn ± 3 and quadrangle coding in identical
Digit have to be larger than equal to 2;
Then, by remaining phonetic, quadrangle coding and 5-stroke coding assign corresponding to weight, weight is respectively 30%,
25%th, 45%, detailed rules is as follows:
The phonetic of phonetic identical (tone can different) or two words is front and rear nasal sound relation, such as yin and ying, or
The phonetic of two words of person is the relation of flat tongue consonant and cacuminal, must be 0.3 point such as sa and sha, is otherwise 0 point;
In quadrangle coding, there is N number of coding identical, then obtain N*0.2*0.25 points;
In 5-stroke coding, 3 with 3 matchings when have that M coding is identical, then obtain M*1/3*0.45;4 and 4
There is N number of coding identical when matching, then obtain N*1/4*0.45;3 and 4 matching when, if continuous 2 it is identical then
2*2/3*0.45, if 3 it is identical then 0.45 point.
Secondly, the similarity obtained by phonetic, quadrangle coding and 5-stroke coding is added up, last only screening similarity is big
In the Chinese character equal to 0.5 as similar character;
Be directed to second of situation, because structure is changed, quadrangle coding and 5-stroke coding it is inclined using degree
It is weak.In fact we it is seen that, " value " is made up of " Ren " and " straight ", by chance " straight " into " value " a subset, such case I
Can be handled completely using the order of strokes observed in calligraphy, detailed algorithm is as follows:
First, we filter out the related subset Chinese character of target word either superset Chinese character, and stroke number difference is no more than 5;
Judge whether phonetic similar (i.e. phonetic is identical, and tone can be different, or two words phonetic be front and rear nasal sound or
The difference of person's flat tongue consonant and cacuminal), directly it is judged as similar character if similar, if dissimilar, continues subsequent treatment;
It is identical to filter out quadrangle coding at least 4, and in 5-stroke coding at least N positions (if the five of two Chinese characters
Pen coding is all 3, then N=2;If the 5-stroke coding of two Chinese characters is all 4, N=3,;If five volumes of a Chinese character
Code is 3, and the 5-stroke coding of another Chinese character is 4, then N=2) identical Chinese character.
Excluding Chinese character that the first situation occurred, (as " bar " is identical with " stalk " structure, the order of strokes observed in calligraphy of " bar " belongs to " stalk "
Subset), the remaining similar character as second of situation.
Here wrong other word refers to the word being made up of similar character, such as " plant " one word, wherein, a similar character of " plant " is
" value ", then replace " planting " composition " value thing " by " value ", here it is the so-called other word of mistake.In practical situations both, these words are dived
Contain wrong word.
The main flow of the wrong other word of generation is as follows:
Basic data prepares:All words are extracted from our correct word dictionary, and each word is split into list
Only word;
The potential other word generation of mistake:According to each single word in word, its corresponding similar character is found, then by similar character
Original word is replaced, generates potential wrong other word;
Wrong other word generation:Wrong other word after replacement, it is possible to and a correct word, as a similar character of " far or indistinct " is
" vast ", a word of " far or indistinct " is " dimly discernible ", and the wrong other word of generation is " ethereal ", but in a practical situation, " ethereal " is also
One correct word, it is therefore desirable to filtered such case.
After wrong other word is formed corresponding to correct word, correct word and the other word of corresponding mistake and the type of error manually set
Wrong word dictionary is collectively constituted.
S203, data acquisition is carried out to targeted website, obtain website data;
After wrong word dictionary is constructed, data acquisition is carried out to targeted website and obtains website data, wherein, described mesh
Mark website is the website for needing to carry out wrong word monitoring.
S204, the website data to acquisition are pre-processed, web analysis and denoising, obtain content of text;
After the data of targeted website are got, the website data of acquisition is pre-processed and web analysis and gone
Make an uproar, obtain content of text.For original web page content of text, the label substance in webpage, the operation master of this step are rejected first
To be completed using regular expression.Mainly match html label by using regular expression, and matching
Label is deleted, and finally, it is exactly webpage text content to obtain remaining content.
S205, word segmentation processing is carried out to content of text, obtain single word;
Segmentation methods carry out word segmentation processing mainly for webpage text content, and text Chinese character sequence is cut into one by one
Single word, it can be mainly comprised the following steps so as to reach computer with the effect of automatic identification, algorithm idea:
Atom cutting:Cutting for a sentence, it is by atom cutting first, whole sentence is cut into one by one
All possible atomic unit.
Generate full segmenting word figure:Full cutting is carried out to sentence according to participle dictionary, and generates a corresponding adjacent chained list and represents
Word figure.
Calculate optimal cutting route:On the basis of this word figure, cutting optimal path is generated with dynamic programming algorithm.
Unknown word identification:Using hidden horse model, the identification that name, place name, mechanism name etc. are not logged in noun is carried out.
Again calculate optimal cutting route with dynamic programming algorithm and obtain optimal word segmentation result.
S206, AC automatic machine dictionary trees are built based on wrong word dictionary, and generate caching;
Dictionary tree is a kind of data structure traded space for time, and insertion can be realized in constant time O (len) and looked into
Operation is ask, and is widely used in word frequency statisticses and input statistics field, main disadvantage is that multi-mode matching problem, and AC is certainly
Motivation can efficiently complete multi-mode matching.The advantages of in order to gather both, herein, dictionary tree has been used as AC automatic machines
Search data structure.
The offline persistence of wrong word dictionary is configured to a dictionary tree caching cache file, only need to be when servicing startup
Dictionary tree caching cache files are loaded into the efficient retrieval of the i.e. achievable wrong word dictionary of internal memory, it is whole to greatly improve system
The performance of body.
S207, structure context of co-text analysis model;
Content of text is expressed as vector space, other words (linguistic context) of surrounding target word are as input, and word
The precedence that language occurs in the text, target word is exported after mapping layer does weighting processing, to maximize target wrong word
Word output probability is target.
Wherein, context represents contextual information.
K gts are reduced to firstly the need of by content of text processing, each input word is mapped as a vector, should
Mapping is represented with C, so C (wt-1) represent wt-1Term vector, with its front and rear each n (n>1) individual word predicts t-th of word.It is false
Determine wrong word mapping to be represented with E, then E (wt-1) it is designated as the term vector of wrong word.
P(C(wt))=P (E (wt))=P (wt/wt-k,wt-k-1,...,wt-1,wt,wt+1..., wt+k)
An appropriately sized window is taken to regard linguistic context, when predicting word w_t below, corresponding node necessarily has one
Individual binary coding, such as " 010011 ", target are exactly each of the binary number of predicting this word so that predict word
Binary coding maximum probability.
S208, wrong word identification is carried out according to AC dictionary trees caching and context of co-text analysis model, output wrong word is known
Other result;
On the basis of word segmentation result, context of co-text analysis model scans for AC dictionary trees caching, and calculates
The potential other word of mistake of maximum probability, the correct word after correcting is provided, and fuzzy is carried out to the difference set of correct word and erroneous words
Match somebody with somebody, carry out error correction judgement, and provide other attribute informations such as type of error of erroneous words, ultimately generate erroneous words entity class simultaneously
Return.The entity class includes erroneous words, correct word, type of error, and the contextual information of the erroneous words in the text.
S209, wrong word recognition result is input to wrong word examination & verification mark platform, manually marked;
For the wrong word identification record monitored daily, it is flat that the not record in platform is marked all is appended to mark
In platform.After user logs in mark platform, the mark that wrong word identification record carries out right and wrong, every note can be obtained manually
Record is at most assigned to N number of user (N can be set), and each user can carry out self to the wrong word record for distributing to oneself
Judgement, according to the wrong word linguistic context provided, it is believed that wrong word identification is correct, then clicks on correct button, it is believed that the wrong word
Mistake is identified, then clicks on wrong button.Mark platform sets every record to distribute to multiple users, so collects more people's wisdom,
Erroneous judgement risk caused by reducing single people's error to a certain extent.
S210, it will carry out manually marking obtained annotation results feedback renewal into wrong word dictionary and core word bank;
Platform meeting every night one shell script of start by set date is marked, shell script first can be according to each wrong word
All annotation results of record, calculate the final result of this wrong word record.For example, for a wrong word identification note
Record, it is believed that its correct number of users, which is more than, thinks its wrong number of users, then it is correct that this, which records final result, conversely, recognizing
It is more than for its wrong number of users and thinks its correct number of users, then it is mistake that this, which records final result,.If it is considered to it is just
True number of users, which is equal to, thinks its wrong number of users, then this records result and fixed tentatively to be undetermined, will subsequently continue to be assigned to
Other users are labeled, until obtaining clear and definite result.
After the final result for calculating the wrong word record being marked, the correct identification record of program automatic screening, more
Newly into core word bank;The identification record to make mistake is screened, the identification record of this partial error is deleted from wrong word dictionary,
Ensure to be identified again after this kind of word.
So, on the basis of machine identifies wrong word comprehensively, it is aided with artificial mark, and manually will further sentence automatically
Disconnected result feeds back to wrong word dictionary and core word bank so that our wrong word identification is more comprehensive and accurate.
S211, the wrong word recognition result for obtaining, wrong word corresponding to the word that will be contained in core word bank
Recognition result carries out system page presentation.
Because wrong word dictionary is according to certain rule construct by algorithm, relative to manual construction wrong word
Speech, amount are big and comprehensive.But monitor daily in obtained a large amount of wrong words record, it is inevitably inaccurate in the presence of some
Wrong word identification.Page presentation is not carried out directly by screening, will necessarily cause one to client if all of identification record
Fixed puzzlement, therefore, by marking platform, a core word bank is established, will identify that accurate word is put into core word bank, and constantly
Accumulation, grow core word bank, and will be present in the wrong word identification record in core word bank carry out page presentation, it is ensured that system
The wrong word identification accuracy of page presentation is higher.
In summary, in the above-described embodiments, similar character is constructed according to the essential information of Chinese character, and generates wrong word word
Storehouse.AC dictionary trees caching is generated according to wrong word dictionary, with reference to context of co-text analysis model it is automatic daily to government website, hold
Continuous monitoring, more comprehensively efficiently can be positioned to wrong word in website, provide errorlevel, such as serious wrong word and general
Logical wrong word, and suggestion for revision is provided, meanwhile, the composition manually marked is combined in our monitoring method, can be good at
The deficiency of machine recognition is made up, core word bank is refined, promotes conversion of the wrong word dictionary from edge to core, so as to more accurately
Monitoring site wrong word.
As shown in figure 3, be a kind of structural representation of wrong word monitoring system embodiment 1 disclosed by the invention, the system
System can include:
First structure module 301, for building wrong word dictionary;
When needing to be monitored the wrong word on website, basic data preparation is carried out first, builds wrong word dictionary,
It is main to include collecting the essential information of Chinese character, and everyday expressions and professional domain vocabulary etc. when building wrong word dictionary, so
Wrong word dictionary is constructed according to the essential information of Chinese character afterwards.
Acquisition module 302, for carrying out data acquisition to targeted website, obtain website data;
After wrong word dictionary is constructed, data acquisition is carried out to targeted website and obtains website data, wherein, described mesh
Mark website is the website for needing to carry out wrong word monitoring.
Processing module 303, for being pre-processed to the website data of acquisition, web analysis and denoising, obtain text
Content;
After the data of targeted website are got, the website data of acquisition is pre-processed and web analysis and gone
Make an uproar, obtain content of text.For original web page content of text, the label substance in webpage, the operation master of this step are rejected first
To be completed using regular expression.Mainly match html label by using regular expression, and matching
Label is deleted, and finally, it is exactly webpage text content to obtain remaining content.
Word segmentation processing module 304, for carrying out word segmentation processing to content of text, obtain single word;
Segmentation methods carry out word segmentation processing mainly for webpage text content, and text Chinese character sequence is cut into one by one
Single word, it can be mainly comprised the following steps so as to reach computer with the effect of automatic identification, algorithm idea:
Atom cutting:Cutting for a sentence, it is by atom cutting first, whole sentence is cut into one by one
All possible atomic unit.
Generate full segmenting word figure:Full cutting is carried out to sentence according to participle dictionary, and generates a corresponding adjacent chained list and represents
Word figure.
Calculate optimal cutting route:On the basis of this word figure, cutting optimal path is generated with dynamic programming algorithm.
Unknown word identification:Using hidden horse model, the identification that name, place name, mechanism name etc. are not logged in noun is carried out.
Again calculate optimal cutting route with dynamic programming algorithm and obtain optimal word segmentation result.
Second structure module 305, for building AC automatic machine dictionary trees based on wrong word dictionary, and generates caching;
Dictionary tree is a kind of data structure traded space for time, and insertion can be realized in constant time O (len) and looked into
Operation is ask, and is widely used in word frequency statisticses and input statistics field, main disadvantage is that multi-mode matching problem, and AC is certainly
Motivation can efficiently complete multi-mode matching.The advantages of in order to gather both, herein, dictionary tree has been used as AC automatic machines
Search data structure.
The offline persistence of wrong word dictionary is configured to a dictionary tree caching cache file, only need to be when servicing startup
Dictionary tree caching cache files are loaded into the efficient retrieval of the i.e. achievable wrong word dictionary of internal memory, it is whole to greatly improve system
The performance of body.
3rd structure module 306, for building context of co-text analysis model;
Content of text is expressed as vector space, other words (linguistic context) of surrounding target word are as input, and word
The precedence that language occurs in the text, target word is exported after mapping layer does weighting processing, to maximize target wrong word
Word output probability is target.
Wherein, context represents contextual information.
K gts are reduced to firstly the need of by content of text processing, each input word is mapped as a vector, should
Mapping is represented with C, so C (wt-1) represent wt-1Term vector, with its front and rear each n (n>1) individual word predicts t-th of word.It is false
Determine wrong word mapping to be represented with E, then E (wt-1) it is designated as the term vector of wrong word.
P(C(wt))=P (E (wt))=P (wt/wt-k,wt-k-1,...,wt-1,wt,wt+1..., wt+k)
An appropriately sized window is taken to regard linguistic context, when predicting word w_t below, corresponding node necessarily has one
Individual binary coding, such as " 010011 ", target are exactly each of the binary number of predicting this word so that predict word
Binary coding maximum probability.
Output module 307, it is defeated for carrying out wrong word identification according to AC dictionary trees caching and context of co-text analysis model
Go out wrong word recognition result.
On the basis of word segmentation result, context of co-text analysis model scans for AC dictionary trees caching, and calculates
The potential other word of mistake of maximum probability, the correct word after correcting is provided, and fuzzy is carried out to the difference set of correct word and erroneous words
Match somebody with somebody, carry out error correction judgement, and provide other attribute informations such as type of error of erroneous words, ultimately generate erroneous words entity class simultaneously
Return.The entity class includes erroneous words, correct word, type of error, and the contextual information of the erroneous words in the text.
In summary, in the above-described embodiments, when needing to be monitored the wrong word on website, structure is wrong not first
Character word stock, data acquisition is carried out to targeted website, website data is obtained, then the website data of acquisition is located in advance
Reason, web analysis and denoising, obtain content of text, carry out word segmentation processing to content of text, obtain single word;Based on wrong other
Character word stock builds AC automatic machine dictionary trees, and generates caching;Build context of co-text analysis model;Finally delayed according to AC dictionary trees
Deposit and carry out wrong word identification with context of co-text analysis model, export wrong word recognition result.AC is generated according to wrong word dictionary
Dictionary tree caches, with reference to context of co-text analysis model it is automatic daily to government website, continue to monitor, can be more comprehensively efficient
Wrong word in website is positioned, provides errorlevel, effectively raises wrong word monitoring accuracy.
As shown in figure 4, be a kind of structural representation of wrong word monitoring system embodiment 2 disclosed by the invention, the system
System can include:
Acquiring unit 401, for obtaining the essential information of Chinese character;
Collect and arrange 9891 Chinese characters, obtain the essential informations of these Chinese characters, including phonetic using reptile, initial consonant, simple or compound vowel of a Chinese syllable,
Consonant, tone, Hanzi structure, 5-stroke coding, quadrangle coding, Chinese-character order of strokes, Chinese character radicals, stroke number etc..Polyphone retains institute
Some phonetic.Meanwhile conventional Chinese terms and phrase are arranged according to collections such as Chinese character voluminous dictionary, dictionaries for word segmentation, and it is artificial right
National title, government affairs Field Words, medical industry vocabulary etc. are specifically collected arrangement, and correct word dictionary is formed with this;
Generation unit 402, it is that each word constructs similar character using similar character algorithm for the essential information based on Chinese character,
And the wrong other word according to corresponding to similar character generates word, obtain wrong word dictionary;
Similar character algorithm mainly includes two types, the phase that Hanzi structure identical similar character and Hanzi structure differ
Like word.If similar character " value " and " plant " are tiled configurations, their structure is identical;" value " and " straight ", the former is tiled configuration, after
Person is up-down structure.For different in the case of, the algorithm of use is also different.
The first structure identical situation is directed to, similar character algorithm flow is mainly as follows:
First, the premise of similar character be between Hanzi structure is identical, stroke number is drawn ± 3 and quadrangle coding in identical
Digit have to be larger than equal to 2;
Then, by remaining phonetic, quadrangle coding and 5-stroke coding assign corresponding to weight, weight is respectively 30%,
25%th, 45%, detailed rules is as follows:
The phonetic of phonetic identical (tone can different) or two words is front and rear nasal sound relation, such as yin and ying, or
The phonetic of two words of person is the relation of flat tongue consonant and cacuminal, must be 0.3 point such as sa and sha, is otherwise 0 point;
In quadrangle coding, there is N number of coding identical, then obtain N*0.2*0.25 points;
In 5-stroke coding, 3 with 3 matchings when have that M coding is identical, then obtain M*1/3*0.45;4 and 4
There is N number of coding identical when matching, then obtain N*1/4*0.45;3 and 4 matching when, if continuous 2 it is identical then
2*2/3*0.45, if 3 it is identical then 0.45 point.
Secondly, the similarity obtained by phonetic, quadrangle coding and 5-stroke coding is added up, last only screening similarity is big
In the Chinese character equal to 0.5 as similar character;
Be directed to second of situation, because structure is changed, quadrangle coding and 5-stroke coding it is inclined using degree
It is weak.In fact we it is seen that, " value " is made up of " Ren " and " straight ", by chance " straight " into " value " a subset, such case I
Can be handled completely using the order of strokes observed in calligraphy, detailed algorithm is as follows:
First, we filter out the related subset Chinese character of target word either superset Chinese character, and stroke number difference is no more than 5;
Judge whether phonetic similar (i.e. phonetic is identical, and tone can be different, or two words phonetic be front and rear nasal sound or
The difference of person's flat tongue consonant and cacuminal), directly it is judged as similar character if similar, if dissimilar, continues subsequent treatment;
It is identical to filter out quadrangle coding at least 4, and in 5-stroke coding at least N positions (if the five of two Chinese characters
Pen coding is all 3, then N=2;If the 5-stroke coding of two Chinese characters is all 4, N=3,;If five volumes of a Chinese character
Code is 3, and the 5-stroke coding of another Chinese character is 4, then N=2) identical Chinese character.
Excluding Chinese character that the first situation occurred, (as " bar " is identical with " stalk " structure, the order of strokes observed in calligraphy of " bar " belongs to " stalk "
Subset), the remaining similar character as second of situation.
Here wrong other word refers to the word being made up of similar character, such as " plant " one word, wherein, a similar character of " plant " is
" value ", then replace " planting " composition " value thing " by " value ", here it is the so-called other word of mistake.In practical situations both, these words are dived
Contain wrong word.
The main flow of the wrong other word of generation is as follows:
Basic data prepares:All words are extracted from our correct word dictionary, and each word is split into list
Only word;
The potential other word generation of mistake:According to each single word in word, its corresponding similar character is found, then by similar character
Original word is replaced, generates potential wrong other word;
Wrong other word generation:Wrong other word after replacement, it is possible to and a correct word, as a similar character of " far or indistinct " is
" vast ", a word of " far or indistinct " is " dimly discernible ", and the wrong other word of generation is " ethereal ", but in a practical situation, " ethereal " is also
One correct word, it is therefore desirable to filtered such case.
After wrong other word is formed corresponding to correct word, correct word and the other word of corresponding mistake and the type of error manually set
Wrong word dictionary is collectively constituted.
Acquisition module 403, for carrying out data acquisition to targeted website, obtain website data;
After wrong word dictionary is constructed, data acquisition is carried out to targeted website and obtains website data, wherein, described mesh
Mark website is the website for needing to carry out wrong word monitoring.
Processing module 404, for being pre-processed to the website data of acquisition, web analysis and denoising, obtain text
Content;
After the data of targeted website are got, the website data of acquisition is pre-processed and web analysis and gone
Make an uproar, obtain content of text.For original web page content of text, the label substance in webpage, the operation master of this step are rejected first
To be completed using regular expression.Mainly match html label by using regular expression, and matching
Label is deleted, and finally, it is exactly webpage text content to obtain remaining content.
Word segmentation processing module 405, for carrying out word segmentation processing to content of text, obtain single word;
Segmentation methods carry out word segmentation processing mainly for webpage text content, and text Chinese character sequence is cut into one by one
Single word, it can be mainly comprised the following steps so as to reach computer with the effect of automatic identification, algorithm idea:
Atom cutting:Cutting for a sentence, it is by atom cutting first, whole sentence is cut into one by one
All possible atomic unit.
Generate full segmenting word figure:Full cutting is carried out to sentence according to participle dictionary, and generates a corresponding adjacent chained list and represents
Word figure.
Calculate optimal cutting route:On the basis of this word figure, cutting optimal path is generated with dynamic programming algorithm.
Unknown word identification:Using hidden horse model, the identification that name, place name, mechanism name etc. are not logged in noun is carried out.
Again calculate optimal cutting route with dynamic programming algorithm and obtain optimal word segmentation result.
Second structure module 406, for building AC automatic machine dictionary trees based on wrong word dictionary, and generates caching;
Dictionary tree is a kind of data structure traded space for time, and insertion can be realized in constant time O (len) and looked into
Operation is ask, and is widely used in word frequency statisticses and input statistics field, main disadvantage is that multi-mode matching problem, and AC is certainly
Motivation can efficiently complete multi-mode matching.The advantages of in order to gather both, herein, dictionary tree has been used as AC automatic machines
Search data structure.
The offline persistence of wrong word dictionary is configured to a dictionary tree caching cache file, only need to be when servicing startup
Dictionary tree caching cache files are loaded into the efficient retrieval of the i.e. achievable wrong word dictionary of internal memory, it is whole to greatly improve system
The performance of body.
3rd structure module 407, for building context of co-text analysis model;
Content of text is expressed as vector space, other words (linguistic context) of surrounding target word are as input, and word
The precedence that language occurs in the text, target word is exported after mapping layer does weighting processing, to maximize target wrong word
Word output probability is target.
Wherein, context represents contextual information.
K gts are reduced to firstly the need of by content of text processing, each input word is mapped as a vector, should
Mapping is represented with C, so C (wt-1) represent wt-1Term vector, with its front and rear each n (n>1) individual word predicts t-th of word.It is false
Determine wrong word mapping to be represented with E, then E (wt-1) it is designated as the term vector of wrong word.
P(C(wt))=P (E (wt))=P (wt/wt-k,wt-k-1,...,wt-1,wt,wt+1..., wt+k)
An appropriately sized window is taken to regard linguistic context, when predicting word w_t below, corresponding node necessarily has one
Individual binary coding, such as " 010011 ", target are exactly each of the binary number of predicting this word so that predict word
Binary coding maximum probability.
Output module 408, it is defeated for carrying out wrong word identification according to AC dictionary trees caching and context of co-text analysis model
Go out wrong word recognition result;
On the basis of word segmentation result, context of co-text analysis model scans for AC dictionary trees caching, and calculates
The potential other word of mistake of maximum probability, the correct word after correcting is provided, and fuzzy is carried out to the difference set of correct word and erroneous words
Match somebody with somebody, carry out error correction judgement, and provide other attribute informations such as type of error of erroneous words, ultimately generate erroneous words entity class simultaneously
Return.The entity class includes erroneous words, correct word, type of error, and the contextual information of the erroneous words in the text.
Labeling module 409, for wrong word recognition result to be input into wrong word examination & verification mark platform, manually marked
Note;
For the wrong word identification record monitored daily, it is flat that the not record in platform is marked all is appended to mark
In platform.After user logs in mark platform, the mark that wrong word identification record carries out right and wrong, every note can be obtained manually
Record is at most assigned to N number of user (N can be set), and each user can carry out self to the wrong word record for distributing to oneself
Judgement, according to the wrong word linguistic context provided, it is believed that wrong word identification is correct, then clicks on correct button, it is believed that the wrong word
Mistake is identified, then clicks on wrong button.Mark platform sets every record to distribute to multiple users, so collects more people's wisdom,
Erroneous judgement risk caused by reducing single people's error to a certain extent.
Update module 410 is fed back, wrong word dictionary is arrived for progress manually to be marked into the feedback renewal of obtained annotation results
In core word bank;
Platform meeting every night one shell script of start by set date is marked, shell script first can be according to each wrong word
All annotation results of record, calculate the final result of this wrong word record.For example, for a wrong word identification note
Record, it is believed that its correct number of users, which is more than, thinks its wrong number of users, then it is correct that this, which records final result, conversely, recognizing
It is more than for its wrong number of users and thinks its correct number of users, then it is mistake that this, which records final result,.If it is considered to it is just
True number of users, which is equal to, thinks its wrong number of users, then this records result and fixed tentatively to be undetermined, will subsequently continue to be assigned to
Other users are labeled, until obtaining clear and definite result.
After the final result for calculating the wrong word record being marked, the correct identification record of program automatic screening, more
Newly into core word bank;The identification record to make mistake is screened, the identification record of this partial error is deleted from wrong word dictionary,
Ensure to be identified again after this kind of word.
So, on the basis of machine identifies wrong word comprehensively, it is aided with artificial mark, and manually will further sentence automatically
Disconnected result feeds back to wrong word dictionary and core word bank so that our wrong word identification is more comprehensive and accurate.
Display module 411, for the wrong word recognition result for obtaining, the word pair that will be contained in core word bank
The wrong word recognition result answered carries out system page presentation.
Because wrong word dictionary is according to certain rule construct by algorithm, relative to manual construction wrong word
Speech, amount are big and comprehensive.But monitor daily in obtained a large amount of wrong words record, it is inevitably inaccurate in the presence of some
Wrong word identification.Page presentation is not carried out directly by screening, will necessarily cause one to client if all of identification record
Fixed puzzlement, therefore, by marking platform, a core word bank is established, will identify that accurate word is put into core word bank, and constantly
Accumulation, grow core word bank, and will be present in the wrong word identification record in core word bank carry out page presentation, it is ensured that system
The wrong word identification accuracy of page presentation is higher.
In summary, in the above-described embodiments, similar character is constructed according to the essential information of Chinese character, and generates wrong word word
Storehouse.AC dictionary trees caching is generated according to wrong word dictionary, with reference to context of co-text analysis model it is automatic daily to government website, hold
Continuous monitoring, more comprehensively efficiently can be positioned to wrong word in website, provide errorlevel, such as serious wrong word and general
Logical wrong word, and suggestion for revision is provided, meanwhile, the composition manually marked is combined in our monitoring method, can be good at
The deficiency of machine recognition is made up, core word bank is refined, promotes conversion of the wrong word dictionary from edge to core, so as to more accurately
Monitoring site wrong word.
Each embodiment is described by the way of progressive in this specification, what each embodiment stressed be and other
The difference of embodiment, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment
For, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is said referring to method part
It is bright.
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description
And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and
The interchangeability of software, the composition and step of each example are generally described according to function in the above description.These
Function is performed with hardware or software mode actually, application-specific and design constraint depending on technical scheme.Specialty
Technical staff can realize described function using distinct methods to each specific application, but this realization should not
Think beyond the scope of this invention.
Directly it can be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor
Capable software module, or the two combination are implemented.Software module can be placed in random access memory (RAM), internal memory, read-only deposit
Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology
In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention.
A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein
General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention
The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one
The most wide scope caused.