CN107679036A - A kind of wrong word monitoring method and system - Google Patents

A kind of wrong word monitoring method and system Download PDF

Info

Publication number
CN107679036A
CN107679036A CN201710946362.3A CN201710946362A CN107679036A CN 107679036 A CN107679036 A CN 107679036A CN 201710946362 A CN201710946362 A CN 201710946362A CN 107679036 A CN107679036 A CN 107679036A
Authority
CN
China
Prior art keywords
word
wrong
dictionary
wrong word
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710946362.3A
Other languages
Chinese (zh)
Inventor
周金娟
王治平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Net Number Mdt Infotech Ltd
Original Assignee
Nanjing Net Number Mdt Infotech Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Net Number Mdt Infotech Ltd filed Critical Nanjing Net Number Mdt Infotech Ltd
Priority to CN201710946362.3A priority Critical patent/CN107679036A/en
Publication of CN107679036A publication Critical patent/CN107679036A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing

Abstract

The invention provides a kind of wrong word monitoring method, method includes:Build wrong word dictionary;Data acquisition is carried out to targeted website, obtains website data;The website data of acquisition is pre-processed, web analysis and denoising, obtain content of text;Word segmentation processing is carried out to content of text, obtains single word;AC automatic machine dictionary trees are built based on wrong word dictionary, and generate caching;Build context of co-text analysis model;Wrong word identification is carried out according to AC dictionary trees caching and context of co-text analysis model, exports wrong word recognition result, the present invention can effectively improve the monitoring accuracy of wrong word.

Description

A kind of wrong word monitoring method and system
Technical field
The present invention relates to wrong word identification technology field, more particularly to a kind of wrong word monitoring method and system.
Background technology
Currently, there is the same page and serious wrong word or multiple pages repeatedly occur in the presence of serious wrong other in part government website The situation of word, trigger public media concern, have a strong impact on government image.For this phenomenon, for the first time national government website generaI investigation Serious wrong word is included in the investigation point of " gross error " index.
Wrong word monitoring method is typically to be made up of wrong word dictionary, participle technique, wrong word identification model.
Participle technique is the premise and key of wrong word identification.There are a variety of segmenting methods in the prior art, wherein based on word The segmenting method of string is accorded with because relatively simple and more common.Segmenting method based on character string can probably include positive maximum Matching method and reverse maximum matching method.Such as have a kind of segmenting method based on character string mainly using Forward Maximum Method method or Reverse maximum matching method carries out mechanical Chinese word segmentation processing to the character string that needs segment, to it is unidentified go out individual character realize place name and The participle identification of street name, its object is to identify place name, street name etc., has expanded ground thesaurus.Existing participle technique In at least find following technical problem be present:
1st, existing Words partition system comes only with a kind of segmenting method (Forward Maximum Method method or reverse maximum matching method) Segmented, participle process is more coarse, and the word segmentation result for causing to obtain is not accurate enough, reduces word segmentation accuracy;
2nd, existing segmenting method generally pertains only to the participle of particular area, still can not enter for multi-field character string Row effectively identification.
Therefore, how effectively to improve the accuracy rate of wrong word monitoring is a urgent problem to be solved.
The content of the invention
In view of this, the invention provides a kind of wrong word monitoring method, the monitoring that can effectively improve wrong word is accurate True rate.
The invention provides a kind of wrong word monitoring method, methods described includes:
Build wrong word dictionary;
Data acquisition is carried out to targeted website, obtains website data;
The website data of acquisition is pre-processed, web analysis and denoising, obtain content of text;
Word segmentation processing is carried out to the content of text, obtains single word;
AC automatic machine dictionary trees are built based on the wrong word dictionary, and generate caching;
Build context of co-text analysis model;
Wrong word identification, output wrong word identification knot are carried out according to AC dictionary trees caching and context of co-text analysis model Fruit.
Preferably, the structure wrong word dictionary includes:
Obtain the essential information of Chinese character;
It is that each word constructs similar character using similar character algorithm based on the essential information of the Chinese character, and according to similar character Wrong other word corresponding to word is generated, obtains wrong word dictionary.
Preferably, methods described also includes:
The wrong word recognition result is input to wrong word examination & verification mark platform, manually marked;
Obtained annotation results feedback renewal will be carried out manually marking into wrong word dictionary and core word bank;
For the obtained wrong word recognition result, wrong word identification knot corresponding to the word that will be contained in core word bank Fruit carries out system page presentation.
A kind of wrong word monitoring system, it is characterised in that including:
First structure module, for building wrong word dictionary;
Acquisition module, for carrying out data acquisition to targeted website, obtain website data;
Processing module, for being pre-processed to the website data of acquisition, web analysis and denoising, obtain text Content;
Word segmentation processing module, for carrying out word segmentation processing to the content of text, obtain single word;
Second structure module, for building AC automatic machine dictionary trees based on the wrong word dictionary, and generates caching;
3rd structure module, for building context of co-text analysis model;
Output module, for carrying out wrong word identification, output according to AC dictionary trees caching and context of co-text analysis model Wrong word recognition result.
Preferably, the first structure module includes:
Acquiring unit, for obtaining the essential information of Chinese character;
Generation unit, it is that each word constructs similar character using similar character algorithm for the essential information based on the Chinese character, And the wrong other word according to corresponding to similar character generates word, obtain wrong word dictionary.
Preferably, the system also includes:
Labeling module, for the wrong word recognition result to be input into wrong word examination & verification mark platform, manually marked Note;
Update module is fed back, wrong word dictionary and core are arrived for progress manually to be marked into the feedback renewal of obtained annotation results In heart dictionary;
Display module, for the wrong word recognition result for obtaining, the word that will be contained in core word bank is corresponding Wrong word recognition result carry out system page presentation.
It can be seen from the above technical proposal that the invention provides a kind of wrong word monitoring method, when needing on website Wrong word when being monitored, build wrong word dictionary first, data acquisition carried out to targeted website, obtains website data, so Afterwards the website data to acquisition pre-processed, web analysis and denoising, obtain content of text, content of text carried out Word segmentation processing, obtain single word;AC automatic machine dictionary trees are built based on wrong word dictionary, and generate caching;Build context Contextual analysis model;Wrong word identification is finally carried out according to AC dictionary trees caching and context of co-text analysis model, exports mistake not Word recognition result.AC dictionary trees caching is generated according to wrong word dictionary, it is every to government website with reference to context of co-text analysis model It is automatic, continues to monitor, and more comprehensively efficiently wrong word in website can be positioned, provide errorlevel, effectively carry High wrong word monitoring accuracy.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of method flow diagram of wrong word monitoring method embodiment 1 disclosed by the invention;
Fig. 2 is a kind of method flow diagram of wrong word monitoring method embodiment 2 disclosed by the invention;
Fig. 3 is a kind of structural representation of wrong word monitoring system embodiment 1 disclosed by the invention;
Fig. 4 is a kind of structural representation of wrong word monitoring system embodiment 2 disclosed by the invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.
As shown in figure 1, be a kind of flow chart of the embodiment 1 of wrong word monitoring method disclosed by the invention, methods described It may comprise steps of:
S101, structure wrong word dictionary;
When needing to be monitored the wrong word on website, basic data preparation is carried out first, builds wrong word dictionary, It is main to include collecting the essential information of Chinese character, and everyday expressions and professional domain vocabulary etc. when building wrong word dictionary, so Wrong word dictionary is constructed according to the essential information of Chinese character afterwards.
S102, data acquisition is carried out to targeted website, obtain website data;
After wrong word dictionary is constructed, data acquisition is carried out to targeted website and obtains website data, wherein, described mesh Mark website is the website for needing to carry out wrong word monitoring.
S103, the website data to acquisition are pre-processed, web analysis and denoising, obtain content of text;
After the data of targeted website are got, the website data of acquisition is pre-processed and web analysis and gone Make an uproar, obtain content of text.For original web page content of text, the label substance in webpage, the operation master of this step are rejected first To be completed using regular expression.Mainly match html label by using regular expression, and matching Label is deleted, and finally, it is exactly webpage text content to obtain remaining content.
S104, word segmentation processing is carried out to content of text, obtain single word;
Segmentation methods carry out word segmentation processing mainly for webpage text content, and text Chinese character sequence is cut into one by one Single word, it can be mainly comprised the following steps so as to reach computer with the effect of automatic identification, algorithm idea:
Atom cutting:Cutting for a sentence, it is by atom cutting first, whole sentence is cut into one by one All possible atomic unit.
Generate full segmenting word figure:Full cutting is carried out to sentence according to participle dictionary, and generates a corresponding adjacent chained list and represents Word figure.
Calculate optimal cutting route:On the basis of this word figure, cutting optimal path is generated with dynamic programming algorithm.
Unknown word identification:Using hidden horse model, the identification that name, place name, mechanism name etc. are not logged in noun is carried out.
Again calculate optimal cutting route with dynamic programming algorithm and obtain optimal word segmentation result.
S105, AC automatic machine dictionary trees are built based on wrong word dictionary, and generate caching;
Dictionary tree is a kind of data structure traded space for time, and insertion can be realized in constant time O (len) and looked into Operation is ask, and is widely used in word frequency statisticses and input statistics field, main disadvantage is that multi-mode matching problem, and AC is certainly Motivation can efficiently complete multi-mode matching.The advantages of in order to gather both, herein, dictionary tree has been used as AC automatic machines Search data structure.
The offline persistence of wrong word dictionary is configured to a dictionary tree caching cache file, only need to be when servicing startup Dictionary tree caching cache files are loaded into the efficient retrieval of the i.e. achievable wrong word dictionary of internal memory, it is whole to greatly improve system The performance of body.
S106, structure context of co-text analysis model;
Content of text is expressed as vector space, other words (linguistic context) of surrounding target word are as input, and word The precedence that language occurs in the text, target word is exported after mapping layer does weighting processing, to maximize target wrong word Word output probability is target.
Wherein, context represents contextual information.
K gts are reduced to firstly the need of by content of text processing, each input word is mapped as a vector, should Mapping is represented with C, so C (wt-1) represent wt-1Term vector, with its front and rear each n (n>1) individual word predicts t-th of word.It is false Determine wrong word mapping to be represented with E, then E (wt-1) it is designated as the term vector of wrong word.
P(C(wt))=P (E (wt))=P (wt/wt-k,wt-k-1,...,wt-1,wt,wt+1..., wt+k)
An appropriately sized window is taken to regard linguistic context, when predicting word w_t below, corresponding node necessarily has one Individual binary coding, such as " 010011 ", target are exactly each of the binary number of predicting this word so that predict word Binary coding maximum probability.
S107, wrong word identification is carried out according to AC dictionary trees caching and context of co-text analysis model, output wrong word is known Other result.
On the basis of word segmentation result, context of co-text analysis model scans for AC dictionary trees caching, and calculates The potential other word of mistake of maximum probability, the correct word after correcting is provided, and fuzzy is carried out to the difference set of correct word and erroneous words Match somebody with somebody, carry out error correction judgement, and provide other attribute informations such as type of error of erroneous words, ultimately generate erroneous words entity class simultaneously Return.The entity class includes erroneous words, correct word, type of error, and the contextual information of the erroneous words in the text.
In summary, in the above-described embodiments, when needing to be monitored the wrong word on website, structure is wrong not first Character word stock, data acquisition is carried out to targeted website, website data is obtained, then the website data of acquisition is located in advance Reason, web analysis and denoising, obtain content of text, carry out word segmentation processing to content of text, obtain single word;Based on wrong other Character word stock builds AC automatic machine dictionary trees, and generates caching;Build context of co-text analysis model;Finally delayed according to AC dictionary trees Deposit and carry out wrong word identification with context of co-text analysis model, export wrong word recognition result.AC is generated according to wrong word dictionary Dictionary tree caches, with reference to context of co-text analysis model it is automatic daily to government website, continue to monitor, can be more comprehensively efficient Wrong word in website is positioned, provides errorlevel, effectively raises wrong word monitoring accuracy.
As shown in Fig. 2 be a kind of flow chart of the embodiment 2 of wrong word monitoring method disclosed by the invention, methods described It may comprise steps of:
S201, the essential information for obtaining Chinese character;
Collect and arrange 9891 Chinese characters, obtain the essential informations of these Chinese characters, including phonetic using reptile, initial consonant, simple or compound vowel of a Chinese syllable, Consonant, tone, Hanzi structure, 5-stroke coding, quadrangle coding, Chinese-character order of strokes, Chinese character radicals, stroke number etc..Polyphone retains institute Some phonetic.Meanwhile conventional Chinese terms and phrase are arranged according to collections such as Chinese character voluminous dictionary, dictionaries for word segmentation, and it is artificial right National title, government affairs Field Words, medical industry vocabulary etc. are specifically collected arrangement, and correct word dictionary is formed with this;
S202, the essential information based on Chinese character, it is that each word constructs similar character using similar character algorithm, and according to similar character Wrong other word corresponding to word is generated, obtains wrong word dictionary;
Similar character algorithm mainly includes two types, the phase that Hanzi structure identical similar character and Hanzi structure differ Like word.If similar character " value " and " plant " are tiled configurations, their structure is identical;" value " and " straight ", the former is tiled configuration, after Person is up-down structure.For different in the case of, the algorithm of use is also different.
The first structure identical situation is directed to, similar character algorithm flow is mainly as follows:
First, the premise of similar character be between Hanzi structure is identical, stroke number is drawn ± 3 and quadrangle coding in identical Digit have to be larger than equal to 2;
Then, by remaining phonetic, quadrangle coding and 5-stroke coding assign corresponding to weight, weight is respectively 30%, 25%th, 45%, detailed rules is as follows:
The phonetic of phonetic identical (tone can different) or two words is front and rear nasal sound relation, such as yin and ying, or The phonetic of two words of person is the relation of flat tongue consonant and cacuminal, must be 0.3 point such as sa and sha, is otherwise 0 point;
In quadrangle coding, there is N number of coding identical, then obtain N*0.2*0.25 points;
In 5-stroke coding, 3 with 3 matchings when have that M coding is identical, then obtain M*1/3*0.45;4 and 4 There is N number of coding identical when matching, then obtain N*1/4*0.45;3 and 4 matching when, if continuous 2 it is identical then 2*2/3*0.45, if 3 it is identical then 0.45 point.
Secondly, the similarity obtained by phonetic, quadrangle coding and 5-stroke coding is added up, last only screening similarity is big In the Chinese character equal to 0.5 as similar character;
Be directed to second of situation, because structure is changed, quadrangle coding and 5-stroke coding it is inclined using degree It is weak.In fact we it is seen that, " value " is made up of " Ren " and " straight ", by chance " straight " into " value " a subset, such case I Can be handled completely using the order of strokes observed in calligraphy, detailed algorithm is as follows:
First, we filter out the related subset Chinese character of target word either superset Chinese character, and stroke number difference is no more than 5;
Judge whether phonetic similar (i.e. phonetic is identical, and tone can be different, or two words phonetic be front and rear nasal sound or The difference of person's flat tongue consonant and cacuminal), directly it is judged as similar character if similar, if dissimilar, continues subsequent treatment;
It is identical to filter out quadrangle coding at least 4, and in 5-stroke coding at least N positions (if the five of two Chinese characters Pen coding is all 3, then N=2;If the 5-stroke coding of two Chinese characters is all 4, N=3,;If five volumes of a Chinese character Code is 3, and the 5-stroke coding of another Chinese character is 4, then N=2) identical Chinese character.
Excluding Chinese character that the first situation occurred, (as " bar " is identical with " stalk " structure, the order of strokes observed in calligraphy of " bar " belongs to " stalk " Subset), the remaining similar character as second of situation.
Here wrong other word refers to the word being made up of similar character, such as " plant " one word, wherein, a similar character of " plant " is " value ", then replace " planting " composition " value thing " by " value ", here it is the so-called other word of mistake.In practical situations both, these words are dived Contain wrong word.
The main flow of the wrong other word of generation is as follows:
Basic data prepares:All words are extracted from our correct word dictionary, and each word is split into list Only word;
The potential other word generation of mistake:According to each single word in word, its corresponding similar character is found, then by similar character Original word is replaced, generates potential wrong other word;
Wrong other word generation:Wrong other word after replacement, it is possible to and a correct word, as a similar character of " far or indistinct " is " vast ", a word of " far or indistinct " is " dimly discernible ", and the wrong other word of generation is " ethereal ", but in a practical situation, " ethereal " is also One correct word, it is therefore desirable to filtered such case.
After wrong other word is formed corresponding to correct word, correct word and the other word of corresponding mistake and the type of error manually set Wrong word dictionary is collectively constituted.
S203, data acquisition is carried out to targeted website, obtain website data;
After wrong word dictionary is constructed, data acquisition is carried out to targeted website and obtains website data, wherein, described mesh Mark website is the website for needing to carry out wrong word monitoring.
S204, the website data to acquisition are pre-processed, web analysis and denoising, obtain content of text;
After the data of targeted website are got, the website data of acquisition is pre-processed and web analysis and gone Make an uproar, obtain content of text.For original web page content of text, the label substance in webpage, the operation master of this step are rejected first To be completed using regular expression.Mainly match html label by using regular expression, and matching Label is deleted, and finally, it is exactly webpage text content to obtain remaining content.
S205, word segmentation processing is carried out to content of text, obtain single word;
Segmentation methods carry out word segmentation processing mainly for webpage text content, and text Chinese character sequence is cut into one by one Single word, it can be mainly comprised the following steps so as to reach computer with the effect of automatic identification, algorithm idea:
Atom cutting:Cutting for a sentence, it is by atom cutting first, whole sentence is cut into one by one All possible atomic unit.
Generate full segmenting word figure:Full cutting is carried out to sentence according to participle dictionary, and generates a corresponding adjacent chained list and represents Word figure.
Calculate optimal cutting route:On the basis of this word figure, cutting optimal path is generated with dynamic programming algorithm.
Unknown word identification:Using hidden horse model, the identification that name, place name, mechanism name etc. are not logged in noun is carried out.
Again calculate optimal cutting route with dynamic programming algorithm and obtain optimal word segmentation result.
S206, AC automatic machine dictionary trees are built based on wrong word dictionary, and generate caching;
Dictionary tree is a kind of data structure traded space for time, and insertion can be realized in constant time O (len) and looked into Operation is ask, and is widely used in word frequency statisticses and input statistics field, main disadvantage is that multi-mode matching problem, and AC is certainly Motivation can efficiently complete multi-mode matching.The advantages of in order to gather both, herein, dictionary tree has been used as AC automatic machines Search data structure.
The offline persistence of wrong word dictionary is configured to a dictionary tree caching cache file, only need to be when servicing startup Dictionary tree caching cache files are loaded into the efficient retrieval of the i.e. achievable wrong word dictionary of internal memory, it is whole to greatly improve system The performance of body.
S207, structure context of co-text analysis model;
Content of text is expressed as vector space, other words (linguistic context) of surrounding target word are as input, and word The precedence that language occurs in the text, target word is exported after mapping layer does weighting processing, to maximize target wrong word Word output probability is target.
Wherein, context represents contextual information.
K gts are reduced to firstly the need of by content of text processing, each input word is mapped as a vector, should Mapping is represented with C, so C (wt-1) represent wt-1Term vector, with its front and rear each n (n>1) individual word predicts t-th of word.It is false Determine wrong word mapping to be represented with E, then E (wt-1) it is designated as the term vector of wrong word.
P(C(wt))=P (E (wt))=P (wt/wt-k,wt-k-1,...,wt-1,wt,wt+1..., wt+k)
An appropriately sized window is taken to regard linguistic context, when predicting word w_t below, corresponding node necessarily has one Individual binary coding, such as " 010011 ", target are exactly each of the binary number of predicting this word so that predict word Binary coding maximum probability.
S208, wrong word identification is carried out according to AC dictionary trees caching and context of co-text analysis model, output wrong word is known Other result;
On the basis of word segmentation result, context of co-text analysis model scans for AC dictionary trees caching, and calculates The potential other word of mistake of maximum probability, the correct word after correcting is provided, and fuzzy is carried out to the difference set of correct word and erroneous words Match somebody with somebody, carry out error correction judgement, and provide other attribute informations such as type of error of erroneous words, ultimately generate erroneous words entity class simultaneously Return.The entity class includes erroneous words, correct word, type of error, and the contextual information of the erroneous words in the text.
S209, wrong word recognition result is input to wrong word examination & verification mark platform, manually marked;
For the wrong word identification record monitored daily, it is flat that the not record in platform is marked all is appended to mark In platform.After user logs in mark platform, the mark that wrong word identification record carries out right and wrong, every note can be obtained manually Record is at most assigned to N number of user (N can be set), and each user can carry out self to the wrong word record for distributing to oneself Judgement, according to the wrong word linguistic context provided, it is believed that wrong word identification is correct, then clicks on correct button, it is believed that the wrong word Mistake is identified, then clicks on wrong button.Mark platform sets every record to distribute to multiple users, so collects more people's wisdom, Erroneous judgement risk caused by reducing single people's error to a certain extent.
S210, it will carry out manually marking obtained annotation results feedback renewal into wrong word dictionary and core word bank;
Platform meeting every night one shell script of start by set date is marked, shell script first can be according to each wrong word All annotation results of record, calculate the final result of this wrong word record.For example, for a wrong word identification note Record, it is believed that its correct number of users, which is more than, thinks its wrong number of users, then it is correct that this, which records final result, conversely, recognizing It is more than for its wrong number of users and thinks its correct number of users, then it is mistake that this, which records final result,.If it is considered to it is just True number of users, which is equal to, thinks its wrong number of users, then this records result and fixed tentatively to be undetermined, will subsequently continue to be assigned to Other users are labeled, until obtaining clear and definite result.
After the final result for calculating the wrong word record being marked, the correct identification record of program automatic screening, more Newly into core word bank;The identification record to make mistake is screened, the identification record of this partial error is deleted from wrong word dictionary, Ensure to be identified again after this kind of word.
So, on the basis of machine identifies wrong word comprehensively, it is aided with artificial mark, and manually will further sentence automatically Disconnected result feeds back to wrong word dictionary and core word bank so that our wrong word identification is more comprehensive and accurate.
S211, the wrong word recognition result for obtaining, wrong word corresponding to the word that will be contained in core word bank Recognition result carries out system page presentation.
Because wrong word dictionary is according to certain rule construct by algorithm, relative to manual construction wrong word Speech, amount are big and comprehensive.But monitor daily in obtained a large amount of wrong words record, it is inevitably inaccurate in the presence of some Wrong word identification.Page presentation is not carried out directly by screening, will necessarily cause one to client if all of identification record Fixed puzzlement, therefore, by marking platform, a core word bank is established, will identify that accurate word is put into core word bank, and constantly Accumulation, grow core word bank, and will be present in the wrong word identification record in core word bank carry out page presentation, it is ensured that system The wrong word identification accuracy of page presentation is higher.
In summary, in the above-described embodiments, similar character is constructed according to the essential information of Chinese character, and generates wrong word word Storehouse.AC dictionary trees caching is generated according to wrong word dictionary, with reference to context of co-text analysis model it is automatic daily to government website, hold Continuous monitoring, more comprehensively efficiently can be positioned to wrong word in website, provide errorlevel, such as serious wrong word and general Logical wrong word, and suggestion for revision is provided, meanwhile, the composition manually marked is combined in our monitoring method, can be good at The deficiency of machine recognition is made up, core word bank is refined, promotes conversion of the wrong word dictionary from edge to core, so as to more accurately Monitoring site wrong word.
As shown in figure 3, be a kind of structural representation of wrong word monitoring system embodiment 1 disclosed by the invention, the system System can include:
First structure module 301, for building wrong word dictionary;
When needing to be monitored the wrong word on website, basic data preparation is carried out first, builds wrong word dictionary, It is main to include collecting the essential information of Chinese character, and everyday expressions and professional domain vocabulary etc. when building wrong word dictionary, so Wrong word dictionary is constructed according to the essential information of Chinese character afterwards.
Acquisition module 302, for carrying out data acquisition to targeted website, obtain website data;
After wrong word dictionary is constructed, data acquisition is carried out to targeted website and obtains website data, wherein, described mesh Mark website is the website for needing to carry out wrong word monitoring.
Processing module 303, for being pre-processed to the website data of acquisition, web analysis and denoising, obtain text Content;
After the data of targeted website are got, the website data of acquisition is pre-processed and web analysis and gone Make an uproar, obtain content of text.For original web page content of text, the label substance in webpage, the operation master of this step are rejected first To be completed using regular expression.Mainly match html label by using regular expression, and matching Label is deleted, and finally, it is exactly webpage text content to obtain remaining content.
Word segmentation processing module 304, for carrying out word segmentation processing to content of text, obtain single word;
Segmentation methods carry out word segmentation processing mainly for webpage text content, and text Chinese character sequence is cut into one by one Single word, it can be mainly comprised the following steps so as to reach computer with the effect of automatic identification, algorithm idea:
Atom cutting:Cutting for a sentence, it is by atom cutting first, whole sentence is cut into one by one All possible atomic unit.
Generate full segmenting word figure:Full cutting is carried out to sentence according to participle dictionary, and generates a corresponding adjacent chained list and represents Word figure.
Calculate optimal cutting route:On the basis of this word figure, cutting optimal path is generated with dynamic programming algorithm.
Unknown word identification:Using hidden horse model, the identification that name, place name, mechanism name etc. are not logged in noun is carried out.
Again calculate optimal cutting route with dynamic programming algorithm and obtain optimal word segmentation result.
Second structure module 305, for building AC automatic machine dictionary trees based on wrong word dictionary, and generates caching;
Dictionary tree is a kind of data structure traded space for time, and insertion can be realized in constant time O (len) and looked into Operation is ask, and is widely used in word frequency statisticses and input statistics field, main disadvantage is that multi-mode matching problem, and AC is certainly Motivation can efficiently complete multi-mode matching.The advantages of in order to gather both, herein, dictionary tree has been used as AC automatic machines Search data structure.
The offline persistence of wrong word dictionary is configured to a dictionary tree caching cache file, only need to be when servicing startup Dictionary tree caching cache files are loaded into the efficient retrieval of the i.e. achievable wrong word dictionary of internal memory, it is whole to greatly improve system The performance of body.
3rd structure module 306, for building context of co-text analysis model;
Content of text is expressed as vector space, other words (linguistic context) of surrounding target word are as input, and word The precedence that language occurs in the text, target word is exported after mapping layer does weighting processing, to maximize target wrong word Word output probability is target.
Wherein, context represents contextual information.
K gts are reduced to firstly the need of by content of text processing, each input word is mapped as a vector, should Mapping is represented with C, so C (wt-1) represent wt-1Term vector, with its front and rear each n (n>1) individual word predicts t-th of word.It is false Determine wrong word mapping to be represented with E, then E (wt-1) it is designated as the term vector of wrong word.
P(C(wt))=P (E (wt))=P (wt/wt-k,wt-k-1,...,wt-1,wt,wt+1..., wt+k)
An appropriately sized window is taken to regard linguistic context, when predicting word w_t below, corresponding node necessarily has one Individual binary coding, such as " 010011 ", target are exactly each of the binary number of predicting this word so that predict word Binary coding maximum probability.
Output module 307, it is defeated for carrying out wrong word identification according to AC dictionary trees caching and context of co-text analysis model Go out wrong word recognition result.
On the basis of word segmentation result, context of co-text analysis model scans for AC dictionary trees caching, and calculates The potential other word of mistake of maximum probability, the correct word after correcting is provided, and fuzzy is carried out to the difference set of correct word and erroneous words Match somebody with somebody, carry out error correction judgement, and provide other attribute informations such as type of error of erroneous words, ultimately generate erroneous words entity class simultaneously Return.The entity class includes erroneous words, correct word, type of error, and the contextual information of the erroneous words in the text.
In summary, in the above-described embodiments, when needing to be monitored the wrong word on website, structure is wrong not first Character word stock, data acquisition is carried out to targeted website, website data is obtained, then the website data of acquisition is located in advance Reason, web analysis and denoising, obtain content of text, carry out word segmentation processing to content of text, obtain single word;Based on wrong other Character word stock builds AC automatic machine dictionary trees, and generates caching;Build context of co-text analysis model;Finally delayed according to AC dictionary trees Deposit and carry out wrong word identification with context of co-text analysis model, export wrong word recognition result.AC is generated according to wrong word dictionary Dictionary tree caches, with reference to context of co-text analysis model it is automatic daily to government website, continue to monitor, can be more comprehensively efficient Wrong word in website is positioned, provides errorlevel, effectively raises wrong word monitoring accuracy.
As shown in figure 4, be a kind of structural representation of wrong word monitoring system embodiment 2 disclosed by the invention, the system System can include:
Acquiring unit 401, for obtaining the essential information of Chinese character;
Collect and arrange 9891 Chinese characters, obtain the essential informations of these Chinese characters, including phonetic using reptile, initial consonant, simple or compound vowel of a Chinese syllable, Consonant, tone, Hanzi structure, 5-stroke coding, quadrangle coding, Chinese-character order of strokes, Chinese character radicals, stroke number etc..Polyphone retains institute Some phonetic.Meanwhile conventional Chinese terms and phrase are arranged according to collections such as Chinese character voluminous dictionary, dictionaries for word segmentation, and it is artificial right National title, government affairs Field Words, medical industry vocabulary etc. are specifically collected arrangement, and correct word dictionary is formed with this;
Generation unit 402, it is that each word constructs similar character using similar character algorithm for the essential information based on Chinese character, And the wrong other word according to corresponding to similar character generates word, obtain wrong word dictionary;
Similar character algorithm mainly includes two types, the phase that Hanzi structure identical similar character and Hanzi structure differ Like word.If similar character " value " and " plant " are tiled configurations, their structure is identical;" value " and " straight ", the former is tiled configuration, after Person is up-down structure.For different in the case of, the algorithm of use is also different.
The first structure identical situation is directed to, similar character algorithm flow is mainly as follows:
First, the premise of similar character be between Hanzi structure is identical, stroke number is drawn ± 3 and quadrangle coding in identical Digit have to be larger than equal to 2;
Then, by remaining phonetic, quadrangle coding and 5-stroke coding assign corresponding to weight, weight is respectively 30%, 25%th, 45%, detailed rules is as follows:
The phonetic of phonetic identical (tone can different) or two words is front and rear nasal sound relation, such as yin and ying, or The phonetic of two words of person is the relation of flat tongue consonant and cacuminal, must be 0.3 point such as sa and sha, is otherwise 0 point;
In quadrangle coding, there is N number of coding identical, then obtain N*0.2*0.25 points;
In 5-stroke coding, 3 with 3 matchings when have that M coding is identical, then obtain M*1/3*0.45;4 and 4 There is N number of coding identical when matching, then obtain N*1/4*0.45;3 and 4 matching when, if continuous 2 it is identical then 2*2/3*0.45, if 3 it is identical then 0.45 point.
Secondly, the similarity obtained by phonetic, quadrangle coding and 5-stroke coding is added up, last only screening similarity is big In the Chinese character equal to 0.5 as similar character;
Be directed to second of situation, because structure is changed, quadrangle coding and 5-stroke coding it is inclined using degree It is weak.In fact we it is seen that, " value " is made up of " Ren " and " straight ", by chance " straight " into " value " a subset, such case I Can be handled completely using the order of strokes observed in calligraphy, detailed algorithm is as follows:
First, we filter out the related subset Chinese character of target word either superset Chinese character, and stroke number difference is no more than 5;
Judge whether phonetic similar (i.e. phonetic is identical, and tone can be different, or two words phonetic be front and rear nasal sound or The difference of person's flat tongue consonant and cacuminal), directly it is judged as similar character if similar, if dissimilar, continues subsequent treatment;
It is identical to filter out quadrangle coding at least 4, and in 5-stroke coding at least N positions (if the five of two Chinese characters Pen coding is all 3, then N=2;If the 5-stroke coding of two Chinese characters is all 4, N=3,;If five volumes of a Chinese character Code is 3, and the 5-stroke coding of another Chinese character is 4, then N=2) identical Chinese character.
Excluding Chinese character that the first situation occurred, (as " bar " is identical with " stalk " structure, the order of strokes observed in calligraphy of " bar " belongs to " stalk " Subset), the remaining similar character as second of situation.
Here wrong other word refers to the word being made up of similar character, such as " plant " one word, wherein, a similar character of " plant " is " value ", then replace " planting " composition " value thing " by " value ", here it is the so-called other word of mistake.In practical situations both, these words are dived Contain wrong word.
The main flow of the wrong other word of generation is as follows:
Basic data prepares:All words are extracted from our correct word dictionary, and each word is split into list Only word;
The potential other word generation of mistake:According to each single word in word, its corresponding similar character is found, then by similar character Original word is replaced, generates potential wrong other word;
Wrong other word generation:Wrong other word after replacement, it is possible to and a correct word, as a similar character of " far or indistinct " is " vast ", a word of " far or indistinct " is " dimly discernible ", and the wrong other word of generation is " ethereal ", but in a practical situation, " ethereal " is also One correct word, it is therefore desirable to filtered such case.
After wrong other word is formed corresponding to correct word, correct word and the other word of corresponding mistake and the type of error manually set Wrong word dictionary is collectively constituted.
Acquisition module 403, for carrying out data acquisition to targeted website, obtain website data;
After wrong word dictionary is constructed, data acquisition is carried out to targeted website and obtains website data, wherein, described mesh Mark website is the website for needing to carry out wrong word monitoring.
Processing module 404, for being pre-processed to the website data of acquisition, web analysis and denoising, obtain text Content;
After the data of targeted website are got, the website data of acquisition is pre-processed and web analysis and gone Make an uproar, obtain content of text.For original web page content of text, the label substance in webpage, the operation master of this step are rejected first To be completed using regular expression.Mainly match html label by using regular expression, and matching Label is deleted, and finally, it is exactly webpage text content to obtain remaining content.
Word segmentation processing module 405, for carrying out word segmentation processing to content of text, obtain single word;
Segmentation methods carry out word segmentation processing mainly for webpage text content, and text Chinese character sequence is cut into one by one Single word, it can be mainly comprised the following steps so as to reach computer with the effect of automatic identification, algorithm idea:
Atom cutting:Cutting for a sentence, it is by atom cutting first, whole sentence is cut into one by one All possible atomic unit.
Generate full segmenting word figure:Full cutting is carried out to sentence according to participle dictionary, and generates a corresponding adjacent chained list and represents Word figure.
Calculate optimal cutting route:On the basis of this word figure, cutting optimal path is generated with dynamic programming algorithm.
Unknown word identification:Using hidden horse model, the identification that name, place name, mechanism name etc. are not logged in noun is carried out.
Again calculate optimal cutting route with dynamic programming algorithm and obtain optimal word segmentation result.
Second structure module 406, for building AC automatic machine dictionary trees based on wrong word dictionary, and generates caching;
Dictionary tree is a kind of data structure traded space for time, and insertion can be realized in constant time O (len) and looked into Operation is ask, and is widely used in word frequency statisticses and input statistics field, main disadvantage is that multi-mode matching problem, and AC is certainly Motivation can efficiently complete multi-mode matching.The advantages of in order to gather both, herein, dictionary tree has been used as AC automatic machines Search data structure.
The offline persistence of wrong word dictionary is configured to a dictionary tree caching cache file, only need to be when servicing startup Dictionary tree caching cache files are loaded into the efficient retrieval of the i.e. achievable wrong word dictionary of internal memory, it is whole to greatly improve system The performance of body.
3rd structure module 407, for building context of co-text analysis model;
Content of text is expressed as vector space, other words (linguistic context) of surrounding target word are as input, and word The precedence that language occurs in the text, target word is exported after mapping layer does weighting processing, to maximize target wrong word Word output probability is target.
Wherein, context represents contextual information.
K gts are reduced to firstly the need of by content of text processing, each input word is mapped as a vector, should Mapping is represented with C, so C (wt-1) represent wt-1Term vector, with its front and rear each n (n>1) individual word predicts t-th of word.It is false Determine wrong word mapping to be represented with E, then E (wt-1) it is designated as the term vector of wrong word.
P(C(wt))=P (E (wt))=P (wt/wt-k,wt-k-1,...,wt-1,wt,wt+1..., wt+k)
An appropriately sized window is taken to regard linguistic context, when predicting word w_t below, corresponding node necessarily has one Individual binary coding, such as " 010011 ", target are exactly each of the binary number of predicting this word so that predict word Binary coding maximum probability.
Output module 408, it is defeated for carrying out wrong word identification according to AC dictionary trees caching and context of co-text analysis model Go out wrong word recognition result;
On the basis of word segmentation result, context of co-text analysis model scans for AC dictionary trees caching, and calculates The potential other word of mistake of maximum probability, the correct word after correcting is provided, and fuzzy is carried out to the difference set of correct word and erroneous words Match somebody with somebody, carry out error correction judgement, and provide other attribute informations such as type of error of erroneous words, ultimately generate erroneous words entity class simultaneously Return.The entity class includes erroneous words, correct word, type of error, and the contextual information of the erroneous words in the text.
Labeling module 409, for wrong word recognition result to be input into wrong word examination & verification mark platform, manually marked Note;
For the wrong word identification record monitored daily, it is flat that the not record in platform is marked all is appended to mark In platform.After user logs in mark platform, the mark that wrong word identification record carries out right and wrong, every note can be obtained manually Record is at most assigned to N number of user (N can be set), and each user can carry out self to the wrong word record for distributing to oneself Judgement, according to the wrong word linguistic context provided, it is believed that wrong word identification is correct, then clicks on correct button, it is believed that the wrong word Mistake is identified, then clicks on wrong button.Mark platform sets every record to distribute to multiple users, so collects more people's wisdom, Erroneous judgement risk caused by reducing single people's error to a certain extent.
Update module 410 is fed back, wrong word dictionary is arrived for progress manually to be marked into the feedback renewal of obtained annotation results In core word bank;
Platform meeting every night one shell script of start by set date is marked, shell script first can be according to each wrong word All annotation results of record, calculate the final result of this wrong word record.For example, for a wrong word identification note Record, it is believed that its correct number of users, which is more than, thinks its wrong number of users, then it is correct that this, which records final result, conversely, recognizing It is more than for its wrong number of users and thinks its correct number of users, then it is mistake that this, which records final result,.If it is considered to it is just True number of users, which is equal to, thinks its wrong number of users, then this records result and fixed tentatively to be undetermined, will subsequently continue to be assigned to Other users are labeled, until obtaining clear and definite result.
After the final result for calculating the wrong word record being marked, the correct identification record of program automatic screening, more Newly into core word bank;The identification record to make mistake is screened, the identification record of this partial error is deleted from wrong word dictionary, Ensure to be identified again after this kind of word.
So, on the basis of machine identifies wrong word comprehensively, it is aided with artificial mark, and manually will further sentence automatically Disconnected result feeds back to wrong word dictionary and core word bank so that our wrong word identification is more comprehensive and accurate.
Display module 411, for the wrong word recognition result for obtaining, the word pair that will be contained in core word bank The wrong word recognition result answered carries out system page presentation.
Because wrong word dictionary is according to certain rule construct by algorithm, relative to manual construction wrong word Speech, amount are big and comprehensive.But monitor daily in obtained a large amount of wrong words record, it is inevitably inaccurate in the presence of some Wrong word identification.Page presentation is not carried out directly by screening, will necessarily cause one to client if all of identification record Fixed puzzlement, therefore, by marking platform, a core word bank is established, will identify that accurate word is put into core word bank, and constantly Accumulation, grow core word bank, and will be present in the wrong word identification record in core word bank carry out page presentation, it is ensured that system The wrong word identification accuracy of page presentation is higher.
In summary, in the above-described embodiments, similar character is constructed according to the essential information of Chinese character, and generates wrong word word Storehouse.AC dictionary trees caching is generated according to wrong word dictionary, with reference to context of co-text analysis model it is automatic daily to government website, hold Continuous monitoring, more comprehensively efficiently can be positioned to wrong word in website, provide errorlevel, such as serious wrong word and general Logical wrong word, and suggestion for revision is provided, meanwhile, the composition manually marked is combined in our monitoring method, can be good at The deficiency of machine recognition is made up, core word bank is refined, promotes conversion of the wrong word dictionary from edge to core, so as to more accurately Monitoring site wrong word.
Each embodiment is described by the way of progressive in this specification, what each embodiment stressed be and other The difference of embodiment, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment For, because it is corresponded to the method disclosed in Example, so description is fairly simple, related part is said referring to method part It is bright.
Professional further appreciates that, with reference to the unit of each example of the embodiments described herein description And algorithm steps, can be realized with electronic hardware, computer software or the combination of the two, in order to clearly demonstrate hardware and The interchangeability of software, the composition and step of each example are generally described according to function in the above description.These Function is performed with hardware or software mode actually, application-specific and design constraint depending on technical scheme.Specialty Technical staff can realize described function using distinct methods to each specific application, but this realization should not Think beyond the scope of this invention.
Directly it can be held with reference to the step of method or algorithm that the embodiments described herein describes with hardware, processor Capable software module, or the two combination are implemented.Software module can be placed in random access memory (RAM), internal memory, read-only deposit Reservoir (ROM), electrically programmable ROM, electrically erasable ROM, register, hard disk, moveable magnetic disc, CD-ROM or technology In any other form of storage medium well known in field.
The foregoing description of the disclosed embodiments, professional and technical personnel in the field are enable to realize or using the present invention. A variety of modifications to these embodiments will be apparent for those skilled in the art, as defined herein General Principle can be realized in other embodiments without departing from the spirit or scope of the present invention.Therefore, it is of the invention The embodiments shown herein is not intended to be limited to, and is to fit to and principles disclosed herein and features of novelty phase one The most wide scope caused.

Claims (6)

1. a kind of wrong word monitoring method, it is characterised in that methods described includes:
Build wrong word dictionary;
Data acquisition is carried out to targeted website, obtains website data;
The website data of acquisition is pre-processed, web analysis and denoising, obtain content of text;
Word segmentation processing is carried out to the content of text, obtains single word;
AC automatic machine dictionary trees are built based on the wrong word dictionary, and generate caching;
Build context of co-text analysis model;
Wrong word identification is carried out according to AC dictionary trees caching and context of co-text analysis model, exports wrong word recognition result.
2. according to the method for claim 1, it is characterised in that the structure wrong word dictionary includes:
Obtain the essential information of Chinese character;
It is that each word constructs similar character using similar character algorithm, and generated according to similar character based on the essential information of the Chinese character Wrong other word, obtains wrong word dictionary corresponding to word.
3. according to the method for claim 1, it is characterised in that also include:
The wrong word recognition result is input to wrong word examination & verification mark platform, manually marked;
Obtained annotation results feedback renewal will be carried out manually marking into wrong word dictionary and core word bank;
For the obtained wrong word recognition result, wrong word recognition result enters corresponding to the word that will be contained in core word bank Row system page presentation.
A kind of 4. wrong word monitoring system, it is characterised in that including:
First structure module, for building wrong word dictionary;
Acquisition module, for carrying out data acquisition to targeted website, obtain website data;
Processing module, for being pre-processed to the website data of acquisition, web analysis and denoising, obtain in text Hold;
Word segmentation processing module, for carrying out word segmentation processing to the content of text, obtain single word;
Second structure module, for building AC automatic machine dictionary trees based on the wrong word dictionary, and generates caching;
3rd structure module, for building context of co-text analysis model;
Output module, for carrying out wrong word identification according to AC dictionary trees caching and context of co-text analysis model, export mistake not Word recognition result.
5. system according to claim 4, it is characterised in that the first structure module includes:
Acquiring unit, for obtaining the essential information of Chinese character;
Generation unit, it is that each word constructs similar character, and root using similar character algorithm for the essential information based on the Chinese character According to wrong other word corresponding to similar character generation word, wrong word dictionary is obtained.
6. system according to claim 4, it is characterised in that also include:
Labeling module, for the wrong word recognition result to be input into wrong word examination & verification mark platform, manually marked;
Update module is fed back, wrong word dictionary and core word are arrived for progress manually to be marked into the feedback renewal of obtained annotation results In storehouse;
Display module, it is wrong corresponding to the word that will be contained in core word bank for the wrong word recognition result for obtaining Malapropism recognition result carries out system page presentation.
CN201710946362.3A 2017-10-12 2017-10-12 A kind of wrong word monitoring method and system Pending CN107679036A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710946362.3A CN107679036A (en) 2017-10-12 2017-10-12 A kind of wrong word monitoring method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710946362.3A CN107679036A (en) 2017-10-12 2017-10-12 A kind of wrong word monitoring method and system

Publications (1)

Publication Number Publication Date
CN107679036A true CN107679036A (en) 2018-02-09

Family

ID=61139865

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710946362.3A Pending CN107679036A (en) 2017-10-12 2017-10-12 A kind of wrong word monitoring method and system

Country Status (1)

Country Link
CN (1) CN107679036A (en)

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145287A (en) * 2018-07-05 2019-01-04 广东外语外贸大学 Indonesian word error-detection error-correction method and system
CN109522558A (en) * 2018-11-21 2019-03-26 金现代信息产业股份有限公司 A kind of Chinese wrongly written character bearing calibration based on deep learning
CN110389875A (en) * 2019-03-29 2019-10-29 中国银联股份有限公司 Method, apparatus and storage medium for supervisory computer system operating status
CN110852091A (en) * 2019-11-11 2020-02-28 杭州安恒信息技术股份有限公司 Method and device for monitoring wrongly written characters, electronic equipment and computer readable medium
CN110928915A (en) * 2018-08-31 2020-03-27 北京京东金融科技控股有限公司 Method, device and equipment for fuzzy matching of Chinese names and readable storage medium
CN111737948A (en) * 2020-05-06 2020-10-02 福建天晴数码有限公司 Wrongly written character generation method and terminal
CN111984845A (en) * 2020-08-17 2020-11-24 江苏百达智慧网络科技有限公司 Website wrongly-written character recognition method and system
CN112084746A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Entity identification method, system, storage medium and equipment
CN112613522A (en) * 2021-01-04 2021-04-06 重庆邮电大学 Method for correcting recognition result of medicine taking order based on fusion font information
CN112632213A (en) * 2020-12-03 2021-04-09 大箴(杭州)科技有限公司 Address information standardization method and device, electronic equipment and storage medium
CN113051925A (en) * 2019-12-26 2021-06-29 中国移动通信集团有限公司 Time identification method, device, equipment and computer storage medium
CN113672587A (en) * 2021-07-15 2021-11-19 福建拓尔通软件有限公司 New media update monitoring method, system, device and medium
CN113673227A (en) * 2021-07-15 2021-11-19 福建拓尔通软件有限公司 Method, system, equipment and medium for correcting wrongly written characters of WEB editor
CN113704233A (en) * 2021-10-29 2021-11-26 飞狐信息技术(天津)有限公司 Keyword detection method and system
CN115065671A (en) * 2022-03-04 2022-09-16 山谷网安科技股份有限公司 Method and system for realizing dynamically expandable wrong word detection service

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140298168A1 (en) * 2013-03-28 2014-10-02 Est Soft Corp. System and method for spelling correction of misspelled keyword
CN105159871A (en) * 2015-08-21 2015-12-16 小米科技有限责任公司 Text information detection method and apparatus
CN105260354A (en) * 2015-08-20 2016-01-20 及时标讯网络信息技术(北京)有限公司 Chinese AC (Aho-Corasick) automaton working method based on keyword dictionary tree structure
CN106777073A (en) * 2016-12-13 2017-05-31 深圳爱拼信息科技有限公司 The automatic method for correcting of wrong word and server in a kind of search engine
CN107193843A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 A kind of character string selection method and device based on AC automatic machines and postfix expression

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140298168A1 (en) * 2013-03-28 2014-10-02 Est Soft Corp. System and method for spelling correction of misspelled keyword
CN105260354A (en) * 2015-08-20 2016-01-20 及时标讯网络信息技术(北京)有限公司 Chinese AC (Aho-Corasick) automaton working method based on keyword dictionary tree structure
CN105159871A (en) * 2015-08-21 2015-12-16 小米科技有限责任公司 Text information detection method and apparatus
CN107193843A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 A kind of character string selection method and device based on AC automatic machines and postfix expression
CN106777073A (en) * 2016-12-13 2017-05-31 深圳爱拼信息科技有限公司 The automatic method for correcting of wrong word and server in a kind of search engine

Cited By (23)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109145287A (en) * 2018-07-05 2019-01-04 广东外语外贸大学 Indonesian word error-detection error-correction method and system
CN109145287B (en) * 2018-07-05 2022-11-29 广东外语外贸大学 Indonesia word error detection and correction method and system
CN110928915A (en) * 2018-08-31 2020-03-27 北京京东金融科技控股有限公司 Method, device and equipment for fuzzy matching of Chinese names and readable storage medium
CN109522558A (en) * 2018-11-21 2019-03-26 金现代信息产业股份有限公司 A kind of Chinese wrongly written character bearing calibration based on deep learning
CN109522558B (en) * 2018-11-21 2024-01-12 金现代信息产业股份有限公司 Deep learning-based Chinese character-staggering correction method
CN110389875A (en) * 2019-03-29 2019-10-29 中国银联股份有限公司 Method, apparatus and storage medium for supervisory computer system operating status
CN110389875B (en) * 2019-03-29 2023-06-06 中国银联股份有限公司 Method, apparatus and storage medium for monitoring the operating state of a computer system
CN110852091A (en) * 2019-11-11 2020-02-28 杭州安恒信息技术股份有限公司 Method and device for monitoring wrongly written characters, electronic equipment and computer readable medium
CN110852091B (en) * 2019-11-11 2023-08-15 杭州安恒信息技术股份有限公司 Method, device, electronic equipment and computer readable medium for monitoring wrongly written characters
CN113051925A (en) * 2019-12-26 2021-06-29 中国移动通信集团有限公司 Time identification method, device, equipment and computer storage medium
CN111737948A (en) * 2020-05-06 2020-10-02 福建天晴数码有限公司 Wrongly written character generation method and terminal
CN111737948B (en) * 2020-05-06 2022-10-21 福建天晴数码有限公司 Wrongly-written character generation method and terminal
CN111984845B (en) * 2020-08-17 2023-10-31 江苏百达智慧网络科技有限公司 Website wrongly written word recognition method and system
CN111984845A (en) * 2020-08-17 2020-11-24 江苏百达智慧网络科技有限公司 Website wrongly-written character recognition method and system
CN112084746A (en) * 2020-09-11 2020-12-15 广东电网有限责任公司 Entity identification method, system, storage medium and equipment
CN112632213A (en) * 2020-12-03 2021-04-09 大箴(杭州)科技有限公司 Address information standardization method and device, electronic equipment and storage medium
CN112613522A (en) * 2021-01-04 2021-04-06 重庆邮电大学 Method for correcting recognition result of medicine taking order based on fusion font information
CN113672587A (en) * 2021-07-15 2021-11-19 福建拓尔通软件有限公司 New media update monitoring method, system, device and medium
CN113673227A (en) * 2021-07-15 2021-11-19 福建拓尔通软件有限公司 Method, system, equipment and medium for correcting wrongly written characters of WEB editor
CN113704233A (en) * 2021-10-29 2021-11-26 飞狐信息技术(天津)有限公司 Keyword detection method and system
CN113704233B (en) * 2021-10-29 2022-03-01 飞狐信息技术(天津)有限公司 Keyword detection method and system
CN115065671A (en) * 2022-03-04 2022-09-16 山谷网安科技股份有限公司 Method and system for realizing dynamically expandable wrong word detection service
CN115065671B (en) * 2022-03-04 2024-04-02 山谷网安科技股份有限公司 Method and system for realizing dynamically-extensible word-dislocation detection service

Similar Documents

Publication Publication Date Title
CN107679036A (en) A kind of wrong word monitoring method and system
CN105243129B (en) Item property Feature words clustering method
US7577963B2 (en) Event data translation system
CN106897559B (en) A kind of symptom and sign class entity recognition method and device towards multi-data source
CN103914494B (en) Method and system for identifying identity of microblog user
CN109753660B (en) LSTM-based winning bid web page named entity extraction method
JP6403382B2 (en) Phrase pair collection device and computer program therefor
CN106650943A (en) Auxiliary writing method and apparatus based on artificial intelligence
CN111190900B (en) JSON data visualization optimization method in cloud computing mode
JP2015121897A (en) Scenario generation device, and computer program for the same
JP5907393B2 (en) Complex predicate template collection device and computer program therefor
CN110008463B (en) Method, apparatus and computer readable medium for event extraction
CN106934069A (en) Data retrieval method and system
CN110297880B (en) Corpus product recommendation method, apparatus, device and storage medium
CN108920447B (en) Chinese event extraction method for specific field
CN112527981B (en) Open type information extraction method and device, electronic equipment and storage medium
CN107247613A (en) Sentence analytic method and sentence resolver
CN112784589B (en) Training sample generation method and device and electronic equipment
CN108062351A (en) Text snippet extracting method, readable storage medium storing program for executing on particular topic classification
CN106681981B (en) The mask method and device of Chinese part of speech
JP2020106880A (en) Information processing apparatus, model generation method and program
KR101429621B1 (en) Duplication news detection system and method for detecting duplication news
JP2003099442A (en) Key concept extraction rule preparing method, key concept extraction method, key concept extraction rule preparing device, key concept extraction device, and program and recording medium for them
CN110705285B (en) Government affair text subject word library construction method, device, server and readable storage medium
Torget et al. Mapping texts: Combining text-mining and geo-visualization to unlock the research potential of historical newspapers

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20200520

Address after: 410208 Hunan, Changsha, Yuelu science and Technology Industrial Park, No. 8, Changsha science and Technology Park, including science and Technology Industrial Park Development and Construction Co., Ltd., general services building, room 6018

Applicant after: Hunan Network Technology Co.,Ltd.

Address before: Park Road in Jiangning District of Nanjing city and Jiangsu province 211100 No. 18

Applicant before: NANJING WANGSHU INFORMATION TECHNOLOGY Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180209