CN103576882B - Improper text recognition method and its system - Google Patents

Improper text recognition method and its system Download PDF

Info

Publication number
CN103576882B
CN103576882B CN201210264218.9A CN201210264218A CN103576882B CN 103576882 B CN103576882 B CN 103576882B CN 201210264218 A CN201210264218 A CN 201210264218A CN 103576882 B CN103576882 B CN 103576882B
Authority
CN
China
Prior art keywords
text
keyboard
identified
button
distance
Prior art date
Application number
CN201210264218.9A
Other languages
Chinese (zh)
Other versions
CN103576882A (en
Inventor
何小晨
张国强
郝志新
许春林
王长伟
Original Assignee
深圳市世纪光速信息技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳市世纪光速信息技术有限公司 filed Critical 深圳市世纪光速信息技术有限公司
Priority to CN201210264218.9A priority Critical patent/CN103576882B/en
Publication of CN103576882A publication Critical patent/CN103576882A/en
Application granted granted Critical
Publication of CN103576882B publication Critical patent/CN103576882B/en

Links

Abstract

The present invention provides a kind of improper text recognition method and its system, the described method comprises the following steps:Each word in text to be identified, obtain the button corresponding to the initial of each word input;Distribution situation of the button corresponding to initial inputted according to each word on keyboard, normal text or improper text are divided into by the text to be identified.The improper text recognition method and its system, distribution situation of the button corresponding to initial inputted by each word on keyboard of the present invention, meaningless, the improper text of most of stochastic inputs can be efficiently identified out.Also, because recognition methods is not rely on the meaning of a word, the calculating of text quality point, make recognition result more objective and accurate.

Description

Improper text recognition method and its system

Technical field

The present invention relates to text recognition technique field, more particularly to a kind of improper text recognition method, and one kind Improper text recognition system.

Background technology

Rubbish text filtering is always an important sport technique segment in search engine, and the rubbish text is often referred to be not intended to The improper text of justice, traditional rubbish text are filtered through the lookup of keyword and the calculating of text quality point, can filtered Fall the excessive difference of poor format text, the non-standard character of some ad contents, Pornograph, political sensitivity content, content repeatedly Format text etc..

However, in microblogging and space such as have a talk about at the short text search, it has been found that have and a certain amount of produced by stochastic inputs Improper text (also referred to as rubbish text), such as:" rubbish such as flighty big real road Ka Sa armies of Lhasa science and technology, which is wantonly searched for, to be beaten Sui Liaojun Dallas adds Dallas to add Sutton water etc. ".The characteristic of this kind of improper text is:Inside have certain randomness, repeat Entry is less;Notional word is more, and the text quality point calculated according to conventional text filtering technology is not generally low;Due to input method Association's input characteristics, often have certain degree of correlation between adjacent word, it is difficult to filter by semantic analysis.Based on above characteristic, This kind of improper text is difficult to be distinguish between by traditional text filtering method.

The content of the invention

For problem present in above-mentioned background technology, it is an object of the invention to provide one kind can efficiently identify with The improper text recognition method of improper text caused by machine input, and a kind of improper text recognition system.

A kind of improper text recognition method, comprises the following steps:

Each word in text to be identified, obtain the button corresponding to the initial of each word input;

Distribution feelings of the button according to corresponding to the initial that each word inputs on keyboard as follows Condition, the text to be identified is divided into normal text or improper text:

According to default multiple keyboard subregions, distribution ratio of the button for judging to obtain on each keyboard subregion Example;By the distribution proportion compared with default distribution proportion threshold value;If greater than the distribution proportion threshold value, then treated described Identification text is divided into improper text;Otherwise, the text to be identified is divided into normal text;

Or,

Distance of the button corresponding to the word that each two is adjacent in the text to be identified on keyboard is calculated, and is calculated The average value of the distance;By the average value of the distance compared with default average distance threshold value;If less than described average Distance threshold, then the text to be identified is divided into improper text;Otherwise, the text to be identified is divided into normal text This;

Or,

According to default multiple keyboard subregions, distribution ratio of the button for judging to obtain on each keyboard subregion Example;Distance of the button corresponding to the word that each two is adjacent in the text to be identified on keyboard is calculated, and described in calculating The average value of distance;According to the distribution proportion and the average value of the distance, corresponded to respectively according to each keyboard subregion The default criteria for classifying, the text to be identified is divided into normal text or improper text;Wherein, each keyboard Presetting the criteria for classifying corresponding to subregion difference includes default distribution proportion threshold value and default average distance threshold value.

A kind of improper text recognition system, including:

Button acquisition module, for each word in text to be identified, obtain the head that each word inputs Button corresponding to letter;

Identification module, for distribution feelings of the button corresponding to the initial according to each word input on keyboard Condition, the text to be identified is divided into normal text or improper text;

Wherein, the identification module includes:

Distribution proportion computing module, for according to default multiple keyboard subregions, the button for judging to obtain to be each Distribution proportion on the keyboard subregion;First comparison module, for by the distribution proportion and default distribution proportion threshold value Compare;First division module, for when the distribution proportion is more than the distribution proportion threshold value, then by the text to be identified It is divided into improper text;Otherwise, the text to be identified is divided into normal text;

Or,

Keyboard distance calculation module, for calculating the button corresponding to the word that each two is adjacent in the text to be identified Distance on keyboard, and calculate the average value of the distance;Second comparison module, for by the average value of the distance and in advance If average distance threshold value compare;Second division module, it is less than the average distance threshold value for the average value in the distance When, then the text to be identified is divided into improper text;Otherwise, the text to be identified is divided into normal text;

Or,

Distribution proportion computing module, for according to default multiple keyboard subregions, the button for judging to obtain to be each Distribution proportion on the keyboard subregion;Keyboard distance calculation module is adjacent for calculating each two in the text to be identified Word corresponding to distance of the button on keyboard, and calculate the average value of the distance;3rd division module, for basis The average value of the distribution proportion and the distance, the criteria for classifying is preset according to corresponding to each keyboard subregion difference, will The text to be identified is divided into normal text or improper text;Wherein, it is pre- corresponding to each keyboard subregion difference If the criteria for classifying includes default distribution proportion threshold value and default average distance threshold value.

The improper text recognition method and its system of the present invention, by obtaining under corresponding input method, text to be identified Button corresponding to the initial of each word input in this, judges distribution situation of the button on keyboard.Because with The insignificant improper text of machine input is typically some key-press inputs for clicking on the region relatively concentrated on keyboard, so Distribution situation of the button on keyboard, can efficiently identify out major part corresponding to initial when being inputted by the word Meaningless, the improper text of stochastic inputs.Also, because recognition methods is not rely on the meaning of a word, the calculating of text quality point, Make recognition result more objective and accurate.

Brief description of the drawings

Fig. 1 is the schematic flow sheet of the improper text recognition method first embodiment of the present invention;

Fig. 2 is the local schematic flow sheet of step S102 in the improper text recognition method first embodiment of the present invention;

Fig. 3 is a kind of schematic diagram of the setting means of keyboard subregion in improper text recognition method of the invention;

Fig. 4 is the local schematic flow sheet of step S102 in the improper text recognition method second embodiment of the present invention;

Fig. 5 is the local schematic flow sheet of step S102 in the improper embodiment of text recognition method the 3rd of the present invention;

Fig. 6 is the structural representation of the improper text recognition system first embodiment of the present invention;

Fig. 7 be the improper text recognition system of the present invention first embodiment in identification module structural representation;

Fig. 8 be the improper text recognition system of the present invention second embodiment in identification module structural representation;

Fig. 9 be the improper text recognition system of the present invention the 3rd embodiment in identification module structural representation.

Embodiment

Referring to Fig. 1, Fig. 1 is the schematic flow sheet of the improper text recognition method first embodiment of the present invention.

The improper text recognition method, comprises the following steps S101 and S102:

S101, each word in text to be identified, obtain corresponding to the initial that each word inputs Button;

Wherein, the text to be identified is the text for needing to be identified.Word in the text to be identified includes Middle word, English alphabet, numeral, symbol etc. or one or more kinds of combination therein.Especially, the present invention is right There is preferable recognition effect in the text being made up of Chinese character.

And the text to be identified is preferably the text that length is more than certain preset value, i.e., its word number included is preferably super Cross certain quantity because text is shorter, comprising word it is fewer, the boundary between normal and improper text is fuzzyyer, identification Difficulty it is bigger.Therefore when actually performing the improper text recognition method of the present invention, can preset to text requirement Minimum length, the minimum length is more than to text size, just performs improper text recognition method of the invention, otherwise The improper text recognition method is not performed.

Button corresponding to the initial of the word input, it is first click when inputting each described word Button, such as English alphabet input computer when, button corresponding to initial is the button where the English alphabet.

Button corresponding to the initial of the word input, can be obtained by way of establishing look-up table.Preferably, In this step S101, according to each word in the text to be identified, the mapping table pre-established, acquisition pair are searched The button answered;Wherein, corresponding to the initial that the word and word input are recorded in the mapping table Button.

That is, in the default mapping table, establish the word of input and the corresponding pass of the corresponding button System.Only need the mapping table according to text search, you can obtain institute corresponding to each word in the text to be identified Button is stated, this mode compares direct convenience.

Or the Rule of button input method used when can also be inputted according to the word.For example, for English alphabet in the text to be identified, directly obtain each English alphabet corresponding to button, be identified as the English Button corresponding to the initial of letter input.

When each word in the text to be identified is the Chinese character inputted by spelling input method, then described treat can be obtained Button corresponding to the first letter of pinyin of each Chinese character in text is identified, is identified as the initial institute of the Chinese character input Corresponding button.

Due to the Chinese character of spelling input method input, the button of first click necessarily press by the first letter of pinyin of the Chinese character Key, therefore corresponding first button when Chinese character inputs computer can be obtained through the above way.

Button according to corresponding to the initial that different input method rules inputs to various words derives, Ke Yiwu The larger mapping table of data volume need to be established.The above method that those skilled in the art records according to the present invention, can be certainly Row input method rule corresponding to, obtains the button corresponding to the initial of other words input.

Preferably, in this step S101, the button of a-z on computer keyboard is marked with 26 different marks, example Such as digital 1-26;And mark punctuation mark and numeral with same mark, such as numeral 0.Then button corresponding to acquisition it Afterwards, corresponding identification record can be used, so that computer carries out statistical disposition.

S102, distribution situation of the button corresponding to initial inputted according to each word on keyboard, by institute State text to be identified and be divided into normal text or improper text.

Because stochastic inputs, insignificant improper text is typically click on keyboard the region relatively concentrated some Individual key-press input, that is, generally each button will not be fifty-fifty clicked in whole keyboard range during stochastic inputs.So By distribution situation of the button on keyboard corresponding to each word, can efficiently identify out most of random defeated Meaningless, the improper text entered.For example, if the button distribution is relatively concentrated corresponding to each word, wait to know by described Other text is judged as improper text;And if distribution is relatively scattered, then it is judged as normal text.And judge its distribution and concentrate or divide Scattered standard can obtain according to statistics, or training sample and the mode of machine learning.

Referring to Fig. 2, Fig. 2 is the part of step S102 in the improper text recognition method first embodiment of the present invention Schematic flow sheet.

In present embodiment, the text to be identified can be divided into normal text or improper in the following manner Text, i.e., described step S102 include:

S201, according to default multiple keyboard subregions, the button for judging to obtain is on each keyboard subregion Distribution proportion;

S202, by the distribution proportion compared with default distribution proportion threshold value;

If greater than the distribution proportion threshold value, then step S203 is performed, the detection text is divided into improper text This;Otherwise, step S204 is performed, the detection text is divided into normal text.

Wherein, " multiple " occurred in the present invention refer to two or more.The multiple keyboard subregion is to preset , each keyboard subregion includes several adjacent buttons successively, specifically can according to the distribution of each button on keyboard come Setting.

Referring to Fig. 3, Fig. 3 be the present invention improper text recognition method in a kind of setting means of keyboard subregion show It is intended to.The setting means is divided into 7 keyboard subregions:First keyboard subregion includes button Q, W, E, R, T, Y, U, I, O, P; Second keyboard subregion includes button A, S, D, F, G, H, J, K, L;3rd keyboard subregion includes button Z, X, C, V, B, N, M; 4th keyboard subregion includes button W, E, R, T, S, D, F, G;5th keyboard subregion includes button Y, U, I, O, H, J, K, L; 6th keyboard subregion includes button S, D, F, G, X, C, V, B;7th keyboard subregion includes button H, J, K, L, N, M.

According to above subregion, distribution proportion of the button that can judge to obtain respectively on each keyboard subregion, Such as text to be identified:" rubbish such as flighty big real road Ka Sa armies of Lhasa science and technology, which is wantonly searched for, smashes Liao Jun Dallas Dallas is added to add Sutton water etc. ", first letter of pinyin corresponding to each word is respectively " sjdslkjdsjdlksjdljodsdsljdlsjdlsjsdsd ", button corresponding to the first letter of pinyin are the word Button corresponding to the initial of input.

That is, 37 buttons corresponding to obtaining altogether.Wherein, the button fallen in second keyboard subregion has 36, distribution Ratio accounts for 97.3%;And the button fallen in the first keyboard subregion has 1, distribution proportion accounts for 2.7%;Fall in the 4th or the 6th keyboard The button of subregion has 20, accounts for 54%;And the button fallen in the 5th or the 7th keyboard subregion has 17, accounts for 46%.

Therefore, can be by distribution proportion of the button on each keyboard subregion and default distribution proportion threshold value ratio Compared with.The detection text is divided into by improper text or normal text according to comparative result.

The distribution proportion threshold value of each keyboard subregion can also be respectively set as different threshold values with identical.It is excellent Selection of land, multistage distribution proportion threshold value can be set to keyboard subregion each described.For example, setting first order distribution proportion threshold value It is 90%, the second level 70%, the third level 40%.The distribution proportion that a certain keyboard subregion accounts for can be then set in and be higher than 90% When or when thering is the distribution proportion shared by two keyboard subregions to be higher than 70% or have a distribution ratio shared by three keyboard subregions Improper text is divided into when example is higher than 40%, by the detection text.

Referring to Fig. 4, Fig. 4 is the part of step S102 in the improper text recognition method second embodiment of the present invention Schematic flow sheet.

In present embodiment, the text to be identified can also be divided into normal text or anon-normal in the following manner Chang Wenben, i.e., described step S102 include:

S211, distance of the button corresponding to the word that each two is adjacent in the text to be identified on keyboard is calculated, And calculate the average value of the distance;

S212, by the average value of the distance compared with default average distance threshold value;

If less than the average distance threshold value, then step S213 is performed, the detection text is divided into improper text This;Otherwise, step S214 is performed, the detection text is divided into normal text.

Present embodiment is that distance of the button on keyboard according to corresponding to two adjacent words judges the detection The whether improper text of text.Because the improper text of stochastic inputs is also likely to be that each button continuously swept on keyboard is defeated Enter, such as inswept button QWERTYUIOPLKJHGFDSA successively, the improper text of stochastic inputs are:" go to play physical culture i Aupres card slows down spreading for public expense ", this text is identified in the embodiment, then judges button respectively The keyboard distance of the adjacent button of each two in QWERTYUIOPLKJHGFDSA, it is 1.0 to obtain distance average, less than default Average distance threshold value (be, for example, 2.0), therefore, the detection text is divided into improper text.

Preferably, distance of the button on keyboard corresponding to the word that each two is adjacent in the text to be identified is calculated When, the lateral separation and fore-and-aft distance of the keyboard case can be handled with different Weights.That is, treated according to described Lateral separation and fore-and-aft distance of the button on keyboard corresponding to the adjacent word difference of each two in identification text, according to Below equation calculates weighting keyboard distance:

Dist=x+ α y

Wherein, Dist is the weighting keyboard distance calculated, and x is lateral separation, and y is fore-and-aft distance, and α is lateral separation and indulged To the proportional roles of distance, α>1.

Because it has been generally acknowledged that user's cost for vertically moving during keyboard is tapped at random will exceed transverse shifting, because This generally sets the proportional roles α of the lateral separation and fore-and-aft distance>1.For example, α value is set into 2, then alphabetical S and word Lateral separations of female T on keyboard is 2.5, fore-and-aft distance 1, and its keyboard weighted distance is 2.5+2 × 1=4.5.It is assuming that literary There is N number of word (only including Chinese character and English alphabet, do not include numeral, punctuate and non-standard character) in this, then calculate every two N-1 keyboard distance between individual adjacent word, and calculate the average value of distance, further according to the distance average value with it is described Average distance threshold value, divide the text to be identified.

Referring to Fig. 5, Fig. 5 is the part of step S102 in the improper embodiment of text recognition method the 3rd of the present invention Schematic flow sheet.

In present embodiment, by above two criterion, i.e., the distribution proportion of described button and the distance of the button Average value is simultaneously as the foundation for judging the whether improper text of the text to be identified.I.e. described step S102 includes:

S221, according to default multiple keyboard subregions, the button for judging to obtain is on each keyboard subregion Distribution proportion;

S222, distance of the button corresponding to the word that each two is adjacent in the text to be identified on keyboard is calculated, And calculate the average value of the distance;

S223, according to the distribution proportion and the average value of the distance, corresponded to respectively according to each keyboard subregion The default criteria for classifying, the text to be identified is divided into normal text or improper text;Wherein, each keyboard Presetting the criteria for classifying corresponding to subregion difference includes default distribution proportion threshold value and default average distance threshold value.

Aforesaid way treats the distance average of the distribution proportion of the button and the button as described in judgement simultaneously The foundation of the whether improper text of text is identified, makes the result of text identification more accurate.

Preferably, further to improve the accuracy of text identification result, preset and draw corresponding to each keyboard subregion Minute mark standard includes default multiple distribution proportion threshold values, and corresponds to the multiple flat of each distribution proportion threshold value respectively Equal distance threshold.Realize that multiple threshold value is set with this, make the result of text identification more accurate.

Further, since the probability that punctuation mark and numeral typically occur in the improper text of stochastic inputs is less, because This further the quantity of punctuation mark in the text to be identified or numeral can also be carried out in step s 102 Identification.

That is, the distribution proportion of numeral or symbol in the text to be identified is further obtained in step s 102;

Also, according to the average value of the distribution proportion and the distance and the distribution proportion of the numeral or symbol, According to each keyboard subregion respectively corresponding to preset the criteria for classifying, by the text to be identified be divided into normal text or Improper text;Wherein, each keyboard subregion respectively corresponding to preset the criteria for classifying include default distribution proportion threshold value, Default average distance threshold value and default numeral or symbol distribution proportion.

The quantity of symbol or numeral is also served as to the standard of text identification, can further improve and improper text is known Other ability.

Especially, pair with the average value of the distribution proportion of the button and the button distance and described wait to know simultaneously The distribution proportion of numeral or symbol is the situation that the default criteria for classifying sets a variety of threshold values in other text, makees following illustrate:

Program code, which is exemplified below, to be realized to the default criteria for classifying of above-mentioned first keyboard subregion:

if(letterCounter>=15&& ((UpLetterRatio>0.4&&LetterRepeatTimes [0]==0 &&meanKeyDist<1.1)||(UpLetterRatio>0.5&& ((LetterRepeatTimes [0]==0&& meanKeyDist<2.2)||meanKeyDist<1.1))||(UpLetterRatio>0.75&& meanKeyDist<2.2)|| UpLetterRatio>0.9))

return true;

Wherein, the letterCounter is the number of words of the text to be identified;UpLetterRatio is first key Button distribution proportion on disk subregion;MeanKeyDist is the average value of the button distance, LetterRepeatTimes [0] Refer to punctuation mark and the number that numeral occurs.

Return true in program code, refer to the operation that the text to be identified is divided into improper text.

Setting to each predetermined threshold value in the default criteria for classifying of the text to be identified, statistics can be taken a large amount of The mode of sample data obtains;The training sample of substantial amounts of text identification, generation point can also be learnt by machine learning techniques Class device is realized.The benefit of machine learning is to hand off given threshold, by only needing by machine learning techniques The training sample that substantial amounts of success identifies is practised, but the workload for making a large amount of training samples is bigger, and because pass through machine The default criteria for classifying of device learning art generation is complex, and the computation burden that takes is larger when making to identify on line.And use people Work counts great amount of samples data, when setting the default criteria for classifying, then can be repaiied in the early stage according to the feedback of recognition result on line Change the various threshold values in the default criteria for classifying, occurred with rapidly reducing the situation of identification mistake.

Referring to Fig. 6, Fig. 6 is the structural representation of the improper text recognition system first embodiment of the present invention.

The improper text recognition system, including:Button acquisition module 11 and identification module 12.The button obtains mould Block 11 is used for each word in text to be identified, obtains the button corresponding to the initial of each word input; Distribution situation of the button corresponding to initial that the identification module 12 is used to be inputted according to each word on keyboard, The text to be identified is divided into normal text or improper text.

Because stochastic inputs, insignificant improper text is typically click on keyboard the region relatively concentrated some Individual key-press input, that is, generally each button will not be fifty-fifty clicked in whole keyboard range during stochastic inputs.So By distribution situation of the button on keyboard corresponding to each word, can efficiently identify out most of random defeated Meaningless, the improper text entered.

Wherein, the text to be identified is the text for needing to be identified.Word in the text to be identified includes Middle word, English alphabet, numeral, symbol etc. or one or more kinds of combination therein.Especially, the present invention is right There is preferable recognition effect in the text being made up of Chinese character.

And the text to be identified is preferably the text that length is more than certain preset value, i.e., its word number included is preferably super Cross certain quantity because text is shorter, comprising word it is fewer, the boundary between normal and improper text is fuzzyyer, identification Difficulty it is bigger.Therefore the present invention can preset the minimum to text requirement when actually performing improper text identification Length, the minimum length is more than to text size, just performs improper text identification, otherwise do not performed improper text and know Not.

Button corresponding to the initial of the word input, it is first click when inputting each described word Button, such as English alphabet input computer when, button corresponding to initial is the button where the English alphabet.

The button acquisition module 11 is obtained by way of establishing look-up table corresponding to the initial of the word input Button.Preferably, each word of the button acquisition module 11 in the text to be identified, searches what is pre-established Mapping table, the button corresponding to acquisition;Wherein, the word and word input are recorded in the mapping table Initial corresponding to button.

That is, in the default mapping table, establish the word of input and the corresponding pass of the corresponding button System.Only need the mapping table according to text search, you can obtain institute corresponding to each word in the text to be identified Button is stated, this mode compares direct convenience.

Described in the Rule of the button acquisition module 11 input method used when can also be inputted according to the word Button corresponding to the initial of word input.For example, for the English alphabet in the text to be identified, directly obtain each Button corresponding to the English alphabet, it is identified as the button corresponding to the initial of the English alphabet input.

Each word of the button acquisition module 11 in the text to be identified is the Chinese inputted by spelling input method During word, button corresponding to the first letter of pinyin of each Chinese character in the text to be identified is obtained, is identified as the Chinese Chinese Button corresponding to the initial of word input.

Due to the Chinese character of spelling input method input, the button of first click necessarily press by the first letter of pinyin of the Chinese character Key, therefore corresponding first button when Chinese character inputs computer can be obtained through the above way.

The button acquisition module 11 is according to corresponding to the initial that different input method rules inputs to various words Button is derived, and can need not establish the larger mapping table of data volume.Those skilled in the art remembers according to the present invention The above method of load, can voluntarily according to corresponding to input method rule, obtain other words input initial corresponding to pressing Key.

Preferably, the button acquisition module 11 is after first button corresponding to acquisition, by a-z button with 26 not Same mark mark, such as digital 1-26;And mark punctuation mark and numeral with same mark, such as numeral 0.Then obtaining After button corresponding to taking, corresponding identification record can be used, so that computer carries out statistical disposition.

Referring to Fig. 7, Fig. 7 be the improper text recognition system of the present invention first embodiment in identification module structure Schematic diagram.

In the present embodiment, the identification module 12 includes:

Distribution proportion computing module 201, for according to default multiple keyboard subregions, the button for judging to obtain to be each Distribution proportion on the individual keyboard subregion;

First comparison module 202, for by the distribution proportion compared with default distribution proportion threshold value;

First division module 203, for when the distribution proportion is more than the distribution proportion threshold value, text to be detected by described Originally it is divided into improper text;Otherwise, the detection text is divided into normal text.

Wherein, " multiple " occurred in the present invention refer to two or more.The multiple keyboard subregion is to preset , each keyboard subregion includes several adjacent buttons successively, specifically can according to the distribution of each button on keyboard come Setting.

The setting means of one of which keyboard subregion is 7 keyboard subregions of setting:First keyboard subregion include button Q, W、E、R、T、Y、U、I、O、P;Second keyboard subregion includes button A, S, D, F, G, H, J, K, L;3rd keyboard subregion includes Button Z, X, C, V, B, N, M;4th keyboard subregion includes button W, E, R, T, S, D, F, G;5th keyboard subregion includes pressing Key Y, U, I, O, H, J, K, L;6th keyboard subregion includes button S, D, F, G, X, C, V, B;7th keyboard subregion includes Button H, J, K, L, N, M.

The distribution proportion threshold value of each keyboard subregion can also be respectively set as different threshold values with identical.It is excellent Selection of land, multistage distribution proportion threshold value can be set to keyboard subregion each described.For example, setting first order distribution proportion threshold value It is 90%, the second level 70%, the third level 40%.The distribution proportion that a certain keyboard subregion accounts for can be then set in and be higher than 90% When or when thering is the distribution proportion shared by two keyboard subregions to be higher than 70% or have a distribution ratio shared by three keyboard subregions Improper text is divided into when example is higher than 40%, by the detection text.

Referring to Fig. 8, Fig. 8 be the improper text recognition system of the present invention second embodiment in identification module structure Schematic diagram.

In the present embodiment, the identification module 12 includes:

Keyboard distance calculation module 211, for calculating corresponding to the word that each two is adjacent in the text to be identified Distance of the button on keyboard, and calculate the average value of the distance;

Second comparison module 212, for by the average value of the distance compared with default average distance threshold value;

Second division module 213, for when the average value of the distance is less than the average distance threshold value, by the inspection Survey text and be divided into improper text;Otherwise, the detection text is divided into normal text.

Present embodiment is that distance of the button on keyboard according to corresponding to two adjacent words judges the detection The whether improper text of text.Because the improper text of stochastic inputs is also likely to be that each button continuously swept on keyboard is defeated Enter, such as inswept button QWERTYUIOPLKJHGFDSA successively, the improper text of stochastic inputs are:" go to play physical culture i Aupres card slows down spreading for public expense ", this text is identified in the embodiment, then judges button respectively The keyboard distance of the adjacent button of each two in QWERTYUIOPLKJHGFDSA, it is 1.0 to obtain distance average, less than default Average distance threshold value (be, for example, 2.0), therefore, the detection text is divided into improper text.

Preferably, it is right to calculate the word institute that each two is adjacent in the text to be identified for the keyboard distance calculation module 21 The button answered on keyboard apart from when, the lateral separation and fore-and-aft distance of the keyboard case can be added with different weights Power processing.That is, the keyboard distance calculation module 21 is right respectively according to the adjacent word of each two in the text to be identified Lateral separation and fore-and-aft distance of the button answered on keyboard, weighting keyboard distance is calculated according to below equation:

Dist=x+ α y

Wherein, Dist is the weighting keyboard distance calculated, and x is lateral separation, and y is fore-and-aft distance, and α is lateral separation and indulged To the proportional roles of distance, α>1.

Because it has been generally acknowledged that user's cost for vertically moving during keyboard is tapped at random will exceed transverse shifting, because This generally sets the proportional roles α of the lateral separation and fore-and-aft distance>1.For example, α value is set into 2, then alphabetical S and word Lateral separations of female T on keyboard is 2.5, fore-and-aft distance 1, and its keyboard weighted distance is 2.5+2 × 1=4.5.It is assuming that literary There is N number of word (only including Chinese character and English alphabet, do not include numeral, punctuate and non-standard character) in this, then calculate every two N-1 keyboard distance between individual adjacent word, and calculate the average value of distance, further according to the distance average value with it is described Average distance threshold value, divide the text to be identified.

Referring to Fig. 9, Fig. 9 be the improper text recognition system of the present invention the 3rd embodiment in identification module structure Schematic diagram.

In the present embodiment, the identification module 12 includes:

Distribution proportion computing module 201, for according to default multiple keyboard subregions, the button for judging to obtain to be each Distribution proportion on the individual keyboard subregion;

Keyboard distance calculation module 211, for calculating corresponding to the word that each two is adjacent in the text to be identified Distance of the button on keyboard, and calculate the average value of the distance;

3rd division module 221, for the average value according to the distribution proportion and the distance, according to each key The criteria for classifying is preset corresponding to disk subregion difference, the text to be identified is divided into normal text or improper text;Its In, presetting the criteria for classifying corresponding to each keyboard subregion difference includes default distribution proportion threshold value and default average departure From threshold value.

Aforesaid way treats the distance average of the distribution proportion of the button and the button as described in judgement simultaneously The foundation of the whether improper text of text is identified, makes the result of text identification more accurate.

Preferably, further to improve the accuracy of text identification result, preset and draw corresponding to each keyboard subregion Minute mark standard includes default multiple distribution proportion threshold values, and corresponds to the multiple flat of each distribution proportion threshold value respectively Equal distance threshold.Realize that multiple threshold value is set with this, make the result of text identification more accurate.

Further, since the probability that punctuation mark and numeral typically occur in the improper text of stochastic inputs is less, because This described identification module 12 further the quantity of the punctuation mark in the text to be identified or numeral can also enter Row identification.

That is, described identification module 12 further comprises:

Symbol distributed acquisition module (not shown), for obtaining the distribution ratio of numeral or symbol in the text to be identified Example;

4th division module (not shown), for the average value according to the distribution proportion and the distance and described The distribution proportion of numeral or symbol, the criteria for classifying is preset according to corresponding to each keyboard subregion difference, will be described to be identified Text is divided into normal text or improper text;Wherein, the criteria for classifying is preset corresponding to each keyboard subregion difference Including default distribution proportion threshold value, default average distance threshold value and default numeral or symbol distribution proportion.

The quantity of symbol or numeral is also served as to the standard of text identification, can further improve and improper text is known Other ability.

One of ordinary skill in the art will appreciate that realize all or part of flow in above-mentioned embodiment and correspondingly System, be that by computer program the hardware of correlation can be instructed to complete, described program can be stored in a computer In read/write memory medium, the program is upon execution, it may include the flow of each embodiment as described above.Wherein, described storage Medium can be magnetic disc, CD, read-only memory (Read-Only Memory, ROM) or random access memory (Random Access Memory, RAM) etc..

Embodiment described above only expresses the several embodiments of the present invention, and its description is more specific and detailed, but simultaneously Therefore the limitation to the scope of the claims of the present invention can not be interpreted as.It should be pointed out that for one of ordinary skill in the art For, without departing from the inventive concept of the premise, various modifications and improvements can be made, these belong to the guarantor of the present invention Protect scope.Therefore, the protection domain of patent of the present invention should be determined by the appended claims.

Claims (12)

1. a kind of improper text recognition method, it is characterised in that including step:
Each word in text to be identified, obtain the button corresponding to the initial of each word input;According to Distribution situation of button of the following manner according to corresponding to the initial that each word inputs on keyboard, waits to know by described Other text is divided into normal text or improper text:
According to default multiple keyboard subregions, distribution proportion of the button for judging to obtain on each keyboard subregion; By the distribution proportion compared with default distribution proportion threshold value;If greater than the distribution proportion threshold value, then wait to know by described Other text is divided into improper text;Otherwise, the text to be identified is divided into normal text;
Or,
Distance of the button corresponding to the word that each two is adjacent in the text to be identified on keyboard is calculated, and described in calculating The average value of distance;By the average value of the distance compared with default average distance threshold value;If less than the average distance Threshold value, then the text to be identified is divided into improper text;Otherwise, the text to be identified is divided into normal text;
Or,
According to default multiple keyboard subregions, distribution proportion of the button for judging to obtain on each keyboard subregion; Distance of the button corresponding to the word that each two is adjacent in the text to be identified on keyboard is calculated, and calculates the distance Average value;According to the distribution proportion and the average value of the distance, according to pre- corresponding to each keyboard subregion difference If the criteria for classifying, the text to be identified is divided into normal text or improper text;Wherein, each keyboard subregion The criteria for classifying is preset corresponding to respectively includes default distribution proportion threshold value and default average distance threshold value.
2. improper text recognition method as claimed in claim 1, it is characterised in that described each in text to be identified Individual word, the step of obtaining the button corresponding to the initial of each word input, include:
According to each word in the text to be identified, the mapping table pre-established is searched, it is described corresponding to acquisition to press Key;Wherein, the button corresponding to the initial of the word and word input is recorded in the mapping table.
3. improper text recognition method as claimed in claim 1, it is characterised in that described each in text to be identified Individual word, the step of obtaining the button corresponding to the initial of each word input, include:
Button corresponding to the first letter of pinyin of each Chinese character in the text to be identified is obtained, is identified as the Chinese character Button corresponding to the initial of input;
Or
Button corresponding to each English alphabet in the text to be identified is obtained, is identified as the initial of the English alphabet input Corresponding button.
4. improper text recognition method as claimed in claim 1, it is characterised in that pre- corresponding to each keyboard subregion If the criteria for classifying includes default multiple distribution proportion threshold values, and corresponds to the more of each distribution proportion threshold value respectively Individual average distance threshold value.
5. improper text recognition method as claimed in claim 1, it is characterised in that further obtain the text to be identified The distribution proportion of middle numeral or symbol;
Also, according to the average value of the distribution proportion and the distance and the distribution proportion of the numeral or symbol, according to The criteria for classifying is preset corresponding to each keyboard subregion difference, the text to be identified is divided into normal text or anon-normal Chang Wenben;Wherein, presetting the criteria for classifying corresponding to each keyboard subregion difference includes default distribution proportion threshold value, presets Average distance threshold value and it is default numeral or symbol distribution proportion.
6. the improper text recognition method as described in any one in claim 1-5, it is characterised in that wait to know described in calculating Button in other text corresponding to the adjacent word of each two on keyboard apart from the step of include:
According to lateral separation of the button on keyboard corresponding to the adjacent word difference of each two in the text to be identified And fore-and-aft distance, calculate weighting keyboard distance according to below equation:
Dist=x+ α y
Wherein, Dist is the weighting keyboard distance calculated, and x is lateral separation, and y is fore-and-aft distance, α be lateral separation and longitudinal direction away from From proportional roles, α>1.
A kind of 7. improper text recognition system, it is characterised in that including:
Button acquisition module, for each word in text to be identified, obtain the initial that each word inputs Corresponding button;
Identification module, for distribution situation of the button corresponding to the initial that is inputted according to each word on keyboard, The text to be identified is divided into normal text or improper text;
Wherein, the identification module includes:
Distribution proportion computing module, for according to default multiple keyboard subregions, the button for judging to obtain to be each described Distribution proportion on keyboard subregion;First comparison module, for by the distribution proportion compared with default distribution proportion threshold value; First division module, for when the distribution proportion is more than the distribution proportion threshold value, then dividing the text to be identified For improper text;Otherwise, the text to be identified is divided into normal text;
Or,
Keyboard distance calculation module, for calculating the button corresponding to the word that each two is adjacent in the text to be identified in key Distance on disk, and calculate the average value of the distance;Second comparison module, for by the average value of the distance with it is default Average distance threshold value compares;Second division module, for when the average value of the distance is less than the average distance threshold value, then The text to be identified is divided into improper text;Otherwise, the text to be identified is divided into normal text;
Or,
Distribution proportion computing module, for according to default multiple keyboard subregions, the button for judging to obtain to be each described Distribution proportion on keyboard subregion;Keyboard distance calculation module, for calculating the text that each two is adjacent in the text to be identified Distance of the button on keyboard corresponding to word, and calculate the average value of the distance;3rd division module, for according to The average value of distribution proportion and the distance, the criteria for classifying is preset according to corresponding to each keyboard subregion difference, by described in Text to be identified is divided into normal text or improper text;Wherein, preset and draw corresponding to each keyboard subregion difference Minute mark standard includes default distribution proportion threshold value and default average distance threshold value.
8. improper text recognition system as claimed in claim 7, it is characterised in that the button acquisition module is according to Each word in text to be identified, search the mapping table pre-established, the button corresponding to acquisition;Wherein, it is described The button corresponding to the initial of the word and word input is recorded in mapping table.
9. improper text recognition system as claimed in claim 7, it is characterised in that described in the button acquisition module obtains Button corresponding to the first letter of pinyin of each Chinese character in text to be identified, it is identified as the initial of the Chinese character input Corresponding button;Or button corresponding to each English alphabet in the text to be identified is obtained, it is identified as the English words Button corresponding to the initial of mother's input.
10. improper text recognition system as claimed in claim 7, it is characterised in that corresponding to each keyboard subregion The default criteria for classifying includes default multiple distribution proportion threshold values, and corresponds to each distribution proportion threshold value respectively Multiple average distance threshold values.
11. improper text recognition system as claimed in claim 7, it is characterised in that the identification module further comprises:
Symbol distributed acquisition module, for obtaining the distribution proportion of numeral or symbol in the text to be identified;
4th division module, for average value according to the distribution proportion and the distance and the numeral or symbol Distribution proportion, the criteria for classifying is preset according to corresponding to each keyboard subregion difference, the text to be identified is divided into just Chang Wenben or improper text;Wherein, presetting the criteria for classifying corresponding to each keyboard subregion difference includes default point Cloth proportion threshold value, default average distance threshold value and default numeral or symbol distribution proportion.
12. the improper text recognition system as described in any one in claim 7-11, it is characterised in that the keyboard away from When calculating button corresponding to the word that each two is adjacent in the text to be identified with a distance from keyboard from computing module, root According to lateral separation and longitudinal direction of the button on keyboard corresponding to the adjacent word difference of each two in the text to be identified Distance, weighting keyboard distance is calculated according to below equation:
Dist=x+ α y
Wherein, Dist is the weighting keyboard distance calculated, and x is lateral separation, and y is fore-and-aft distance, α be lateral separation and longitudinal direction away from From proportional roles, α>1.
CN201210264218.9A 2012-07-27 2012-07-27 Improper text recognition method and its system CN103576882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210264218.9A CN103576882B (en) 2012-07-27 2012-07-27 Improper text recognition method and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210264218.9A CN103576882B (en) 2012-07-27 2012-07-27 Improper text recognition method and its system

Publications (2)

Publication Number Publication Date
CN103576882A CN103576882A (en) 2014-02-12
CN103576882B true CN103576882B (en) 2018-03-09

Family

ID=50048833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210264218.9A CN103576882B (en) 2012-07-27 2012-07-27 Improper text recognition method and its system

Country Status (1)

Country Link
CN (1) CN103576882B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445908B (en) * 2015-08-07 2019-11-15 阿里巴巴集团控股有限公司 Text recognition method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928860A (en) * 2005-09-05 2007-03-14 日电(中国)有限公司 Method, search engine and search system for correcting key errors
CN101266520A (en) * 2008-04-18 2008-09-17 黄晓凤;赵艳姣;戴静芬 System for accomplishing live keyboard layout
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN101510879A (en) * 2009-03-26 2009-08-19 腾讯科技(深圳)有限公司 Method and apparatus for filtering rubbish contents
CN101710262A (en) * 2009-12-11 2010-05-19 北京搜狗科技发展有限公司 Error correction method and error correction device of characters
EP2264563A1 (en) * 2009-06-19 2010-12-22 Tegic Communications, Inc. Virtual keyboard system with automatic correction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2545426A4 (en) * 2010-03-12 2017-05-17 Nuance Communications, Inc. Multimodal text input system, such as for use with touch screens on mobile phones

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928860A (en) * 2005-09-05 2007-03-14 日电(中国)有限公司 Method, search engine and search system for correcting key errors
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN101266520A (en) * 2008-04-18 2008-09-17 黄晓凤;赵艳姣;戴静芬 System for accomplishing live keyboard layout
CN101510879A (en) * 2009-03-26 2009-08-19 腾讯科技(深圳)有限公司 Method and apparatus for filtering rubbish contents
EP2264563A1 (en) * 2009-06-19 2010-12-22 Tegic Communications, Inc. Virtual keyboard system with automatic correction
CN101710262A (en) * 2009-12-11 2010-05-19 北京搜狗科技发展有限公司 Error correction method and error correction device of characters

Also Published As

Publication number Publication date
CN103576882A (en) 2014-02-12

Similar Documents

Publication Publication Date Title
Gibson et al. Discovering large dense subgraphs in massive graphs
US8005300B2 (en) Image search system, image search method, and storage medium
Stein et al. Intrinsic plagiarism analysis
Hu et al. Recognition of pornographic web pages by classifying texts and images
US7899249B2 (en) Media material analysis of continuing article portions
US20070198530A1 (en) Reputation information processing program, method, and apparatus
Yang Symbol recognition via statistical integration of pixel-level constraint histograms: A new descriptor
JP4233836B2 (en) Automatic document classification system, unnecessary word determination method, automatic document classification method, and program
CN103258000B (en) Method and device for clustering high-frequency keywords in webpages
US7444279B2 (en) Question answering system and question answering processing method
Marchetti-Bowick et al. Learning for microblogs with distant supervision: Political forecasting with twitter
Cribbie Multiplicity control in structural equation modeling
CN103793503A (en) Opinion mining and classification method based on web texts
Cieslak et al. Start globally, optimize locally, predict globally: Improving performance on imbalanced data
RU2648946C2 (en) Image object category recognition method and device
Keim et al. Visual analytics: Combining automated discovery with interactive visualizations
US8630972B2 (en) Providing context for web articles
WO2006113970A1 (en) Automatic concept clustering
CN103617157A (en) Text similarity calculation method based on semantics
US20090041361A1 (en) Character recognition apparatus, character recognition method, and computer product
CN104596767B (en) Method for diagnosing and predicating rolling bearing based on grey support vector machine
CN105577660A (en) DGA domain name detection method based on random forest
CN104408093B (en) A kind of media event key element abstracting method and device
Kamishima et al. Efficient clustering for orders
JPH05282497A (en) Online hand-written character recognition

Legal Events

Date Code Title Description
PB01 Publication
C06 Publication
SE01 Entry into force of request for substantive examination
EXSB Decision made by sipo to initiate substantive examination
GR01 Patent grant
GR01 Patent grant