CN103576882A - Off-normal text recognition method and system - Google Patents

Off-normal text recognition method and system Download PDF

Info

Publication number
CN103576882A
CN103576882A CN201210264218.9A CN201210264218A CN103576882A CN 103576882 A CN103576882 A CN 103576882A CN 201210264218 A CN201210264218 A CN 201210264218A CN 103576882 A CN103576882 A CN 103576882A
Authority
CN
China
Prior art keywords
text
keyboard
identified
button
distance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201210264218.9A
Other languages
Chinese (zh)
Other versions
CN103576882B (en
Inventor
何小晨
张国强
郝志新
许春林
王长伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Shiji Guangsu Information Technology Co Ltd
Original Assignee
Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Shiji Guangsu Information Technology Co Ltd filed Critical Shenzhen Shiji Guangsu Information Technology Co Ltd
Priority to CN201210264218.9A priority Critical patent/CN103576882B/en
Publication of CN103576882A publication Critical patent/CN103576882A/en
Application granted granted Critical
Publication of CN103576882B publication Critical patent/CN103576882B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention provides an off-normal text recognition method and system. The method includes acquiring a key corresponding to the input first letter of each word according to the word in a text to be recognized; dividing the text to be recognized into a normal one or an off-normal one according to distributing situation of keys, corresponding to the input first letter of each word, on a keyboard. According to the off-normal text recognition method and system, by means of the distributing situation of the keys, corresponding to the input first letter of each word, on the keyboard, meaningless and off-normal texts input randomly can be recognized effectively and largely; meanwhile, since the meaning and calculation of text quality parts are not relied by the recognition method, and recognizing results are more objective and accurate.

Description

Improper text recognition method and system thereof
Technical field
The present invention relates to text identification technical field, particularly relate to a kind of improper text recognition method, and a kind of improper text recognition system.
Background technology
It is an important sport technique segment in search engine that rubbish text filters always, described rubbish text is often referred to insignificant improper text, traditional rubbish text filters by the calculating dividing with text quality of searching of keyword, can filter out too much poor format text of poor format text, non-standard character that some ad contents, Pornograph, political sensitive content, content repeat etc.
Yet, in the short text search such as is had a talk about in microblogging and space, we find that there is a certain amount of improper text being produced by random input (also claiming rubbish text), such as: " rubbish such as the flighty Lhasa large real road Ka Sa of science and technology army wantonly search for smash Liao Jun Dallas add the Jia Sadun of Dallas water etc. ".The characteristic of the improper text of this class is: inside have certain randomness, repeat entry less; Notional word is more, according to the text quality of text filtering technique computes in the past, divides conventionally not low; Due to association's input characteristics of input method, between adjacent word, often there is certain degree of correlation, be difficult to filter by semantic analysis.Based on above characteristic, the improper text of this class is difficult to be distinguished by traditional text filtering method.
Summary of the invention
For the problem existing in above-mentioned background technology, the object of the present invention is to provide a kind of improper text recognition method that can effectively identify the improper text of random input generation, and a kind of improper text recognition system.
A text recognition method, comprises the following steps:
According to each word in text to be identified, obtain the corresponding button of initial of word input described in each;
Distribution situation according to the corresponding button of initial of word input described in each on keyboard, is divided into normal text or improper text by described text to be identified.
A text recognition system, comprising:
Button acquisition module, for according to each word of text to be identified, obtains the corresponding button of initial of word input described in each;
Identification module, for distribution situation on keyboard according to the corresponding button of initial of word input described in each, is divided into normal text or improper text by described text to be identified.
Improper text recognition method of the present invention and system thereof, by obtaining under corresponding input method, the corresponding button of initial of each word input in text to be identified, judges the distribution situation of described button on keyboard.Because the insignificant improper text of input is normally clicked some key-press inputs in region more concentrated on keyboard at random, so the distribution situation of button corresponding to initial during by the input of described word on keyboard, can identify meaningless, the improper text of most of random input effectively.And because recognition methods do not rely on the meaning of a word, the calculating that text quality divides, makes recognition result more objective and accurate.
Accompanying drawing explanation
Fig. 1 is the schematic flow sheet of improper text recognition method the first embodiment of the present invention;
Fig. 2 is the local schematic flow sheet of step S102 in improper text recognition method the first embodiment of the present invention;
Fig. 3 is the schematic diagram of a kind of setting means of keyboard subregion in improper text recognition method of the present invention;
Fig. 4 is the local schematic flow sheet of step S102 in improper text recognition method the second embodiment of the present invention;
Fig. 5 is the local schematic flow sheet of step S102 in the improper text recognition method of the present invention the 3rd embodiment;
Fig. 6 is the structural representation of improper text recognition system the first embodiment of the present invention;
Fig. 7 is the structural representation of identification module in the first embodiment of the improper text recognition system of the present invention;
Fig. 8 is the structural representation of identification module in the second embodiment of the improper text recognition system of the present invention;
Fig. 9 is the structural representation of identification module in the 3rd embodiment of the improper text recognition system of the present invention.
Embodiment
Refer to Fig. 1, Fig. 1 is the schematic flow sheet of improper text recognition method the first embodiment of the present invention.
Described improper text recognition method, comprises the following steps S101 and S102:
S101, according to each word in text to be identified, obtains the corresponding button of initial of word input described in each;
Wherein, described text to be identified is the text that need to identify.Word in described text to be identified comprises Chinese word, English alphabet, and numeral, symbol etc., can be also one or more combination wherein.Especially, the present invention has good recognition effect for the text being comprised of Chinese character.
And described text to be identified is preferably the text that length is greater than certain preset value, its word number comprising preferably surpasses certain quantity, because text is shorter, the word comprising is fewer, and the boundary between normal and improper text is fuzzyyer, and the difficulty of identification is larger.Therefore when reality is carried out improper text recognition method of the present invention, can preset the minimum length to text requirement, text size is greater than to described minimum length, just carries out improper text recognition method of the present invention, otherwise do not carry out described improper text recognition method.
The corresponding button of initial of described word input, is at input button of first click during word described in each, and for example during English alphabet input computing machine, button corresponding to initial is the button at described English alphabet place.
The corresponding button of initial of described word input, can obtain by setting up the mode of look-up table.Preferably, in this step S101, according to each word in described text to be identified, search the mapping table of setting up in advance, obtain corresponding described button; Wherein, in described mapping table, record the corresponding button of initial of described word and the input of described word.
That is, in default described mapping table, set up the corresponding relation of the word of input and the described button of correspondence.Only need to can obtain described button corresponding to each word in described text to be identified according to mapping table described in text search, this mode is direct convenience relatively.
Or, the Rule of input method used when described button also can be inputted according to described word.For example, for the English alphabet in described text to be identified, directly obtain button corresponding to English alphabet described in each, be identified as the corresponding button of initial of described English alphabet input.
When each word in described text to be identified is the Chinese character of being inputted by spelling input method, can obtain the button corresponding to first letter of pinyin of each Chinese character in described text to be identified, be identified as the corresponding button of initial of described Chinese character input.
Due to the Chinese character of spelling input method input, the button of first click must be the button of the first letter of pinyin of this Chinese character, corresponding first button in the time of therefore can obtaining by the way Chinese character input computing machine.
According to different input method rules, the corresponding button of initial of various word inputs is derived, can be without setting up the mapping table that data volume is larger.The said method that those skilled in the art records according to the present invention, can, voluntarily according to corresponding input method rule, obtain the corresponding button of initial of other word inputs.
Preferably, in this step S101, by the button of a-z on computer keyboard for example, with 26 different sign marks, digital 1-26; And by punctuation mark and numeral with same sign mark, for example numeral 0., after obtaining corresponding button, available corresponding identification record, so that computing machine carries out statistical treatment.
S102, the distribution situation according to the corresponding button of initial of word input described in each on keyboard, is divided into normal text or improper text by described text to be identified.
Because random input, insignificant improper text normally clicks several key-press inputs in region more concentrated on keyboard, that is, while inputting at random, conventionally can in whole keyboard range, to each button, not click fifty-fifty.So the distribution situation by described button corresponding to word described in each on keyboard, can identify meaningless, the improper text of most of random input effectively.For example, more concentrated if described button corresponding to each word distributes, described text to be identified is judged as to improper text; If disperse and distribute, be judged as normal text.And judge that it distribute to be concentrated or the standard of disperseing can be according to statistics, or the mode of training sample and machine learning obtains.
Refer to Fig. 2, Fig. 2 is the local schematic flow sheet of step S102 in improper text recognition method the first embodiment of the present invention.
In present embodiment, can in the following manner described text to be identified be divided into normal text or improper text, described step S102 comprises:
S201, according to default a plurality of keyboard subregions, described button distribution proportion on keyboard subregion described in each that judgement is obtained;
S202, by described distribution proportion and default distribution proportion threshold value comparison;
If be greater than described distribution proportion threshold value, perform step S203, described detection text is divided into improper text; Otherwise execution step S204, is divided into normal text by described detection text.
Wherein, " a plurality of " of in the present invention, occurring refer to two or more.Described a plurality of keyboard subregion is predefined, and described in each, keyboard subregion comprises several adjacent buttons successively, specifically can set according to the distribution of each button on keyboard.
Refer to Fig. 3, Fig. 3 is the schematic diagram of a kind of setting means of keyboard subregion in improper text recognition method of the present invention.This setting means is divided into 7 keyboard subregions: first keyboard subregion comprises button Q, W, E, R, T, Y, U, I, O, P; Second keyboard subregion comprises button A, S, D, F, G, H, J, K, L; The 3rd keyboard subregion comprises button Z, X, C, V, B, N, M; The 4th keyboard subregion comprises button W, E, R, T, S, D, F, G; The 5th keyboard subregion comprises button Y, U, I, O, H, J, K, L; The 6th keyboard subregion comprises button S, D, F, G, X, C, V, B; The 7th keyboard subregion comprises button H, J, K, L, N, M.
According to above subregion, can judge respectively the described button distribution proportion on keyboard subregion described in each obtaining, such as for text to be identified: " the flighty rubbish such as the Lhasa large real road Ka Sa of science and technology army wantonly search for smash Liao Jun Dallas add the Jia Sadun of Dallas water etc. ", the first letter of pinyin that each word is corresponding is respectively " sjdslkjdsjdlksjdljodsdsljdlsjdlsjsdsd ", and the button that described first letter of pinyin is corresponding is the corresponding button of initial of described word input.
That is, altogether obtain 37 corresponding buttons.Wherein, the button that drops on second keyboard subregion has 36, and distribution proportion accounts for 97.3%; And the button that drops on the first keyboard subregion has 1, distribution proportion accounts for 2.7%; Drop on the 4th or the button of the 6th keyboard subregion have 20, account for 54%; And drop on the 5th or the button of the 7th keyboard subregion have 17, account for 46%.
Therefore, can be by described the button distribution proportion on keyboard subregion and default distribution proportion threshold value comparison described in each.According to comparative result, described detection text is divided into improper text or normal text.
Described in each, the distribution proportion threshold value of keyboard subregion can be identical, also can be set as respectively different threshold values.Preferably, can set multistage distribution proportion threshold value to keyboard subregion described in each.For example, setting first order distribution proportion threshold value is 90%, and the second level is 70%, and the third level is 40%.Can be set in distribution proportion that a certain keyboard subregion accounts for higher than 90% time or have the shared distribution proportion of two keyboard subregions higher than 70% time or have the shared distribution proportion of three keyboard subregions higher than 40% time, described detection text is divided into improper text.
Refer to Fig. 4, Fig. 4 is the local schematic flow sheet of step S102 in improper text recognition method the second embodiment of the present invention.
In present embodiment, can also in the following manner described text to be identified be divided into normal text or improper text, described step S102 comprises:
S211, calculates the distance of every two corresponding buttons of adjacent word on keyboard in described text to be identified, and calculates the mean value of described distance;
S212, by the mean value of described distance and default mean distance threshold value comparison;
If be less than described mean distance threshold value, perform step S213, described detection text is divided into improper text; Otherwise execution step S214, is divided into normal text by described detection text.
Present embodiment is that the distance on keyboard judges the whether improper text of described detection text according to button corresponding to two adjacent words.Because the improper text of random input may be also each key-press input on continuous inswept keyboard, inswept button QWERTYUIOPLKJHGFDSA successively for example, the improper text of random input is: " remove to play physical culture i Aupres card and slow down spreading of public expense ", this text is identified according to present embodiment, judge respectively the keyboard distance of every two adjacent buttons in button QWERTYUIOPLKJHGFDSA, obtaining range averaging value is 1.0, be less than default mean distance threshold value (being for example 2.0), therefore, described detection text is divided into improper text.
Preferably, while calculating distance on keyboard of every two corresponding buttons of adjacent word in described text to be identified, can with different Weights, process the lateral separation of described keyboard case and fore-and-aft distance.That is, lateral separation and fore-and-aft distance according to the described button of every two adjacent words difference correspondences in described text to be identified on keyboard, calculate weighting keyboard distance according to following formula:
Dist=x+α·y
Wherein, the weighting keyboard distance of Dist for calculating, x is lateral separation, and y is fore-and-aft distance, and α is the ratio weight of lateral separation and fore-and-aft distance, α >1.
Because it has been generally acknowledged that the cost that user vertically moves in knocking at random the process of keyboard will exceed transverse shifting, therefore the ratio weight α >1 of described lateral separation and fore-and-aft distance is set conventionally.For example, the value of α is decided to be to 2, alphabetical S and the tee lateral separation on keyboard is 2.5, and fore-and-aft distance is 1, and the keyboard distance of its weighting is 2.5+2 * 1=4.5.Supposing has N word (to only include Chinese character and English alphabet in text, do not comprise numeral, punctuate and non-standard character), calculate so N-1 keyboard distance between every two adjacent words, and calculate the mean value of distance, according to the mean value of described distance and described mean distance threshold value, divide described text to be identified again.
Refer to Fig. 5, Fig. 5 is the local schematic flow sheet of step S102 in the improper text recognition method of the present invention the 3rd embodiment.
In present embodiment, by above-mentioned two kinds of criterions, the range averaging value of the distribution proportion of described button and described button is simultaneously as judging the whether foundation of improper text of described text to be identified.Be that described step S102 comprises:
S221, according to default a plurality of keyboard subregions, described button distribution proportion on keyboard subregion described in each that judgement is obtained;
S222, calculates the distance of every two corresponding buttons of adjacent word on keyboard in described text to be identified, and calculates the mean value of described distance;
S223, according to the mean value of described distribution proportion and described distance, according to the default criteria for classifying corresponding to keyboard subregion difference described in each, is divided into normal text or improper text by described text to be identified; Wherein, described in each keyboard subregion respectively the corresponding default criteria for classifying comprise default distribution proportion threshold value and default mean distance threshold value.
Aforesaid way, using the range averaging value of the distribution proportion of described button and described button simultaneously as judging the whether foundation of improper text of described text to be identified, makes the result of text identification more accurate.
Preferably, be further to improve the accuracy of text identification result, the default criteria for classifying that described in each, keyboard subregion is corresponding comprises default a plurality of described distribution proportion threshold value, and corresponding a plurality of mean distance threshold values of distribution proportion threshold value described in each respectively.With this, realize multiple Threshold, make the result of text identification more accurate.
In addition, because the probability that punctuate meets and numeral generally occurs in the improper text of random input is less, therefore in step S102, can also further according to the quantity of the punctuation mark in described text to be identified or numeral, identify.
That is, in step S102, further obtain the distribution proportion of numeral in described text to be identified or symbol;
And, according to the distribution proportion of the mean value of described distribution proportion and described distance and described numeral or symbol, according to the default criteria for classifying corresponding to keyboard subregion difference described in each, described text to be identified is divided into normal text or improper text; Wherein, described in each keyboard subregion respectively the corresponding default criteria for classifying comprise default distribution proportion threshold value, default mean distance threshold value and default numeral or symbol distribution proportion.
The quantity of symbol or numeral, also as the standard of text identification, can further be improved improper text recognition.
Especially, to take the distribution proportion of numeral in the distribution proportion of described button and the mean value of described button distance and described text to be identified or symbol, for the default criteria for classifying, set the situation of multiple threshold value simultaneously, do following giving an example:
The program code of realizing to the default criteria for classifying of above-mentioned first keyboard subregion is exemplified below:
Figure BDA00001943136800081
Wherein, the number of words that described letterCounter is described text to be identified; UpLetterRatio is the button distribution proportion on this first keyboard subregion; MeanKeyDist is the mean value of described button distance, LetterRepeatTimes[0] refer to the number of times that punctuation mark and numeral occur.
Return true in program code, refers to this text to be identified to be divided into the operation of improper text.
Setting to each predetermined threshold value in the default criteria for classifying of described text to be identified, can take the mode of adding up great amount of samples data to obtain; Also can learn by machine learning techniques the training sample of a large amount of text identification, generate sorter and realize.The benefit of machine learning is without manual intervention setting threshold, by only learning by machine learning techniques the training sample of a large amount of successful identification, but the workload of making a large amount of training samples is larger, and because the default criteria for classifying that generates by machine learning techniques is comparatively complicated, make the computation burden that takies while identifying on line larger.And adopt complicate statistics great amount of samples data, and while setting the described default criteria for classifying, can according to the feedback of recognition result on line, revise the various threshold values in the described default criteria for classifying in the early stage, to reduce rapidly the situation of identification error, occur.
Refer to Fig. 6, Fig. 6 is the structural representation of improper text recognition system the first embodiment of the present invention.
Described improper text recognition system, comprising: button acquisition module 11 and identification module 12.Described button acquisition module 11 is for according to each word of text to be identified, obtains the corresponding button of initial of word input described in each; Described identification module 12, for distribution situation on keyboard according to the corresponding button of initial of word input described in each, is divided into normal text or improper text by described text to be identified.
Because random input, insignificant improper text normally clicks several key-press inputs in region more concentrated on keyboard, that is, while inputting at random, conventionally can in whole keyboard range, to each button, not click fifty-fifty.So the distribution situation by described button corresponding to word described in each on keyboard, can identify meaningless, the improper text of most of random input effectively.
Wherein, described text to be identified is the text that need to identify.Word in described text to be identified comprises Chinese word, English alphabet, and numeral, symbol etc., can be also one or more combination wherein.Especially, the present invention has good recognition effect for the text being comprised of Chinese character.
And described text to be identified is preferably the text that length is greater than certain preset value, its word number comprising preferably surpasses certain quantity, because text is shorter, the word comprising is fewer, and the boundary between normal and improper text is fuzzyyer, and the difficulty of identification is larger.Therefore the present invention, when reality is carried out improper text identification, can preset the minimum length to text requirement, and text size is greater than to described minimum length, just carries out improper text identification, otherwise does not carry out improper text identification.
The corresponding button of initial of described word input, is at input button of first click during word described in each, and for example during English alphabet input computing machine, button corresponding to initial is the button at described English alphabet place.
Described button acquisition module 11 obtains the corresponding button of initial of described word input by setting up the mode of look-up table.Preferably, described button acquisition module 11, according to each word in described text to be identified, is searched the mapping table of setting up in advance, obtains corresponding described button; Wherein, in described mapping table, record the corresponding button of initial of described word and the input of described word.
That is, in default described mapping table, set up the corresponding relation of the word of input and the described button of correspondence.Only need to can obtain described button corresponding to each word in described text to be identified according to mapping table described in text search, this mode is direct convenience relatively.
The corresponding button of initial of word input described in the Rule of input method used when described button acquisition module 11 also can be inputted according to described word.For example, for the English alphabet in described text to be identified, directly obtain button corresponding to English alphabet described in each, be identified as the corresponding button of initial of described English alphabet input.
When described button acquisition module 11 each word in described text to be identified is the Chinese character of being inputted by spelling input method, obtain the button corresponding to first letter of pinyin of each Chinese character in described text to be identified, be identified as the corresponding button of initial of described Chinese character input.
Due to the Chinese character of spelling input method input, the button of first click must be the button of the first letter of pinyin of this Chinese character, corresponding first button in the time of therefore can obtaining by the way Chinese character input computing machine.
Described button acquisition module 11 is derived to the corresponding button of initial of various word inputs according to different input method rules, can be without setting up the mapping table that data volume is larger.The said method that those skilled in the art records according to the present invention, can, voluntarily according to corresponding input method rule, obtain the corresponding button of initial of other word inputs.
Preferably, described button acquisition module 11 after obtaining corresponding first button, by the button of a-z for example, with 26 different sign marks, digital 1-26; And by punctuation mark and numeral with same sign mark, for example numeral 0., after obtaining corresponding button, available corresponding identification record, so that computing machine carries out statistical treatment.
Refer to Fig. 7, Fig. 7 is the structural representation of identification module in the first embodiment of the improper text recognition system of the present invention.
In the present embodiment, described identification module 12 comprises:
Distribution proportion computing module 201, for according to default a plurality of keyboard subregions, judges the described button distribution proportion on keyboard subregion described in each obtaining;
The first comparison module 202, for by described distribution proportion and default distribution proportion threshold value comparison;
First divides module 203, for when described distribution proportion is greater than described distribution proportion threshold value, described detection text is divided into improper text; Otherwise, described detection text is divided into normal text.
Wherein, " a plurality of " of in the present invention, occurring refer to two or more.Described a plurality of keyboard subregion is predefined, and described in each, keyboard subregion comprises several adjacent buttons successively, specifically can set according to the distribution of each button on keyboard.
Wherein a kind of setting means of keyboard subregion is for setting 7 keyboard subregions: first keyboard subregion comprises button Q, W, E, R, T, Y, U, I, O, P; Second keyboard subregion comprises button A, S, D, F, G, H, J, K, L; The 3rd keyboard subregion comprises button Z, X, C, V, B, N, M; The 4th keyboard subregion comprises button W, E, R, T, S, D, F, G; The 5th keyboard subregion comprises button Y, U, I, O, H, J, K, L; The 6th keyboard subregion comprises button S, D, F, G, X, C, V, B; The 7th keyboard subregion comprises button H, J, K, L, N, M.
Described in each, the distribution proportion threshold value of keyboard subregion can be identical, also can be set as respectively different threshold values.Preferably, can set multistage distribution proportion threshold value to keyboard subregion described in each.For example, setting first order distribution proportion threshold value is 90%, and the second level is 70%, and the third level is 40%.Can be set in distribution proportion that a certain keyboard subregion accounts for higher than 90% time or have the shared distribution proportion of two keyboard subregions higher than 70% time or have the shared distribution proportion of three keyboard subregions higher than 40% time, described detection text is divided into improper text.
Refer to Fig. 8, Fig. 8 is the structural representation of identification module in the second embodiment of the improper text recognition system of the present invention.
In the present embodiment, described identification module 12 comprises:
Keyboard distance calculation module 211, for calculating the distance of every two the corresponding buttons of adjacent word of described text to be identified on keyboard, and calculates the mean value of described distance;
The second comparison module 212, for by the mean value of described distance and default mean distance threshold value comparison;
Second divides module 213, while being less than described mean distance threshold value for the mean value in described distance, described detection text is divided into improper text; Otherwise, described detection text is divided into normal text.
Present embodiment is that the distance on keyboard judges the whether improper text of described detection text according to button corresponding to two adjacent words.Because the improper text of random input may be also each key-press input on continuous inswept keyboard, inswept button QWERTYUIOPLKJHGFDSA successively for example, the improper text of random input is: " remove to play physical culture i Aupres card and slow down spreading of public expense ", this text is identified according to present embodiment, judge respectively the keyboard distance of every two adjacent buttons in button QWERTYUIOPLKJHGFDSA, obtaining range averaging value is 1.0, be less than default mean distance threshold value (being for example 2.0), therefore, described detection text is divided into improper text.
Preferably, when described keyboard distance calculation module 21 is calculated distance on keyboard of every two corresponding buttons of adjacent word in described text to be identified, can with different Weights, process the lateral separation of described keyboard case and fore-and-aft distance.That is described keyboard distance calculation module 21 is lateral separation and fore-and-aft distance on keyboard according to the described button of every two adjacent words difference correspondences in described text to be identified, calculates weighting keyboard distance according to following formula:
Dist=x+α·y
Wherein, the weighting keyboard distance of Dist for calculating, x is lateral separation, and y is fore-and-aft distance, and α is the ratio weight of lateral separation and fore-and-aft distance, α >1.
Because it has been generally acknowledged that the cost that user vertically moves in knocking at random the process of keyboard will exceed transverse shifting, therefore the ratio weight α >1 of described lateral separation and fore-and-aft distance is set conventionally.For example, the value of α is decided to be to 2, alphabetical S and the tee lateral separation on keyboard is 2.5, and fore-and-aft distance is 1, and the keyboard distance of its weighting is 2.5+2 * 1=4.5.Supposing has N word (to only include Chinese character and English alphabet in text, do not comprise numeral, punctuate and non-standard character), calculate so N-1 keyboard distance between every two adjacent words, and calculate the mean value of distance, according to the mean value of described distance and described mean distance threshold value, divide described text to be identified again.
Refer to Fig. 9, Fig. 9 is the structural representation of identification module in the 3rd embodiment of the improper text recognition system of the present invention.
In the present embodiment, described identification module 12 comprises:
Distribution proportion computing module 201, for according to default a plurality of keyboard subregions, judges the described button distribution proportion on keyboard subregion described in each obtaining;
Keyboard distance calculation module 211, for calculating the distance of every two the corresponding buttons of adjacent word of described text to be identified on keyboard, and calculates the mean value of described distance;
The 3rd divides module 221, for according to the mean value of described distribution proportion and described distance, according to the default criteria for classifying corresponding to keyboard subregion difference described in each, described text to be identified is divided into normal text or improper text; Wherein, described in each keyboard subregion respectively the corresponding default criteria for classifying comprise default distribution proportion threshold value and default mean distance threshold value.
Aforesaid way, using the range averaging value of the distribution proportion of described button and described button simultaneously as judging the whether foundation of improper text of described text to be identified, makes the result of text identification more accurate.
Preferably, be further to improve the accuracy of text identification result, the default criteria for classifying that described in each, keyboard subregion is corresponding comprises default a plurality of described distribution proportion threshold value, and corresponding a plurality of mean distance threshold values of distribution proportion threshold value described in each respectively.With this, realize multiple Threshold, make the result of text identification more accurate.
In addition, because the general probability occurring in punctuation mark and the digital improper text inputting is at random less, so described identification module 12 can also further be identified according to the punctuation mark in described text to be identified or digital quantity.
That is, described identification module 12 further comprises:
Symbol distributed acquisition module (not shown), for obtaining the distribution proportion of described text numeral to be identified or symbol;
The 4th divides module (not shown), be used for according to the distribution proportion of the mean value of described distribution proportion and described distance and described numeral or symbol, according to the default criteria for classifying corresponding to keyboard subregion difference described in each, described text to be identified is divided into normal text or improper text; Wherein, described in each keyboard subregion respectively the corresponding default criteria for classifying comprise default distribution proportion threshold value, default mean distance threshold value and default numeral or symbol distribution proportion.
The quantity of symbol or numeral, also as the standard of text identification, can further be improved improper text recognition.
One of ordinary skill in the art will appreciate that all or part of flow process and the corresponding system that realize in above-mentioned embodiment, to come the hardware that instruction is relevant to complete by computer program, described program can be stored in a computer read/write memory medium, this program, when carrying out, can comprise the flow process as the respective embodiments described above.Wherein, described storage medium can be magnetic disc, CD, read-only store-memory body (Read-Only Memory, ROM) or random store-memory body (Random Access Memory, RAM) etc.
The above embodiment has only expressed several embodiment of the present invention, and it describes comparatively concrete and detailed, but can not therefore be interpreted as the restriction to the scope of the claims of the present invention.It should be pointed out that for the person of ordinary skill of the art, without departing from the inventive concept of the premise, can also make some distortion and improvement, these all belong to protection scope of the present invention.Therefore, the protection domain of patent of the present invention should be as the criterion with claims.

Claims (18)

1. an improper text recognition method, is characterized in that, comprises step:
According to each word in text to be identified, obtain the corresponding button of initial of word input described in each;
Distribution situation according to the corresponding button of initial of word input described in each on keyboard, is divided into normal text or improper text by described text to be identified.
2. improper text recognition method as claimed in claim 1, is characterized in that, described according to each word in text to be identified, and the step of obtaining the corresponding button of initial of word input described in each comprises:
According to each word in described text to be identified, search the mapping table of setting up in advance, obtain corresponding described button; Wherein, in described mapping table, record the corresponding button of initial of described word and the input of described word.
3. improper text recognition method as claimed in claim 1, is characterized in that, described according to each word in text to be identified, and the step of obtaining the corresponding button of initial of word input described in each comprises:
Obtain the button corresponding to first letter of pinyin of each Chinese character in described text to be identified, be identified as the corresponding button of initial of described Chinese character input;
Or,
Obtain button corresponding to each English alphabet in described text to be identified, be identified as the corresponding button of initial of described English alphabet input.
4. improper text recognition method as claimed in claim 1, it is characterized in that, distribution situation according to the corresponding button of initial of word input described in each on keyboard, the step that described text to be identified is divided into normal text or improper text comprises:
According to default a plurality of keyboard subregions, described button distribution proportion on keyboard subregion described in each that judgement is obtained;
By described distribution proportion and default distribution proportion threshold value comparison;
If be greater than described distribution proportion threshold value, described detection text is divided into improper text; Otherwise, described detection text is divided into normal text.
5. improper text recognition method as claimed in claim 1, it is characterized in that, distribution situation according to the corresponding button of initial of word input described in each on keyboard, the step that described text to be identified is divided into normal text or improper text comprises:
Calculate the distance of every two corresponding buttons of adjacent word on keyboard in described text to be identified, and calculate the mean value of described distance;
By the mean value of described distance and default mean distance threshold value comparison;
If be less than described mean distance threshold value, described detection text is divided into improper text; Otherwise, described detection text is divided into normal text.
6. improper text recognition method as claimed in claim 1, it is characterized in that, distribution situation according to the corresponding button of initial of word input described in each on keyboard, the step that described text to be identified is divided into normal text or improper text comprises:
According to default a plurality of keyboard subregions, described button distribution proportion on keyboard subregion described in each that judgement is obtained;
Calculate the distance of every two corresponding buttons of adjacent word on keyboard in described text to be identified, and calculate the mean value of described distance;
According to the mean value of described distribution proportion and described distance, according to the default criteria for classifying corresponding to keyboard subregion difference described in each, described text to be identified is divided into normal text or improper text; Wherein, described in each keyboard subregion respectively the corresponding default criteria for classifying comprise default distribution proportion threshold value and default mean distance threshold value.
7. improper text recognition method as claimed in claim 6, it is characterized in that, the default criteria for classifying that described in each, keyboard subregion is corresponding comprises default a plurality of described distribution proportion threshold value, and corresponding a plurality of mean distance threshold values of distribution proportion threshold value described in each respectively.
8. improper text recognition method as claimed in claim 6, is characterized in that, further obtains the distribution proportion of numeral in described text to be identified or symbol;
And, according to the distribution proportion of the mean value of described distribution proportion and described distance and described numeral or symbol, according to the default criteria for classifying corresponding to keyboard subregion difference described in each, described text to be identified is divided into normal text or improper text; Wherein, described in each keyboard subregion respectively the corresponding default criteria for classifying comprise default distribution proportion threshold value, default mean distance threshold value and default numeral or symbol distribution proportion.
9. the improper text recognition method as described in any one in claim 5 to 8, is characterized in that, the step of calculating the distance of every two corresponding buttons of adjacent word on keyboard in described text to be identified comprises:
Lateral separation and fore-and-aft distance according to the described button of every two adjacent words difference correspondences in described text to be identified on keyboard, calculate weighting keyboard distance according to following formula:
Dist=x+α·y
Wherein, the weighting keyboard distance of Dist for calculating, x is lateral separation, and y is fore-and-aft distance, and α is the ratio weight of lateral separation and fore-and-aft distance, α >1.
10. an improper text recognition system, is characterized in that, comprising:
Button acquisition module, for according to each word of text to be identified, obtains the corresponding button of initial of word input described in each;
Identification module, for distribution situation on keyboard according to the corresponding button of initial of word input described in each, is divided into normal text or improper text by described text to be identified.
11. improper text recognition system as claimed in claim 10, is characterized in that, described button acquisition module, according to each word in described text to be identified, is searched the mapping table of setting up in advance, obtains corresponding described button; Wherein, in described mapping table, record the corresponding button of initial of described word and the input of described word.
12. improper text recognition system as claimed in claim 10, it is characterized in that, described button acquisition module obtains the button corresponding to first letter of pinyin of each Chinese character in described text to be identified, is identified as the corresponding button of initial of described Chinese character input; Or, obtain button corresponding to each English alphabet in described text to be identified, be identified as the corresponding button of initial of described English alphabet input.
13. improper text recognition system as claimed in claim 10, is characterized in that, described identification module comprises:
Distribution proportion computing module, for according to default a plurality of keyboard subregions, judges the described button distribution proportion on keyboard subregion described in each obtaining;
The first comparison module, for by described distribution proportion and default distribution proportion threshold value comparison;
First divides module, for when described distribution proportion is greater than described distribution proportion threshold value, described detection text is divided into improper text; Otherwise, described detection text is divided into normal text.
14. improper text recognition system as claimed in claim 10, is characterized in that, described identification module comprises:
Keyboard distance calculation module, for calculating the distance of every two the corresponding buttons of adjacent word of described text to be identified on keyboard, and calculates the mean value of described distance;
The second comparison module, for by the mean value of described distance and default mean distance threshold value comparison;
Second divides module, while being less than described mean distance threshold value for the mean value in described distance, described detection text is divided into improper text; Otherwise, described detection text is divided into normal text.
15. improper text recognition system as claimed in claim 10, is characterized in that, described identification module comprises:
Distribution proportion computing module, for according to default a plurality of keyboard subregions, judges the described button distribution proportion on keyboard subregion described in each obtaining;
Keyboard distance calculation module, for calculating the distance of every two the corresponding buttons of adjacent word of described text to be identified on keyboard, and calculates the mean value of described distance;
The 3rd divides module, for according to the mean value of described distribution proportion and described distance, according to the default criteria for classifying corresponding to keyboard subregion difference described in each, described text to be identified is divided into normal text or improper text; Wherein, described in each keyboard subregion respectively the corresponding default criteria for classifying comprise default distribution proportion threshold value and default mean distance threshold value.
16. improper text recognition system as claimed in claim 15, it is characterized in that, the default criteria for classifying that described in each, keyboard subregion is corresponding comprises default a plurality of described distribution proportion threshold value, and corresponding a plurality of mean distance threshold values of distribution proportion threshold value described in each respectively.
17. improper text recognition system as claimed in claim 15, is characterized in that, described identification module further comprises:
Symbol distributed acquisition module, for obtaining the distribution proportion of described text numeral to be identified or symbol;
The 4th divides module, be used for according to the distribution proportion of the mean value of described distribution proportion and described distance and described numeral or symbol, according to the default criteria for classifying corresponding to keyboard subregion difference described in each, described text to be identified is divided into normal text or improper text; Wherein, described in each keyboard subregion respectively the corresponding default criteria for classifying comprise default distribution proportion threshold value, default mean distance threshold value and default numeral or symbol distribution proportion.
18. improper text recognition system as described in any one in claim 14 to 17, it is characterized in that, when described keyboard distance calculation module is calculated distance on keyboard of every two corresponding buttons of adjacent word in described text to be identified, lateral separation and fore-and-aft distance according to the described button of every two adjacent words difference correspondences in described text to be identified on keyboard, calculate weighting keyboard distance according to following formula:
Dist=x+α·y
Wherein, the weighting keyboard distance of Dist for calculating, x is lateral separation, and y is fore-and-aft distance, and α is the ratio weight of lateral separation and fore-and-aft distance, α >1.
CN201210264218.9A 2012-07-27 2012-07-27 Improper text recognition method and its system Active CN103576882B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210264218.9A CN103576882B (en) 2012-07-27 2012-07-27 Improper text recognition method and its system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210264218.9A CN103576882B (en) 2012-07-27 2012-07-27 Improper text recognition method and its system

Publications (2)

Publication Number Publication Date
CN103576882A true CN103576882A (en) 2014-02-12
CN103576882B CN103576882B (en) 2018-03-09

Family

ID=50048833

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210264218.9A Active CN103576882B (en) 2012-07-27 2012-07-27 Improper text recognition method and its system

Country Status (1)

Country Link
CN (1) CN103576882B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445908A (en) * 2015-08-07 2017-02-22 阿里巴巴集团控股有限公司 Text identification method and apparatus
CN111597310A (en) * 2020-05-26 2020-08-28 成都卫士通信息产业股份有限公司 Sensitive content detection method, device, equipment and medium

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928860A (en) * 2005-09-05 2007-03-14 日电(中国)有限公司 Method, search engine and search system for correcting key errors
CN101266520A (en) * 2008-04-18 2008-09-17 黄晓凤 System for accomplishing live keyboard layout
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN101510879A (en) * 2009-03-26 2009-08-19 腾讯科技(深圳)有限公司 Method and apparatus for filtering rubbish contents
CN101710262A (en) * 2009-12-11 2010-05-19 北京搜狗科技发展有限公司 Error correction method and error correction device of characters
EP2264563A1 (en) * 2009-06-19 2010-12-22 Tegic Communications, Inc. Virtual keyboard system with automatic correction
WO2011113057A1 (en) * 2010-03-12 2011-09-15 Nuance Communications, Inc. Multimodal text input system, such as for use with touch screens on mobile phones

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1928860A (en) * 2005-09-05 2007-03-14 日电(中国)有限公司 Method, search engine and search system for correcting key errors
CN101304589A (en) * 2008-04-14 2008-11-12 中国联合通信有限公司 Method and system for monitoring and filtering garbage short message transmitted by short message gateway
CN101266520A (en) * 2008-04-18 2008-09-17 黄晓凤 System for accomplishing live keyboard layout
CN101510879A (en) * 2009-03-26 2009-08-19 腾讯科技(深圳)有限公司 Method and apparatus for filtering rubbish contents
EP2264563A1 (en) * 2009-06-19 2010-12-22 Tegic Communications, Inc. Virtual keyboard system with automatic correction
CN101710262A (en) * 2009-12-11 2010-05-19 北京搜狗科技发展有限公司 Error correction method and error correction device of characters
WO2011113057A1 (en) * 2010-03-12 2011-09-15 Nuance Communications, Inc. Multimodal text input system, such as for use with touch screens on mobile phones

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106445908A (en) * 2015-08-07 2017-02-22 阿里巴巴集团控股有限公司 Text identification method and apparatus
CN106445908B (en) * 2015-08-07 2019-11-15 阿里巴巴集团控股有限公司 Text recognition method and device
CN111597310A (en) * 2020-05-26 2020-08-28 成都卫士通信息产业股份有限公司 Sensitive content detection method, device, equipment and medium
CN111597310B (en) * 2020-05-26 2023-10-20 成都卫士通信息产业股份有限公司 Sensitive content detection method, device, equipment and medium

Also Published As

Publication number Publication date
CN103576882B (en) 2018-03-09

Similar Documents

Publication Publication Date Title
CN106570109B (en) Method for automatically generating question bank knowledge points through text analysis
JP3882048B2 (en) Question answering system and question answering processing method
CN104881458B (en) A kind of mask method and device of Web page subject
CN106202294B (en) Related news computing method and device based on keyword and topic model fusion
CN106156372B (en) A kind of classification method and device of internet site
CN110674881A (en) Trademark image retrieval model training method, system, storage medium and computer equipment
Layton et al. Recentred local profiles for authorship attribution
US20150199567A1 (en) Document classification assisting apparatus, method and program
CN103605694A (en) Device and method for detecting similar texts
CN113312461A (en) Intelligent question-answering method, device, equipment and medium based on natural language processing
CN101980210A (en) Marked word classifying and grading method and system
CN110990676A (en) Social media hotspot topic extraction method and system
CN103092966A (en) Vocabulary mining method and device
CN110287311A (en) File classification method and device, storage medium, computer equipment
CN111090994A (en) Chinese-internet-forum-text-oriented event place attribution province identification method
CN110019820A (en) Main suit and present illness history symptom Timing Coincidence Detection method in a kind of case history
CN115473726A (en) Method and device for identifying domain name
CN102855264B (en) Document processing method and device thereof
CN103576882A (en) Off-normal text recognition method and system
Feng et al. Confidence guided progressive search and fast match techniques for high performance Chinese/English OCR
CN110096708B (en) Calibration set determining method and device
CN112445976A (en) City address positioning method based on congestion index map
CN111737982A (en) Chinese text wrongly-written character detection method based on deep learning
CN112417088A (en) Evaluation method and device for text value in community
CN114281942A (en) Question and answer processing method, related equipment and readable storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
EXSB Decision made by sipo to initiate substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant