CN111144107B - Messy code identification method based on slicing algorithm - Google Patents

Messy code identification method based on slicing algorithm Download PDF

Info

Publication number
CN111144107B
CN111144107B CN201911357125.9A CN201911357125A CN111144107B CN 111144107 B CN111144107 B CN 111144107B CN 201911357125 A CN201911357125 A CN 201911357125A CN 111144107 B CN111144107 B CN 111144107B
Authority
CN
China
Prior art keywords
character string
length
character
words
variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911357125.9A
Other languages
Chinese (zh)
Other versions
CN111144107A (en
Inventor
刘德建
张伟泽
陈宏展
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujian Tianqing Online Interactive Technology Co Ltd
Original Assignee
Fujian Tianqing Online Interactive Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujian Tianqing Online Interactive Technology Co Ltd filed Critical Fujian Tianqing Online Interactive Technology Co Ltd
Priority to CN201911357125.9A priority Critical patent/CN111144107B/en
Publication of CN111144107A publication Critical patent/CN111144107A/en
Application granted granted Critical
Publication of CN111144107B publication Critical patent/CN111144107B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques

Abstract

The invention provides a messy code identification method based on a slicing algorithm, which is characterized in that a character string set needing to be judged is taken out, and the character string set and a non-messy code character string set of a coding mode of a language corresponding to the character string set are combined into a new character string set; acquiring the original length of a character string set, performing supplementary expansion on the character string with short length, calculating the word number of the character string by a slicing algorithm, and solving the ratio of the word number to the length to form a set; performing interval estimation on each variable in the set through normal distribution, performing probabilistic judgment on the variable falling into the interval, and recognizing the messy code text under the condition of the probability of the interval of the horizontal axis higher than a certain confidence; the invention can automatically identify the messy codes under high efficiency and ensure reliable accuracy.

Description

Messy code identification method based on slicing algorithm
Technical Field
The invention relates to the field of character string filtering and processing, in particular to a messy code identification method based on a slicing algorithm.
Background
At present, a method for recognizing a messy code includes extracting a text to be recognized and recognizing a coding format thereof, converting a first text into a second text having a second coding format by obtaining a first coding format of the first text to be recognized in a page according to a correspondence between characters corresponding to the second coding format and characters corresponding to other coding formats, and converting the second text into a third text according to a correspondence between characters corresponding to the second coding format and characters corresponding to the first coding format, so that whether the messy code exists in the first text can be determined according to the third text and the first text. From: a method and a device for identifying messy codes of texts in web pages are disclosed. And the operator checks the texts one by one so as to judge whether the texts have messy codes.
The existing method has low conversion efficiency for the initial text, if the character set of the specified second coding format does not have corresponding byte codes, the conversion can be wrong, and then the conversion can be wrong. The requirement for character sets is high. Through the identification mode of the operator, under the condition that the operator is tired or the workload is large, identification errors are easy to occur.
Disclosure of Invention
In order to overcome the problems, the invention aims to provide a scrambling code identification method based on a slicing algorithm, so that the identification efficiency and reliability are improved.
The invention is realized by adopting the following scheme: a scrambling code identification method based on a slicing algorithm comprises the following steps: s1, taking out the character string set needed to be judged, combining the character string set and the non-messy code character string set of the coding mode of the language corresponding to the character string set into a new character string set; performing operations of steps S2 through S5 on each character string in the character string set;
step S2, obtaining the original length of the character string set, judging whether the character string is long or short according to the original length, if the character string is short, performing single character supplementary expansion on the character string, judging whether the supplemented character string reaches the shortest length of the long class, if not, continuing the single character supplementary, if so, obtaining the length of the character string after supplementary expansion, setting the length as the length after processing, and entering step S3; if the length is the long type, the original length is directly set as the processed length, and the step S4 is proceeded;
step S3, carrying out slicing and word segmentation processing on the character string after the supplementary expansion to obtain word number, and entering step S5;
step S4, carrying out slicing word segmentation processing on the character string to obtain the number of words, and entering step S5;
step S5, calculating a variable x, which is the number of words/processed length, to obtain a variable x set corresponding to the string set, performing interval estimation on the variable x set by using a mathematical normal distribution, and performing probabilistic judgment on variables falling in the interval to determine whether the string to be judged is a garbled code.
Further, the single characters in the single character supplementary expansion in step S2 are randomly selected characters, and several added single characters cannot form new words, and must form words with the character string.
Further, the slice word segmentation processing is as follows: the method comprises the steps of predefining a dictionary of words, scanning character strings to be recognized when a program runs, checking whether the character strings to be recognized have the words appearing in the dictionary or not, wherein the word number matching mode is matched according to length priority, namely the length from long to short is prior, and then subtracting one word from each character string one by one to judge whether the words appear in the dictionary or not one by one so as to obtain the final word number.
Further, the step S5 is further specifically: calculating a variable x as the number of words/processed length, and solving an average u and a standard deviation a of a variable x set; referring to a mathematical normal distribution for the mean u and the standard deviation a, the area of the normal distribution in the horizontal axis interval (u-a, u + a) is 68.268949%, the area in the horizontal axis interval (u-1.96 a, u +1.96 a) is 95.449974%, and the area in the horizontal axis interval (u-2.58 a, u +2.58 a) is 99.730020%; that is, the probabilities in the horizontal axis intervals of the three confidence levels are 68.27%, 95.45%, and 99.73%, respectively, and the probabilities are normal string probabilities, it is possible to determine that the abnormal string is a scramble code with the same probability.
Furthermore, in the slicing word segmentation processing, after the single character is supplemented and expanded, when the single character and the peripheral characters cannot be combined into a word, the single character is also a word.
The invention has the beneficial effects that: 1. by using the slicing algorithm and the characteristic of normal distribution for estimation, the messy codes can be automatically identified at high efficiency, and reliable accuracy is ensured.
2. Compared with the method for converting the text into another text recognition messy code, the method reduces the internal memory IO, hands more work to the CPU for arithmetic operation, and improves the running speed.
Drawings
FIG. 1 is a schematic flow diagram of the process of the present invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings.
Referring to fig. 1, a scrambling code identification method based on a slicing algorithm of the present invention includes the following steps: step S1, a character string set needing to be judged is taken out, and the character string set and a non-messy code character string set of the coding mode of the language corresponding to the character string set are combined into a new character string set; performing operations of steps S2 through S5 on each character string in the character string set;
step S2, obtaining the original length of the character string set, judging whether the character string is long or short according to the original length, if the character string is short, performing single character supplementary expansion on the character string, judging whether the supplemented character string reaches the shortest length of the long class, if not, continuing the single character supplementary, if so, obtaining the length of the character string after supplementary expansion, setting the length as the length after processing, and entering step S3; if the length is the long type, the original length is directly set as the processed length, and the step S4 is proceeded; the single characters in the single character supplementary expansion in the step S2 are randomly selected characters, and several added single characters cannot form new words with each other, and must form words with the character string.
Step S3, carrying out slicing word segmentation processing on the character string after the supplementary expansion (when a single character and peripheral characters can not be combined into words, a word is also calculated), obtaining the number of words, and entering step S5;
step S4, carrying out slicing word segmentation processing on the character string to obtain the number of words, and entering step S5;
step S5, calculating a variable x, which is the number of words/processed length, to obtain a variable x set corresponding to the string set, performing interval estimation on the variable x set by using a mathematical normal distribution, and performing probabilistic judgment on variables falling in the interval to determine whether the string to be judged is a garbled code.
Wherein the slice word segmentation processing is as follows: the method comprises the steps of predefining a dictionary of words, scanning character strings to be recognized when a program runs, checking whether the character strings to be recognized have the words appearing in the dictionary or not, wherein the word number matching mode is matched according to length priority, namely the length from long to short is prior, and then subtracting one word from each character string one by one to judge whether the words appear in the dictionary or not one by one so as to obtain the final word number.
In the present invention, the step S5 further includes: calculating a variable x as the number of words/processed length, and solving an average u and a standard deviation a of a variable x set; referring to a mathematical normal distribution for the mean u and the standard deviation a, the area of the normal distribution in the horizontal axis interval (u-a, u + a) is 68.268949%, the area in the horizontal axis interval (u-1.96 a, u +1.96 a) is 95.449974%, and the area in the horizontal axis interval (u-2.58 a, u +2.58 a) is 99.730020%; that is, the probabilities in the horizontal axis intervals of the three confidence levels are 68.27%, 95.45%, and 99.73%, respectively, and the probabilities are normal string probabilities, it is possible to determine that the abnormal string is a scramble code with the same probability.
The invention is further illustrated below with reference to a specific embodiment: the invention utilizes a statistical normal distribution scheme to identify character string messy codes, and achieves approximate normal distribution by supplementing normal samples, wherein the normal samples refer to: and (4) supplementing the normal character strings (non-messy code character strings) in proportion in a supplementing mode, and filling the normal character strings (non-messy code character strings) in proportion to the number of the character strings of the character string set to be recognized, namely 100. Taking out a character string set needing to be judged, and combining the character string set and a non-messy code character string set of a coding mode of a language corresponding to the character string set into a new character string set; operating each character string in the character string set; acquiring the original length of the character string set, and performing supplementary expansion on the character string with short length, wherein the filling mode specifically comprises the following steps: firstly, determining the number of candidate characters to be used according to the length of the shortest and long class, randomly selecting common characters such as 'the', 'and', and the characters can not form new words; namely, A and B output random candidate characters, and AB and BA are not a word. Such as "and", "and" are not words. Then, calculating the word number of the character string through a slicing algorithm, wherein the slicing method comprises the following steps: by predefining a dictionary of words. Then the program runs to scan the character string to be recognized, and check whether words appear in the dictionary, and the word number matching mode is matched according to length priority, such as: the character string "SJMY",
1. and whether the priority matching 'SJMY' is in the dictionary or not is judged, if yes, matching is finished, and 1 word is obtained.
2. If not, match "SJM" or "JMY".
There are two words of the group "SJM", "Y" or two words of "S", "JMY".
3. If not, match "SJ", "MY", i.e., 2 words; or "JM" gets the 3 words "S", "JM", "Y".
4. If the output is still not the same, the worst output is achieved as four words of 'S', 'J', 'M' and 'Y'. And the ratio of the number of words and the length is obtained to form a set, and the ratio obtaining mode is as follows: the ratio is the number of words/length;
the set is formed by each ratio.
Set representation: { x | x ═ word number/length }. And carrying out interval estimation on normal distribution of each variable in the set. The mean u and standard deviation a of this set are calculated. A mathematical normal distribution is referenced. The area of the normal distribution in the horizontal axis segment (u-a, u + a) was 68.268949%, the area in the horizontal axis segment (u-1.96 a, u +1.96 a) was 95.449974%, and the area in the horizontal axis segment (u-2.58 a, u +2.58 a) was 99.730020%. That is to say the probabilities within the three confidence intervals are 68.27%, 95.45%, 99.73%, respectively. If the probability is the normal string probability, the abnormal string, i.e. the messy code, can be judged under the same probability. That is, the normal string profile is such that, at this probability, it can be determined that the string is an abnormal string, and both are equal probability events. And carrying out probabilistic judgment on the variables falling into the interval. Scrambled text can be identified above a certain confidence probability.
The application mode of the invention is as follows: 1. the API mode can be provided for developers to use, before the developers use, a character string list needing to be recognized is preset, and then the character string list and a character string coding format are used as parameters to be input into an API recognition program. Then, the output result flag can be used to judge whether the code is scrambled.
2. The method can be packaged into a visualization tool, firstly, character strings to be recognized are divided according to lines and stored in a text file, an input tool is used for specifying a text encoding format, the tool is started, and a text absolute path is provided. And outputting the result to be marked to the user according to the line after the tool is calculated.
The above description is only a preferred embodiment of the present invention, and all equivalent changes and modifications made in accordance with the claims of the present invention should be covered by the present invention.

Claims (4)

1. A messy code identification method based on a slicing algorithm is characterized in that: the method comprises the following steps: s1, taking out the character string set needed to be judged, combining the character string set and the non-messy code character string set of the coding mode of the language corresponding to the character string set into a new character string set; performing operations of steps S2 through S5 on each character string in the character string set;
step S2, obtaining the original length of the character string set, judging whether the character string is long or short according to the original length, if the character string is short, performing single character supplementary expansion on the character string, judging whether the supplemented character string reaches the shortest length of the long class, if not, continuing the single character supplementary, if so, obtaining the length of the character string after supplementary expansion, setting the length as the length after processing, and entering step S3; if the length is the long type, the original length is directly set as the processed length, and the step S4 is proceeded; the single characters in the single character supplementary expansion in the step S2 are randomly selected characters, and a plurality of added single characters cannot form new words with each other and can form words with the character strings;
step S3, carrying out slicing and word segmentation processing on the character string after the supplementary expansion to obtain word number, and entering step S5;
step S4, carrying out slicing word segmentation processing on the character string to obtain the number of words, and entering step S5;
step S5, calculating a variable x = the number of words/the processed length to obtain a variable x set corresponding to the string set, performing interval estimation on the variable x set by using a mathematical normal distribution, and performing probabilistic judgment on the variable falling in the interval to know whether the string to be judged is a messy code.
2. The scrambling code identification method based on the slicing algorithm as claimed in claim 1, wherein: the word segmentation processing comprises the following steps: the method comprises the steps of predefining a dictionary of words, scanning character strings to be recognized when a program runs, checking whether the character strings to be recognized have the words appearing in the dictionary or not, wherein the word number matching mode is matched according to length priority, namely, matching is carried out from long to short, and then subtracting one word from the character strings one by one to judge whether the words appear in the dictionary or not one by one so as to obtain the final word number.
3. The scrambling code identification method based on the slicing algorithm as claimed in claim 1, wherein: the step S5 further includes: calculating variable x = word number/processed length, and solving average number u and standard deviation a of variable x set; referring to a mathematical normal distribution for the mean u and the standard deviation a, the area of the normal distribution in the horizontal axis interval (u-a, u + a) is 68.268949%, the area in the horizontal axis interval (u-1.96 a, u +1.96 a) is 95.449974%, and the area in the horizontal axis interval (u-2.58 a, u +2.58 a) is 99.730020%; that is, the probabilities in the horizontal axis intervals of the three confidences are 68.27%, 95.45%, and 99.73%, respectively, and this probability is a normal string probability, and it is determined whether or not the normal string probability is an abnormal string, that is, a scramble code.
4. The scrambling code identification method based on the slicing algorithm as claimed in claim 2, wherein: in the slicing word segmentation processing, after the single character is supplemented and expanded, when the single character and the peripheral characters cannot be combined into words, the single character is also a word.
CN201911357125.9A 2019-12-25 2019-12-25 Messy code identification method based on slicing algorithm Active CN111144107B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911357125.9A CN111144107B (en) 2019-12-25 2019-12-25 Messy code identification method based on slicing algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911357125.9A CN111144107B (en) 2019-12-25 2019-12-25 Messy code identification method based on slicing algorithm

Publications (2)

Publication Number Publication Date
CN111144107A CN111144107A (en) 2020-05-12
CN111144107B true CN111144107B (en) 2022-08-09

Family

ID=70519968

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911357125.9A Active CN111144107B (en) 2019-12-25 2019-12-25 Messy code identification method based on slicing algorithm

Country Status (1)

Country Link
CN (1) CN111144107B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114363061A (en) * 2021-12-31 2022-04-15 深信服科技股份有限公司 Abnormal flow detection method, system, storage medium and terminal
CN114629707A (en) * 2022-03-16 2022-06-14 深信服科技股份有限公司 Method and device for detecting messy codes, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732228A (en) * 2015-04-16 2015-06-24 同方知网数字出版技术股份有限公司 Detection and correction method for messy codes of PDF (portable document format) document
CN107608968A (en) * 2017-09-22 2018-01-19 深圳市易图资讯股份有限公司 Chinese word cutting method, the device of text-oriented big data

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8041883B2 (en) * 2007-05-09 2011-10-18 Stmicroelectronics S.R.L. Restoring storage devices based on flash memories and related circuit, system, and method

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104732228A (en) * 2015-04-16 2015-06-24 同方知网数字出版技术股份有限公司 Detection and correction method for messy codes of PDF (portable document format) document
CN107608968A (en) * 2017-09-22 2018-01-19 深圳市易图资讯股份有限公司 Chinese word cutting method, the device of text-oriented big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
Chongmu Chen等.Scene Character Recognition Using PCANet.《ICIMCS"15》.2015,第1-4页. *
祝佳 等.邮件内容过滤的中文编码盲识别算法.《计算机工程与应用》.2005,第131-133页. *

Also Published As

Publication number Publication date
CN111144107A (en) 2020-05-12

Similar Documents

Publication Publication Date Title
CN107122342B (en) Text code recognition method and device
CN111144107B (en) Messy code identification method based on slicing algorithm
CN111814466A (en) Information extraction method based on machine reading understanding and related equipment thereof
JP2007073044A (en) Text correction for pdf conversion apparatus
US20150302598A1 (en) Line segmentation method
CN110765235A (en) Training data generation method and device, terminal and readable medium
CN110502645B (en) Information query method and device
US9658989B2 (en) Apparatus and method for extracting and manipulating the reading order of text to prepare a display document for analysis
CN115424284A (en) Text similarity recognition method, device, equipment and storage medium
CN113961768B (en) Sensitive word detection method and device, computer equipment and storage medium
CN109062891B (en) Media processing method, device, terminal and medium
CN111291535A (en) Script processing method and device, electronic equipment and computer readable storage medium
CN113065360B (en) Word semantic model construction method and device, computer equipment and storage medium
CN110807322B (en) Method, device, server and storage medium for identifying new words based on information entropy
CN114239562A (en) Method, device and equipment for identifying program code blocks in document
CA3144052A1 (en) Method and apparatus for recognizing new sql statements in database audit systems
CN111859896B (en) Formula document detection method and device, computer readable medium and electronic equipment
CN108021918B (en) Character recognition method and device
CN112861526A (en) Sensitive word matching method and device, computer equipment and storage medium
JP7007793B2 (en) Similar character string detection device, similar character string detection method, and similar character string detection program
CN114241487B (en) OCR recognition method
CN113569974B (en) Programming statement error correction method, device, electronic equipment and storage medium
CN115600580B (en) Text matching method, device, equipment and storage medium
JP3115459B2 (en) Method of constructing and retrieving character recognition dictionary
JP4318223B2 (en) Document proofing apparatus and program storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant