CN111078979A

CN111078979A - Method and system for identifying network credit website based on OCR and text processing technology

Info

Publication number: CN111078979A
Application number: CN201911209962.7A
Authority: CN
Inventors: 陶景龙; 梁淑云; 刘胜; 马影; 王启凡; 魏国富; 徐�明; 殷钱安; 余贤喆; 周晓勇
Original assignee: Information and Data Security Solutions Co Ltd
Current assignee: Information and Data Security Solutions Co Ltd
Priority date: 2019-11-29
Filing date: 2019-11-29
Publication date: 2020-04-28

Abstract

The invention provides a method for identifying a network credit website based on OCR and text processing technologies, which comprises the following steps: s101, acquiring a URL of a website to be detected; s102, crawling pictures of a website to be detected by using a crawler technology, and outputting a URL picture set; s103, extracting characters from the crawled picture set by using an OCR technology; 104, filtering the content of the extracted characters and segmenting words by using a jieba word segmentation technology, and translating the extracted characters into pinyin word segmentation content; and S105, matching the phonetic segmentation content with the network credit keyword, and outputting whether the corresponding URL is a network credit website. The method is efficient and accurate, and effectively makes up for the technical vacancy in the field.

Description

Method and system for identifying network credit website based on OCR and text processing technology

Technical Field

The invention relates to the technical field of network credit website identification, in particular to a method and a system for identifying a network credit website based on OCR and text processing technologies.

Background

With the rapid development of the internet financial industry, websites are more convenient to establish, and the threshold is lower, so that a plurality of bad and illegal websites, such as illegal net loan websites, phishing websites, gambling websites and the like, are caused. In recent years, events such as P2P company race, phishing, telecom fraud and the like frequently occur, so that serious property loss is caused to people, personal safety is even damaged, and adverse social effects are generated. The online credit website is accurately and timely identified, so that the user is timely reminded of cautious operation, property and the like of the user can be prevented from being lost, and the social responsibility and the enterprise image of an enterprise are improved.

The lower and lower the requirement threshold of network loan, so that a plurality of organizations or enterprises mainly using network loan as the operation business are generated, and generally, the enterprises have own online application platforms for network loan, and the loan business is developed by means of the time domain and the interactivity of the internet. The access link of the network credit website is not obviously different from the link of the common website, and for distinguishing the network credit website, one mode is to manually perform link access and display the content through the website so as to judge whether the website is the network credit website. This method consumes a lot of manpower resources and time, and is inefficient. The website information identification method, the website information identification device and the electronic equipment disclosed by application number 201910565890.3 have the main technology that the content of a target website is obtained according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots; performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation library to determine a text recognition result of the target website; and respectively carrying out image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture recognition result of the target website. After the content of a target website is obtained, the technology carries out accurate matching and/or natural language analysis processing on the text content to obtain a text recognition result; and carrying out deep learning on the picture file and the display effect screenshot to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, and the misjudgment rate is reduced. However, this method has the problems that the characters and the pictures need to be processed separately, the characters in the pictures cannot be identified and judged, and especially the pictures are classified based on learning, so the calculation amount is large and the efficiency is low. In addition, the method only identifies and matches characters, but the character matching error is large, for example, in the case of Chinese characters, the similarity between the loan and the lower money is only fifty percent, but the similarity between the dai kuan and the xia kuan is 75 percent, so that the matching result has a large error.

Disclosure of Invention

The invention aims to solve the technical problems of low efficiency and large error of the identification technology of the network credit website.

The invention solves the technical problems through the following technical means:

a method for identifying a network credit website based on OCR and text processing technology comprises the following steps:

s101, acquiring a URL of a website to be detected;

s102, crawling pictures of a website to be detected by using a crawler technology, and outputting a URL picture set;

s103, extracting characters from the crawled picture set by using an OCR technology;

s104, using a jieba word segmentation technology to filter the content of the extracted characters, segmenting words and translating the segmented words into pinyin to obtain pinyin segmented word content M;

s105, constructing a pinyin keyword library K, performing network credit keyword matching on the pinyin word segmentation content M by using the pinyin keyword library, and outputting whether the corresponding URL is a network credit website.

The method has the advantages that the network credit information keyword library is established by the website image character content and the network credit service expert, the keyword library and the characters to be matched are translated into pinyin by using an OCR technology and a text processing technology, the recognition rate is improved, the error is reduced, and the systematic network credit website recognition method is completed.

Preferably, the step S102 specifically includes:

s1021, building a crawler system for the website pictures by using a crawler technology, and recording the crawler system as R;

s1022, inputting the URL to be detected into the crawler system, and outputting the picture set { P } corresponding to the URL.

Preferably, the step S103 specifically includes:

s1031, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P };

s1032, segmenting the picture into sections and lines by using a layout analysis algorithm;

s1033, processing the problem that characters are difficult to cut simply due to character adhesion and broken strokes by using a character cutting algorithm;

s1034, extracting multi-dimensional features from the character image by using a character feature extraction algorithm;

s1035, carrying out template rough classification and template fine matching on the feature vector extracted from the current character and a feature template library by using a character recognition algorithm, and recognizing the character;

and S1036, correcting the recognized characters according to semantics, and sorting and outputting the characters into a text format.

Preferably, the step S104 specifically includes:

s1041, using a jieba word segmentation tool to filter special characters of the text content output by S1036 and segment words to obtain word segmentation content.

Preferably, the step S105 specifically includes:

s1051, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords;

s1052, translating the network credit information keyword library in S1051 into a pinyin format by using a jieba word segmentation tool, marking the keyword library as K, and translating the word segmentation content of S1041 into the pinyin format by using the same method, and marking the word segmentation content as M;

s1053, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with similarity larger than a preset value appears in a matching result, the text contains network credit information, the URL corresponding to the text is a network credit website, and otherwise, the URL corresponding to the text is a non-network credit website.

Correspondingly, the invention also provides a system for identifying the network credit website based on OCR and text processing technology, which comprises

The URL acquisition module is used for acquiring the URL of the website to be detected;

the image crawling module is used for crawling the image of the website to be detected by using a crawler technology and outputting a URL image set;

the character extraction module is used for extracting characters from the crawled picture set by using an OCR technology;

the word segmentation module is used for filtering the content of the extracted characters and segmenting words by using a jieba word segmentation technology;

and the matching module is used for matching the word content with the network credit keyword and outputting whether the corresponding URL is a network credit website.

Preferably, the specific execution process in the picture crawling module is as follows:

firstly, building a crawler system for website pictures by using a crawler technology, and recording the crawler system as R; and inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL.

Preferably, the specific execution process of the character extraction module is as follows:

firstly, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P }; then, the picture is segmented into segments and lines by using a layout analysis algorithm, the problem that the characters are difficult to simply cut due to character adhesion and broken strokes is solved by using a character cutting algorithm, multidimensional features are extracted from the character image by using a character feature extraction algorithm, and the feature vectors extracted from the current characters are subjected to template rough classification and template fine matching with a feature template library by using a character recognition algorithm to recognize the characters; and finally, correcting the recognized characters according to the semantics, and sorting and outputting the characters into a text format.

Preferably, the word segmentation module specifically executes the following steps:

and using a jieba word segmentation tool to filter special characters of the text content and segment words to obtain word segmentation content.

Preferably, the step matching module specifically executes the following steps:

firstly, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords; translating the network credit information keyword library into a pinyin format, marking the keyword library as K, and translating the word segmentation content of S1041 into the pinyin format by the same method, and marking the word segmentation content as M by using a jieba word segmentation tool; and finally, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with the similarity larger than a preset value appears in a matching result, the text contains the network credit information, the URL corresponding to the text is the network credit website, and otherwise, the URL corresponding to the text is not the network credit website.

The invention has the advantages that:

Drawings

FIG. 1 is a block flow diagram of a method in embodiment 1 of the present invention;

fig. 2 is a flowchart of the image crawler system in step S102 according to embodiment 1 of the present invention;

fig. 3 is a flowchart of the OCR character extraction system in step S103 according to embodiment 1 of the present invention, as shown in fig. 3;

fig. 4 is an illustration of sampling the text information of the network credit picture in step S104 according to embodiment 1 of the present invention;

fig. 5 is an example of the implementation process of the fuzzy matching result in step S105 in embodiment 1 of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Example 1

As shown in fig. 1, the present embodiment provides a method for identifying a network credit website based on OCR and text processing technologies, so as to identify whether the website is a network credit-type website. The method comprises the following steps:

s101, acquiring a URL of a website to be detected;

s104, using a jieba word segmentation technology to filter the content of the extracted characters, segmenting words and translating the segmented words into pinyin to obtain pinyin segmented word content;

s105, establishing a pinyin keyword library, performing network credit keyword matching on the pinyin word segmentation contents by using the pinyin keyword library, and outputting whether the corresponding URL is a network credit website or not. The contents of each step are specifically described as follows:

the method in S101 comprises the following steps:

acquiring a website URL to be detected;

the method in S102 is, as shown in fig. 2:

the crawler technology refers to web crawler technology, and is a program or script for automatically capturing web information according to a certain rule. Other less commonly used names are ants, automatic indexing, simulation programs, or worms.

S1022, inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL;

the method in S103 is, as shown in fig. 3:

s1031 carries out picture preprocessing on each picture in the picture set { P }, wherein the picture preprocessing mainly comprises image binaryzation, noise removal, inclination correction and the like;

s1032, dividing the picture into sections and lines by using a layout analysis algorithm;

s1033, a character cutting algorithm is used for processing the problem that characters are difficult to cut simply due to character adhesion and broken strokes;

s1034, extracting multi-dimensional features from the character image by using a character feature extraction algorithm for subsequent feature matching;

s1036, correcting the recognized characters according to semantics, and sorting and outputting the characters into a text format;

the specific method in S104 is as follows:

s1041, as shown in FIG. 4, using a jieba word segmentation tool to filter special characters of the text content output by S1036, segmenting words, and then performing pinyin translation to obtain pinyin word segmentation content, and recording the pinyin word segmentation content as a set M.

Converting Chinese into pinyin can improve the matching rate, such as: in the case of Chinese characters, the similarity between the loan and the loan is only fifty percent, but the similarity between the dai kuan and the xia kuan is as high as 75 percent.

The jieba word segmentation tool is an open source natural language processing tool, and related algorithms comprise: realizing efficient word graph scanning based on the Trie tree structure, and generating a directed acyclic graph formed by all possible word forming conditions of Chinese characters in a sentence; a maximum probability path is searched by adopting dynamic programming, and a maximum segmentation combination based on word frequency is found out; for unknown words, an HMM model based on Chinese character word forming capability is adopted, and a Viterbi algorithm is used;

the specific method in S105 is as follows:

s1051, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords, the content form of which is shown in Table 1:

TABLE 1 network loan information keyword List

Serial number	Keyword
		1	Loan
2	Money to be paid
		3	Paying
4	Interest information
		5	Lower message
6	Rest-free
		7	Mortgage
8	Credit investigation
		9	High amount of
10	Can be credited with a vehicle
		...	...

S1052 uses the jieba word segmentation tool to translate the network loan information keyword library in S1051 into a Pinyin format, which is marked as a set K, the content form of which is shown in Table 2,

TABLE 2 network loan information keyword pinyin list

S1053, as shown in FIG. 5, fuzzy matching is performed in M for each keyword in K by using a fuzzy Wuzzy tool, and fuzzy matching is a technology for approximately searching a character string matched with a pattern and is not accurate. The principle of the method is that the difference between sequences is calculated by using an Edit Distance (Levenshtein Distance), and the Levenshtein Distance algorithm, also called Edit Distance algorithm, refers to the minimum number of editing operations required for converting one character string into another character string. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. Generally, the smaller the edit distance, the greater the similarity of the two strings. The method adopts a non-complete matching mode in a fuzzy Wuzzy tool in the using process, if a matching item with the similarity larger than 80 appears in a matching result, the text contains the network credit information, so that the URL corresponding to the text is judged to be the network credit website, and otherwise, the URL corresponding to the text is judged to be the non-network credit website.

The process of calculating the similarity of the two comparison terms specifically comprises the following steps: if the input is: the character a is "dai kuan"; the character B ═ xia kuan "; the process is as follows: the function partial _ ratio (a, B); and (3) outputting: similarity of character a to character B.

The method for identifying the network credit website based on the OCR and the text processing technology provided by the invention has the advantages that a network credit information keyword library is established by a network credit service expert according to the character content in a website picture, the keyword library and characters to be matched are translated into pinyin by using the OCR technology and the text processing technology, the identification rate is improved, the error is reduced, and the systematic network credit website identification method is completed.

Example 2

In accordance with embodiment 1, the present embodiment provides a system for identifying a loan website based on OCR and text processing techniques, comprising

the image crawling module is used for crawling the image of the website to be detected by using a crawler technology and outputting a URL image set; the specific execution process comprises the following steps:

The character extraction module is used for extracting characters from the crawled picture set { P } by using an OCR technology; the specific execution process comprises the following steps:

And the word segmentation module is used for filtering special characters of the text content output by the S1036 by using a jieba word segmentation tool, segmenting words and then performing pinyin translation to obtain pinyin word segmentation content which is recorded as a set M.

And the matching module is used for constructing a pinyin keyword library K, matching the pinyin word-dividing contents with the internet credit keywords by using the pinyin keyword library, and outputting whether the corresponding URL is the internet credit website or not. The specific execution process comprises the following steps:

firstly, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords; translating the network credit information keyword library into a pinyin format which is marked as K by using a jieba word segmentation tool, and translating the word segmentation content into the pinyin format which is marked as M by using the same method; and finally, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with the similarity larger than a preset value appears in a matching result, the text contains the network credit information, the URL corresponding to the text is the network credit website, and otherwise, the URL corresponding to the text is not the network credit website.

The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims

1. A method for identifying a network credit website based on OCR and text processing technology is characterized in that: the method comprises the following steps:

s101, acquiring a URL of a website to be detected;

2. A method for identifying a lending website based on OCR and text processing techniques according to claim 1, wherein: the step S102 specifically includes:

3. A method for identifying a lending website based on OCR and text processing techniques according to claim 1, wherein: the step S103 is specifically:

4. A method for identifying a lending website based on OCR and text processing techniques according to claim 3 wherein: the step S104 specifically includes:

s1041, using a jieba word segmentation tool to filter special characters of the text content output by S1036, then segmenting words, and finally translating the segmented words into pinyin to obtain pinyin word segmentation content M.

5. A method for identifying a lending website based on OCR and text processing techniques according to claim 4 wherein: the step S105 specifically includes:

s1052, translating the network loan information keyword library in the S1051 into a pinyin format and recording the keyword library as K by using a jieba word segmentation tool;

6. A system for identifying a network credit website based on OCR and text processing technologies is characterized in that: comprises that

the word segmentation module is used for filtering special characters of the text content output by the S1036 by using a jieba word segmentation tool, segmenting words and then performing pinyin translation to obtain pinyin word segmentation content which is recorded as a set M;

and the matching module is used for constructing a pinyin keyword library K, matching the pinyin word-dividing contents with the internet credit keywords by using the pinyin keyword library, and outputting whether the corresponding URL is the internet credit website or not.

7. A system for identifying a network crediting website based on OCR and text processing techniques as recited in claim 6, wherein: the specific execution process in the picture crawling module is as follows:

firstly, building a crawler system for website pictures by using a crawler technology; and inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL.

8. A method for identifying a lending website based on OCR and text processing techniques according to claim 7 wherein: the specific execution process of the character extraction module is as follows:

9. A system for identifying a network crediting website based on OCR and text processing techniques as recited in claim 8, wherein: the word segmentation module specifically executes the following steps:

and (3) filtering the special characters of the text content by using a jieba word segmentation tool, segmenting words, translating the segmented words into pinyin to obtain pinyin word segmentation content, and recording the pinyin word segmentation content as M.

10. A system for identifying a network crediting website based on OCR and text processing techniques as recited in claim 9, wherein: the step matching module specifically executes the following steps: