CN111078979A - Method and system for identifying network credit website based on OCR and text processing technology - Google Patents
Method and system for identifying network credit website based on OCR and text processing technology Download PDFInfo
- Publication number
- CN111078979A CN111078979A CN201911209962.7A CN201911209962A CN111078979A CN 111078979 A CN111078979 A CN 111078979A CN 201911209962 A CN201911209962 A CN 201911209962A CN 111078979 A CN111078979 A CN 111078979A
- Authority
- CN
- China
- Prior art keywords
- website
- pinyin
- url
- text
- network credit
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 57
- 238000005516 engineering process Methods 0.000 title claims abstract description 47
- 238000012545 processing Methods 0.000 title claims abstract description 27
- 230000011218 segmentation Effects 0.000 claims abstract description 46
- 230000009193 crawling Effects 0.000 claims abstract description 12
- 238000001914 filtration Methods 0.000 claims abstract description 5
- 238000004422 calculation algorithm Methods 0.000 claims description 28
- 238000000605 extraction Methods 0.000 claims description 12
- 238000004458 analytical method Methods 0.000 claims description 8
- 238000012937 correction Methods 0.000 claims description 6
- 239000000463 material Substances 0.000 claims description 6
- 239000013598 vector Substances 0.000 claims description 6
- 238000013519 translation Methods 0.000 claims description 3
- 230000000694 effects Effects 0.000 description 3
- 230000009897 systematic effect Effects 0.000 description 3
- 238000013135 deep learning Methods 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 208000001613 Gambling Diseases 0.000 description 1
- 241000257303 Hymenoptera Species 0.000 description 1
- 230000002411 adverse Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
- 230000000192 social effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/413—Classification of content, e.g. text, photographs or tables
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/10—Character recognition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Tourism & Hospitality (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Multimedia (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Economics (AREA)
- General Health & Medical Sciences (AREA)
- Human Resources & Organizations (AREA)
- Marketing (AREA)
- Primary Health Care (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a method for identifying a network credit website based on OCR and text processing technologies, which comprises the following steps: s101, acquiring a URL of a website to be detected; s102, crawling pictures of a website to be detected by using a crawler technology, and outputting a URL picture set; s103, extracting characters from the crawled picture set by using an OCR technology; 104, filtering the content of the extracted characters and segmenting words by using a jieba word segmentation technology, and translating the extracted characters into pinyin word segmentation content; and S105, matching the phonetic segmentation content with the network credit keyword, and outputting whether the corresponding URL is a network credit website. The method is efficient and accurate, and effectively makes up for the technical vacancy in the field.
Description
Technical Field
The invention relates to the technical field of network credit website identification, in particular to a method and a system for identifying a network credit website based on OCR and text processing technologies.
Background
With the rapid development of the internet financial industry, websites are more convenient to establish, and the threshold is lower, so that a plurality of bad and illegal websites, such as illegal net loan websites, phishing websites, gambling websites and the like, are caused. In recent years, events such as P2P company race, phishing, telecom fraud and the like frequently occur, so that serious property loss is caused to people, personal safety is even damaged, and adverse social effects are generated. The online credit website is accurately and timely identified, so that the user is timely reminded of cautious operation, property and the like of the user can be prevented from being lost, and the social responsibility and the enterprise image of an enterprise are improved.
The lower and lower the requirement threshold of network loan, so that a plurality of organizations or enterprises mainly using network loan as the operation business are generated, and generally, the enterprises have own online application platforms for network loan, and the loan business is developed by means of the time domain and the interactivity of the internet. The access link of the network credit website is not obviously different from the link of the common website, and for distinguishing the network credit website, one mode is to manually perform link access and display the content through the website so as to judge whether the website is the network credit website. This method consumes a lot of manpower resources and time, and is inefficient. The website information identification method, the website information identification device and the electronic equipment disclosed by application number 201910565890.3 have the main technology that the content of a target website is obtained according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots; performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation library to determine a text recognition result of the target website; and respectively carrying out image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture recognition result of the target website. After the content of a target website is obtained, the technology carries out accurate matching and/or natural language analysis processing on the text content to obtain a text recognition result; and carrying out deep learning on the picture file and the display effect screenshot to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, and the misjudgment rate is reduced. However, this method has the problems that the characters and the pictures need to be processed separately, the characters in the pictures cannot be identified and judged, and especially the pictures are classified based on learning, so the calculation amount is large and the efficiency is low. In addition, the method only identifies and matches characters, but the character matching error is large, for example, in the case of Chinese characters, the similarity between the loan and the lower money is only fifty percent, but the similarity between the dai kuan and the xia kuan is 75 percent, so that the matching result has a large error.
Disclosure of Invention
The invention aims to solve the technical problems of low efficiency and large error of the identification technology of the network credit website.
The invention solves the technical problems through the following technical means:
a method for identifying a network credit website based on OCR and text processing technology comprises the following steps:
s101, acquiring a URL of a website to be detected;
s102, crawling pictures of a website to be detected by using a crawler technology, and outputting a URL picture set;
s103, extracting characters from the crawled picture set by using an OCR technology;
s104, using a jieba word segmentation technology to filter the content of the extracted characters, segmenting words and translating the segmented words into pinyin to obtain pinyin segmented word content M;
s105, constructing a pinyin keyword library K, performing network credit keyword matching on the pinyin word segmentation content M by using the pinyin keyword library, and outputting whether the corresponding URL is a network credit website.
The method has the advantages that the network credit information keyword library is established by the website image character content and the network credit service expert, the keyword library and the characters to be matched are translated into pinyin by using an OCR technology and a text processing technology, the recognition rate is improved, the error is reduced, and the systematic network credit website recognition method is completed.
Preferably, the step S102 specifically includes:
s1021, building a crawler system for the website pictures by using a crawler technology, and recording the crawler system as R;
s1022, inputting the URL to be detected into the crawler system, and outputting the picture set { P } corresponding to the URL.
Preferably, the step S103 specifically includes:
s1031, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P };
s1032, segmenting the picture into sections and lines by using a layout analysis algorithm;
s1033, processing the problem that characters are difficult to cut simply due to character adhesion and broken strokes by using a character cutting algorithm;
s1034, extracting multi-dimensional features from the character image by using a character feature extraction algorithm;
s1035, carrying out template rough classification and template fine matching on the feature vector extracted from the current character and a feature template library by using a character recognition algorithm, and recognizing the character;
and S1036, correcting the recognized characters according to semantics, and sorting and outputting the characters into a text format.
Preferably, the step S104 specifically includes:
s1041, using a jieba word segmentation tool to filter special characters of the text content output by S1036 and segment words to obtain word segmentation content.
Preferably, the step S105 specifically includes:
s1051, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords;
s1052, translating the network credit information keyword library in S1051 into a pinyin format by using a jieba word segmentation tool, marking the keyword library as K, and translating the word segmentation content of S1041 into the pinyin format by using the same method, and marking the word segmentation content as M;
s1053, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with similarity larger than a preset value appears in a matching result, the text contains network credit information, the URL corresponding to the text is a network credit website, and otherwise, the URL corresponding to the text is a non-network credit website.
Correspondingly, the invention also provides a system for identifying the network credit website based on OCR and text processing technology, which comprises
The URL acquisition module is used for acquiring the URL of the website to be detected;
the image crawling module is used for crawling the image of the website to be detected by using a crawler technology and outputting a URL image set;
the character extraction module is used for extracting characters from the crawled picture set by using an OCR technology;
the word segmentation module is used for filtering the content of the extracted characters and segmenting words by using a jieba word segmentation technology;
and the matching module is used for matching the word content with the network credit keyword and outputting whether the corresponding URL is a network credit website.
Preferably, the specific execution process in the picture crawling module is as follows:
firstly, building a crawler system for website pictures by using a crawler technology, and recording the crawler system as R; and inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL.
Preferably, the specific execution process of the character extraction module is as follows:
firstly, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P }; then, the picture is segmented into segments and lines by using a layout analysis algorithm, the problem that the characters are difficult to simply cut due to character adhesion and broken strokes is solved by using a character cutting algorithm, multidimensional features are extracted from the character image by using a character feature extraction algorithm, and the feature vectors extracted from the current characters are subjected to template rough classification and template fine matching with a feature template library by using a character recognition algorithm to recognize the characters; and finally, correcting the recognized characters according to the semantics, and sorting and outputting the characters into a text format.
Preferably, the word segmentation module specifically executes the following steps:
and using a jieba word segmentation tool to filter special characters of the text content and segment words to obtain word segmentation content.
Preferably, the step matching module specifically executes the following steps:
firstly, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords; translating the network credit information keyword library into a pinyin format, marking the keyword library as K, and translating the word segmentation content of S1041 into the pinyin format by the same method, and marking the word segmentation content as M by using a jieba word segmentation tool; and finally, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with the similarity larger than a preset value appears in a matching result, the text contains the network credit information, the URL corresponding to the text is the network credit website, and otherwise, the URL corresponding to the text is not the network credit website.
The invention has the advantages that:
the method has the advantages that the network credit information keyword library is established by the website image character content and the network credit service expert, the keyword library and the characters to be matched are translated into pinyin by using an OCR technology and a text processing technology, the recognition rate is improved, the error is reduced, and the systematic network credit website recognition method is completed.
Drawings
FIG. 1 is a block flow diagram of a method in embodiment 1 of the present invention;
fig. 2 is a flowchart of the image crawler system in step S102 according to embodiment 1 of the present invention;
fig. 3 is a flowchart of the OCR character extraction system in step S103 according to embodiment 1 of the present invention, as shown in fig. 3;
fig. 4 is an illustration of sampling the text information of the network credit picture in step S104 according to embodiment 1 of the present invention;
fig. 5 is an example of the implementation process of the fuzzy matching result in step S105 in embodiment 1 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in fig. 1, the present embodiment provides a method for identifying a network credit website based on OCR and text processing technologies, so as to identify whether the website is a network credit-type website. The method comprises the following steps:
s101, acquiring a URL of a website to be detected;
s102, crawling pictures of a website to be detected by using a crawler technology, and outputting a URL picture set;
s103, extracting characters from the crawled picture set by using an OCR technology;
s104, using a jieba word segmentation technology to filter the content of the extracted characters, segmenting words and translating the segmented words into pinyin to obtain pinyin segmented word content;
s105, establishing a pinyin keyword library, performing network credit keyword matching on the pinyin word segmentation contents by using the pinyin keyword library, and outputting whether the corresponding URL is a network credit website or not. The contents of each step are specifically described as follows:
the method in S101 comprises the following steps:
acquiring a website URL to be detected;
the method in S102 is, as shown in fig. 2:
s1021, building a crawler system for the website pictures by using a crawler technology, and recording the crawler system as R;
the crawler technology refers to web crawler technology, and is a program or script for automatically capturing web information according to a certain rule. Other less commonly used names are ants, automatic indexing, simulation programs, or worms.
S1022, inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL;
the method in S103 is, as shown in fig. 3:
s1031 carries out picture preprocessing on each picture in the picture set { P }, wherein the picture preprocessing mainly comprises image binaryzation, noise removal, inclination correction and the like;
s1032, dividing the picture into sections and lines by using a layout analysis algorithm;
s1033, a character cutting algorithm is used for processing the problem that characters are difficult to cut simply due to character adhesion and broken strokes;
s1034, extracting multi-dimensional features from the character image by using a character feature extraction algorithm for subsequent feature matching;
s1035, carrying out template rough classification and template fine matching on the feature vector extracted from the current character and a feature template library by using a character recognition algorithm, and recognizing the character;
s1036, correcting the recognized characters according to semantics, and sorting and outputting the characters into a text format;
the specific method in S104 is as follows:
s1041, as shown in FIG. 4, using a jieba word segmentation tool to filter special characters of the text content output by S1036, segmenting words, and then performing pinyin translation to obtain pinyin word segmentation content, and recording the pinyin word segmentation content as a set M.
Converting Chinese into pinyin can improve the matching rate, such as: in the case of Chinese characters, the similarity between the loan and the loan is only fifty percent, but the similarity between the dai kuan and the xia kuan is as high as 75 percent.
The jieba word segmentation tool is an open source natural language processing tool, and related algorithms comprise: realizing efficient word graph scanning based on the Trie tree structure, and generating a directed acyclic graph formed by all possible word forming conditions of Chinese characters in a sentence; a maximum probability path is searched by adopting dynamic programming, and a maximum segmentation combination based on word frequency is found out; for unknown words, an HMM model based on Chinese character word forming capability is adopted, and a Viterbi algorithm is used;
the specific method in S105 is as follows:
s1051, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords, the content form of which is shown in Table 1:
TABLE 1 network loan information keyword List
Serial number | Keyword |
1 | Loan |
2 | Money to be paid |
3 | Paying |
4 | Interest information |
5 | Lower message |
6 | Rest-free |
7 | Mortgage |
8 | Credit investigation |
9 | High amount of |
10 | Can be credited with a vehicle |
... | ... |
S1052 uses the jieba word segmentation tool to translate the network loan information keyword library in S1051 into a Pinyin format, which is marked as a set K, the content form of which is shown in Table 2,
TABLE 2 network loan information keyword pinyin list
S1053, as shown in FIG. 5, fuzzy matching is performed in M for each keyword in K by using a fuzzy Wuzzy tool, and fuzzy matching is a technology for approximately searching a character string matched with a pattern and is not accurate. The principle of the method is that the difference between sequences is calculated by using an Edit Distance (Levenshtein Distance), and the Levenshtein Distance algorithm, also called Edit Distance algorithm, refers to the minimum number of editing operations required for converting one character string into another character string. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. Generally, the smaller the edit distance, the greater the similarity of the two strings. The method adopts a non-complete matching mode in a fuzzy Wuzzy tool in the using process, if a matching item with the similarity larger than 80 appears in a matching result, the text contains the network credit information, so that the URL corresponding to the text is judged to be the network credit website, and otherwise, the URL corresponding to the text is judged to be the non-network credit website.
The process of calculating the similarity of the two comparison terms specifically comprises the following steps: if the input is: the character a is "dai kuan"; the character B ═ xia kuan "; the process is as follows: the function partial _ ratio (a, B); and (3) outputting: similarity of character a to character B.
The method for identifying the network credit website based on the OCR and the text processing technology provided by the invention has the advantages that a network credit information keyword library is established by a network credit service expert according to the character content in a website picture, the keyword library and characters to be matched are translated into pinyin by using the OCR technology and the text processing technology, the identification rate is improved, the error is reduced, and the systematic network credit website identification method is completed.
Example 2
In accordance with embodiment 1, the present embodiment provides a system for identifying a loan website based on OCR and text processing techniques, comprising
The URL acquisition module is used for acquiring the URL of the website to be detected;
the image crawling module is used for crawling the image of the website to be detected by using a crawler technology and outputting a URL image set; the specific execution process comprises the following steps:
firstly, building a crawler system for website pictures by using a crawler technology, and recording the crawler system as R; and inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL.
The character extraction module is used for extracting characters from the crawled picture set { P } by using an OCR technology; the specific execution process comprises the following steps:
firstly, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P }; then, the picture is segmented into segments and lines by using a layout analysis algorithm, the problem that the characters are difficult to simply cut due to character adhesion and broken strokes is solved by using a character cutting algorithm, multidimensional features are extracted from the character image by using a character feature extraction algorithm, and the feature vectors extracted from the current characters are subjected to template rough classification and template fine matching with a feature template library by using a character recognition algorithm to recognize the characters; and finally, correcting the recognized characters according to the semantics, and sorting and outputting the characters into a text format.
And the word segmentation module is used for filtering special characters of the text content output by the S1036 by using a jieba word segmentation tool, segmenting words and then performing pinyin translation to obtain pinyin word segmentation content which is recorded as a set M.
And the matching module is used for constructing a pinyin keyword library K, matching the pinyin word-dividing contents with the internet credit keywords by using the pinyin keyword library, and outputting whether the corresponding URL is the internet credit website or not. The specific execution process comprises the following steps:
firstly, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords; translating the network credit information keyword library into a pinyin format which is marked as K by using a jieba word segmentation tool, and translating the word segmentation content into the pinyin format which is marked as M by using the same method; and finally, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with the similarity larger than a preset value appears in a matching result, the text contains the network credit information, the URL corresponding to the text is the network credit website, and otherwise, the URL corresponding to the text is not the network credit website.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
Claims (10)
1. A method for identifying a network credit website based on OCR and text processing technology is characterized in that: the method comprises the following steps:
s101, acquiring a URL of a website to be detected;
s102, crawling pictures of a website to be detected by using a crawler technology, and outputting a URL picture set;
s103, extracting characters from the crawled picture set by using an OCR technology;
s104, using a jieba word segmentation technology to filter the content of the extracted characters, segmenting words and translating the segmented words into pinyin to obtain pinyin segmented word content M;
s105, constructing a pinyin keyword library K, performing network credit keyword matching on the pinyin word segmentation content M by using the pinyin keyword library, and outputting whether the corresponding URL is a network credit website.
2. A method for identifying a lending website based on OCR and text processing techniques according to claim 1, wherein: the step S102 specifically includes:
s1021, building a crawler system for the website pictures by using a crawler technology, and recording the crawler system as R;
s1022, inputting the URL to be detected into the crawler system, and outputting the picture set { P } corresponding to the URL.
3. A method for identifying a lending website based on OCR and text processing techniques according to claim 1, wherein: the step S103 is specifically:
s1031, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P };
s1032, segmenting the picture into sections and lines by using a layout analysis algorithm;
s1033, processing the problem that characters are difficult to cut simply due to character adhesion and broken strokes by using a character cutting algorithm;
s1034, extracting multi-dimensional features from the character image by using a character feature extraction algorithm;
s1035, carrying out template rough classification and template fine matching on the feature vector extracted from the current character and a feature template library by using a character recognition algorithm, and recognizing the character;
and S1036, correcting the recognized characters according to semantics, and sorting and outputting the characters into a text format.
4. A method for identifying a lending website based on OCR and text processing techniques according to claim 3 wherein: the step S104 specifically includes:
s1041, using a jieba word segmentation tool to filter special characters of the text content output by S1036, then segmenting words, and finally translating the segmented words into pinyin to obtain pinyin word segmentation content M.
5. A method for identifying a lending website based on OCR and text processing techniques according to claim 4 wherein: the step S105 specifically includes:
s1051, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords;
s1052, translating the network loan information keyword library in the S1051 into a pinyin format and recording the keyword library as K by using a jieba word segmentation tool;
s1053, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with similarity larger than a preset value appears in a matching result, the text contains network credit information, the URL corresponding to the text is a network credit website, and otherwise, the URL corresponding to the text is a non-network credit website.
6. A system for identifying a network credit website based on OCR and text processing technologies is characterized in that: comprises that
The URL acquisition module is used for acquiring the URL of the website to be detected;
the image crawling module is used for crawling the image of the website to be detected by using a crawler technology and outputting a URL image set;
the character extraction module is used for extracting characters from the crawled picture set by using an OCR technology;
the word segmentation module is used for filtering special characters of the text content output by the S1036 by using a jieba word segmentation tool, segmenting words and then performing pinyin translation to obtain pinyin word segmentation content which is recorded as a set M;
and the matching module is used for constructing a pinyin keyword library K, matching the pinyin word-dividing contents with the internet credit keywords by using the pinyin keyword library, and outputting whether the corresponding URL is the internet credit website or not.
7. A system for identifying a network crediting website based on OCR and text processing techniques as recited in claim 6, wherein: the specific execution process in the picture crawling module is as follows:
firstly, building a crawler system for website pictures by using a crawler technology; and inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL.
8. A method for identifying a lending website based on OCR and text processing techniques according to claim 7 wherein: the specific execution process of the character extraction module is as follows:
firstly, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P }; then, the picture is segmented into segments and lines by using a layout analysis algorithm, the problem that the characters are difficult to simply cut due to character adhesion and broken strokes is solved by using a character cutting algorithm, multidimensional features are extracted from the character image by using a character feature extraction algorithm, and the feature vectors extracted from the current characters are subjected to template rough classification and template fine matching with a feature template library by using a character recognition algorithm to recognize the characters; and finally, correcting the recognized characters according to the semantics, and sorting and outputting the characters into a text format.
9. A system for identifying a network crediting website based on OCR and text processing techniques as recited in claim 8, wherein: the word segmentation module specifically executes the following steps:
and (3) filtering the special characters of the text content by using a jieba word segmentation tool, segmenting words, translating the segmented words into pinyin to obtain pinyin word segmentation content, and recording the pinyin word segmentation content as M.
10. A system for identifying a network crediting website based on OCR and text processing techniques as recited in claim 9, wherein: the step matching module specifically executes the following steps:
firstly, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords; translating the network credit information keyword library into a pinyin format, marking the keyword library as K, and translating the word segmentation content of S1041 into the pinyin format by the same method, and marking the word segmentation content as M by using a jieba word segmentation tool; and finally, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with the similarity larger than a preset value appears in a matching result, the text contains the network credit information, the URL corresponding to the text is the network credit website, and otherwise, the URL corresponding to the text is not the network credit website.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911209962.7A CN111078979A (en) | 2019-11-29 | 2019-11-29 | Method and system for identifying network credit website based on OCR and text processing technology |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911209962.7A CN111078979A (en) | 2019-11-29 | 2019-11-29 | Method and system for identifying network credit website based on OCR and text processing technology |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111078979A true CN111078979A (en) | 2020-04-28 |
Family
ID=70312365
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911209962.7A Pending CN111078979A (en) | 2019-11-29 | 2019-11-29 | Method and system for identifying network credit website based on OCR and text processing technology |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111078979A (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291738A (en) * | 2020-05-09 | 2020-06-16 | 支付宝(杭州)信息技术有限公司 | Element extraction method and device in front-end page image and electronic equipment |
CN111782772A (en) * | 2020-07-24 | 2020-10-16 | 平安银行股份有限公司 | Text automatic generation method, device, equipment and medium based on OCR technology |
CN112667943A (en) * | 2020-11-10 | 2021-04-16 | 中科金审(北京)科技有限公司 | Illegal website identification and locking method |
CN113127715A (en) * | 2021-03-04 | 2021-07-16 | 微梦创科网络科技(中国)有限公司 | Method and system for identifying gambling-related information |
CN114663903A (en) * | 2022-05-25 | 2022-06-24 | 深圳大道云科技有限公司 | Text data classification method, device, equipment and storage medium |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100316300A1 (en) * | 2009-06-13 | 2010-12-16 | Microsoft Corporation | Detection of objectionable videos |
CN103179095A (en) * | 2011-12-22 | 2013-06-26 | 阿里巴巴集团控股有限公司 | Method and client device for detecting phishing websites |
CN106776946A (en) * | 2016-12-02 | 2017-05-31 | 重庆大学 | A kind of detection method of fraudulent website |
CN108256104A (en) * | 2018-02-05 | 2018-07-06 | 恒安嘉新(北京)科技股份公司 | Internet site compressive classification method based on multidimensional characteristic |
CN108874777A (en) * | 2018-06-11 | 2018-11-23 | 北京奇艺世纪科技有限公司 | A kind of method and device of text anti-spam |
CN109255113A (en) * | 2018-09-04 | 2019-01-22 | 郑州信大壹密科技有限公司 | Intelligent critique system |
CN110209795A (en) * | 2018-06-11 | 2019-09-06 | 腾讯科技(深圳)有限公司 | Comment on recognition methods, device, computer readable storage medium and computer equipment |
CN110210028A (en) * | 2019-05-30 | 2019-09-06 | 杭州远传新业科技有限公司 | For domain feature words extracting method, device, equipment and the medium of speech translation text |
CN110275958A (en) * | 2019-06-26 | 2019-09-24 | 北京市博汇科技股份有限公司 | Site information recognition methods, device and electronic equipment |
-
2019
- 2019-11-29 CN CN201911209962.7A patent/CN111078979A/en active Pending
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20100316300A1 (en) * | 2009-06-13 | 2010-12-16 | Microsoft Corporation | Detection of objectionable videos |
CN103179095A (en) * | 2011-12-22 | 2013-06-26 | 阿里巴巴集团控股有限公司 | Method and client device for detecting phishing websites |
CN106776946A (en) * | 2016-12-02 | 2017-05-31 | 重庆大学 | A kind of detection method of fraudulent website |
CN108256104A (en) * | 2018-02-05 | 2018-07-06 | 恒安嘉新(北京)科技股份公司 | Internet site compressive classification method based on multidimensional characteristic |
CN108874777A (en) * | 2018-06-11 | 2018-11-23 | 北京奇艺世纪科技有限公司 | A kind of method and device of text anti-spam |
CN110209795A (en) * | 2018-06-11 | 2019-09-06 | 腾讯科技(深圳)有限公司 | Comment on recognition methods, device, computer readable storage medium and computer equipment |
CN109255113A (en) * | 2018-09-04 | 2019-01-22 | 郑州信大壹密科技有限公司 | Intelligent critique system |
CN110210028A (en) * | 2019-05-30 | 2019-09-06 | 杭州远传新业科技有限公司 | For domain feature words extracting method, device, equipment and the medium of speech translation text |
CN110275958A (en) * | 2019-06-26 | 2019-09-24 | 北京市博汇科技股份有限公司 | Site information recognition methods, device and electronic equipment |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111291738A (en) * | 2020-05-09 | 2020-06-16 | 支付宝(杭州)信息技术有限公司 | Element extraction method and device in front-end page image and electronic equipment |
CN111782772A (en) * | 2020-07-24 | 2020-10-16 | 平安银行股份有限公司 | Text automatic generation method, device, equipment and medium based on OCR technology |
CN112667943A (en) * | 2020-11-10 | 2021-04-16 | 中科金审(北京)科技有限公司 | Illegal website identification and locking method |
CN113127715A (en) * | 2021-03-04 | 2021-07-16 | 微梦创科网络科技(中国)有限公司 | Method and system for identifying gambling-related information |
CN114663903A (en) * | 2022-05-25 | 2022-06-24 | 深圳大道云科技有限公司 | Text data classification method, device, equipment and storage medium |
CN114663903B (en) * | 2022-05-25 | 2022-08-19 | 深圳大道云科技有限公司 | Text data classification method, device, equipment and storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111078979A (en) | Method and system for identifying network credit website based on OCR and text processing technology | |
CN109697162B (en) | Software defect automatic detection method based on open source code library | |
CN112347244B (en) | Yellow-based and gambling-based website detection method based on mixed feature analysis | |
CN109831460B (en) | Web attack detection method based on collaborative training | |
CN112541476B (en) | Malicious webpage identification method based on semantic feature extraction | |
CN111078978A (en) | Web credit website entity identification method and system based on website text content | |
CN111462752B (en) | Attention mechanism, feature embedding and BI-LSTM (business-to-business) based customer intention recognition method | |
CN103605691A (en) | Device and method used for processing issued contents in social network | |
WO2020071558A1 (en) | Business form layout analysis device, and analysis program and analysis method therefor | |
CN113986864A (en) | Log data processing method and device, electronic equipment and storage medium | |
CN103605690A (en) | Device and method for recognizing advertising messages in instant messaging | |
CN103246655A (en) | Text categorizing method, device and system | |
CN111460803B (en) | Equipment identification method based on Web management page of industrial Internet of things equipment | |
CN106815253B (en) | Mining method based on mixed data type data | |
CN110413998B (en) | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof | |
CN117520561A (en) | Entity relation extraction method and system for knowledge graph construction in helicopter assembly field | |
CN111597423A (en) | Performance evaluation method and device of interpretable method of text classification model | |
CN112380412A (en) | Optimization method for screening matching information based on big data | |
CN110888977B (en) | Text classification method, apparatus, computer device and storage medium | |
CN107291952B (en) | Method and device for extracting meaningful strings | |
CN112699949B (en) | Potential user identification method and device based on social platform data | |
CN115294593A (en) | Image information extraction method and device, computer equipment and storage medium | |
CN111695117B (en) | Webshell script detection method and device | |
Kumar et al. | Line based robust script identification for indianlanguages | |
CN103605692A (en) | Device and method used for shielding advertisement contents in ask-and-answer community |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
WD01 | Invention patent application deemed withdrawn after publication | ||
WD01 | Invention patent application deemed withdrawn after publication |
Application publication date: 20200428 |