CN111078979A - Method and system for identifying network credit website based on OCR and text processing technology - Google Patents

Method and system for identifying network credit website based on OCR and text processing technology Download PDF

Info

Publication number
CN111078979A
CN111078979A CN201911209962.7A CN201911209962A CN111078979A CN 111078979 A CN111078979 A CN 111078979A CN 201911209962 A CN201911209962 A CN 201911209962A CN 111078979 A CN111078979 A CN 111078979A
Authority
CN
China
Prior art keywords
website
pinyin
url
text
network credit
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911209962.7A
Other languages
Chinese (zh)
Inventor
陶景龙
梁淑云
刘胜
马影
王启凡
魏国富
徐�明
殷钱安
余贤喆
周晓勇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Information and Data Security Solutions Co Ltd
Original Assignee
Information and Data Security Solutions Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Information and Data Security Solutions Co Ltd filed Critical Information and Data Security Solutions Co Ltd
Priority to CN201911209962.7A priority Critical patent/CN111078979A/en
Publication of CN111078979A publication Critical patent/CN111078979A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/40Document-oriented image-based pattern recognition
    • G06V30/41Analysis of document content
    • G06V30/413Classification of content, e.g. text, photographs or tables
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06VIMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
    • G06V30/00Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
    • G06V30/10Character recognition

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Tourism & Hospitality (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Multimedia (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Economics (AREA)
  • General Health & Medical Sciences (AREA)
  • Human Resources & Organizations (AREA)
  • Marketing (AREA)
  • Primary Health Care (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a method for identifying a network credit website based on OCR and text processing technologies, which comprises the following steps: s101, acquiring a URL of a website to be detected; s102, crawling pictures of a website to be detected by using a crawler technology, and outputting a URL picture set; s103, extracting characters from the crawled picture set by using an OCR technology; 104, filtering the content of the extracted characters and segmenting words by using a jieba word segmentation technology, and translating the extracted characters into pinyin word segmentation content; and S105, matching the phonetic segmentation content with the network credit keyword, and outputting whether the corresponding URL is a network credit website. The method is efficient and accurate, and effectively makes up for the technical vacancy in the field.

Description

Method and system for identifying network credit website based on OCR and text processing technology
Technical Field
The invention relates to the technical field of network credit website identification, in particular to a method and a system for identifying a network credit website based on OCR and text processing technologies.
Background
With the rapid development of the internet financial industry, websites are more convenient to establish, and the threshold is lower, so that a plurality of bad and illegal websites, such as illegal net loan websites, phishing websites, gambling websites and the like, are caused. In recent years, events such as P2P company race, phishing, telecom fraud and the like frequently occur, so that serious property loss is caused to people, personal safety is even damaged, and adverse social effects are generated. The online credit website is accurately and timely identified, so that the user is timely reminded of cautious operation, property and the like of the user can be prevented from being lost, and the social responsibility and the enterprise image of an enterprise are improved.
The lower and lower the requirement threshold of network loan, so that a plurality of organizations or enterprises mainly using network loan as the operation business are generated, and generally, the enterprises have own online application platforms for network loan, and the loan business is developed by means of the time domain and the interactivity of the internet. The access link of the network credit website is not obviously different from the link of the common website, and for distinguishing the network credit website, one mode is to manually perform link access and display the content through the website so as to judge whether the website is the network credit website. This method consumes a lot of manpower resources and time, and is inefficient. The website information identification method, the website information identification device and the electronic equipment disclosed by application number 201910565890.3 have the main technology that the content of a target website is obtained according to the address of the target website; the content comprises the following steps: text content, picture files and display effect screenshots; performing accurate matching and/or natural language analysis processing on the text content according to a preset sensitive violation library to determine a text recognition result of the target website; and respectively carrying out image classification recognition based on deep learning on the picture file and the display effect screenshot according to preset sample pictures with different types of labels to determine a picture recognition result of the target website. After the content of a target website is obtained, the technology carries out accurate matching and/or natural language analysis processing on the text content to obtain a text recognition result; and carrying out deep learning on the picture file and the display effect screenshot to obtain a picture identification result. Whether the website has bad content or not can be effectively judged, and the misjudgment rate is reduced. However, this method has the problems that the characters and the pictures need to be processed separately, the characters in the pictures cannot be identified and judged, and especially the pictures are classified based on learning, so the calculation amount is large and the efficiency is low. In addition, the method only identifies and matches characters, but the character matching error is large, for example, in the case of Chinese characters, the similarity between the loan and the lower money is only fifty percent, but the similarity between the dai kuan and the xia kuan is 75 percent, so that the matching result has a large error.
Disclosure of Invention
The invention aims to solve the technical problems of low efficiency and large error of the identification technology of the network credit website.
The invention solves the technical problems through the following technical means:
a method for identifying a network credit website based on OCR and text processing technology comprises the following steps:
s101, acquiring a URL of a website to be detected;
s102, crawling pictures of a website to be detected by using a crawler technology, and outputting a URL picture set;
s103, extracting characters from the crawled picture set by using an OCR technology;
s104, using a jieba word segmentation technology to filter the content of the extracted characters, segmenting words and translating the segmented words into pinyin to obtain pinyin segmented word content M;
s105, constructing a pinyin keyword library K, performing network credit keyword matching on the pinyin word segmentation content M by using the pinyin keyword library, and outputting whether the corresponding URL is a network credit website.
The method has the advantages that the network credit information keyword library is established by the website image character content and the network credit service expert, the keyword library and the characters to be matched are translated into pinyin by using an OCR technology and a text processing technology, the recognition rate is improved, the error is reduced, and the systematic network credit website recognition method is completed.
Preferably, the step S102 specifically includes:
s1021, building a crawler system for the website pictures by using a crawler technology, and recording the crawler system as R;
s1022, inputting the URL to be detected into the crawler system, and outputting the picture set { P } corresponding to the URL.
Preferably, the step S103 specifically includes:
s1031, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P };
s1032, segmenting the picture into sections and lines by using a layout analysis algorithm;
s1033, processing the problem that characters are difficult to cut simply due to character adhesion and broken strokes by using a character cutting algorithm;
s1034, extracting multi-dimensional features from the character image by using a character feature extraction algorithm;
s1035, carrying out template rough classification and template fine matching on the feature vector extracted from the current character and a feature template library by using a character recognition algorithm, and recognizing the character;
and S1036, correcting the recognized characters according to semantics, and sorting and outputting the characters into a text format.
Preferably, the step S104 specifically includes:
s1041, using a jieba word segmentation tool to filter special characters of the text content output by S1036 and segment words to obtain word segmentation content.
Preferably, the step S105 specifically includes:
s1051, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords;
s1052, translating the network credit information keyword library in S1051 into a pinyin format by using a jieba word segmentation tool, marking the keyword library as K, and translating the word segmentation content of S1041 into the pinyin format by using the same method, and marking the word segmentation content as M;
s1053, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with similarity larger than a preset value appears in a matching result, the text contains network credit information, the URL corresponding to the text is a network credit website, and otherwise, the URL corresponding to the text is a non-network credit website.
Correspondingly, the invention also provides a system for identifying the network credit website based on OCR and text processing technology, which comprises
The URL acquisition module is used for acquiring the URL of the website to be detected;
the image crawling module is used for crawling the image of the website to be detected by using a crawler technology and outputting a URL image set;
the character extraction module is used for extracting characters from the crawled picture set by using an OCR technology;
the word segmentation module is used for filtering the content of the extracted characters and segmenting words by using a jieba word segmentation technology;
and the matching module is used for matching the word content with the network credit keyword and outputting whether the corresponding URL is a network credit website.
Preferably, the specific execution process in the picture crawling module is as follows:
firstly, building a crawler system for website pictures by using a crawler technology, and recording the crawler system as R; and inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL.
Preferably, the specific execution process of the character extraction module is as follows:
firstly, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P }; then, the picture is segmented into segments and lines by using a layout analysis algorithm, the problem that the characters are difficult to simply cut due to character adhesion and broken strokes is solved by using a character cutting algorithm, multidimensional features are extracted from the character image by using a character feature extraction algorithm, and the feature vectors extracted from the current characters are subjected to template rough classification and template fine matching with a feature template library by using a character recognition algorithm to recognize the characters; and finally, correcting the recognized characters according to the semantics, and sorting and outputting the characters into a text format.
Preferably, the word segmentation module specifically executes the following steps:
and using a jieba word segmentation tool to filter special characters of the text content and segment words to obtain word segmentation content.
Preferably, the step matching module specifically executes the following steps:
firstly, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords; translating the network credit information keyword library into a pinyin format, marking the keyword library as K, and translating the word segmentation content of S1041 into the pinyin format by the same method, and marking the word segmentation content as M by using a jieba word segmentation tool; and finally, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with the similarity larger than a preset value appears in a matching result, the text contains the network credit information, the URL corresponding to the text is the network credit website, and otherwise, the URL corresponding to the text is not the network credit website.
The invention has the advantages that:
the method has the advantages that the network credit information keyword library is established by the website image character content and the network credit service expert, the keyword library and the characters to be matched are translated into pinyin by using an OCR technology and a text processing technology, the recognition rate is improved, the error is reduced, and the systematic network credit website recognition method is completed.
Drawings
FIG. 1 is a block flow diagram of a method in embodiment 1 of the present invention;
fig. 2 is a flowchart of the image crawler system in step S102 according to embodiment 1 of the present invention;
fig. 3 is a flowchart of the OCR character extraction system in step S103 according to embodiment 1 of the present invention, as shown in fig. 3;
fig. 4 is an illustration of sampling the text information of the network credit picture in step S104 according to embodiment 1 of the present invention;
fig. 5 is an example of the implementation process of the fuzzy matching result in step S105 in embodiment 1 of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the embodiments of the present invention, and it is obvious that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in fig. 1, the present embodiment provides a method for identifying a network credit website based on OCR and text processing technologies, so as to identify whether the website is a network credit-type website. The method comprises the following steps:
s101, acquiring a URL of a website to be detected;
s102, crawling pictures of a website to be detected by using a crawler technology, and outputting a URL picture set;
s103, extracting characters from the crawled picture set by using an OCR technology;
s104, using a jieba word segmentation technology to filter the content of the extracted characters, segmenting words and translating the segmented words into pinyin to obtain pinyin segmented word content;
s105, establishing a pinyin keyword library, performing network credit keyword matching on the pinyin word segmentation contents by using the pinyin keyword library, and outputting whether the corresponding URL is a network credit website or not. The contents of each step are specifically described as follows:
the method in S101 comprises the following steps:
acquiring a website URL to be detected;
the method in S102 is, as shown in fig. 2:
s1021, building a crawler system for the website pictures by using a crawler technology, and recording the crawler system as R;
the crawler technology refers to web crawler technology, and is a program or script for automatically capturing web information according to a certain rule. Other less commonly used names are ants, automatic indexing, simulation programs, or worms.
S1022, inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL;
the method in S103 is, as shown in fig. 3:
s1031 carries out picture preprocessing on each picture in the picture set { P }, wherein the picture preprocessing mainly comprises image binaryzation, noise removal, inclination correction and the like;
s1032, dividing the picture into sections and lines by using a layout analysis algorithm;
s1033, a character cutting algorithm is used for processing the problem that characters are difficult to cut simply due to character adhesion and broken strokes;
s1034, extracting multi-dimensional features from the character image by using a character feature extraction algorithm for subsequent feature matching;
s1035, carrying out template rough classification and template fine matching on the feature vector extracted from the current character and a feature template library by using a character recognition algorithm, and recognizing the character;
s1036, correcting the recognized characters according to semantics, and sorting and outputting the characters into a text format;
the specific method in S104 is as follows:
s1041, as shown in FIG. 4, using a jieba word segmentation tool to filter special characters of the text content output by S1036, segmenting words, and then performing pinyin translation to obtain pinyin word segmentation content, and recording the pinyin word segmentation content as a set M.
Converting Chinese into pinyin can improve the matching rate, such as: in the case of Chinese characters, the similarity between the loan and the loan is only fifty percent, but the similarity between the dai kuan and the xia kuan is as high as 75 percent.
The jieba word segmentation tool is an open source natural language processing tool, and related algorithms comprise: realizing efficient word graph scanning based on the Trie tree structure, and generating a directed acyclic graph formed by all possible word forming conditions of Chinese characters in a sentence; a maximum probability path is searched by adopting dynamic programming, and a maximum segmentation combination based on word frequency is found out; for unknown words, an HMM model based on Chinese character word forming capability is adopted, and a Viterbi algorithm is used;
the specific method in S105 is as follows:
s1051, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords, the content form of which is shown in Table 1:
TABLE 1 network loan information keyword List
Serial number Keyword
1 Loan
2 Money to be paid
3 Paying
4 Interest information
5 Lower message
6 Rest-free
7 Mortgage
8 Credit investigation
9 High amount of
10 Can be credited with a vehicle
... ...
S1052 uses the jieba word segmentation tool to translate the network loan information keyword library in S1051 into a Pinyin format, which is marked as a set K, the content form of which is shown in Table 2,
TABLE 2 network loan information keyword pinyin list
Figure BDA0002295716120000071
S1053, as shown in FIG. 5, fuzzy matching is performed in M for each keyword in K by using a fuzzy Wuzzy tool, and fuzzy matching is a technology for approximately searching a character string matched with a pattern and is not accurate. The principle of the method is that the difference between sequences is calculated by using an Edit Distance (Levenshtein Distance), and the Levenshtein Distance algorithm, also called Edit Distance algorithm, refers to the minimum number of editing operations required for converting one character string into another character string. Permitted editing operations include replacing one character with another, inserting one character, and deleting one character. Generally, the smaller the edit distance, the greater the similarity of the two strings. The method adopts a non-complete matching mode in a fuzzy Wuzzy tool in the using process, if a matching item with the similarity larger than 80 appears in a matching result, the text contains the network credit information, so that the URL corresponding to the text is judged to be the network credit website, and otherwise, the URL corresponding to the text is judged to be the non-network credit website.
The process of calculating the similarity of the two comparison terms specifically comprises the following steps: if the input is: the character a is "dai kuan"; the character B ═ xia kuan "; the process is as follows: the function partial _ ratio (a, B); and (3) outputting: similarity of character a to character B.
The method for identifying the network credit website based on the OCR and the text processing technology provided by the invention has the advantages that a network credit information keyword library is established by a network credit service expert according to the character content in a website picture, the keyword library and characters to be matched are translated into pinyin by using the OCR technology and the text processing technology, the identification rate is improved, the error is reduced, and the systematic network credit website identification method is completed.
Example 2
In accordance with embodiment 1, the present embodiment provides a system for identifying a loan website based on OCR and text processing techniques, comprising
The URL acquisition module is used for acquiring the URL of the website to be detected;
the image crawling module is used for crawling the image of the website to be detected by using a crawler technology and outputting a URL image set; the specific execution process comprises the following steps:
firstly, building a crawler system for website pictures by using a crawler technology, and recording the crawler system as R; and inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL.
The character extraction module is used for extracting characters from the crawled picture set { P } by using an OCR technology; the specific execution process comprises the following steps:
firstly, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P }; then, the picture is segmented into segments and lines by using a layout analysis algorithm, the problem that the characters are difficult to simply cut due to character adhesion and broken strokes is solved by using a character cutting algorithm, multidimensional features are extracted from the character image by using a character feature extraction algorithm, and the feature vectors extracted from the current characters are subjected to template rough classification and template fine matching with a feature template library by using a character recognition algorithm to recognize the characters; and finally, correcting the recognized characters according to the semantics, and sorting and outputting the characters into a text format.
And the word segmentation module is used for filtering special characters of the text content output by the S1036 by using a jieba word segmentation tool, segmenting words and then performing pinyin translation to obtain pinyin word segmentation content which is recorded as a set M.
And the matching module is used for constructing a pinyin keyword library K, matching the pinyin word-dividing contents with the internet credit keywords by using the pinyin keyword library, and outputting whether the corresponding URL is the internet credit website or not. The specific execution process comprises the following steps:
firstly, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords; translating the network credit information keyword library into a pinyin format which is marked as K by using a jieba word segmentation tool, and translating the word segmentation content into the pinyin format which is marked as M by using the same method; and finally, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with the similarity larger than a preset value appears in a matching result, the text contains the network credit information, the URL corresponding to the text is the network credit website, and otherwise, the URL corresponding to the text is not the network credit website.
The above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (10)

1. A method for identifying a network credit website based on OCR and text processing technology is characterized in that: the method comprises the following steps:
s101, acquiring a URL of a website to be detected;
s102, crawling pictures of a website to be detected by using a crawler technology, and outputting a URL picture set;
s103, extracting characters from the crawled picture set by using an OCR technology;
s104, using a jieba word segmentation technology to filter the content of the extracted characters, segmenting words and translating the segmented words into pinyin to obtain pinyin segmented word content M;
s105, constructing a pinyin keyword library K, performing network credit keyword matching on the pinyin word segmentation content M by using the pinyin keyword library, and outputting whether the corresponding URL is a network credit website.
2. A method for identifying a lending website based on OCR and text processing techniques according to claim 1, wherein: the step S102 specifically includes:
s1021, building a crawler system for the website pictures by using a crawler technology, and recording the crawler system as R;
s1022, inputting the URL to be detected into the crawler system, and outputting the picture set { P } corresponding to the URL.
3. A method for identifying a lending website based on OCR and text processing techniques according to claim 1, wherein: the step S103 is specifically:
s1031, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P };
s1032, segmenting the picture into sections and lines by using a layout analysis algorithm;
s1033, processing the problem that characters are difficult to cut simply due to character adhesion and broken strokes by using a character cutting algorithm;
s1034, extracting multi-dimensional features from the character image by using a character feature extraction algorithm;
s1035, carrying out template rough classification and template fine matching on the feature vector extracted from the current character and a feature template library by using a character recognition algorithm, and recognizing the character;
and S1036, correcting the recognized characters according to semantics, and sorting and outputting the characters into a text format.
4. A method for identifying a lending website based on OCR and text processing techniques according to claim 3 wherein: the step S104 specifically includes:
s1041, using a jieba word segmentation tool to filter special characters of the text content output by S1036, then segmenting words, and finally translating the segmented words into pinyin to obtain pinyin word segmentation content M.
5. A method for identifying a lending website based on OCR and text processing techniques according to claim 4 wherein: the step S105 specifically includes:
s1051, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords;
s1052, translating the network loan information keyword library in the S1051 into a pinyin format and recording the keyword library as K by using a jieba word segmentation tool;
s1053, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with similarity larger than a preset value appears in a matching result, the text contains network credit information, the URL corresponding to the text is a network credit website, and otherwise, the URL corresponding to the text is a non-network credit website.
6. A system for identifying a network credit website based on OCR and text processing technologies is characterized in that: comprises that
The URL acquisition module is used for acquiring the URL of the website to be detected;
the image crawling module is used for crawling the image of the website to be detected by using a crawler technology and outputting a URL image set;
the character extraction module is used for extracting characters from the crawled picture set by using an OCR technology;
the word segmentation module is used for filtering special characters of the text content output by the S1036 by using a jieba word segmentation tool, segmenting words and then performing pinyin translation to obtain pinyin word segmentation content which is recorded as a set M;
and the matching module is used for constructing a pinyin keyword library K, matching the pinyin word-dividing contents with the internet credit keywords by using the pinyin keyword library, and outputting whether the corresponding URL is the internet credit website or not.
7. A system for identifying a network crediting website based on OCR and text processing techniques as recited in claim 6, wherein: the specific execution process in the picture crawling module is as follows:
firstly, building a crawler system for website pictures by using a crawler technology; and inputting the URL to be detected into the crawler system, and outputting a picture set { P } corresponding to the URL.
8. A method for identifying a lending website based on OCR and text processing techniques according to claim 7 wherein: the specific execution process of the character extraction module is as follows:
firstly, carrying out image binarization, noise removal and inclination correction pretreatment on each picture in the picture set { P }; then, the picture is segmented into segments and lines by using a layout analysis algorithm, the problem that the characters are difficult to simply cut due to character adhesion and broken strokes is solved by using a character cutting algorithm, multidimensional features are extracted from the character image by using a character feature extraction algorithm, and the feature vectors extracted from the current characters are subjected to template rough classification and template fine matching with a feature template library by using a character recognition algorithm to recognize the characters; and finally, correcting the recognized characters according to the semantics, and sorting and outputting the characters into a text format.
9. A system for identifying a network crediting website based on OCR and text processing techniques as recited in claim 8, wherein: the word segmentation module specifically executes the following steps:
and (3) filtering the special characters of the text content by using a jieba word segmentation tool, segmenting words, translating the segmented words into pinyin to obtain pinyin word segmentation content, and recording the pinyin word segmentation content as M.
10. A system for identifying a network crediting website based on OCR and text processing techniques as recited in claim 9, wherein: the step matching module specifically executes the following steps:
firstly, through a large number of network credit websites and materials, network credit service experts arrange to obtain network credit information keywords; translating the network credit information keyword library into a pinyin format, marking the keyword library as K, and translating the word segmentation content of S1041 into the pinyin format by the same method, and marking the word segmentation content as M by using a jieba word segmentation tool; and finally, fuzzy matching is carried out in M by using a fuzzy Wuzzy tool for each keyword in K, if a matching item with the similarity larger than a preset value appears in a matching result, the text contains the network credit information, the URL corresponding to the text is the network credit website, and otherwise, the URL corresponding to the text is not the network credit website.
CN201911209962.7A 2019-11-29 2019-11-29 Method and system for identifying network credit website based on OCR and text processing technology Pending CN111078979A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911209962.7A CN111078979A (en) 2019-11-29 2019-11-29 Method and system for identifying network credit website based on OCR and text processing technology

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911209962.7A CN111078979A (en) 2019-11-29 2019-11-29 Method and system for identifying network credit website based on OCR and text processing technology

Publications (1)

Publication Number Publication Date
CN111078979A true CN111078979A (en) 2020-04-28

Family

ID=70312365

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911209962.7A Pending CN111078979A (en) 2019-11-29 2019-11-29 Method and system for identifying network credit website based on OCR and text processing technology

Country Status (1)

Country Link
CN (1) CN111078979A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291738A (en) * 2020-05-09 2020-06-16 支付宝(杭州)信息技术有限公司 Element extraction method and device in front-end page image and electronic equipment
CN111782772A (en) * 2020-07-24 2020-10-16 平安银行股份有限公司 Text automatic generation method, device, equipment and medium based on OCR technology
CN112667943A (en) * 2020-11-10 2021-04-16 中科金审(北京)科技有限公司 Illegal website identification and locking method
CN113127715A (en) * 2021-03-04 2021-07-16 微梦创科网络科技(中国)有限公司 Method and system for identifying gambling-related information
CN114663903A (en) * 2022-05-25 2022-06-24 深圳大道云科技有限公司 Text data classification method, device, equipment and storage medium

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100316300A1 (en) * 2009-06-13 2010-12-16 Microsoft Corporation Detection of objectionable videos
CN103179095A (en) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 Method and client device for detecting phishing websites
CN106776946A (en) * 2016-12-02 2017-05-31 重庆大学 A kind of detection method of fraudulent website
CN108256104A (en) * 2018-02-05 2018-07-06 恒安嘉新(北京)科技股份公司 Internet site compressive classification method based on multidimensional characteristic
CN108874777A (en) * 2018-06-11 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and device of text anti-spam
CN109255113A (en) * 2018-09-04 2019-01-22 郑州信大壹密科技有限公司 Intelligent critique system
CN110209795A (en) * 2018-06-11 2019-09-06 腾讯科技(深圳)有限公司 Comment on recognition methods, device, computer readable storage medium and computer equipment
CN110210028A (en) * 2019-05-30 2019-09-06 杭州远传新业科技有限公司 For domain feature words extracting method, device, equipment and the medium of speech translation text
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100316300A1 (en) * 2009-06-13 2010-12-16 Microsoft Corporation Detection of objectionable videos
CN103179095A (en) * 2011-12-22 2013-06-26 阿里巴巴集团控股有限公司 Method and client device for detecting phishing websites
CN106776946A (en) * 2016-12-02 2017-05-31 重庆大学 A kind of detection method of fraudulent website
CN108256104A (en) * 2018-02-05 2018-07-06 恒安嘉新(北京)科技股份公司 Internet site compressive classification method based on multidimensional characteristic
CN108874777A (en) * 2018-06-11 2018-11-23 北京奇艺世纪科技有限公司 A kind of method and device of text anti-spam
CN110209795A (en) * 2018-06-11 2019-09-06 腾讯科技(深圳)有限公司 Comment on recognition methods, device, computer readable storage medium and computer equipment
CN109255113A (en) * 2018-09-04 2019-01-22 郑州信大壹密科技有限公司 Intelligent critique system
CN110210028A (en) * 2019-05-30 2019-09-06 杭州远传新业科技有限公司 For domain feature words extracting method, device, equipment and the medium of speech translation text
CN110275958A (en) * 2019-06-26 2019-09-24 北京市博汇科技股份有限公司 Site information recognition methods, device and electronic equipment

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111291738A (en) * 2020-05-09 2020-06-16 支付宝(杭州)信息技术有限公司 Element extraction method and device in front-end page image and electronic equipment
CN111782772A (en) * 2020-07-24 2020-10-16 平安银行股份有限公司 Text automatic generation method, device, equipment and medium based on OCR technology
CN112667943A (en) * 2020-11-10 2021-04-16 中科金审(北京)科技有限公司 Illegal website identification and locking method
CN113127715A (en) * 2021-03-04 2021-07-16 微梦创科网络科技(中国)有限公司 Method and system for identifying gambling-related information
CN114663903A (en) * 2022-05-25 2022-06-24 深圳大道云科技有限公司 Text data classification method, device, equipment and storage medium
CN114663903B (en) * 2022-05-25 2022-08-19 深圳大道云科技有限公司 Text data classification method, device, equipment and storage medium

Similar Documents

Publication Publication Date Title
CN111078979A (en) Method and system for identifying network credit website based on OCR and text processing technology
CN109697162B (en) Software defect automatic detection method based on open source code library
CN112347244B (en) Yellow-based and gambling-based website detection method based on mixed feature analysis
CN109831460B (en) Web attack detection method based on collaborative training
CN112541476B (en) Malicious webpage identification method based on semantic feature extraction
CN111078978A (en) Web credit website entity identification method and system based on website text content
CN111462752B (en) Attention mechanism, feature embedding and BI-LSTM (business-to-business) based customer intention recognition method
CN103605691A (en) Device and method used for processing issued contents in social network
WO2020071558A1 (en) Business form layout analysis device, and analysis program and analysis method therefor
CN113986864A (en) Log data processing method and device, electronic equipment and storage medium
CN103605690A (en) Device and method for recognizing advertising messages in instant messaging
CN103246655A (en) Text categorizing method, device and system
CN111460803B (en) Equipment identification method based on Web management page of industrial Internet of things equipment
CN106815253B (en) Mining method based on mixed data type data
CN110413998B (en) Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN117520561A (en) Entity relation extraction method and system for knowledge graph construction in helicopter assembly field
CN111597423A (en) Performance evaluation method and device of interpretable method of text classification model
CN112380412A (en) Optimization method for screening matching information based on big data
CN110888977B (en) Text classification method, apparatus, computer device and storage medium
CN107291952B (en) Method and device for extracting meaningful strings
CN112699949B (en) Potential user identification method and device based on social platform data
CN115294593A (en) Image information extraction method and device, computer equipment and storage medium
CN111695117B (en) Webshell script detection method and device
Kumar et al. Line based robust script identification for indianlanguages
CN103605692A (en) Device and method used for shielding advertisement contents in ask-and-answer community

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20200428