CN106453689B - The method extracted and verify URL - Google Patents

The method extracted and verify URL Download PDF

Info

Publication number
CN106453689B
CN106453689B CN201611042612.2A CN201611042612A CN106453689B CN 106453689 B CN106453689 B CN 106453689B CN 201611042612 A CN201611042612 A CN 201611042612A CN 106453689 B CN106453689 B CN 106453689B
Authority
CN
China
Prior art keywords
url
url data
data
domain name
content
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611042612.2A
Other languages
Chinese (zh)
Other versions
CN106453689A (en
Inventor
李强
王凤琴
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Changhong Electric Co Ltd
Original Assignee
Sichuan Changhong Electric Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Changhong Electric Co Ltd filed Critical Sichuan Changhong Electric Co Ltd
Priority to CN201611042612.2A priority Critical patent/CN106453689B/en
Publication of CN106453689A publication Critical patent/CN106453689A/en
Application granted granted Critical
Publication of CN106453689B publication Critical patent/CN106453689B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L61/00Network arrangements, protocols or services for addressing or naming
    • H04L61/30Managing network names, e.g. use of aliases or nicknames
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2101/00Indexing scheme associated with group H04L61/00
    • H04L2101/30Types of network names
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2101/00Indexing scheme associated with group H04L61/00
    • H04L2101/30Types of network names
    • H04L2101/33Types of network names containing protocol addresses or telephone numbers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L2101/00Indexing scheme associated with group H04L61/00
    • H04L2101/60Types of network addresses
    • H04L2101/618Details of network addresses
    • H04L2101/659Internet protocol version 6 [IPv6] addresses

Abstract

The invention discloses a kind of extraction and the methods of verification URL, it includes building matching template library;The reading of content from huge volumes of content source;According to model agreement, domain name template and IP address template, the content stream of input is matched;Classification storage is carried out to the content matched;Read the url data of storage;According to model agreement, the content of reading is checked;Whether correct judge URL content, if url data is correct, continues the url data for reading next storage, if url data is incorrect, turn in next step;Completion is carried out to URL content;It checks whether URL content meets definition, if url data meets definition, in the data write-in classification storage after completion, continues the url data for reading next storage.If url data does not meet definition, this record is deleted from the URL of storage, method provided by the invention is the basic methods for carrying out big data analysis in certain business scenarios, there is stronger practical value.

Description

The method extracted and verify URL
Technical field
The present invention relates to the communications fields, and in particular to a method of extract and verify URL.
Background technique
URL, that is, Uniform Resource Locator, means uniform resource locator, that is, the webpage being commonly called as Location.URL is the expression succinct to the position for the resource that can be obtained from internet and one kind of access method, is on internet The address of standard resource.URL is equivalent to a filename in the extension of network range.Therefore URL is the machine being connected with internet One pointer of any accessible object on device.
The grammer of URL be usually it is such " agreement: // user name: password@subdomain name domain name top level domain: port numbers/ Directory/file name file suffixes? parameter=value # mark ".In actual use often according to their own needs, it selects therein Subitem.Due to the diversity of internet content, URL is extracted from huge volumes of content, often there are probelem in two aspects: first is that How URL is correctly extracted;Second is that how the URL extracted realizes error correction.
Traditional solution is to filter out URL by searching for " http: // " mark, then realize URL by manual type Error correction, this method is time-consuming and laborious, and not practical enough.
Summary of the invention
The present invention overcomes deficiencies in the prior art, provide a kind of method extracted and verify URL.
In order to solve the above technical problems, the invention adopts the following technical scheme:
A method of extract and verification URL, it the following steps are included:
Step 1, building extract and verification URL template library, the template library include model agreement library, domain name template library and IP address template library;
Step 2, the reading of content from huge volumes of content source, and Content Transformation is read out at the mode of inlet flow, it is described Content source include come to the web page contents of internet, come in the user behavior data that is collected into social tool perhaps come to sensing The daily record data content that device is recorded;
Step 3, it according to model agreement library, is matched according to content stream of the protocol class to input, filters out satisfaction association Discuss the url data of template library;
Step 4, according to domain name template library, the content stream of input is matched according to domain name rank, domain name type, is filtered Meet the url data of domain name template library out;
Step 5, it according to IP address template library, matches, filters out full according to content stream of the IPv4 and IPv6 to input The url data of sufficient IP address template library;
Step 6, matched according to step 3- step 5 as a result, to after matching url data carry out classification storage;
Step 7, url data is successively read from the url data of classification storage;
Step 8, according to model agreement library, the url data of reading is provided according to agreement and protocol characteristic is accurately examined It looks into;
Step 9, according to the accurate inspection results of step 8, determine whether url data is correct data, if url data Correctly, then turning to step 7, continue the url data for reading next storage, if url data is incorrect, turn to step 10;
Step 10, completion is carried out to url data;
Step 11, the url data after completion is checked again for, checks whether url data still conforms to define, if Url data meets definition, then in the url data write-in classification storage after completion, and step 7 is turned to, continue to read next The url data of item storage.If url data does not meet definition, then it represents that the url data of completion is invalid, turns to step 12;
Step 12, url data is deleted.
Further technical solution is, the model agreement library include http protocol template, HTTPS model agreement, STMP model agreement, File Transfer Protocol template, udp protocol template, Telnet model agreement or NFS protocol template.
Further technical solution is that the model agreement library includes agreement regulation and protocol characteristic.
Further technical solution is that domain name template library is according to the design of domain name regulation by domain name rank, domain The set of the domain name template of name type building.
Further technical solution is that the IP address template library refers to the domain name template according to IPv4 and IPv6 design Set.
Further technical solution is that the matching process in the step 3-5 uses approximate match and fuzzy matching.
Further technical solution is, being successively read in the step 7 refer to category or by record strip number one by one Read url data.
Compared with prior art, the beneficial effects of the present invention are:
Method provided by the invention is the basic methods that certain business scenarios carry out big data analysis, there is stronger practical valence Value.
Detailed description of the invention
Fig. 1 is the extraction of an embodiment of the present invention and the method flow diagram for verifying URL.
Fig. 2 is the flow chart of the extraction of another middle embodiment of the present invention and the method for verification URL.
Specific embodiment
The present invention is further elaborated with reference to the accompanying drawing.
Extraction as depicted in figs. 1 and 2 and the method for verifying URL, it the following steps are included:
Step 1, building is extracted and the template library of verification URL, template library include model agreement library, domain name template library and IP Location template library;
Model agreement library includes http protocol template, HTTPS model agreement, STMP model agreement, File Transfer Protocol template, UDP Model agreement, Telnet model agreement or NFS protocol template.
Model agreement library includes agreement regulation and protocol characteristic.
Domain name template library is the collection according to the domain name template of domain name regulation design constructed by domain name rank, domain name type It closes.
IP address template library refers to the set of the domain name template according to IPv4 and IPv6 design.
Constructing template library simultaneously can dynamically add according to demand, so as to support more agreements and more domain names Type.
Step 2, the reading of content from huge volumes of content source, and Content Transformation is read out at the mode of inlet flow, content Source include come to the web page contents of internet, perhaps come in the user behavior data that is collected into social tool to sensor to remember Daily record data content under record.
Step 3, it according to model agreement library, is matched according to content stream of the protocol class to input, filters out satisfaction association Discuss the url data of template library;The method that matching process uses approximate match and fuzzy matching.
Step 4, according to domain name template library, the content stream of input is matched according to domain name rank, domain name type, is filtered Meet the url data of domain name template library out;The method that matching process uses approximate match and fuzzy matching.
Step 5, it according to IP address template library, matches, filters out full according to content stream of the IPv4 and IPv6 to input The url data of sufficient IP address template library;The method that matching process uses approximate match and fuzzy matching.
Step 6, matched according to step 3- step 5 as a result, to after matching url data carry out classification storage;Storage Database can be relational database, be also possible to file system, can also be NoSQL database.
Step 7, url data is successively read from the url data of classification storage;It is successively read and refers to category or by record Item number reads url data one by one.
Step 8, according to model agreement library, the url data of reading is provided according to agreement and protocol characteristic is accurately examined It looks into.
Step 9, according to the accurate inspection results of step 8, determine whether url data is correct data, if url data Correctly, then turning to step 7, continue the url data for reading next storage, if url data is incorrect, turn to step 10。
Step 10, completion is carried out to url data, such as is frequently encountered the case where URL partial content omits, such as " news.qq.com " is usually omitted the content of " http: // ", and should completion be " http://news.qq.com/ ".
Step 11, the url data after completion is checked again for, checks whether url data still conforms to define, if Url data meets definition, then in the url data write-in classification storage after completion, and step 7 is turned to, continue to read next The url data of item storage.If url data does not meet definition, then it represents that the url data of completion is invalid, turns to step 12.
Step 12, url data is deleted, this record is specially deleted from the URL of classification storage record, and turn to step 7, continue the url data for reading next storage.Until all URL record processing are completed.
The above specific embodiment is described in detail for the essence of the present invention, but can not be to protection scope of the present invention It is limited, it should be apparent that, under the inspiration of the present invention, those of ordinary skill in the art can also carry out many improvement And modification, it should be noted that these improvement and modification are all fallen within the scope of the claims of the present invention.

Claims (7)

1. it is a kind of extraction and verification URL method, which is characterized in that it the following steps are included:
Step 1, building is extracted and the template library of verification URL, the template library include model agreement library, domain name template library and IP Location template library;
Step 2, the reading of content from huge volumes of content source, and Content Transformation is read out at the mode of inlet flow, the content Source include come to the web page contents of internet, perhaps come in the user behavior data that is collected into social tool to sensor to remember Daily record data content under record;
Step 3, it according to model agreement library, is matched according to content stream of the protocol class to input, filters out and meet agreement mould The url data in plate library;
Step 4, according to domain name template library, the content stream of input is matched according to domain name rank, domain name type, is filtered out full The url data of sufficient domain name template library;
Step 5, it according to IP address template library, is matched according to content stream of the IPv4 and IPv6 to input, filters out and meet IP The url data in address template library;
Step 6, matched according to step 3- step 5 as a result, to after matching url data carry out classification storage;
Step 7, url data is successively read from the url data of classification storage;
Step 8, according to model agreement library, the url data of reading is provided according to agreement and protocol characteristic is accurately checked;
Step 9, according to the accurate inspection results of step 8, determine whether url data is correct data, if url data is just Really, then turning to step 7, continue the url data for reading next storage, if url data is incorrect, turn to step 10;
Step 10, completion is carried out to url data;
Step 11, the url data after completion is checked again for, checks whether url data still conforms to define, if URL Data fit definition continues to read next and deposit then and turn to step 7 in the url data write-in classification storage after completion The url data of storage;If url data does not meet definition, then it represents that the url data of completion is invalid, turns to step 12;
Step 12, url data is deleted.
2. the method for extraction according to claim 1 and verification URL, which is characterized in that the model agreement library includes Http protocol template, HTTPS model agreement, STMP model agreement, File Transfer Protocol template, udp protocol template, Telnet agreement mould Plate or NFS protocol template.
3. the method for extraction according to claim 1 and verification URL, which is characterized in that the model agreement library includes association View regulation and protocol characteristic.
4. the method for extraction according to claim 1 and verification URL, which is characterized in that domain name template library is basis The set for the domain name template of domain name regulation design constructed by domain name rank, domain name type.
5. the method for extraction according to claim 1 and verification URL, which is characterized in that the IP address template library refers to According to the set of the IPv4 and IPv6 domain name template designed.
6. the method for extraction according to claim 1 and verification URL, which is characterized in that the match party in the step 3-5 Method is while using approximate match and fuzzy matching.
7. the method for extraction according to claim 1 and verification URL, which is characterized in that being successively read in the step 7 Refer to category or reads url data one by one by record strip number.
CN201611042612.2A 2016-11-11 2016-11-11 The method extracted and verify URL Active CN106453689B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611042612.2A CN106453689B (en) 2016-11-11 2016-11-11 The method extracted and verify URL

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611042612.2A CN106453689B (en) 2016-11-11 2016-11-11 The method extracted and verify URL

Publications (2)

Publication Number Publication Date
CN106453689A CN106453689A (en) 2017-02-22
CN106453689B true CN106453689B (en) 2019-05-24

Family

ID=58218073

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611042612.2A Active CN106453689B (en) 2016-11-11 2016-11-11 The method extracted and verify URL

Country Status (1)

Country Link
CN (1) CN106453689B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105187439A (en) * 2015-09-25 2015-12-23 北京奇虎科技有限公司 Phishing website detection method and device
CN108363740B (en) * 2018-01-22 2020-09-04 中国平安人寿保险股份有限公司 IP address analysis method and device, storage medium and terminal
CN111241082B (en) * 2020-01-13 2020-10-23 贝壳找房(北京)科技有限公司 Data correction method and device
CN111931113B (en) * 2020-09-16 2021-01-05 深圳壹账通智能科技有限公司 Data cleaning method and related equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101695164A (en) * 2009-09-28 2010-04-14 华为技术有限公司 Verification method, device and system for controlling resource access
CN103514189A (en) * 2012-06-25 2014-01-15 上海博腾信息科技有限公司 Implementing method for web crawler based on search engines
CN104052737A (en) * 2014-05-19 2014-09-17 北京网康科技有限公司 Network data message processing method and device
CN104462257A (en) * 2014-11-21 2015-03-25 百度在线网络技术(北京)有限公司 Method and device for verifying information of intermediate pages

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150081419A1 (en) * 2013-09-19 2015-03-19 Oracle International Corporation Method and system for implementing dynamic link tracking

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101695164A (en) * 2009-09-28 2010-04-14 华为技术有限公司 Verification method, device and system for controlling resource access
CN103514189A (en) * 2012-06-25 2014-01-15 上海博腾信息科技有限公司 Implementing method for web crawler based on search engines
CN104052737A (en) * 2014-05-19 2014-09-17 北京网康科技有限公司 Network data message processing method and device
CN104462257A (en) * 2014-11-21 2015-03-25 百度在线网络技术(北京)有限公司 Method and device for verifying information of intermediate pages

Also Published As

Publication number Publication date
CN106453689A (en) 2017-02-22

Similar Documents

Publication Publication Date Title
CN106453689B (en) The method extracted and verify URL
CN104486461B (en) Domain name classification method and device, domain name recognition methods and system
US11151179B2 (en) Method, apparatus and electronic device for determining knowledge sample data set
CN110245716A (en) Sample labeling auditing method and device
CN108268581A (en) The construction method and device of knowledge mapping
CN110222791A (en) Sample labeling information auditing method and device
CN104796300B (en) A kind of packet feature extracting method and device
CN109344740A (en) Face identification system, method and computer readable storage medium
CN106447046A (en) House type design scheme evaluation method based on machine learning
CN103297561B (en) IP address source tracing method and device
CN105893340A (en) Efficient data processing system used during detection and analysis
CN107153652B (en) Method and device for converting target character string into normalized character string
CN104317909A (en) Method and device for verifying data of points of interest
CN105930325B (en) A kind of file report compares the conversed analysis method and device of difference
CN110781805A (en) Target object detection method, device, computing equipment and medium
CN104378659A (en) Personalization recommendation method based on smart television
CN107958154A (en) A kind of malware detection device and method
CN103823809A (en) Query phrase classification method and device, and classification optimization method and device
CN104407699A (en) Human-computer interaction method, device and system
CN102073678A (en) System and method for analyzing information of websites
CN108133030A (en) A kind of realization method and system for painting this question and answer
CN107220262B (en) Information processing method and device
CN109753227A (en) Storage method, device, mobile terminal, server and readable storage medium storing program for executing
CN109714225B (en) Automatic testing method and system for Elink
CN104317903B (en) The recognition methods of the chapters and sections integrality of chapters and sections formula text and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant