CN106453689B - The method extracted and verify URL - Google Patents
The method extracted and verify URL Download PDFInfo
- Publication number
- CN106453689B CN106453689B CN201611042612.2A CN201611042612A CN106453689B CN 106453689 B CN106453689 B CN 106453689B CN 201611042612 A CN201611042612 A CN 201611042612A CN 106453689 B CN106453689 B CN 106453689B
- Authority
- CN
- China
- Prior art keywords
- url
- url data
- data
- domain name
- content
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L61/00—Network arrangements, protocols or services for addressing or naming
- H04L61/30—Managing network names, e.g. use of aliases or nicknames
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
- G06F16/9566—URL specific, e.g. using aliases, detecting broken or misspelled links
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L2101/00—Indexing scheme associated with group H04L61/00
- H04L2101/30—Types of network names
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L2101/00—Indexing scheme associated with group H04L61/00
- H04L2101/30—Types of network names
- H04L2101/33—Types of network names containing protocol addresses or telephone numbers
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L2101/00—Indexing scheme associated with group H04L61/00
- H04L2101/60—Types of network addresses
- H04L2101/618—Details of network addresses
- H04L2101/659—Internet protocol version 6 [IPv6] addresses
Abstract
The invention discloses a kind of extraction and the methods of verification URL, it includes building matching template library;The reading of content from huge volumes of content source;According to model agreement, domain name template and IP address template, the content stream of input is matched;Classification storage is carried out to the content matched;Read the url data of storage;According to model agreement, the content of reading is checked;Whether correct judge URL content, if url data is correct, continues the url data for reading next storage, if url data is incorrect, turn in next step;Completion is carried out to URL content;It checks whether URL content meets definition, if url data meets definition, in the data write-in classification storage after completion, continues the url data for reading next storage.If url data does not meet definition, this record is deleted from the URL of storage, method provided by the invention is the basic methods for carrying out big data analysis in certain business scenarios, there is stronger practical value.
Description
Technical field
The present invention relates to the communications fields, and in particular to a method of extract and verify URL.
Background technique
URL, that is, Uniform Resource Locator, means uniform resource locator, that is, the webpage being commonly called as
Location.URL is the expression succinct to the position for the resource that can be obtained from internet and one kind of access method, is on internet
The address of standard resource.URL is equivalent to a filename in the extension of network range.Therefore URL is the machine being connected with internet
One pointer of any accessible object on device.
The grammer of URL be usually it is such " agreement: // user name: password@subdomain name domain name top level domain: port numbers/
Directory/file name file suffixes? parameter=value # mark ".In actual use often according to their own needs, it selects therein
Subitem.Due to the diversity of internet content, URL is extracted from huge volumes of content, often there are probelem in two aspects: first is that
How URL is correctly extracted;Second is that how the URL extracted realizes error correction.
Traditional solution is to filter out URL by searching for " http: // " mark, then realize URL by manual type
Error correction, this method is time-consuming and laborious, and not practical enough.
Summary of the invention
The present invention overcomes deficiencies in the prior art, provide a kind of method extracted and verify URL.
In order to solve the above technical problems, the invention adopts the following technical scheme:
A method of extract and verification URL, it the following steps are included:
Step 1, building extract and verification URL template library, the template library include model agreement library, domain name template library and
IP address template library;
Step 2, the reading of content from huge volumes of content source, and Content Transformation is read out at the mode of inlet flow, it is described
Content source include come to the web page contents of internet, come in the user behavior data that is collected into social tool perhaps come to sensing
The daily record data content that device is recorded;
Step 3, it according to model agreement library, is matched according to content stream of the protocol class to input, filters out satisfaction association
Discuss the url data of template library;
Step 4, according to domain name template library, the content stream of input is matched according to domain name rank, domain name type, is filtered
Meet the url data of domain name template library out;
Step 5, it according to IP address template library, matches, filters out full according to content stream of the IPv4 and IPv6 to input
The url data of sufficient IP address template library;
Step 6, matched according to step 3- step 5 as a result, to after matching url data carry out classification storage;
Step 7, url data is successively read from the url data of classification storage;
Step 8, according to model agreement library, the url data of reading is provided according to agreement and protocol characteristic is accurately examined
It looks into;
Step 9, according to the accurate inspection results of step 8, determine whether url data is correct data, if url data
Correctly, then turning to step 7, continue the url data for reading next storage, if url data is incorrect, turn to step
10;
Step 10, completion is carried out to url data;
Step 11, the url data after completion is checked again for, checks whether url data still conforms to define, if
Url data meets definition, then in the url data write-in classification storage after completion, and step 7 is turned to, continue to read next
The url data of item storage.If url data does not meet definition, then it represents that the url data of completion is invalid, turns to step 12;
Step 12, url data is deleted.
Further technical solution is, the model agreement library include http protocol template, HTTPS model agreement,
STMP model agreement, File Transfer Protocol template, udp protocol template, Telnet model agreement or NFS protocol template.
Further technical solution is that the model agreement library includes agreement regulation and protocol characteristic.
Further technical solution is that domain name template library is according to the design of domain name regulation by domain name rank, domain
The set of the domain name template of name type building.
Further technical solution is that the IP address template library refers to the domain name template according to IPv4 and IPv6 design
Set.
Further technical solution is that the matching process in the step 3-5 uses approximate match and fuzzy matching.
Further technical solution is, being successively read in the step 7 refer to category or by record strip number one by one
Read url data.
Compared with prior art, the beneficial effects of the present invention are:
Method provided by the invention is the basic methods that certain business scenarios carry out big data analysis, there is stronger practical valence
Value.
Detailed description of the invention
Fig. 1 is the extraction of an embodiment of the present invention and the method flow diagram for verifying URL.
Fig. 2 is the flow chart of the extraction of another middle embodiment of the present invention and the method for verification URL.
Specific embodiment
The present invention is further elaborated with reference to the accompanying drawing.
Extraction as depicted in figs. 1 and 2 and the method for verifying URL, it the following steps are included:
Step 1, building is extracted and the template library of verification URL, template library include model agreement library, domain name template library and IP
Location template library;
Model agreement library includes http protocol template, HTTPS model agreement, STMP model agreement, File Transfer Protocol template, UDP
Model agreement, Telnet model agreement or NFS protocol template.
Model agreement library includes agreement regulation and protocol characteristic.
Domain name template library is the collection according to the domain name template of domain name regulation design constructed by domain name rank, domain name type
It closes.
IP address template library refers to the set of the domain name template according to IPv4 and IPv6 design.
Constructing template library simultaneously can dynamically add according to demand, so as to support more agreements and more domain names
Type.
Step 2, the reading of content from huge volumes of content source, and Content Transformation is read out at the mode of inlet flow, content
Source include come to the web page contents of internet, perhaps come in the user behavior data that is collected into social tool to sensor to remember
Daily record data content under record.
Step 3, it according to model agreement library, is matched according to content stream of the protocol class to input, filters out satisfaction association
Discuss the url data of template library;The method that matching process uses approximate match and fuzzy matching.
Step 4, according to domain name template library, the content stream of input is matched according to domain name rank, domain name type, is filtered
Meet the url data of domain name template library out;The method that matching process uses approximate match and fuzzy matching.
Step 5, it according to IP address template library, matches, filters out full according to content stream of the IPv4 and IPv6 to input
The url data of sufficient IP address template library;The method that matching process uses approximate match and fuzzy matching.
Step 6, matched according to step 3- step 5 as a result, to after matching url data carry out classification storage;Storage
Database can be relational database, be also possible to file system, can also be NoSQL database.
Step 7, url data is successively read from the url data of classification storage;It is successively read and refers to category or by record
Item number reads url data one by one.
Step 8, according to model agreement library, the url data of reading is provided according to agreement and protocol characteristic is accurately examined
It looks into.
Step 9, according to the accurate inspection results of step 8, determine whether url data is correct data, if url data
Correctly, then turning to step 7, continue the url data for reading next storage, if url data is incorrect, turn to step
10。
Step 10, completion is carried out to url data, such as is frequently encountered the case where URL partial content omits, such as
" news.qq.com " is usually omitted the content of " http: // ", and should completion be " http://news.qq.com/ ".
Step 11, the url data after completion is checked again for, checks whether url data still conforms to define, if
Url data meets definition, then in the url data write-in classification storage after completion, and step 7 is turned to, continue to read next
The url data of item storage.If url data does not meet definition, then it represents that the url data of completion is invalid, turns to step 12.
Step 12, url data is deleted, this record is specially deleted from the URL of classification storage record, and turn to step
7, continue the url data for reading next storage.Until all URL record processing are completed.
The above specific embodiment is described in detail for the essence of the present invention, but can not be to protection scope of the present invention
It is limited, it should be apparent that, under the inspiration of the present invention, those of ordinary skill in the art can also carry out many improvement
And modification, it should be noted that these improvement and modification are all fallen within the scope of the claims of the present invention.
Claims (7)
1. it is a kind of extraction and verification URL method, which is characterized in that it the following steps are included:
Step 1, building is extracted and the template library of verification URL, the template library include model agreement library, domain name template library and IP
Location template library;
Step 2, the reading of content from huge volumes of content source, and Content Transformation is read out at the mode of inlet flow, the content
Source include come to the web page contents of internet, perhaps come in the user behavior data that is collected into social tool to sensor to remember
Daily record data content under record;
Step 3, it according to model agreement library, is matched according to content stream of the protocol class to input, filters out and meet agreement mould
The url data in plate library;
Step 4, according to domain name template library, the content stream of input is matched according to domain name rank, domain name type, is filtered out full
The url data of sufficient domain name template library;
Step 5, it according to IP address template library, is matched according to content stream of the IPv4 and IPv6 to input, filters out and meet IP
The url data in address template library;
Step 6, matched according to step 3- step 5 as a result, to after matching url data carry out classification storage;
Step 7, url data is successively read from the url data of classification storage;
Step 8, according to model agreement library, the url data of reading is provided according to agreement and protocol characteristic is accurately checked;
Step 9, according to the accurate inspection results of step 8, determine whether url data is correct data, if url data is just
Really, then turning to step 7, continue the url data for reading next storage, if url data is incorrect, turn to step
10;
Step 10, completion is carried out to url data;
Step 11, the url data after completion is checked again for, checks whether url data still conforms to define, if URL
Data fit definition continues to read next and deposit then and turn to step 7 in the url data write-in classification storage after completion
The url data of storage;If url data does not meet definition, then it represents that the url data of completion is invalid, turns to step 12;
Step 12, url data is deleted.
2. the method for extraction according to claim 1 and verification URL, which is characterized in that the model agreement library includes
Http protocol template, HTTPS model agreement, STMP model agreement, File Transfer Protocol template, udp protocol template, Telnet agreement mould
Plate or NFS protocol template.
3. the method for extraction according to claim 1 and verification URL, which is characterized in that the model agreement library includes association
View regulation and protocol characteristic.
4. the method for extraction according to claim 1 and verification URL, which is characterized in that domain name template library is basis
The set for the domain name template of domain name regulation design constructed by domain name rank, domain name type.
5. the method for extraction according to claim 1 and verification URL, which is characterized in that the IP address template library refers to
According to the set of the IPv4 and IPv6 domain name template designed.
6. the method for extraction according to claim 1 and verification URL, which is characterized in that the match party in the step 3-5
Method is while using approximate match and fuzzy matching.
7. the method for extraction according to claim 1 and verification URL, which is characterized in that being successively read in the step 7
Refer to category or reads url data one by one by record strip number.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611042612.2A CN106453689B (en) | 2016-11-11 | 2016-11-11 | The method extracted and verify URL |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611042612.2A CN106453689B (en) | 2016-11-11 | 2016-11-11 | The method extracted and verify URL |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106453689A CN106453689A (en) | 2017-02-22 |
CN106453689B true CN106453689B (en) | 2019-05-24 |
Family
ID=58218073
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611042612.2A Active CN106453689B (en) | 2016-11-11 | 2016-11-11 | The method extracted and verify URL |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106453689B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105187439A (en) * | 2015-09-25 | 2015-12-23 | 北京奇虎科技有限公司 | Phishing website detection method and device |
CN108363740B (en) * | 2018-01-22 | 2020-09-04 | 中国平安人寿保险股份有限公司 | IP address analysis method and device, storage medium and terminal |
CN111241082B (en) * | 2020-01-13 | 2020-10-23 | 贝壳找房(北京)科技有限公司 | Data correction method and device |
CN111931113B (en) * | 2020-09-16 | 2021-01-05 | 深圳壹账通智能科技有限公司 | Data cleaning method and related equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101695164A (en) * | 2009-09-28 | 2010-04-14 | 华为技术有限公司 | Verification method, device and system for controlling resource access |
CN103514189A (en) * | 2012-06-25 | 2014-01-15 | 上海博腾信息科技有限公司 | Implementing method for web crawler based on search engines |
CN104052737A (en) * | 2014-05-19 | 2014-09-17 | 北京网康科技有限公司 | Network data message processing method and device |
CN104462257A (en) * | 2014-11-21 | 2015-03-25 | 百度在线网络技术(北京)有限公司 | Method and device for verifying information of intermediate pages |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150081419A1 (en) * | 2013-09-19 | 2015-03-19 | Oracle International Corporation | Method and system for implementing dynamic link tracking |
-
2016
- 2016-11-11 CN CN201611042612.2A patent/CN106453689B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101695164A (en) * | 2009-09-28 | 2010-04-14 | 华为技术有限公司 | Verification method, device and system for controlling resource access |
CN103514189A (en) * | 2012-06-25 | 2014-01-15 | 上海博腾信息科技有限公司 | Implementing method for web crawler based on search engines |
CN104052737A (en) * | 2014-05-19 | 2014-09-17 | 北京网康科技有限公司 | Network data message processing method and device |
CN104462257A (en) * | 2014-11-21 | 2015-03-25 | 百度在线网络技术(北京)有限公司 | Method and device for verifying information of intermediate pages |
Also Published As
Publication number | Publication date |
---|---|
CN106453689A (en) | 2017-02-22 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106453689B (en) | The method extracted and verify URL | |
CN104486461B (en) | Domain name classification method and device, domain name recognition methods and system | |
US11151179B2 (en) | Method, apparatus and electronic device for determining knowledge sample data set | |
CN110245716A (en) | Sample labeling auditing method and device | |
CN108268581A (en) | The construction method and device of knowledge mapping | |
CN110222791A (en) | Sample labeling information auditing method and device | |
CN104796300B (en) | A kind of packet feature extracting method and device | |
CN109344740A (en) | Face identification system, method and computer readable storage medium | |
CN106447046A (en) | House type design scheme evaluation method based on machine learning | |
CN103297561B (en) | IP address source tracing method and device | |
CN105893340A (en) | Efficient data processing system used during detection and analysis | |
CN107153652B (en) | Method and device for converting target character string into normalized character string | |
CN104317909A (en) | Method and device for verifying data of points of interest | |
CN105930325B (en) | A kind of file report compares the conversed analysis method and device of difference | |
CN110781805A (en) | Target object detection method, device, computing equipment and medium | |
CN104378659A (en) | Personalization recommendation method based on smart television | |
CN107958154A (en) | A kind of malware detection device and method | |
CN103823809A (en) | Query phrase classification method and device, and classification optimization method and device | |
CN104407699A (en) | Human-computer interaction method, device and system | |
CN102073678A (en) | System and method for analyzing information of websites | |
CN108133030A (en) | A kind of realization method and system for painting this question and answer | |
CN107220262B (en) | Information processing method and device | |
CN109753227A (en) | Storage method, device, mobile terminal, server and readable storage medium storing program for executing | |
CN109714225B (en) | Automatic testing method and system for Elink | |
CN104317903B (en) | The recognition methods of the chapters and sections integrality of chapters and sections formula text and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |