CN112149008B - Method for calculating document version set - Google Patents

Method for calculating document version set Download PDF

Info

Publication number
CN112149008B
CN112149008B CN202010986308.3A CN202010986308A CN112149008B CN 112149008 B CN112149008 B CN 112149008B CN 202010986308 A CN202010986308 A CN 202010986308A CN 112149008 B CN112149008 B CN 112149008B
Authority
CN
China
Prior art keywords
content
value
key
document
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010986308.3A
Other languages
Chinese (zh)
Other versions
CN112149008A (en
Inventor
曾祥宇
王君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Technology and Business University
Original Assignee
Sichuan Technology and Business University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Technology and Business University filed Critical Sichuan Technology and Business University
Priority to CN202010986308.3A priority Critical patent/CN112149008B/en
Publication of CN112149008A publication Critical patent/CN112149008A/en
Application granted granted Critical
Publication of CN112149008B publication Critical patent/CN112149008B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links

Abstract

The invention discloses a method for calculating a document version set, which belongs to the cross field of computers and big data application; the invention comprises the following steps: appointing URL to download the snapshot, taking the time stamp as the file name Fn, and storing the snapshot Content as the file Content; clearing the html label and the special label of the calendar, and storing the modified content; calculating the MD5 value of the Content, and modifying the Content into an MD5 value, a tab and an Fn; uploading all documents to an HDFS file system of the Hadoop cluster; in the Map stage, Content is split, so that key is an MD5 value, value is Fn, and key-value is sent; a Reduce stage, accumulating the counts of the same key, and connecting the value Fn to the container; for the same key, the organization outputs the content as key, count and container.

Description

Method for calculating document version set
Technical Field
A document version set calculation method is a document version management method for capturing data based on an internet time-optical machine (wayback machine), and belongs to the cross field of computer and big data application.
Background
A URL (Uniform Resource Locator) published on the internet is a description document, usually the latest version of the product; generally, a user can view all the specification documents stored in a certain URL from the optomechanical device, and store the specification documents at a time point, namely, the time captured by the optomechanical crawler.
If the product is updated in many versions in the last decade, as long as the user uses the product which is not the latest version of the product, the use document can not be obtained on the published description document URL, and the product document of a certain version can not be accurately obtained by the optical machine.
The MD5 Message Digest Algorithm (MD5 Message-Digest Algorithm) is a cryptographic hash function that can generate a 128-bit hash value to ensure the integrity of the Message transmission; performing MD5 calculation on all binary contents of a file to obtain an MD5 value of the file, wherein the MD5 value of the file before and after modification changes even if only one byte is modified; many language library functions support MD5 calculation, such as PHP language call function MD5 (file name) can calculate MD5 value of a file.
Hadoop is a distributed parallel programming open source framework which is developed by the Apache foundation and can run on a large-scale computer cluster, is originally a sub-item of a full-text retrieval engine Lucene, is designed at the beginning to process mass indexes captured by the Lucene, comprises storage and calculation, and then independently becomes a distributed basic framework; the system mainly comprises modules such as a file system HDFS (Hadoop distributed file system) and a computing model MapReduce, wherein the MapReduce can enable a developer to mainly write own processing logic without paying attention to implementation details of a distributed computing framework; the core steps of the MapReduce program are divided into two parts: map and Reduce, when Map receives a calculation job, divide the calculation job into several Map tasks at first, distribute to different nodes to carry out, every Map task processes a part of the input data, generally store the Map task processing result in "key-value pair" (key-value) way, will produce some intermediate files after the Map task finishes, these intermediate files are regarded as the input data of the Reduce task, Reduce outputs the final result after carrying on the further combination processing to the "key-value pair" of the data; HDFS is a distributed file storage and management system, generally established on the basis of a local file system of an operating system, and used by nodes of a cluster network, and on HDFS, a large file is divided into a plurality of data blocks for distributed storage, and its efficient access mode is write-once and read many times.
HTML (hyper Text Markup language) is called as hypertext Markup language, is an identifying language and comprises a series of tags, and the tags can unify the document format on the network so as to connect the scattered Internet resources into a logic whole; HTML is typically read by a browser, and content is presented to a user as required by its tags, which start with a smaller number and end with a larger number, which are interpreted by the browser and are not typically presented on the user's content.
The invention mainly cleans HTML labels and all script codes between < script > and < script > because the codes are not generally used for reading and only used for logic judgment, if the codes are provided with time stamps, the codes can influence the calculation of MD5 values, and the codes are deleted, so that document contents close to the browser display are obtained, then MD5 values of the document contents are calculated, and finally elements with the same MD5 values are calculated to form a version set of the document.
Disclosure of Invention
The invention aims to: the document version set calculation method is provided, and the work of searching for new documents and searching for duplicate documents is simplified through document set comparison.
The technical scheme adopted by the invention is as follows: a method for calculating a document version set comprises the following steps:
1. downloading all snapshots of a corresponding time period from a website of the time-light machine according to a specified product document URL, wherein each snapshot takes a timestamp recorded when the snapshot is captured on the time-light machine as a file name variable Fn, the timestamp is expressed by year, month, day, hour, minute and second, and the Content of the file is set as a variable Content, and the step is the Content stored on the time-light machine;
2. the subsequent steps can greatly modify Content, and can be carried out at the step if the Content needs to be independently stored;
3. clearing code between tags starting with "< script" and ending with "</script >"; clearing HTML tags closed by a less than number and a more than number, such as < HTML >, < body >, < script >, and the like, but not checking tag closure, nor processing abnormal conditions, such as partial or broken tags, and saving the modified Content as file Content;
4. clearing two labels of special guidance of the optical machine during the operation, namely 'Wayback write JS Include' and 'Wayback's 'DOMContentloaded line', deleting all contents of the two lines, and saving the modified contents as file Content;
5. calculating the MD5 value of Content, separating the MD5 value from Fn by a tab character, ending by a carriage return character, and saving as one line of Content as file Content;
6. uploading all processed documents to an HDFS file system of a Hadoop cluster;
in the Map stage, a document is processed as a Map task, a tab is used as a token, Content is split, so that key is an MD5 value, value is Fn, and the pair of key-value is used as a processing result of the Map task;
8, in the Reduce stage, Map tasks with the same key are collected by the same Reduce, for the same key, every time a value is collected, the counter count is added with 1, the value Fn of the value is accumulated to a character string container and is separated by a blank;
the keys are elements of the set, and the sum of the Fn number of all the keys is equal to the number of all the documents uploaded by the task;
9. for the same key, organizing the output content into key, count and container, wherein the key, the count and the container are also separated by a space, and the container is ended by a carriage return symbol;
10. the output result of this task is retrieved from the HDFS.
The working principle of the invention is as follows: the first problem to be solved by the present invention is how many different versions of documents are in total, and all the time when the version of document appears is listed, if the MD5 values of all documents are in a mathematical set, and one element represents one MD5 value, that is, one version, the present invention calculates how many elements of the set, and one version of document can have multiple timestamps captured by the optical machine.
Another problem is that the document content is modified by the page captured by the optical machine each time, and a note is added at a specific position for recording the time stamp and information such as the server node processed at that time, so that even if the effect of interpreting by HTML in the browser is the same, the document content stored in each time is different, the notes such as the HTML tag and the time stamp of the optical machine can be cleaned, the original document content is obtained as much as possible, and the actual document content after cleaning basically conforms to the display content interpreted by HTML on the browser page.
There is also a problem in that if the resulting document data is massive, the present invention uses HDFS of Hadoop to store the data and then processes the document version set by a program based on MapReduce framework.
In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:
the invention is used for determining the page modification time boundary, and is used for the services of thesis duplicate checking, patent novelty checking and the like; when a document of a product of a certain version needs to be obtained, firstly, how many different versions of the document of the product appear in total needs to be calculated, then, according to the release time of the certain version, the document of the product near the time is searched, and the version set statistics is carried out.
Drawings
The invention will now be described, by way of example, with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart of the operation of the present invention;
FIG. 2 is a Gantt chart of the duration of the version of the Nutch document according to embodiment 1 of the present invention;
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a method for calculating a document version set includes the following steps:
s1, specifying a product document URL, downloading all snapshots of a corresponding time period, and storing the snapshots as file name variables Fn and file Content contents by taking a timestamp as a file name variable Fn;
s2, judging whether the Content needs to be modified greatly, if not, storing the document Content independently, and if not, performing the next step;
s3, clearing the html label and the special label of the timer and saving the modified Content as the Content of the file Content;
s4, calculating an MD5 value of the Content, separating the MD5 value from the Fn by a tab character, finishing by a carriage return character, and storing as a line of Content as the Content of the file;
s5, uploading all the documents processed in the step to an HDFS file system of the Hadoop cluster;
s6, in the Map stage, processing a document as a Map task, taking a tab as a token, splitting Content to enable key to be an MD5 value and value to be Fn, and sending key-value;
s7, at the Reduce stage, Map tasks with the same key are collected by the same Reduce, for the same key, when one value is collected, the counter count is added with 1, the value Fn of the value is accumulated to a character string container and is separated by a space;
the keys are elements of the set, and the sum of the Fn number of all the keys is equal to the number of all the documents uploaded by the task;
s8, for the same key, organizing output contents of the key, the count and the container, wherein the key, the count and the container are also separated by a space, and the end of the container is ended by a carriage return symbol;
and S9, retrieving the output result of the task from the HDFS.
Example 1
Taking the official document of Apache Nutch as an example, analyzing how many versions the official document of Nutch has, the official document URL of Apache Nutch is https:// wiki.apache.org/Nutch Tutorial, and the URL only contains the latest operation instruction of Nutch software; the user can not obtain a description of a certain version from the URL, the URL is inquired from the clock machine in step 1, all documents of records can be obtained, but repeated records are contained, the first record of the URL by the clock machine in the beginning time selection is 2006, 05, 29, 21, 09:49, the end time selection is 2012, 02, 16, the total number of documents can be downloaded to 172, the documents are named by time, namely year, month, day, hour, minute and second, and then step 3-5 is executed for cleaning and calculating the MD5 value, so that the second problem is solved; in order to solve the third problem of mass data, 6-10 steps of uploading to Hadoop are executed, calculation is carried out under a Map Reduce framework, and the number of the documents with the most repetition is 19, two types are available, and 38 are displayed to the user, and the partial results are shown in Table 1:
Figure BDA0002689371340000041
drawing the result data into a Gantt chart, so that the change of the document version along a time line can be clearly seen; for convenient display, only 2006- & lt2008 & gt data is selected to be drawn, as shown in fig. 2, the left side of the graph is a table, each row is a version, the first column is a version MD5 value, the second column Points is the number of times of crawler collection, Days is the number of Days elapsed between the start date and the end date of the version, and for convenience of calculation, if Days is 0, Days is set to be 1; finding the start date and the end date in a certain edition date set, namely the maximum minimum value of the dates, drawing a Gantt chart on the right side, wherein for 19 documents with MD5 values of 37d0b4942f074bf1a7289a16ba24d1b6, the official part has not been modified between 5 months and 11 months in 2006, and the documents are of the same edition; the next modification occurs in 21/11/2006, which corresponds to the product released in this time period, or the price of the commodity, or information of other requirements, and because the document modification is a continuous behavior, the invention easily finds the boundary of the time range of the document modification, easily finds the document corresponding to the software version, and successfully solves the proposed problem.
Example 2
Take iPhone page of apple official Chinese website as an example, URL is https:// www.apple.com/cn/iPhone; partial results calculated using the procedure of the present invention are shown in the following table.
Figure BDA0002689371340000051
For the document with the version 0370eba2c84c144a2eba7c5766bf8030, 3 points are collected in total, the time period is 2020-07-23 to 2020-08-06, the starting time of the next version of the document is 2020-08-22, and by comparing the two versions of the document, the modified content is found to be the addition of the field trip extracurriculum activity, and only the modification is carried out at one point.
The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention, the scope of the present invention is defined by the appended claims, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the present invention.

Claims (5)

1. A method for calculating a document version set is characterized by comprising the following steps:
s1, specifying a product document URL, downloading all snapshots of a corresponding time period, and storing the snapshots as file name variables Fn and file Content contents by taking a timestamp as a file name variable Fn;
s2, judging whether the Content needs to be modified greatly, if not, storing the document Content independently, and if not, performing the next step;
s3, clearing the html label and the special label of the timer and saving the modified Content as the Content of the file Content;
s4, calculating an MD5 value of the Content, separating the MD5 value from the Fn by a tab character, finishing by a carriage return character, and storing as a line of Content as the Content of the file Content;
s5, uploading all the documents processed in the step to an HDFS file system of the Hadoop cluster;
s6, in the Map stage, processing a document as a Map task, taking a tab as a token, splitting Content to enable key to be an MD5 value and value to be Fn, and sending key-value;
s7, at the Reduce stage, Map tasks with the same key are collected by the same Reduce, for the same key, when one value is collected, the counter count is added with 1, the value Fn of the value is accumulated to a character string container and is separated by a space; the keys are elements of the set, and the sum of the Fn number of all the keys is equal to the number of all the documents uploaded by the task;
s8, for the same key, organizing output contents into a key, a count and a container, wherein the key, the count and the container are also separated by a space, and the end of the container is ended by a carriage return symbol;
and S9, retrieving the output result of the task from the HDFS.
2. The method for calculating the document version set according to claim 1, wherein the step S3 is to remove HTML tags closed by a less than number and a more than number and all script codes between "< script" and "</script >.
3. The method for calculating a document version set according to claim 1, wherein the special tags of the calendar in step S3 are two special bootstrapped tags "Wayback Rewrite JS inclusion" and "Wayback 'S' DOMContentLoaded line", all the contents of the two lines are deleted, and the modified contents are saved as the file Content.
4. The method according to claim 1, wherein the output result of step 9 is in the form of a table, and outputs the MD5 value of the document version, the count of all documents having the value, and the timestamp.
5. The method of claim 4, wherein the time stamp is expressed in terms of time, month, day, minute and second.
CN202010986308.3A 2020-09-18 2020-09-18 Method for calculating document version set Active CN112149008B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010986308.3A CN112149008B (en) 2020-09-18 2020-09-18 Method for calculating document version set

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010986308.3A CN112149008B (en) 2020-09-18 2020-09-18 Method for calculating document version set

Publications (2)

Publication Number Publication Date
CN112149008A CN112149008A (en) 2020-12-29
CN112149008B true CN112149008B (en) 2022-09-23

Family

ID=73893234

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010986308.3A Active CN112149008B (en) 2020-09-18 2020-09-18 Method for calculating document version set

Country Status (1)

Country Link
CN (1) CN112149008B (en)

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2425626B1 (en) * 2011-05-12 2014-06-05 Telefónica, S.A. METHOD FOR DNS RESOLUTION OF CONTENT REQUESTS IN A CDN SERVICE
US20130290234A1 (en) * 2012-02-02 2013-10-31 Visa International Service Association Intelligent Consumer Service Terminal Apparatuses, Methods and Systems
CN102882974B (en) * 2012-10-15 2015-04-29 焦点科技股份有限公司 Method for saving website access resource by website identification version number
CN108132929A (en) * 2017-12-25 2018-06-08 上海大学 A kind of similarity calculation method of magnanimity non-structured text
CN111586072A (en) * 2020-05-19 2020-08-25 贺斌 Data transmission method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN112149008A (en) 2020-12-29

Similar Documents

Publication Publication Date Title
US10692048B2 (en) Apparatus and method for generating a chronological timesheet
EP3299972B1 (en) Efficient query processing using histograms in a columnar database
JP7170638B2 (en) Generating, Accessing, and Displaying Lineage Metadata
US20090193054A1 (en) Tracking changes to a business object
US11768908B2 (en) System and method for collection of a website in a past state and retroactive analysis thereof
JP2011022705A (en) Trail management method, system, and program
KR102391839B1 (en) Method and device for processing user personal, server and storage medium
US20130347127A1 (en) Database management by analyzing usage of database fields
CN113468196B (en) Method, apparatus, system, server and medium for processing data
US20240095256A1 (en) Method and system for persisting data
Bakaev et al. Web intelligence linked open data for website design reuse
CN112149008B (en) Method for calculating document version set
CN113886204A (en) User behavior data collection method and device, electronic equipment and readable storage medium
Fathalla et al. EVENTS: a dataset on the history of top-prestigious events in five computer science communities
CN113220530B (en) Data quality monitoring method and platform
US11593451B2 (en) System and method for comparing zones for different versions of a website based on performance metrics
CN114968725A (en) Task dependency relationship correction method and device, computer equipment and storage medium
CN112711404A (en) Method for generating special topic webpage template once and automatically releasing special topic webpage
Boumans et al. A comprehensive meta model for the current data landscape
Eldridge Best Practices for Designing Efficient Tableau Workbooks
Diakun et al. Splunk Operational Intelligence Cookbook: Over 80 recipes for transforming your data into business-critical insights using Splunk
CN113553320B (en) Data quality monitoring method and device
CN108733845A (en) Data processing method, device, computer equipment and storage medium
Novelinka Automatic creation of charts from open datasets
Budaragade et al. Big data analytics using Apache Hadoop: A case study on different fertilizers requirement and availability in different states of India from 2012-2013 to 2014-2015

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant