CN112149008B

CN112149008B - Method for calculating document version set

Info

Publication number: CN112149008B
Application number: CN202010986308.3A
Authority: CN
Inventors: 曾祥宇; 王君
Original assignee: Sichuan Technology and Business University
Current assignee: Sichuan Technology and Business University
Priority date: 2020-09-18
Filing date: 2020-09-18
Publication date: 2022-09-23
Anticipated expiration: 2040-09-18
Also published as: CN112149008A

Abstract

The invention discloses a method for calculating a document version set, which belongs to the cross field of computers and big data application; the invention comprises the following steps: appointing URL to download the snapshot, taking the time stamp as the file name Fn, and storing the snapshot Content as the file Content; clearing the html label and the special label of the calendar, and storing the modified content; calculating the MD5 value of the Content, and modifying the Content into an MD5 value, a tab and an Fn; uploading all documents to an HDFS file system of the Hadoop cluster; in the Map stage, Content is split, so that key is an MD5 value, value is Fn, and key-value is sent; a Reduce stage, accumulating the counts of the same key, and connecting the value Fn to the container; for the same key, the organization outputs the content as key, count and container.

Description

Method for calculating document version set

Technical Field

A document version set calculation method is a document version management method for capturing data based on an internet time-optical machine (wayback machine), and belongs to the cross field of computer and big data application.

Background

A URL (Uniform Resource Locator) published on the internet is a description document, usually the latest version of the product; generally, a user can view all the specification documents stored in a certain URL from the optomechanical device, and store the specification documents at a time point, namely, the time captured by the optomechanical crawler.

If the product is updated in many versions in the last decade, as long as the user uses the product which is not the latest version of the product, the use document can not be obtained on the published description document URL, and the product document of a certain version can not be accurately obtained by the optical machine.

The MD5 Message Digest Algorithm (MD5 Message-Digest Algorithm) is a cryptographic hash function that can generate a 128-bit hash value to ensure the integrity of the Message transmission; performing MD5 calculation on all binary contents of a file to obtain an MD5 value of the file, wherein the MD5 value of the file before and after modification changes even if only one byte is modified; many language library functions support MD5 calculation, such as PHP language call function MD5 (file name) can calculate MD5 value of a file.

Hadoop is a distributed parallel programming open source framework which is developed by the Apache foundation and can run on a large-scale computer cluster, is originally a sub-item of a full-text retrieval engine Lucene, is designed at the beginning to process mass indexes captured by the Lucene, comprises storage and calculation, and then independently becomes a distributed basic framework; the system mainly comprises modules such as a file system HDFS (Hadoop distributed file system) and a computing model MapReduce, wherein the MapReduce can enable a developer to mainly write own processing logic without paying attention to implementation details of a distributed computing framework; the core steps of the MapReduce program are divided into two parts: map and Reduce, when Map receives a calculation job, divide the calculation job into several Map tasks at first, distribute to different nodes to carry out, every Map task processes a part of the input data, generally store the Map task processing result in "key-value pair" (key-value) way, will produce some intermediate files after the Map task finishes, these intermediate files are regarded as the input data of the Reduce task, Reduce outputs the final result after carrying on the further combination processing to the "key-value pair" of the data; HDFS is a distributed file storage and management system, generally established on the basis of a local file system of an operating system, and used by nodes of a cluster network, and on HDFS, a large file is divided into a plurality of data blocks for distributed storage, and its efficient access mode is write-once and read many times.

HTML (hyper Text Markup language) is called as hypertext Markup language, is an identifying language and comprises a series of tags, and the tags can unify the document format on the network so as to connect the scattered Internet resources into a logic whole; HTML is typically read by a browser, and content is presented to a user as required by its tags, which start with a smaller number and end with a larger number, which are interpreted by the browser and are not typically presented on the user's content.

The invention mainly cleans HTML labels and all script codes between < script > and < script > because the codes are not generally used for reading and only used for logic judgment, if the codes are provided with time stamps, the codes can influence the calculation of MD5 values, and the codes are deleted, so that document contents close to the browser display are obtained, then MD5 values of the document contents are calculated, and finally elements with the same MD5 values are calculated to form a version set of the document.

Disclosure of Invention

The invention aims to: the document version set calculation method is provided, and the work of searching for new documents and searching for duplicate documents is simplified through document set comparison.

The technical scheme adopted by the invention is as follows: a method for calculating a document version set comprises the following steps:

1. downloading all snapshots of a corresponding time period from a website of the time-light machine according to a specified product document URL, wherein each snapshot takes a timestamp recorded when the snapshot is captured on the time-light machine as a file name variable Fn, the timestamp is expressed by year, month, day, hour, minute and second, and the Content of the file is set as a variable Content, and the step is the Content stored on the time-light machine;

2. the subsequent steps can greatly modify Content, and can be carried out at the step if the Content needs to be independently stored;

3. clearing code between tags starting with "< script" and ending with "</script >"; clearing HTML tags closed by a less than number and a more than number, such as < HTML >, < body >, < script >, and the like, but not checking tag closure, nor processing abnormal conditions, such as partial or broken tags, and saving the modified Content as file Content;

4. clearing two labels of special guidance of the optical machine during the operation, namely 'Wayback write JS Include' and 'Wayback's 'DOMContentloaded line', deleting all contents of the two lines, and saving the modified contents as file Content;

5. calculating the MD5 value of Content, separating the MD5 value from Fn by a tab character, ending by a carriage return character, and saving as one line of Content as file Content;

6. uploading all processed documents to an HDFS file system of a Hadoop cluster;

in the Map stage, a document is processed as a Map task, a tab is used as a token, Content is split, so that key is an MD5 value, value is Fn, and the pair of key-value is used as a processing result of the Map task;

8, in the Reduce stage, Map tasks with the same key are collected by the same Reduce, for the same key, every time a value is collected, the counter count is added with 1, the value Fn of the value is accumulated to a character string container and is separated by a blank;

the keys are elements of the set, and the sum of the Fn number of all the keys is equal to the number of all the documents uploaded by the task;

9. for the same key, organizing the output content into key, count and container, wherein the key, the count and the container are also separated by a space, and the container is ended by a carriage return symbol;

10. the output result of this task is retrieved from the HDFS.

The working principle of the invention is as follows: the first problem to be solved by the present invention is how many different versions of documents are in total, and all the time when the version of document appears is listed, if the MD5 values of all documents are in a mathematical set, and one element represents one MD5 value, that is, one version, the present invention calculates how many elements of the set, and one version of document can have multiple timestamps captured by the optical machine.

Another problem is that the document content is modified by the page captured by the optical machine each time, and a note is added at a specific position for recording the time stamp and information such as the server node processed at that time, so that even if the effect of interpreting by HTML in the browser is the same, the document content stored in each time is different, the notes such as the HTML tag and the time stamp of the optical machine can be cleaned, the original document content is obtained as much as possible, and the actual document content after cleaning basically conforms to the display content interpreted by HTML on the browser page.

There is also a problem in that if the resulting document data is massive, the present invention uses HDFS of Hadoop to store the data and then processes the document version set by a program based on MapReduce framework.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the invention is used for determining the page modification time boundary, and is used for the services of thesis duplicate checking, patent novelty checking and the like; when a document of a product of a certain version needs to be obtained, firstly, how many different versions of the document of the product appear in total needs to be calculated, then, according to the release time of the certain version, the document of the product near the time is searched, and the version set statistics is carried out.

Drawings

The invention will now be described, by way of example, with reference to the accompanying drawings, in which:

FIG. 1 is a flow chart of the operation of the present invention;

FIG. 2 is a Gantt chart of the duration of the version of the Nutch document according to embodiment 1 of the present invention;

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

As shown in fig. 1, a method for calculating a document version set includes the following steps:

s1, specifying a product document URL, downloading all snapshots of a corresponding time period, and storing the snapshots as file name variables Fn and file Content contents by taking a timestamp as a file name variable Fn;

s2, judging whether the Content needs to be modified greatly, if not, storing the document Content independently, and if not, performing the next step;

s3, clearing the html label and the special label of the timer and saving the modified Content as the Content of the file Content;

s4, calculating an MD5 value of the Content, separating the MD5 value from the Fn by a tab character, finishing by a carriage return character, and storing as a line of Content as the Content of the file;

s5, uploading all the documents processed in the step to an HDFS file system of the Hadoop cluster;

s6, in the Map stage, processing a document as a Map task, taking a tab as a token, splitting Content to enable key to be an MD5 value and value to be Fn, and sending key-value;

s7, at the Reduce stage, Map tasks with the same key are collected by the same Reduce, for the same key, when one value is collected, the counter count is added with 1, the value Fn of the value is accumulated to a character string container and is separated by a space;

s8, for the same key, organizing output contents of the key, the count and the container, wherein the key, the count and the container are also separated by a space, and the end of the container is ended by a carriage return symbol;

and S9, retrieving the output result of the task from the HDFS.

Example 1

Taking the official document of Apache Nutch as an example, analyzing how many versions the official document of Nutch has, the official document URL of Apache Nutch is https:// wiki.apache.org/Nutch Tutorial, and the URL only contains the latest operation instruction of Nutch software; the user can not obtain a description of a certain version from the URL, the URL is inquired from the clock machine in step 1, all documents of records can be obtained, but repeated records are contained, the first record of the URL by the clock machine in the beginning time selection is 2006, 05, 29, 21, 09:49, the end time selection is 2012, 02, 16, the total number of documents can be downloaded to 172, the documents are named by time, namely year, month, day, hour, minute and second, and then step 3-5 is executed for cleaning and calculating the MD5 value, so that the second problem is solved; in order to solve the third problem of mass data, 6-10 steps of uploading to Hadoop are executed, calculation is carried out under a Map Reduce framework, and the number of the documents with the most repetition is 19, two types are available, and 38 are displayed to the user, and the partial results are shown in Table 1:

drawing the result data into a Gantt chart, so that the change of the document version along a time line can be clearly seen; for convenient display, only 2006- & lt2008 & gt data is selected to be drawn, as shown in fig. 2, the left side of the graph is a table, each row is a version, the first column is a version MD5 value, the second column Points is the number of times of crawler collection, Days is the number of Days elapsed between the start date and the end date of the version, and for convenience of calculation, if Days is 0, Days is set to be 1; finding the start date and the end date in a certain edition date set, namely the maximum minimum value of the dates, drawing a Gantt chart on the right side, wherein for 19 documents with MD5 values of 37d0b4942f074bf1a7289a16ba24d1b6, the official part has not been modified between 5 months and 11 months in 2006, and the documents are of the same edition; the next modification occurs in 21/11/2006, which corresponds to the product released in this time period, or the price of the commodity, or information of other requirements, and because the document modification is a continuous behavior, the invention easily finds the boundary of the time range of the document modification, easily finds the document corresponding to the software version, and successfully solves the proposed problem.

Example 2

Take iPhone page of apple official Chinese website as an example, URL is https:// www.apple.com/cn/iPhone; partial results calculated using the procedure of the present invention are shown in the following table.

For the document with the version 0370eba2c84c144a2eba7c5766bf8030, 3 points are collected in total, the time period is 2020-07-23 to 2020-08-06, the starting time of the next version of the document is 2020-08-22, and by comparing the two versions of the document, the modified content is found to be the addition of the field trip extracurriculum activity, and only the modification is carried out at one point.

The above description is only a preferred embodiment of the present invention, and not intended to limit the present invention, the scope of the present invention is defined by the appended claims, and all equivalent structural changes made by using the contents of the specification and the drawings of the present invention should be covered by the scope of the present invention.

Claims

1. A method for calculating a document version set is characterized by comprising the following steps:

s4, calculating an MD5 value of the Content, separating the MD5 value from the Fn by a tab character, finishing by a carriage return character, and storing as a line of Content as the Content of the file Content;

s7, at the Reduce stage, Map tasks with the same key are collected by the same Reduce, for the same key, when one value is collected, the counter count is added with 1, the value Fn of the value is accumulated to a character string container and is separated by a space; the keys are elements of the set, and the sum of the Fn number of all the keys is equal to the number of all the documents uploaded by the task;

s8, for the same key, organizing output contents into a key, a count and a container, wherein the key, the count and the container are also separated by a space, and the end of the container is ended by a carriage return symbol;

and S9, retrieving the output result of the task from the HDFS.

2. The method for calculating the document version set according to claim 1, wherein the step S3 is to remove HTML tags closed by a less than number and a more than number and all script codes between "< script" and "</script >.

3. The method for calculating a document version set according to claim 1, wherein the special tags of the calendar in step S3 are two special bootstrapped tags "Wayback Rewrite JS inclusion" and "Wayback 'S' DOMContentLoaded line", all the contents of the two lines are deleted, and the modified contents are saved as the file Content.

4. The method according to claim 1, wherein the output result of step 9 is in the form of a table, and outputs the MD5 value of the document version, the count of all documents having the value, and the timestamp.

5. The method of claim 4, wherein the time stamp is expressed in terms of time, month, day, minute and second.