CN111026942A

CN111026942A - Hot word extraction method, device, terminal and medium based on web crawler

Info

Publication number: CN111026942A
Application number: CN201911060879.8A
Authority: CN
Inventors: 崔凯; 王健宗
Original assignee: Ping An Technology Shenzhen Co Ltd
Current assignee: Ping An Technology Shenzhen Co Ltd
Priority date: 2019-11-01
Filing date: 2019-11-01
Publication date: 2020-04-17
Anticipated expiration: 2039-11-01
Also published as: CN111026942B

Abstract

The invention provides a hot word extraction method based on a web crawler, which comprises the following steps: initializing a website queue, wherein at least one URL is stored in the website queue, the URL comprises a first URL and a second URL, starting a first thread, adding a second URL which is crawled from the first URL and is not identical to the URL in the website queue to the tail of the website queue, starting a second thread, obtaining the URL and a hypertext markup language document corresponding to the URL from the head of the website queue, and executing the first thread and the second thread in parallel; extracting a text data set in a hypertext markup language document, performing word segmentation processing, and then counting the occurrence frequency of each word; and taking the vocabulary with the frequency greater than the preset frequency threshold as the hot vocabulary. The invention also provides a hot word extraction device, a terminal and a medium based on the web crawler. According to the method and the device, the URL is added at the tail part of the website queue, and the hypertext markup language document corresponding to the URL is obtained at the head part of the website queue, so that resource conflict is prevented, and the hot vocabulary extraction efficiency is improved.

Description

Hot word extraction method, device, terminal and medium based on web crawler

Technical Field

The invention relates to the technical field of internet communication, in particular to a hot word extraction method, a hot word extraction device, a hot word extraction terminal and a hot word extraction medium based on a web crawler.

Background

The scientific research selection is a main attack direction of selecting scientific research from the aspect of war, the selection needs to have novelty and advancement, along with the continuous development of network technology, electronic text information taking the internet as a carrier is increased explosively, and how to extract hot words meeting requirements from mass data on the internet as the scientific research selection becomes an important research topic.

In the prior art, when keywords in a sudden hot event or a topic with high user participation are collected, at least two vocabularies are extracted after the vocabularies are counted from mass data manually to form a word list, at least two vocabularies are selected as the keywords according to the weight of each vocabulary in the word list, then the keywords are directly subjected to topic clustering according to the similarity between the keywords, the extraction precision of the target vocabulary is low, and the processing speed is low.

Disclosure of Invention

In view of the above, it is necessary to provide a method, an apparatus, a terminal and a medium for extracting hot words based on a web crawler, in which a hypertext markup language document corresponding to a URL is obtained at the head of a website queue while adding the URL at the tail of the website queue, so as to prevent resource conflict and improve efficiency of extracting hot words.

The first aspect of the invention provides a hot word extraction method based on a web crawler, which comprises the following steps:

initializing a website queue, wherein at least one URL is stored in the website queue, the URL comprises a first URL and a second URL which exist at present, and starting a first thread to crawl the second URL from the first URL;

judging whether the second URL is the same as the URL in the website queue;

when the second URL is determined to be different from the URL in the website queue, adding the second URL to the tail of the website queue;

starting a second thread to acquire a URL and a hypertext markup language document corresponding to the URL from the head of the website queue, wherein the first thread and the second thread are executed in parallel;

extracting a text data set in the hypertext markup language document;

performing word segmentation processing on the text data set to obtain a target vocabulary list;

counting the occurrence frequency of each vocabulary in the target vocabulary list;

and determining the vocabulary corresponding to the frequency greater than the preset frequency threshold value in the frequency as the hot vocabulary.

Preferably, after determining the vocabulary corresponding to the frequency greater than the preset frequency threshold value in the frequencies as the hot vocabulary, the method further includes:

calculating the similarity between the hot vocabulary and a pre-stored keyword related to a target object;

when the similarity is larger than a preset similarity threshold, determining that the hot vocabulary can be used as a target object;

and when the similarity is smaller than or equal to the preset similarity threshold, determining that the hot vocabulary cannot be used as a target object.

Preferably, the determining whether the second URL is the same as the URL in the website queue includes:

calculating an MD5 hash value for each second URL;

comparing each MD5 hash value with a prestored hash value one by one;

when the MD5 hash value is the same as any one of the pre-stored MD5 hash values, determining that the second URL is the same as the URL in the website queue;

when the MD5 hash value is different from any pre-stored MD5 hash value, determining that the second URL is not the same as the URL in the website queue.

Preferably, after the starting of the second thread obtains the URL and the hypertext markup language document corresponding to the URL from the head of the website queue, the method further includes:

deleting the URL with the subscript of 0 at the head of the website queue;

and simultaneously, subtracting 1 from subscripts corresponding to the residual URLs in the website queue to obtain new subscripts of the residual URLs.

Preferably, after the adding the second URL to the tail of the website queue, the method further includes:

acquiring subscripts of URLs at the tail of the website queue;

adding 1 to the subscript yields the subscript of the second URL.

Preferably, the performing word segmentation processing on the text data set and determining the target vocabulary list includes:

performing word segmentation processing on the text data set to obtain an initial vocabulary list;

matching the initial vocabulary list with a preset filtering vocabulary list;

and deleting the vocabulary in the initial vocabulary list which is the same as the vocabulary in the preset filtering vocabulary list to obtain a target vocabulary list.

Preferably, the method further comprises:

skipping the second URL and continuing crawling when the second URL is determined to be the same as the URL in the website queue.

The second aspect of the present invention provides a hot word extraction apparatus based on web crawlers, the apparatus comprising:

the device comprises an initialization module, a first processing module and a second processing module, wherein the initialization module is used for initializing a website queue, at least one URL is stored in the website queue, the URL comprises a first URL and a second URL which exist at present, and a first thread is started to crawl the second URL from the first URL;

the judging module is used for judging whether the second URL is the same as the URL in the website queue;

the adding module is used for adding the second URL to the tail part of the website queue when the judging module determines that the second URL is different from the URL in the website queue;

the starting module is used for starting a second thread to acquire a URL and a hypertext markup language document corresponding to the URL from the head of the website queue, wherein the first thread and the second thread are executed in parallel;

the extraction module is used for extracting a text data set in the hypertext markup language document;

the word segmentation module is used for carrying out word segmentation on the text data set to obtain a target word list;

the statistic module is used for counting the frequency of each vocabulary in the target vocabulary list;

and the determining module is used for determining the vocabulary corresponding to the frequency greater than the preset frequency threshold value in the frequency as the hot vocabulary.

A third aspect of the present invention provides a terminal, where the terminal includes a processor, and the processor is configured to implement the web crawler-based hot word extraction method when executing a computer program stored in a memory.

A fourth aspect of the present invention provides a computer-readable storage medium having stored thereon a computer program, which, when executed by a processor, implements the web crawler-based hot word extraction method.

In summary, according to the web crawler-based hot word extraction method, device, terminal and medium of the present invention, by initializing a website queue, where at least one URL is stored in the website queue, where the URL includes a first URL and a second URL that currently exist, a first thread is started to crawl the second URL from the first URL; judging whether the second URL is the same as the URL in the website queue; when the second URL is determined to be different from the URL in the website queue, adding the second URL to the tail of the website queue; starting a second thread to acquire a URL and a hypertext markup language document corresponding to the URL from the head of the website queue, wherein the first thread and the second thread are executed in parallel; extracting a text data set in the hypertext markup language document; performing word segmentation processing on the text data set to obtain a target vocabulary list; counting the occurrence frequency of each vocabulary in the target vocabulary list; and determining the vocabulary corresponding to the frequency greater than the preset frequency threshold value in the frequency as the hot vocabulary. On one hand, the first thread is started to crawl the first URL, the second URL which does not exist in the website queue is stored at the tail of the website queue, and the second thread is started to obtain the hypertext markup language document corresponding to the URL at the head of the website queue, so that resource conflict is prevented, the crawling efficiency is improved, the time for obtaining the hypertext markup language document is shortened, the efficiency for obtaining word frequency is accelerated, on the other hand, whether the second URL exists in the website queue or not is judged, repeated crawl can be avoided, and the crawler time is shortened.

Drawings

Fig. 1 is a flowchart of a hot word extraction method based on web crawlers according to an embodiment of the present invention.

Fig. 2 is a block diagram of a hot word extraction device based on web crawlers according to a second embodiment of the present invention.

Fig. 3 is a schematic structural diagram of a terminal according to a third embodiment of the present invention.

The following detailed description will further illustrate the invention in conjunction with the above-described figures.

Detailed Description

In the following description, numerous specific details are set forth to provide a thorough understanding of the present invention, and the described embodiments are merely a subset of the embodiments of the present invention, rather than a complete embodiment. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used in the description of the invention herein is for the purpose of describing particular embodiments only and is not intended to be limiting of the invention.

Example one

In this embodiment, the hot word extraction method based on the web crawler may be applied to a terminal, and for a terminal that needs to perform hot word extraction based on the web crawler, the hot word extraction function based on the web crawler provided by the method of the present invention may be directly integrated on the terminal, or may be operated in the terminal in a form of software development Kit (SKD).

As shown in fig. 1, the hot word extraction method based on web crawlers specifically includes the following steps, and the order of the steps in the flowchart may be changed and some may be omitted according to different requirements.

S11: initializing a website queue, wherein at least one URL is stored in the website queue, the URL comprises a first URL and a second URL which exist at present, and starting a first thread to crawl the second URL from the first URL.

In this embodiment, the URL (Uniform Resource Locator) is a compact representation of a location and an access method of a Resource available from the internet, and is an address of a standard Resource on the internet, and each file on the internet has a unique URL. And starting a first thread to crawl the currently existing first URL from the Internet, wherein the first URL comprises a plurality of first Href tags, the first Href tags comprise a plurality of second Href tags, and the first Href tag and the second Href tags both correspond to a second URL.

In this embodiment, an Href (hypertext reference) is a URL specifying a hyperlink target, an Href tag indicates "a link is located below," a depth-first algorithm based on a web crawler is used to crawl a second URL corresponding to the Href tag in the first URL, a search depth range is preset, the search depth range may be set to 10-20 layers, and different search depth ranges may also be set according to different time requirements or topics.

In this embodiment, a first thread is started to control a plurality of crawler engines to crawl a first URL on the internet at the same time, wherein the crawler acquires a second URL link on an initial webpage from the first URL link of a plurality of (at least one) initial webpages, and in the process of crawling the webpages, the crawler continuously crawls a second URL link corresponding to a first Href tag in a current webpage and puts the second URL link into the tail of a website queue until a preset search depth range is met, and the crawling is stopped in the prior art.

For example, starting multiple crawler engines to crawl all first URLs in the website queue at the same time, if there are 3 first URLs in the website queue: first URL₁The first URL₂And a first URL₃Crawling the first URL simultaneously₁Second URL corresponding to the first Href tag in (1)_1i(ii) a Crawling the first URL₂Second URL corresponding to the first Href tag in (1)_2i(ii) a Crawling the first URL₃Second URL corresponding to the first Href tag in (1)_3iThe second URL to be crawled_1iThe second URL_2iAnd the second URL_3iSequentially adding the URL information to the tail part of the website queue according to the order of crawling completion, and crawling the second URL_1iThe second URL_2iAnd the second URL_3iAnd adding the second URL corresponding to the second Href tag to the tail of the website queue while crawling according to the crawling completion sequence, and so on until crawling is completed, as shown in Table I.

Initial website queue of present case	Initial website queue in prior art
		First of allURL₁	First URL₁
First URL₂	First URL₂
		First URL₃	First URL₃
Crawl post-website queue	Crawl post-website queue
		First URL₁	……
First URL₂	Second URL₁₁Third of the second URLs₁₁₃
		First URL₃	Second URL₁₁Second URL in (1)₁₁₂
First URL₁First and second URL in (1)₁₁	Second URL₁₁First and second URL in (1)₁₁₁
		First URL₂First and second URL in (1)₂₁	First URL₁Third of the second URLs₁₃
First URL₃First and second URL in (1)₃₁	First URL₁Second URL in (1)₁₂
		First URL₁Second URL in (1)₁₂	First URL₁First and second URL in (1)₁₁
First URL₂Second URL in (1)₂₂	First URL₁
		First URL₃Second URL in (1)₃₂	……
First URL₁Third of the second URLs₁₃	First URL₂Third of the second URLs₂₃
		First URL₂Third of the second URLs₂₃	First URL₂Second URL in (1)₂₂
First URL₃Third of the second URLs₃₃	First URL₂First and second URL in (1)₂₁
		Second URL₁₁Second URL in (1)₁₁₁	First URL₂
Second URL₂₁Second URL in (1)₂₁₁	……
		Second URL₃₁Second URL in (1)₃₁₁	First URL₃First and second URL in (1)₃₁
……	First URL₃

In the embodiment, the crawler engines are started to simultaneously crawl the second URL in the first URL, and the second URL is added to the tail of the website queue while crawling, so that crawling efficiency is improved, and time is saved. The depth-first algorithm in the crawler technology is the prior art, and the depth-first algorithm is not described in detail herein.

S12: and judging whether the second URL is the same as the URL in the website queue.

In this embodiment, the URL includes the first URL and the second URL crawled from the website queue, and after the second URL corresponding to the Href tag is crawled, it is determined whether the second URL is the same as the URL in the website queue.

calculating an MD5 hash value for each second URL;

comparing each MD5 hash value with a prestored hash value one by one;

In this embodiment, MD5(Message-digest algorithm 5) is an implementation of a Message digest, which can generate a 128-bit hash value from a plaintext string of any length. The hash function has one basic characteristic as follows: if two hash values are not identical (according to the same function), then the original inputs for the two hash values are also not identical. That is, if the contents of two node data are different, the hash values of the two node data are also different from each other. Therefore, the invention adopts a hash table method to judge whether the second URL exists in the website queue, creates a dictionary global variable URL _ MAP of Python, calculates a 128-bit (composed of 16 characters) MD5 hash value for the second URL, and compares the MD5 hash value with the MD5 hash value pre-stored in the URL _ MAP one by one to determine whether the second URL exists in the website queue.

Further, when it is determined that the second URL is the same as the URL in the web site queue, the method further includes:

skipping over the second URL to continue crawling.

In this embodiment, when the MD5 hash value of the second URL is obtained by calculation to be the same as one MD5 hash value stored in the URL _ MAP in advance, the crawling of the second URL is not performed, and the crawling of the next second URL is directly skipped.

Further, when the MD5 hash value is different from any one of the pre-stored hash values, the method further includes:

adding the MD5 hash value of the second URL to the URL _ MAP.

In the embodiment, by calculating the MD5 hash value of the second URL, comparing the MD5 hash value with the MD5 hash value in the URL _ MAP created in advance, crawling is performed while skipping repeated URLs, repeated crawlers are prevented, and crawler time is shortened to a certain extent.

And S13, adding the second URL to the tail part of the website queue when the second URL is determined not to be the same as the URL in the website queue.

In this embodiment, the MD5 hash value of the second URL obtained by calculation is different from any hash value stored in the URL _ MAP in advance, it is determined that the second URL does not exist in the website queue, and the second URL is added to the tail of the website queue.

acquiring subscripts of URLs at the tail of the website queue;

adding 1 to the subscript yields the subscript of the second URL.

Illustratively, if N URL elements are in the website queue, the index of the historical URL at the tail of the website queue is N, and after the second URL is added to the tail of the website queue, the number of the elements in the website queue is increased by 1, that is, the index of the second URL is N + 1.

S14: and starting a second thread to acquire the URL and the hypertext markup language document corresponding to the URL from the head of the website queue, wherein the first thread and the second thread are executed in parallel.

In this embodiment, the URL includes a first URL and a second URL currently existing in the website queue. And starting a second thread control text collection module to acquire a hypertext markup language document corresponding to the URL from the head of the website queue.

In the prior art, all URLs are collected into a website queue, after crawling all URLs contained in one URL into the website queue, another URL is crawled, the crawled URL is added to the head of the website queue, and meanwhile, a hypertext markup language document corresponding to the URL is obtained from the head of the website queue, so that resource conflict is easy to occur, for example: at time T1, a URL with a first index of 0 is added₀To the head of the website queue, simultaneously acquiring the URL with the first subscript of 0₀A corresponding hypertext markup language document; at time T2, a URL0 with a second index of 0 is added to the head of the website queue, and then the URL with the first index of 0 is added₀Become URL with subscript 1₁(ii) a At time T3, the URL with the first subscript of 0 is acquired₀Deleting the URL with index 0 after the corresponding hypertext markup language document₀Then the URL with the first index 0 at time T1 is deleted₀Instead, the URL with the second subscript of 0 newly added at time T2₀Resulting in time T2 being newURL with added second subscript 0₀The corresponding hypertext markup language document is directly deleted without being acquired, so that resource conflict occurs. By adopting the method and the device, the first thread and the second thread are started simultaneously, the URL is crawled and added to the tail part of the website queue, the hypertext markup language document corresponding to the URL at the head part of the website queue is obtained, the addition and the obtaining in different directions are carried out, after the URL with the subscript of 0 is obtained from the head part every time, the subscript of the URL at the second row is modified from 1 to 0, a new URL cannot be inserted into the head part of the website queue, the resource conflict is prevented, and the requirement can be met only by maintaining one website queue.

deleting the URL with the subscript of 0 at the head of the website queue;

Illustratively, the URL with index 0 at the head of the website queue is the URL₀The URL of the next row at the head of the URL queue is given a subscript of 1, i.e., URL₁，URL₁Has a URL of the next row of 2 as a subscript₂When deleting the URL with the subscript of 0 at the head of the website queue, simultaneously deleting the URLs in the website queue₁Subtracting 1 from subscript 1 of (a) to obtain a corresponding new subscript of 0, URL₂Subtracting 1 from subscript 2 of (a) to give the corresponding new subscript 1.

In this embodiment, each URL in the website queue is updated after the URL whose subscript is 0 at the head of the website queue is deleted, so that the subscript of the URL at the head of the website queue is always 0, and resource conflict is prevented.

It should be noted that, S11 and S14 are performed simultaneously, that is, the first thread and the second thread are started simultaneously to execute corresponding operations, and the hypertext markup language document corresponding to the URL is obtained from the head of the website queue while adding the newly added URL from the tail of the website queue, so that time is saved, and resource conflict is prevented.

S15: extracting a text data set in the hypertext markup language document.

In this embodiment, a web page downloaded by a web crawler is in a hypertext markup language document format, which includes a large amount of codes and information unrelated to hotspot event mining, and may extract a text data set in the hypertext markup language document by using an HTML parser module under Python, where the beautiiful sound is a Python HTML parsing module, and the hypertext markup language is parsed by the beautiiful sound, so as to quickly obtain the content of a web page tag and convert the content into a text format, thereby obtaining the text data set.

Illustratively, a tool library Beautiful Soup, for example, soup ═ Beautiful Soup (html _ text, ' html.parser '), where parameter html _ text represents the content of a hypertext markup language document, and parameter html.parser designates the parser of Beautiful Soup as "html.parser", calls a function of Beautiful Soup in the tool library "find _ all (' a ') to obtain all Href tags, traverses all nodes of the hypertext markup language document corresponding to the ' a ' tags, obtains the text content in the hypertext markup language document by using a get _ text method in the tool library, splices the text content, extracts the spliced text content to obtain a text dataset of the hypertext markup language document, calls an open _ text in the database (' ml _ text _.

S16: and performing word segmentation processing on the text data set to obtain a target vocabulary list.

In this embodiment, a Chinese word segmentation tool library under a Chinese word segmentation tool Python is used to segment the text content in the text data set, so as to obtain a target vocabulary list.

matching the initial vocabulary list with a preset filtering vocabulary list;

and deleting the vocabulary in the initial vocabulary list which is the same as the vocabulary in the preset filtering vocabulary list, and determining a target vocabulary list.

In this embodiment, a filtering vocabulary list may be preset, where the filtering vocabulary list includes common prepositions, nonsense words, and the like. And segmenting the text content in the text data set by adopting a jieba.cut method under Python, and deleting the vocabulary which is the same as the vocabulary in the filtered vocabulary list from the initial vocabulary list to obtain a target vocabulary list.

Illustratively, an initial vocabulary list after the text content is participled, such as [ 'text', 'is', 'professional', 'of', 'IT', 'consult' ], the initial vocabulary list is matched with a pre-stored filtered vocabulary list, such as [ 'of', 'ground', 'get', 'in', 'is' ], the same vocabulary as in the filtered vocabulary list is deleted [ 'of', 'is', ] to obtain a target vocabulary list [ 'text', 'professional', 'IT', 'consult' ].

In the embodiment, the counting time can be effectively shortened by removing the nonsense words in the filtering word list.

S17: and counting the occurrence frequency of each vocabulary in the target vocabulary list.

In this embodiment, the word list after word segmentation is used as an input parameter, for example, the word list after word segmentation may be expressed as: and inputting the chn _ phrase _ arr serving as a parameter into a Counter module in a Collections library in Python to count the occurrence frequency of all the words, for example, ret equals to Counter (chn _ phrase _ arr), and obtaining a result, namely, the result contains the occurrence frequency of all the Chinese words in the chn _ phrase _ arr of the word list.

Further, after the statistics of the frequency of each vocabulary in the target vocabulary list, the vocabulary list is output according to the sequence from high to low of the frequency of occurrence of the vocabulary.

In this embodiment, the counter.most _ common (N) method in Python may obtain a list of N elements with the highest frequency, and call the ret.most _ common function in the Counter module in the Collections library in Python after counting the frequency of occurrence of all the vocabularies in the vocabulary list, so as to output the frequency of occurrence of all the vocabularies in the order from high to low, thereby completing the calculation of fast counting word frequencies.

In the embodiment, a plurality of functions in the ending word segmentation tool library under Python are called to perform word segmentation processing on text contents and statistics on the occurrence frequency of words, so that the word frequency statistics efficiency is improved to a certain extent.

S18: and determining the vocabulary corresponding to the frequency greater than the preset frequency threshold value in the frequency as the hot vocabulary.

In this embodiment, a frequency threshold of occurrence of a vocabulary may be preset, where the preset frequency threshold may be set according to an actual situation, and if the preset frequency threshold may be set to 100 times, a vocabulary corresponding to a frequency greater than the preset frequency threshold is extracted from the vocabulary list and used as a hot vocabulary.

Further, after determining the vocabulary corresponding to the frequency greater than the preset frequency threshold value in the frequencies as the hot vocabulary, the method further includes:

and when the similarity is smaller than or equal to a preset similarity threshold, determining that the hot vocabulary cannot be used as a target object.

In this embodiment, the target object refers to a scientific research topic selection in a scientific research project, and some keywords are stored in advance, where the keywords include: the technical field and the national key supporting project key words and the advanced science and technology key words to which the scientific research project belongs. Calculating the similarity between the hot words after word segmentation and the pre-stored keywords, wherein the similarity is calculated based on whether the hot words and the keywords belong to the same technical field, belong to national key support projects and lead-edge science and technology, belong to the same technical field and account for 60%, belong to the national key support projects and account for 20%, and belong to the data field lead-edge science and technology, a similarity threshold value can be preset, the similarity threshold value can be set to 80%, and if the similarity between the hot words and the pre-stored keywords is greater than or equal to the similarity threshold value, the hot words can be determined to be used as scientific research topics; and if the similarity between the hot words and the pre-stored keywords is smaller than a similarity threshold, determining that the hot words do not belong to the scientific research topic selection range.

In the embodiment, by determining whether the hot words are similar to the scientific research project, the national attention degree and the advanced science and technology, some hot network terms can be excluded to a certain degree, and the accuracy of scientific research topic selection is improved.

In summary, the hot word extraction method based on web crawlers according to the present invention includes: initializing a website queue, wherein at least one URL is stored in the website queue, the URL comprises a first URL and a second URL which exist at present, and starting a first thread to crawl the second URL from the first URL; judging whether the second URL is the same as the URL in the website queue; when the second URL is determined to be different from the URL in the website queue, adding the second URL to the tail of the website queue; starting a second thread to acquire a URL and a hypertext markup language document corresponding to the URL from the head of the website queue, wherein the first thread and the second thread are executed in parallel; extracting a text data set in the hypertext markup language document; performing word segmentation processing on the text data set to obtain a target vocabulary list; counting the occurrence frequency of each vocabulary in the target vocabulary list; and determining the vocabulary corresponding to the frequency greater than the preset frequency threshold value in the frequency as the hot vocabulary. On one hand, the first thread is started to crawl the second URL, the second URL which does not exist in the website queue is stored at the tail of the website queue, and the second thread is started to obtain the hypertext markup language document corresponding to the URL at the head of the website queue, so that resource conflict is prevented, the crawling efficiency is improved, the time for obtaining the hypertext markup language document is shortened, the efficiency for obtaining word frequency is accelerated, on the other hand, whether the second URL exists in the website queue or not is judged, repeated crawl can be avoided, and the crawler time is shortened.

In addition, the vocabulary corresponding to the frequency greater than the preset frequency threshold value in the frequency is used as the hot vocabulary, so that the hot vocabulary extraction accuracy is improved to a certain extent.

Example two

In some embodiments, the web crawler-based hot spot vocabulary extracting apparatus 20 may include a plurality of functional modules composed of program code segments. The program codes of the various program segments in the web crawler-based hot word extraction apparatus 20 may be stored in a memory of the terminal and executed by the at least one processor to perform (see fig. 1 for details) the extraction of the hot words existing in the web crawler.

In this embodiment, the hot word extraction apparatus 20 based on web crawler may be divided into a plurality of functional modules according to the functions executed by the apparatus. The functional module may include: the system comprises an initialization module 201, a judgment module 202, a crawling module 203, an adding module 204, an obtaining module 205, a determining module 206, a starting module 207, a deleting module 208, an extracting module 209, a word segmentation module 210, a counting module 211 and a calculating module 212. The module referred to herein is a series of computer program segments capable of being executed by at least one processor and capable of performing a fixed function and is stored in memory. In the present embodiment, the functions of the modules will be described in detail in the following embodiments.

The initialization module 201: initializing a website queue, wherein at least one URL is stored in the website queue, the URL comprises a first URL and a second URL which exist at present, and starting a first thread to crawl the second URL from the first URL.

For example, starting multiple crawler engines to crawl all first URLs in the website queue at the same time, if there are 3 first URLs in the website queue: first URL₁The first URL₂And a first URL₃Crawling the first URL simultaneously₁Second URL corresponding to the first Href tag in (1)_1i(ii) a Crawling the first URL₂Second URL corresponding to the first Href tag in (1)_2i(ii) a Crawling the first URL₃Second of (1) corresponding to the first Href tagURL_3iThe second URL to be crawled_1iThe second URL_2iAnd the second URL_3iSequentially adding the URL information to the tail part of the website queue according to the order of crawling completion, and crawling the second URL_1iThe second URL_2iAnd the second URL_3iAnd adding the second URL corresponding to the second Href tag to the tail of the website queue while crawling according to the crawling completion sequence, and so on until crawling is completed, as shown in Table I.

Initial website queue of present case	Initial website queue in prior art
		First URL₁	First URL₁
First URL₂	First URL₂
		First URL₃	First URL₃
Crawl post-website queue	Crawl post-website queue
		First URL₁	……
First URL₂	Second URL₁₁Third of the second URLs₁₁₃
		First URL₃	Second URL₁₁Second URL in (1)₁₁₂
First URL₁First and second URL in (1)₁₁	Second URL₁₁First and second URL in (1)₁₁₁
		First URL₂First and second URL in (1)₂₁	First URL₁Third of the second URLs₁₃
First URL₃First and second URL in (1)₃₁	First URL₁Second URL in (1)₁₂
		First URL₁Second URL in (1)₁₂	First URL₁First and second URL in (1)₁₁
First URL₂Second URL in (1)₂₂	First URL₁
		First URL₃Second URL32 in (a)	……
First URL₁Third of the second URLs₁₃	First URL₂Third of the second URLs₂₃
		First URL₂Third of the second URLs₂₃	First URL₂Second URL in (1)₂₂
First URL₃Third of the second URLs₃₃	First URL₂First and second URL in (1)₂₁
		Second URL₁₁Second URL in (1)₁₁₁	First URL₂
Second URL₂₁Second URL in (1)₂₁₁	……
		Second URL₃₁Second URL in (1)₃₁₁	First URL₃First and second URL in (1)₃₁
……	First URL₃

In the embodiment, the crawler engines are started to simultaneously crawl the second URLs in the first URLs, and the second URLs are added to the tail of the website queue while crawling, so that crawling efficiency is improved, and time is saved. The depth-first algorithm in the crawler technology is the prior art, and the depth-first algorithm is not described in detail herein.

The judging module 202: and the URL processing module is used for judging whether the second URL is the same as the URL in the website queue.

Preferably, the determining module 202 determines whether the second URL is the same as the URL in the website queue, including:

calculating an MD5 hash value for each second URL;

comparing each MD5 hash value with a prestored hash value one by one;

Further, when the determining module 202 determines that the second URL is the same as the URL in the website queue, the web crawler-based hot word extracting apparatus further includes:

and the crawling module 203 is used for skipping the second URL to continue crawling.

Further, when the MD5 hash value is different from any pre-stored hash value, the web crawler-based hot word extraction apparatus further includes:

an adding module 204, configured to add the MD5 hash value of the second URL to the URL _ MAP.

The adding module 204 is further configured to add the second URL to the tail of the website queue when the second URL is determined to be different from the URLs in the website queue.

Preferably, after the adding module 204 adds the second URL to the tail of the website queue, the web crawler-based hot word extracting apparatus further includes:

an obtaining module 205, configured to obtain a subscript of a URL at a tail of the website queue;

the determining module 206 is further configured to add 1 to the subscript to determine the subscript of the second URL.

The start module 207: and the method is used for starting a second thread to acquire the hypertext markup language document corresponding to the URL from the head of the website queue, wherein the first thread and the second thread are started and run in parallel.

In the prior art, all URLs are collected into a website queue, after crawling all URLs contained in one URL into the website queue, another URL is crawled, the crawled URL is added to the head of the website queue, and meanwhile, a hypertext markup language document corresponding to the URL is obtained from the head of the website queue, so that resource conflict is easy to occur, for example: at time T1, a URL with a first index of 0 is added₀To the head of the website queue, simultaneously acquiring the URL with the first subscript of 0₀A corresponding hypertext markup language document; at time T2, a URL with a second index of 0 is added₀To the head of the website queue, the first subscript is 0 URL₀Become URL with subscript 1₁(ii) a At time T3, the URL with the first subscript of 0 is acquired₀Deleting the URL with index 0 after the corresponding hypertext markup language document₀Then the URL with the first index 0 at time T1 is deleted₀Instead, the URL with the second subscript of 0 newly added at time T2₀URL with a second index of 0, which results in a new increment at time T2₀The corresponding hypertext markup language document is directly deleted without being acquired, so that resource conflict occurs. By adopting the method and the device, the first thread and the second thread are started simultaneously, the URL is crawled and added to the tail part of the website queue, the hypertext markup language document corresponding to the URL at the head part of the website queue is obtained, the addition and the obtaining in different directions are carried out, after the URL with the subscript of 0 is obtained from the head part every time, the subscript of the URL at the second row is modified from 1 to 0, a new URL cannot be inserted into the head part of the website queue, the resource conflict is prevented, and the requirement can be met only by maintaining one website queue.

Preferably, after the second thread is started to obtain the URL and the hypertext markup language document corresponding to the URL from the head of the website queue, the web crawler-based hot spot vocabulary extracting apparatus further includes:

a deleting module 208, configured to delete the URL with index 0 at the head of the website queue;

the determining module 206 is configured to subtract 1 from all subscripts corresponding to the remaining URLs in the website queue to determine new subscripts of the remaining URLs.

It should be noted that, the first thread and the second thread are started simultaneously, and while a newly added URL is added from the tail of the website queue, a hypertext markup language document corresponding to the URL is obtained from the head of the website queue, which saves time and prevents resource conflict.

The extraction module 209: for extracting a text data set in the hypertext markup language document.

The word segmentation module 210: and the system is also used for performing word segmentation processing on the text data set and determining a target vocabulary list.

Preferably, the word segmentation module 210 performs word segmentation on the text data set, and determining the target vocabulary list includes:

matching the initial vocabulary list with a preset filtering vocabulary list;

The statistic module 211: the method is used for counting the frequency of each vocabulary in the target vocabulary list.

Further, after the counting module 211 counts the occurrence frequency of each vocabulary in the target vocabulary list, the vocabulary list is output according to the order of the occurrence frequency of the vocabularies from high to low.

The determination module 206: and the vocabulary corresponding to the frequency greater than the preset frequency threshold value in the frequency is determined as the hot vocabulary.

Further, after the determining module 206 determines the vocabulary corresponding to the frequency greater than the preset frequency threshold in the frequencies as the hot vocabulary, the web crawler-based hot vocabulary extracting apparatus further includes:

a calculating module 212, configured to calculate similarity between the hot vocabulary and a pre-stored keyword related to the target object;

the determining module 206 is further configured to determine that the hot vocabulary can be used as a target object when the similarity is greater than a preset similarity threshold;

the determining module 206 is further configured to determine that the hot vocabulary cannot be used as the target object when the similarity is smaller than or equal to a preset similarity threshold.

In summary, the device for extracting hot words based on web crawlers according to the present invention comprises: initializing a website queue, wherein at least one URL is stored in the website queue, the URL comprises a first URL and a second URL which exist at present, and starting a first thread to crawl the second URL from the first URL; judging whether the second URL is the same as the URL in the website queue; when the second URL is determined to be different from the URL in the website queue, adding the second URL to the tail of the website queue; starting a second thread to acquire a URL and a hypertext markup language document corresponding to the URL from the head of the website queue, wherein the first thread and the second thread are executed in parallel; extracting a text data set in the hypertext markup language document; performing word segmentation processing on the text data set to obtain a target vocabulary list; counting the occurrence frequency of each vocabulary in the target vocabulary list; and determining the vocabulary corresponding to the frequency greater than the preset frequency threshold value in the frequency as the hot vocabulary. On one hand, the first thread is started to crawl the second URL, the second URL which does not exist in the website queue is stored at the tail of the website queue, and the second thread is started to obtain the hypertext markup language document corresponding to the URL at the head of the website queue, so that resource conflict is prevented, the crawling efficiency is improved, the time for obtaining the hypertext markup language document is shortened, the efficiency for obtaining word frequency is accelerated, on the other hand, whether the second URL exists in the website queue or not is judged, repeated crawl can be avoided, and the crawler time is shortened.

EXAMPLE III

Fig. 3 is a schematic structural diagram of a terminal according to a third embodiment of the present invention. In the preferred embodiment of the present invention, the terminal 3 includes a memory 31, at least one processor 32, at least one communication bus 33, and a transceiver 34.

It will be appreciated by those skilled in the art that the configuration of the terminal shown in fig. 3 is not limiting to the embodiments of the present invention, and may be a bus-type configuration or a star-type configuration, and the terminal 3 may include more or less hardware or software than those shown, or a different arrangement of components.

In some embodiments, the terminal 3 is a terminal capable of automatically performing numerical calculation and/or information processing according to preset or stored instructions, and the hardware includes but is not limited to a microprocessor, an application specific integrated circuit, a programmable gate array, a digital processor, an embedded device, and the like. The terminal 3 may further include a client device, which includes, but is not limited to, any electronic product capable of performing human-computer interaction with a client through a keyboard, a mouse, a remote controller, a touch panel, or a voice control device, for example, a personal computer, a tablet computer, a smart phone, a digital camera, and the like.

It should be noted that the terminal 3 is only an example, and other existing or future electronic products, such as those that can be adapted to the present invention, should also be included in the scope of the present invention, and are included herein by reference.

In some embodiments, the memory 31 is used for storing program codes and various data, such as the web crawler-based hot spot vocabulary extracting apparatus 20 installed in the terminal 3, and realizes high-speed and automatic access to programs or data during the operation of the terminal 3. The Memory 31 includes a Read-Only Memory (ROM), a Programmable Read-Only Memory (PROM), an Erasable Programmable Read-Only Memory (EPROM), a One-time Programmable Read-Only Memory (OTPROM), an electronically Erasable rewritable Read-Only Memory (Electrically-Erasable Programmable Read-Only Memory (EEPROM)), an optical Read-Only Memory (CD-ROM) or other optical disk Memory, a magnetic disk Memory, a tape Memory, or any other medium readable by a computer capable of carrying or storing data.

In some embodiments, the at least one processor 32 may be composed of an integrated circuit, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The at least one processor 32 is a Control Unit (Control Unit) of the terminal 3, connects various components of the whole terminal 3 by using various interfaces and lines, and executes various functions of the terminal 3 and processes data by running or executing programs or modules stored in the memory 31 and calling data stored in the memory 31, for example, extracting a hot word based on a web crawler.

In some embodiments, the at least one communication bus 33 is arranged to enable connection communication between the memory 31 and the at least one processor 32 or the like.

Although not shown, the terminal 3 may further include a power supply (such as a battery) for supplying power to various components, and preferably, the power supply may be logically connected to the at least one processor 32 through a power management device, so as to implement functions of managing charging, discharging, and power consumption through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The terminal 3 may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.

It is to be understood that the described embodiments are for purposes of illustration only and that the scope of the appended claims is not limited to such structures.

The integrated unit implemented in the form of a software functional module may be stored in a computer-readable storage medium. The software functional module is stored in a storage medium and includes several instructions to enable a computer device (which may be a personal computer, a terminal, or a network device) or a processor (processor) to execute parts of the methods according to the embodiments of the present invention.

In a further embodiment, referring to fig. 2, the at least one processor 32 may execute the operating device of the terminal 3 and various installed applications (such as the web crawler-based hot word extraction device 20), program codes, and the like, for example, the above modules.

The memory 31 has program code stored therein, and the at least one processor 32 can call the program code stored in the memory 31 to perform related functions. For example, the modules illustrated in fig. 2 are program codes stored in the memory 31 and executed by the at least one processor 32, so as to implement the functions of the modules for the purpose of web crawler-based hot word extraction.

In one embodiment of the present invention, the memory 31 stores a plurality of instructions that are executed by the at least one processor 32 to implement web crawler-based hot word extraction functionality.

Specifically, the at least one processor 32 may refer to the description of the relevant steps in the embodiment corresponding to fig. 1, and details are not repeated here.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.

The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.

In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or that the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. A hot word extraction method based on web crawlers is characterized by comprising the following steps:

judging whether the second URL is the same as the URL in the website queue;

extracting a text data set in the hypertext markup language document;

2. The method of claim 1, wherein after determining the vocabulary corresponding to the frequency greater than the preset frequency threshold value as the hot vocabulary, the method further comprises:

3. The method of claim 1, wherein said determining whether the second URL is the same as a URL in the web site queue comprises:

calculating an MD5 hash value for each second URL;

comparing each MD5 hash value with a prestored hash value one by one;

4. The method of claim 1, wherein after the initiating the second thread obtains a URL and a hypertext markup language document corresponding to the URL from a head of the web site queue, the method further comprises:

deleting the URL with the subscript of 0 at the head of the website queue;

5. The method of claim 1, wherein after the adding the second URL to the tail of the web site queue, the method further comprises:

acquiring subscripts of URLs at the tail of the website queue;

adding 1 to the subscript yields the subscript of the second URL.

6. The method of claim 1, wherein the tokenizing the text data set to determine a target list of words comprises:

matching the initial vocabulary list with a preset filtering vocabulary list;

7. The method of any one of claims 1 to 6, further comprising:

8. The utility model provides a hot word extraction element based on web crawler which characterized in that, the device includes:

9. A terminal, characterized in that the terminal comprises a processor, and the processor is configured to implement the web crawler-based hot spot vocabulary extracting method according to any one of claims 1 to 7 when executing the computer program stored in the memory.

10. A computer-readable storage medium, on which a computer program is stored, wherein the computer program, when executed by a processor, implements the web crawler-based hot word extraction method according to any one of claims 1 to 7.