CN108062337B - Method and device for labeling crawler seeds - Google Patents

Method and device for labeling crawler seeds Download PDF

Info

Publication number
CN108062337B
CN108062337B CN201610987244.2A CN201610987244A CN108062337B CN 108062337 B CN108062337 B CN 108062337B CN 201610987244 A CN201610987244 A CN 201610987244A CN 108062337 B CN108062337 B CN 108062337B
Authority
CN
China
Prior art keywords
keywords
keyword
crawler
seeds
array
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201610987244.2A
Other languages
Chinese (zh)
Other versions
CN108062337A (en
Inventor
贺达
曹志明
陈晓敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201610987244.2A priority Critical patent/CN108062337B/en
Publication of CN108062337A publication Critical patent/CN108062337A/en
Application granted granted Critical
Publication of CN108062337B publication Critical patent/CN108062337B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9558Details of hyperlinks; Management of linked annotations

Abstract

The invention discloses a method and a device for labeling crawler seeds, wherein the method comprises the following steps: the method comprises the steps that an incidence relation between a label and a keyword array is established in advance, and when any crawler seed to be labeled is received, webpage content is crawled by the crawler seed; extracting keywords from the webpage content, and aggregating the keywords to obtain the word frequency of each keyword; sequencing the keywords according to the word frequency of each keyword to generate a position identifier of each keyword; matching the keywords with the position identifications with each keyword array in the incidence relation between the labels and the keyword arrays respectively; and finally, taking the label with the association relation with the keyword group with the highest matching degree as the label of the crawler seed. Compared with the existing manual labeling mode, the crawler seed labeling method can automatically complete the labeling of the crawler seeds, and is higher in efficiency and more accurate in labeling.

Description

Method and device for labeling crawler seeds
Technical Field
The invention relates to the field of data processing, in particular to a method and a device for tagging crawler seeds.
Background
A web crawler is a program or script that automatically captures web information according to certain rules. The crawler seed is an entry URL when the web crawler captures the information, and represents that the web crawler captures the website content information from the URL.
At present, a large amount of information in the internet can be acquired and learned, and the website contents automatically captured by a web crawler are difficult to determine the knowledge field of the captured information because the captured information is not manually checked and classified, so that the value of the captured information is very low.
Although there are trillions of web sites on a network, each web site has its own set of features of knowledge. For example, there are literature communication websites, professional computer technology communication websites, etc., each website has the characteristics of its website content, information in the literature communication websites is mainly based on literature knowledge, information in the computer technology communication websites is mainly based on computer technology, and information from different websites is related to the website itself. Therefore, we can achieve classification of information from websites by classification of the websites themselves. There may be multiple crawler seeds for a website, i.e., the web crawler crawls the website content through the crawler seeds. The crawling website content information can be classified by tagging the crawler seeds of the website, that is, the crawling information of which crawler seed can be utilized to classify according to the tag of the crawler seed. In this way, the web crawler captures the website content information as classified information that can be utilized.
However, at present, the method of tagging the crawler seeds is mainly performed manually, that is, a user manually tags the crawler seeds by browsing the content of the crawler seeds and combining the tag system according to a pre-constructed tag system. Because different people may have different understandings to the same crawler seed, this results in that the label of beating to the same crawler seed can be inconsistent for different people, that is to say, there is not unified standard to beating the label to the crawler seed.
Disclosure of Invention
In view of the above problems, the invention provides a method and a device for tagging crawler seeds, which can automatically complete tagging of the crawler seeds, and have the advantages of higher efficiency and more accurate tagging.
The invention provides a method for tagging crawler seeds, which comprises the following steps:
pre-establishing an incidence relation between a label and a keyword array; the keyword array comprises a corresponding relation between keywords and position identifications, and the position identifications are determined according to the keywords which have the corresponding relation with the position identifications;
receiving any crawler seed to be tagged, and crawling webpage content by using the crawler seed;
extracting keywords from the webpage content, and generating position identifications corresponding to the keywords according to the keywords;
matching the keywords with the position identifications with each keyword array in the incidence relation between the labels and the keyword arrays respectively;
and taking the label with the association relation with the keyword group with the highest matching degree as the label of the crawler seed.
Preferably, the generating the position identifier corresponding to each keyword according to the keyword includes:
aggregating the keywords to obtain the word frequency of each keyword;
and sequencing the keywords according to the word frequency of each keyword to generate the position identification of each keyword.
Preferably, the incidence relation between the label and the keyword array is established in advance; the keyword array comprises a corresponding relation between keywords and position identifications, the position identifications are determined according to the keywords which have the corresponding relation with the position identifications, and the method comprises the following steps:
crawling web page content by using crawler seeds with preset labels;
extracting keywords from the webpage content, and aggregating the keywords belonging to the same group of tags to obtain the word frequency of each keyword;
sorting the keywords belonging to the same group of tags according to the word frequency of each keyword to generate a position identifier of each keyword;
and taking the corresponding relation between the keywords belonging to the same group of tags and the position identification as a keyword array, and establishing an association relation with the tags.
Preferably, the extracting keywords from the web page content, and aggregating the keywords to obtain a word frequency of each keyword includes:
performing word segmentation processing on the webpage content through a natural language processing technology;
and extracting keywords in the webpage content by using a TF-IDF algorithm.
Preferably, before the step of aggregating the keywords to obtain the word frequency of each keyword, the method further includes:
and normalizing the synonyms and the similar synonyms in the extracted keywords into unified keywords through a synonym and similar synonym word list.
Preferably, the matching the keywords with the position identifiers with the keyword arrays in the association relationship between the tags and the keyword arrays respectively includes:
matching each keyword with each keyword in the keyword array according to the sequence obtained by sequencing the keywords;
judging whether the position identifications of the successfully matched keywords are consistent; if the matching degrees are consistent, increasing the matching degree by a first set value; if not, increasing the matching degree by a second set value; wherein the first set value is greater than the second set value.
The invention also provides a device for tagging crawler seeds, which comprises:
the establishing module is used for establishing the incidence relation between the label and the keyword array in advance; the keyword array comprises a corresponding relation between keywords and position identifications, and the position identifications are determined according to the keywords which have the corresponding relation with the position identifications;
the crawling module is used for receiving any crawler seed to be tagged and crawling webpage content by utilizing the crawler seed;
the extraction module is used for extracting keywords from the webpage content;
the generating module is used for generating position identifications corresponding to the keywords according to the keywords;
the matching module is used for respectively matching the keywords with the position identifications with each keyword array in the incidence relation between the tags and the keyword arrays;
and the labeling module is used for taking a label with a correlation relation with the keyword group with the highest matching degree as a label of the crawler seed.
Preferably, the generating module includes:
the aggregation submodule is used for aggregating the keywords to obtain the word frequency of each keyword;
and the first sequencing submodule is used for sequencing the keywords according to the word frequency of each keyword to generate the position identification of each keyword.
Preferably, the establishing module includes:
the crawling submodule is used for crawling webpage content by utilizing crawler seeds preset with labels;
the first extraction submodule is used for extracting keywords from the webpage content and aggregating the keywords belonging to the same group of tags to obtain the word frequency of each keyword;
the second sequencing submodule is used for sequencing the keywords belonging to the same group of tags according to the word frequency of each keyword to generate position marks of each keyword;
and the establishing sub-module is used for establishing an association relation with the tags by taking the corresponding relation between the keywords belonging to the same group of tags and the position identifications as a keyword array.
Preferably, the extraction module comprises:
the word segmentation sub-module is used for carrying out word segmentation processing on the webpage content through a natural language processing technology;
and the second extraction submodule is used for extracting the key words in the webpage content by using a TF-IDF algorithm.
Preferably, the apparatus further comprises:
and the normalization module is used for normalizing the synonyms and the near synonyms in the extracted keywords into unified keywords through the synonym and near synonym word list.
Preferably, the matching module includes:
the matching sub-module is used for matching each keyword with each keyword in the keyword array according to the sequence obtained by sequencing the keywords;
the judgment submodule is used for judging whether the position identifications of the successfully matched keywords are consistent or not;
the first increasing submodule is used for increasing the matching degree by a first set value when the result of the judging submodule is yes;
the second increasing submodule is used for increasing the matching degree by a second set value when the result of the judging submodule is negative; wherein the first set value is greater than the second set value.
By means of the technical scheme, in the method for tagging the crawler seeds, the incidence relation between the tags and the keyword array is established in advance, and when any crawler seed to be tagged is received, the crawler seed is used for crawling webpage content; extracting keywords from the webpage content, and aggregating the keywords to obtain the word frequency of each keyword; sequencing the keywords according to the word frequency of each keyword to generate a position identifier of each keyword; matching the keywords with the position identifications with each keyword array in the incidence relation between the labels and the keyword arrays respectively; and finally, taking the label with the association relation with the keyword group with the highest matching degree as the label of the crawler seed. Compared with the existing manual labeling mode, the crawler seed labeling method can automatically complete the labeling of the crawler seeds, and is higher in efficiency and more accurate in labeling.
The foregoing description is only an overview of the technical solutions of the present invention, and the embodiments of the present invention are described below in order to make the technical means of the present invention more clearly understood and to make the above and other objects, features, and advantages of the present invention more clearly understandable.
Drawings
Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to refer to like parts throughout the drawings. In the drawings:
FIG. 1 is a flow chart illustrating a method for determining website availability according to an embodiment of the present invention;
FIG. 2 is a flow chart illustrating another method for determining website availability provided by embodiments of the present invention;
fig. 3 is a schematic structural diagram illustrating an apparatus for determining availability of a website according to an embodiment of the present invention.
Detailed Description
Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.
The following description will be made of specific contents of examples.
The embodiment of the invention provides a method for tagging crawler seeds, and relates to a flow chart of the method for tagging the crawler seeds, which is provided by the invention, with reference to fig. 1. The method for labeling the crawler seeds specifically comprises the following steps:
s101: pre-establishing an incidence relation between a label and a keyword array; the keyword array comprises a corresponding relation between keywords and position identifications, and the position identifications are determined according to the keywords which have the corresponding relation with the position identifications.
In the embodiment of the invention, the relevance relation between the label and the keyword array is established in advance by using the crawler seed with the label in advance. Specifically, the keyword array includes a corresponding relationship between the keyword and the position identifier. According to the embodiment of the invention, the webpage content is crawled by using the crawler seeds with the preset labels, the keywords in the crawled webpage content are extracted, and the keywords belonging to the same group of labels are aggregated to obtain the word frequency of each keyword, namely the number of times of the keywords appearing in the webpage content. Wherein, the same group of tags may include only one tag or a plurality of tags.
Preferably, the extracted keywords belonging to the same group of tags are sorted according to the word frequencies of the keywords, and the position identifiers are generated according to the positions of the keywords in the arrangement sequence. For example, after the keywords are sorted in descending order according to the word frequencies of the keywords, the position identifier of the keyword with the highest word frequency may be determined as 1, and the position identifiers generated according to the sorting order may be sequentially added with 1 to finally obtain the position identifiers of the keywords. For example, the same set of tags may be { swordsmen online game, game forum }, and the keyword array having an association relationship with the tags may be [ < bizu sword, 1>, < god, 2>, < sister, 3> ].
The embodiment of the invention establishes the association relationship between each keyword with the position mark obtained by using the crawler seeds with the preset labels and the labels, and is used in the method for tagging the crawler seeds.
S102: receiving any crawler seed to be tagged, and crawling the webpage content by using the crawler seed.
In practical application, when the system receives any crawler seed to be tagged, the crawler seed is firstly utilized to crawl webpage content. Specifically, the crawler seed is an entry URL when the web crawler captures information, and the embodiment of the present invention crawls web page content using the URL.
S103: and extracting keywords from the webpage content, and generating position identifications corresponding to the keywords according to the keywords.
After crawling the webpage content by using the crawler seeds, the system extracts keywords from the crawled webpage content. In one implementation mode, firstly, word segmentation processing is carried out on each crawled webpage content through a natural language processing technology NLP, and then keywords in the webpage content after word segmentation processing are extracted through a TF-IDF algorithm. For example, a sentence in crawled webpage content is divided into a plurality of participles through NLP, and keywords are extracted from each participle by using TF-IDF algorithm. It is emphasized that the above embodiments are not limited to other ways in which keyword extraction of web page content can be achieved.
After extracting the keywords in the webpage content, the embodiment of the invention aggregates the keywords, i.e. counts the occurrence frequency of each keyword to obtain the word frequency of each keyword. The embodiment of the invention can utilize the key words and the word frequency thereof to form the key value pair < the key words and the word frequency >.
In the embodiment of the invention, the keywords are sequenced according to the word frequency of the keywords, and the position identifiers of the keywords are generated according to the sequence.
It is worth noting that in order to save consumption of a subsequent matching algorithm on system performance, the keyword ordering mode is preferably consistent with the keyword ordering of the keyword array in the association relationship between the label and the keyword array established in advance, namely, the keywords can be arranged in a descending order according to word frequency, the position identifier of the keyword with the highest occurrence frequency in the web page content is 1, the position identifiers generated according to the arrangement order are sequentially added with 1, and finally the position identifiers of the keywords are sequentially 1, 2 and 3 … ….
S104: and respectively matching the keywords with the position identifications with each keyword array in the incidence relation between the tag and the keyword array.
In the embodiment of the invention, after the keywords with the position identifications corresponding to the crawler seeds are obtained, the keywords with the position identifications are respectively matched with the pre-established incidence relation between each label and the keyword array.
In practical application, not only the keywords need to be matched, but also the position identifications of the keywords need to be matched, and finally the keyword array with the highest matching degree can be determined.
In a preferred embodiment, each keyword is matched with each keyword array in the association relationship between each tag and each keyword array according to the sequence obtained by sorting the keywords. Specifically, first, each keyword extracted from the web page content is respectively matched with the keyword in each keyword array in the incidence relation between the tag and the keyword array, and if the keyword is the same as any keyword in any keyword array, the matching of the keyword is successful. And secondly, judging whether the position identifications of the successfully matched keywords are consistent, namely whether the position identifications of the keywords extracted from the webpage content are consistent with the position identifications corresponding to the keywords in the keyword array. For the keywords with consistent position identifications, the matching degree can be increased by a numerical value, such as a first set value, which can be 1; for the keywords with inconsistent position marks, the matching degree can be increased by a small value, for example, a second set value smaller than the first set value can be 0.5. The influence of each keyword corresponding to the crawler seed on the matching degree is integrated, the finally obtained matching degree can be a numerical value, and the embodiment of the invention can take the label which has the incidence relation with the keyword array corresponding to the maximum matching degree as the label of the crawler seed.
The following illustrates a matching process of keywords, assuming that a keyword extracted from web page content and a position identifier of the keyword are < god, 1>, and an association relationship between a pre-established tag and a keyword array includes a keyword array of [ < god, 1>, < sister, 2> ], when the < god, 1> is successfully matched with the keyword array, and at the same time, the position identifiers corresponding to the "god" are all "1", which indicates that the position identifiers are consistent, the matching degree of the keyword array can be increased by 1. In addition, the pre-established association relationship between the tag and the keyword array may further include a keyword array of [ < Bixue Jian, 1>, < Dashen, 2>, < sister, 3> ], when < Dashen, 1> is matched with the keyword array, the keyword of "Dashen" is successfully matched, but the position identifiers corresponding to the "Dashen" are respectively inconsistent and are respectively "1" and "2", and since the condition is that the keyword is successfully matched but the corresponding position identifiers are inconsistent, the matching degree of the keyword array is only increased by 0.5 at this time.
S105: and taking the label with the association relation with the keyword group with the highest matching degree as the label of the crawler seed.
In the method for tagging the crawler seeds provided by the embodiment of the invention, the incidence relation between the tag and the keyword array is established in advance, and when any crawler seed to be tagged is received, the crawler seed is used for crawling webpage content; extracting keywords from the webpage content, and aggregating the keywords to obtain the word frequency of each keyword; sequencing the keywords according to the word frequency of each keyword to generate a position identifier of each keyword; matching the keywords with the position identifications with each keyword array in the incidence relation between the labels and the keyword arrays respectively; and finally, taking the label with the association relation with the keyword group with the highest matching degree as the label of the crawler seed. Compared with the existing manual labeling mode, the crawler seed labeling method can automatically complete the labeling of the crawler seeds, and is higher in efficiency and more accurate in labeling.
In a preferred implementation manner, referring to fig. 2, a flowchart of another method for tagging crawler seeds is provided in an embodiment of the present invention. The method specifically comprises the following steps:
s201: crawling webpage content by using crawler seeds with preset labels, and extracting keywords from the webpage content.
S202: and normalizing the synonyms and the near synonyms in the keywords belonging to the same group of labels into uniform keywords through a synonym and near synonym word list, and aggregating to obtain the word frequency of each keyword.
In the embodiment of the invention, synonyms and near synonyms in the keywords belonging to the same group of labels are normalized into unified keywords, for example, the people's republic of China and China belong to synonyms, and the keywords are normalized into unified China. Therefore, the system consumption in the subsequent matching degree calculation caused by synonyms and near synonyms generated by different expression modes can be effectively reduced.
S203: and performing descending order arrangement on the keywords belonging to the same group of labels according to the word frequency of each keyword to generate the position identification of each keyword.
S204: and taking the corresponding relation between the keywords belonging to the same group of tags and the position identification as a keyword array, and establishing an association relation with the tags.
In the embodiment of the invention, the website characteristics corresponding to the crawler seeds belonging to the same group of tags, namely the keyword array in the invention, are determined by utilizing the crawler seeds which have been labeled at present. And subsequently, automatically labeling the crawler seeds to be labeled by utilizing the established incidence relation between the labels and the keyword array.
S205: when any crawler seed to be tagged is received, the crawler seed is utilized to crawl webpage content.
S206: extracting keywords from the webpage content, normalizing the synonyms and the near-synonyms in the extracted keywords into uniform keywords through a synonym and near-synonym word list, and aggregating the keywords to obtain the word frequency of each keyword.
After extracting the keywords in the webpage content, the embodiments of the present invention normalize the synonyms and the synonyms in the keywords into uniform keywords, so as to facilitate the processing during the subsequent matching degree calculation.
S207: and performing descending order arrangement on the keywords according to the word frequency of each keyword to generate the position identification of each keyword.
S208: and matching the keywords with the position identifications with each keyword array in the incidence relation between the labels and the keyword arrays according to the descending order.
S209: and taking the label with the association relation with the keyword group with the highest matching degree as the label of the crawler seed.
An embodiment of the present invention further provides a device for tagging crawler seeds, and referring to fig. 3, the device for tagging crawler seeds according to the embodiment of the present invention is schematically shown in a structural diagram, and the device includes:
the establishing module 301 is used for establishing an association relationship between the tag and the keyword array in advance; the keyword array comprises a corresponding relation between keywords and position identifications, and the position identifications are determined according to the keywords which have the corresponding relation with the position identifications;
the crawling module 302 is configured to receive any crawler seed to be tagged, and crawl web page content by using the crawler seed;
an extracting module 303, configured to extract keywords from the web page content;
a generating module 304, configured to generate, according to the keyword, a position identifier corresponding to each keyword;
a matching module 305, configured to match the keyword with the location identifier with each keyword array in the association relationship between the tag and the keyword array;
and the tagging module 306 is configured to use a tag having a relationship with the keyword group with the highest matching degree as a tag of the crawler seed.
In practical applications, the generating module includes:
the aggregation submodule is used for aggregating the keywords to obtain the word frequency of each keyword;
and the first sequencing submodule is used for sequencing the keywords according to the word frequency of each keyword to generate the position identification of each keyword.
The establishing module may include:
the crawling submodule is used for crawling webpage content by utilizing crawler seeds preset with labels;
the first extraction submodule is used for extracting keywords from the webpage content and aggregating the keywords belonging to the same group of tags to obtain the word frequency of each keyword;
the second sequencing submodule is used for sequencing the keywords belonging to the same group of tags according to the word frequency of each keyword to generate position marks of each keyword;
and the establishing sub-module is used for establishing an association relation with the tags by taking the corresponding relation between the keywords belonging to the same group of tags and the position identifications as a keyword array.
In one implementation, the extraction module in the apparatus may include:
the word segmentation sub-module is used for carrying out word segmentation processing on the webpage content through a natural language processing technology;
and the second extraction submodule is used for extracting the key words in the webpage content by using a TF-IDF algorithm.
In order to reduce system consumption in the matching degree calculation, the apparatus may further include:
and the normalization module is used for normalizing the synonyms and the near synonyms in the extracted keywords into unified keywords through the synonym and near synonym word list.
In a preferred embodiment, the matching module includes:
the matching sub-module is used for matching each keyword with each keyword in the keyword array according to the sequence obtained by sequencing the keywords;
the judgment submodule is used for judging whether the position identifications of the successfully matched keywords are consistent or not;
the first increasing submodule is used for increasing the matching degree by a first set value when the result of the judging submodule is yes;
the second increasing submodule is used for increasing the matching degree by a second set value when the result of the judging submodule is negative; wherein the first set value is greater than the second set value.
In addition, the device for labeling the crawler seeds comprises a processor and a memory, wherein the establishing module, the crawling module, the extracting module, the sequencing module, the matching module, the labeling module and the like are stored in the memory as program units, and the processor executes the program units stored in the memory to realize corresponding functions.
The processor comprises a kernel, and the kernel calls the corresponding program unit from the memory. The kernel can be set to be one or more than one, the crawler seeds are automatically labeled by adjusting kernel parameters, efficiency is higher, and the labels are more accurate.
The memory may include volatile memory in a computer readable medium, Random Access Memory (RAM) and/or nonvolatile memory such as Read Only Memory (ROM) or flash memory (flash RAM), and the memory includes at least one memory chip.
The device for tagging the crawler seeds provided by the embodiment of the invention can realize the following functions: the method comprises the steps that an incidence relation between a label and a keyword array is established in advance, and when any crawler seed to be labeled is received, webpage content is crawled by the crawler seed; extracting keywords from the webpage content, and aggregating the keywords to obtain the word frequency of each keyword; sequencing the keywords according to the word frequency of each keyword to generate a position identifier of each keyword; matching the keywords with the position identifications with each keyword array in the incidence relation between the labels and the keyword arrays respectively; and finally, taking the label with the association relation with the keyword group with the highest matching degree as the label of the crawler seed. Compared with the existing manual labeling mode, the device provided by the embodiment of the invention can automatically complete labeling on the crawler seeds, and has the advantages of higher efficiency and more accurate labeling.
The present application further provides a computer program product adapted to perform program code for initializing the following method steps when executed on a data processing device:
pre-establishing an incidence relation between a label and a keyword array; the keyword array comprises a corresponding relation between keywords and position identifications, and the position identifications are determined according to the keywords which have the corresponding relation with the position identifications;
receiving any crawler seed to be tagged, and crawling webpage content by using the crawler seed;
extracting keywords from the webpage content, and generating position identifications corresponding to the keywords according to the keywords;
matching the keywords with the position identifications with each keyword array in the incidence relation between the labels and the keyword arrays respectively;
and taking the label with the association relation with the keyword group with the highest matching degree as the label of the crawler seed.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). The memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, a computer readable medium does not include a transitory computer readable medium such as a modulated data signal and a carrier wave.
The above are merely examples of the present application and are not intended to limit the present application. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims (12)

1. A method of tagging crawler seeds, the method comprising:
pre-establishing an incidence relation between a label and a keyword array; the keyword array comprises a corresponding relation between keywords and position identifications, and the position identifications are determined according to the keywords which have the corresponding relation with the position identifications; the incidence relation between the label and the keyword array is established by using a crawler seed preset with the label; the position identification is used for representing word frequency ordering of the keywords in the webpage content;
receiving any crawler seed to be tagged, and crawling webpage content by using the crawler seed;
extracting keywords from the webpage content, and generating position identifications corresponding to the keywords according to the keywords;
matching the keywords with the position identifications with each keyword array in the incidence relation between the labels and the keyword arrays respectively; the matching of the keywords with the position identifications and each keyword array in the incidence relation between the labels and the keyword arrays is specifically as follows: matching the keywords with the position identifications and the position identifications corresponding to the keywords with the keywords and the position identifications with the corresponding relations included in each keyword array in the incidence relation between the labels and the keyword arrays;
and taking the label with the association relation with the keyword group with the highest matching degree as the label of the crawler seed.
2. The method for tagging crawler seeds according to claim 1, wherein the generating position identifiers corresponding to the keywords according to the keywords comprises:
aggregating the keywords to obtain the word frequency of each keyword;
and sequencing the keywords according to the word frequency of each keyword to generate the position identification of each keyword.
3. The method for tagging crawler seeds of claim 1, wherein the association relationship between the tags and the keyword array is pre-established; the keyword array comprises a corresponding relation between keywords and position identifications, the position identifications are determined according to the keywords which have the corresponding relation with the position identifications, and the method comprises the following steps:
crawling web page content by using crawler seeds with preset labels;
extracting keywords from the webpage content, and aggregating the keywords belonging to the same group of tags to obtain the word frequency of each keyword;
sorting the keywords belonging to the same group of tags according to the word frequency of each keyword to generate a position identifier of each keyword;
and taking the corresponding relation between the keywords belonging to the same group of tags and the position identification as a keyword array, and establishing an association relation with the tags.
4. The method for tagging crawler seeds of claim 1, wherein the extracting keywords from the web page content comprises:
performing word segmentation processing on the webpage content through a natural language processing technology;
and extracting keywords in the webpage content by using a TF-IDF algorithm.
5. The method for tagging crawler seeds as recited in claim 2, wherein before the step of aggregating the keywords to obtain the word frequency of each keyword, the method further comprises:
and normalizing the synonyms and the similar synonyms in the extracted keywords into unified keywords through a synonym and similar synonym word list.
6. The method for tagging crawler seeds as recited in claim 1, wherein the matching the keywords with position identifiers with the keyword arrays in the association relationship between the tags and the keyword arrays comprises:
matching each keyword with each keyword in the keyword array according to the sequence obtained by sequencing the keywords;
judging whether the position identifications of the successfully matched keywords are consistent; if the matching degrees are consistent, increasing the matching degree by a first set value; if not, increasing the matching degree by a second set value; wherein the first set value is greater than the second set value.
7. An apparatus for tagging crawler seeds, the apparatus comprising:
the establishing module is used for establishing the incidence relation between the label and the keyword array in advance; the keyword array comprises a corresponding relation between keywords and position identifications, and the position identifications are determined according to the keywords which have the corresponding relation with the position identifications; the incidence relation between the label and the keyword array is established by using a crawler seed preset with the label; the position identification is used for representing word frequency ordering of the keywords in the webpage content;
the crawling module is used for receiving any crawler seed to be tagged and crawling webpage content by utilizing the crawler seed;
the extraction module is used for extracting keywords from the webpage content;
the generating module is used for generating position identifications corresponding to the keywords according to the keywords;
the matching module is used for respectively matching the keywords with the position identifications with each keyword array in the incidence relation between the tags and the keyword arrays; the matching of the keywords with the position identifications and each keyword array in the incidence relation between the labels and the keyword arrays is specifically as follows: matching the keywords with the position identifications and the position identifications corresponding to the keywords with the keywords and the position identifications with the corresponding relations included in each keyword array in the incidence relation between the labels and the keyword arrays;
and the labeling module is used for taking a label with a correlation relation with the keyword group with the highest matching degree as a label of the crawler seed.
8. The apparatus for tagging crawler seeds of claim 7, wherein the generating module comprises:
the aggregation submodule is used for aggregating the keywords to obtain the word frequency of each keyword;
and the first sequencing submodule is used for sequencing the keywords according to the word frequency of each keyword to generate the position identification of each keyword.
9. The apparatus for tagging crawler seeds of claim 7, wherein said building module comprises:
the crawling submodule is used for crawling webpage content by utilizing crawler seeds preset with labels;
the first extraction submodule is used for extracting keywords from the webpage content and aggregating the keywords belonging to the same group of tags to obtain the word frequency of each keyword;
the second sequencing submodule is used for sequencing the keywords belonging to the same group of tags according to the word frequency of each keyword to generate position marks of each keyword;
and the establishing sub-module is used for establishing an association relation with the tags by taking the corresponding relation between the keywords belonging to the same group of tags and the position identifications as a keyword array.
10. The apparatus for labeling crawler seeds of claim 7, wherein said extraction module comprises:
the word segmentation sub-module is used for carrying out word segmentation processing on the webpage content through a natural language processing technology;
and the second extraction submodule is used for extracting the key words in the webpage content by using a TF-IDF algorithm.
11. The apparatus for tagging crawler seeds of claim 8, further comprising:
and the normalization module is used for normalizing the synonyms and the near synonyms in the extracted keywords into unified keywords through the synonym and near synonym word list.
12. The apparatus for tagging crawler seeds of claim 7, wherein said matching module comprises:
the matching sub-module is used for matching each keyword with each keyword in the keyword array according to the sequence obtained by sequencing the keywords;
the judgment submodule is used for judging whether the position identifications of the successfully matched keywords are consistent or not;
the first increasing submodule is used for increasing the matching degree by a first set value when the result of the judging submodule is yes;
the second increasing submodule is used for increasing the matching degree by a second set value when the result of the judging submodule is negative; wherein the first set value is greater than the second set value.
CN201610987244.2A 2016-11-09 2016-11-09 Method and device for labeling crawler seeds Active CN108062337B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610987244.2A CN108062337B (en) 2016-11-09 2016-11-09 Method and device for labeling crawler seeds

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610987244.2A CN108062337B (en) 2016-11-09 2016-11-09 Method and device for labeling crawler seeds

Publications (2)

Publication Number Publication Date
CN108062337A CN108062337A (en) 2018-05-22
CN108062337B true CN108062337B (en) 2021-03-16

Family

ID=62136621

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610987244.2A Active CN108062337B (en) 2016-11-09 2016-11-09 Method and device for labeling crawler seeds

Country Status (1)

Country Link
CN (1) CN108062337B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116186368B (en) * 2023-03-17 2023-11-14 广东朝恒科技有限公司 Data crawling method and system

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102143224A (en) * 2011-01-25 2011-08-03 张金海 Mobile phone Internet accessing-based user behavior analysis method and device
US8862602B1 (en) * 2011-10-25 2014-10-14 Google Inc. Systems and methods for improved readability of URLs
CN103902597B (en) * 2012-12-27 2019-03-08 百度在线网络技术(北京)有限公司 The method and apparatus for determining relevance of searches classification corresponding to target keyword
CN105279208B (en) * 2014-07-25 2019-01-22 北京龙源创新信息技术有限公司 A kind of data marker method and management system

Also Published As

Publication number Publication date
CN108062337A (en) 2018-05-22

Similar Documents

Publication Publication Date Title
CN108959431B (en) Automatic label generation method, system, computer readable storage medium and equipment
CN106033416B (en) Character string processing method and device
US9460117B2 (en) Image searching
CN109299258B (en) Public opinion event detection method, device and equipment
CN110991171B (en) Sensitive word detection method and device
CN110352427B (en) System and method for collecting data associated with fraudulent content in a networked environment
KR20200007969A (en) Information processing methods, terminals, and computer storage media
CN111797239B (en) Application program classification method and device and terminal equipment
US9977995B2 (en) Image clustering method, image clustering system, and image clustering server
CN111125086B (en) Method, device, storage medium and processor for acquiring data resources
CN104462396B (en) Character string processing method and device
CN111078776A (en) Data table standardization method, device, equipment and storage medium
CN112632269A (en) Method and related device for training document classification model
CN111859093A (en) Sensitive word processing method and device and readable storage medium
CN112818200A (en) Data crawling and event analyzing method and system based on static website
CN107688563B (en) Synonym recognition method and recognition device
CN108062337B (en) Method and device for labeling crawler seeds
CN108255891B (en) Method and device for judging webpage type
CN110334262B (en) Model training method and device and electronic equipment
WO2016101737A1 (en) Search query method and apparatus
CN108287831A (en) A kind of URL classification method and system, data processing method and system
CN106776654B (en) Data searching method and device
CN108875060B (en) Website identification method and identification system
CN110598115A (en) Sensitive webpage identification method and system based on artificial intelligence multi-engine
CN111488452A (en) Webpage tampering detection method, detection system and related equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100080 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

GR01 Patent grant
GR01 Patent grant