CN107908698B - Topic web crawler method, electronic device, storage medium and system - Google Patents

Topic web crawler method, electronic device, storage medium and system Download PDF

Info

Publication number
CN107908698B
CN107908698B CN201711071026.5A CN201711071026A CN107908698B CN 107908698 B CN107908698 B CN 107908698B CN 201711071026 A CN201711071026 A CN 201711071026A CN 107908698 B CN107908698 B CN 107908698B
Authority
CN
China
Prior art keywords
topic
url
text
module
document
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201711071026.5A
Other languages
Chinese (zh)
Other versions
CN107908698A (en
Inventor
石忠民
徐叶强
钟力
殷长涛
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiaxing Xiaoda Intelligent Technology Co.,Ltd.
Original Assignee
GUANGZHOU SUMMBA INFORMATION TECHNOLOGY CO LTD
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by GUANGZHOU SUMMBA INFORMATION TECHNOLOGY CO LTD filed Critical GUANGZHOU SUMMBA INFORMATION TECHNOLOGY CO LTD
Priority to CN201711071026.5A priority Critical patent/CN107908698B/en
Publication of CN107908698A publication Critical patent/CN107908698A/en
Application granted granted Critical
Publication of CN107908698B publication Critical patent/CN107908698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • G06F16/9566URL specific, e.g. using aliases, detecting broken or misspelled links
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a topic web crawler method, which comprises the steps of obtaining the initial URL of a topic crawler, loading the initial URL into a seed task queue, obtaining the initial URL in sequence from the task queue by the topic crawler for crawling, downloading a network document corresponding to the initial URL to the local, performing topic relevance classification on the network document, obtaining a topic relevant text, and storing the topic relevant text into a data warehouse in a structured manner; the present invention relates to an electronic device and a readable storage medium for performing a subject web crawler method; the invention also relates to a topic web crawler system; the word vectors of the text word segmentation result are obtained by adopting a CBOW model of word2vec, the PCA principal component analysis algorithm is adopted to reduce the dimension of the word vectors, the LSTM model is adopted to classify the reduced dimension word vectors, the crawled documents are classified firstly, and then the documents are selectively stored according to specific topics, so that the correlation degree of the topics and the crawled documents is improved, and the coverage rate of a search engine is increased.

Description

Topic web crawler method, electronic device, storage medium and system
Technical Field
The invention relates to the technical field of web crawlers, in particular to a method, electronic equipment, a storage medium and a system for topic web crawlers.
Background
The rapid growth of the network scale brings people with abundant information and brings great challenges to the retrieval of the information, the web crawler is a program for automatically browsing the network, or a network robot, and at present, the web crawler has been widely used in internet search engines or other similar websites, and can automatically collect page contents that can be visited by all search engines or websites, so that a user can quickly retrieve required information through the web crawler, and the page contents collected through the web crawler can be further processed by the search engines or websites, so that the search engines or websites can be trained based on the collected page contents. At present, even if the coverage rate of a very large-scale search engine on internet information is less than 50%, the resource of a retrieval server is far from keeping up with the increasing speed of the network scale. If the traditional information crawling method is still adopted, the coverage of information retrieval is smaller and smaller.
In order to solve the problems, according to the characteristic that a user only is interested in certain subject information during information retrieval, the invention provides a subject web crawler method for a specific subject webpage by using a text classification method.
Disclosure of Invention
In order to overcome the defects of the prior art, one of the purposes of the invention is to provide a topic web crawler method, which improves the correlation degree of topics and crawled documents and increases the coverage rate of a search engine by classifying the crawled documents.
One of the purposes of the invention is realized by adopting the following technical scheme:
a method of topic web crawler comprising the steps of:
initializing a topic crawler, acquiring a starting URL of the topic crawler, and loading the starting URL into a seed task queue;
crawling document contents, wherein the topic crawler sequentially acquires the initial URL from a task queue for crawling, and downloads a network document corresponding to the initial URL to the local;
text classification, namely performing topic relevance classification on the network document to obtain topic relevant texts;
and text storage, namely storing the subject related text to a data warehouse in a structured manner.
Further, the step of text classification comprises the steps of:
preprocessing a document, namely performing word segmentation and stop word removal processing on the network document;
text data conversion, namely converting the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec;
reducing the dimension of the word vector, and reducing the dimension of the word vector by adopting a Principal Component Analysis (PCA) algorithm to obtain a dimension-reduced word vector;
and text classification, namely classifying the dimensionality reduction word vectors by adopting an LSTM model to obtain the classification result.
Further, step non-crawled URL collection is further included between the step document content crawling and the step text classification, the URL of the network document is obtained, and when the URL of the network document is not crawled, the non-crawled URL is added into the seed task queue.
Further, the step of document preprocessing specifically includes performing word segmentation on the network document by using a forward maximum matching algorithm and a CRF word segmentation algorithm.
An electronic device, comprising: a processor;
a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for performing one of the subject web crawler methods described above.
A computer-readable storage medium having stored thereon a computer program for execution by a processor of a subject web crawler method as described above.
A topic web crawler system comprises an initialization module, a document crawling module, a text classification module and a text storage module, wherein the initialization module acquires a starting URL of a topic crawler, loads the starting URL into a seed task queue and maintains a crawling sequence of the seed task queue; the document crawling module sequentially acquires the initial URL from a task queue for crawling, downloads a network document corresponding to the initial URL to the local, the text classification module performs topic relevance classification on the network document to acquire a topic related text, and the text storage module structurally stores the topic related text to a data warehouse.
The text classification module further comprises a document preprocessing module, a text data conversion module and a word vector dimension reduction module, wherein the document preprocessing module is used for performing word segmentation and stop word removal on the network document; the text data conversion module converts the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec; the word vector dimension reduction module adopts a Principal Component Analysis (PCA) algorithm to reduce the dimension of the word vector to obtain a dimension-reduced word vector; and the text classification module classifies the dimensionality reduction word vectors by adopting an LSTM model to obtain the classification result.
And further, the system also comprises an un-crawled URL collecting module, wherein the un-crawled URL collecting module acquires the URL of the network document and adds the un-crawled URL into the seed task queue.
Furthermore, the document preprocessing module performs word segmentation on the network document by adopting a forward maximum matching algorithm and a CRF word segmentation algorithm.
Compared with the prior art, the invention has the beneficial effects that:
the word vectors of text word segmentation results are obtained by adopting a CBOW model of word2vec, the PCA principal component analysis algorithm is adopted to reduce the dimension of the word vectors, the LSTM model is adopted to classify the reduced dimension word vectors, crawled documents are firstly classified, then the documents are selectively stored according to specific topics, the correlation degree of the topics and the crawled documents is improved, after the documents related to the topics are fully crawled and the document contents are structurally extracted, the non-crawled URLs related to the topics in the documents are further extracted, the non-crawled URLs are transmitted into a seed task queue, and the coverage rate of a search engine is increased.
The foregoing description is only an overview of the technical solutions of the present invention, and in order to make the technical solutions of the present invention more clearly understood and to implement them in accordance with the contents of the description, the following detailed description is given with reference to the preferred embodiments of the present invention and the accompanying drawings. The detailed description of the present invention is given in detail by the following examples and the accompanying drawings.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
FIG. 1 is a flow chart of a subject web crawler method of the present invention;
FIG. 2 is a flow chart of text classification according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of an LSTM model according to an embodiment of the present invention;
FIG. 4 is a block diagram of a system for a topic web crawler according to the present invention.
Detailed Description
The present invention will be further described with reference to the accompanying drawings and the detailed description, and it should be noted that any combination of the embodiments or technical features described below can be used to form a new embodiment without conflict.
A subject web crawler method, as shown in fig. 1, comprising the steps of:
initializing a topic crawler, collecting navigation page contents related to a topic on the Internet by adopting a traditional crawler, acquiring a large number of URLs (uniform resource locators) as initial URLs of the topic crawler, acquiring the initial URLs of the topic crawler, and loading the initial URLs into a seed task queue; preferably, the step of initializing the topic crawler further includes popping up a single starting URL in a seed task queue according to the crawler requirement, wherein the seed task queue mainly maintains the seed URL, and includes initializing and loading the URL, dequeuing operation of the single URL during crawling, and enqueuing operation of the URL extracted after crawling of the single document.
In an embodiment, document content crawls, a topic crawler acquires starting URLs from a task queue in sequence to crawl, network documents corresponding to the starting URLs are downloaded to the local, in the embodiment, the URLs in a seed subtask queue are crawled in sequence, the network documents corresponding to the current URLs in the seed subtask queue are crawled, the network documents are downloaded to the local, the network documents corresponding to the current URLs are subjected to the following steps of collecting, classifying texts and storing data, the next URL in the seed task queue is acquired, and the steps are repeated until all URLs in the seed task queue are crawled.
In one embodiment, preferably, a step of collecting the un-crawled URLs is further included between the step of crawling the document content and the step of text classification, the URLs of the network documents are obtained, and when the URLs of the network documents are not crawled, the un-crawled URLs are added into the seed task queue, and the coverage rate of the search engine is increased.
And text classification, namely performing topic relevance classification on the network documents to obtain topic relevant texts. As shown in fig. 2, preferably, the step text classification includes the steps of:
preprocessing a document, namely performing word segmentation and stop word removal processing on the network document; preferably, the step of document preprocessing is to perform word segmentation on the network document by using a Chinese word segmentation algorithm combining a forward maximum matching algorithm of the souguo corpus and a CRF word segmentation algorithm, and the processing of removing stop words mainly includes extracting all the classes, such as "fire", "eat", "and", "but", "this", etc., widely used and frequently appearing in the document from the document, and the stop words mainly include words of speech adverbs, auxiliary words, conjunctions, prepositions, and the like.
Text data conversion, namely converting the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec; the CBOW model is a probability model for predicting a middle word by using surrounding words, and for the middle word Wt, the probability that Wt belongs to a certain word in a dictionary is calculated by using surrounding 2K words Wt-K, Wt-K +1, … …, Wt-1+ K and Wt + K. For statement S, a CBOW model is utilized, which is a natural language probability:
Figure BDA0001456838610000051
where P (S) represents the probability that S is natural language, T is the size of the dictionary, and P (W1, …, Wt) represents the joint probability of words in the text, the following model is constructed for the entire text:
Figure BDA0001456838610000061
making the value of P (S) as large as possible to obtain the objective function of the model:
Figure BDA0001456838610000062
for the model of the entire text, the objective function is:
Figure BDA0001456838610000063
and solving the maximum value of the objective function to obtain a CBOW model training target. A training strategy based on a hierarchical structure is adopted, specifically, a hierarchical Softmax algorithm is adopted, Huffman codes are added into an output layer, a Huffman tree is constructed, leaf nodes of the Huffman tree have unique codes, and words of a corpus are represented. The non-leaf nodes of the Huffman tree represent the category of words, the input layer is word vectors of context words, the hidden layer performs vector addition operation on the word vectors of the input layer, the output of the hidden layer is connected with each non-leaf node of the output Huffman tree, and each connection has a corresponding weight.
Reducing the dimension of the word vector, namely reducing the dimension of the word vector by adopting a Principal Component Analysis (PCA) algorithm to obtain a dimension-reduced word vector; reducing the dimension of the text word segmentation vectors obtained by adopting the CBOW model by adopting a Principal Component Analysis (PCA) algorithm, and for a matrix A ═ aijThe dimension reduction process performed by (i ═ 1, …, m, j ═ 1, …, n) is as follows:
calculating the mean value a of each row of the matrix Ai(i=1,2,…,n);
Calculating adjusted matrix B ═ BikIn which b isik=ajk-ak(j=1,…,m);
Calculating covariance matrix S ═ Sik};
And calculating an eigenvalue lambda and an eigenvector e of the S matrix.
Arranging the eigenvalues lambda in order of big to small (lambda)1≥λ2≥…≥λn) And sequentially selecting the corresponding feature vector e to construct a feature vector matrix, wherein all row vectors in the feature vector matrix represent each document.
Text classification, adopting LSTM model of attention model to classify the reduced dimension word vector to obtain classification result, using attention model in word vector coding stage, inputting sequence x of text1,x2,…,xTAccumulating and averaging the input vectors of the historical nodes to obtain an input vector X 'of the text ensemble, wherein X' is the final input of the word vector encoding stage, and h is the final input of the word vector encoding stage1,h2,h3,…,htCorresponding to the input sequence x1,x2,…,xTHidden layer state value of HkThe model structure diagram corresponding to the hidden layer state value of the input X' is shown in FIG. 3, a in FIG. 3ikIs the attention probability, x, of the history node to the last node1,x2,…,xTFor a word of text, X' is the input vector for the article population. To reduce the semantic impact of non-keywords on the entire text, x is computed1,x2,…,xTThe specific calculation process for the overall influence weight of the article is as follows:
semantic coding to compute attention distribution probability:
Figure BDA0001456838610000071
eki=v tanh(Whk+Uhi+b)
akiis the attention probability weight of node i to node k, T is the number of elements of the input sequence, v, W, U is the weight matrix, hkFor the last input of the corresponding hidden layer state, hiFor the ith element of the input sequenceHidden layer state values corresponding to the pixels;
calculating semantic codes and feature vectors of attention distribution probability:
Figure BDA0001456838610000072
Hk′=H(C,hk,X′)
the final semantic code is to take the attention probability distribution semantic code containing the historical nodes and the text overall vector as the input of the traditional LSTM module, and the hidden layer state value H of the final nodekAnd classifying the acquired documents according to the required subjects for the final feature vector through the steps so as to judge whether the documents belong to a certain subject.
And text storage, namely storing the subject related texts to a data warehouse in a structured manner, and adopting a NoSql database to store the subject related texts because the document data structures are not uniform.
An electronic device, comprising: a processor; a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for performing one of the subject web crawler methods described above; a computer-readable storage medium having stored thereon a computer program for execution by a processor of a subject web crawler method as described above.
A topic web crawler system is shown in FIG. 4 and comprises an initialization module, a document crawling module, a text classification module and a text storage module, wherein the initialization module acquires a start URL of a topic crawler, loads the start URL into a seed task queue and maintains a crawling sequence of the task queue; the seed task queue mainly maintains seed URLs, and comprises the steps of initially loading the URLs, performing out-queue operation on a single URL during crawling, and performing in-queue operation on the extracted URL after a single document is crawled; the document crawling module sequentially acquires starting URLs from the task queue for crawling, downloads the network documents corresponding to the starting URLs to the local, the network documents are unstructured texts, the text classification module performs topic relevance classification on the network documents to obtain topic related texts, and the text storage module structurally stores the topic related texts into a data warehouse, for example, the NoSql data warehouse is used for storing the topic related texts.
In an embodiment, preferably, the text classification module further includes a document preprocessing module, a text data conversion module, and a word vector dimension reduction module, where the document preprocessing module performs word segmentation and stop word removal on the network document; further, the document preprocessing module performs word segmentation on the network document by adopting a forward maximum matching algorithm of the souguo corpus and a CRF word segmentation algorithm. The text data conversion module converts the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec; the word vector dimension reduction module adopts a Principal Component Analysis (PCA) algorithm to reduce the dimension of the word vector to obtain a dimension-reduced word vector; and the text classification module classifies the dimensionality reduction word vectors by adopting an LSTM model to obtain a classification result.
In an embodiment, the method further preferably includes an un-crawl URL collecting module, where the un-crawl URL collecting module obtains a URL of the web document, and adds the un-crawl URL to the seed task queue.
The foregoing is merely a preferred embodiment of the invention and is not intended to limit the invention in any manner; those skilled in the art can readily practice the invention as shown and described in the drawings and detailed description herein; however, those skilled in the art should appreciate that they can readily use the disclosed conception and specific embodiments as a basis for designing or modifying other structures for carrying out the same purposes of the present invention without departing from the scope of the invention as defined by the appended claims; meanwhile, any changes, modifications, and evolutions of the equivalent changes of the above embodiments according to the actual techniques of the present invention are still within the protection scope of the technical solution of the present invention.

Claims (6)

1. A method for topic web crawler, comprising the steps of:
initializing a topic crawler, acquiring a starting URL of the topic crawler, and loading the starting URL into a seed task queue;
crawling document contents, wherein the topic crawler sequentially acquires the initial URL from a task queue for crawling, and downloads a network document corresponding to the initial URL to the local;
collecting the un-crawled URL, acquiring the URL of the network document, and adding the un-crawled URL into the seed task queue when the URL of the network document is not crawled;
text classification, namely performing topic relevance classification on the network document to obtain topic relevant texts;
the text storage is used for storing the subject related text to a data warehouse in a structured manner;
the step of text classification comprises the steps of:
preprocessing a document, namely performing word segmentation and stop word removal processing on the network document;
text data conversion, namely converting the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec;
reducing the dimension of the word vector, and reducing the dimension of the word vector by adopting a Principal Component Analysis (PCA) algorithm to obtain a dimension-reduced word vector;
and text classification, namely classifying the dimensionality reduction word vectors by adopting an LSTM model to obtain the classification result.
2. The subject web crawler method of claim 1, wherein: the step of document preprocessing specifically comprises the step of performing word segmentation on the network document by adopting a forward maximum matching algorithm and a CRF word segmentation algorithm.
3. An electronic device, characterized by comprising: a processor;
a memory; and a program, wherein the program is stored in the memory and configured to be executed by the processor, the program comprising instructions for performing the method of any of claims 1-2.
4. A computer-readable storage medium having stored thereon a computer program, characterized in that: the computer program is executed by a processor for performing the method according to any of claims 1-2.
5. A subject web crawler system, comprising: the system comprises an initialization module, a document crawling module, a text classification module and a text storage module, wherein the initialization module acquires a starting URL of a topic crawler, loads the starting URL into a seed task queue and maintains a crawling sequence of the seed task queue; the document crawling module sequentially acquires the initial URL from a task queue for crawling, downloads a network document corresponding to the initial URL to the local, the text classification module performs topic relevance classification on the network document to acquire a topic related text, and the text storage module structurally stores the topic related text to a data warehouse; the non-crawl URL collection module acquires the URL of the network document and adds the non-crawl URL into the seed task queue;
the text classification module also comprises a document preprocessing module, a text data conversion module and a word vector dimension reduction module, wherein the document preprocessing module carries out word segmentation and stop word removal processing on the network document; the text data conversion module converts the text data after word segmentation into word vectors by adopting a CBOW model of word2 vec; the word vector dimension reduction module adopts a Principal Component Analysis (PCA) algorithm to reduce the dimension of the word vector to obtain a dimension-reduced word vector; and the text classification module classifies the dimensionality reduction word vectors by adopting an LSTM model to obtain the classification result.
6. The subject web crawler system of claim 5, wherein: and the document preprocessing module performs word segmentation on the network document by adopting a forward maximum matching algorithm and a CRF word segmentation algorithm.
CN201711071026.5A 2017-11-03 2017-11-03 Topic web crawler method, electronic device, storage medium and system Active CN107908698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201711071026.5A CN107908698B (en) 2017-11-03 2017-11-03 Topic web crawler method, electronic device, storage medium and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201711071026.5A CN107908698B (en) 2017-11-03 2017-11-03 Topic web crawler method, electronic device, storage medium and system

Publications (2)

Publication Number Publication Date
CN107908698A CN107908698A (en) 2018-04-13
CN107908698B true CN107908698B (en) 2021-04-13

Family

ID=61842571

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201711071026.5A Active CN107908698B (en) 2017-11-03 2017-11-03 Topic web crawler method, electronic device, storage medium and system

Country Status (1)

Country Link
CN (1) CN107908698B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108595590A (en) * 2018-04-19 2018-09-28 中国科学院电子学研究所苏州研究院 A kind of Chinese Text Categorization based on fusion attention model
CN108681571B (en) * 2018-05-05 2024-02-27 吉林大学 Theme crawler system and method based on Word2Vec
CN109408809A (en) * 2018-09-25 2019-03-01 天津大学 A kind of sentiment analysis method for automobile product comment based on term vector
CN111191095A (en) * 2018-11-14 2020-05-22 中国移动通信集团河北有限公司 Webpage data acquisition method, device, equipment and medium
CN110134858A (en) * 2019-03-26 2019-08-16 国网重庆市电力公司 Method for transformation, system, storage medium and the electronic equipment of unstructured data
CN110532450B (en) * 2019-05-13 2021-05-04 南京大学 Topic crawler method based on improved shark search
CN111767482B (en) * 2020-05-21 2023-06-06 中国地质大学(武汉) Self-adaptive crawling method for focused web crawlers
CN112650570A (en) * 2020-12-29 2021-04-13 百果园技术(新加坡)有限公司 Dynamically expandable distributed crawler system, data processing method and device

Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320375A (en) * 2008-07-04 2008-12-10 浙江大学 Digital book search method based on user click action
CN101408883A (en) * 2008-11-24 2009-04-15 电子科技大学 Method for collecting network public feelings viewpoint
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN101694658A (en) * 2009-10-20 2010-04-14 浙江大学 Method for constructing webpage crawler based on repeated removal of news
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
CN105279288A (en) * 2015-12-04 2016-01-27 深圳大学 Online content recommending method based on deep neural network
CN105740381A (en) * 2016-01-27 2016-07-06 北京工业大学 User interest mining method based on complex network characteristics and neural network clustering
CN105824904A (en) * 2016-03-15 2016-08-03 浙江大学 Chinese herbal medicine plant picture capturing method based on professional term vector of traditional Chinese medicine and pharmacy field
CN106611055A (en) * 2016-12-27 2017-05-03 大连理工大学 Chinese hedge scope detection method based on stacked neural network
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method
CN106737692A (en) * 2017-02-10 2017-05-31 杭州迦智科技有限公司 A kind of mechanical paw Grasp Planning method and control device based on depth projection
CN107133210A (en) * 2017-04-20 2017-09-05 中国科学院上海高等研究院 Scheme document creation method and system
CN107133928A (en) * 2017-04-27 2017-09-05 扬州大学 A kind of facial symmetry plane acquiring method based on iteration mirror image with registration

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020194161A1 (en) * 2001-04-12 2002-12-19 Mcnamee J. Paul Directed web crawler with machine learning

Patent Citations (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101320375A (en) * 2008-07-04 2008-12-10 浙江大学 Digital book search method based on user click action
CN101408883A (en) * 2008-11-24 2009-04-15 电子科技大学 Method for collecting network public feelings viewpoint
CN101630327A (en) * 2009-08-14 2010-01-20 昆明理工大学 Design method of theme network crawler system
CN101694658A (en) * 2009-10-20 2010-04-14 浙江大学 Method for constructing webpage crawler based on repeated removal of news
CN104199974A (en) * 2013-09-22 2014-12-10 中科嘉速(北京)并行软件有限公司 Microblog-oriented dynamic topic detection and evolution tracking method
CN105279288A (en) * 2015-12-04 2016-01-27 深圳大学 Online content recommending method based on deep neural network
CN105740381A (en) * 2016-01-27 2016-07-06 北京工业大学 User interest mining method based on complex network characteristics and neural network clustering
CN105824904A (en) * 2016-03-15 2016-08-03 浙江大学 Chinese herbal medicine plant picture capturing method based on professional term vector of traditional Chinese medicine and pharmacy field
CN106611055A (en) * 2016-12-27 2017-05-03 大连理工大学 Chinese hedge scope detection method based on stacked neural network
CN106709052A (en) * 2017-01-06 2017-05-24 电子科技大学 Keyword based topic-focused web crawler design method
CN106737692A (en) * 2017-02-10 2017-05-31 杭州迦智科技有限公司 A kind of mechanical paw Grasp Planning method and control device based on depth projection
CN107133210A (en) * 2017-04-20 2017-09-05 中国科学院上海高等研究院 Scheme document creation method and system
CN107133928A (en) * 2017-04-27 2017-09-05 扬州大学 A kind of facial symmetry plane acquiring method based on iteration mirror image with registration

Also Published As

Publication number Publication date
CN107908698A (en) 2018-04-13

Similar Documents

Publication Publication Date Title
CN107908698B (en) Topic web crawler method, electronic device, storage medium and system
CN110222160B (en) Intelligent semantic document recommendation method and device and computer readable storage medium
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
Saraç et al. An ant colony optimization based feature selection for web page classification
CN107577671B (en) Subject term extraction method based on multi-feature fusion
WO2021051518A1 (en) Text data classification method and apparatus based on neural network model, and storage medium
WO2017090051A1 (en) A method for text classification and feature selection using class vectors and the system thereof
CN105139237A (en) Information push method and apparatus
CN110795527B (en) Candidate entity ordering method, training method and related device
CN107329954B (en) Topic detection method based on document content and mutual relation
CN108959413B (en) Topic webpage crawling method and topic crawler system
CN111159485A (en) Tail entity linking method, device, server and storage medium
CN110717042A (en) Method for constructing document-keyword heterogeneous network model
CN114329225A (en) Search method, device, equipment and storage medium based on search statement
CN112199508A (en) Parameter adaptive agricultural knowledge graph recommendation method based on remote supervision
CN109948154A (en) A kind of personage's acquisition and relationship recommender system and method based on name
Babapour et al. Web pages classification: An effective approach based on text mining techniques
CN111325018A (en) Domain dictionary construction method based on web retrieval and new word discovery
CN112445862B (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN110728135A (en) Text theme indexing method and device, electronic equipment and computer storage medium
CN107766419B (en) Threshold denoising-based TextRank document summarization method and device
JP2005346598A (en) Web information collection device, web crawler program and web information collection method
CN106294295B (en) Article similarity recognition method based on word frequency
CN110609997B (en) Method and device for generating abstract of text
CN110020024B (en) Method, system and equipment for classifying link resources in scientific and technological literature

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20220105

Address after: 314000 room 503, block C, future science and Technology Plaza, No. 136, Xiuzhou Avenue, Xincheng street, Xiuzhou District, Jiaxing City, Zhejiang Province

Patentee after: Jiaxing Xiaoda Intelligent Technology Co.,Ltd.

Address before: Room 431, 4th floor, No.1 Tiantai 1st Road, Science City, Guangzhou hi tech Industrial Development Zone, Guangdong 510000

Patentee before: GUANGZHOU SUMMBA INFORMATION TECHNOLOGY Co.,Ltd.

TR01 Transfer of patent right