CN112035723A

CN112035723A - Resource library determination method and device, storage medium and electronic device

Info

Publication number: CN112035723A
Application number: CN202010888627.0A
Authority: CN
Inventors: 朱学锋; 铁力; 何沉; 田然; 田江; 向小佳; 丁永建; 李璠
Original assignee: Everbright Technology Co ltd
Current assignee: Everbright Technology Co ltd
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-12-04

Abstract

The invention discloses a method and a device for determining a resource library, a storage medium and an electronic device. Wherein, the method comprises the following steps: acquiring a first webpage address corresponding to a target theme through a network capturer, and generating a first address queue according to the first webpage address; acquiring a webpage source code corresponding to the first address queue by using a network capturer; generating a webpage vector according to the webpage source code, determining a first distance between the webpage vector and a pre-established standard feature vector, and deleting a second webpage address corresponding to a first distance smaller than a first preset threshold value from a first address queue to obtain a second address queue, wherein the webpage vector is generated according to text information corresponding to the webpage source code; and acquiring target resource information from the webpage source code corresponding to the second address queue, and storing the target resource information into a target resource library, wherein the target resource information is used for describing resources related to the target theme, and the target resource library is used for storing the resource information related to the target theme.

Description

Resource library determination method and device, storage medium and electronic device

Technical Field

The invention relates to the field of computers, in particular to a method and a device for determining a resource library, a storage medium and an electronic device.

Background

The traditional search engine uses a general crawler to collect web pages from the internet and collect information, the web page information is used for establishing an index for the search engine so as to provide support, and the traditional search engine determines whether the content of the whole engine system is rich or not and whether the information is instant or not, so the performance of the traditional search engine directly influences the effect of the search engine.

Although the traditional search engine has powerful web crawlers and wide coverage, the classification speciality is poor, the information search result is not satisfactory, and words in certain specific fields, such as financial industry words, cannot be accurately understood.

On the other hand, the existing financial information acquisition system intensively researches algorithm optimization problems of design of a financial topic search engine, a topic crawler algorithm, an information source discovery method and the like in the financial topic search engine, but the existing financial information acquisition system has less relation to identification, extraction and the like of financial knowledge.

Therefore, in the related art, due to the fact that the classification speciality of the traditional search engine is poor, the vocabulary recognition effect on certain specific fields is poor, and an effective solution is not obtained yet.

Disclosure of Invention

The embodiment of the invention provides a method and a device for determining a resource library, a storage medium and an electronic device, which are used for at least solving the technical problem that the vocabulary recognition effect in certain specific fields is poor due to the poor classification speciality of the traditional search engine in the related technology.

According to an aspect of an embodiment of the present invention, a method for determining a resource pool is provided, including: acquiring a first webpage address corresponding to a target theme through a network capturer, and generating a first address queue according to the first webpage address; acquiring a webpage source code corresponding to the first address queue by using the network capturer; generating a webpage vector according to the webpage source code, determining a first distance between the webpage vector and a pre-established standard feature vector, and deleting a second webpage address corresponding to a first distance smaller than a first preset threshold value from the first address queue to obtain a second address queue, wherein the webpage vector is generated according to text information corresponding to the webpage source code; and acquiring target resource information from the webpage source code corresponding to the second address queue, and storing the target resource information into a target resource library, wherein the target resource information is used for describing resources related to the target theme, and the target resource library is used for storing the resource information related to the target theme.

According to another aspect of the embodiments of the present invention, there is also provided a device for determining a resource pool, including: the first processing unit is used for acquiring a first webpage address corresponding to a target theme through a network capturer and generating a first address queue according to the first webpage address; a first obtaining unit, configured to obtain, by using the network capturer, a webpage source code corresponding to the first address queue; a first determining unit, configured to generate a webpage vector according to the webpage source code, determine a first distance between the webpage vector and a pre-established standard feature vector, and delete a second webpage address corresponding to a first distance smaller than a first preset threshold from the first address queue to obtain a second address queue, where the webpage vector is generated according to text information corresponding to the webpage source code; and the second processing unit is used for acquiring target resource information from the webpage source code corresponding to the second address queue and storing the target resource information into a target resource library, wherein the target resource information is used for describing resources related to the target theme, and the target resource library is used for storing the resource information related to the target theme.

According to another aspect of the embodiments of the present invention, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to execute the method for determining the repository when running.

According to another aspect of the embodiments of the present invention, there is also provided an electronic apparatus, including a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor executes the method for determining the resource pool through the computer program.

According to the invention, a first webpage address corresponding to a target subject is obtained through a network capturer, and a first address queue is generated according to the first webpage address; acquiring a webpage source code corresponding to the first address queue by using the network capturer; generating a webpage vector according to the webpage source code, determining a first distance between the webpage vector and a pre-established standard feature vector, and deleting a second webpage address corresponding to a first distance smaller than a first preset threshold value from the first address queue to obtain a second address queue, wherein the webpage vector is generated according to text information corresponding to the webpage source code; and acquiring target resource information from the webpage source code corresponding to the second address queue, and storing the target resource information into a target resource library, wherein the target resource information is used for describing resources related to the target theme, and the target resource library is used for storing the resource information related to the target theme, so that the technical problem that in the related technology, the vocabulary identification effect on certain specific fields is poor due to poor classification speciality of the traditional search engine is solved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a schematic diagram of an application environment of a resource pool determination method according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a web crawler workflow common in the related art;

FIG. 3 is a schematic diagram of a data acquisition system according to the related art;

FIG. 4 is a flow chart illustrating an alternative method for determining a resource pool according to an embodiment of the present invention;

FIG. 5 is a flow chart illustrating an alternative resource pool determination method according to an embodiment of the present invention;

FIG. 6 is a diagram illustrating an alternative tree node structure according to an embodiment of the present invention;

FIG. 7 is a flow diagram of an alternative web crawler algorithm according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an alternative resource pool determination apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

According to an aspect of an embodiment of the present invention, a method for determining a resource pool is provided. Alternatively, the method for determining the resource pool can be applied to, but is not limited to, the application environment shown in fig. 1. As shown in fig. 1, the terminal device 102 or the server 104 acquires a first web page address corresponding to a target topic through a network capturer, and generates a first address queue according to the first web page address; acquiring a webpage source code corresponding to the first address queue by using the network capturer; generating a webpage vector according to the webpage source code, determining a first distance between the webpage vector and a pre-established standard feature vector, and deleting a second webpage address corresponding to a first distance smaller than a first preset threshold value from the first address queue to obtain a second address queue, wherein the webpage vector is generated according to text information corresponding to the webpage source code; and acquiring target resource information from the webpage source code corresponding to the second address queue, and storing the target resource information into a target resource library, wherein the target resource information is used for describing resources related to the target theme, and the target resource library is used for storing the resource information related to the target theme. The above is merely an example, and the embodiments of the present application are not limited herein.

Optionally, in this embodiment, the terminal device may include, but is not limited to, at least one of the following: mobile phones (such as Android phones, iOS phones, etc.), notebook computers, tablet computers, palm computers, MID (Mobile Internet Devices), PAD, desktop computers, etc. Such networks may include, but are not limited to: a wired network, a wireless network, wherein the wired network comprises: a local area network, a metropolitan area network, and a wide area network, the wireless network comprising: bluetooth, WIFI, and other networks that enable wireless communication. The server may be a single server or a server cluster composed of a plurality of servers. The above is only an example, and the present embodiment is not limited to this.

In the related art, the conventional search engine uses a general crawler to collect web pages from the internet and collect information, and the web page information is used for establishing an index for the search engine so as to provide support.

The basic workflow of the universal crawler is shown in FIG. 2:

firstly, selecting a part of seed URLs, and putting the URLs into a URL queue to be captured;

and taking out the URLs to be captured, analyzing the DNS to obtain the IP of the host, downloading the webpages corresponding to the URLs, storing the webpages into a downloaded webpage library, and putting the URLs into a captured URL queue.

And analyzing the URLs in the captured URL queue, analyzing other URLs in the captured URL queue, putting the URLs into a URL queue to be captured, and entering the next cycle when the queue is not empty.

In the related art, a typical financial information acquisition system has 3 modules, namely a data acquisition module, a data preprocessing module and a data application module.

1. A data acquisition module: the data acquisition module is a main body of a resource layer of the system and is responsible for searching webpages related to financial major, and comprises 3 parts of a data acquisition unit, a data analyzer and a WEB database, and can complete the work of scanning, acquisition and page analysis of webpage data, judgment (link filtering) of the relevance of URL and a theme, judgment (page filtering) processing and warehousing of the relevance of the page and the theme, and the like. And the system automatically runs in the background of the server regularly. The data collector can read the records in the key dictionary, start and release a plurality of network capturers at regular time, and the network capturers scan the internet and capture the webpage content related to the financial profession. The data analyzer performs preliminary analysis and processing on the data extracted by the data collector, and stores the processed result in a WEB database. The temporary database mainly comprises fields such as url addresses, titles, website names (judged according to the url), main contents (or local file links), release time, extraction time, attributes, snapshot links and the like, and the repeated information is deleted.

2. A data preprocessing module: and the operation is carried out in the foreground, and the intervention of an administrator is needed in the operation process. The module analyzes and processes data of a system resource layer, and has the main functions of reading newly added data records from a WEB database, indexing, analyzing and processing the newly added data records, establishing a classified index for the data, and finally storing the data in a central database of a local server according to a specified format. And meanwhile, establishing an efficient index which is suitable for query. The staff uses the source code editor provided by the system to analyze the source code of the unanalyzed web page and determine the position and the identification symbol of the information such as the title, the main content and the like. If the piece of data meets the system requirements, the piece of data is stored in a central database.

3. A data application module: the data application module provides a retrieval interface to a user in the form of a website. The user asks questions for the data, the system checks out the data which is consistent with the questions from the central database according to the content of the questions, and feeds back the search result to the user in the form of a webpage. The final portal web site comprises functions of classified display, combined query, result output and the like. The page mainly comprises a home page, a column page, a detailed page, a query page, an output page of a query result and the like. All data is from the central database that is processed by the analysis. The user registered on the website can use the 'information customizing function' which meets the individual inquiry requirement on the personalized service webpage. The latest 'customized' material information can be received regularly through the mailbox.

The overall workflow is shown in fig. 3:

however, the above solution has the following disadvantages:

on one hand, although the traditional search engine has strong web crawlers and wide coverage, the classification speciality is poor, the information search result is not satisfactory, and the words in the financial industry cannot be accurately understood.

In order to solve the above problem, in this embodiment, as an optional implementation manner, the method may be executed by a server, or may be executed by a terminal device, or may be executed by both the server and the terminal device, and in this embodiment, the description is given by taking an example that the method is executed by a terminal device (e.g., the terminal device 102 described above). As shown in fig. 4, the flow of the method for determining a resource pool may include the steps of:

step S402, a first webpage address corresponding to a target subject is obtained through a network capturer, and a first address queue is generated according to the first webpage address;

step S404, acquiring a webpage source code corresponding to the first address queue by using the network capturer;

step S406, generating a webpage vector according to the webpage source code, determining a first distance between the webpage vector and a pre-established standard feature vector, and deleting a second webpage address corresponding to a first distance smaller than a first preset threshold value from the first address queue to obtain a second address queue, wherein the webpage vector is generated according to text information corresponding to the webpage source code;

step S408, obtaining target resource information from the web page source code corresponding to the second address queue, and storing the target resource information in a target resource library, where the target resource information is used to describe a resource related to the target topic, and the target resource library is used to store the resource information related to the target topic.

According to the embodiment, a first webpage address corresponding to a target theme is obtained through a network capturer, and a first address queue is generated according to the first webpage address; acquiring a webpage source code corresponding to the first address queue by using the network capturer; generating a webpage vector according to the webpage source code, determining a first distance between the webpage vector and a pre-established standard feature vector, and deleting a second webpage address corresponding to a first distance smaller than a first preset threshold value from the first address queue to obtain a second address queue, wherein the webpage vector is generated according to text information corresponding to the webpage source code; and acquiring target resource information from the webpage source code corresponding to the second address queue, and storing the target resource information into a target resource library, wherein the target resource information is used for describing resources related to the target theme, and the target resource library is used for storing the resource information related to the target theme, so that the technical problem that in the related technology, the vocabulary identification effect on certain specific fields is poor due to poor classification speciality of the traditional search engine is solved.

Optionally, in this embodiment, the obtaining, by the network capturer, the first web page address corresponding to the target topic includes: and sequentially crawling the addresses in the address base according to the preset priority by using the target theme through a network capturer to obtain the first webpage address.

Optionally, in this embodiment, after the web page source code corresponding to the first address queue is obtained by using the network acquirer, the method further includes: determining a tree node structure corresponding to the webpage source code corresponding to the first address queue; and denoising a part corresponding to the visual feature in the tree node structure, wherein the part corresponding to the visual feature in the tree node structure is style and script.

Optionally, in this embodiment, the generating a webpage vector according to the webpage source code includes: extracting the webpage source codes corresponding to the first address queue by adopting a regular expression to obtain the text information; and generating the webpage vector according to the text information.

Optionally, in this embodiment, after obtaining the target resource information from the web page source code corresponding to the second address queue, the method further includes: determining fingerprint information corresponding to each webpage address in the second address queue according to the target resource information, wherein the fingerprint information is used for representing the target resource information; determining a second distance between any two webpage addresses in a second webpage address according to the fingerprint information, wherein the second webpage address is a webpage address in the second address queue; and deleting one webpage address of any two webpage addresses corresponding to the second distance smaller than a second preset threshold value to obtain a target address queue.

It should be noted that, in recent years, the overall market size of the industry intelligence system (BI) industry has been on the rise year by year trend. The intelligent BI problem processing capability is determined by the knowledge richness of the knowledge base, and how to improve the knowledge richness is a difficult problem. However, the traditional search engine has strong web crawlers and wide coverage, but has poor classification speciality, unsatisfactory information search results and incapability of accurately understanding the words in the financial industry.

Moreover, most of the existing financial information acquisition systems are focused on the research of the financial topic search engine, and mainly research the algorithm optimization problems of the design of the financial topic search engine, the topic crawler algorithm, the information source discovery method and the like, but the financial knowledge identification and extraction are less involved.

Based on the problems, the application identifies financial knowledge therein through web crawlers to evaluate, extract and remove duplication, so that the problems existing in the traditional mode can be solved.

The following describes a flow of a face detection method with reference to an optional example, where the method mainly includes the following steps:

by analyzing the main financial websites, the basic working flow, the system function module and the database of the financial knowledge acquisition system are designed, the web crawler rule is improved, and a financial topic crawler algorithm is provided. Credible financial knowledge is found out through the technologies of webpage denoising, intelligent knowledge matching, knowledge evaluation, knowledge extraction, knowledge duplication removal and the like, a financial knowledge base is enriched, and the problem processing capacity of an intelligent system in the financial industry is improved. The application of the device can improve the knowledge richness of the financial industry intelligent system, so that the knowledge updating of the financial industry intelligent system and the information updating of the internet are synchronously carried out.

Optionally, the specific technical solution is as follows:

a resource library determining system comprises an initialization module, a storage module, an address queue generating module, a webpage capturing module, an analyzing module, a financial knowledge filtering module, a financial knowledge extracting module, a financial knowledge formatting processing module and the like. The modules cooperatively work to form an organic whole, and the functions of crawler address queue generation, webpage information capture, financial knowledge denoising, financial knowledge identification, financial knowledge extraction, financial knowledge evaluation, financial knowledge deduplication, financial knowledge storage and the like are realized.

Optionally, the basic workflow of the resource library determination system is as follows:

the basic workflow of the financial knowledge acquisition system is shown in fig. 5:

step 1, a web crawler acquires addresses from a financial subject address library to form a work queue;

step 2, capturing webpage source codes corresponding to addresses according to the address queue acquired by the crawler, and denoising the financial knowledge;

optionally, the financial web page needs to be denoised, and the analysis of the main financial industry website can find that the HTML source code of a financial topic web page usually contains 2 parts, namely head and body. And the web crawler acquires and processes the address queue and acquires HTML source codes of the financial webpage documents corresponding to the address queue through program operation. And drawing a tree node structure for the financial topic webpage collected by the crawler, wherein the tree node structure is shown in fig. 6. The page tree structure may be expressed as (html (meta, keywords, description), style, description)), (body (tr (td (text))), (div (li (a)), (span (text))), description …).

The visual characteristics of the finance theme webpage are generally embodied in the aspects of fonts, background colors, paragraph division and the like; semantic information is typically expressed as a type of page content, such as text, multimedia, or hyperlinks. According to the tree representation, the visual feature nodes and the page content nodes of the financial topic webpage can be found. Title, keywords and description of the head part are descriptions of pages or financial knowledge directly, have high conformity with the content of the web page and can be used for distinguishing whether the pages are related to financial subjects or not. In the financial subject webpage, the style and script parts represent visual features, interfere with financial knowledge acquisition and are all treated as noise. The financial knowledge is generally hidden in text label labels such as tables < td > </td >, < span > </span >, < b > </b >, < h1 > </h 1 > and the like, and purification is required after matching to obtain the knowledge. And the mark of < ul > </ul >, and < li > </li > can be used for positioning address and obtaining anchor text information. After the anchor text information is calculated and distinguished whether the anchor text information is related to the financial knowledge theme or not, whether the address is added into a crawling queue or not can be judged. The other marker parts are mostly noises which interfere with the acquisition of financial knowledge.

Step 3, extracting text information characteristics according to the source code characteristic intelligent matching rule, establishing a target webpage vector, extracting link addresses, combining the same addresses and adding the same addresses to a crawler work queue;

optionally, the financial knowledge acquisition requires knowledge extraction, and identifies data related to financial knowledge from unstructured or semi-structured information contained in the financial topic webpage, and converts the data into a format with clearer structure and semantic meaning. In the invention, a method based on an HTML structure is adopted to realize financial knowledge extraction. This decimation method needs to be implemented with regular expressions. In general, regular expressions of the system are all fixed and unchangeable, but in the invention, besides the knowledge acquisition regular expression contained in the system rule base, the method also supports a user to specify an HTML label rule for a specific page and intelligently generates the regular expression for the page. Besides the content part, the user-defined rule is supported by the method, and the user-defined rule also comprises page attributes (including meta information such as title, keywords, description and the like), addresses, article titles, publishing time, information sources and the like. The customized rule enables the matching of the system to meet the diversity requirements of users. The crawler needs to analyze the financial webpage source codes to obtain the address groups contained in the pages to form a work queue. Assuming that the regular expression for extracting all links of a page is z1, the expression of z1 is as shown in equation (1):

z₁＝(？＜＝href*？＝*？[\′\″]).*？(？＝[\′\″])。 (1)

in extracting feature URLs according to URL features input by a user, it is assumed that a regular expression to be generated is z2, and z2 represents as shown in equation (2):

z₂＝″(？＜＝″+s₁.split[i]+″).*？(？＝″+s₁.split[i+1]+″)″。 (2)

according to the generated regular expression, the system analyzes the page source codes acquired by the crawler and then extracts the address queue to be acquired. Regular expressions are not invariable and are handled in consideration of a variety of conditions in the definition process. It is assumed that the title of financial knowledge uses the < h1 > </h 1 > tag, but there may be many identical tags in the page, and the feature method is not unique in this tag, so all the < h1 > </h 1 > tag contents are extracted and further analyzed. In processing, several wildcards of the regular expression are used, such as (+?. Assuming that the regular expression for extracting the title content is z3, z3 represents as shown in equation (3):

z₃＝″(？<＝″+s₂.split[i]+″+？>)*？(？＝″+s₂.split[i+1]+″)″。 (3)

after all the tag queues are obtained, the characteristic information of each record of the queues, such as id, class and other mark contents, is analyzed, and financial knowledge information is extracted according to the characteristic attributes of different websites. ② the extraction rule of the text is more complicated compared with the address extraction and the title extraction. The text extraction not only requires precision, but also needs to keep proper linefeed format and the like, and is convenient to be directly applied to an industrial intelligent system. Assuming that the regular expression of the extracted text is z4, z4 represents as shown in equation (4):

z₄＝″(？<＝″+s₃.split[i]+″([^^]*？″)？＝″+s₃.split[i+1]+″)″。 (3)

after the text is extracted, the content is screened, the visual feature label irrelevant to financial knowledge is removed, and partial linefeed labels are reserved, such as < hr/>, < br/> and the like. In formula (2), formula (3), formula (4), s1, s2, s3 all represent label character strings input by the user, and split [ i ] represents the value of the character string array after the character string is split by the split () method, with the subscript being i.

Step 4, calculating the target webpage vector and the financial characteristic vector, identifying the webpage irrelevant to financial knowledge and filtering;

step 5, extracting financial knowledge, calculating the fingerprint similarity of the extracted knowledge, removing repeated knowledge, evaluating the financial theme correlation of the knowledge, outputting the financial knowledge meeting the requirements, and storing the financial knowledge in a knowledge base;

optionally, the crawler first looks up the address library to see if this address has already been collected before collection. And if the page is not repeated, acquiring the page. After topic filtering and extraction to form knowledge, whether repeated knowledge exists is detected. The general idea of knowledge duplication detection is to generate a fingerprint for each collected financial knowledge, and calculate the similarity of 2 fingerprints by using a character string comparison-based method. If the fingerprint similarity of 2 pieces of knowledge is greater than a certain threshold, the 2 pieces of knowledge are considered to be repeated. Euclidean distance is a commonly used method in space to compute 2 n-dimensional vector distances. The larger the Euclidean distance value is, the farther the vector distance is, and the lower the document similarity degree is; and the smaller the Euclidean distance is, the closer the vector distance is, and the higher the similarity degree of the documents is. The financial knowledge to be compared is calculated using equation (5) using TF-IDF to establish vector Vq and vector Vd.

The smaller the obtained value is, the higher the knowledge similarity is; the larger the value, the lower the knowledge similarity. If the value is not in the contracted domain, the address of the page and the processed financial knowledge are stored in a database.

And 6, performing word segmentation on the data in the database by using a financial word stock, and creating an index stock for convenient retrieval.

Optionally, taking a financial topic as an example, in the embodiment of the present invention, a financial topic crawler algorithm is provided.

In the related art, a general web crawler is a program for automatically extracting web pages, and provides an information acquisition method. The universal web crawler generally takes one or more initial URLs as an entry, repeatedly extracts URLs in source codes of target web pages to form a queue, and supplements new URLs to the URL queue to be crawled until a system stop condition is met. The ordinary crawler does not identify the content, does not filter the links, does not classify the content, and only crawls the links widely and obtains the webpage content corresponding to the links.

In the embodiment of the invention, the financial topic crawler filters the links irrelevant to the financial topic according to the financial webpage identification algorithm, reserves the links relevant to the financial topic, and then forms the URL queue to be crawled of the crawler. In the capturing process, the crawler stores and extracts captured contents, identifies whether the contents conform to financial topics, identifies subordinate classifications according to the contents, and establishes indexes to facilitate retrieval. In addition, the finance topic crawler also carries out Rank rating on the website according to the administrative grade division, so that the content credibility is ensured; adding a financial proper word bank to accurately identify financial proper nouns; and establishing a financial feature vector to filter and capture the content.

Optionally, in the embodiment of the present invention, in combination with characteristics of financial management knowledge and design requirements of a web crawler module, a common topic crawler technology is adopted to be combined with a depth-first crawler, and a definition rule for crawler crawling is added. Financial knowledge has certain specialty and has higher requirement on information source credibility. The development of the Chinese financial industry and the research of the financial industry are mainly guided by government policies. As for the overall authority of the financial industry, governments, scientific research units and colleges are higher than enterprises and social institutions, and central units are higher than local units. Therefore, the method takes authority as a basis, the crawler does not crawl the data of the internet widely, the website is ranked according to the administrative grades of the country, the province (autonomous region or direct district city), the city and the county (district), and the higher the administrative grade is, the higher the website weight is, and the higher the credibility is. In the design of the web crawler, the reliability requirement of financial knowledge is fully considered, and only the link in the host name stored in the address library is processed. In the working process, the system calculates the topic relevance of the anchor text of the collected links, and only the links meeting the topic requirement are added into the work queue. In addition, the crawler determines the address score according to the queuing rating of the website, and determines the collection sequence of the work queue according to the score. The invention adopts jieba as a system word segmentation device to process the page information of the website. Some finance related words can be found in non-finance web pages. If such words are not distinguished, given their financial domain specific meaning, and identified, the crawler may identify the filtering for the advertising information. On the basis of the system lexicon, the invention adds special words in the financial industry, such as securities, funds, banks, trusts, insurance and the like to meet the requirements. The basic common vocabulary is covered.

Alternatively, the workflow of the web crawler algorithm is as shown in FIG. 7. The algorithm is as follows:

step 1, pre-reading data, and calculating the priority score of the URL queue.

And step 2, reordering according to the URL priority scores.

And step 3, acquiring a corresponding webpage source code document according to the URL.

And step 4, extracting a URL link group of the webpage source code document and entering the URL link group into a Todo work queue.

And 5, acquiring the URL of the ith data of the Todo work queue, namely Todo [ i-1]. Url, and judging whether the URL is in the host name range of the URL index library.

Step 6, if yes, extracting page information and performing further operations such as formatting processing on the information, and executing step 3 if i + -, is 1; if not, skipping the page information processing, and executing the step 5 when i + -, is 1.

Step 7, judging whether URLs in the Todo queue are not processed, if yes, setting i + as 1, and executing step 5; if not, execution jumps to step 8.

And 8, finishing the work. By the method of processing the in-site link only, the information acquisition sources are guaranteed to be websites provided in the index database, and the reliability of the information sources is improved.

It should be noted that the data collected by the crawler is not necessarily financial knowledge, and therefore, the knowledge collected by the crawler needs to be filtered, and only the financial-related knowledge is retained.

Through this embodiment, the knowledge acquisition mode of adoption is less than ordinary crawler mode in the total number of links of saving, but the link ratio that accords with financial knowledge theme that gathers is more than ordinary crawler, and the percentage of the number of links that accords with the theme is greater than ordinary crawler mode far away.

The ordinary crawler only carries out rough HTML tag removing processing on the page, retains all texts of the page, does not carry out precise content extraction and content formatting processing, and cannot form financial knowledge. The financial knowledge acquisition system carries out formatting processing and accurate content processing on the page information through intelligent rule matching, so that the information is divided into various attributes including knowledge title, release time, knowledge source, knowledge content and the like. The system also evaluates the credibility of the knowledge so as to directly store the knowledge in a knowledge base and apply the knowledge in an intelligent system in the financial industry.

The financial knowledge acquisition system solves the problem of acquiring financial knowledge on the Internet, and improves the knowledge richness of the financial industry intelligent system. On the basis of analyzing the financial knowledge acquisition problem, a financial industry proprietary word bank is established, the network crawler rule is improved, and a financial knowledge acquisition system is designed and realized by utilizing the technologies of a financial topic crawler algorithm, financial webpage denoising, financial knowledge intelligent matching, financial knowledge duplication removal and the like. The method analyzes the characteristics of the financial topic website, establishes the financial characteristic vector to filter the collected content, and uses the Euclidean distance to perform financial knowledge fingerprint identification, thereby obtaining the financial knowledge with high relevancy, high accuracy and low repeatability.

It should be noted that, for simplicity of description, the above-mentioned method embodiments are described as a series of acts or combination of acts, but those skilled in the art will recognize that the present invention is not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the invention. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required by the invention.

According to another aspect of the embodiments of the present invention, there is also provided an apparatus for determining a resource pool, as shown in fig. 8, the apparatus including:

a first processing unit 802, configured to obtain, through a network capturer, a first web page address corresponding to a target topic, and generate a first address queue according to the first web page address;

a first obtaining unit 804, configured to obtain, by using the network capturer, a webpage source code corresponding to the first address queue;

a first determining unit 806, configured to generate a webpage vector according to the webpage source code, determine a first distance between the webpage vector and a pre-established standard feature vector, and delete a second webpage address corresponding to a first distance that is smaller than a first preset threshold from the first address queue to obtain a second address queue, where the webpage vector is generated according to text information corresponding to the webpage source code;

a second processing unit 808, configured to obtain target resource information from the web page source code corresponding to the second address queue, and store the target resource information in a target resource library, where the target resource information is used to describe a resource related to the target topic, and the target resource library is used to store the resource information related to the target topic.

Alternatively, the first processing unit 802 may be configured to execute step S402, the first obtaining unit 804 may be configured to execute step S404, the first determining unit 806 may be configured to execute step S406, and the second processing unit 808 may be configured to execute step S408.

As an optional technical solution, the first processing unit is further configured to crawl addresses in an address base according to a preset priority level sequentially through a network capturer and by using the target topic, so as to obtain the first web page address.

As an optional technical solution, the apparatus further includes: a second determining unit, configured to determine a tree node structure corresponding to the web page source code corresponding to the first address queue after the web page source code corresponding to the first address queue is acquired by using the network acquirer; and a third processing unit, configured to perform denoising processing on a portion of the tree node structure corresponding to the visual feature, where the portion of the tree node structure corresponding to the visual feature is style and script.

As an optional technical solution, the first determining unit includes: the first processing module is used for extracting the webpage source codes corresponding to the first address queue by adopting a regular expression to obtain the text information; and the second processing module is used for generating the webpage vector according to the text information.

As an optional technical solution, the apparatus further includes: a third determining unit, configured to determine, after obtaining target resource information from the web page source code corresponding to the second address queue, fingerprint information corresponding to each web page address in the second address queue according to the target resource information, where the fingerprint information is used to represent the target resource information; a fourth determining unit, configured to determine a second distance between any two web addresses in a second web address according to the fingerprint information, where the second web address is a web address in the second address queue; and the fourth processing unit is used for deleting one webpage address of any two webpage addresses corresponding to the second distance smaller than the second preset threshold value to obtain the target address queue.

According to a further aspect of embodiments of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, acquiring a first webpage address corresponding to the target subject through a network capturer, and generating a first address queue according to the first webpage address;

s2, acquiring the webpage source code corresponding to the first address queue by using the network capturer;

s3, generating a webpage vector according to the webpage source code, determining a first distance between the webpage vector and a pre-established standard feature vector, and deleting a second webpage address corresponding to a first distance smaller than a first preset threshold value from the first address queue to obtain a second address queue, wherein the webpage vector is generated according to text information corresponding to the webpage source code;

s4, obtaining target resource information from the web page source code corresponding to the second address queue, and storing the target resource information in a target resource library, where the target resource information is used to describe a resource related to the target topic, and the target resource library is used to store the resource information related to the target topic.

alternatively, in this embodiment, a person skilled in the art may understand that all or part of the steps in the methods of the foregoing embodiments may be implemented by a program instructing hardware associated with the terminal device, where the program may be stored in a computer-readable storage medium, and the storage medium may include: flash disks, ROM (Read-Only Memory), RAM (Random Access Memory), magnetic or optical disks, and the like.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic device may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A method for determining a resource pool, comprising:

acquiring a first webpage address corresponding to a target theme through a network capturer, and generating a first address queue according to the first webpage address;

acquiring a webpage source code corresponding to the first address queue by using the network capturer;

generating a webpage vector according to the webpage source code, determining a first distance between the webpage vector and a pre-established standard feature vector, and deleting a second webpage address corresponding to a first distance smaller than a first preset threshold value from the first address queue to obtain a second address queue, wherein the webpage vector is generated according to text information corresponding to the webpage source code;

and acquiring target resource information from the webpage source code corresponding to the second address queue, and storing the target resource information to a target resource library, wherein the target resource information is used for describing resources related to the target subject, and the target resource library is used for storing the resource information related to the target subject.

2. The method of claim 1, wherein obtaining, by the web capturer, the first web page address corresponding to the target subject comprises:

and sequentially crawling the addresses in an address base according to a preset priority by using the target theme through a network capturer to obtain the first webpage address.

3. The method of claim 1, wherein after the obtaining, by the network capturer, the source code of the web page corresponding to the first address queue, the method further comprises:

determining a tree node structure corresponding to the webpage source code corresponding to the first address queue;

and denoising a part corresponding to the visual feature in the tree node structure, wherein the part corresponding to the visual feature in the tree node structure is style and script.

4. The method of claim 1, wherein generating a webpage vector from the webpage source code comprises:

extracting the webpage source codes corresponding to the first address queue by adopting a regular expression to obtain the text information;

and generating the webpage vector according to the text information.

5. The method according to any one of claims 1 to 4, wherein after the obtaining target resource information from the web page source code corresponding to the second address queue, the method further comprises:

determining fingerprint information corresponding to each webpage address in the second address queue according to the target resource information, wherein the fingerprint information is used for representing the target resource information;

determining a second distance between any two webpage addresses in a second webpage address according to the fingerprint information, wherein the second webpage address is a webpage address in the second address queue;

and deleting one webpage address of any two webpage addresses corresponding to the second distance smaller than a second preset threshold value to obtain a target address queue.

6. An apparatus for determining a resource pool, comprising:

the first processing unit is used for acquiring a first webpage address corresponding to a target subject through a network capturer and generating a first address queue according to the first webpage address;

a first obtaining unit, configured to obtain, by using the network capturer, a webpage source code corresponding to the first address queue;

the first determining unit is used for generating a webpage vector according to the webpage source code, determining a first distance between the webpage vector and a pre-established standard feature vector, and deleting a second webpage address corresponding to a first distance smaller than a first preset threshold value from the first address queue to obtain a second address queue, wherein the webpage vector is generated according to text information corresponding to the webpage source code;

and the second processing unit is used for acquiring target resource information from the webpage source code corresponding to the second address queue and storing the target resource information into a target resource library, wherein the target resource information is used for describing resources related to the target subject, and the target resource library is used for storing the resource information related to the target subject.

7. The apparatus of claim 6, wherein the first processing unit is further configured to crawl addresses in an address library sequentially according to a preset priority through a network capturer and using the target topic to obtain the first web page address.

8. The apparatus of claim 6, further comprising:

a second determining unit, configured to determine a tree node structure corresponding to the web page source code corresponding to the first address queue after the web page source code corresponding to the first address queue is acquired by using the network acquirer;

and a third processing unit, configured to perform denoising processing on a portion of the tree node structure corresponding to the portion for representing the visual feature, where the portion of the tree node structure corresponding to the portion for representing the visual feature is style and script.

9. A computer-readable storage medium, comprising a stored program, wherein the program is operable to perform the method of any one of claims 1 to 5.

10. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to execute the method of any of claims 1 to 5 by means of the computer program.