CN116361594A - Mining method, device, equipment and medium for bidding information release platform - Google Patents

Mining method, device, equipment and medium for bidding information release platform Download PDF

Info

Publication number
CN116361594A
CN116361594A CN202310638881.9A CN202310638881A CN116361594A CN 116361594 A CN116361594 A CN 116361594A CN 202310638881 A CN202310638881 A CN 202310638881A CN 116361594 A CN116361594 A CN 116361594A
Authority
CN
China
Prior art keywords
column address
expansion column
page
acquiring
address
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310638881.9A
Other languages
Chinese (zh)
Other versions
CN116361594B (en
Inventor
贾新
田小亮
张金坤
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Tuopu Fenglian Information Technology Co ltd
Original Assignee
Beijing Tuopu Fenglian Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Tuopu Fenglian Information Technology Co ltd filed Critical Beijing Tuopu Fenglian Information Technology Co ltd
Priority to CN202310638881.9A priority Critical patent/CN116361594B/en
Publication of CN116361594A publication Critical patent/CN116361594A/en
Application granted granted Critical
Publication of CN116361594B publication Critical patent/CN116361594B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9027Trees
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/08Auctions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Finance (AREA)
  • Accounting & Taxation (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Development Economics (AREA)
  • Economics (AREA)
  • Marketing (AREA)
  • Strategic Management (AREA)
  • General Business, Economics & Management (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a mining method, a mining device and a mining medium for bidding information release platforms, which relate to the technical field of data processing and acquire stock site libraries; acquiring a column address list based on the stock site library, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address; acquiring a website top page list based on the stock site library, extracting a friend link of each website top page to acquire a friend link address list, and acquiring a second expansion column address based on the friend link address list; and acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text contents contain bidding information or not so as to determine a target expansion column address. Therefore, the bid issuing platform which is not recorded yet can be automatically excavated from the existing bid issuing platform, and the excavation efficiency and accuracy are high.

Description

Mining method, device, equipment and medium for bidding information release platform
Technical Field
The application relates to the technical field of data processing, in particular to a mining method, a mining device, mining equipment and mining media for bidding information release platforms.
Background
Bidding is the abbreviation of bidding, bidding and bidding are a commodity transaction behavior, and are two aspects of the transaction process. In the purchasing behavior of goods, engineering and services, a signer attracts a plurality of bidders to perform equal competition according to the same conditions through the purchasing requirements published in advance, and experts in aspects of technology, economy, law and the like are organized according to a specified program to comprehensively evaluate the plurality of bidders, so that the behavior process of the signer of the selected project is preferred. The bidding information is monitored, collected, counted and analyzed, so that enterprises can be helped to grasp more valuable data in real time, and market competitiveness is improved.
The current bidding information release platform is expected to have tens of thousands, and the number of bidding information release platforms is continuously increased along with the time. And the manual collection of the label data release platform finds that the columns of the label data release platform change, and the resource cost and the time cost are relatively high.
Disclosure of Invention
In view of the above, an object of the present application is to provide a bid information distribution platform mining method, apparatus, device, and medium capable of mining a bid distribution platform that has not yet been recorded by using existing resources.
In a first aspect, an embodiment of the present application provides a method for mining a bidding information distribution platform, the method including the steps of:
Acquiring an inventory site library, wherein the inventory site library comprises a determined website home page and a column address for issuing bidding information;
acquiring a column address list based on the stock site library, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address;
acquiring a website top page list based on the stock site library, extracting a friend link of each website top page to acquire a friend link address list, and acquiring a second expansion column address based on the friend link address list;
acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text contents contain bidding information or not; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
In some embodiments, the parsing the page of each column address to obtain the first extended column address includes the following steps:
analyzing a page dom tree of each column address;
Extracting brother nodes of the current column dom node based on the page dom tree, and taking the brother nodes as a first expansion column address; or/and extracting all child nodes under the same-level brother node of the father node of the current column dom node based on the page dom tree to serve as the first expansion column address.
In some embodiments, the acquiring the second expansion column address based on the friendship link address list includes the following steps:
based on the friend link address list, collecting pages of each friend link address;
resolving the page of each friend link address to obtain all hyperlinks in the page of each friend link address to obtain a hyperlink address list;
collecting page source codes of each hyperlink address based on the hyperlink address list;
and analyzing the corresponding page based on the page source code, and filtering through preset characteristic words to obtain a second expansion column address.
In some embodiments, before the parsing of the corresponding page based on the page source code, the method further includes the following steps:
setting the optimization times;
acquiring a friend link address list based on the page source code reverse direction;
Acquiring pages of each friend link address based on the reversely acquired friend link address list, analyzing the pages of each friend link address, acquiring all hyperlinks in the pages of each friend link address, and acquiring an optimized hyperlink address list;
and acquiring page source codes of each hyperlink address based on the optimized hyperlink address list, and accordingly, acquiring page source codes for analyzing the page according to the set optimization times.
In some embodiments, the parsing the corresponding page based on the page source code and filtering through a preset feature word to obtain a second expansion column address, which includes the following steps:
setting a phrase with bidding attributes and bidding attributes as a feature word;
analyzing the corresponding page according to the page source code to obtain a page title name;
judging whether the page title name contains the feature word or not; and if the page title name contains any set feature word, taking the page address corresponding to the page title name as a second expansion column address.
In some embodiments, the obtaining text content in the corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text content includes bidding information, includes the following steps:
Crawling text contents in corresponding pages of the first expansion column address and the second expansion column address based on a crawler script;
judging whether the text content contains bidding information or not by using a preset keyword group; if the text content contains the keyword group, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
and if the text content does not contain the keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model, and if the text content does not contain the bidding information, filtering a first expansion column address or a second expansion column address corresponding to the text content.
In some embodiments, the keyword groups are divided into primary keyword groups and secondary keyword groups based on parameter types of bidding information and bidding information, and whether the text content contains bidding information is judged by using preset keyword groups in the following manner:
if the text content contains the primary key word group, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
Judging whether the text content contains a secondary keyword group or not if the text content does not contain the primary keyword group, and taking a first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address if the text content contains the secondary keyword group;
and if the text content does not contain the secondary keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model.
In a second aspect, an embodiment of the present application provides a bid information release platform mining apparatus, the apparatus including:
the first acquisition module is used for acquiring an inventory site library, wherein the inventory site library comprises a determined website home page and a column address for issuing bidding information;
the second acquisition module is used for acquiring a column address list based on the stock site library, acquiring pages of each column address, analyzing the pages of each column address and acquiring a first expansion column address;
the third acquisition module is used for acquiring a website first page list based on the stock site library, extracting a friend link of each website first page to obtain a friend link address list, and acquiring a second expansion column address based on the friend link address list;
The judging module is used for acquiring text contents in the corresponding pages of the first expansion column address and the second expansion column address and judging whether the text contents contain bidding information or not; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and a bus, where the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor communicates with the memory through the bus, and the machine-readable instructions, when executed by the processor, perform the steps of the bidding information issue platform mining method of any one of the first aspect above.
In a fourth aspect, an embodiment of the present application provides a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor performs the steps of the bidding information distribution platform mining method of any one of the first aspects above.
The mining method, the mining device, the electronic equipment and the storage medium for the bidding information release platform acquire a stock site library; acquiring a column address list based on the stock site library, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address; acquiring a website top page list based on the stock site library, extracting a friend link of each website top page to acquire a friend link address list, and acquiring a second expansion column address based on the friend link address list; and acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text contents contain bidding information or not so as to determine a target expansion column address. Therefore, the bid issuing platform which is not recorded yet can be automatically excavated from the existing bid issuing platform, and the excavation efficiency and accuracy are high.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart illustrating a method of mining a bid information distribution platform according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating the parsing of a page of each column address to obtain a first extended column address according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of acquiring a second extended column address based on the friends link address list according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating the steps of obtaining text content in corresponding pages of the first and second expanded column addresses and determining whether the text content includes bidding information according to an embodiment of the present application;
FIG. 5 is a schematic diagram showing the construction of a mining apparatus for bidding information distribution platform according to an embodiment of the present application;
fig. 6 shows a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the accompanying drawings in the present application are only for the purpose of illustration and description, and are not intended to limit the protection scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this application, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to the flow diagrams and one or more operations may be removed from the flow diagrams as directed by those skilled in the art.
In addition, the described embodiments are only some, but not all, of the embodiments of the present application. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that the term "comprising" will be used in the embodiments of the present application to indicate the presence of the features stated hereinafter, but not to exclude the addition of other features.
In view of the technical problems set forth in the background art, the application provides a method, a device, electronic equipment and a storage medium for mining bidding information release platforms, which can automatically mine out the bidding information release platforms which are not recorded yet from the existing bidding information release platforms, and improve mining efficiency and accuracy.
Referring to fig. 1 of the specification, the mining method for the bidding information release platform provided by the embodiment of the application includes the following steps:
S1, acquiring an inventory site library, wherein the inventory site library comprises a determined website home page and a column address for issuing bidding information;
in step S1, the stock site library may be manually counted and stored based on usual websites, and the stock site library includes website data attributes including website names, website top page URLs, column names and column address URLs, for example, website names are national bidding information networks, website top page URLs are https:// www.bidnews.cn/, column names are engineering bidding, and column address URLs are https:// www.bidnews.cn/caibou/style-gongchen. The columns refer to main plate contents of website construction, generally refer to website navigation columns, secondary columns, tertiary columns and the like, and are mainly used for facilitating a user to quickly find things (topics) which the user wants to know, enhancing user experience.
Because the initial stock site library is counted manually, the release platform is increased continuously along with the time, and the bidding release plate in the existing website is updated, the stock site library is lagged, and the timeliness and the accuracy of bidding information statistics are affected. Therefore, in the application, the current stock site library is automatically mined to obtain more unobscured label issuing platforms and columns, and the stock site library is updated in real time, so that enterprises can master more valuable data in time, and the market competitiveness is improved.
It should be noted that, in the embodiment of the present application, the mining method of the bidding information publishing platform may be operated in a terminal device or a server; the terminal device may be a server terminal device, and when the bidding information release platform mining method is operated on the server, the bidding information release platform mining method may be implemented and executed based on a cloud interaction system, where the cloud interaction system at least includes the server and a client device (i.e., the terminal device).
S2, acquiring a column address list based on the stock site library, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address;
in step S2, the own station address mining is mainly performed. Specifically, the acquired stock site library is subjected to data classification, a column name and a column address URL are selected from the stock site library, the column name and the column address URL are arranged into a column address list, then pages of each column address in the column address list are collected through a crawler script, and page analysis is performed to obtain a first expansion column address. The crawler script is a program for automatically acquiring the content of the web page, which should be a technical means known to those skilled in the art, and will not be described herein.
In this embodiment, when resolving a page of each column address in the column address list to obtain a first extended column address, referring specifically to fig. 2 of the specification, the method includes the following steps:
s201, analyzing a page dom tree of each column address;
s202, extracting brother nodes of the current column dom node based on the page dom tree, and taking the brother nodes as a first expansion column address; or/and extracting all child nodes under the same-level brother node of the father node of the current column dom node based on the page dom tree to serve as the first expansion column address.
In steps S201-S202, a dom (Document Object Model document object model) tree is generated based on a column address URL, and a corresponding tag position can be quickly positioned through a dom node so as to perform addition, deletion and verification; in the dom tree, the top node is called root, except that each node of the root has a parent node, the parent node has a child node, and the child node of the same level is called sibling node, which should be a technical means well known to those skilled in the art, and will not be described herein. Therefore, in the application, each column address in the column address list is mined by adopting the two expansion modes, so that a sufficient number of first expansion column addresses can be obtained initially. It should be noted that, the expansion based on the current dom node is not limited to the parent node, for example, some websites need to extend upwards (parent level) multiple times to mine the first expansion column address.
In other embodiments, in order to improve the efficiency of acquiring the first extension column address, only one extension mode of the column may be adopted, that is, only the sibling node of the current column dom node or only all the child nodes under the sibling node of the parent node of the current column dom node are extracted to serve as the first extension column address. Namely, the part expansion quantity is sacrificed to improve the expansion efficiency or reduce the hardware requirement.
S3, acquiring a website first page list based on the stock site library, extracting a friend link of each website first page to acquire a friend link address list, and acquiring a second expansion column address based on the friend link address list;
in step 3, the external station address mining is mainly performed. Specifically, the acquired stock site library is subjected to data classification, the website names and the website top page URLs are selected from the stock site library, and the website names are arranged into a website top page list, wherein the website names are generally displayed on the website top page, and the purpose of distinguishing websites is achieved; and then identifying and extracting a friend link on each website home page in the website home page list, and further acquiring a second expansion column address based on the obtained friend link. Specifically, referring to fig. 3 of the specification, the step of obtaining the second expansion column address based on the friendship link address list includes the following steps:
S301, acquiring pages of each friend link address based on the friend link address list;
s302, analyzing the page of each friend link address to obtain all hyperlinks in the page of each friend link address to obtain a hyperlink address list;
s303, collecting page source codes of each hyperlink address based on the hyperlink address list;
s304, analyzing the corresponding page based on the page source code, and filtering through preset feature words to obtain a second expansion column address.
The friend links refer to links of the other parties placed on the website mutually, so that the access amount is increased by mutual recommendation, and the friend links are related to the exchange industry generally; and the friendship links typically appear in the bottom area of the website's home page, their friendship links are identified according to their common selection drop-down boxes.
In the embodiment, firstly, acquiring a friend link of each website first page in a website first page list, and counting to obtain a list of all friend link addresses; collecting page information of each friend link address in the friend link address list, extracting all hyperlinks and title names in the page, and counting to obtain a hyperlink address list containing all hyperlinks and title names; then traversing the hyperlink address list through a crawler script to acquire page source codes of the hyperlink, wherein the hyperlink is called a hyperlink which essentially belongs to a part of a webpage, and refers to a connection relation pointing to a target from the webpage, wherein the target can be another webpage or different positions on the same webpage, the off-site address can be eliminated in the application, and the acquired hyperlink address is equivalent to a column address in a stock site library; and finally, analyzing the corresponding page based on the page source code, and filtering through preset characteristic words to obtain a second expansion column address.
It should be noted that, in the process of collecting the page information of the friend link address and extracting all the hyperlinks and the title names in the page, all the hyperlinks and the title names in the page may not be identified and extracted at one time, i.e. a missing situation may occur.
Therefore, in the present application, the number of extended mining can be set, and the number of extended mining can be controlled to be as large as possible by controlling the statistical hyperlink address, that is, controlling the mining link hierarchy. Specifically, acquiring a friend link address list again according to the acquired page source codes through a domain name hierarchical relationship, acquiring pages of each friend link address according to the acquired friend link address list again, analyzing the pages of each friend link address, acquiring all hyperlinks in the pages of each friend link address to obtain a secondary mined hyperlink address list, acquiring a final page source code according to the set extension mining times, analyzing corresponding pages of the final page source code according to the final acquired page source code, filtering through preset feature words, and acquiring a second expansion column address, and specifically, the method comprises the following steps:
S2041, setting a phrase with bidding attributes and bidding attributes as a feature word;
s2042, analyzing the corresponding page according to the page source code to obtain a page title name;
s2043, judging whether the page title name contains the feature word; and if the page title name contains any set feature word, taking the page address corresponding to the page title name as a second expansion column address.
Since there are too many hyperlinks acquired through all the friend links in each web site front page, on one hand, all the friend links are related to the industry (bidding) of the web site front page, and on the other hand, all the hyperlinks are related to bidding subjects, it is necessary to perform preliminary screening filtering. In steps S2041-S2043, the feature word may be a phrase having bidding attributes including bid, winning bid, intention to purchase, and the like. And analyzing the corresponding page of the page source code to obtain the title name, and if the title name contains one of bid, winning bid and purchasing intention, recognizing the page address corresponding to the page title name as the second expansion column address.
S4, acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text contents contain bidding information or not; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
In other words, in the present application, the first expansion column address and the second expansion column address that are obtained preliminarily are checked further to determine whether they are websites that actually issue bidding information. In addition, in the application, in order to achieve the efficiency and accuracy of automatic check, two check modes are adopted, one check is performed through a preset keyword group, and the other check is performed based on a semantic analysis model. The verification mode through the preset keyword groups is high in efficiency, but the accuracy is relatively low; the verification mode based on the semantic analysis model is low in efficiency, but the accuracy is relatively high.
Specifically, referring to fig. 4 of the present disclosure, the steps of obtaining text content in the corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text content includes bidding information include the following steps:
S401, crawling text contents in corresponding pages of the first expansion column address and the second expansion column address based on a crawler script;
s402, judging whether the text content contains bidding information or not by using a preset keyword group; if the text content contains the keyword group, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
and if the text content does not contain the keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model, and if the text content does not contain the bidding information, filtering a first expansion column address or a second expansion column address corresponding to the text content.
That is, in steps S401-S402, first, whether the text content includes bidding information is determined by using a keyword group, where the keyword group includes at least two of a preview, a bid, a show, a combination, etc., if the text content includes a set keyword group, it may be determined directly that the first expansion column address or the second expansion column address corresponding to the text content is the target expansion column address, and verification based on a semantic analysis model is no longer performed;
If the text content does not comprise the set keyword group, checking based on the semantic analysis model, filtering the first expansion column address or the second expansion column address corresponding to the text content if the text content is judged to not comprise bidding information, and finally judging the first expansion column address or the second expansion column address corresponding to the text content to be the target expansion column address if the text content is judged to not comprise bidding information. The semantic analysis model is obtained through training in the following mode: acquiring a plurality of training sample sets, wherein each training sample set consists of a training feature code corresponding to a training sentence in a bidding text and a corresponding reference result; and respectively taking one training feature code and one corresponding reference result in each training sample group as input quantities and inputting the input quantities into the semantic analysis model to be trained so as to train the semantic analysis model to be trained. The construction and training process of the semantic analysis model should be known to those skilled in the art, and will not be described herein.
In other embodiments, in order to further improve the verification efficiency and reduce the hardware configuration, the keyword groups may be set as the primary keyword group and the secondary keyword group. For example, the advance notice, bid, winning bid, etc. can be directly defined as bid information, and can be used as a primary keyword group, and the item number, item name, purchase unit related amount, etc. can be indirectly defined as bid information, and can be used as a secondary keyword. Therefore, when judging whether the text content contains bidding information or not by using a preset keyword group, the method comprises the following steps: if the text content contains the primary key word, directly taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address; if the text content does not contain the primary key word group, judging whether the text content contains a secondary key word group, and if the text content contains the secondary key word group, taking a first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address; and if the text content does not contain the secondary keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model.
Therefore, the mining method of the bidding information release platform can utilize the stock site library manually acquired in the initial stage to perform local site address mining and external site address mining, and obtain the final target expansion column address through bidding information verification, so that the stock site library is updated in real time, timeliness and accuracy of acquiring bidding information are improved, and further market competitiveness of enterprises is improved.
Based on the same inventive concept, the embodiment of the present application further provides a device for mining a bidding information release platform, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the foregoing mining method for the bidding information release platform, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.
As shown in fig. 5 of the specification, the present application further provides a device for mining a bidding information distribution platform, where the device includes:
a first obtaining module 501, configured to obtain an inventory site library, where the inventory site library includes a determined website home page and a column address for issuing bidding information;
the second obtaining module 502 is configured to obtain a column address list based on the stock site library, collect pages of each column address, and parse the pages of each column address to obtain a first extended column address;
A third obtaining module 503, configured to obtain a website top page list based on the stock site library, extract a friend link of each website top page, obtain a friend link address list, and obtain a second expansion column address based on the friend link address list;
the judging module 504 is configured to obtain text content in the corresponding pages of the first expansion column address and the second expansion column address, and judge whether the text content includes bidding information; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
In some embodiments, the second obtaining module 502 parses a page of each column address to obtain a first extended column address, including:
analyzing a page dom tree of each column address;
extracting brother nodes of the current column dom node based on the page dom tree, and taking the brother nodes as a first expansion column address; or/and extracting all child nodes under the same-level brother node of the father node of the current column dom node based on the page dom tree to serve as the first expansion column address.
In some embodiments, the third obtaining module 503 obtains a second expansion column address based on the friendship link address list, including:
based on the friend link address list, collecting pages of each friend link address;
resolving the page of each friend link address to obtain all hyperlinks in the page of each friend link address to obtain a hyperlink address list;
collecting page source codes of each hyperlink address based on the hyperlink address list;
and analyzing the corresponding page based on the page source code, and filtering through preset characteristic words to obtain a second expansion column address.
In some embodiments, before the third obtaining module 503 parses the corresponding page based on the page source code, the method further includes:
setting the optimization times;
acquiring a friend link address list based on the page source code reverse direction;
acquiring pages of each friend link address based on the reversely acquired friend link address list, analyzing the pages of each friend link address, acquiring all hyperlinks in the pages of each friend link address, and acquiring an optimized hyperlink address list;
And acquiring page source codes of each hyperlink address based on the optimized hyperlink address list, and accordingly, acquiring page source codes for analyzing the page according to the set optimization times.
In some embodiments, the third obtaining module 503 parses the corresponding page based on the page source code, and filters the page source code through a preset feature word to obtain a second expansion column address, including:
setting a phrase with bidding attributes and bidding attributes as a feature word;
analyzing the corresponding page according to the page source code to obtain a page title name;
judging whether the page title name contains the feature word or not; and if the page title name contains any set feature word, taking the page address corresponding to the page title name as a second expansion column address.
In some embodiments, the determining module 504 obtains text content in the corresponding pages of the first expansion column address and the second expansion column address, and determines whether the text content includes bidding information, including:
crawling text contents in corresponding pages of the first expansion column address and the second expansion column address based on a crawler script;
Judging whether the text content contains bidding information or not by using preset keywords; if the text content contains the keyword, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
and if the text content does not contain the keywords, judging whether the text content contains bidding information or not based on a semantic analysis model, and if the text content does not contain the bidding information, filtering out a first expansion column address or a second expansion column address corresponding to the text content.
In some embodiments, the determining module 504 divides the keywords into primary keywords and secondary keywords based on parameter types of bidding information and bidding information, and determines whether the text content includes bidding information by using preset keywords, including:
if the text content contains the primary key word, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
judging whether the text content contains a secondary keyword or not if the text content does not contain the primary keyword, and taking a first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address if the text content contains the secondary keyword;
And if the text content does not contain the secondary keywords, judging whether the text content contains bidding information or not based on a semantic analysis model.
The grid main body query module 503 is configured to display information on all main bodies in the selected grid according to the received query main body instruction at the grid main body query interface, where the information includes:
aiming at a third target grid selected from a grid list displayed in the grid main body inquiry interface, displaying main body identifiers in the third target grid in a map displayed in the grid supervision interface according to the received inquiry main body instruction;
generating a main body list interface according to the received interface switching instruction; and displaying corresponding subject information on the subject list interface according to the received subject information viewing instruction.
In some embodiments, the apparatus further comprises a grid assistant maintenance module configured to generate a grid assistant maintenance interface according to the received grid assistant maintenance instruction, so as to query or edit grid assistant information in the grid assistant maintenance interface.
According to the mining device for the bidding information release platform, the stock site library is acquired through the first acquisition module; acquiring a column address list based on the stock site library through a second acquisition module, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address; acquiring a website first page list based on the stock site library through a third acquisition module, extracting a friend link of each website first page to obtain a friend link address list, and acquiring a second expansion column address based on the friend link address list; and acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address through a judging module, and judging whether the text contents contain bidding information or not so as to determine a target expansion column address. Therefore, the bid issuing platform which is not recorded yet can be automatically excavated from the existing bid issuing platform, and the excavation efficiency and accuracy are improved.
Based on the same concept of the present invention, fig. 6 of the present disclosure shows a structure of an electronic device 600 according to an embodiment of the present application, where the electronic device 600 includes: at least one processor 601, at least one network interface 604 or other user interface 603, memory 605, at least one communication bus 602. The communication bus 602 is used to enable connected communications between these components. The electronic device 600 optionally includes a user interface 603 including a display (e.g., a touch screen, LCD, CRT, holographic imaging (Holographic) or projection (Projector), etc.), a keyboard or pointing device (e.g., a mouse, trackball, touch pad or touch screen, etc.).
Memory 605 may include read-only memory and random access memory and provide instructions and data to processor 601. A portion of the memory 605 may also include non-volatile random access memory (NVRAM).
In some implementations, the memory 605 stores the following elements, protectable modules or data structures, or a subset thereof, or an extended set thereof:
an operating system 6051 containing various system programs for implementing various basic services and handling hardware-based tasks;
The application program module 6052 includes various application programs such as a desktop (desktop), a Media Player (Media Player), a Browser (Browser), and the like for implementing various application services.
In the embodiment of the present application, by calling a program or an instruction stored in the memory 605, the processor 601 is configured to execute steps in a method for mining a bidding information issue platform, for example, so that a bidding issue platform that has not been recorded yet can be automatically mined from existing bidding issue platforms, and the mining efficiency and accuracy are high.
The present application also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor performs steps as in a bid information distribution platform mining method.
Specifically, the storage medium can be a general-purpose storage medium, such as a mobile disk, a hard disk, or the like, and when the computer program on the storage medium is executed, the bidding information distribution platform mining method described above can be executed.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments provided in the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the foregoing examples are merely illustrative of specific embodiments of the present application, and are not intended to limit the scope of the present application, although the present application is described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the corresponding technical solutions. Are intended to be encompassed within the scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (10)

1. A bid information distribution platform mining method, the method comprising the steps of:
acquiring an inventory site library, wherein the inventory site library comprises a determined website home page and a column address for issuing bidding information;
acquiring a column address list based on the stock site library, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address;
Acquiring a website top page list based on the stock site library, extracting a friend link of each website top page to acquire a friend link address list, and acquiring a second expansion column address based on the friend link address list;
acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text contents contain bidding information or not; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
2. The mining method of bidding information distribution platform according to claim 1, wherein the parsing of the page of each column address to obtain the first expanded column address comprises the following steps:
analyzing a page dom tree of each column address;
extracting brother nodes of the current column dom node based on the page dom tree, and taking the brother nodes as a first expansion column address; or/and extracting all child nodes under the same-level brother node of the father node of the current column dom node based on the page dom tree to serve as the first expansion column address.
3. The method for mining the bidding information distribution platform according to claim 2, wherein the step of obtaining the second expansion column address based on the friendship link address list comprises the following steps:
based on the friend link address list, collecting pages of each friend link address;
resolving the page of each friend link address to obtain all hyperlinks in the page of each friend link address to obtain a hyperlink address list;
collecting page source codes of each hyperlink address based on the hyperlink address list;
and analyzing the corresponding page based on the page source code, and filtering through preset characteristic words to obtain a second expansion column address.
4. The mining method of bidding information distribution platform according to claim 3, further comprising the following steps before parsing the corresponding page based on the page source code:
setting the number of extension excavation times;
secondarily acquiring a friend link address list based on the page source code;
acquiring pages of each friend link address based on the twice-acquired friend link address list, analyzing the pages of each friend link address, and acquiring all hyperlinks in the pages of each friend link address to obtain a twice-mined hyperlink address list;
And acquiring page source codes of each hyperlink address based on the twice-mined hyperlink address list, and accordingly, acquiring page source codes for analyzing the pages according to the set extending mining times.
5. The mining method of bidding information distribution platform according to claim 4, wherein the analyzing the corresponding page based on the page source code and filtering through the preset feature word to obtain the second expanded column address comprises the following steps:
setting a phrase with bidding attributes and bidding attributes as a feature word;
analyzing the corresponding page according to the page source code to obtain a page title name;
judging whether the page title name contains the feature word or not; and if the page title name contains any set feature word, taking the page address corresponding to the page title name as a second expansion column address.
6. The mining method of bidding information distribution platform according to claim 5, wherein the steps of obtaining text content in the corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text content contains bidding information, include:
Crawling text contents in corresponding pages of the first expansion column address and the second expansion column address based on a crawler script;
judging whether the text content contains bidding information or not by using a preset keyword group; if the text content contains the keyword group, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
and if the text content does not contain the keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model, and if the text content does not contain the bidding information, filtering a first expansion column address or a second expansion column address corresponding to the text content.
7. The method according to claim 6, wherein the keyword group is divided into a primary keyword group and a secondary keyword group based on the bid information and the parameter type of the bid information, and whether the text content contains the bid information is judged by using a preset keyword group as follows:
if the text content contains the primary key word group, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
Judging whether the text content contains a secondary keyword group or not if the text content does not contain the primary keyword group, and taking a first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address if the text content contains the secondary keyword group;
and if the text content does not contain the secondary keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model.
8. A bid information distribution platform mining apparatus, the apparatus comprising:
the first acquisition module is used for acquiring an inventory site library, wherein the inventory site library comprises a determined website home page and a column address for issuing bidding information;
the second acquisition module is used for acquiring a column address list based on the stock site library, acquiring pages of each column address, analyzing the pages of each column address and acquiring a first expansion column address;
the third acquisition module is used for acquiring a website first page list based on the stock site library, extracting a friend link of each website first page to obtain a friend link address list, and acquiring a second expansion column address based on the friend link address list;
The judging module is used for acquiring text contents in the corresponding pages of the first expansion column address and the second expansion column address and judging whether the text contents contain bidding information or not; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the bidding information distribution platform mining method of any of claims 1 to 7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the bidding information distribution platform mining method of any of claims 1 to 7.
CN202310638881.9A 2023-06-01 2023-06-01 Mining method, device, equipment and medium for bidding information release platform Active CN116361594B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310638881.9A CN116361594B (en) 2023-06-01 2023-06-01 Mining method, device, equipment and medium for bidding information release platform

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310638881.9A CN116361594B (en) 2023-06-01 2023-06-01 Mining method, device, equipment and medium for bidding information release platform

Publications (2)

Publication Number Publication Date
CN116361594A true CN116361594A (en) 2023-06-30
CN116361594B CN116361594B (en) 2023-08-25

Family

ID=86909455

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310638881.9A Active CN116361594B (en) 2023-06-01 2023-06-01 Mining method, device, equipment and medium for bidding information release platform

Country Status (1)

Country Link
CN (1) CN116361594B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001175759A (en) * 1999-12-20 2001-06-29 Kyowa:Kk Synergic type building integrated information system utilizing the internet
WO2015172490A1 (en) * 2014-05-16 2015-11-19 百度在线网络技术(北京)有限公司 Method and apparatus for providing extended search item
CN105468664A (en) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 Information acquisition method and apparatus
CN107239891A (en) * 2017-05-26 2017-10-10 山东省科学院情报研究所 A kind of bid checking method based on big data
CN108415968A (en) * 2018-02-08 2018-08-17 湖南慧集网络科技有限责任公司 A kind of acquisition method of information on bidding
CN108427721A (en) * 2018-02-08 2018-08-21 湖南慧集网络科技有限责任公司 A kind of standardized method of the information on bidding based on database and system
CN109582883A (en) * 2017-09-29 2019-04-05 北京国双科技有限公司 The determination method and apparatus of column page
CN112685620A (en) * 2020-12-31 2021-04-20 山东奥邦交通设施工程有限公司 Bidding information processing method, system, readable storage medium and device
CN114912905A (en) * 2022-07-15 2022-08-16 北京拓普丰联信息科技股份有限公司 Target object mining method and device

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2001175759A (en) * 1999-12-20 2001-06-29 Kyowa:Kk Synergic type building integrated information system utilizing the internet
WO2015172490A1 (en) * 2014-05-16 2015-11-19 百度在线网络技术(北京)有限公司 Method and apparatus for providing extended search item
CN105468664A (en) * 2015-05-12 2016-04-06 北京众标网络科技有限公司 Information acquisition method and apparatus
CN107239891A (en) * 2017-05-26 2017-10-10 山东省科学院情报研究所 A kind of bid checking method based on big data
CN109582883A (en) * 2017-09-29 2019-04-05 北京国双科技有限公司 The determination method and apparatus of column page
CN108415968A (en) * 2018-02-08 2018-08-17 湖南慧集网络科技有限责任公司 A kind of acquisition method of information on bidding
CN108427721A (en) * 2018-02-08 2018-08-21 湖南慧集网络科技有限责任公司 A kind of standardized method of the information on bidding based on database and system
CN112685620A (en) * 2020-12-31 2021-04-20 山东奥邦交通设施工程有限公司 Bidding information processing method, system, readable storage medium and device
CN114912905A (en) * 2022-07-15 2022-08-16 北京拓普丰联信息科技股份有限公司 Target object mining method and device

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
张涛;廖力;: "基于链接的网站搜索引擎优化策略", 湖北工业大学学报, no. 05, pages 61 - 63 *

Also Published As

Publication number Publication date
CN116361594B (en) 2023-08-25

Similar Documents

Publication Publication Date Title
US10853566B2 (en) Systems and methods for automatically creating tables using auto-generated templates
CN103023714B (en) The liveness of topic Network Based and cluster topology analytical system and method
US8645385B2 (en) System and method for automating categorization and aggregation of content from network sites
CN108090104B (en) Method and device for acquiring webpage information
CN107943838B (en) Method and system for automatically acquiring xpath generated crawler script
JP2004139304A (en) Hyper text inspection device, its method, and program
CN103778151A (en) Method and device for identifying characteristic group and search method and device
US8560518B2 (en) Method and apparatus for building sales tools by mining data from websites
US20110145398A1 (en) System and Method for Monitoring Visits to a Target Site
EP3289487B1 (en) Computer-implemented methods of website analysis
CN105205080A (en) Redundant file clearing method, device and system
CN108446136B (en) Element code extraction method and system
CN103838862A (en) Video searching method, device and terminal
CN105468627A (en) Method and system for shielding and filtering web page contents
CN116226494B (en) Crawler system and method for information search
CN113505317A (en) Illegal advertisement identification method and device, electronic equipment and storage medium
CN111158973B (en) Web application dynamic evolution monitoring method
CN108846134A (en) A kind of O&M scheme recommender system and method based on web crawlers
CN112612990A (en) Webpage analysis method, system and computer readable storage medium
CN116361594B (en) Mining method, device, equipment and medium for bidding information release platform
CN113806647A (en) Method for identifying development framework and related equipment
CN109948015B (en) Meta search list result extraction method and system
CN114528448B (en) Accurate analytic system of drawing of portrait of global foreign trade customer
Martinez Two datasets of questions and answers for studying the development of cross-platform mobile applications using Xamarin framework
CN111222918B (en) Keyword mining method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant