CN116361594A - Mining method, device, equipment and medium for bidding information release platform - Google Patents
Mining method, device, equipment and medium for bidding information release platform Download PDFInfo
- Publication number
- CN116361594A CN116361594A CN202310638881.9A CN202310638881A CN116361594A CN 116361594 A CN116361594 A CN 116361594A CN 202310638881 A CN202310638881 A CN 202310638881A CN 116361594 A CN116361594 A CN 116361594A
- Authority
- CN
- China
- Prior art keywords
- column address
- expansion column
- page
- acquiring
- address
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000005065 mining Methods 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 43
- 238000009412 basement excavation Methods 0.000 claims abstract description 4
- 238000001914 filtration Methods 0.000 claims description 14
- 238000004891 communication Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 5
- 230000009193 crawling Effects 0.000 claims description 4
- 238000012545 processing Methods 0.000 abstract description 3
- 238000012549 training Methods 0.000 description 8
- 238000012795 verification Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 5
- 238000012423 maintenance Methods 0.000 description 4
- 238000005457 optimization Methods 0.000 description 4
- 230000008569 process Effects 0.000 description 4
- 230000006399 behavior Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 230000008878 coupling Effects 0.000 description 3
- 238000010168 coupling process Methods 0.000 description 3
- 238000005859 coupling reaction Methods 0.000 description 3
- 230000006870 function Effects 0.000 description 2
- 230000003993 interaction Effects 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 238000007792 addition Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000002708 enhancing effect Effects 0.000 description 1
- 238000003384 imaging method Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/901—Indexing; Data structures therefor; Storage structures
- G06F16/9027—Trees
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q30/00—Commerce
- G06Q30/06—Buying, selling or leasing transactions
- G06Q30/08—Auctions
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- Data Mining & Analysis (AREA)
- General Engineering & Computer Science (AREA)
- Finance (AREA)
- Accounting & Taxation (AREA)
- Entrepreneurship & Innovation (AREA)
- Development Economics (AREA)
- Economics (AREA)
- Marketing (AREA)
- Strategic Management (AREA)
- General Business, Economics & Management (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The application provides a mining method, a mining device and a mining medium for bidding information release platforms, which relate to the technical field of data processing and acquire stock site libraries; acquiring a column address list based on the stock site library, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address; acquiring a website top page list based on the stock site library, extracting a friend link of each website top page to acquire a friend link address list, and acquiring a second expansion column address based on the friend link address list; and acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text contents contain bidding information or not so as to determine a target expansion column address. Therefore, the bid issuing platform which is not recorded yet can be automatically excavated from the existing bid issuing platform, and the excavation efficiency and accuracy are high.
Description
Technical Field
The application relates to the technical field of data processing, in particular to a mining method, a mining device, mining equipment and mining media for bidding information release platforms.
Background
Bidding is the abbreviation of bidding, bidding and bidding are a commodity transaction behavior, and are two aspects of the transaction process. In the purchasing behavior of goods, engineering and services, a signer attracts a plurality of bidders to perform equal competition according to the same conditions through the purchasing requirements published in advance, and experts in aspects of technology, economy, law and the like are organized according to a specified program to comprehensively evaluate the plurality of bidders, so that the behavior process of the signer of the selected project is preferred. The bidding information is monitored, collected, counted and analyzed, so that enterprises can be helped to grasp more valuable data in real time, and market competitiveness is improved.
The current bidding information release platform is expected to have tens of thousands, and the number of bidding information release platforms is continuously increased along with the time. And the manual collection of the label data release platform finds that the columns of the label data release platform change, and the resource cost and the time cost are relatively high.
Disclosure of Invention
In view of the above, an object of the present application is to provide a bid information distribution platform mining method, apparatus, device, and medium capable of mining a bid distribution platform that has not yet been recorded by using existing resources.
In a first aspect, an embodiment of the present application provides a method for mining a bidding information distribution platform, the method including the steps of:
Acquiring an inventory site library, wherein the inventory site library comprises a determined website home page and a column address for issuing bidding information;
acquiring a column address list based on the stock site library, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address;
acquiring a website top page list based on the stock site library, extracting a friend link of each website top page to acquire a friend link address list, and acquiring a second expansion column address based on the friend link address list;
acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text contents contain bidding information or not; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
In some embodiments, the parsing the page of each column address to obtain the first extended column address includes the following steps:
analyzing a page dom tree of each column address;
Extracting brother nodes of the current column dom node based on the page dom tree, and taking the brother nodes as a first expansion column address; or/and extracting all child nodes under the same-level brother node of the father node of the current column dom node based on the page dom tree to serve as the first expansion column address.
In some embodiments, the acquiring the second expansion column address based on the friendship link address list includes the following steps:
based on the friend link address list, collecting pages of each friend link address;
resolving the page of each friend link address to obtain all hyperlinks in the page of each friend link address to obtain a hyperlink address list;
collecting page source codes of each hyperlink address based on the hyperlink address list;
and analyzing the corresponding page based on the page source code, and filtering through preset characteristic words to obtain a second expansion column address.
In some embodiments, before the parsing of the corresponding page based on the page source code, the method further includes the following steps:
setting the optimization times;
acquiring a friend link address list based on the page source code reverse direction;
Acquiring pages of each friend link address based on the reversely acquired friend link address list, analyzing the pages of each friend link address, acquiring all hyperlinks in the pages of each friend link address, and acquiring an optimized hyperlink address list;
and acquiring page source codes of each hyperlink address based on the optimized hyperlink address list, and accordingly, acquiring page source codes for analyzing the page according to the set optimization times.
In some embodiments, the parsing the corresponding page based on the page source code and filtering through a preset feature word to obtain a second expansion column address, which includes the following steps:
setting a phrase with bidding attributes and bidding attributes as a feature word;
analyzing the corresponding page according to the page source code to obtain a page title name;
judging whether the page title name contains the feature word or not; and if the page title name contains any set feature word, taking the page address corresponding to the page title name as a second expansion column address.
In some embodiments, the obtaining text content in the corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text content includes bidding information, includes the following steps:
Crawling text contents in corresponding pages of the first expansion column address and the second expansion column address based on a crawler script;
judging whether the text content contains bidding information or not by using a preset keyword group; if the text content contains the keyword group, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
and if the text content does not contain the keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model, and if the text content does not contain the bidding information, filtering a first expansion column address or a second expansion column address corresponding to the text content.
In some embodiments, the keyword groups are divided into primary keyword groups and secondary keyword groups based on parameter types of bidding information and bidding information, and whether the text content contains bidding information is judged by using preset keyword groups in the following manner:
if the text content contains the primary key word group, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
Judging whether the text content contains a secondary keyword group or not if the text content does not contain the primary keyword group, and taking a first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address if the text content contains the secondary keyword group;
and if the text content does not contain the secondary keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model.
In a second aspect, an embodiment of the present application provides a bid information release platform mining apparatus, the apparatus including:
the first acquisition module is used for acquiring an inventory site library, wherein the inventory site library comprises a determined website home page and a column address for issuing bidding information;
the second acquisition module is used for acquiring a column address list based on the stock site library, acquiring pages of each column address, analyzing the pages of each column address and acquiring a first expansion column address;
the third acquisition module is used for acquiring a website first page list based on the stock site library, extracting a friend link of each website first page to obtain a friend link address list, and acquiring a second expansion column address based on the friend link address list;
The judging module is used for acquiring text contents in the corresponding pages of the first expansion column address and the second expansion column address and judging whether the text contents contain bidding information or not; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
In a third aspect, an embodiment of the present application provides an electronic device, including a processor, a memory, and a bus, where the memory stores machine-readable instructions executable by the processor, and when the electronic device is running, the processor communicates with the memory through the bus, and the machine-readable instructions, when executed by the processor, perform the steps of the bidding information issue platform mining method of any one of the first aspect above.
In a fourth aspect, an embodiment of the present application provides a computer readable storage medium having a computer program stored thereon, where the computer program when executed by a processor performs the steps of the bidding information distribution platform mining method of any one of the first aspects above.
The mining method, the mining device, the electronic equipment and the storage medium for the bidding information release platform acquire a stock site library; acquiring a column address list based on the stock site library, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address; acquiring a website top page list based on the stock site library, extracting a friend link of each website top page to acquire a friend link address list, and acquiring a second expansion column address based on the friend link address list; and acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text contents contain bidding information or not so as to determine a target expansion column address. Therefore, the bid issuing platform which is not recorded yet can be automatically excavated from the existing bid issuing platform, and the excavation efficiency and accuracy are high.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart illustrating a method of mining a bid information distribution platform according to an embodiment of the present application;
FIG. 2 is a flowchart illustrating the parsing of a page of each column address to obtain a first extended column address according to an embodiment of the present disclosure;
FIG. 3 is a flowchart of acquiring a second extended column address based on the friends link address list according to an embodiment of the present application;
FIG. 4 is a flowchart illustrating the steps of obtaining text content in corresponding pages of the first and second expanded column addresses and determining whether the text content includes bidding information according to an embodiment of the present application;
FIG. 5 is a schematic diagram showing the construction of a mining apparatus for bidding information distribution platform according to an embodiment of the present application;
fig. 6 shows a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it should be understood that the accompanying drawings in the present application are only for the purpose of illustration and description, and are not intended to limit the protection scope of the present application. In addition, it should be understood that the schematic drawings are not drawn to scale. A flowchart, as used in this application, illustrates operations implemented according to some embodiments of the present application. It should be understood that the operations of the flow diagrams may be implemented out of order and that steps without logical context may be performed in reverse order or concurrently. Moreover, one or more other operations may be added to the flow diagrams and one or more operations may be removed from the flow diagrams as directed by those skilled in the art.
In addition, the described embodiments are only some, but not all, of the embodiments of the present application. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present application without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that the term "comprising" will be used in the embodiments of the present application to indicate the presence of the features stated hereinafter, but not to exclude the addition of other features.
In view of the technical problems set forth in the background art, the application provides a method, a device, electronic equipment and a storage medium for mining bidding information release platforms, which can automatically mine out the bidding information release platforms which are not recorded yet from the existing bidding information release platforms, and improve mining efficiency and accuracy.
Referring to fig. 1 of the specification, the mining method for the bidding information release platform provided by the embodiment of the application includes the following steps:
S1, acquiring an inventory site library, wherein the inventory site library comprises a determined website home page and a column address for issuing bidding information;
in step S1, the stock site library may be manually counted and stored based on usual websites, and the stock site library includes website data attributes including website names, website top page URLs, column names and column address URLs, for example, website names are national bidding information networks, website top page URLs are https:// www.bidnews.cn/, column names are engineering bidding, and column address URLs are https:// www.bidnews.cn/caibou/style-gongchen. The columns refer to main plate contents of website construction, generally refer to website navigation columns, secondary columns, tertiary columns and the like, and are mainly used for facilitating a user to quickly find things (topics) which the user wants to know, enhancing user experience.
Because the initial stock site library is counted manually, the release platform is increased continuously along with the time, and the bidding release plate in the existing website is updated, the stock site library is lagged, and the timeliness and the accuracy of bidding information statistics are affected. Therefore, in the application, the current stock site library is automatically mined to obtain more unobscured label issuing platforms and columns, and the stock site library is updated in real time, so that enterprises can master more valuable data in time, and the market competitiveness is improved.
It should be noted that, in the embodiment of the present application, the mining method of the bidding information publishing platform may be operated in a terminal device or a server; the terminal device may be a server terminal device, and when the bidding information release platform mining method is operated on the server, the bidding information release platform mining method may be implemented and executed based on a cloud interaction system, where the cloud interaction system at least includes the server and a client device (i.e., the terminal device).
S2, acquiring a column address list based on the stock site library, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address;
in step S2, the own station address mining is mainly performed. Specifically, the acquired stock site library is subjected to data classification, a column name and a column address URL are selected from the stock site library, the column name and the column address URL are arranged into a column address list, then pages of each column address in the column address list are collected through a crawler script, and page analysis is performed to obtain a first expansion column address. The crawler script is a program for automatically acquiring the content of the web page, which should be a technical means known to those skilled in the art, and will not be described herein.
In this embodiment, when resolving a page of each column address in the column address list to obtain a first extended column address, referring specifically to fig. 2 of the specification, the method includes the following steps:
s201, analyzing a page dom tree of each column address;
s202, extracting brother nodes of the current column dom node based on the page dom tree, and taking the brother nodes as a first expansion column address; or/and extracting all child nodes under the same-level brother node of the father node of the current column dom node based on the page dom tree to serve as the first expansion column address.
In steps S201-S202, a dom (Document Object Model document object model) tree is generated based on a column address URL, and a corresponding tag position can be quickly positioned through a dom node so as to perform addition, deletion and verification; in the dom tree, the top node is called root, except that each node of the root has a parent node, the parent node has a child node, and the child node of the same level is called sibling node, which should be a technical means well known to those skilled in the art, and will not be described herein. Therefore, in the application, each column address in the column address list is mined by adopting the two expansion modes, so that a sufficient number of first expansion column addresses can be obtained initially. It should be noted that, the expansion based on the current dom node is not limited to the parent node, for example, some websites need to extend upwards (parent level) multiple times to mine the first expansion column address.
In other embodiments, in order to improve the efficiency of acquiring the first extension column address, only one extension mode of the column may be adopted, that is, only the sibling node of the current column dom node or only all the child nodes under the sibling node of the parent node of the current column dom node are extracted to serve as the first extension column address. Namely, the part expansion quantity is sacrificed to improve the expansion efficiency or reduce the hardware requirement.
S3, acquiring a website first page list based on the stock site library, extracting a friend link of each website first page to acquire a friend link address list, and acquiring a second expansion column address based on the friend link address list;
in step 3, the external station address mining is mainly performed. Specifically, the acquired stock site library is subjected to data classification, the website names and the website top page URLs are selected from the stock site library, and the website names are arranged into a website top page list, wherein the website names are generally displayed on the website top page, and the purpose of distinguishing websites is achieved; and then identifying and extracting a friend link on each website home page in the website home page list, and further acquiring a second expansion column address based on the obtained friend link. Specifically, referring to fig. 3 of the specification, the step of obtaining the second expansion column address based on the friendship link address list includes the following steps:
S301, acquiring pages of each friend link address based on the friend link address list;
s302, analyzing the page of each friend link address to obtain all hyperlinks in the page of each friend link address to obtain a hyperlink address list;
s303, collecting page source codes of each hyperlink address based on the hyperlink address list;
s304, analyzing the corresponding page based on the page source code, and filtering through preset feature words to obtain a second expansion column address.
The friend links refer to links of the other parties placed on the website mutually, so that the access amount is increased by mutual recommendation, and the friend links are related to the exchange industry generally; and the friendship links typically appear in the bottom area of the website's home page, their friendship links are identified according to their common selection drop-down boxes.
In the embodiment, firstly, acquiring a friend link of each website first page in a website first page list, and counting to obtain a list of all friend link addresses; collecting page information of each friend link address in the friend link address list, extracting all hyperlinks and title names in the page, and counting to obtain a hyperlink address list containing all hyperlinks and title names; then traversing the hyperlink address list through a crawler script to acquire page source codes of the hyperlink, wherein the hyperlink is called a hyperlink which essentially belongs to a part of a webpage, and refers to a connection relation pointing to a target from the webpage, wherein the target can be another webpage or different positions on the same webpage, the off-site address can be eliminated in the application, and the acquired hyperlink address is equivalent to a column address in a stock site library; and finally, analyzing the corresponding page based on the page source code, and filtering through preset characteristic words to obtain a second expansion column address.
It should be noted that, in the process of collecting the page information of the friend link address and extracting all the hyperlinks and the title names in the page, all the hyperlinks and the title names in the page may not be identified and extracted at one time, i.e. a missing situation may occur.
Therefore, in the present application, the number of extended mining can be set, and the number of extended mining can be controlled to be as large as possible by controlling the statistical hyperlink address, that is, controlling the mining link hierarchy. Specifically, acquiring a friend link address list again according to the acquired page source codes through a domain name hierarchical relationship, acquiring pages of each friend link address according to the acquired friend link address list again, analyzing the pages of each friend link address, acquiring all hyperlinks in the pages of each friend link address to obtain a secondary mined hyperlink address list, acquiring a final page source code according to the set extension mining times, analyzing corresponding pages of the final page source code according to the final acquired page source code, filtering through preset feature words, and acquiring a second expansion column address, and specifically, the method comprises the following steps:
S2041, setting a phrase with bidding attributes and bidding attributes as a feature word;
s2042, analyzing the corresponding page according to the page source code to obtain a page title name;
s2043, judging whether the page title name contains the feature word; and if the page title name contains any set feature word, taking the page address corresponding to the page title name as a second expansion column address.
Since there are too many hyperlinks acquired through all the friend links in each web site front page, on one hand, all the friend links are related to the industry (bidding) of the web site front page, and on the other hand, all the hyperlinks are related to bidding subjects, it is necessary to perform preliminary screening filtering. In steps S2041-S2043, the feature word may be a phrase having bidding attributes including bid, winning bid, intention to purchase, and the like. And analyzing the corresponding page of the page source code to obtain the title name, and if the title name contains one of bid, winning bid and purchasing intention, recognizing the page address corresponding to the page title name as the second expansion column address.
S4, acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text contents contain bidding information or not; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
In other words, in the present application, the first expansion column address and the second expansion column address that are obtained preliminarily are checked further to determine whether they are websites that actually issue bidding information. In addition, in the application, in order to achieve the efficiency and accuracy of automatic check, two check modes are adopted, one check is performed through a preset keyword group, and the other check is performed based on a semantic analysis model. The verification mode through the preset keyword groups is high in efficiency, but the accuracy is relatively low; the verification mode based on the semantic analysis model is low in efficiency, but the accuracy is relatively high.
Specifically, referring to fig. 4 of the present disclosure, the steps of obtaining text content in the corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text content includes bidding information include the following steps:
S401, crawling text contents in corresponding pages of the first expansion column address and the second expansion column address based on a crawler script;
s402, judging whether the text content contains bidding information or not by using a preset keyword group; if the text content contains the keyword group, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
and if the text content does not contain the keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model, and if the text content does not contain the bidding information, filtering a first expansion column address or a second expansion column address corresponding to the text content.
That is, in steps S401-S402, first, whether the text content includes bidding information is determined by using a keyword group, where the keyword group includes at least two of a preview, a bid, a show, a combination, etc., if the text content includes a set keyword group, it may be determined directly that the first expansion column address or the second expansion column address corresponding to the text content is the target expansion column address, and verification based on a semantic analysis model is no longer performed;
If the text content does not comprise the set keyword group, checking based on the semantic analysis model, filtering the first expansion column address or the second expansion column address corresponding to the text content if the text content is judged to not comprise bidding information, and finally judging the first expansion column address or the second expansion column address corresponding to the text content to be the target expansion column address if the text content is judged to not comprise bidding information. The semantic analysis model is obtained through training in the following mode: acquiring a plurality of training sample sets, wherein each training sample set consists of a training feature code corresponding to a training sentence in a bidding text and a corresponding reference result; and respectively taking one training feature code and one corresponding reference result in each training sample group as input quantities and inputting the input quantities into the semantic analysis model to be trained so as to train the semantic analysis model to be trained. The construction and training process of the semantic analysis model should be known to those skilled in the art, and will not be described herein.
In other embodiments, in order to further improve the verification efficiency and reduce the hardware configuration, the keyword groups may be set as the primary keyword group and the secondary keyword group. For example, the advance notice, bid, winning bid, etc. can be directly defined as bid information, and can be used as a primary keyword group, and the item number, item name, purchase unit related amount, etc. can be indirectly defined as bid information, and can be used as a secondary keyword. Therefore, when judging whether the text content contains bidding information or not by using a preset keyword group, the method comprises the following steps: if the text content contains the primary key word, directly taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address; if the text content does not contain the primary key word group, judging whether the text content contains a secondary key word group, and if the text content contains the secondary key word group, taking a first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address; and if the text content does not contain the secondary keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model.
Therefore, the mining method of the bidding information release platform can utilize the stock site library manually acquired in the initial stage to perform local site address mining and external site address mining, and obtain the final target expansion column address through bidding information verification, so that the stock site library is updated in real time, timeliness and accuracy of acquiring bidding information are improved, and further market competitiveness of enterprises is improved.
Based on the same inventive concept, the embodiment of the present application further provides a device for mining a bidding information release platform, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the foregoing mining method for the bidding information release platform, the implementation of the device can refer to the implementation of the method, and the repetition is omitted.
As shown in fig. 5 of the specification, the present application further provides a device for mining a bidding information distribution platform, where the device includes:
a first obtaining module 501, configured to obtain an inventory site library, where the inventory site library includes a determined website home page and a column address for issuing bidding information;
the second obtaining module 502 is configured to obtain a column address list based on the stock site library, collect pages of each column address, and parse the pages of each column address to obtain a first extended column address;
A third obtaining module 503, configured to obtain a website top page list based on the stock site library, extract a friend link of each website top page, obtain a friend link address list, and obtain a second expansion column address based on the friend link address list;
the judging module 504 is configured to obtain text content in the corresponding pages of the first expansion column address and the second expansion column address, and judge whether the text content includes bidding information; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
In some embodiments, the second obtaining module 502 parses a page of each column address to obtain a first extended column address, including:
analyzing a page dom tree of each column address;
extracting brother nodes of the current column dom node based on the page dom tree, and taking the brother nodes as a first expansion column address; or/and extracting all child nodes under the same-level brother node of the father node of the current column dom node based on the page dom tree to serve as the first expansion column address.
In some embodiments, the third obtaining module 503 obtains a second expansion column address based on the friendship link address list, including:
based on the friend link address list, collecting pages of each friend link address;
resolving the page of each friend link address to obtain all hyperlinks in the page of each friend link address to obtain a hyperlink address list;
collecting page source codes of each hyperlink address based on the hyperlink address list;
and analyzing the corresponding page based on the page source code, and filtering through preset characteristic words to obtain a second expansion column address.
In some embodiments, before the third obtaining module 503 parses the corresponding page based on the page source code, the method further includes:
setting the optimization times;
acquiring a friend link address list based on the page source code reverse direction;
acquiring pages of each friend link address based on the reversely acquired friend link address list, analyzing the pages of each friend link address, acquiring all hyperlinks in the pages of each friend link address, and acquiring an optimized hyperlink address list;
And acquiring page source codes of each hyperlink address based on the optimized hyperlink address list, and accordingly, acquiring page source codes for analyzing the page according to the set optimization times.
In some embodiments, the third obtaining module 503 parses the corresponding page based on the page source code, and filters the page source code through a preset feature word to obtain a second expansion column address, including:
setting a phrase with bidding attributes and bidding attributes as a feature word;
analyzing the corresponding page according to the page source code to obtain a page title name;
judging whether the page title name contains the feature word or not; and if the page title name contains any set feature word, taking the page address corresponding to the page title name as a second expansion column address.
In some embodiments, the determining module 504 obtains text content in the corresponding pages of the first expansion column address and the second expansion column address, and determines whether the text content includes bidding information, including:
crawling text contents in corresponding pages of the first expansion column address and the second expansion column address based on a crawler script;
Judging whether the text content contains bidding information or not by using preset keywords; if the text content contains the keyword, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
and if the text content does not contain the keywords, judging whether the text content contains bidding information or not based on a semantic analysis model, and if the text content does not contain the bidding information, filtering out a first expansion column address or a second expansion column address corresponding to the text content.
In some embodiments, the determining module 504 divides the keywords into primary keywords and secondary keywords based on parameter types of bidding information and bidding information, and determines whether the text content includes bidding information by using preset keywords, including:
if the text content contains the primary key word, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
judging whether the text content contains a secondary keyword or not if the text content does not contain the primary keyword, and taking a first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address if the text content contains the secondary keyword;
And if the text content does not contain the secondary keywords, judging whether the text content contains bidding information or not based on a semantic analysis model.
The grid main body query module 503 is configured to display information on all main bodies in the selected grid according to the received query main body instruction at the grid main body query interface, where the information includes:
aiming at a third target grid selected from a grid list displayed in the grid main body inquiry interface, displaying main body identifiers in the third target grid in a map displayed in the grid supervision interface according to the received inquiry main body instruction;
generating a main body list interface according to the received interface switching instruction; and displaying corresponding subject information on the subject list interface according to the received subject information viewing instruction.
In some embodiments, the apparatus further comprises a grid assistant maintenance module configured to generate a grid assistant maintenance interface according to the received grid assistant maintenance instruction, so as to query or edit grid assistant information in the grid assistant maintenance interface.
According to the mining device for the bidding information release platform, the stock site library is acquired through the first acquisition module; acquiring a column address list based on the stock site library through a second acquisition module, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address; acquiring a website first page list based on the stock site library through a third acquisition module, extracting a friend link of each website first page to obtain a friend link address list, and acquiring a second expansion column address based on the friend link address list; and acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address through a judging module, and judging whether the text contents contain bidding information or not so as to determine a target expansion column address. Therefore, the bid issuing platform which is not recorded yet can be automatically excavated from the existing bid issuing platform, and the excavation efficiency and accuracy are improved.
Based on the same concept of the present invention, fig. 6 of the present disclosure shows a structure of an electronic device 600 according to an embodiment of the present application, where the electronic device 600 includes: at least one processor 601, at least one network interface 604 or other user interface 603, memory 605, at least one communication bus 602. The communication bus 602 is used to enable connected communications between these components. The electronic device 600 optionally includes a user interface 603 including a display (e.g., a touch screen, LCD, CRT, holographic imaging (Holographic) or projection (Projector), etc.), a keyboard or pointing device (e.g., a mouse, trackball, touch pad or touch screen, etc.).
In some implementations, the memory 605 stores the following elements, protectable modules or data structures, or a subset thereof, or an extended set thereof:
an operating system 6051 containing various system programs for implementing various basic services and handling hardware-based tasks;
The application program module 6052 includes various application programs such as a desktop (desktop), a Media Player (Media Player), a Browser (Browser), and the like for implementing various application services.
In the embodiment of the present application, by calling a program or an instruction stored in the memory 605, the processor 601 is configured to execute steps in a method for mining a bidding information issue platform, for example, so that a bidding issue platform that has not been recorded yet can be automatically mined from existing bidding issue platforms, and the mining efficiency and accuracy are high.
The present application also provides a computer readable storage medium having a computer program stored thereon, which when executed by a processor performs steps as in a bid information distribution platform mining method.
Specifically, the storage medium can be a general-purpose storage medium, such as a mobile disk, a hard disk, or the like, and when the computer program on the storage medium is executed, the bidding information distribution platform mining method described above can be executed.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.
The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in the embodiments provided in the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.
The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.
Finally, it should be noted that: the foregoing examples are merely illustrative of specific embodiments of the present application, and are not intended to limit the scope of the present application, although the present application is described in detail with reference to the foregoing examples, it will be understood by those skilled in the art that: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the corresponding technical solutions. Are intended to be encompassed within the scope of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (10)
1. A bid information distribution platform mining method, the method comprising the steps of:
acquiring an inventory site library, wherein the inventory site library comprises a determined website home page and a column address for issuing bidding information;
acquiring a column address list based on the stock site library, acquiring pages of each column address, and analyzing the pages of each column address to acquire a first expansion column address;
Acquiring a website top page list based on the stock site library, extracting a friend link of each website top page to acquire a friend link address list, and acquiring a second expansion column address based on the friend link address list;
acquiring text contents in corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text contents contain bidding information or not; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
2. The mining method of bidding information distribution platform according to claim 1, wherein the parsing of the page of each column address to obtain the first expanded column address comprises the following steps:
analyzing a page dom tree of each column address;
extracting brother nodes of the current column dom node based on the page dom tree, and taking the brother nodes as a first expansion column address; or/and extracting all child nodes under the same-level brother node of the father node of the current column dom node based on the page dom tree to serve as the first expansion column address.
3. The method for mining the bidding information distribution platform according to claim 2, wherein the step of obtaining the second expansion column address based on the friendship link address list comprises the following steps:
based on the friend link address list, collecting pages of each friend link address;
resolving the page of each friend link address to obtain all hyperlinks in the page of each friend link address to obtain a hyperlink address list;
collecting page source codes of each hyperlink address based on the hyperlink address list;
and analyzing the corresponding page based on the page source code, and filtering through preset characteristic words to obtain a second expansion column address.
4. The mining method of bidding information distribution platform according to claim 3, further comprising the following steps before parsing the corresponding page based on the page source code:
setting the number of extension excavation times;
secondarily acquiring a friend link address list based on the page source code;
acquiring pages of each friend link address based on the twice-acquired friend link address list, analyzing the pages of each friend link address, and acquiring all hyperlinks in the pages of each friend link address to obtain a twice-mined hyperlink address list;
And acquiring page source codes of each hyperlink address based on the twice-mined hyperlink address list, and accordingly, acquiring page source codes for analyzing the pages according to the set extending mining times.
5. The mining method of bidding information distribution platform according to claim 4, wherein the analyzing the corresponding page based on the page source code and filtering through the preset feature word to obtain the second expanded column address comprises the following steps:
setting a phrase with bidding attributes and bidding attributes as a feature word;
analyzing the corresponding page according to the page source code to obtain a page title name;
judging whether the page title name contains the feature word or not; and if the page title name contains any set feature word, taking the page address corresponding to the page title name as a second expansion column address.
6. The mining method of bidding information distribution platform according to claim 5, wherein the steps of obtaining text content in the corresponding pages of the first expansion column address and the second expansion column address, and judging whether the text content contains bidding information, include:
Crawling text contents in corresponding pages of the first expansion column address and the second expansion column address based on a crawler script;
judging whether the text content contains bidding information or not by using a preset keyword group; if the text content contains the keyword group, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
and if the text content does not contain the keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model, and if the text content does not contain the bidding information, filtering a first expansion column address or a second expansion column address corresponding to the text content.
7. The method according to claim 6, wherein the keyword group is divided into a primary keyword group and a secondary keyword group based on the bid information and the parameter type of the bid information, and whether the text content contains the bid information is judged by using a preset keyword group as follows:
if the text content contains the primary key word group, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address;
Judging whether the text content contains a secondary keyword group or not if the text content does not contain the primary keyword group, and taking a first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address if the text content contains the secondary keyword group;
and if the text content does not contain the secondary keyword group, judging whether the text content contains bidding information or not based on a semantic analysis model.
8. A bid information distribution platform mining apparatus, the apparatus comprising:
the first acquisition module is used for acquiring an inventory site library, wherein the inventory site library comprises a determined website home page and a column address for issuing bidding information;
the second acquisition module is used for acquiring a column address list based on the stock site library, acquiring pages of each column address, analyzing the pages of each column address and acquiring a first expansion column address;
the third acquisition module is used for acquiring a website first page list based on the stock site library, extracting a friend link of each website first page to obtain a friend link address list, and acquiring a second expansion column address based on the friend link address list;
The judging module is used for acquiring text contents in the corresponding pages of the first expansion column address and the second expansion column address and judging whether the text contents contain bidding information or not; and if the text content is judged to contain bidding information, taking the first expansion column address or the second expansion column address corresponding to the text content as a target expansion column address.
9. An electronic device, comprising: a processor, a memory and a bus, the memory storing machine-readable instructions executable by the processor, the processor and the memory in communication over the bus when the electronic device is running, the machine-readable instructions when executed by the processor performing the steps of the bidding information distribution platform mining method of any of claims 1 to 7.
10. A computer readable storage medium, characterized in that the computer readable storage medium has stored thereon a computer program which, when executed by a processor, performs the steps of the bidding information distribution platform mining method of any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310638881.9A CN116361594B (en) | 2023-06-01 | 2023-06-01 | Mining method, device, equipment and medium for bidding information release platform |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310638881.9A CN116361594B (en) | 2023-06-01 | 2023-06-01 | Mining method, device, equipment and medium for bidding information release platform |
Publications (2)
Publication Number | Publication Date |
---|---|
CN116361594A true CN116361594A (en) | 2023-06-30 |
CN116361594B CN116361594B (en) | 2023-08-25 |
Family
ID=86909455
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310638881.9A Active CN116361594B (en) | 2023-06-01 | 2023-06-01 | Mining method, device, equipment and medium for bidding information release platform |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116361594B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001175759A (en) * | 1999-12-20 | 2001-06-29 | Kyowa:Kk | Synergic type building integrated information system utilizing the internet |
WO2015172490A1 (en) * | 2014-05-16 | 2015-11-19 | 百度在线网络技术(北京)有限公司 | Method and apparatus for providing extended search item |
CN105468664A (en) * | 2015-05-12 | 2016-04-06 | 北京众标网络科技有限公司 | Information acquisition method and apparatus |
CN107239891A (en) * | 2017-05-26 | 2017-10-10 | 山东省科学院情报研究所 | A kind of bid checking method based on big data |
CN108415968A (en) * | 2018-02-08 | 2018-08-17 | 湖南慧集网络科技有限责任公司 | A kind of acquisition method of information on bidding |
CN108427721A (en) * | 2018-02-08 | 2018-08-21 | 湖南慧集网络科技有限责任公司 | A kind of standardized method of the information on bidding based on database and system |
CN109582883A (en) * | 2017-09-29 | 2019-04-05 | 北京国双科技有限公司 | The determination method and apparatus of column page |
CN112685620A (en) * | 2020-12-31 | 2021-04-20 | 山东奥邦交通设施工程有限公司 | Bidding information processing method, system, readable storage medium and device |
CN114912905A (en) * | 2022-07-15 | 2022-08-16 | 北京拓普丰联信息科技股份有限公司 | Target object mining method and device |
-
2023
- 2023-06-01 CN CN202310638881.9A patent/CN116361594B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2001175759A (en) * | 1999-12-20 | 2001-06-29 | Kyowa:Kk | Synergic type building integrated information system utilizing the internet |
WO2015172490A1 (en) * | 2014-05-16 | 2015-11-19 | 百度在线网络技术(北京)有限公司 | Method and apparatus for providing extended search item |
CN105468664A (en) * | 2015-05-12 | 2016-04-06 | 北京众标网络科技有限公司 | Information acquisition method and apparatus |
CN107239891A (en) * | 2017-05-26 | 2017-10-10 | 山东省科学院情报研究所 | A kind of bid checking method based on big data |
CN109582883A (en) * | 2017-09-29 | 2019-04-05 | 北京国双科技有限公司 | The determination method and apparatus of column page |
CN108415968A (en) * | 2018-02-08 | 2018-08-17 | 湖南慧集网络科技有限责任公司 | A kind of acquisition method of information on bidding |
CN108427721A (en) * | 2018-02-08 | 2018-08-21 | 湖南慧集网络科技有限责任公司 | A kind of standardized method of the information on bidding based on database and system |
CN112685620A (en) * | 2020-12-31 | 2021-04-20 | 山东奥邦交通设施工程有限公司 | Bidding information processing method, system, readable storage medium and device |
CN114912905A (en) * | 2022-07-15 | 2022-08-16 | 北京拓普丰联信息科技股份有限公司 | Target object mining method and device |
Non-Patent Citations (1)
Title |
---|
张涛;廖力;: "基于链接的网站搜索引擎优化策略", 湖北工业大学学报, no. 05, pages 61 - 63 * |
Also Published As
Publication number | Publication date |
---|---|
CN116361594B (en) | 2023-08-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US10853566B2 (en) | Systems and methods for automatically creating tables using auto-generated templates | |
CN103023714B (en) | The liveness of topic Network Based and cluster topology analytical system and method | |
US8645385B2 (en) | System and method for automating categorization and aggregation of content from network sites | |
CN108090104B (en) | Method and device for acquiring webpage information | |
CN107943838B (en) | Method and system for automatically acquiring xpath generated crawler script | |
JP2004139304A (en) | Hyper text inspection device, its method, and program | |
CN103778151A (en) | Method and device for identifying characteristic group and search method and device | |
US8560518B2 (en) | Method and apparatus for building sales tools by mining data from websites | |
US20110145398A1 (en) | System and Method for Monitoring Visits to a Target Site | |
EP3289487B1 (en) | Computer-implemented methods of website analysis | |
CN105205080A (en) | Redundant file clearing method, device and system | |
CN108446136B (en) | Element code extraction method and system | |
CN103838862A (en) | Video searching method, device and terminal | |
CN105468627A (en) | Method and system for shielding and filtering web page contents | |
CN116226494B (en) | Crawler system and method for information search | |
CN113505317A (en) | Illegal advertisement identification method and device, electronic equipment and storage medium | |
CN111158973B (en) | Web application dynamic evolution monitoring method | |
CN108846134A (en) | A kind of O&M scheme recommender system and method based on web crawlers | |
CN112612990A (en) | Webpage analysis method, system and computer readable storage medium | |
CN116361594B (en) | Mining method, device, equipment and medium for bidding information release platform | |
CN113806647A (en) | Method for identifying development framework and related equipment | |
CN109948015B (en) | Meta search list result extraction method and system | |
CN114528448B (en) | Accurate analytic system of drawing of portrait of global foreign trade customer | |
Martinez | Two datasets of questions and answers for studying the development of cross-platform mobile applications using Xamarin framework | |
CN111222918B (en) | Keyword mining method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |