CN112035594A - Bidding information extraction result screening system and method - Google Patents

Bidding information extraction result screening system and method Download PDF

Info

Publication number
CN112035594A
CN112035594A CN201911060290.8A CN201911060290A CN112035594A CN 112035594 A CN112035594 A CN 112035594A CN 201911060290 A CN201911060290 A CN 201911060290A CN 112035594 A CN112035594 A CN 112035594A
Authority
CN
China
Prior art keywords
result
weight
extraction
extraction result
score
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911060290.8A
Other languages
Chinese (zh)
Inventor
贾新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing swordfish Information Technology Co.,Ltd.
Beijing Tuopu Fenglian Information Technology Co ltd
Hefei Topnet System Engineering Co ltd
Henan Tupu Computer Network Engineering Co ltd
Original Assignee
Henan Tupu Computer Network Engineering Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Henan Tupu Computer Network Engineering Co ltd filed Critical Henan Tupu Computer Network Engineering Co ltd
Priority to CN201911060290.8A priority Critical patent/CN112035594A/en
Publication of CN112035594A publication Critical patent/CN112035594A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/3332Query translation
    • G06F16/3334Selection or weighting of terms from queries, including natural language queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q30/00Commerce
    • G06Q30/06Buying, selling or leasing transactions
    • G06Q30/08Auctions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Data Mining & Analysis (AREA)
  • Accounting & Taxation (AREA)
  • Finance (AREA)
  • Computational Linguistics (AREA)
  • General Business, Economics & Management (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Software Systems (AREA)
  • Marketing (AREA)
  • Economics (AREA)
  • Development Economics (AREA)
  • Artificial Intelligence (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a bid inviting and bidding information extraction result screening system and method, and aims to solve the technical problem that the existing bid inviting and bidding information extraction result is inaccurate. The invention comprises four parts: 1. configuring an attribute weight table; 2. importing an extraction result and initializing scores for a plurality of results; 3. calculating the weighting result of the attribute of each extraction result by combining the attribute weight table; 4. and selecting the extraction result with the maximum value of the weighting result as the optimal extraction result. The invention has the beneficial technical effects that: high accuracy, high flexibility and high efficiency.

Description

Bidding information extraction result screening system and method
Technical Field
The invention relates to the technical field of internet data information processing, in particular to a bidding information extraction result screening system and method.
Background
Currently, with the development of internet technology, more and more enterprises and institutions begin to use the internet to issue bidding information. With the rapid increase of the information amount, bidders rely more and more on bid retrieval and push services to acquire relevant bid information.
However, the existing bid information retrieval has many defects in the aspects of data accuracy, accurate matching and the like. Generally, bidding data collected by web crawlers may match a plurality of suspected results during the extraction process, and it is difficult to select more accurate and reasonable results.
The current common selection schemes are as follows:
1) the first determination principle is as follows: matching the first result as a final result;
2) random principle: for multiple matched results, one is randomly selected as the final result. Both approaches are very similar, with a large uncertainty factor, and the final result chosen is not necessarily optimal.
Disclosure of Invention
The invention provides a bid inviting and bidding information extraction result screening system and method, which aim to solve the technical problem that the existing bid inviting and bidding information extraction result is inaccurate.
In order to solve the technical problems, the invention adopts the following technical scheme:
designing a bid information extraction result screening system, which comprises a preposed parameter setting unit, an initialization score setting unit, a result score calculating unit and an output unit;
the prepositive parameter setting unit is used for distributing a weight value for the attribute of the extraction result;
an initialization score setting unit for setting an initial score for the extraction result;
the result score calculating unit is used for calculating the score of the extracted result after attribute weighting;
and the output unit is used for sequencing the weighting results and extracting the extraction result represented by the weighting result with the largest numerical value as the final extraction result.
Further, the attributes of the extraction result include character length, positive/negative words, numerical range, paragraph index, extraction mode, and field tag.
A bid information extraction result screening method is also designed, and comprises the following steps:
s1: configuring an attribute weight table representing an information extraction result by a computer processor, and distributing weights for each group of attributes;
s2: the computer guides the extracted bidding information into a memory and gives an initial score to each item of information;
s3: the processor calculates the final score according to the attribute weight of each extraction result;
s4: the extraction result with the highest final score in step S3 is selected as the output extraction result.
Preferably, in step S3, the initial score is multiplied by the weight of each attribute of the extracted result in turn, and the obtained result is the final score.
Preferably, the attribute weight table includes an index of the located paragraph, an extraction mode, a field tag, a positive word, a negative word, a character length range matching, and a numerical range.
Preferably, the paragraph index, the extraction mode and the field tag are additional data generated in the data extraction process.
Preferably, the extraction method includes tag value pair sequence with weight of 1, table identification with weight of 0.9, and regular expression with weight of 0.7.
Preferably, the positive word weight is 1, and the negative word weight is 0.7.
Preferably, in step S4, the weighted results are arranged in reverse order, and the top one is the extraction result with the highest score.
Compared with the prior art, the invention has the main beneficial technical effects that:
1. the method has the advantages that the accuracy of the information obtained by processing is high, the multiple suspected result sets are scored through various factors, the most reasonable result can be effectively selected, and the accuracy of bid and bid information retrieval is improved.
2. The method has strong flexibility in use and setting, can configure the weight table of each scoring link according to the attribute characteristics, can repeatedly utilize different bidding information by only changing the preset weight parameters, and enlarges the application range.
3. The invention can automatically extract the best result, greatly reduces the workload of manual checking and is beneficial to improving the working efficiency on the premise of ensuring the quality.
Drawings
Fig. 1 is a flowchart of a bid and bid information extraction result screening method according to the present invention.
Detailed Description
The following examples are intended to illustrate the present invention in detail and should not be construed as limiting the scope of the present invention in any way.
The procedures and/or methods involved in or relied on in the following examples are routine procedures or simple procedures in the art, and those skilled in the art can make routine selection or adaptation according to specific application scenarios, unless otherwise specified.
Example 1: a bid information extraction result screening system comprises the following four parts:
(1) the prepositive parameter setting unit is used for distributing a weight value for the attribute of the extraction result; the bidding data crawled by the web crawler comprises positive words, negative words, character length range matching and numerical value ranges for output; and intermediate data generated in the extraction process, including paragraph indexes, extraction modes and field labels.
The field label refers to a position in the bid text, which contains an item field (such as an item name, a name of a purchasing unit, a contact person of the purchasing unit, a contact way of the purchasing unit, an item budget, a price bid amount, a winning bid unit, and the like), and a sentence, a phrase, and a vocabulary of a corresponding description or introduction class always exists before the item field, which is called an item field label, for example, an item number in table one is a field label. Item field label libraries such as item names and purchasing unit names are collected according to historical data, and each type of label library comprises different calling methods of most websites for item fields.
The extraction mode refers to data acquired by which method, different extraction modes have certain influence on data accuracy, and conventionally, the ranking order of the accuracy is as follows: tag value pair sequence table identification regular expression extraction. For example, an extraction mode weight table is set [ tag value pair sequence: 1, table identification: 0.9, regular expression extraction: 0.7 ].
The positive words and the negative words are configured with positive word banks and negative word banks for different project fields, and when the extracted result matches with the positive word banks and the negative word banks, the result is weighted down (rewarded and punished), wherein the result comprises the reward of the positive words and the punishment of the negative words. For example, a positive and negative face weight table [ positive word regular expression/{ 2,100} (project | engineering | construction | service | equipment | procurement | design | system) $/weight 1.0, and a negative word regular expression/\\[ | [ ]/weight 0.7], which indicates that the words are positive words and are recorded as 1.0 and the words which are not negative words and are recorded as 0.7, are set.
A character length matching range, wherein for different item fields (mainly character types), a character length range matching weight table is configured, the item field character length range is set according to historical experience, and data exceeding the range is punished according to weight; for example, the character length matching range may be set to [ character length > 0 and character length ≦ 3 weight: 0.2, character length > 3 and character length ≦ 5 weight: 0.7, character length > 5 and character length ≦ 35 weight: 1, character length > 35 weight: 0.7 ].
And (3) configuring a data range weight table for different item fields (mainly numerical types such as contract money), setting the item field value range according to historical experience, and carrying out weight penalty on data exceeding the range. Such as setting a range of values [ provincial project amount > 0 and project amount ≦ 50000 weight: 0.7, [ provincial project amount > 50000 and project amount ≦ 10000000 weight: 1 provincial project amount > 10000000 and project amount ≦ 100000000 weight: 0.8, [ provincial project amount > 100000000 weight: 0.6, [ market project amount > 0 and project amount ≦ 50000 weight: 0.7, [ market project amount > 50000 and project amount ≦ 10000000 weight: 1, [ market project amount > 10000000 weight: 0.7 ].
(2) Importing an extraction result and initializing a score; when the web crawler extracts the result, it needs to identify the feature attributes of each result, such as whether the result contains several characters, some words, the number of the located paragraph, etc.; the second part of the system is to arrange the results generated in the data extraction process and the characteristic attributes, and then put all the results into a cache to wait for evaluation and screening.
(3) A result score calculating unit that calculates a weight value for each of the extracted results; specifically, an initial value is assigned to each extraction result, then a weight corresponding to each attribute of the extraction result is found in the first partial pre-parameter setting unit, the initial values are multiplied by the weight of each attribute respectively to obtain a score, and then the scores are summed to obtain a final score.
(4) And the output unit is used for carrying out reverse sorting on the final scores of all the results in the third part result calculation unit, the top one is the highest score in all the extracted results, the reliability of the extracted result is represented to be the highest, and then the result is output to the final result text.
Example 2: this embodiment is a method for performing result screening using the bid and bid information extraction result screening system in embodiment 1, and the bid and bid information (data desensitization processing) shown in table one is taken as an example with reference to fig. 1.
Bidding information for an item of a watch
Figure 700436DEST_PATH_IMAGE001
In step 401, a system weight table configuration is performed, and a weight is set for each attribute, including a field label weight table, an extraction mode weight table, a positive word weight table, a negative word weight table, a character length weight table, and a data range weight table. In the extraction mode weight table, the regular expression is set to be 0.7, and the tag value pair sequence is 1; the weight of the positive word 'project' is set to be 1, the rest are judged to be negative words, the weight is recorded as 0.7, and the number of the positive words can be increased or deleted according to the reality; in the character length, the character length is less than 3, the weight is recorded as 0.2, the weight values greater than 3 and less than 5 are recorded as 0.7, the weight values greater than 5 and less than 35 are recorded as 1, and the weight values greater than 35 are recorded as 0.7.
The bid information is extracted in the following manner in this embodiment:
1) loading text format content;
2) matching by using a regular expression, and identifying extracted data; such as the name of the project;
3) identifying the extracted data in a form identification mode; such as the name of the project;
4) identifying the extracted data by using a tag value pair sequence identification mode; such as the name of the project;
storing the extracted data, including the following information: an extraction mode (a regular mode/a table mode/a key value KV mode) is adopted, the natural section (natural section number) where the data is located, and the extracted data.
In step 402, importing an output result of a text extraction platform text, for example, extracting an item name field from table one, where there are 2 records in the extraction result, extracting "playground improvement project" (in the first natural segment) according to a regular expression, extracting "playground improvement project of XX primary school in the city of peony river" (in the fourth natural segment) according to a tag value pair sequence mode, finding 2 suspected results altogether, and entering step 403.
In step 403, determining whether the result is multiple suspected extraction results, if yes, entering step 404, otherwise, ending directly; in this embodiment, two results are extracted, so that the step 404 is required to be further determined.
In step 404, an initial score is set for each extracted result, with a default of 100.
The content of the extracted item name result in the first table after the attribute is completed is as follows: [ first result: the extraction mode = regular expression, the extraction result = playground reconstruction project, and the natural segment label =1 score = 100; the second result is: the extraction mode = label value pair sequence, the extraction result = primary school playground reconstruction project of XX of the peony river city, the natural segment label =4 score =100], and the default score is 100 scores; proceed to step 405 for calculation.
In step 405, according to the type (character type, numerical type) of the extraction result, it is determined to proceed to step 406 or step 408, if the character type is entered to step 406, and if the character type is numerical type, it proceeds to step 408; the table one extract item name is character type and step 406 is entered.
In step 406, the extraction result is weighted down according to the matching of the positive words and the negative word bases and the configured weight table. In the above extraction result, the score of positive and negative words is as follows [ first result score =100 × weight 1=100 × second result =100 × weight 1=100], and then the process proceeds to step 407.
In step 407, the character length range of the extraction result is matched according to the character length range weight table, and the score of the extraction result is weighted down according to the configured weight table. The scoring result is given by the character length range weight table as follows [ first result score =100 × weight 1=100 second result =100 × weight 1=100], and then the process proceeds to step 409.
In step 409, the extraction result score is weighted down according to the matching of the extraction method weight table and the extraction method of the extraction result. The result is scored according to the decimation weighting table as follows [ first result score =100 × 0.7=70 and second result =100 × 1=100], and then the process proceeds to step 410.
In step 410, the above-mentioned weighting operation on the extracted result score is performed to obtain the final scores: the first result, "playground reconstruction project": 100 × 1 × 0.7=70, second result "the XX primary playground improvement project in the city of peony: 100 x 1= 100. And then, reversely sorting the two extracted results according to scores, and selecting a first 'the XX primary playground reconstruction project of the peony river city' as an optimal result.
While the present invention has been described in detail with reference to the drawings and the embodiments, those skilled in the art will understand that various specific parameters in the above embodiments can be changed without departing from the spirit of the present invention, and a plurality of specific embodiments are formed, which are common variation ranges of the present invention, and will not be described in detail herein.

Claims (9)

1. A bid information extraction result screening system is characterized by comprising a preposed parameter setting unit, an initialization score setting unit, a result score calculating unit and an output unit;
the preposed parameter setting unit is used for distributing a weight value for the attribute of the extraction result;
the initialization score setting unit is used for setting an initial score for the extraction result;
the result score calculating unit is used for calculating the score of the extracted result after attribute weighting;
and the output unit is used for sequencing the weighting results and extracting the extraction result represented by the weighting result with the largest numerical value as the final extraction result.
2. The bid-bidding information extraction result screening system according to claim 1, wherein the attributes of the extraction result include character length, positive/negative words, numerical range, paragraph index, extraction manner, field tag.
3. A method for result screening using the bid information extraction result screening system of claim 1, comprising the steps of:
s1: configuring an attribute weight table representing an information extraction result by a computer processor, and distributing weights for each group of attributes;
s2: the computer guides the extracted bidding information into a memory and gives an initial score to each item of information;
s3: the processor calculates the final score according to the attribute weight of each extraction result;
s4: the extraction result with the highest final score in step S3 is selected as the output extraction result.
4. The method as claimed in claim 3, wherein the initial score is multiplied by the weight of each attribute of the extracted result in sequence in step S3 to obtain a final score.
5. The method for screening bidding information extraction results according to claim 3, wherein the attribute weight table comprises an index of the located paragraph, an extraction method, a field tag, a positive word, a negative word, a character length range matching, and a numerical range.
6. The bid-extension information extraction result screening method of claim 5, wherein the paragraph index, extraction manner and field tag are additional data generated in the data extraction process.
7. The method for screening bidding information extraction results according to claim 5, wherein the extraction manner includes tag value pair sequence with weight of 1, table identification with weight of 0.9, and regular expression with weight of 0.7.
8. The method for screening bidding information extraction results according to claim 5, wherein the positive word weight is 1 and the negative word weight is 0.7.
9. The bid information extraction result screening method of claim 3, wherein in the step S4, the weighted results are arranged in a reverse order, and the top one is the extraction result with the highest score.
CN201911060290.8A 2019-10-29 2019-10-29 Bidding information extraction result screening system and method Pending CN112035594A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911060290.8A CN112035594A (en) 2019-10-29 2019-10-29 Bidding information extraction result screening system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911060290.8A CN112035594A (en) 2019-10-29 2019-10-29 Bidding information extraction result screening system and method

Publications (1)

Publication Number Publication Date
CN112035594A true CN112035594A (en) 2020-12-04

Family

ID=73576221

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911060290.8A Pending CN112035594A (en) 2019-10-29 2019-10-29 Bidding information extraction result screening system and method

Country Status (1)

Country Link
CN (1) CN112035594A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114912905A (en) * 2022-07-15 2022-08-16 北京拓普丰联信息科技股份有限公司 Target object mining method and device

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398836A (en) * 2008-11-11 2009-04-01 丘雷 Search ordering method based on subjectivity and objectivity index and weight allocation
CN102542000A (en) * 2011-12-07 2012-07-04 北京风灵创景科技有限公司 Method and equipment for retrieving contacts
CN105976207A (en) * 2016-05-11 2016-09-28 山东大学 Information search result generation method and system based on multi-attribute dynamic weight distribution
CN107844601A (en) * 2017-11-23 2018-03-27 四川长虹电器股份有限公司 Bid message screening system and method based on web crawlers
CN108874771A (en) * 2018-05-25 2018-11-23 福州大学 A kind of information extraction method towards bid text

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101398836A (en) * 2008-11-11 2009-04-01 丘雷 Search ordering method based on subjectivity and objectivity index and weight allocation
CN102542000A (en) * 2011-12-07 2012-07-04 北京风灵创景科技有限公司 Method and equipment for retrieving contacts
CN105976207A (en) * 2016-05-11 2016-09-28 山东大学 Information search result generation method and system based on multi-attribute dynamic weight distribution
CN107844601A (en) * 2017-11-23 2018-03-27 四川长虹电器股份有限公司 Bid message screening system and method based on web crawlers
CN108874771A (en) * 2018-05-25 2018-11-23 福州大学 A kind of information extraction method towards bid text

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114912905A (en) * 2022-07-15 2022-08-16 北京拓普丰联信息科技股份有限公司 Target object mining method and device

Similar Documents

Publication Publication Date Title
JP6398510B2 (en) Entity linking method and entity linking apparatus
CN102982153B (en) A kind of information retrieval method and device thereof
CN108628833B (en) Method and device for determining summary of original content and method and device for recommending original content
CN110569322A (en) Address information analysis method, device and system and data acquisition method
CN106940788B (en) Intelligent scoring method and device, computer equipment and computer readable medium
US8886624B2 (en) Searching method using extended keyword pool and system thereof
CN108763362A (en) Method is recommended to the partial model Weighted Fusion Top-N films of selection based on random anchor point
CN112667794A (en) Intelligent question-answer matching method and system based on twin network BERT model
US20170018033A1 (en) Stock fluctuatiion prediction method and server
CN104778186B (en) Merchandise items are mounted to the method and system of standardized product unit
Vakulenko et al. Enriching iTunes App Store Categories via Topic Modeling.
CN110334356A (en) Article matter method for determination of amount, article screening technique and corresponding device
JP2008282366A (en) Query response device, query response method, query response program, and recording medium with program recorded thereon
CN109740642A (en) Invoice category recognition methods, device, electronic equipment and readable storage medium storing program for executing
CN103870528A (en) Method and system for question classification and feature mapping in deep question answering system
CN109284389A (en) A kind of information processing method of text data, device
CN111125443A (en) On-line updating method of test question bank based on automatic duplicate removal
CN112035594A (en) Bidding information extraction result screening system and method
JP6942759B2 (en) Information processing equipment, programs and information processing methods
CN112184021A (en) Answer quality evaluation method based on similar support set
CN110941638B (en) Application classification rule base construction method, application classification method and device
CN109657034A (en) Address similarity calculating method and its system
CN115738285A (en) Game quality evaluation feedback method and system
CN110909532B (en) User name matching method and device, computer equipment and storage medium
CN114067343A (en) Data set construction method, model training method and corresponding device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20220609

Address after: 450000 floor 3, building 7, Henan new technology market, No. 199, Yangjin Road, Jinshui District, Zhengzhou City, Henan Province

Applicant after: Henan Tupu computer network engineering Co.,Ltd.

Applicant after: Beijing Tuopu Fenglian Information Technology Co.,Ltd.

Applicant after: HEFEI TOPNET SYSTEM ENGINEERING CO.,LTD.

Applicant after: Beijing swordfish Information Technology Co.,Ltd.

Address before: 450000 floor 3, building 7, Henan new technology market, No. 199, Yangjin Road, Jinshui District, Zhengzhou City, Henan Province

Applicant before: Henan Tupu computer network engineering Co.,Ltd.