CN112035594A

CN112035594A - Bidding information extraction result screening system and method

Info

Publication number: CN112035594A
Application number: CN201911060290.8A
Authority: CN
Inventors: 贾新
Original assignee: Henan Tupu Computer Network Engineering Co ltd
Current assignee: Beijing swordfish Information Technology Co.,Ltd.; Beijing Tuopu Fenglian Information Technology Co ltd; Hefei Topnet System Engineering Co ltd; Henan Tupu Computer Network Engineering Co ltd
Priority date: 2019-10-29
Filing date: 2019-10-29
Publication date: 2020-12-04

Abstract

The invention discloses a bid inviting and bidding information extraction result screening system and method, and aims to solve the technical problem that the existing bid inviting and bidding information extraction result is inaccurate. The invention comprises four parts: 1. configuring an attribute weight table; 2. importing an extraction result and initializing scores for a plurality of results; 3. calculating the weighting result of the attribute of each extraction result by combining the attribute weight table; 4. and selecting the extraction result with the maximum value of the weighting result as the optimal extraction result. The invention has the beneficial technical effects that: high accuracy, high flexibility and high efficiency.

Description

Bidding information extraction result screening system and method

Technical Field

The invention relates to the technical field of internet data information processing, in particular to a bidding information extraction result screening system and method.

Background

Currently, with the development of internet technology, more and more enterprises and institutions begin to use the internet to issue bidding information. With the rapid increase of the information amount, bidders rely more and more on bid retrieval and push services to acquire relevant bid information.

However, the existing bid information retrieval has many defects in the aspects of data accuracy, accurate matching and the like. Generally, bidding data collected by web crawlers may match a plurality of suspected results during the extraction process, and it is difficult to select more accurate and reasonable results.

The current common selection schemes are as follows:

1) the first determination principle is as follows: matching the first result as a final result;

2) random principle: for multiple matched results, one is randomly selected as the final result. Both approaches are very similar, with a large uncertainty factor, and the final result chosen is not necessarily optimal.

Disclosure of Invention

The invention provides a bid inviting and bidding information extraction result screening system and method, which aim to solve the technical problem that the existing bid inviting and bidding information extraction result is inaccurate.

In order to solve the technical problems, the invention adopts the following technical scheme:

designing a bid information extraction result screening system, which comprises a preposed parameter setting unit, an initialization score setting unit, a result score calculating unit and an output unit;

the prepositive parameter setting unit is used for distributing a weight value for the attribute of the extraction result;

an initialization score setting unit for setting an initial score for the extraction result;

the result score calculating unit is used for calculating the score of the extracted result after attribute weighting;

and the output unit is used for sequencing the weighting results and extracting the extraction result represented by the weighting result with the largest numerical value as the final extraction result.

Further, the attributes of the extraction result include character length, positive/negative words, numerical range, paragraph index, extraction mode, and field tag.

A bid information extraction result screening method is also designed, and comprises the following steps:

s1: configuring an attribute weight table representing an information extraction result by a computer processor, and distributing weights for each group of attributes;

s2: the computer guides the extracted bidding information into a memory and gives an initial score to each item of information;

s3: the processor calculates the final score according to the attribute weight of each extraction result;

s4: the extraction result with the highest final score in step S3 is selected as the output extraction result.

Preferably, in step S3, the initial score is multiplied by the weight of each attribute of the extracted result in turn, and the obtained result is the final score.

Preferably, the attribute weight table includes an index of the located paragraph, an extraction mode, a field tag, a positive word, a negative word, a character length range matching, and a numerical range.

Preferably, the paragraph index, the extraction mode and the field tag are additional data generated in the data extraction process.

Preferably, the extraction method includes tag value pair sequence with weight of 1, table identification with weight of 0.9, and regular expression with weight of 0.7.

Preferably, the positive word weight is 1, and the negative word weight is 0.7.

Preferably, in step S4, the weighted results are arranged in reverse order, and the top one is the extraction result with the highest score.

Compared with the prior art, the invention has the main beneficial technical effects that:

1. the method has the advantages that the accuracy of the information obtained by processing is high, the multiple suspected result sets are scored through various factors, the most reasonable result can be effectively selected, and the accuracy of bid and bid information retrieval is improved.

2. The method has strong flexibility in use and setting, can configure the weight table of each scoring link according to the attribute characteristics, can repeatedly utilize different bidding information by only changing the preset weight parameters, and enlarges the application range.

3. The invention can automatically extract the best result, greatly reduces the workload of manual checking and is beneficial to improving the working efficiency on the premise of ensuring the quality.

Drawings

Fig. 1 is a flowchart of a bid and bid information extraction result screening method according to the present invention.

Detailed Description

The following examples are intended to illustrate the present invention in detail and should not be construed as limiting the scope of the present invention in any way.

The procedures and/or methods involved in or relied on in the following examples are routine procedures or simple procedures in the art, and those skilled in the art can make routine selection or adaptation according to specific application scenarios, unless otherwise specified.

Example 1: a bid information extraction result screening system comprises the following four parts:

(1) the prepositive parameter setting unit is used for distributing a weight value for the attribute of the extraction result; the bidding data crawled by the web crawler comprises positive words, negative words, character length range matching and numerical value ranges for output; and intermediate data generated in the extraction process, including paragraph indexes, extraction modes and field labels.

The field label refers to a position in the bid text, which contains an item field (such as an item name, a name of a purchasing unit, a contact person of the purchasing unit, a contact way of the purchasing unit, an item budget, a price bid amount, a winning bid unit, and the like), and a sentence, a phrase, and a vocabulary of a corresponding description or introduction class always exists before the item field, which is called an item field label, for example, an item number in table one is a field label. Item field label libraries such as item names and purchasing unit names are collected according to historical data, and each type of label library comprises different calling methods of most websites for item fields.

The extraction mode refers to data acquired by which method, different extraction modes have certain influence on data accuracy, and conventionally, the ranking order of the accuracy is as follows: tag value pair sequence table identification regular expression extraction. For example, an extraction mode weight table is set [ tag value pair sequence: 1, table identification: 0.9, regular expression extraction: 0.7 ].

The positive words and the negative words are configured with positive word banks and negative word banks for different project fields, and when the extracted result matches with the positive word banks and the negative word banks, the result is weighted down (rewarded and punished), wherein the result comprises the reward of the positive words and the punishment of the negative words. For example, a positive and negative face weight table [ positive word regular expression/{ 2,100} (project | engineering | construction | service | equipment | procurement | design | system) $/weight 1.0, and a negative word regular expression/\\[ | [ ]/weight 0.7], which indicates that the words are positive words and are recorded as 1.0 and the words which are not negative words and are recorded as 0.7, are set.

A character length matching range, wherein for different item fields (mainly character types), a character length range matching weight table is configured, the item field character length range is set according to historical experience, and data exceeding the range is punished according to weight; for example, the character length matching range may be set to [ character length > 0 and character length ≦ 3 weight: 0.2, character length > 3 and character length ≦ 5 weight: 0.7, character length > 5 and character length ≦ 35 weight: 1, character length > 35 weight: 0.7 ].

And (3) configuring a data range weight table for different item fields (mainly numerical types such as contract money), setting the item field value range according to historical experience, and carrying out weight penalty on data exceeding the range. Such as setting a range of values [ provincial project amount > 0 and project amount ≦ 50000 weight: 0.7, [ provincial project amount > 50000 and project amount ≦ 10000000 weight: 1 provincial project amount > 10000000 and project amount ≦ 100000000 weight: 0.8, [ provincial project amount > 100000000 weight: 0.6, [ market project amount > 0 and project amount ≦ 50000 weight: 0.7, [ market project amount > 50000 and project amount ≦ 10000000 weight: 1, [ market project amount > 10000000 weight: 0.7 ].

(2) Importing an extraction result and initializing a score; when the web crawler extracts the result, it needs to identify the feature attributes of each result, such as whether the result contains several characters, some words, the number of the located paragraph, etc.; the second part of the system is to arrange the results generated in the data extraction process and the characteristic attributes, and then put all the results into a cache to wait for evaluation and screening.

(3) A result score calculating unit that calculates a weight value for each of the extracted results; specifically, an initial value is assigned to each extraction result, then a weight corresponding to each attribute of the extraction result is found in the first partial pre-parameter setting unit, the initial values are multiplied by the weight of each attribute respectively to obtain a score, and then the scores are summed to obtain a final score.

(4) And the output unit is used for carrying out reverse sorting on the final scores of all the results in the third part result calculation unit, the top one is the highest score in all the extracted results, the reliability of the extracted result is represented to be the highest, and then the result is output to the final result text.

Example 2: this embodiment is a method for performing result screening using the bid and bid information extraction result screening system in embodiment 1, and the bid and bid information (data desensitization processing) shown in table one is taken as an example with reference to fig. 1.

Bidding information for an item of a watch

。

In step 401, a system weight table configuration is performed, and a weight is set for each attribute, including a field label weight table, an extraction mode weight table, a positive word weight table, a negative word weight table, a character length weight table, and a data range weight table. In the extraction mode weight table, the regular expression is set to be 0.7, and the tag value pair sequence is 1; the weight of the positive word 'project' is set to be 1, the rest are judged to be negative words, the weight is recorded as 0.7, and the number of the positive words can be increased or deleted according to the reality; in the character length, the character length is less than 3, the weight is recorded as 0.2, the weight values greater than 3 and less than 5 are recorded as 0.7, the weight values greater than 5 and less than 35 are recorded as 1, and the weight values greater than 35 are recorded as 0.7.

The bid information is extracted in the following manner in this embodiment:

1) loading text format content;

2) matching by using a regular expression, and identifying extracted data; such as the name of the project;

3) identifying the extracted data in a form identification mode; such as the name of the project;

4) identifying the extracted data by using a tag value pair sequence identification mode; such as the name of the project;

storing the extracted data, including the following information: an extraction mode (a regular mode/a table mode/a key value KV mode) is adopted, the natural section (natural section number) where the data is located, and the extracted data.

In step 402, importing an output result of a text extraction platform text, for example, extracting an item name field from table one, where there are 2 records in the extraction result, extracting "playground improvement project" (in the first natural segment) according to a regular expression, extracting "playground improvement project of XX primary school in the city of peony river" (in the fourth natural segment) according to a tag value pair sequence mode, finding 2 suspected results altogether, and entering step 403.

In step 403, determining whether the result is multiple suspected extraction results, if yes, entering step 404, otherwise, ending directly; in this embodiment, two results are extracted, so that the step 404 is required to be further determined.

In step 404, an initial score is set for each extracted result, with a default of 100.

The content of the extracted item name result in the first table after the attribute is completed is as follows: [ first result: the extraction mode = regular expression, the extraction result = playground reconstruction project, and the natural segment label =1 score = 100; the second result is: the extraction mode = label value pair sequence, the extraction result = primary school playground reconstruction project of XX of the peony river city, the natural segment label =4 score =100], and the default score is 100 scores; proceed to step 405 for calculation.

In step 405, according to the type (character type, numerical type) of the extraction result, it is determined to proceed to step 406 or step 408, if the character type is entered to step 406, and if the character type is numerical type, it proceeds to step 408; the table one extract item name is character type and step 406 is entered.

In step 406, the extraction result is weighted down according to the matching of the positive words and the negative word bases and the configured weight table. In the above extraction result, the score of positive and negative words is as follows [ first result score =100 × weight 1=100 × second result =100 × weight 1=100], and then the process proceeds to step 407.

In step 407, the character length range of the extraction result is matched according to the character length range weight table, and the score of the extraction result is weighted down according to the configured weight table. The scoring result is given by the character length range weight table as follows [ first result score =100 × weight 1=100 second result =100 × weight 1=100], and then the process proceeds to step 409.

In step 409, the extraction result score is weighted down according to the matching of the extraction method weight table and the extraction method of the extraction result. The result is scored according to the decimation weighting table as follows [ first result score =100 × 0.7=70 and second result =100 × 1=100], and then the process proceeds to step 410.

In step 410, the above-mentioned weighting operation on the extracted result score is performed to obtain the final scores: the first result, "playground reconstruction project": 100 × 1 × 0.7=70, second result "the XX primary playground improvement project in the city of peony: 100 x 1= 100. And then, reversely sorting the two extracted results according to scores, and selecting a first 'the XX primary playground reconstruction project of the peony river city' as an optimal result.

While the present invention has been described in detail with reference to the drawings and the embodiments, those skilled in the art will understand that various specific parameters in the above embodiments can be changed without departing from the spirit of the present invention, and a plurality of specific embodiments are formed, which are common variation ranges of the present invention, and will not be described in detail herein.

Claims

1. A bid information extraction result screening system is characterized by comprising a preposed parameter setting unit, an initialization score setting unit, a result score calculating unit and an output unit;

the preposed parameter setting unit is used for distributing a weight value for the attribute of the extraction result;

the initialization score setting unit is used for setting an initial score for the extraction result;

2. The bid-bidding information extraction result screening system according to claim 1, wherein the attributes of the extraction result include character length, positive/negative words, numerical range, paragraph index, extraction manner, field tag.

3. A method for result screening using the bid information extraction result screening system of claim 1, comprising the steps of:

4. The method as claimed in claim 3, wherein the initial score is multiplied by the weight of each attribute of the extracted result in sequence in step S3 to obtain a final score.

5. The method for screening bidding information extraction results according to claim 3, wherein the attribute weight table comprises an index of the located paragraph, an extraction method, a field tag, a positive word, a negative word, a character length range matching, and a numerical range.

6. The bid-extension information extraction result screening method of claim 5, wherein the paragraph index, extraction manner and field tag are additional data generated in the data extraction process.

7. The method for screening bidding information extraction results according to claim 5, wherein the extraction manner includes tag value pair sequence with weight of 1, table identification with weight of 0.9, and regular expression with weight of 0.7.

8. The method for screening bidding information extraction results according to claim 5, wherein the positive word weight is 1 and the negative word weight is 0.7.

9. The bid information extraction result screening method of claim 3, wherein in the step S4, the weighted results are arranged in a reverse order, and the top one is the extraction result with the highest score.