CN116303909B

CN116303909B - Matching method, equipment and medium for electronic bidding documents and clauses

Info

Publication number: CN116303909B
Application number: CN202310456405.5A
Authority: CN
Inventors: 李志杰; 王金亮; 徐明礼; 孙宁振; 魏晓军; 姬建华; 顾华伟; 唐莉; 周志刚; 张津铭
Original assignee: Shandong Qilu Electronic Tendering And Procurement Service Co ltd
Current assignee: Shandong Qilu Electronic Tendering And Procurement Service Co ltd
Priority date: 2023-04-26
Filing date: 2023-04-26
Publication date: 2023-08-22
Anticipated expiration: 2043-04-26
Also published as: CN116303909A

Abstract

The application discloses a method, equipment and medium for matching electronic bidding documents and clauses, and relates to the field of data identification. The method comprises the following steps: acquiring a target bidding document and target bidding terms, and performing text extraction on the target bidding document and the target bidding terms to obtain initial text data; performing Jieba word segmentation processing to obtain word segmentation data; performing word stopping and useless word stopping operation on the word segmentation data to obtain word segmentation results, and storing the word segmentation results into an engineering construction word stock; performing vectorization representation on each paragraph of the bidding text and the bidding text to obtain paragraph vectors; and inputting the paragraph vector into the optimized text matching model, and determining the matching relation between the bidding text paragraph and the bidding text paragraph according to the output result. By establishing the engineering construction professional word stock, the situation that the text matching of the following paragraph is wrong due to word misplacement in the professional field is avoided.

Description

Matching method, equipment and medium for electronic bidding documents and clauses

Technical Field

The application relates to the field of data identification, in particular to a method, equipment and medium for matching electronic bidding documents and clauses.

Background

In recent years, along with popularization of electronic bidding, electronic and structured bidding documents become normal, so that big data analysis and natural language processing have better application foundation.

The engineering project electronic bidding documents have the characteristics of multiple pages, multiple contents, high technical content, multiple bidding units and the like, and the traditional paper document comparison mode cannot meet the requirements of accurate and rapid bid evaluation of electronic bidding.

Disclosure of Invention

In order to solve the above problems, the present application provides a method, apparatus and medium for matching electronic bidding documents and terms, including:

acquiring a target bidding document and target bidding terms, and performing text extraction on the target bidding document and the target bidding terms to obtain initial text data; the initial text data is subjected to the Jieba word segmentation processing to obtain word segmentation data; according to a pre-established stay word list, carrying out stay word removal and useless word removal operation on the word segmentation data to obtain word segmentation results, and storing the word segmentation results into an engineering construction word stock in a format of information list data; acquiring a bid text corresponding to the target bid file and a bid text corresponding to the target bid term from the engineering construction word stock; vectorizing the bid text and each paragraph of the bid text to obtain paragraph vectors; and inputting the paragraph vector into an optimized text matching model, and determining the matching relation between each paragraph in the bidding text and each paragraph in the bidding text according to an output result.

In one example, the obtaining the target bid file and the target bid term, and extracting text from the target bid file and the target bid term specifically includes: performing image correction on the target bidding document image and the target bidding document image through a geometric correction convolutional neural network to obtain a first intermediate image; performing image enhancement on the first intermediate image through a first convolutional neural network to obtain a second intermediate image; splitting the three channels of the second intermediate image to obtain a blue channel gray scale map, a green channel gray scale map and a red channel gray scale map respectively; binarizing the channel gray level image according to a preset threshold value, and converting the channel gray level image into a three-channel image to obtain a third intermediate image; performing layout analysis on the third intermediate image through a second convolutional neural network to extract a table area and a picture area in the third intermediate image so as to obtain a fourth intermediate image; and extracting characters from the fourth intermediate image to obtain the initial text data.

In one example, before the Jieba word segmentation processing is performed on the initial text data, the method further includes: acquiring a thesaurus sample set, and establishing an initial thesaurus according to the sample set; professional words in the word segmentation result are screened to reduce words with wrong word segmentation in the word segmentation result; and according to the importance degree of the professional word, giving the initial word frequency of the professional word, and adding the professional word into the engineering construction word stock.

In one example, the performing Jieba word segmentation on the initial text data to obtain word segmentation data specifically includes: acquiring a plurality of threads, and simultaneously using the threads to perform Jieba word segmentation on a plurality of paragraphs in the initial text data; importing the engineering construction word stock, and performing Jieba word segmentation on a plurality of paragraphs in the initial text according to the engineering construction word stock.

In one example, after the acquiring, from the engineering construction word stock, the bid text corresponding to the target bid file and the bid text corresponding to the target bid term, the method further includes: determining word segmentation results of all the sections in the information table data and combining the word frequency and inverse document frequency weight matrixes corresponding to the word frequency and inverse document frequency; determining editing distance between a father title of the segmentation result of each paragraph and each segmentation; and calculating and weighting the weight matrix and the editing distance, normalizing the weighted matrix, and determining that the preset number of segmentation words in each paragraph are the corresponding subject words of each paragraph according to the normalization result.

In one example, the determining the word segmentation result of each paragraph in the information table data combines the word frequency and the inverse document frequency weight matrix corresponding to each paragraph, specifically includes: determining the document word frequency of each paragraph word segmentation result set by the following formula: The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing characteristic words ++>In document->Word frequency of->Representing characteristic words ++>In document->The number of occurrences of>Representation document->The total number of occurrences of all words in (a); the inverse document frequency of the segmentation result set of each paragraph is determined by the following formula: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing characteristic words ++>Is the inverse document frequency,/">Representing the total number of text in the document set, +.>Representing that the set of documents contains the feature words +.>Is a document number of (a); the normalized word frequency of the feature word and the inverse document frequency weight matrix are determined by the following formula: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>The word frequency and inverse document frequency weight matrix; determining the editing distance between the parent title and each word segment by the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,representation->Before->Personal words and->Before->Edit distance between individual words; />Representing the parent title string,/->Words representing paragraph text, ++>Is an indication function; calculating and weighting the weight matrix and the editing distance according to the following formula: />The method comprises the steps of carrying out a first treatment on the surface of the Wherein (1)>Representing the weighted weight matrix, +.>Is word frequency and inverse document frequency weight matrix, < ->For parent title length, & gt>Representing word length.

In one example, the optimized text matching model includes an original text matching model and a ranking output layer disposed behind the original text matching model; the sequencing output layer is used for extracting softmax values corresponding to any paragraph bidding text and all paragraph bidding texts respectively; the softmax value is an output value of the text matching model; and sorting the paragraph bidding texts with the softmax value larger than a preset threshold value, and taking the paragraph bidding text with the maximum softmax value as a matching term of any paragraph bidding text.

In one example, the determining, according to the output result, a matching relationship between each paragraph in the bid text and each paragraph in the bid text specifically includes: converting the matching relation between each paragraph in the bidding text and each paragraph in the bidding text into two classification problems; if the output result of the paragraph bidding text and the paragraph bidding text is 0, the paragraph bidding text is not matched with the paragraph bidding text; and if the output result of the paragraph bidding text and the paragraph bidding text is 1, the paragraph bidding text is matched with the paragraph bidding text.

The application also provides a device for matching the electronic bidding document with the clause, which comprises: at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to perform: acquiring a target bidding document and target bidding terms, and performing text extraction on the target bidding document and the target bidding terms to obtain initial text data; performing Jieba word segmentation processing on the initial text data to obtain word segmentation data; according to a pre-established stay word list, carrying out stay word removal and useless word removal operation on the word segmentation data to obtain word segmentation results, and storing the word segmentation results into an engineering construction word stock in a format of information list data; acquiring a bid text corresponding to the target bid file and a bid text corresponding to the target bid term from the engineering construction word stock; vectorizing the bid text and each paragraph of the bid text to obtain paragraph vectors; and inputting the paragraph vector into an optimized text matching model, and determining the matching relation between each paragraph in the bidding text and each paragraph in the bidding text according to an output result.

The present application also provides a non-volatile computer storage medium storing computer executable instructions, characterized in that the computer executable instructions are configured to: acquiring a target bidding document and target bidding terms, and performing text extraction on the target bidding document and the target bidding terms to obtain initial text data; performing Jieba word segmentation processing on the initial text data to obtain word segmentation data; according to a pre-established stay word list, carrying out stay word removal and useless word removal operation on the word segmentation data to obtain word segmentation results, and storing the word segmentation results into an engineering construction word stock in a format of information list data; acquiring a bid text corresponding to the target bid file and a bid text corresponding to the target bid term from the engineering construction word stock; vectorizing the bid text and each paragraph of the bid text to obtain paragraph vectors; and inputting the paragraph vector into an optimized text matching model, and determining the matching relation between each paragraph in the bidding text and each paragraph in the bidding text according to an output result.

The method provided by the application has the following beneficial effects: by establishing the engineering construction professional word stock, the situation that the text matching of the following paragraph is wrong due to word misplacement in the professional field is avoided. Features in the data are fully extracted through a deep network model, and the method has higher accuracy than the prior art; the method has strong adaptability, and can be self-adjusted according to different data sets, so that the method is suitable for matching of different application scenes.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:

FIG. 1 is a flow chart of a method for matching electronic bidding documents and terms according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a device for matching electronic bidding documents and terms.

Detailed Description

In order to make the objects, technical solutions and advantages of the present application more apparent, the technical solutions of the present application will be clearly and completely described below with reference to specific embodiments of the present application and corresponding drawings. It will be apparent that the described embodiments are only some, but not all, embodiments of the application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The following describes in detail the technical solutions provided by the embodiments of the present application with reference to the accompanying drawings.

FIG. 1 is a flow diagram of a method for matching electronic bid documents and terms according to one or more embodiments of the present disclosure. The method can be applied to paragraph matching between electronic bidding documents and bidding documents in the engineering construction field, the process can be executed by computing equipment in the corresponding field, and certain input parameters or intermediate results in the process allow manual intervention and adjustment to help improve accuracy.

The implementation of the analysis method according to the embodiment of the present application may be a terminal device or a server, which is not particularly limited in the present application. For ease of understanding and description, the following embodiments are described in detail with reference to a server.

It should be noted that the server may be a single device, or may be a system formed by a plurality of devices, that is, a distributed server, which is not particularly limited in the present application.

As shown in fig. 1, an embodiment of the present application provides a method, including:

s101: and acquiring a target bid file and target bid tendering terms, and performing text extraction on the target bid file and the target bid tendering terms to obtain initial text data.

First, text extraction is performed on a target bid document and target bid term, thereby obtaining initial text data. The target bid document and target bid term may be in the form of pdf document or image document, and after ocr identification, the corresponding json document may be obtained in addition to the initial text data.

In one embodiment, in order to increase the accuracy of OCR text extraction when text extraction is performed, the image is first preprocessed, and the main flow includes image correction and image enhancement and layout analysis.

For image correction, document readability is not only reduced due to document distortion (i.e., folded, bent, or wrinkled paper), but also the performance of OCR text extraction at a later point is greatly affected. Thus, image correction is a critical and desirable step. The target bidding document image and the target bidding document image need to be subjected to image correction through a geometric correction convolutional neural network to obtain a first intermediate image. Specifically, the image correction adopts an architecture of a geometric correction convolutional neural network, the vertical dimension represents the spatial resolution of the feature mapping, the horizontal dimension represents the output channel, and the image patch is integrated, so that a more accurate correction result is obtained.

For image enhancement, the problem of low accuracy of character recognition due to shadow blurring of an image is solved, and an image enhancement method is highlighted. And carrying out image enhancement on the first intermediate image through a first convolution neural network so as to obtain a second intermediate image. And then splitting three channels of the second intermediate image to obtain a blue channel gray level image, a green channel gray level image and a red channel gray level image respectively, binarizing the channel gray level image according to a preset threshold value, and converting the channel gray level image into a three-channel image to obtain a third intermediate image. Image enhancement employs a first convolutional neural network having three convolutional layers at the beginning and end and five residual blocks in the middle. Each residue contains two convolutional layers and has a fast connection, which helps to achieve lower losses, all convolutional layers use a 3 x 3 kernel, step size 1. After each convolution layer a batch normalization layer and a ReLU number are added. The skip connection connects the input of the first residual block to the output of the last residual block.

For layout analysis, firstly, influence factor red chapters are removed, and secondly, the effects of pictures, tables and characters are identified. And performing layout analysis on the third intermediate image through the second convolutional neural network to extract a table area and a picture area in the third intermediate image so as to obtain a fourth intermediate image. And finally classifying the identified information, including matching name information, certificate number information, address information and the like. And extracting and processing file information. And performing text extraction on the fourth intermediate image to obtain initial text data.

In order to improve the accuracy of OCR for recognizing characters, a convolutional neural network is adopted to extract pictures, tables and flowcharts in an image so as to achieve the purposes that only characters exist in the image, interference items are eliminated, and the accuracy of recognizing the characters is improved. By carrying out image enhancement on the image, characters in the image become clear and bright, the accuracy of OCR (optical character recognition) for extracting the characters is improved, and a foundation is laid for carrying out text classification and recognition subsequently. The method can output clear text data record to the text recognition result of the certificate file, improves the accuracy of the content recognition of the certificate file, and can smoothly extract corresponding key information to provide for a user.

S102: and carrying out the Jieba word segmentation processing on the initial text data to obtain word segmentation data.

And performing the Jieba word segmentation processing on the initial text data to obtain word segmentation data corresponding to the initial text data.

In one embodiment, since a professional word stock specific to the engineering construction field does not exist in the prior art, in the word segmentation process, a word segmentation inaccuracy phenomenon often occurs, which easily causes a problem in a later matching process. Therefore, before word segmentation, a professional word stock in the engineering construction field needs to be established, a word stock sample set is firstly acquired, an initial word stock is established according to the sample set, professional words in a word segmentation result are manually screened to reduce words with wrong word segmentation in the word segmentation result, then the initial word frequency of the professional words is given according to the importance degree of the professional words, and the professional words are added into the engineering construction word stock.

Further, in the word segmentation process, because the word stock of the Jieba is not perfect enough, words in many professional fields are easy to be misplaced, and later matching errors are caused. Therefore, in the process of using jieba word segmentation, a professional word stock established in advance is imported, so that the word segmentation accuracy can be improved. In addition, in the word segmentation process, a parallel processing mode is adopted, and simultaneously, a plurality of threads are used for respectively carrying out the Jieba word segmentation on a plurality of paragraphs in the initial text data, so that the processing speed in the word segmentation process can be improved.

S103: and performing word stopping and useless word stopping operation on the word segmentation data according to a pre-established word stopping table so as to obtain word segmentation results, and storing the word segmentation results into an engineering construction word stock in a format of information table data.

And performing word stopping and useless word stopping operation on the obtained word segmentation data, so as to obtain a word segmentation result corresponding to the initial text data. And when the operations of stopping word removal and useless word removal are carried out, comparing the stopping word removal and useless word removal with the word segmentation data according to a pre-established stopping word list, and thus taking out stopping words and useless words summarized by the word segmentation data. After the word segmentation result is obtained, the word segmentation result is stored into a pre-established engineering construction word stock in the format of information table data.

S104: and acquiring a bid text corresponding to the target bid file and a bid text corresponding to the target bid term from the engineering construction word stock.

And when paragraph text matching is carried out, extracting bidding texts corresponding to the target bidding files and bidding texts corresponding to target bidding terms stored in the engineering construction word stock so as to carry out subsequent operation.

In one embodiment, after the bid text and the bid-tendering text are acquired, in order to facilitate staff to know the subject of each paragraph, the corresponding subject word of each paragraph text can be determined, firstly, the word segmentation result of each paragraph is combined with the corresponding word frequency and inverse document frequency weight matrix in the information table data, then the editing distance between the father title of the word segmentation result of each paragraph and each word is determined, then the weight matrix and the editing distance are calculated and weighted, the weighted matrix is normalized, and the preset number of words in each paragraph are determined as the subject words corresponding to each paragraph according to the normalization result.

Specifically, when determining the document word frequency, determining the document word frequency of the segmentation result set of each paragraph through the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing characteristic words ++>In document->Word frequency of->Representing characteristic words ++>In document->The number of occurrences of>Representation document->The total number of occurrences of all words in (a).

The inverse document frequency of the segmentation result set of each paragraph is determined by the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing characteristic words ++>Is the inverse document frequency,/">Representing the total number of text in the document set, +.>Representing that the set of documents contains the feature words +.>Is a document number of (a);

then, a normalized word frequency of the feature word and an inverse document frequency weight matrix are determined through the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,is word frequency and inverse document frequency weight matrix.

Then determining the editing distance between the father title and each word segment by the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,representation->Before->Personal words and->Before->Edit distance between individual words; />Representing the parent title string,/->Words representing paragraph text, ++>Is an indication function;

finally, calculating and weighting the weight matrix and the editing distance according to the following formula:

wherein, the liquid crystal display device comprises a liquid crystal display device,representing the weighted weight matrix, +.>The word frequency and inverse document frequency weight matrix; / >For parent title length, & gt>Representing word length.

The subject term is used as the central subject of the paragraph, and the frequency of occurrence is higher than that of other words. Therefore, the word frequency and other factors are considered, the weight of each word is calculated, and the higher the weight is, the more relevant to the theme is according to the weight ranking. And identifying all keywords of each paragraph, and finally extracting the subject words of the weight ordering TOP3 of each paragraph, so that an expert can conveniently know the paragraph theme, and the expert bid evaluation efficiency is improved.

S105: and carrying out vectorization representation on each paragraph of the bid text and the bid text so as to obtain paragraph vectors.

Firstly, vectorization representation is needed to be carried out on each paragraph in the bid text and the bid text, so that paragraph vectors corresponding to the bid text and the bid text respectively are obtained.

In one embodiment, when vectorizing the representation, the vectorized representation of the data is obtained using an empedding model, where different vectorization methods are selected according to the model. For example, bimpm uses skip-gram models to derive a vectorized representation of the data. The bert text matching model uses Token encoding, position Embedding (position encoding) and Segment Embedding three vectorized data for input. Then, in the task corresponding to the bid bidding of the current time, matching model training is needed to be carried out first, and a model which can better judge whether two texts are matched is obtained.

S106: and inputting the paragraph vector into an optimized text matching model, and determining the matching relation between each paragraph in the bidding text and each paragraph in the bidding text according to an output result.

After the paragraph vector is input into the optimized text matching model, the matching relation between each paragraph in the bidding text and the bidding text can be determined through the output result of the text matching model. And operating the extracted data on the model to obtain a final classification result. And converting the classification result into corresponding clause content and writing the clause content into a database.

In one embodiment, classification is inconvenient because the original data is not determined by the number of clause numbers and categories and the contents of the bidding documents. Therefore, the two classification problems of whether the two text paragraphs match are converted into the two classification problems of whether the two text paragraphs match or not, when the output result is 0, the two text paragraphs do not match, and when the output result is 1, the two text paragraphs match. Each bid content is linked to and tagged with each term data in the form of [ bid content, term, label (tag 0/1) ].

In one embodiment, since one bid content may be matched against multiple terms, in order for one bid content to correspond to one term (one term may have multiple bid contents). In the prior art, the matching result of the bidding document and the clauses is directly obtained to be 1 or 0, so that one bidding document is matched with a plurality of clauses at the same time, therefore, the softmax values corresponding to the same bidding content id and all the clauses are extracted, the clauses with the values larger than the threshold value are ordered under a certain threshold value range, the clauses with the maximum softmax value are obtained, namely, the matching label of the bidding content and the clauses is 1, the other matching labels are set to be 0, and if the matching label of the bidding content and all the clauses is smaller than the threshold value, the matching label of the bidding content and all the clauses is 0. Such that one bid content corresponds to only one term.

As shown in FIG. 2, the embodiment of the application further provides a device for matching the electronic bidding document with terms, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to enable the at least one processor to:

acquiring a target bidding document and target bidding terms, and performing text extraction on the target bidding document and the target bidding terms to obtain initial text data; performing Jieba word segmentation processing on the initial text data to obtain word segmentation data; according to a pre-established stay word list, carrying out stay word removal and useless word removal operation on the word segmentation data to obtain word segmentation results, and storing the word segmentation results into an engineering construction word stock in a format of information list data; acquiring a bid text corresponding to the target bid file and a bid text corresponding to the target bid term from the engineering construction word stock; vectorizing the bid text and each paragraph of the bid text to obtain paragraph vectors; and inputting the paragraph vector into an optimized text matching model, and determining the matching relation between each paragraph in the bidding text and each paragraph in the bidding text according to an output result.

The embodiment of the application also provides a nonvolatile computer storage medium, which stores computer executable instructions, wherein the computer executable instructions are configured to:

The embodiments of the present application are described in a progressive manner, and the same and similar parts of the embodiments are all referred to each other, and each embodiment is mainly described in the differences from the other embodiments. In particular, for the apparatus and medium embodiments, the description is relatively simple, as it is substantially similar to the method embodiments, with reference to the section of the method embodiments being relevant.

The devices and media provided in the embodiments of the present application are in one-to-one correspondence with the methods, so that the devices and media also have similar beneficial technical effects as the corresponding methods, and since the beneficial technical effects of the methods have been described in detail above, the beneficial technical effects of the devices and media are not repeated here.

It will be appreciated by those skilled in the art that embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

In one typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include volatile memory in a computer-readable medium, random Access Memory (RAM) and/or nonvolatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of computer-readable media.

Computer readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of storage media for a computer include, but are not limited to, phase change memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic tape magnetic disk storage or other magnetic storage devices, or any other non-transmission medium, which can be used to store information that can be accessed by a computing device. Computer-readable media, as defined herein, does not include transitory computer-readable media (transmission media), such as modulated data signals and carrier waves.

It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article or apparatus that comprises the element.

The foregoing is merely exemplary of the present application and is not intended to limit the present application. Various modifications and variations of the present application will be apparent to those skilled in the art. Any modification, equivalent replacement, improvement, etc. which come within the spirit and principles of the application are to be included in the scope of the claims of the present application.

Claims

1. A method for matching electronic bidding documents with terms, comprising:

acquiring a target bidding document and target bidding terms, and performing text extraction on the target bidding document and the target bidding terms to obtain initial text data;

performing Jieba word segmentation processing on the initial text data to obtain word segmentation data;

according to a pre-established stay word list, carrying out stay word removal and useless word removal operation on the word segmentation data to obtain word segmentation results, and storing the word segmentation results into an engineering construction word stock in a format of information list data;

acquiring a bid text corresponding to the target bid file and a bid text corresponding to the target bid term from the engineering construction word stock;

vectorizing the bid text and each paragraph of the bid text to obtain paragraph vectors;

Inputting the paragraph vector into an optimized text matching model, and determining the matching relation between each paragraph in the bidding text and each paragraph in the bidding text according to an output result;

the obtaining the target bidding document and the target bidding clause, and extracting the characters of the target bidding document and the target bidding clause specifically comprises the following steps:

performing image correction on the target bidding document image and the target bidding document image through a geometric correction convolutional neural network to obtain a first intermediate image; the vertical dimension of the geometric correction convolutional neural network is the spatial resolution of the feature mapping, and the horizontal dimension is an output channel;

performing image enhancement on the first intermediate image through a first convolutional neural network to obtain a second intermediate image; the first convolutional neural network is respectively provided with three convolutional layers at a starting position and a finishing position, and five residual blocks are arranged between the starting position and the finishing position; the remaining block contains two convolution layers using a kernel of 3 x 3, step size 1; adding a batch normalization layer and a ReLU number after each convolution layer; a skip connection connects the input of the first residual block to the output of the last residual block;

Splitting the three channels of the second intermediate image to obtain a blue channel gray scale map, a green channel gray scale map and a red channel gray scale map respectively;

binarizing the channel gray level image according to a preset threshold value, and converting the channel gray level image into a three-channel image to obtain a third intermediate image;

performing layout analysis on the third intermediate image through a second convolutional neural network to extract a table area and a picture area in the third intermediate image so as to obtain a fourth intermediate image;

performing text extraction on the fourth intermediate image to obtain the initial text data;

before the initial text data is subjected to the Jieba word segmentation, the method further comprises the following steps:

acquiring a thesaurus sample set, and establishing an initial thesaurus according to the sample set;

professional words in the word segmentation result are screened to reduce words with wrong word segmentation in the word segmentation result;

according to the importance degree of the professional word, giving an initial word frequency of the professional word, and adding the professional word into the engineering construction word stock;

the optimized text matching model comprises an original text matching model and a sequencing output layer arranged behind the original text matching model; the sequencing output layer is used for extracting softmax values corresponding to any paragraph bidding text and all paragraph bidding texts respectively;

The softmax value is an output value of the text matching model; sorting the paragraph bidding texts with the softmax value larger than a preset threshold value, and taking the paragraph bidding text with the maximum softmax value as a matching term of any paragraph bidding text;

the determining the matching relation between each paragraph in the bidding text and each paragraph in the bidding text according to the output result specifically comprises the following steps:

converting the matching relation between each paragraph in the bidding text and each paragraph in the bidding text into two classification problems;

if the output result of the paragraph bidding text and the paragraph bidding text is 0, the paragraph bidding text is not matched with the paragraph bidding text;

if the output result of the paragraph bidding text and the paragraph bidding text is 1, the paragraph bidding text is matched with the paragraph bidding text;

connecting the bidding content with the clause data, marking the bidding content with a label, and outputting the bidding content, the clause and the label in a data format;

the step of performing the Jieba word segmentation processing on the initial text data to obtain word segmentation data specifically comprises the following steps:

acquiring a plurality of threads, and simultaneously using the threads to perform Jieba word segmentation on a plurality of paragraphs in the initial text data;

Importing the engineering construction word stock, and performing Jieba word segmentation on a plurality of paragraphs in the initial text according to the engineering construction word stock.

2. An electronic bidding document and term matching apparatus, comprising:

at least one processor; and a memory communicatively coupled to the at least one processor; wherein, the liquid crystal display device comprises a liquid crystal display device,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform:

3. A non-transitory computer storage medium storing computer-executable instructions, the computer-executable instructions configured to: