CN116205212A - Bid file information extraction method, device, equipment and storage medium - Google Patents

Bid file information extraction method, device, equipment and storage medium Download PDF

Info

Publication number
CN116205212A
CN116205212A CN202310217202.0A CN202310217202A CN116205212A CN 116205212 A CN116205212 A CN 116205212A CN 202310217202 A CN202310217202 A CN 202310217202A CN 116205212 A CN116205212 A CN 116205212A
Authority
CN
China
Prior art keywords
text
information
bidding
target
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310217202.0A
Other languages
Chinese (zh)
Inventor
曾志贤
王伟
陈焕坤
张黔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Resources Digital Technology Co Ltd
Original Assignee
China Resources Digital Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Resources Digital Technology Co Ltd filed Critical China Resources Digital Technology Co Ltd
Priority to CN202310217202.0A priority Critical patent/CN116205212A/en
Publication of CN116205212A publication Critical patent/CN116205212A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/38Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
    • G06F16/383Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using metadata automatically derived from the content
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Library & Information Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a method, a device, equipment and a storage medium for extracting bidding document information, and relates to the technical field of artificial intelligence. Obtaining a plurality of text block data by obtaining a target file and carrying out text analysis based on visual characteristics, inputting the text block data into a text block classification model to carry out text block classification to obtain target text classification labels, and then selecting attribute rules of each target text classification label to extract attribute information of the text block data corresponding to the target text classification labels; and finally obtaining the extraction information, wherein the extraction information is bidding information of a bidding file or bidding information of a bidding file. The method has the advantages that the method is suitable for bidding documents with different format information by conducting text blocking on the target file based on visual features, meanwhile, based on the thought of prompt learning, the characteristics of text block data are fully utilized to obtain target text classification labels, so that the sample size requirement is reduced, and meanwhile, the accuracy of bidding information extraction in the bidding document is improved.

Description

Bid file information extraction method, device, equipment and storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a bidding document information extraction method, device, equipment and storage medium.
Background
A large number of bidding documents or bidding documents are formed in bidding business, and the enterprises need to extract information from the large number of bidding documents and complete bidding work, for example, required fields are extracted from different types of bidding documents, and the fields are input into a system to perform check ratio peer-to-peer operation. In the related art, with the development of deep learning in the field of natural language processing, a recurrent neural network and a convolutional neural network are used in the field of information extraction.
However, different from a general document, the types of bidding documents are various, the formats of the documents of the same type provided by different candidates are also various, a plurality of deep learning models for information extraction need to be trained in the related technology to meet the requirements of information extraction of different formats, and the models have no universality; in addition, many bidding document types are enterprise specific data, and data are difficult to collect in advance for model training, so that the information extraction model trained by using small sample data volume is weak in generalization capability, and bidding information cannot be accurately extracted. It is therefore desirable to provide a method that can improve the accuracy of information extraction of bidding documents.
Disclosure of Invention
The embodiment of the application mainly aims to provide a method, a device, equipment and a storage medium for extracting bidding document information, and the accuracy of bidding document information extraction is improved.
In order to achieve the above object, a first aspect of an embodiment of the present application provides a bid document information extraction method, including:
acquiring a target file, and carrying out text analysis on the target file based on visual characteristics to obtain a plurality of text block data; the target file includes: bidding documents or bidding documents;
inputting the text block data into a pre-trained text block classification model for text block classification to obtain a target text classification label of each text block data, wherein the text block classification model is a prompt learning pre-training language model;
selecting attribute rules of each target text classification label from a preset pattern matching rule library, and extracting attribute information of the text block data corresponding to the target text classification label by utilizing the attribute rules;
and obtaining the extraction information of the target file according to the target text classification tag and the attribute information, wherein the extraction information is bidding information of the bidding file or bidding information of the bidding file.
In some embodiments, the visual features include spatial location features and typesetting features; the text analysis is performed on the target file based on the visual characteristics to obtain a plurality of text block data, including:
segmenting according to the text information of the target file to obtain a plurality of text paragraphs, wherein the text paragraphs comprise attribute information, and the attribute information comprises space position information and typesetting information;
obtaining the spatial position information of the text paragraph according to the spatial position characteristics of the text paragraph in the target file;
acquiring the typesetting information of the text paragraph based on the typesetting characteristics of the target file;
and obtaining the text block data according to the text paragraph and the attribute information of the text paragraph.
In some embodiments, the text block classification model includes a pre-trained language model and a tag classification model; inputting the text block data into a pre-trained text block classification model for text block classification to obtain a target text classification label of each text block data, wherein the method comprises the following steps:
adding prompt learning information into the text paragraph, and generating text information based on the prompt learning information and the text paragraph, wherein the prompt learning information comprises mask information;
Generating a text block vector according to the text information and the corresponding attribute information;
inputting the text block vector into the pre-training language model to generate a vector, so as to obtain a mask vector representing the mask information;
and inputting the mask vector into the pre-trained label classification model to perform label prediction to obtain a target text classification label of the text block data.
In some embodiments, the generating a text block vector according to the text information and the corresponding attribute information includes:
after word segmentation is carried out on the text information, a first preset vector table is queried to obtain a text vector;
inquiring a second preset vector table, and generating a space position vector according to the space position information;
inquiring a third preset vector table, and generating typesetting vectors according to the typesetting information;
and splicing the text vector, the space position vector and the typesetting vector to obtain the text block vector.
In some embodiments, the inputting the mask vector into the pre-trained tag classification model to perform tag prediction to obtain a target text classification tag of the text block data includes:
inputting the pre-training vector into the label classification model, wherein the pre-training vector is generated by utilizing the pre-training language model according to the label of the word to be predicted;
Initializing model parameters of the tag classification model based on the pre-training vector;
and carrying out label prediction by using the initialized label classification model, and outputting the target text classification label.
In some embodiments, extracting the attribute information of the text block data corresponding to the target text classification tag using the attribute rule includes:
matching the text block data by utilizing the attribute rule to obtain the label position of the target text classification label, wherein the attribute information is a regular expression;
and obtaining the attribute information based on the label position.
In some embodiments, the target text classification tag includes a first tag and a second tag, and the obtaining the extraction information of the target file according to the target text classification tag and the attribute information includes:
when the same target text classification label corresponds to more than one attribute information, the target text classification label is a first label, and a plurality of attribute information are combined to obtain target attribute information of the first label;
forming a first extraction result by the first tag and the corresponding target attribute information;
Forming a second extraction result by the second label and the corresponding attribute information;
and obtaining the extraction information based on the first extraction result and the second extraction result.
To achieve the above object, a second aspect of the embodiments of the present application provides a bidding document information extraction apparatus, including:
the acquisition module is used for acquiring a target file, and carrying out text analysis on the target file based on visual characteristics to obtain a plurality of text block data; the target file includes: bidding documents or bidding documents;
the text block classification module is used for inputting the text block data into a pre-trained text block classification model to perform text block classification to obtain a target text classification label of each text block data, wherein the text block classification model is a prompt learning pre-trained language model;
the attribute matching module is used for selecting an attribute rule of each target text classification label from a preset pattern matching rule library and extracting attribute information of the text block data corresponding to the target text classification label by utilizing the attribute rule;
and the information extraction module is used for obtaining the extraction information of the target file according to the target text classification tag and the attribute information, wherein the extraction information is the bidding information of the bidding file or the bidding information of the bidding file.
To achieve the above object, a third aspect of the embodiments of the present application proposes an electronic device, which includes a memory and a processor, the memory storing a computer program, the processor implementing the method according to the first aspect when executing the computer program.
To achieve the above object, a fourth aspect of the embodiments of the present application proposes a storage medium, which is a computer-readable storage medium, storing a computer program, which when executed by a processor implements the method described in the first aspect.
The bid and ask file information extraction method, device, equipment and storage medium provided by the embodiment of the application obtain a target file and perform text analysis on the target file based on visual characteristics to obtain a plurality of text block data. Inputting a plurality of text block data into a pre-trained text block classification model to perform text block classification to obtain target text classification labels of each text block data, selecting attribute rules of each target text classification label from a preset pattern matching rule base, and extracting attribute information of the text block data corresponding to the target text classification labels by utilizing the attribute rules; and finally, obtaining the extraction information of the target file according to the target text classification tag and the attribute information, wherein the extraction information is bidding information of a bidding file or bidding information of a bidding file. According to the method and the device, text blocking is carried out on the target file based on visual features, the method and the device can be suitable for bidding documents with different format information, meanwhile, based on the thought of prompt learning, the target text classification labels are obtained by fully utilizing the features of text block data, so that the sample size requirement is reduced, and meanwhile, the accuracy of bidding information extraction in the bidding documents is improved.
Drawings
Fig. 1 is a flowchart of a bid and ask document information extraction method provided by an embodiment of the present invention.
Fig. 2 is a flowchart of step S110 in fig. 1.
Fig. 3 is a flowchart of step S120 in fig. 1.
Fig. 4 is a schematic diagram of a text block classification model of a bid and ask document information extraction method according to an embodiment of the present invention.
Fig. 5 is a flowchart of step S122 in fig. 3.
Fig. 6 is an input-output schematic diagram of a pre-training language model of the bidding document information extraction method provided by the embodiment of the present invention.
Fig. 7 is a flowchart of step S124 in fig. 3.
Fig. 8 is a flowchart of step S130 in fig. 1.
Fig. 9 is a flowchart of step S140 in fig. 1.
Fig. 10 is a flowchart of a bid document information extraction method according to still another embodiment of the present invention.
Fig. 11 is a block diagram showing a construction of a bidding document information extracting apparatus according to still another embodiment of the present invention.
Fig. 12 is a schematic hardware structure of an electronic device according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
It should be noted that although functional block division is performed in a device diagram and a logic sequence is shown in a flowchart, in some cases, the steps shown or described may be performed in a different order than the block division in the device, or in the flowchart.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. The terminology used herein is for the purpose of describing embodiments of the invention only and is not intended to be limiting of the invention.
First, several nouns involved in the present invention are parsed:
artificial intelligence (artificial intelligence, AI): is a new technical science for researching and developing theories, methods, technologies and application systems for simulating, extending and expanding the intelligence of people; artificial intelligence is a branch of computer science that attempts to understand the nature of intelligence and to produce a new intelligent machine that can react in a manner similar to human intelligence, research in this field including robotics, language recognition, image recognition, natural language processing, and expert systems. Artificial intelligence can simulate the information process of consciousness and thinking of people. Artificial intelligence is also a theory, method, technique, and application system that utilizes a digital computer or digital computer-controlled machine to simulate, extend, and expand human intelligence, sense the environment, acquire knowledge, and use knowledge to obtain optimal results.
Deep learning: is the inherent law and expression hierarchy of the learning sample data, the information obtained in the learning process is greatly helpful to the interpretation of data such as words, images and sounds, and the final goal is to enable a machine to analyze learning ability like a person and to recognize the data such as words, images and sounds. Deep learning is a complex machine learning algorithm that achieves far greater results in terms of speech and image recognition than prior art. Deep learning has achieved many results in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization techniques, and other related fields. The deep learning makes the machine imitate the activities of human beings such as audio-visual and thinking, solves a plurality of complex pattern recognition problems, and makes the related technology of artificial intelligence greatly advanced.
Prompting study: the method is a deep learning method, and changes a downstream task into a text generation task by adding prompt information to input under the condition of not significantly changing the structure and parameters of a pre-training language model.
A large number of bidding documents or bidding documents are formed in bidding business, and the enterprises need to extract information from the large number of bidding documents and complete bidding work, for example, required fields are extracted from different types of bidding documents, and the fields are input into a system to perform check ratio peer-to-peer operation. In the related art, with the development of deep learning in the field of natural language processing, a recurrent neural network and a convolutional neural network are used in the field of information extraction.
However, different from a general document, the types of bidding documents are various, the formats of the documents of the same type provided by different candidates are also various, a plurality of deep learning models for information extraction need to be trained in the related technology to meet the requirements of information extraction of different formats, and the models have no universality; in addition, many bidding document types are enterprise-specific data, data are difficult to collect in advance to perform model training, an available model needs to be trained under a small sample based on the enterprise-specific document data, the information extraction model is high in training cost, and the generalization capability is weak, so that bidding information cannot be accurately extracted. It is therefore desirable to provide a method that can improve the accuracy of information extraction of bidding documents.
Based on the above, the embodiment of the invention provides a method, a device, equipment and a storage medium for extracting bidding document information, which are used for text blocking of a target document based on visual characteristics, can be suitable for bidding documents with different format information, and can be used for obtaining target text classification labels by fully utilizing characteristics of text block data based on the thought of prompt learning, so that the requirement of sample size is reduced, and meanwhile, the accuracy of extracting bidding information in the bidding document is improved.
The embodiment of the invention provides a bidding document information extraction method, a bidding document information extraction device, bidding document information extraction equipment and a storage medium, and specifically, the bidding document information extraction method in the embodiment of the invention is described first through the following embodiment.
The embodiment of the invention can acquire and process the related data based on the artificial intelligence technology. Among these, artificial intelligence (ArtificialIntelligence, AI) is the theory, method, technique and application system that uses a digital computer or a digital computer-controlled machine to simulate, extend and expand human intelligence, sense the environment, acquire knowledge and use knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.
The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.
The embodiment of the invention provides a bidding document information extraction method, which relates to the technical field of artificial intelligence, in particular to the technical field of data mining. The bid and ask file information extraction method provided by the embodiment of the invention can be applied to a terminal, a server and a computer program running in the terminal or the server. For example, the computer program may be a native program or a software module in an operating system; the Application may be a local (Native) Application (APP), i.e. a program that needs to be installed in an operating system to be run, such as a client that supports extraction of bidding document information, or an applet, i.e. a program that only needs to be downloaded to a browser environment to be run; but also an applet that can be embedded in any APP. In general, the computer programs described above may be any form of application, module or plug-in. Wherein the terminal communicates with the server through a network. The bid document information extraction method may be performed by a terminal or a server, or by a terminal and a server in cooperation.
In some embodiments, the terminal may be a smart phone, tablet, notebook, desktop, or smart watch, or the like. The server can be an independent server, and can also be a cloud server for providing cloud services, cloud databases, cloud computing, cloud functions, cloud storage, network services, cloud communication, middleware services, domain name services, security services, content delivery networks (Content Delivery Network, CDNs), basic cloud computing services such as big data and artificial intelligent platforms, and the like; or may be service nodes in a blockchain system, where a Peer-To-Peer (P2P) network is formed between the service nodes, and the P2P protocol is an application layer protocol that runs on top of a transmission control protocol (TCP, transmission Control Protocol) protocol. The server may be provided with a server of the bidding document information extraction system, through which the terminal may interact, for example, the server may be provided with corresponding software, which may be an application for implementing the bidding document information extraction method, or the like, but is not limited to the above form. The terminal and the server may be connected by a communication connection manner such as bluetooth, USB (Universal Serial Bus ) or a network, which is not limited herein.
The invention is operational with numerous general purpose or special purpose computer system environments or configurations. For example: personal computers, server computers, hand-held or portable devices, tablet devices, multiprocessor systems, microprocessor-based systems, set top boxes, programmable consumer electronics, network PCs, minicomputers, mainframe computers, distributed computing environments that include any of the above systems or devices, and the like. The invention may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, etc. that perform particular tasks or implement particular abstract data types. The invention may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
The following describes a bid document information extraction method in an embodiment of the present invention.
Fig. 1 is an optional flowchart of a bid document information extraction method provided by an embodiment of the present invention, and the method in fig. 1 may include, but is not limited to, steps S110 to S130. It should be understood that the order of steps S110 to S130 in fig. 1 is not particularly limited, and the order of steps may be adjusted, or some steps may be reduced or increased according to actual requirements.
Step S110: and acquiring the target file, and carrying out text analysis on the target file based on the visual characteristics to obtain a plurality of text block data.
In one embodiment, the target file includes: the bidding document or bidding document is an outline of bidding engineering construction, is all conditions required for providing bidding units with participation in bidding, is also called bidding document, and is a response document compiled by bidders according to bidding document requirements. In one embodiment, because the bid and ask files have different information characteristics, the models of information extraction built for the bid and ask files have the same model framework, but have different model parameters.
In an embodiment, the target file may be obtained by uploading the file, or may be obtained by reading a database of the target file, and the method for obtaining the target file is not specifically limited in this embodiment.
In an embodiment, the visual features include spatial location features and layout features. Referring to fig. 2, a flowchart of a specific implementation of step S110 is shown in an embodiment, where the step of performing text parsing on the target file based on the visual features to obtain a plurality of text block data includes:
Step S111: and segmenting according to the text information of the target file to obtain a plurality of text paragraphs.
In an embodiment, the target file is a bid-bidding file or a bid-bidding file, wherein the information related to bidding and bidding is mostly displayed in the form of text, the text information exists in the target file in the form of paragraphs, and paragraph identifiers such as 'carriage return' exist between different paragraphs, so that the embodiment can segment the text information in the target file according to the paragraph identifiers to obtain a plurality of text paragraphs. In addition, in order to distinguish different text paragraphs, the text paragraphs obtained in this embodiment include corresponding attribute information, where the attribute information includes spatial position information and typesetting information.
In one embodiment, the spatial location information is primarily identifying the paragraph location of the text paragraph in the target file, e.g., the beginning of the document, the middle of the document, the end of the document, etc. The layout information is mainly layout information identifying a text paragraph, for example, whether the text paragraph is text in a table, i.e., the layout information of the text paragraph.
Step S112: and obtaining the spatial position information of the text paragraph according to the spatial position characteristics of the text paragraph in the target file.
In one embodiment, if the target file contains multiple pages, the contents of all pages are spliced according to the sequence to obtain an overall document, and the position of the text paragraph in the overall document, namely the spatial position characteristic of the text paragraph, is calculated according to the position information of the text paragraph in the overall document, for example, the text paragraph is positioned at 0% -10% of the position of the overall document, and 0% -10% of the position information is the spatial position information of the text paragraph. It will be appreciated that in an embodiment, the position of the text document in the overall document may be calculated according to the position of the first line, the middle line, or the last line of the text paragraph, which is not particularly limited in this embodiment.
Step S113: and acquiring typesetting information of the text paragraph based on typesetting characteristics of the target file.
In an embodiment, the typesetting feature refers to a typesetting format contained in the target file, where the typesetting format includes: paragraph and/or form, so that its layout information is obtained according to the layout format of the text paragraph, and similarly, the layout information includes: paragraphs and/or tables.
Step S114: and obtaining text block data according to the text paragraphs and the attribute information of the text paragraphs.
In an embodiment, the attribute information of the obtained text paragraph is added as additional information to the corresponding text paragraph to obtain text block data. For example, n text paragraphs are obtained in one embodiment, denoted as C1, C2, and Cn, and the corresponding attribute information is denoted as: s1, S2,..and Sn), the corresponding n text block data are expressed as: C1-S1, C2-S2, cn-Sn. It will be appreciated that the number of text block data and the number of text paragraphs correspond to the same.
Step S120: inputting the text block data into a pre-trained text block classification model to classify the text blocks, and obtaining a target text classification label of each text block data.
In an embodiment, a text block classification model is pre-constructed and trained, and because the types of bidding documents are numerous, different candidates provide the same type of documents with multiple types, the text block classification model needs to be capable of adapting to the requirements of extracting different format information, and meanwhile, because many bidding document types are enterprise-specific data, data are difficult to collect in advance for model training, and therefore, the training process of the text block classification model uses small-sample document data specific to enterprises. In order to reduce the model training cost and improve the model generalization capability so as to accurately extract bidding information, a text block classification model is constructed and trained by using a prompt learning method in the embodiment, namely, the text block classification model is a prompt learning pre-training language model.
In an embodiment, referring to fig. 3, which is a flowchart showing a specific implementation of step S120, in this embodiment, the step of inputting a plurality of text block data into a pre-trained text block classification model to perform text block classification, and obtaining a target text classification label of each text block data includes:
Step S121: and adding prompt learning information in the text paragraph, and generating text information based on the prompt learning information and the text paragraph.
In an embodiment, the learning-prompting concept is added to the text paragraph based on the learning-prompting concept that the text classification task is changed into the text generation task by changing the reasoning mode into the "query pre-training language model" which is the complete blank filling problem about "xxx", i.e. by adding the "learning-prompting information" to the input without significantly changing the pre-training language model structure and parameters.
In one embodiment, text information is generated by concatenating hint learning information in a text passage, wherein the hint learning information includes mask information. For example, the text paragraph content is "bid deposit is refund. ", prompt learning information is: "this text block data belongs to the description of [ MASK ], the resulting text information is expressed as: the bid deposit is refund. This text block data belongs to the description of MASK. The purpose of text block classification of text block data in this embodiment is to generate content of [ MASK ], and the obtained information of [ MASK ], that is, the target text classification tag of the text block data.
In one embodiment, the target text classification tag is one of the following, including: bidding content, item name, item number, time to receive bid, buyer name, bid deposit, quality deposit, bid expiration time, bid opening time, bid location, bid opening location, contact telephone, purchasing center address, advertising deadline, contact telephone, bid announcement, purchasing unit, budget amount, bid file selling price, bid file acquisition mode, item contact mode, highest bid limit, and the like. The target text classification label is not particularly limited in this embodiment, and can be set according to actual requirements in the application process.
As can be seen from the foregoing, in the embodiment of the present application, the prompt learning information is masked, so that the text block classification model performs text generation on the mask information in the prompt learning information, and a target text classification tag corresponding to the text paragraph is obtained, so as to identify the content property of the text paragraph.
In one embodiment, referring to FIG. 4, the text block classification model 100 includes: the text block data input by the pre-training language model 200 and the label classification model 300 are sequentially connected, firstly, the pre-training language model 200 is used for inputting the text block data, then the output of the pre-training language model 200 is used as the input of the label classification model 300, and the target text classification labels corresponding to the text paragraphs are obtained through the label classification process.
In an embodiment, the pre-training language model 200 may be a BERT model or other similar pre-training language model. Taking the BERT model as an example, the structure of the BERT model is a multi-layer transducer structure, is represented by a bidirectional encoder of the transducer, converts the distance between two words at any position into 1 through an Attention mechanism, and pre-trains depth bidirectional representation by jointly adjusting the context in all layers, thereby having strong language representation capability and feature extraction capability. The converter network architecture is an encoder-decoder structure and is formed by stacking a plurality of encoders and decoders. The left part of the architecture of a Transformer network, for example, is an encoder, which consists of a multi-head attention layer and a fully connected layer, and is used for converting the input corpus into feature vectors. The right part of the transform network architecture is a decoder whose inputs are the feature vectors output by the encoder and the predicted results, consisting of a multi-head attention layer for outputting the conditional probabilities of the final results and a full concatenation.
In one embodiment, the tag classification model 300 is a classifier, such as a softmax classifier or the like. The structure of the classifier in this embodiment is not particularly limited.
Step S122: and generating a text block vector according to the text information and the corresponding attribute information.
In one embodiment, the text information and corresponding attribute information are in text format and cannot be used directly for model calculation, so vector conversion is required to convert them into text block vectors.
In an embodiment, referring to fig. 5, a flowchart of a specific implementation of step S122 is shown in an embodiment, where the step of generating a text block vector according to text information and corresponding attribute information includes:
step S1221: and after word segmentation is carried out on the text information, inquiring a first preset vector table to obtain a text vector.
In one embodiment, a word segmentation method based on a dictionary may be used to segment text information to obtain a word segmentation sequence composed of a plurality of word segments, and the method matches a character string to be matched in the text information with words in a pre-established dictionary according to a preset strategy. The preset strategy comprises the following steps: a forward maximum matching method, a reverse maximum matching method, a bidirectional matching word segmentation method and the like. According to the method, the word segmentation is carried out on the text information by adopting a machine learning algorithm based on statistics to obtain a word segmentation sequence, and different words in the text information are labeled and trained by utilizing a deep learning related algorithm, so that the frequency of occurrence of the words is considered, the context information is also considered, and a good effect is achieved. Or combining machine learning and a dictionary to segment text information to obtain a word segmentation sequence, so that the word segmentation accuracy can be improved. The word segmentation operation process also comprises a process of removing stop words, and the word segmentation method is not particularly limited in the embodiment.
And after the word segmentation process, obtaining a plurality of word segments of the text information. For example, in one embodiment, the text information is: the bid deposit is refund. The text block data belongs to the description of [ MASK ], and word segmentation sequences obtained by word segmentation are as follows: [ CLS ] "throw", "target", "guarantee", "gold", "average", "none", "rest", "return", "still", "and". "," this "," an "," text "," a book "," a block "," data "," a genus "," in "," [ MASK ] "," a "drawing", "the" [ SEP ]. Where [ CLS ] represents the feature for the classification model, the fitting may be omitted for non-classification models, [ SEP ] represents the clause symbol for breaking two sentences in the input corpus.
From the above, the text information can be expressed as a word segmentation sequence, and the words in the word segmentation sequence can be pre-stored in a dictionary. It can be understood that, because the number of Chinese characters is limited, the number of words stored in the dictionary is limited, so that the words in the dictionary can be represented by numbers, and the corresponding relation between the words and the numbers is stored to obtain a first preset vector table. And selecting according to the word segmentation sequence of the text information and the sequence to obtain a text vector, wherein the text vector comprises digital representations of all the words in the word segmentation sequence.
Step S1222: and inquiring a second preset vector table, and generating a space position vector according to the space position information.
In one embodiment, the information carried in text paragraphs appearing at different locations in the document is different, such as text paragraphs describing "quality assurance" concerns, typically appearing in the template portion of the bidding contract, the template portion typically being located in the second half of the document; and text paragraphs that are related to the "open time" are described, typically in the first half of the document. Therefore, the embodiment obtains the space position information of the text paragraph so as to obtain more semantic information and improve the accuracy of information extraction.
In an embodiment, the different spatial position information corresponds to different spatial position vectors, for example, the spatial position information of the text paragraph is divided into 10 types, and the 10 types of spatial position vectors can be marked with 1-10 types of spatial position vectors, for example: "1" indicates that the spatial position information of the text paragraph is: located in 0% -10% of the document, "2" indicates that the spatial location information of the text paragraph is: located in 10% -20% of the document, and so on.
As can be seen from the above, the corresponding relationship between the spatial position information and the spatial position vector can be stored in the second preset vector table, and for the spatial position information in the attribute information, the corresponding spatial position vector can be generated according to the spatial position information by querying the second preset vector table.
In one embodiment, the spatial position vector corresponding to the prompt learning information is set to 0.
Step S1223: inquiring a third preset vector table, and generating typesetting vectors according to typesetting information.
In one embodiment, the information carried in text paragraphs of different typesetting formats of the document are different, such as text paragraphs describing "bid deposit" related text paragraphs typically appear in a table of bidding documents, and text paragraphs describing "quality deposit" related text paragraphs typically appear in the form of paragraphs in the document. Therefore, the embodiment obtains typesetting information of the text paragraphs to obtain more semantic information, and improves the accuracy of information extraction.
In an embodiment, the different typesetting information corresponds to different typesetting vectors, for example, the typesetting information of the text paragraph is divided into 3 types, and the 3 types of typesetting vectors can be marked with 1-3 types of typesetting vectors, for example: "1" indicates that the typesetting information of the text paragraph is: paragraph, "2" indicates that the typesetting information of the text paragraph is: the table, "3" indicates that the typesetting information of the text paragraph is: list, and so on. Wherein, list typesetting format refers to: a series of text formats arranged in a particular order.
From the above, the correspondence between the typesetting information and the typesetting vector may be stored in a third preset vector table, and for the typesetting information in the attribute information, the corresponding typesetting vector may be generated according to the typesetting information by querying the third preset vector table.
In one embodiment, the typesetting vector corresponding to the prompt learning information is set to 0.
Step S1224: and splicing the text vector, the space position vector and the typesetting vector to obtain a text block vector.
In an embodiment, vector splicing is performed on the text vector, the spatial position vector and the typesetting vector obtained in the above process according to the sequence, so that a corresponding text block vector can be obtained.
Step S123: and inputting the text block vector into a pre-training language model to generate a vector, and obtaining a mask vector representing mask information.
In an embodiment, the pre-trained language model may be a BERT model, and the BERT model structure may output MASK vectors for MASK information MASK based on context information of the text block vectors. Referring to fig. 6, it is assumed that the chinese character of the inputted text block vector is expressed as: [ CLS ] "throw", "target", "guarantee", "gold", "none", "information", "return", "still", "and". "," this "," belongs to "," [ MASK ] "," drawing "," said "[ SEP ], it is to be understood that the actual text block vector is represented numerically here by a chinese character schematic. After inputting the vector into the BERT model, a vector corresponding to "[ MASK ]" is output, and the vector is a MASK vector.
Step S124: and inputting the mask vector into a pre-trained label classification model for label prediction to obtain a target text classification label of the text block data.
In one embodiment, since the purpose of the text block classification model is to obtain the target text classification labels of the text block data, this is equivalent to predicting that the "bid deposit" is a refund in one embodiment. This text block data belongs to the description of [ MASK ] what the part obscured by [ MASK ] is. Since knowledge or mode contained in the text block classification model is established, words are not necessarily predicted in an expected mode, so that a plurality of prompts are needed to guide the text block classification model to do downstream classification tasks, all possible target text classification labels are used as word labels to be predicted in the embodiment, and the classification capability of the text block classification model is improved by using the word labels to be predicted.
In one embodiment, the target text classification tag includes: bid information tags or bid information tags. The method comprises the steps of taking { bidding contents, item names, item numbers, bidding time, buyer names, bidding guarantee, quality guarantee, bidding deadline, bidding time, bidding place, contact telephone, purchasing center address, advertising period, contact telephone, bidding bulletin, purchasing unit, budget amount, bidding document selling price, bidding document acquisition mode, item contact mode, highest bidding limit price and the like, which are related to bidding and bidding, as a set of word labels to be predicted, dividing classification data into a plurality of batches when carrying out label prediction of mask vectors, and updating the word label set to be predicted once every batch is processed. Updating the operation of the search trigger word set, and specifically, the time is as follows: traversing each word label to be predicted of the current sample one by one, finding k candidate word labels to be predicted from the word labels to be predicted, finding out one with the largest lifting effect on the text block classification model from the k candidate word labels to be predicted, and replacing the original word label to be predicted of the current position by the one.
In one embodiment, the process of finding the candidate word label to be predicted for k is: and calculating the average loss of samples in one batch, then calculating the gradient of the word vector corresponding to the kth word label to be predicted, traversing the vocabulary, and calculating the word vector of each word label to be predicted and taking the dot product of the word vector and the obtained gradient vector as an index value. The index has the meaning that if the original word label to be predicted is replaced by the current word label to be predicted, the prediction effect can be improved.
In one embodiment, the process of selecting the final word label to be predicted from the k candidate word labels to be predicted is as follows: traversing k words; replacing the label of the j-th word to be predicted with the current replacement word, and executing the previous process to obtain a probability value; and selecting the word label to be predicted corresponding to the maximum probability value as the final word label to be predicted.
In the above description, when performing label prediction of the mask vector, it is necessary to input a word label set to be predicted, which is formed by all target text classification labels, into a label classification model, and select the word label to be predicted with the largest probability value as the final target text classification label after calculation.
In an embodiment, referring to fig. 7, which is a flowchart showing a specific implementation of step S124, in this embodiment, the step of inputting the mask vector into the pre-trained label classification model to perform label prediction, and obtaining the target text classification label of the text block data includes:
Step S1241: the pre-training vectors are input into a tag classification model.
In an embodiment, the pre-training vector is generated according to the word label to be predicted, and in this embodiment, the pre-training vector is generated according to the word label to be predicted by using a pre-training language model, that is, the output obtained by inputting each word label to be predicted into the pre-training language model, that is, the pre-training vector of the word label to be predicted.
Step S1242: model parameters of the tag classification model are initialized based on the pre-training vector.
In an embodiment, the method for prompting learning uses the pre-training vector of the word label to be predicted to initialize the model parameters of the label classification model, where the initialization process may be to obtain a vector average value of the pre-training vector, and the purpose of initialization is to make full use of the features learned by the pre-training language model and the information of the word label to be predicted, so as to help promote the classification effect of the label classification model.
Step S1243: and carrying out label prediction by using the initialized label classification model, and outputting a target text classification label.
In an embodiment, the tag classification model is a softmax layer, and can output probabilities of different word tags to be predicted, and select the word tag to be predicted with the highest probability from the probabilities as the target text classification tag.
For example, in one embodiment, for convenience of description, it is assumed that there are only two kinds of target text classification labels, and the label classification model is a linear two-classifier. The content to be predicted is: the bid deposit is refund. This text chunk data belongs to the MASK in the description of [ MASK ], assuming that the set of word tags to be predicted of [ MASK ] includes { bid-deposit, quality-deposit }. The pre-training language model outputs mask vectors, then the mask vectors are input into the label classification model, and the classifier is initialized to train by using the pre-training vectors of the bid deposit and the quality deposit, and the prediction result of the classifier is classified as the bid deposit or the quality deposit.
By the above process, the target text classification tag of the corresponding text block data can be obtained, and then the following steps are performed, and the attribute information of the text block data is obtained based on the target text classification tag.
Step S130: and selecting an attribute rule of each target text classification label from a preset pattern matching rule library, and extracting attribute information of text block data corresponding to the target text classification label by utilizing the attribute rule.
In an embodiment, if the content to be displayed by the text paragraph is already known through the target text classification tag, the attribute rule of each target text classification tag may be generated based on the prior information, and all attribute rules are stored in a preset pattern matching rule base, and if necessary, the attribute rule of each target text classification tag may be selected from the preset pattern matching rule base according to the content of the target text classification tag.
In an embodiment, referring to fig. 8, which is a flowchart showing a specific implementation of step S130, in this embodiment, the step of selecting an attribute rule of each target text classification tag from a preset pattern matching rule base, and extracting attribute information of text block data corresponding to the target text classification tag by using the attribute rule includes:
step S131: and matching the text block data by utilizing the attribute rule to obtain the label position of the target text classification label.
Step S132: attribute information is obtained based on the tag location.
In an embodiment, the attribute information is a regular expression, and a pattern matching method based on vocabulary syntax is used to write regular expressions corresponding to different classification labels according to the distribution rule of the entity and the attribute information, and it can be understood that each target text classification label corresponds to one regular expression. For example, for "open time", it is usually the presentation format of "20xx year xx month xx day", so if a text paragraph contains the open time, the character string position of "20xx year xx month xx day" of the open time, that is, the tag position, can be located by writing a regular expression. After the tag position is obtained, the 20xx year, xx month and xx day in the text paragraph can be extracted, and the extracted 20xx year, xx month and xx day is the attribute information of the text paragraph.
Step S140: and obtaining the extraction information of the target file according to the target text classification tag and the attribute information, wherein the extraction information is bidding information of a bidding file or bidding information of a bidding file.
In an embodiment, it is possible that the different text paragraphs describe the same content, for example, the target text classification label "open time" is located in different paragraphs, so after the whole target file is segmented, a plurality of corresponding "open times", that is, attribute information thereof, can be obtained through the calculation of the steps. Alternatively, the content in the same target text classification tag is described in segments, for example, "bid announcement" has related descriptions in text paragraph D1, text paragraph D2, and text paragraph D3, and text paragraph D1 describes: bulletin G1, text paragraph D2: bulletin G2 and text paragraph D3: bulletin G3, therefore, the target text classification labels of text paragraph D1, text paragraph D2, and text paragraph D3 are all "bid bulletins", but their corresponding attribute information is different, expressed as:
text paragraph D1: "bid announcement" { announcement G1};
text paragraph D2: "bid announcement" { announcement G2};
text paragraph D3: "bid announcement" { announcement G3};
In view of the above scenario, in an embodiment, referring to fig. 9, which is a flowchart showing a specific implementation of step S140, in this embodiment, the step of obtaining, according to the target text classification tag and the attribute information, extraction information of the target file, where the extraction information is bid information of a bid document or bid information of a bid document includes:
step S141: when the same target text classification label corresponds to more than one attribute information, the target text classification label is a first label, and a plurality of attribute information are combined to obtain target attribute information of the first label.
In an embodiment, the target text classification labels of all text block data are classified according to the number of attribute information, if one target text classification label corresponds to more than one attribute information, the target text classification label is a first label, such as "bid announcement" in the above embodiment, otherwise, is a second label, such as "bid opening time" in the above embodiment.
And combining the plurality of attribute information aiming at the first label to obtain the target attribute information of the first label. For example, in the above embodiment, the plurality of attribute information of "bid advertisement" is combined to obtain { advertisement G1, advertisement G2, advertisement G3}, and { advertisement G1, advertisement G2, advertisement G3}, that is, the target attribute information of "bid advertisement".
Step S142: and constructing a first extraction result by the first label and the corresponding target attribute information.
In an embodiment, the first tag and the corresponding target attribute information are spliced to obtain a first extraction result, where in the above embodiment, the first extraction result is expressed as: "bid announcement" - { announcement G1, announcement G2, announcement G3}.
Step S143: and forming a second extraction result by the second label and the corresponding attribute information.
In an embodiment, the second label and the corresponding attribute information are spliced to obtain a second extraction result, where in the above embodiment, the second extraction result is expressed as: "open time" - {20xx year xx month xx day }.
Step S144: and obtaining extraction information based on the first extraction result and the second extraction result.
In an embodiment, the first extraction result and the second extraction result are combined to obtain the extraction information of the target file. When the target file is a bidding document, extracting bidding information of the bidding document; and when the target file is a bidding file, extracting the bidding information of the bidding file.
In one embodiment, referring to fig. 10, an overall flow diagram of a bid document information extraction method is shown. The text paragraph content is as follows: the bid deposit is refund. ", prompt learning information is: "this text block data belongs to the description of [ MASK ]. "by way of example.
Firstly inputting text paragraphs and corresponding attribute information and prompt learning information thereof, entering an embedding layer, wherein the purpose of the embedding layer is to obtain text block vectors according to the text vectors, the spatial position vectors and typesetting vectors, then inputting the text block vectors into a BERT model to obtain output mask vectors, inputting the mask vectors into a classifier, pre-obtaining pre-training vectors of word labels to be predicted, initializing the classifier by using the pre-training vectors to obtain output of the classifier based on the mask vectors, and the output is a target text classification label. The embedded layer can better represent the semantic information of the text block by utilizing text characteristics such as word vector information, spatial positions, typesetting formats and the like and visual characteristics contained in the text block.
And selecting attribute rules from a preset pattern matching rule library according to the target text classification labels, extracting attribute information of text block data corresponding to the target text classification labels by utilizing the attribute rules, and obtaining extraction information of the target file according to the attribute information.
According to the embodiment, the text block classification model for prompt learning is constructed, the extraction task of bidding information is converted into the classification prediction task of the MASK [ MASK ] in the construction prompt, the pre-training vector of the word label to be predicted is used for initializing the parameters of the classifier, the characteristics and the label text learned by the pre-training language model are fully utilized, the requirement of sample size is reduced, and the accuracy of extracting bidding file information is increased.
According to the technical scheme provided by the embodiment of the invention, the target file is obtained, and text analysis is performed on the target file based on visual characteristics, so that a plurality of text block data are obtained. Inputting a plurality of text block data into a pre-trained text block classification model to perform text block classification to obtain target text classification labels of each text block data, selecting attribute rules of each target text classification label from a preset pattern matching rule base, and extracting attribute information of the text block data corresponding to the target text classification labels by utilizing the attribute rules; and finally, obtaining the extraction information of the target file according to the target text classification tag and the attribute information, wherein the extraction information is bidding information of a bidding file or bidding information of a bidding file. According to the method and the device, text blocking is carried out on the target file based on visual features, the method and the device can be suitable for bidding documents with different format information, meanwhile, based on the thought of prompt learning, the target text classification labels are obtained by fully utilizing the features of text block data, so that the sample size requirement is reduced, and meanwhile, the accuracy of bidding information extraction in the bidding documents is improved.
The embodiment of the invention also provides a bidding document information extraction device, which can realize the bidding document information extraction method, and referring to fig. 11, the device comprises:
The obtaining module 1110 is configured to obtain a target file, and perform text analysis on the target file based on visual features to obtain a plurality of text block data; the target file includes: bid documents or bid documents.
The text block classification module 1120 is configured to input a plurality of text block data into a pre-trained text block classification model for text block classification, so as to obtain a target text classification label of each text block data, where the text block classification model is a prompt learning pre-trained language model.
The attribute matching module 1130 is configured to select an attribute rule of each target text classification tag from a preset pattern matching rule library, and extract attribute information of text block data corresponding to the target text classification tag by using the attribute rule.
The information extraction module 1140 is configured to obtain, according to the target text classification tag and the attribute information, extraction information of the target file, where the extraction information is bid information of the bid file or bid information of the bid file.
The specific implementation of the bidding document information extraction apparatus of the present embodiment is basically the same as the specific implementation of the bidding document information extraction method described above, and will not be described here again.
The embodiment of the invention also provides electronic equipment, which comprises:
At least one memory;
at least one processor;
at least one program;
the program is stored in the memory, and the processor executes the at least one program to implement the bid document information extraction method of the present invention as described above. The electronic equipment can be any intelligent terminal including a mobile phone, a tablet personal computer, a personal digital assistant (Personal Digital Assistant, PDA for short), a vehicle-mounted computer and the like.
Referring to fig. 12, fig. 12 illustrates a hardware structure of an electronic device according to another embodiment, the electronic device includes:
the processor 1201 may be implemented by a general purpose CPU (central processing unit), a microprocessor, an application specific integrated circuit (ApplicationSpecificIntegratedCircuit, ASIC), or one or more integrated circuits, etc. for executing related programs to implement the technical solution provided by the embodiments of the present invention;
the memory 1202 may be implemented in the form of a ROM (read only memory), a static storage device, a dynamic storage device, or a RAM (random access memory). The memory 1202 may store an operating system and other application programs, and when the technical solutions provided in the embodiments of the present disclosure are implemented by software or firmware, relevant program codes are stored in the memory 1202, and the processor 1201 invokes the bidding document information extraction method for executing the embodiments of the present disclosure;
An input/output interface 1203 for implementing information input and output;
the communication interface 1204 is configured to implement communication interaction between the device and other devices, and may implement communication in a wired manner (e.g., USB, network cable, etc.), or may implement communication in a wireless manner (e.g., mobile network, WIFI, bluetooth, etc.); and
a bus 1205 for transferring information between various components of the device such as the processor 1201, memory 1202, input/output interface 1203, and communication interface 1204;
wherein the processor 1201, the memory 1202, the input/output interface 1203 and the communication interface 1204 enable communication connection between each other inside the device via a bus 1205.
The embodiment of the application also provides a storage medium, which is a computer readable storage medium, and the storage medium stores a computer program, and the computer program realizes the bidding document information extraction method when being executed by a processor.
The memory, as a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. In addition, the memory may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located relative to the processor, the remote memory being connectable to the processor through a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The bidding document information extraction method, the bidding document information extraction device, the electronic equipment and the storage medium provided by the embodiment of the invention acquire the target document and perform text analysis on the target document based on visual characteristics to acquire a plurality of text block data. Inputting a plurality of text block data into a pre-trained text block classification model to perform text block classification to obtain target text classification labels of each text block data, selecting attribute rules of each target text classification label from a preset pattern matching rule base, and extracting attribute information of the text block data corresponding to the target text classification labels by utilizing the attribute rules; and finally, obtaining the extraction information of the target file according to the target text classification tag and the attribute information, wherein the extraction information is bidding information of a bidding file or bidding information of a bidding file. According to the method and the device, text blocking is carried out on the target file based on visual features, the method and the device can be suitable for bidding documents with different format information, meanwhile, based on the thought of prompt learning, the target text classification labels are obtained by fully utilizing the features of text block data, so that the sample size requirement is reduced, and meanwhile, the accuracy of bidding information extraction in the bidding documents is improved.
The embodiments described in the embodiments of the present application are for more clearly describing the technical solutions of the embodiments of the present application, and do not constitute a limitation on the technical solutions provided by the embodiments of the present application, and as those skilled in the art can know that, with the evolution of technology and the appearance of new application scenarios, the technical solutions provided by the embodiments of the present application are equally applicable to similar technical problems.
It will be appreciated by those skilled in the art that the technical solutions shown in the figures do not constitute limitations of the embodiments of the present application, and may include more or fewer steps than shown, or may combine certain steps, or different steps.
The above described apparatus embodiments are merely illustrative, wherein the units illustrated as separate components may or may not be physically separate, i.e. may be located in one place, or may be distributed over a plurality of network elements. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
Those of ordinary skill in the art will appreciate that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof.
The terms "first," "second," "third," "fourth," and the like in the description of the present application and in the above-described figures, if any, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that embodiments of the present application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
It should be understood that in this application, "at least one" means one or more, and "a plurality" means two or more. "and/or" for describing the association relationship of the association object, the representation may have three relationships, for example, "a and/or B" may represent: only a, only B and both a and B are present, wherein a, B may be singular or plural. The character "/" generally indicates that the context-dependent object is an "or" relationship. "at least one of" or the like means any combination of these items, including any combination of single item(s) or plural items(s). For example, at least one (one) of a, b or c may represent: a, b, c, "a and b", "a and c", "b and c", or "a and b and c", wherein a, b, c may be single or plural.
In the several embodiments provided in this application, it should be understood that the disclosed apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the above-described division of units is merely a logical function division, and there may be another division manner in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
The units described above as separate components may or may not be physically separate, and components shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.
In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.
The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in part or all of the technical solution or in part in the form of a software product stored in a storage medium, including multiple instructions to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods of the various embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing a program.
Preferred embodiments of the present application are described above with reference to the accompanying drawings, and thus do not limit the scope of the claims of the embodiments of the present application. Any modifications, equivalent substitutions and improvements made by those skilled in the art without departing from the scope and spirit of the embodiments of the present application shall fall within the scope of the claims of the embodiments of the present application.

Claims (10)

1. A bidding document information extraction method, characterized by comprising:
acquiring a target file, and carrying out text analysis on the target file based on visual characteristics to obtain a plurality of text block data; the target file includes: bidding documents or bidding documents;
inputting the text block data into a pre-trained text block classification model for text block classification to obtain a target text classification label of each text block data, wherein the text block classification model is a prompt learning pre-training language model;
selecting attribute rules of each target text classification label from a preset pattern matching rule library, and extracting attribute information of the text block data corresponding to the target text classification label by utilizing the attribute rules;
and obtaining the extraction information of the target file according to the target text classification tag and the attribute information, wherein the extraction information is bidding information of the bidding file or bidding information of the bidding file.
2. The bidding document information extraction method of claim 1, wherein the visual features include spatial location features and typesetting features; the text analysis is performed on the target file based on the visual characteristics to obtain a plurality of text block data, including:
Segmenting according to the text information of the target file to obtain a plurality of text paragraphs, wherein the text paragraphs comprise attribute information, and the attribute information comprises space position information and typesetting information;
obtaining the spatial position information of the text paragraph according to the spatial position characteristics of the text paragraph in the target file;
acquiring the typesetting information of the text paragraph based on the typesetting characteristics of the target file;
and obtaining the text block data according to the text paragraph and the attribute information of the text paragraph.
3. The bidding document information extraction method of claim 2, wherein the text block classification model comprises a pre-trained language model and a tag classification model; inputting the text block data into a pre-trained text block classification model for text block classification to obtain a target text classification label of each text block data, wherein the method comprises the following steps:
adding prompt learning information into the text paragraph, and generating text information based on the prompt learning information and the text paragraph, wherein the prompt learning information comprises mask information;
generating a text block vector according to the text information and the corresponding attribute information;
Inputting the text block vector into the pre-training language model to generate a vector, so as to obtain a mask vector representing the mask information;
and inputting the mask vector into the pre-trained label classification model to perform label prediction to obtain a target text classification label of the text block data.
4. A bidding document information extraction method as claimed in claim 3, wherein said generating a text block vector from said text information and corresponding said attribute information comprises:
after word segmentation is carried out on the text information, a first preset vector table is queried to obtain a text vector;
inquiring a second preset vector table, and generating a space position vector according to the space position information;
inquiring a third preset vector table, and generating typesetting vectors according to the typesetting information;
and splicing the text vector, the space position vector and the typesetting vector to obtain the text block vector.
5. The method for extracting bidding document information of claim 3, wherein the inputting the mask vector into the pre-trained tag classification model for tag prediction to obtain the target text classification tag of the text block data comprises:
Inputting the pre-training vector into the label classification model, wherein the pre-training vector is generated by utilizing the pre-training language model according to the label of the word to be predicted;
initializing model parameters of the tag classification model based on the pre-training vector;
and carrying out label prediction by using the initialized label classification model, and outputting the target text classification label.
6. The bidding document information extraction method according to any one of claims 1 to 4, wherein the extracting attribute information of the text block data corresponding to the target text classification tag using the attribute rule includes:
matching the text block data by utilizing the attribute rule to obtain the label position of the target text classification label, wherein the attribute information is a regular expression;
and obtaining the attribute information based on the label position.
7. The method of claim 6, wherein the target text classification tag includes a first tag and a second tag, wherein the obtaining the extracted information of the target file according to the target text classification tag and the attribute information includes:
When the same target text classification label corresponds to more than one attribute information, the target text classification label is a first label, and a plurality of attribute information are combined to obtain target attribute information of the first label;
forming a first extraction result by the first tag and the corresponding target attribute information;
forming a second extraction result by the second label and the corresponding attribute information;
and obtaining the extraction information based on the first extraction result and the second extraction result.
8. A bidding document information extraction apparatus, comprising:
the acquisition module is used for acquiring a target file, and carrying out text analysis on the target file based on visual characteristics to obtain a plurality of text block data; the target file includes: bidding documents or bidding documents;
the text block classification module is used for inputting the text block data into a pre-trained text block classification model to perform text block classification to obtain a target text classification label of each text block data, wherein the text block classification model is a prompt learning pre-trained language model;
the attribute matching module is used for selecting an attribute rule of each target text classification label from a preset pattern matching rule library and extracting attribute information of the text block data corresponding to the target text classification label by utilizing the attribute rule;
And the information extraction module is used for obtaining the extraction information of the target file according to the target text classification tag and the attribute information, wherein the extraction information is the bidding information of the bidding file or the bidding information of the bidding file.
9. An electronic device comprising a memory storing a computer program and a processor that when executing the computer program implements the bidding document information extraction method of any one of claims 1 to 7.
10. A computer-readable storage medium storing a computer program, wherein the computer program, when executed by a processor, implements the bidding document information extraction method of any one of claims 1 to 7.
CN202310217202.0A 2023-02-27 2023-02-27 Bid file information extraction method, device, equipment and storage medium Pending CN116205212A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310217202.0A CN116205212A (en) 2023-02-27 2023-02-27 Bid file information extraction method, device, equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310217202.0A CN116205212A (en) 2023-02-27 2023-02-27 Bid file information extraction method, device, equipment and storage medium

Publications (1)

Publication Number Publication Date
CN116205212A true CN116205212A (en) 2023-06-02

Family

ID=86509320

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310217202.0A Pending CN116205212A (en) 2023-02-27 2023-02-27 Bid file information extraction method, device, equipment and storage medium

Country Status (1)

Country Link
CN (1) CN116205212A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117934130A (en) * 2024-01-19 2024-04-26 湖南湘能创业项目管理有限公司 Intelligent bidding method, system, equipment and medium for project material bidding
CN118377912A (en) * 2024-06-27 2024-07-23 山东捷瑞数字科技股份有限公司 Electronic manual processing method, interactive system, electronic device and readable storage medium
CN118469040A (en) * 2024-07-10 2024-08-09 中建五局第三建设(深圳)有限公司 Bidder ring training method, predicting device, equipment and medium for detecting model
CN118917295A (en) * 2024-10-10 2024-11-08 北京数科网维技术有限责任公司 Layout file processing method and device and computing equipment
CN119248934A (en) * 2024-12-05 2025-01-03 北京仁科互动网络技术有限公司 Unstructured data extraction method, device, electronic device and storage medium

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117934130A (en) * 2024-01-19 2024-04-26 湖南湘能创业项目管理有限公司 Intelligent bidding method, system, equipment and medium for project material bidding
CN117934130B (en) * 2024-01-19 2025-02-07 湖南湘能创业项目管理有限公司 Intelligent bidding method, system, equipment and medium for project material bidding
CN118377912A (en) * 2024-06-27 2024-07-23 山东捷瑞数字科技股份有限公司 Electronic manual processing method, interactive system, electronic device and readable storage medium
CN118377912B (en) * 2024-06-27 2024-11-08 山东捷瑞数字科技股份有限公司 Electronic manual processing method, interaction system, electronic device and readable storage medium
CN118469040A (en) * 2024-07-10 2024-08-09 中建五局第三建设(深圳)有限公司 Bidder ring training method, predicting device, equipment and medium for detecting model
CN118917295A (en) * 2024-10-10 2024-11-08 北京数科网维技术有限责任公司 Layout file processing method and device and computing equipment
CN119248934A (en) * 2024-12-05 2025-01-03 北京仁科互动网络技术有限公司 Unstructured data extraction method, device, electronic device and storage medium

Similar Documents

Publication Publication Date Title
CN111488931B (en) Article quality evaluation method, article recommendation method and corresponding devices
CN116205212A (en) Bid file information extraction method, device, equipment and storage medium
CN114298121B (en) Multi-mode-based text generation method, model training method and device
CN113392209B (en) Text clustering method based on artificial intelligence, related equipment and storage medium
CN110688854A (en) Named entity recognition method, device and computer readable storage medium
Arumugam et al. Hands-On Natural Language Processing with Python: A practical guide to applying deep learning architectures to your NLP applications
CN116258137A (en) Text error correction method, device, equipment and storage medium
CN116578688A (en) Text processing method, device, equipment and storage medium based on multiple rounds of questions and answers
CN117807482B (en) Method, device, equipment and storage medium for classifying customs clearance notes
CN113392179A (en) Text labeling method and device, electronic equipment and storage medium
CN116775875A (en) Question corpus construction method and device, question answering method and device and storage medium
CN116595023A (en) Address information updating method and device, electronic equipment and storage medium
CN114492661B (en) Text data classification method and device, computer equipment and storage medium
CN114492669B (en) Keyword recommendation model training method, recommendation device, equipment and medium
CN113221553A (en) Text processing method, device and equipment and readable storage medium
Andriyanov Combining text and image analysis methods for solving multimodal classification problems
WO2023134085A1 (en) Question answer prediction method and prediction apparatus, electronic device, and storage medium
CN114372454B (en) Text information extraction method, model training method, device and storage medium
CN113657092A (en) Method, apparatus, device and medium for identifying label
CN118468863A (en) Title generation method and device
CN114398903B (en) Intention recognition method, device, electronic equipment and storage medium
CN113392190B (en) Text recognition method, related equipment and device
CN115617959A (en) Question answering method and device
CN116089602B (en) Information processing method, apparatus, electronic device, storage medium, and program product
CN115712704A (en) Sentence vector generating method and device, matching method and device and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination