CN113515587A

CN113515587A - Object information extraction method and device, computer equipment and storage medium

Info

Publication number: CN113515587A
Application number: CN202110614055.1A
Authority: CN
Inventors: 严蕾; 王进强; 袁明; 沈志远; 李维盈; 陈建; 高振祥
Original assignee: China Shenhua International Engineering Co ltd
Current assignee: China Shenhua International Engineering Co ltd
Priority date: 2021-06-02
Filing date: 2021-06-02
Publication date: 2021-10-19
Anticipated expiration: 2041-06-02
Also published as: CN113515587B

Abstract

The invention discloses a method and a device for extracting subject matter information, computer equipment and a storage medium, wherein the method comprises the following steps: acquiring a current bidding document of the subject matter information to be extracted; performing data cleaning on the text content of the current bidding document to obtain initial data; positioning a key sentence set of suspected object information from the initial data based on a regular matching mode, and obtaining a target phrase set corresponding to the key sentence set; and carrying out data classification and labeling on the target phrase set based on a named entity identification mode, and extracting structured target object information based on a labeling result. By applying the scheme of the invention, the target object information can be automatically extracted, and the extracted target object information is structured target object information, so that the working efficiency of enterprise workers and the quality of the bidding service are improved, and the bidding service is more intelligent and electronic. By comparing with the manually marked object information, the method is found to have good effect in the process of extracting the object information.

Description

Object information extraction method and device, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer and bidding technologies, and in particular, to a method and an apparatus for extracting information of a target object, a computer device, and a storage medium.

Background

Bidding businesses are an important task for enterprises to conduct project management, and bidding documents typically have relatively standardized writing requirements and textual content. For the bidding service, the data generation speed of the bidding management enterprise is high, the data quantity is high, and the target object information extraction of the enterprise in the bidding project management process is still carried out in a manual mode, so that a large amount of manpower and material resources are consumed, and the accuracy of the extracted target object is difficult to ensure.

Therefore, if the text content of the bidding document is researched as a corpus, the functions of management, application, feedback, update iteration and the like of the standard bidding document can be realized, the working efficiency of enterprise workers and the quality of bidding business are improved, the risk control is facilitated, and the development of the enterprise on the bidding management mode in the direction of intellectualization and electronization is promoted. Therefore, there is a need to provide a target object information extraction scheme to achieve standardization, efficiency and accuracy of bidding project management.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method provides a subject matter information extraction scheme to improve the working efficiency of enterprise workers and the quality of bid inviting business and promote the development of enterprise management modes for bidding towards intellectualization and electronization.

In order to solve the above technical problem, the present invention provides a method for extracting object information, including:

acquiring a current bidding document of the subject matter information to be extracted;

performing data cleaning on the text content of the current bid document to obtain initial data;

positioning a key sentence set of suspected object information from the initial data based on a regular matching mode, and obtaining a target phrase set corresponding to the key sentence set;

and carrying out data classification and annotation on the target phrase set based on a named entity identification mode, and extracting structured object information based on annotation results.

Optionally, the performing data cleaning on the text content of the current bid-marking document includes:

converting the current bid document in the HTML format into text content in a target format;

and carrying out data cleaning on the text content and removing useless information to obtain initial data.

Optionally, the locating a key sentence set of suspected object information from the initial data based on the regular matching manner includes:

and positioning the key sentence set of the suspected object information in the initial data by utilizing a pre-established rule set and a field dictionary.

Optionally, the rule set and the field dictionary are established as follows, including:

taking the associated content of the object information winning the bid in the history bidding document as initial information; wherein the associated content of the subject matter information at least comprises a document name of the historical bidding document;

removing a preset symbol from the initial information to obtain effective data;

detecting whether the valid data contain preset high-frequency words aiming at each item type;

if yes, removing text contents behind the segmentation points by taking the detected high-frequency words as the segmentation points to obtain main body contents of the effective data;

removing non-critical data in the main body content of the valid data; wherein the non-critical data is data which has no significant influence on extracting the object information;

and carrying out redundancy and compatibility processing on the main content of the effective data after the non-critical data is removed, and establishing a rule set and a field dictionary.

Optionally, the classifying and labeling the data of the target phrase set based on the named entity recognition mode, and extracting structured target information based on a labeling result includes:

utilizing a cyclic neural network recognition model to label named entities of phrases in the target phrase set;

and removing the unstructured phrases from the marked target phrase set to obtain structured target object information.

Optionally, the performing named entity tagging on the phrases in the target phrase set by using a recurrent neural network recognition model includes:

inputting each phrase in the target phrase set to an input layer of the recurrent neural network recognition model, and converting each phrase into a corresponding word vector;

inputting each word vector into a convolution layer of the recurrent neural network identification model, wherein the convolution layer utilizes a maximum downsampling algorithm to extract a feature vector of each word vector to obtain an optimal local feature vector;

sequentially inputting the optimal local feature vector into a linear connection layer and a nonlinear activation layer in the recurrent neural network model, and extracting a high-level abstract feature vector;

and inputting the high-level abstract feature vectors to an output layer of the recurrent neural network recognition model, and determining the labeling result of the named entity label of the phrase corresponding to each high-level abstract feature vector by calculating the probability of belonging to each preset classification label.

Optionally, after extracting the structured subject matter information based on the annotation result, the method further includes:

integrating the information of the extracted object and the standardized data; wherein the normalized data includes at least: item name, principal unit.

In order to solve the above technical problem, the present invention provides a subject matter information extraction device, including:

the document acquisition module is used for acquiring the current bidding document of the object information to be extracted;

the preprocessing module is used for carrying out data cleaning on the text content of the current bid document to obtain initial data;

the positioning module is used for positioning a key sentence set of suspected object information from the initial data based on a regular matching mode and obtaining a target phrase set corresponding to the key sentence set;

and the object information extraction module is used for carrying out data classification and labeling on the target phrase set based on a named entity identification mode and extracting structured object information based on a labeling result.

Optionally, the preprocessing module is specifically configured to convert a current bid document in an HTML format into text content in a target format; and carrying out data cleaning on the text content and removing useless information to obtain initial data.

Optionally, the positioning module is specifically configured to position a key sentence set of the suspected object information in the initial data by using a pre-established rule set and a field dictionary.

Optionally, the system further comprises a rule set and field dictionary establishing module, configured to use the associated content of the item information in the historical bidding document as initial information; wherein the associated content of the subject matter information at least comprises a document name of the historical bidding document; removing a preset symbol from the initial information to obtain effective data; detecting whether the valid data contain preset high-frequency words aiming at each item type; if yes, removing text contents behind the segmentation points by taking the detected high-frequency words as the segmentation points to obtain main body contents of the effective data; removing non-critical data in the main body content of the valid data; wherein the non-critical data is data which has no significant influence on extracting the object information; and carrying out redundancy and compatibility processing on the main content of the effective data after the non-critical data is removed, and establishing a rule set and a field dictionary.

Optionally, the subject matter information extraction module includes: labeling the submodule and the structuring submodule; wherein,

the labeling submodule is used for carrying out named entity labeling on the phrases in the target phrase set by utilizing a recurrent neural network recognition model;

and the structuring sub-module is used for removing the non-structuring phrases from the marked target phrase set to obtain the structured target object information.

Optionally, the labeling submodule is specifically configured to: inputting each phrase in the target phrase set to an input layer of the recurrent neural network recognition model, and converting each phrase into a corresponding word vector; inputting each word vector into a convolution layer of the recurrent neural network identification model, wherein the convolution layer utilizes a maximum downsampling algorithm to extract a feature vector of each word vector to obtain an optimal local feature vector; sequentially inputting the optimal local feature vector into a linear connection layer and a nonlinear activation layer in the recurrent neural network model, and extracting a high-level abstract feature vector; and inputting the high-level abstract feature vectors to an output layer of the recurrent neural network recognition model, and determining the labeling result of the named entity label of the phrase corresponding to each high-level abstract feature vector by calculating the probability of belonging to each preset classification label.

Optionally, the system further comprises a subject matter information integration module, configured to, after structured subject matter information is extracted based on the labeling result, perform information integration on the extracted subject matter information and the standardized data; wherein the normalized data includes at least: item name, principal unit.

In order to solve the above technical problem, the present invention provides a computer device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the above method when executing the computer program.

To solve the above technical problem, the present invention provides a computer-readable storage medium having a computer program stored thereon, wherein the computer program is configured to implement the above method when executed by a processor.

Compared with the prior art, one or more embodiments in the above scheme can have the following advantages or beneficial effects:

when the scheme provided by the invention is applied to extracting the object information, the current bidding document of the object information to be extracted is obtained, the text content of the current bidding document is subjected to data cleaning to obtain initial data, and the accuracy of extracting the object information is improved through the step of data cleaning; in order to enable the extraction effect not to be widely influenced by the industry field related to the bidding document, a target phrase set corresponding to a key sentence set of the suspected object information is obtained from initial data based on a regular matching mode, and structured object information is extracted from the target phrase set based on a named entity recognition mode.

According to the method, when the target object information in the bidding document is extracted, the target phrase set suspected of having the target object information can be obtained in a regular matching mode, then the phrases in the target phrase set are labeled in a named entity identification mode, and the labeled target phrase set is screened, so that the structured target object information is obtained, the target object information can be automatically extracted, the extracted target object information is the structured target object information, the working efficiency of enterprise workers and the quality of the bidding service can be improved, and the bidding service is more intelligent and electronic. In addition, the comparison with the manually marked object information shows that the method can obtain good effect in the extraction of the object information.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.

Fig. 1 is a flowchart of a method for extracting object information according to an embodiment of the present invention;

FIG. 2 is a flowchart of establishing a rule set and a field dictionary according to an embodiment of the present invention;

FIG. 3 is a schematic diagram of a set of target phrases extracted based on the rule set and field dictionary shown in FIG. 2;

fig. 4 is a flowchart illustrating data classification and labeling of the target phrase set and extraction of structured target information based on the labeling result according to the embodiment of the present invention;

fig. 5 is another flowchart of performing data classification tagging on the target phrase set and extracting structured target object information based on tagging results according to the embodiment of the present invention;

FIG. 6 is a block diagram of a recurrent neural network recognition model provided in accordance with an embodiment of the present invention;

fig. 7 is another flowchart of a method for extracting object information according to an embodiment of the present invention;

FIG. 8 is a schematic diagram of object information extraction using the embodiment of the method shown in FIG. 7;

fig. 9 is a structural diagram of a subject matter information extracting apparatus according to an embodiment of the present invention;

fig. 10 is a block diagram of a computer device according to an embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

With the development of natural language processing and data mining technologies, the demand for information extraction of unstructured data is increasing. Generally, the common methods for extracting text information mainly include three categories, namely rule-based, statistical-based and deep learning. Based on the rules, the method mainly adopts a character string and pattern matching method, is realized through a regular matching method, is simple to operate, highly depends on extraction rules, and is mainly suitable for normative documents. Based on statistics, the learning is trained depending on actual texts, the precision is high, but the process is complex. The bidding document text usually has certain normative, and the information extraction of the bidding document text file is more suitable by adopting a rule-based method, so that the target object information extraction is finally realized.

In order to improve the working efficiency of enterprise workers and the quality of tender service and promote the development of the management mode of enterprises for tender towards intellectualization and electronization, the invention provides a method and a device for extracting tender object information, computer equipment and a storage medium.

The method for extracting the object information provided by the embodiment of the invention comprises the following steps: acquiring a current bidding document of the subject matter information to be extracted; performing data cleaning on the text content of the current bid document to obtain initial data; positioning a key sentence set of suspected object information from the initial data based on a regular matching mode, and obtaining a target phrase set corresponding to the key sentence set; and carrying out data classification and annotation on the target phrase set based on a named entity identification mode, and extracting structured object information based on annotation results.

When the scheme provided by the invention is applied to extraction of the target object information, the target phrase set suspected of having the target object information can be obtained in a regular matching mode, then the phrases in the target phrase set are labeled in a named entity identification mode, and the labeled target phrase set is screened, so that the structured target object information is obtained, the target object information can be automatically extracted, and the extracted target object information is the structured target object information, so that the working efficiency of enterprise workers and the quality of the bidding service are improved, and the bidding service is more intelligent and electronic.

As shown in fig. 1, a flowchart of a method for extracting object information according to an embodiment of the present invention is provided, where the method for extracting object information may include the following steps:

step S100: and acquiring the current bidding document of the object information to be extracted.

Generally, the bidding document can be divided into an online document and an offline document, and for the HTML format bidding document, when the number of times of cleaning is performed, the current bidding document in the HTML format can be converted into the text content in the target format; and further, performing data cleaning on the text content and removing useless information to obtain initial data.

In practice, all or part of the HTML formatted bid document may be converted into text content, such as the text content of a TXT text document, and the converted text content may also be stored in the text document for subsequent use. In one implementation, the HTML parser of python can be used to implement document format conversion, and the main content of the bidding announcement text is parsed out by defining an HTML text content extraction parser.

Step S200: and performing data cleaning on the text content of the current bid document to obtain initial data.

Generally, the text content of the bidding document may contain some information which is useless for extracting the object information, such as an item number, a chapter number, and the like, and in order to ensure the accuracy of extracting the object information, data cleaning is required to be performed on the text content of the bidding document, and the bidding document after deleting the useless information is taken as initial data.

Step S300: and positioning a key sentence set of the suspected object information from the initial data based on a regular matching mode, and obtaining a target phrase set corresponding to the key sentence set.

Regularization, as the name implies, is a rule. The regular expression is to summarize a rule from a character string and summarize the rule in the form of a regular expression, so as to express the rule. It can be seen that the regular expression is a logical formula capable of operating on a character string, and a "regular character string" (i.e., a rule representing a character string, a logic capable of filtering a character string) is constructed by defining some specific characters and their combination forms in advance. The idea behind regular matching is to define the rules of a string in a generally descriptive language, and strings that meet the rules are "matched".

The study of regular expressions starts with the study of the working patterns of the human nervous system, of which two neurophysiologists, Warren McCulloch and Walter Pitts, have developed a mathematical way to describe neural networks, describing neurons in biology as automatic control elements. In 1956, the concept of Regular expressions (abbreviated Expression, which may be abbreviated as RE, regex, etc., also known as Regular expressions, etc.) was proposed by a mathematician, Stephen Kleene, in a paper entitled "representation of neural net events". The expression describing the regular set algebra can be used for lexical analysis through matching of character formats, and is firstly described by using a single character string, and the function of the expression is to explain or match one or more character strings which accord with a certain syntactic rule, so that as a representative language, the regular expression has a series of expression modes which are exclusive to the regular expression to express character classes which accord with various rules. Regular expressions are widely used in various fields by means of strong and accurate pattern description capacity, a qed editor in Unix is a first practical application program of the regular expressions, and then the regular expressions are continuously developed in various computer languages, so that the application range is expanded in various fields. At present, various text editing tools, programming languages, lexical analyzers and the like are embedded with regular expression matching, so that functions of efficiently searching and replacing texts are supported.

It should be noted that the target phrase set corresponding to the key sentence set acquired based on the regular matching manner contains key data having a significant influence on the extraction of the target object, such as several key data of a project name, a bid condition, a project profile, and a bid range.

In one case, as shown in FIG. 2, a rule set and field dictionary may be established as follows:

step S301: taking the associated content of the object information winning the bid in the history bidding document as initial information; wherein the associated content of the subject matter information at least comprises a document name of the historical bidding document.

For a bidding document, the content associated with the subject matter may be from the document name of the bidding document or from the document body of the bidding document, and the document name usually contains the subject matter information, so that the document name of the bidding document may be used as the initial information for quickly and accurately extracting the subject matter information. Further, when the extraction of the subject matter information with the document name of the bidding document as the initial information fails, the extraction of the subject matter may be performed with a part or the whole of the document body of the bidding document as the initial information.

Step S302: and removing preset symbols from the initial information to obtain effective data.

The document contents of a bidding document often contain some symbols that are not useful for extracting subject matter information, such as "#" in the name of the item, a pause, a space, a bracket, and so on. In view of such a situation, the present invention presets a certain number of symbols, that is, preset symbols, so that when a symbol among the preset symbols is detected in the history bidding document or the current bidding document, the corresponding symbol can be deleted, thereby enabling to obtain effective data useful for extracting subject matter information.

Step S303: detecting whether the valid data contains preset high-frequency words for each item type, if yes, executing step S304, and if not, executing step S305.

Based on the statistical data, a certain number of general high-frequency words are generally existed in the bidding document regardless of the industry classification and the item type, and the high-frequency words are useless for extracting the object information, and the text content after the high-frequency words is useless for extracting the object information in one sentence, so that the invention presets a certain number of high-frequency words and deletes the text content after the high-frequency words when detecting that the high-frequency words exist in the historical bidding document or the current bidding document.

Step S304: and removing the text content behind the segmentation point by taking the detected high-frequency words as the segmentation point to obtain the main content of the effective data.

Step S305: removing non-critical data in the main body content of the valid data; wherein the non-critical data is data that has no significant impact on extracting subject matter information.

Step S306: and carrying out redundancy and compatibility processing on the main content of the effective data after the non-critical data is removed, and establishing a rule set and a field dictionary.

Referring to fig. 3, the following description will use "a group purchase bidding announcement for a pipeline equipment of 9 months and four months in 2018" as an example. Firstly, removing "#", pause number, blank, brackets, contents thereof and the like to obtain a "certain group of purchasing and bidding bulletin of pipeline equipment in 9 months and four months in 2018"; cutting the content behind the high-frequency word "purchase" which is summarized as a segmentation point to obtain a group of pipeline equipment with the length of 9 months in 2018; cutting out the contents of enterprise unit names, project dates and the like which appear in most linguistic data and the front part of the contents to obtain four pipeline devices; some common word structures are cut, such as redundant words like "frame", "frame protocol", "year", etc., although these words do not exist in this example, no cutting is performed. The term "framework" exists in the example of purchasing public bidding project bidding announcement of the Wuhai energy anchor (framework), and the cutting of the fourth step is needed after the term is cut in the first three steps. That is, the non-critical data, which is data having no significant influence on extracting the subject matter information, generally includes necessary information (such as the name of the business entity, the date of the project, and the like) included in the main content of the majority of the bidding documents and the preceding content of the necessary information, and these data have no significant influence on extracting the subject matter information, and therefore need to be removed to effectively establish the rule set and the field dictionary.

After the non-key data is removed, the remaining content in the main content of the valid data is key data useful for extracting the subject matter information, however, there are also redundant words to some extent in the main content of the valid data, for example, redundant words such as "frame", "frame agreement", "year" and the like are often found in the bidding document of the goods type, and phrase structures such as "item (++) (item)" are often found in the bidding document of the service type. Therefore, in order to reduce the processing workload of the repeated data, the main content of the valid data needs to be processed redundantly and compatibly, so as to establish a more effective rule set and field dictionary.

Step S400: and carrying out data classification and annotation on the target phrase set based on a named entity identification mode, and extracting structured object information based on annotation results.

Named entity recognition is the structured processing of unstructured, semi-structured, or structured text that is readable by a computer. The unstructured text has no uniform rule, typesetting format and the like, and the named entity recognition of the text is realized by training and learning a large amount of texts on the basis of grammar; the structured text mainly comes from a database, has a uniform format and can directly extract key information; semi-structured text generally has no fixed format and cannot be directly processed. The research method of named entity recognition can be roughly summarized into three categories, namely rule-based (such as a method adopting character strings and pattern matching), statistics-based (such as recognition based on the probability of different character occurrences), and deep learning-based (such as a deep neural network, a long-term and short-term memory network, and the like).

In another embodiment of the present invention, as shown in fig. 4, the target phrase set may be labeled in the following manner, and structured target information may be extracted based on the labeling result:

step S401: and carrying out named entity labeling on the phrases in the target phrase set by utilizing a recurrent neural network recognition model.

Step S402: and removing the unstructured phrases from the marked target phrase set to obtain structured target object information.

Referring to fig. 5 and 6, in a preferred implementation, the phrases in the target phrase set may be named entity labeled as follows:

step S4011: and inputting each phrase in the target phrase set into an input layer of the recurrent neural network recognition model, and converting the phrase into a corresponding word vector.

The embodiment of the invention adopts a cyclic neural network recognition model to extract object information, and because the input of the neural network model is a vector, the embodiment of the invention converts each phrase in a target phrase set entering an input layer into a corresponding word vector by using word2 vec. It should be noted that the word2vec function can be used to generate a model for converting phrases in text into word vectors. In one implementation, the word2vec function integrated in python's third party toolkit, gensim, may be used.

Step S4012: and inputting each word vector into a convolution layer of the recurrent neural network identification model, wherein the convolution layer utilizes a maximum downsampling algorithm to extract the feature vector of each word vector to obtain the optimal local feature vector.

The optimal local features in the convolutional layer are captured through a maximum downsampling algorithm, so that the cyclic neural network recognition model is used for realizing target named entity recognition. For the convolutional layer, the dimension of the output feature vector depends on the number of phrases in the input sentence, and the dimension of the feature vector output by the convolutional layer is different because the number of phrases in the target phrase set corresponding to each sentence in the key sentence set is different. In order to avoid the influence on the extraction of the object information caused by the boundary problem of the beginning and the end of the sentence in the down-sampling process, the local features captured by the convolution layer need to be combined in a certain mode, so that the dimensionality of the output feature vector is fixed and has no relation with the dimensionality of the feature vector input to the convolution layer, and the influence caused by the different quantities of phrases in the target phrase set is eliminated.

Specifically, the downsampling may be performed according to the following expression, and the obtained output of the current layer is:

wherein,

is the output vector of the current layer,

the output vector of the upper layer, t represents the size of the downsampled selected area, i represents the ith neuron,

the number of hidden units in the convolutional layer.

As can be seen from the above expression, by

The function selects the region t near the ith neuron with the largest feature vector

As output of the ith neuron, i.e.

It should be noted that, in the following description,

the number of hidden units in the convolutional layer is also the dimension of the local feature vector generated by the sliding window that downsamples the feature vector output by the input layer.

The main factors influencing the annotation result of the phrase derived entity come from: phrases within a set window within a vicinity centered on it. For a given phrase to be labeled, the sliding window method only considers the effect of the phrase in the fixed window size range around the word on the labeling result of the phrase, while other phrases are ignored, which results in that it is not very friendly to the phrases near the beginning or end of the sentence, the sliding window has a problem of "out of bounds" for such phrases, and if the beginning phrase is centered, the left region has no phrase. In order to solve the problem, the invention adopts a sentence expansion method to solve the problems that: filling half of the size of a window with special filling words at the beginning and the end of a sentence to ensure that each window taking phrases as the center does not have empty condition and ensure that each word can have own window, thereby ensuring that the boundary problem does not influence the experimental result, and initializing the word vector of the filling words by using a zero initialization method. Through the down-sampling layer, the problem that the dimensionality of the feature vector is not uniform can be solved, the influence caused by different numbers of phrases in the target phrase set is eliminated, and therefore the context information of the phrases to be labeled does not need to be discarded.

Step S4013: and sequentially inputting the optimal local feature vector into a linear connection layer and a nonlinear activation layer in the recurrent neural network model, and extracting a high-level abstract feature vector.

After the optimal local feature vector is obtained through the maximum downsampling layer, the vector with the fixed dimension size is input into the hidden layer of the recurrent neural network recognition model and used for subsequent label judgment. As shown in fig. 5, the hidden layers include 3 layers: the first layer is a linear connection layer, the second nonlinear activation layer and the third Softmax layer.

The linear connection layer mainly performs linear combination on the input optimal local vectors to acquire more complex features. The operation performed at the linear connection layer can be expressed by the following formula:

wherein, W^lIs a linear connection parameter of the first layer of linear connection layer, b^lIs the offset value of the current layer,

the output vector of the previous layer is the input vector of the current layer,

is the output vector of the current layer.

In addition, if the number of hidden units of the first layer linear connection layer is defined as

Then, by a linear concatenation operation, a dimension of

The vector of (2) is used as an input vector of the second nonlinear activation layer for activation.

The nonlinear activation layer performs activation operation on the output vector of the linear connection layer by using a specific nonlinear activation function. The nonlinear activation function is the key to capture the high-level abstract features, and if the nonlinear activation function is not available, the neural network model is degraded into a linear model. In general, in order to extract high-level features more fully, several linear connection layers and nonlinear activation layers are interleaved and stacked together to obtain higher-order feature representations, but the cost of model training increases exponentially with the increase of the depth of the neural network, so that the number of the layers of the neural network needs to be selected according to experimental results and computing power. Because the computing resources of the laboratory hardware are limited, fig. 5 uses a combination of a single linear connection layer and a nonlinear activation layer to extract high-level abstract features, and it should be noted that in practical applications, those skilled in the art can set the extraction according to practical situations. In addition, in order to accelerate the model training speed, the HardTanh function is used as the activation function of the nonlinear activation layer, mainly due to the simple calculation process in the derivation process, and the function expression is as follows:

where x is the output of the previous layer.

The HardTanh function is analyzed, the derivative of the function in the range of-1, 1 is constantly 1, and the derivative of the function in other ranges is 0, so that the calculation of the recurrent neural network recognition model is simpler and more convenient when the recurrent neural network recognition model is trained by using a back propagation algorithm, and the training process of the model is accelerated.

The Softmax layer converts the linear prediction value into a class probability, and can adopt the following function:

wherein σ_i(z) is the input z of the previous layer_iProbability of belonging to label i, exp means each z_iAnd taking a natural base power value, wherein k is the number of the neurons.

It should be noted that the result of bringing the outputs of k neurons of the neural network into Softmax is actually for each z_iTaking the natural base power value to become a non-negative value, then dividing by the sum of all terms for normalization, since z is_iThe initial input phrase is transformed layer by layer through a recurrent neural network recognition model, so that each z_iAll represent a phrase, and thus each output σ_i(z) can be regarded as the probability that the input phrase to be annotated belongs to the tag i, otherwise known as Likelihood (Likelihood).

Step S4014: and inputting the high-level abstract feature vectors to an output layer of the recurrent neural network recognition model, and determining the labeling result of the named entity label of the phrase corresponding to each high-level abstract feature vector by calculating the probability of belonging to each preset classification label.

The final output of the recurrent neural network model is to label the part of speech of each phrase corresponding to the input sentence, and all the label types in the labeling mode are shown in the following table:

B-begin	I-inside	E-end
			indicating the start of an entity	Representing the interior of an entity	Representing a physical tail
S-single	O-other
			The representation itself is an entity	Representing other non-physical characters

Taking "procurement of ion exchange resin used" as an example, the output results of the model are as follows:

use of	Is/are as follows	Separation device	Seed of Japanese apricot
				O-other	O-other	B-begin	I-inside
Making a business	Changeable pipe	Resin composition	Procurement
				I-inside	E-end	S-single	O-other

As can be seen from the above two tables, different phrases in the target phrase set corresponding to the sentences in the key sentence set are labeled with different parts of speech. Further, for different labeling results, the following processing method is adopted when the target object information is extracted: 1) if all phrases in the target phrase set corresponding to the sentences in the key sentence set are marked as entity phrases, directly determining the key sentences as object information; 2) if each phrase in the target phrase set corresponding to the sentences in the key sentence set is marked as a phrase of other non-entity characters at the beginning and/or the end of the sentence, deleting the phrases marked as corresponding to other entity characters, and taking each phrase reserved in the key sentences as a target object; 3) and if all phrases in the target phrase set corresponding to the sentences in the key sentence set are marked as phrases of other non-entity characters, directly deleting the key sentences.

When the scheme provided by the invention is applied to extraction of the target object information, the target phrase set suspected of having the target object information can be obtained in a regular matching mode, then the phrases in the target phrase set are labeled in a named entity identification mode, and the labeled target phrase set is screened, so that the structured target object information is obtained, the target object information can be automatically extracted, and the extracted target object information is the structured target object information, so that the working efficiency of enterprise workers and the quality of the bidding service are improved, and the bidding service is more intelligent and electronic. In addition, the comparison with the manually marked object information shows that the method can obtain good effect in the extraction of the object information.

Further, when naming entities of phrases containing targets are labeled, labeling effects of four networks, namely a Convolutional Neural Network (CNN), a feedback neural network (HNN), a Deep Neural Network (DNN) and a Recurrent Neural Network (RNN), are compared respectively, as shown in the following table:

	DNN	CNN	HNN	RNN
					recall rate	91.22％	95.43％	95.22％	95.65％
Rate of accuracy	91.59％	95.43％	95.49％	95.81％
					F1 score	91.32％	95.41％	95.30％	95.42％

It can be seen that the neural networks based on DNN, CNN, HNN, RNN perform best on entity naming classes with 0.3% higher accuracy than the highest HNN of their three networks. The invention also carries out ten-fold cross validation on the data of different industries and different project types, and the obtained accuracy is about 95.8 percent.

As shown in fig. 7, a flowchart of a method for extracting object information according to an embodiment of the present invention is provided, where the method for extracting object information may include the following steps:

step S110: and acquiring the current bidding document of the object information to be extracted.

Step S210: and performing data cleaning on the text content of the current bid document to obtain initial data.

Step S310: and positioning a key sentence set of the suspected object information from the initial data based on a regular matching mode, and obtaining a target phrase set corresponding to the key sentence set.

Step S410: and carrying out data classification and annotation on the target phrase set based on a named entity identification mode, and extracting structured object information based on annotation results.

It should be noted that steps S110 to S410 in the method embodiment shown in fig. 7 are similar to steps S100 to S400 in the method embodiment shown in fig. 1, and reference may be made to the method embodiment shown in fig. 1 for relevant points, which are not described herein again.

Step S500: integrating the information of the extracted object and the standardized data; wherein the normalized data includes at least: item name, principal unit.

Referring to fig. 8, a term name + a key field in a current bidding document is used as initial data for extracting target object information, a target phrase set corresponding to the initial data is obtained based on a pre-established rule set and a field dictionary, and then the target object is extracted based on a recurrent neural network recognition model, and considering that the above mentioned target object annotation result is known, when all phrases in the target phrase set corresponding to sentences in the key phrase set are labeled as phrases of other non-entity characters, the key sentence is directly deleted, that is, the target object information is not extracted, which indicates that the target object information extraction fails. For the situation, in order to ensure that the bidding service can be automated and electronic, the method can also extract the subject matter information by integrating the extracted subject matter information and the standardized data, specifically, matching the extracted subject matter information and the standardized data based on key fields such as project names or entrusted units in the current bidding document, and further extracting the corresponding subject matter information after the matching is successful.

In actual bidding business, besides bidding documents, there are often some standardized data such as bidding project information tables, and when the scheme provided by the present invention is applied to extract the subject matter information, information integration with the standardized data is required. Taking the bidding service information table as an example for explanation, the current bidding document and the bidding service information table are associated by adopting an approximate matching mode, for example, the item name in the current bidding document is matched with the item name in the bidding service information table, the target object information can be inserted into the bidding item information table, the item name and the entrusting unit are added into the table of the bidding service information table through the connecting operation, and the formatted target object information is obtained from the target object field.

The embodiment of the method shown in fig. 7 has all the advantages of the embodiment of the method shown in fig. 1, and in addition, when the extraction of the subject matter information based on the bidding document fails, the final subject matter information can be obtained by integrating the extracted subject matter information with the standardized data, so that the accuracy of the extraction of the subject matter and the success rate of the extraction of the subject matter information are improved.

The effect of the method for extracting the object information provided by the embodiment of the invention is verified by combining experiments.

(1) Data set presentation used for experiments:

the total bidding documentation from the national energy bidding network was 33586 copies. The bidding items corresponding to part of the bidding documents have more detailed information such as classification results, and the other part of the bidding documents do not have information such as classification results. The total number of the project data which contains information such as bidding project names and entrustment units and the like and the known corresponding classification results is 19980, corresponding bidding documents are all contained in the complete bidding document set, and information integration is performed at the later stage in order to ensure the consistent quantity of the data information.

(2) The experimental setup was as follows:

the programming language is Python3, and different interpreters are used in the experiment because different third party libraries need to be used, and Pycharm community version is used as the integration tool.

(3) The processing structure of the built regular expression is shown in the following table:

(4) the experimental results are as follows:

in order to evaluate the method effect, 1% (namely 200) of bidding documents are randomly extracted from 19980 pieces of bidding document texts, and the effect evaluation is carried out on the bidding documents, so that the accuracy rate and the recall rate of the obtained structured fields are higher than 90%, and the goods type bidding documents are relatively more standard and have higher accuracy rate and recall rate. The adoption of the rule-based method is more suitable for extracting the information of the bidding document text file, and finally realizes the extraction of the target object.

Next, a subject information extraction device provided in an embodiment of the present invention will be described.

As shown in fig. 9, a structural diagram of an object information extracting apparatus provided in an embodiment of the present invention includes the following modules: a document acquisition module 610, a preprocessing module 620, a positioning module 630, and a subject matter information extraction module 640.

The document acquiring module 610 is configured to acquire a current bid document of the subject matter information to be extracted;

the preprocessing module 620 is configured to perform data cleaning on the text content of the current bid document to obtain initial data;

a positioning module 630, configured to position a key sentence set of suspected object information from the initial data based on a regular matching manner, and obtain a target phrase set corresponding to the key sentence set;

and the target object information extraction module 640 is configured to perform data classification and labeling on the target phrase set based on a named entity identification manner, and extract structured target object information based on a labeling result.

In one case, the preprocessing module 620 is specifically configured to convert the current bid document in the HTML format into the text content in the target format; and carrying out data cleaning on the text content and removing useless information to obtain initial data.

In another case, the positioning module 630 is specifically configured to position the set of key sentences in which the subject matter information is suspected to exist in the initial data by using a pre-established rule set and a field dictionary.

In one embodiment of the invention, the system further comprises a rule set and field dictionary establishing module, which is used for taking the associated content of the object information in the historical bidding document as initial information; wherein the associated content of the subject matter information at least comprises a document name of the historical bidding document; removing a preset symbol from the initial information to obtain effective data; detecting whether the valid data contain preset high-frequency words aiming at each item type; if yes, removing text contents behind the segmentation points by taking the detected high-frequency words as the segmentation points to obtain main body contents of the effective data; removing non-critical data in the main body content of the valid data; wherein the non-critical data is data which has no significant influence on extracting the object information; and carrying out redundancy and compatibility processing on the main content of the effective data after the non-critical data is removed, and establishing a rule set and a field dictionary.

In one implementation, the subject matter information extracting module 640 includes: a labeling submodule and a structuring submodule. The labeling submodule is used for carrying out named entity labeling on the phrases in the target phrase set by utilizing a recurrent neural network recognition model; and the structuring sub-module is used for removing the non-structuring phrases from the marked target phrase set to obtain the structured target object information.

In one case, the labeling submodule is specifically configured to: inputting each phrase in the target phrase set to an input layer of the recurrent neural network recognition model, and converting each phrase into a corresponding word vector; inputting each word vector into a convolution layer of the recurrent neural network identification model, wherein the convolution layer utilizes a maximum downsampling algorithm to extract a feature vector of each word vector to obtain an optimal local feature vector; sequentially inputting the optimal local feature vector into a linear connection layer and a nonlinear activation layer in the recurrent neural network model, and extracting a high-level abstract feature vector; and inputting the high-level abstract feature vectors to an output layer of the recurrent neural network recognition model, and determining the labeling result of the named entity label of the phrase corresponding to each high-level abstract feature vector by calculating the probability of belonging to each preset classification label.

In another case, the system further comprises a target object information integration module, configured to perform information integration on the extracted target object information and the standardized data after extracting the structured target object information based on the labeling result; wherein the normalized data includes at least: item name, principal unit. When the target object information based on the bidding document fails to be extracted, the final target object information can be obtained by integrating the extracted target object information with the standardized data, so that the accuracy of target object extraction and the success rate of extracting the target object information are improved.

To solve the above technical problem, the present invention provides a computer device, as shown in fig. 10, including a memory 710, a processor 720, and a computer program stored on the memory and running on the processor, wherein the processor executes the computer program to implement the method as described above.

The computer device can be a desktop computer, a notebook, a palm computer, a cloud server and other computing devices. The computer device may include, but is not limited to, a processor 720, a memory 710. Those skilled in the art will appreciate that fig. 10 is merely an example of a computing device and is not intended to be limiting and may include more or fewer components than those shown, or some of the components may be combined, or different components, e.g., the computing device may also include input output devices, network access devices, buses, etc.

The Processor 720 may be a Central Processing Unit (CPU), other general purpose Processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The storage 710 may be an internal storage unit of the computer device, such as a hard disk or a memory of the computer device. The memory 710 may also be an external storage device of a computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), etc. provided on the computer device. Further, the memory 710 may also include both internal storage units and external storage devices of the computer device. The memory 710 is used for storing the computer program and other programs and data required by the computer device. The memory 710 may also be used to temporarily store data that has been output or is to be output.

The embodiment of the present application further provides a computer-readable storage medium, which may be a computer-readable storage medium contained in the memory in the foregoing embodiment; or it may be a computer-readable storage medium that exists separately and is not incorporated into a computer device. The computer-readable storage medium stores one or more computer programs which, when executed by a processor, implement the methods described above.

The integrated modules/units, if implemented in the form of software functional units and sold or used as separate products, may be stored in a computer readable storage medium. Based on such understanding, all or part of the flow in the method of the embodiments described above can be realized by a computer program, which can be stored in a computer-readable storage medium and can realize the steps of the embodiments of the methods described above when the computer program is executed by a processor. Wherein the computer program comprises computer program code, which may be in the form of source code, object code, an executable file or some intermediate form, etc. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory 710, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like. It should be noted that the computer readable medium may contain content that is subject to appropriate increase or decrease as required by legislation and patent practice in jurisdictions, for example, in some jurisdictions, computer readable media does not include electrical carrier signals and telecommunications signals as is required by legislation and patent practice.

For system or apparatus embodiments, since they are substantially similar to method embodiments, they are described in relative simplicity, and reference may be made to some descriptions of method embodiments for related points.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-mentioned division of the functional units and modules is illustrated, and in practical applications, the above-mentioned function distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules to perform all or part of the above-mentioned functions. Each functional unit and module in the embodiments may be integrated in one processing unit, or each unit may exist alone physically, or two or more units are integrated in one unit, and the integrated unit may be implemented in a form of hardware, or in a form of software functional unit. In addition, specific names of the functional units and modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application. The specific working processes of the units and modules in the system may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

It is to be understood that the terminology used in the description of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the application. As used in the specification of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise.

It should also be understood that the term "and/or" as used in this specification and the appended claims refers to and includes any and all possible combinations of one or more of the associated listed items.

As used in this specification and the appended claims, the term "if" may be interpreted contextually as "when", "upon" or "in response to a determination" or "in response to a detection". Similarly, the phrase "if it is determined" or "if a described condition or event is detected" may be interpreted, depending on the context, to mean "upon determining" or "in response to determining" or "upon detecting a described condition or event" or "in response to detecting a described condition or event".

The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims

1. A method for extracting subject matter information, comprising:

2. The method for extracting the subject matter information according to claim 1, wherein the performing data cleansing on the text content of the current bid document includes:

3. The method according to claim 1, wherein the locating a key sentence set suspected of having the subject matter information from the initial data based on the regular matching comprises:

4. The subject matter information extraction method according to claim 3, wherein a rule set and field dictionary is established in such a manner that:

removing a preset symbol from the initial information to obtain effective data;

removing non-critical data in the main body content of the valid data;

5. The method for extracting the object information according to claim 1, wherein the classifying and labeling the target phrase set based on the named entity recognition mode and extracting the structured object information based on the labeling result comprise:

6. The method of claim 5, wherein the labeling named entities of the phrases in the target phrase set using a recurrent neural network recognition model comprises:

7. The method of claim 1, wherein after extracting the structured target object information based on the labeling result, the method further comprises:

8. An object information extraction device characterized by comprising:

9. A computer device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the method according to any of claims 1 to 7 when executing the computer program.

10. A computer-readable storage medium, on which a computer program is stored, which program, when being executed by a processor, carries out the method according to any one of claims 1 to 7.