CN115310425A

CN115310425A - Policy text analysis method based on policy text classification and key information identification

Info

Publication number: CN115310425A
Application number: CN202211229194.3A
Authority: CN
Inventors: 杨象笋; 李响; 胡奇韬; 王江华
Original assignee: Tiandao Jinke Co ltd; Zhejiang Zhelixin Credit Reporting Co ltd
Current assignee: Tiandao Jinke Co ltd; Zhejiang Zhelixin Credit Reporting Co ltd
Priority date: 2022-10-08
Filing date: 2022-10-08
Publication date: 2022-11-08
Anticipated expiration: 2042-10-08
Also published as: CN115310425B

Abstract

The invention discloses a policy text analysis method based on policy text classification and key information identification, and belongs to the technical field of natural language processing. The policy text classifier provided by the invention is used for classifying the text through the original paragraph

The prompt language of the classification task is added, the prompt language comprises a mask position which needs to predict and fill in a label, the paragraph classification problem is converted into a classification prediction problem of type completion and filling in a blank, the process of paragraph classification prediction is simplified, and a complete policy file element system can be constructed based on the constructed complete policy file element system to more accurately select the contentAnd (4) forming an angle of a file structure and analyzing a policy file text, and mining deeper information. The policy information recognizer provided by the invention also simplifies the recognition difficulty of the text entity by predicting the vacant content labels under the constructed policy text element system, and has better performance when the training data is smaller in scale.

Description

Policy text analysis method based on policy text classification and key information identification

Technical Field

The invention relates to the technical field of natural language processing, in particular to a policy text analysis method based on policy text classification and key information identification.

Background

Generally, the text structure division of the policy document has standard followability, and even has a uniform standard in terms of words. The content and the structure of the policy document are automatically identified and analyzed, and the method is particularly important for improving the analysis efficiency of the policy document. In recent years, natural language processing technology has been developed rapidly, and is mainly applied to machine translation, public opinion monitoring, automatic summarization, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition, chinese OCR and other aspects. Thus, for policy documents having structured textual content, natural language processing techniques are no longer an effective means of analyzing the textual content of the policy document.

At present, policy document text content identification methods with high classification and identification precision are few, and part of students train text classification identification models in an unsupervised learning mode to identify policy text content, but the performance of the trained text classification identification models is not stable enough due to lack of classification and identification standards for the policy text content. Still, some scholars train the text classification recognition model by adopting a supervised learning method, but there is no uniform standard to label the policy text content, so that the trained text classification recognition model is not stable enough, and a large amount of training samples for supervised learning are usually acquired at a high cost.

Disclosure of Invention

The invention provides a policy text analysis method based on policy text classification and key information identification, aiming at realizing accurate classification of text paragraphs of policy documents and accurate identification of key information.

In order to achieve the purpose, the invention adopts the following technical scheme:

a policy text analysis method based on policy text classification and key information identification is provided, and the method comprises the following steps:

s1, input paragraphs are classified by a policy text classifier based on pre-training

Predicting and outputting the paragraph

Type of (d);

s2, finishing each paragraph of the classification based on the pre-trained policy information recognizer

Key information is further extracted at the entity level.

Preferably, in step S1, the policy text classifier predicts the paragraphs

The method of type (d) specifically comprises the steps of:

s11, regarding the paragraph in the given policy document

Using template functions

Will be provided with

Conversion to language models

Is inputted

，

In the original paragraph

A prompt language of a classification task is added, wherein the prompt language comprises a mask position needing to predict and fill in a label;

s12, the language model

Predicting out tags filling in the mask locations

；

S13, label converter

Attaching the label to the container

Set of tagged words mapped as a pre-constructed policy document element system

Corresponding label word in

The paragraphs obtained as predictions

Type (c) of the cell.

Preferably, the language model is trained

The method comprises the following steps:

a1, for each as training sample

Calculating the set of tagged words

Each tag word in (1)

Probability scores for filling in the mask locations

；

A2, calculating probability distribution through softmax function

；

A3 is according to

And

calculating model prediction loss by using the constructed loss function;

a4, judging whether a model iterative training termination condition is reached,

if yes, terminating iteration and outputting the language model

；

If not, adjusting the model parameters and returning to the step A1 to continue the iterative training.

Preferably, the language model

For forming a plurality of language sub-models

The fusion language model formed by fusion, and the method for training the fusion language model comprises the following steps:

b1, defining a template function setClosing box

The set of template functions

Comprising a plurality of different said template functions

；

B2, for each as training sample

By corresponding said language sub-model

Calculating the set of tagged words

Each tag word in (1)

Probability score of filling in the mask position

；

B3, associating each template function

Is/are as follows

Carrying out fusion to obtain

；

B4, calculating probability distribution through softmax function

；

B5 according to

And

calculating model prediction loss by using the constructed loss function;

b6, judging whether a model iterative training termination condition is reached,

if yes, terminating iteration and outputting the fusion language model;

if not, the model parameters are adjusted and then the step B2 is returned to continue the iterative training.

Preferably, the language model

Or the language submodel

Is a BERT language model.

Preferably, the system of policy document elements includes sentence-level elements and entity-level elements, the sentence-level elements including any one or more of 27 sub-categories of policy objective, application review, policy tool-supply type, policy tool-environment type, policy tool-demand type, fund management, regulatory evaluation, admission condition 8,

wherein, the policy tool-supply type category includes any one or more of the 4 sub-categories of talent culture, fund support, technical support and public service;

the policy tool-environment type comprises any one or more of 6 sub-categories of regulation and control, target planning, tax and revenue discount, financial support, organization and construction and policy promotion;

the policy tool-demand type comprises any one or more of the 3 sub-categories of government procurement, company cooperation and overseas cooperation;

the supervision evaluation category comprises 2 sub-categories of supervision management and/or assessment evaluation;

the capital management category includes 2 sub-categories of sources of capital and/or management principles.

Preferably, in step S2, each paragraph is extracted based on the policy information identifier

The method of the key information in (1) comprises the steps of:

s21, defining a sentence template set

Label word set for entity identification in policy document element system

And a language model

Label sets for entity identification

The sentence template set

Sentence template containing entity type and non-entity type

Sentence template

Including two words to be filled in the vacancy, wherein the first vacancy is the paragraph input from the input

The second vacancy is a category label for classifying the intercepted text segment, and the label set is

Middle label

With said set of tagged words

The label word in

Having a mapping relationship;

s22, from the paragraph

Each of the text fragments and each of the labels intercepted in

Corresponding label words

Filling in the sentence template set respectively

Each of the sentence templates

In the first and second vacancies and then using the language model

Calculating probability scores of the filled-in sentences

The calculation method is expressed by the following formula (1):

in the formula (1), the first and second groups,

text passage for representing use candidate

And a label

Label word with mapping relation

Filling sentence template

The sentence obtained;

representing the sentence

The sequence length of (a);

representing the sentence

Of the word sequence

An item;

representing the sentence

1 st to 1 st item in the word sequence of

An item;

representing input to the language model

Text sequence of

；

Representing text at a given input

And 1 st to 1 st item in the word sequence of the sentence template

Item(s)

In the case where the model predicts that the c-th term is

The probability of (d);

s23, the sentence with the highest score

The filled text segment is used as key information entity and corresponding type label

Mapping to the tagged word

Then as corresponding entity type, jointly form the paragraph

The key information of (a).

Preferably, the language model in step S2

Is a BART model.

Preferably, in the step A1,

is expressed by the following formula (2):

in step A2

Calculating by softmax function (3):

in the formulas (2) and (3),

representing a set of tags

The Chinese and the label words

A label having a mapping relationship;

a set of labels representing a text classification task;

the loss function constructed in step A3 is expressed by the following formula (4):

in the formula (4), the first and second groups,

representing a fine tuning coefficient;

representing the distribution of model predictions

The difference from the true distribution;

score representing model prediction

The difference from the true score.

Preferably, in step B2

Is expressed by the following formula (5):

in step B3

Obtained by fusing the following formula (6):

in the formula (6), the first and second groups,

representing the set of template functions

The template function of

The number of (2);

representing said template function

In the calculation of

The weight occupied by the hour;

in step B4

Calculating by means of the softmax function (7):

in the formulas (5), (6) and (7),

representing a set of tags

The Chinese and the label words

A label having a mapping relationship;

a set of labels representing a text classification task;

the loss function constructed in step B5 is expressed by the following formula (8):

in the formula (8), the first and second groups,

representing a fine tuning coefficient;

representing the distribution of model predictions

The difference from the true distribution;

score representing model prediction

The difference from the true score.

Preferably, the fine adjustment coefficients in formula (4) and formula (8)

。

The invention has the following beneficial effects:

1. a set of complete policy document element system is constructed, and different elements in the policy document are clearly divided. Subsequently, based on the system, the classification of each paragraph in the policy document and the key information extraction of the text paragraphs at the entity level can be more accurately realized.

2. By following the original paragraph

The prompt language of the classification task is added, the prompt language comprises a mask position which needs to predict and fill in a label, the paragraph classification problem is converted into a classification prediction problem similar to a complete shape and fill in the blank, the process of paragraph classification prediction is simplified, the policy document text can be more accurately analyzed from the aspects of content composition and document structure based on the constructed complete policy document element system,and deeper information is mined, and excellent performance is achieved under the condition that the scale of the labeled training data set is small.

3. The policy information recognizer provided simplifies the recognition difficulty of text entities by predicting two vacant content labels under the constructed policy document element system, can more accurately extract useful key information from the text based on the constructed policy document element system, and has excellent performance under the condition that the scale of a labeled training data set is small.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.

FIG. 1 is a diagram of a policy document element system constructed according to an embodiment of the present invention;

FIG. 2 is a logic block diagram of paragraph classes of a prediction policy file provided by an embodiment of the present invention;

FIG. 3 is a logic block diagram of a policy information identifier based on hint learning provided by an embodiment of the present invention;

FIG. 4 is a logic block diagram of a pre-training-fine-tuning-based policy information identifier for comparison according to an embodiment of the present invention;

FIG. 5 is a diagram illustrating the steps for implementing a policy text analysis method based on policy text classification and key information identification according to an embodiment of the present invention;

FIG. 6 is a policy text classifier prediction section

A method implementation step diagram of the type(s);

FIG. 7 is a diagram of a policy information identifier extraction section

Is key ofA method step diagram of information.

Detailed Description

The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.

Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; for a better explanation of the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.

In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between components, is to be understood broadly, for example, as being either fixedly connected, detachably connected, or integrated; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in a specific case to those of ordinary skill in the art.

In the embodiment of the invention, the applicant collects a certain amount of policy documents as references for constructing a policy document element system and model training data of a subsequent policy text classifier and a policy information recognizer. The policy documents relate to various fields such as agriculture, industry, business, service industry and the like, and applicable objects of the policy documents include individuals, enterprises, institutions and the like. The policy document element system constructed by the embodiment is as shown in fig. 1, and elements in the system are divided into a sentence level and an entity level according to the length of characters in a text passage. Sentence-level elements generally cover the entire sentence in a paragraph, such as "give a marketing team a city-linked 200 ten thousand yuan reward for a business enterprise that successfully comes into the market," which is a complete sentence and thus is considered sentence-level; and elements at the entity level are typically included in words having a particular meaning, such as policy name, policy number, distribution area, department of formulation, etc. in paragraphs.

Further, the sentence-level elements are subdivided into general forms and "body-relationship-domain" forms, and the sentence-level elements in the general forms are used for distinguishing content compositions of paragraphs in the policy text, such as policy targets, application reviews, policy tools, supervision evaluations, fund management and the like in fig. 1. While sentence-level elements in the form of "body-relationship-domain" are used to structurally represent the admission conditions of the policies, such as the admission condition "business registry-owned-shanghai" associated with the business registry. Specifically, as shown in fig. 1, the content of the policy document element system constructed in this embodiment is as follows:

1. elements at the entity level include: 7 categories of policy name, policy number, release area, formulation department, execution department, release time and execution period;

2. the general form of sentence-level elements includes: policy objectives, application auditing, policy tools, regulatory evaluation, and fund management 5 categories. Wherein, the supervision evaluation is further subdivided into 2 subclasses of supervision management and assessment evaluation. Funding management is further subdivided into funding sources and management rules 2 subclasses. Policy tools are further subdivided into 13 subclasses of 3 types as follows:

supply-type policy tools (i.e., policy tools-supply type) include talent training (establishing talent development plans, actively perfecting various education systems and training systems, etc.), fund support (providing financial support, such as development expenses and infrastructure construction expenses, etc.), technical support (technical counseling and consultation, enhancing technical infrastructure construction, etc.), public services (perfecting relevant supporting facilities, policy environments, etc.).

The environmental policy tool (i.e. policy tool-environmental policy) includes regulation and control (making regulation, standard, standardizing market order, increasing supervision), target planning (top-level design, providing corresponding policy support service), tax benefits (tax incentives such as tax exemption and tax refund, including investment exemption, accelerated discount, tax exemption, tax refund, etc.), financial support (providing loan, subsidy, risk investment, credit guarantee, fund, risk control and other financial support for enterprises through financial institutions), organization and construction (setting leadership, supervision, service, etc. organization and team construction for promoting industry health development), and policy propaganda (propaganda related policy for promoting industry development).

Demand-type policy tools (i.e., policy tools — demand type) include government procurement (products procurement from government to related enterprises), public and private cooperation (government and many social entities participate in the related activities of industry development, such as joint investment, joint technical research development, development planning research, etc.), overseas cooperation (introduction of foreign materials, development of cooperation and communication with overseas government, enterprises or scientific research institutions in terms of generation technology, standard customization, etc.).

The sentence-level elements in the form of "body-relationship-domain" include admission conditions, which in turn can be subdivided into 8 subclasses: registration places, property right requirements, business fields, employee composition, legal qualifications, enterprise types, operational requirements, and research and development requirements.

Before paragraph classification and key information identification are performed on the policy text, paragraph splitting is performed on the text content of the policy document. There are many existing methods for paragraph splitting the text content of the policy document, and the way of splitting paragraphs is not the scope of the claimed invention, so the detailed way of paragraph splitting will not be described here.

After the paragraph is split, the paragraph classification and key information identification process is carried out. In this embodiment, the paragraphs are classified by a pre-trained policy text classifier, and the content composition and file structure of the policy file are further analyzed. In this embodiment, a general-form sentence-level element in the policy document element system shown in fig. 1 is selected as a candidate category set of a paragraph, and two category sets with different classification granularities are used as samples to respectively perform training of a policy text classifier and compare training effects, where a sentence-level element with one classification granularity is 7 major categories including a policy target, an application audit, a policy tool-supply type, a policy tool-environment type, a policy tool-demand type, and fund management and supervision evaluation shown in fig. 1; the other classification granularity is 17 small classes after expanding the 3 major classes of policy tools, supervision evaluation and fund management, and 19 classes of policy targets and application auditing. When classifying paragraphs, the policy text classifier also determines whether the paragraph does not belong to any of these categories, i.e., whether the paragraph is a nonsense paragraph.

The method for classifying the input paragraphs by using the pre-trained policy text classifier in the embodiment is specifically described as follows:

in this embodiment, the technical core of classifying the input paragraphs is to adopt the idea of prompt learning, which can simplify the classification process and improve the classification efficiency, and has higher classification superiority for small-scale data sets. Specifically, in order to more fully exert the powerful question-answer and reading comprehension capabilities of the policy text classifier and mine deeper information contained in the labeled small-scale policy file text data set, the input paragraph text is processed according to a specific mode, and a task prompt language is added to the paragraph text, so that the paragraph text is more adaptive to the question-answer form of the language model. The principle of paragraph recognition by the policy text classifier based on prompt learning is as follows:

is provided with

For pre-trained language models (preferred)Is a BERT language model),

is a label word set and a mask word in a policy document element system

Is used for filling out language model

Is masked in the input

In a word of

Is a set of labels for a text classification task (paragraph classification task). Obtaining an input language model after segmenting words of text paragraphs of each policy

Word sequence of

Then use the self-defined template function

Will be provided with

Conversion to language models

Is inputted

，

In that

The method is characterized in that a prompt language of a classification task is added, and the prompt language comprises a mask position needing to predict and fill in a label. Warp beam

After conversion, the paragraph type prediction problem can be converted into a complete fill-in-the-blank problem, i.e., a language model

Expressed in the form of a filled-in-space problem

For input, the most suitable word to fill in the mask position is predicted to be used as a pair

The classification of the expressed paragraphs predicts the outcome.

It is emphasized that the present application, based on the idea of prompt learning, makes better use of the language model

The question answering and reading comprehension ability of the policy text classifier is achieved, meanwhile, the classification problem is converted into a complete form filling-in-the-air problem, the prediction process is simpler, and the classification efficiency of the policy text classifier is improved. Further, the present embodiment defines a set of tags for classifying tasks from text

Set of tagged words into policy document element hierarchy

As a converter of the label

. For example, for

The label in (1)

The label converter

Map it to tagged words

The policy objective is the predicted paragraph category.

Fig. 2 is a logic block diagram of a paragraph category of a predicted policy document according to an embodiment of the present invention. It is emphasized that for each template function

And label converter

The present embodiment implements classification of paragraphs by the following steps:

given an input paragraph

(preferably a sequence of words of the original passage), using a template function

Will be provided with

Conversion to language models

Is inputted

Language model

Will predict

Tag with most suitable middle mask position

，

Then using a label converter

Mapping the label to label words in a policy document element system

，

And use it as a pair paragraph

The classification of (2). Preferably, the present embodiment employs a pre-trained Chinese BERT model as the language model

The prediction method of the mask position follows the pre-training task of the BERT model, namely the pair thereof is used

The label of the mask position is predicted by the output corresponding to the middle mask position (the prediction method is consistent with the mask Language Model pre-training task of the BERT Model, and detailed description is not given).

For example, with respect to template functions

Suppose to be defined as "

. In general, this is a paragraph of the policy text about _____. "wherein," _____ "represents the mask location, and thus the original text paragraph

A prompt language for the classification task is added.

For example, "for a successful business to market, give a 200 ten thousand dollar reward to the business team for city linkage" for that paragraph

After adding the above-mentioned prompt language, the language model

The classification task of (1) is to predict that the enterprise successfully appeared on the market and the management team is awarded 200 ten thousand yuan of linkage in the urban area. In general, this is a paragraph of policy text about _____. "mask position in" _____

. After predicting the label after the mask position, the predicted label

Mapping to a set of tagged words in a policy document element hierarchy

Corresponding label word in

Paragraphs obtained as predictions

Of the type (c).

The language model is trained for the present embodiment as follows

Method of (a)：

Language model

There are many existing training methods that preferably use the BERT model, which can be applied in the present application for training the language model

With the difference that the present embodiment is used to train language models

Is a template function

Converted to obtain

And via-tag converter

Label word set obtained by conversion

Corresponding label word in

And a loss function for evaluating model performance improved for improved classification accuracy.

Training language models

In the method, the sample data set is randomly divided into a training set and a verification set according to the proportion of 7:3, and the training process is as follows:

sequence generated for each policy text paragraph containing only one mask position

To, forLabel word set in policy document element system

Each tag word in (1)

The probability of filling in the mask position calculates a score (due to the label)

Word set on tag

Has a label word with a mapping relation

Thus predicting the label

The probability score of filling in the mask position is equivalent to predicting the corresponding tag word

Probability score of filling in the mask location), this score is determined by the language model

Predictions represent the likelihood that the predicted tag word can fill in the mask position. More specifically, for a sequence

The present application computes a set of labels for a text classification task

The label in (1)

The method of filling the probability score of the mask position is based on the following formula(1) Expressing:

in the formula (1), the first and second groups,

presentation label

Probability score of filling mask position due to label

Tab word set related to policy document element system

Corresponding label word in

Have a mapping relationship, therefore

Equivalent to representing label words

Filling in probability scores for mask locations;

for example, the label of the label word "policy target" in FIG. 1 may be mapped to

Mapping the label of the label word 'apply for review' to

By establishing the mapping in this way, the task is changed from assigning a label without meaning to the input sentence to selectingThe most likely word to fill in the mask position.

Is calculated to obtain

After all the label words are filled in the scores of the same mask position, obtaining a probability distribution through a softmax function, wherein the specific calculation method is expressed by the following formula (2):

in the formula (2), the first and second groups of the compound,

a set of labels representing a text classification task;

then, according to

And

and calculating a model predicted loss using the constructed loss function expressed by the following formula (3):

in the formula (3), the first and second groups,

represents a trimming coefficient (preferably 0.0001);

representing the distribution of model predictions

The difference from the true one-hot vector distribution;

score representing model prediction

The difference from the true score;

finally, whether a model iterative training termination condition is reached is judged,

if yes, stopping iteration and outputting the language model

；

If not, the iterative training is continued after the model parameters are adjusted.

In order to further improve the training effect of the model and further improve the language model

Preferably, language models

For forming a plurality of language sub-models

The method for training the fusion language model comprises the following steps:

first, a template function set is defined

Set of template functions

Comprising several different template functions

E.g.) "

. This policy text paragraph and what concerns _____ ", again for example,"this policy text paragraph relates to what relates to _____," and so on. For different template functions

In this embodiment, the fusion language model is trained by the following method:

for each as training sample

By corresponding language submodels

Computing a set of tagged words

Each tag word in (1)

Probability score of filling mask position

The calculation method is expressed by the following formula (4):

for associating each template function

Is/are as follows

Carrying out fusion to obtain

Specifically, it is expressed by the following formula (5):

in the formula (5), the first and second groups,

representing a set of template functions

Template function in (1)

The number of (2);

representing a template function

In the calculation of

、

The weight of each language, in this embodiment, according to each language sub-model

Determining individuals with the accuracy obtained on the training and validation sets

The weight of (c).

Then, the probability distribution is calculated by the softmax function

The calculation method is expressed by the following formula (6):

in the formulas (4), (5) and (6),

；

a set of labels representing a text classification task;

finally, according to

And

and calculating a model predicted loss using the constructed loss function expressed by the following formula (7):

in the formula (7), the first and second groups,

represents a trimming coefficient (preferably 0.0001);

representing the distribution of model predictions

The difference from the true distribution;

score representing model prediction

The difference from the true score.

Provided with a prompt language

For language models

The input mask position label prediction method has excellent prediction performance under the condition that the scale of a labeled training data set is small, and in order to verify the excellent performance of the labeled training data set when the training data is small, the application also designs various policy text classifiers based on fully supervised learning for performance comparison, and the specific method comprises the following steps:

(1) For policy document paragraphs

Using word segmentation tool to obtain word sequence, and recording it as

,

Representing sequences of words

To

And carrying out distributed representation on each word after word segmentation through a word vector representation model obtained by pre-training on a large-scale comprehensive field corpus. In this embodiment, a static word vector is used, each word being represented as a 300-dimensional pre-trained vector

，

Representing sequences of words

To (1)

Word, obtaining paragraphs by word vectors

Is characterized by

Then, the characteristics of the paragraph are expressed

Inputting a multi-classifier to predict the probability that each paragraph belongs to each class, the prediction process is expressed as:

，

in order to characterize the function in the form of a feature,

presentation paragraph

Is a first

The probability of each class, and the class with the highest probability is selected as the paragraph

The categories mentioned.

(2) In the multi-classifier part, the method based on statistical machine learning and the method based on deep learning are selected to carry out complete supervised learning on the multi-classifier. The multi-classifier based on statistical machine learning is designed on the basis of a support vector machine model and an XGboost model; the deep learning-based multi-classifier is designed based on a TextCNN model and a Bi-LSTM + Attention model.

1) In a statistical machine learning based multi-classifier, a policy text paragraph is identified

300 of all words of the paragraph to be participatedAveraging each dimension of the dimension distributed representation, and splicing the two characteristics of the length of the paragraph and the relative position of the paragraph in the whole policy document (the index value of the paragraph in the document/the total number of the paragraphs of the document) to obtain a 302-dimension feature vector

It is input into the multi-classifier, and the label of the paragraph classification is output.

2) In the deep learning based multi-classifier, one policy text paragraph is classified

Distributed representation of all words of a segmented paragraph

Splicing into a matrix, extracting features by using convolution kernels of 3 different sizes, wherein the sizes of the convolution kernels of 3 sizes can be respectively 3 multiplied by 3, 4 multiplied by 4 and 5 multiplied by 5, performing maximum pooling after convolution, splicing the features extracted by the convolution kernels of different sizes into a feature vector, inputting the feature vector into a softmax activation function, and outputting a label of the paragraph classification.

3) In another deep learning based multi-classifier, a policy text paragraph is classified

300-dimensional distributed representation of all words of a participled paragraph

Forward input into LSTM long-short-term memory network

Reverse input LSTM to obtain

Adding the elements of the two corresponding time sequences to obtain an output vector of each time sequence

. Then, through an Attention mechanism, the weight of each time sequence is calculated, the vectors of all the time sequences are weighted and summed to be used as a feature vector, and finally, a softmax function is used for classification.

The following shows the multi-classifiers obtained by training on a small-scale training data set by the method (1) and the four methods 1), 2) and 3) in the method (2) and the language model trained by the prompt language and mask position label prediction-based policy text classification method provided by the embodiment of the invention

The evaluation index is the accuracy rate on the test set for the effect comparison table of paragraph classification of two policy documents with different granularities, namely 'policy target, application audit, policy tool-supply type, policy tool-environment type, policy tool-demand type, supervision evaluation, capital management' 7 categories and 'policy target, application audit, talent culture, capital support, technical support, public service, regulation control, target planning, tax discount, financial support, organization construction, policy propaganda, government procurement, public and private cooperation, overseas cooperation, supervision management, assessment evaluation, capital source and management principle' 19 categories shown in figure 1. The following table a shows: language model trained by the embodiment

In paragraphs

The paragraph text classification method for performing mask position label prediction by adding a classification task prompt language shows that the paragraph classification performance is better than that of a multi-classifier trained by other four methods on a small-scale data set, and proves that the language model trained by the embodiment

The superiority of paragraph classes is predicted on small scale datasets.

TABLE a

After the paragraphs in the policy text are classified, it is sometimes necessary to automatically identify key information in each paragraph. The application identifies key information in a policy document through a prompt learning based policy information identifier. In the present application, elements at the entity level in the policy document element system shown in fig. 1 are defined as 7 categories of key information categories of the policy, that is, "policy name, policy document number, distribution area, establishment department, execution department, distribution time, and execution period" shown in fig. 1.

Following extraction of each paragraph from the hint learning based policy information identifier

The method for key information in (1) is specifically described:

in general, each paragraph is regarded as a character sequence, and a policy information identifier is used to identify whether each digit in the character sequence is an entity boundary and identify the type of the entity. Specifically, as shown in fig. 3, setting is performed

For pre-trained language models, in models

In the step (1), the first step,

is a label word set used for entity identification in a policy document element system and order

Tag set for identifying tasks for an entity, tag set

Each of which isLabel (R)

Word set on tag

In which there is a label word with mapping relation

And defining sentence templates

Form board

The method comprises two gaps of words to be filled, wherein the filling content of the first gap is text segments intercepted from an input paragraph, the segments are regarded as candidate entities, and the second gap is an entity class label of the filled text segment needing to be predicted. Set of tagged words for entity identification in policy document element system

Each of the tag words in

The entity type represented, filling this entity type in

Defining a new template, e.g. a sentence template

Is "[ text fragment ]]Is an entity type]Policy entity ", then for the set of tagged words identified by the entity

The entity type of the "department" in (1) is filled into the template

A new template may be defined after the process, for example, as "[ candidate entity ]]Is a department policy making entity ". In addition, in order to deal with the case where the text fragment is not an entity, a sentence template of "non-entity" type is further defined, that is, "[ text fragment" ]]Not a policy entity ", such that a plurality of sentence templates of different entity types and sentence templates of non-entity types constitute a set of sentence templates

。

Will be followed by paragraph

Filling each text segment intercepted into the sentence template set

Each sentence template in (1)

Then using the language model

The probability scores of these filled sentences are calculated (again preferably by the BART model), the calculation method being expressed by the following equation (8):

in the formula (8), the first and second groups of the chemical reaction are shown in the specification,

text passage for representing use candidate

And a label

Filling into sentence templates

The sentence is obtained;

representing the sentence

The sequence length of (a);

representing sentences

Of the word sequence

An item;

representing sentences

1 st to the first in the word sequence of (1)

An item;

representing input to the language model

Text sequence of (2)

；

Representing text at a given input

And 1 st to 1 st item in the word sequence of the sentence template

Item(s)

In the case where the model predicts that the c-th term is

The probability is calculated by a pre-training generative language model.

Through the above process, the language model is used

For each sentence template of both entity type and non-entity type, a probability score for filling in the second gap with tag words is calculated, and then each candidate text segment is classified as the type corresponding to the sentence template with the highest score, although this type may also be "non-entity". The text segment assigned with the entity type is the entity identified in the text segment, and the entity type is the assigned entity type.

The following briefly describes a method of training a policy information recognizer:

to be provided with

And

corresponding real label word

For model training samples, the sample data set is randomly divided into a training set and a verification set according to the proportion of 7:3. For data in the training set, ifText segment

The entity type of

Then will be

And

sentence templates for filling in entity types separately

In the first and second vacancies. If the text segment

Not an entity, it will

Sentence template filled in non-entity type

A filled sentence is also obtained. In addition, the sentence template containing the entity is filled by using all entity samples in the training set, and the non-entity sentence template is filled by randomly sampling in the remaining non-entity type words, wherein the proportion of the two is preferably 1.5, so that the interference of the non-entity sentence template on the recognition of the entity sentence template is increased, and the key information extraction precision of the policy information recognizer is further improved.

It is emphasized that, in the present application, the language model

Preferably a BART model. BART model computation sentence template

Is scored by

The principle of (1) is as follows:

given a policy text passage

And sentence template set

Will be

Inputting into the encoder of BART model to obtain paragraphs

Is characterized by

. In each step of the decoder of the BART model,

output before decoder

Taking the input of the current step together, and using an attention method to obtain a characteristic representation of the current step

After the characteristic representation is subjected to linear transformation, the softmax function is used for obtaining the word output in the current step

Is calculated in such a way that the conditional probability (which means the probability distribution of the c-th term after a given preceding c-1 term and input paragraph) of (2) is given

Wherein

Are model parameters.

In the training BART model, the difference between the output of the decoder and a real template is calculated by using a cross entropy loss function, the difference is used as the basis for adjusting the model parameters, and the training BART model is continued to be iterated after the model parameters are adjusted until the model convergence condition is reached.

The policy information extraction method based on prompt learning provided by the application has an excellent recognition effect on a small-scale data set, in order to verify the performance of the policy information extraction method when the scale of a training data set is small, the application also designs various policy information recognizers based on pre-training-fine tuning to perform performance comparison on the same data set, and a specific method is shown in fig. 4 and comprises the following steps:

in the distributed feature representation part of the input data of the policy information recognizer, distributed feature representations of vocabulary level and character level are used simultaneously, the distributed feature representation of each word of the vocabulary level is realized by a word vector representation model pre-trained on a large-scale integrated domain corpus, and the distributed feature representation of each character of the character level is realized by a Chinese RoBERTA model pre-trained. Since the process of performing distributed feature representation on input data by the word vector representation model and the chinese RoBERTa model is not the scope of protection claimed in the present application, the specific process is not described.

The context encoding layer of the policy information recognizer takes over the output of the distributed representation layer, further modeling the dependency between text semantics and words. In this embodiment, the structure and the construction method of the three models are briefly described as follows by using a multilayer perceptron, a Transformer and a Flat-Lattice transform:

in the context coding layer based on the multi-layer perceptron, a structure of a linear layer-a ReLU function layer-a linear layer is adopted.

In the transform-based context coding layer, a transform Encoder is used to feature code the text.

In a context coding layer based on a Flat-text Transformer (FLAT), a variant FLAT of a Transformer is used, meanwhile, distributed representation of characters and words of a text is used, position coding in the Transformer is further expanded, head and tail relative positions of the characters and the words of the text are introduced, and the problem of unbalanced entity length of a policy file is solved. The relative position coding calculation method of the text segment of the FLAT is expressed by the following formula (9):

in the formula (9), the first and second groups,

and

respectively represent

The position of the first and last character of each text segment in the original sequence is indexed. For example, in the text of "3 years" policy validity period, "the" policy "has a head and a tail of 1 and 2, respectively, while for" administration "the character has a head and a tail of 1, respectively.

Is a parameter that can be learned by the user,

Included

，

the calculation method of (2) is expressed by the following formulas (10), (11):

in the formulas (10) and (11),

Included

、

、

、

any one of the above;

representing the vector length of the input model.

The decoding layer of the policy information recognizer uses a conditional random field model, the decoding process uses a Viterbi algorithm based on dynamic programming to obtain higher decoding efficiency, and a conditional random field loss function is used for optimization.

The following shows a comparison table of extraction effects of policy information of 7 categories, i.e., "policy name, policy number, release area, establishment department, execution department, release time, and execution period" shown in fig. 1, when the scale of a labeled training data set is small, in a policy information identifier based on pre-training-fine adjustment and a policy information identifier based on prompt learning provided by an embodiment of the present invention, and an evaluation index is an F1 score on a test set. The following table b shows: the language model N trained by the embodiment has better performance than the policy information recognizer trained by other methods on a small-scale training data set, and proves the superiority of recognizing policy key information when the labeled training data set is less.

Table b

To sum up, as shown in fig. 5, the policy text analysis method based on policy text classification and key information identification according to the embodiment of the present invention includes the steps of:

Predicting output paragraphs

Type of (d);

s2, the classified paragraphs are completed by the policy information recognizer based on pre-training

Key information is further extracted at the entity level.

More specifically, as shown in FIG. 6, the policy text classifier predicts paragraphs

The method of type (d) specifically comprises the steps of:

s11, for the paragraph in the given policy document

Using template functions

Will be provided with

Conversion to language models

Is inputted

，

In the original paragraph

A prompt language of a classification task is added, and the prompt language comprises a mask position needing to predict and fill in a label;

s12, language model

Predicting out tags filling in mask locations

；

S13, label converter

Will label

Set of tagged words mapped as a pre-constructed policy document element system

Corresponding label word in

As a predicted paragraph

Type (c) of the cell.

More specifically, as shown in fig. 7, in step S2, each paragraph is extracted based on the policy information identifier

The method of the key information in (1) comprises the steps of:

s21, sentence pattern definitionPlate assembly

Label word set for entity identification in policy document element system

And a language model

Label sets for entity identification

Sentence template set

Sentence template containing entity type and non-entity type

Sentence template

Including two words to be filled in the vacancy, wherein the first vacancy is a paragraph from the input

The second vacancy is a category label for classifying the intercepted text segment, and the label set

Middle label

With said set of tag words

Tag word in (1)

Having a mapping relationship;

s22, from the paragraph

Filling each intercepted text segment into a sentence template set

Each sentence template in (1)

Then using the language model

Template each sentence filled with text segments

Computing a set of tags

Each tag in (1)

The probability score of filling the second vacancy is calculated by the following formula (12):

in the formula (12), the first and second groups,

text segments representing usage candidates

And a label

Filling sentence template

The sentence obtained;

representing the sentence

The sequence length of (a);

representing the sentence

Of the word sequence

An item;

representing the sentence

1 st to 1 st item in the word sequence of

An item;

representing input to a language model

Text sequence of

；

Representing text at a given input

And 1 st to 1 st item in the word sequence of the sentence template

Item(s)

In the case where the model predicts that the c-th term is

The probability is calculated by a pre-training generative language model;

s23, the sentence with the highest score

The filled text segment is used as key information entity, and the corresponding type label

Mapping to the tagged word

Then as corresponding entity type, jointly form the paragraph

Key information of (1).

The invention has the following beneficial effects:

1. a set of complete policy document element system is constructed, and different elements in the policy document are clearly divided. Subsequently, based on the system, the classification of each paragraph type in the policy document and the key information extraction of the text paragraphs at the entity level can be more accurately realized.

2. By following the original paragraph

The prompt language is added with a classification task and comprises a mask position needing to predict and fill in a labelThe method converts the paragraph classification problem into the classification prediction problem similar to complete filling, simplifies the paragraph classification prediction process, can more accurately analyze the policy document text from the aspects of content composition and document structure based on the constructed complete policy document element system, and excavates deeper information, and has excellent performance under the condition that the scale of the labeled training data set is small.

3. The policy information recognizer provided simplifies the recognition difficulty of text entities by predicting two vacant content labels under the constructed policy document element system, can more accurately extract useful key information from texts based on the constructed policy document element system, and has excellent performance under the condition that the scale of a labeled training data set is small.

It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims

1. A policy text analysis method based on policy text classification and key information identification is characterized by comprising the following steps:

Predicting and outputting the paragraph

Type of (d);

Key information is further extracted at the entity level.

2. The method of claim 1, wherein in step S1, the policy text classifier predicts the paragraphs

The method of type (d) specifically comprises the steps of:

s11, regarding the paragraph in the given policy document

Using template functions

Will be provided with

Conversion to language model

Is inputted

，

In the original paragraph

s12, the language model

Predicting out tags filling in the mask locations

；

S13, label converter

Attaching the label to the container

Tagged word set mapped as pre-constructed policy document element system

Corresponding label word in

The paragraph obtained as a prediction

Type (c) of the cell.

3. The method of claim 2, wherein the language model is trained by training the policy text classification and key information recognition based policy text analysis method

The method comprises the following steps:

a1, for each as training sample

Calculating the set of tagged words

Each tag word in (1)

Probability scores for filling in the mask locations

；

A2, calculating probability distribution through softmax function

；

A3 is according to

And

calculating model prediction loss by using the constructed loss function;

if yes, terminating iteration and outputting the language model

；

If not, the model parameters are adjusted and then the step A1 is returned to continue the iterative training.

4. The method of claim 2, wherein the language model is based on a policy text classification and key information identification policy text analysis method

For forming a plurality of language sub-models

b1, defining a template function set

The set of template functions

Comprising a plurality of different said template functions

；

B2, for each as training sample

By corresponding said language submodel

Calculating the set of tagged words

Each tag word in (1)

Probability score of filling in the mask position

；

B3, associating each template function

Is/are as follows

Carrying out fusion to obtain

；

B4, calculating probability distribution through softmax function

；

B5 is according to

And

calculating model prediction loss by using the constructed loss function;

if yes, terminating iteration and outputting the fusion language model;

5. The method of claim 4 wherein the language model is based on policy text classification and key information identification

Or the language submodel

Is a BERT language model.

6. The method of claim 1 wherein the system of policy document elements includes sentence-level elements and entity-level elements, the sentence-level elements include any one or more of 27 sub-categories, such as policy objective, application review, policy tool-supply type, policy tool-environment type, policy tool-demand type, fund management, regulatory evaluation, and admission condition 8,

the policy tool-environment type comprises any one or more of 6 sub-categories of regulation and control, target planning, tax and discount, financial support, organization and construction and policy promotion;

7. The method according to any of claims 1-6, wherein in step S2, each paragraph is further extracted at entity level based on the policy information identifier

The method of the key information in (1) comprises the steps of:

s21, defining a sentence template set

Label word set for entity identification in policy document element system

And a language model

Label sets for entity identification

The sentence template set

Sentence template containing entity type and non-entity type

Sentence template

Middle label

With said set of tagged words

The label word in

Having a mapping relationship;

s22, will follow the paragraph

Each of the text fragments and each of the labels intercepted in

Corresponding label words

Filling in the sentence template set respectively

Each of the sentence templates

In the first vacancy, in the second vacancy, and then using the language model

Calculating probability scores of the filled-in sentences

The calculation method is expressed by the following formula (1):

in the formula (1), the first and second groups,

text passage for representing use candidate

And a label

Filling sentence template

The obtained sentence;

representing the sentence

The sequence length of (c);

representing the sentence

Of the word sequence

An item;

representing the sentence template

1 st to 1 st item in the word sequence of

An item;

representing input to the language model

Text sequence of

；

Representing text at a given input

And items 1 to 1 in the word sequence of the sentence templateFirst, the

Item(s)

In the case where the model predicts that the c-th term is

By the pre-trained said language model

Calculating to obtain;

s23, the score is the highest

The filled text segment is used as a key information entity, and the corresponding type label is used

Mapping to the tagged word

Then as corresponding entity type, jointly form the paragraph

Key information of (1).

8. The method of claim 7, wherein the language model is based on the policy text classification and key information identification policy text analysis method

Is a BART model.

9. The policy-based text classification and keyword message of claim 3The policy text analysis method for information recognition is characterized in that, in the step A1,