CN115310425A - Policy text analysis method based on policy text classification and key information identification - Google Patents

Policy text analysis method based on policy text classification and key information identification Download PDF

Info

Publication number
CN115310425A
CN115310425A CN202211229194.3A CN202211229194A CN115310425A CN 115310425 A CN115310425 A CN 115310425A CN 202211229194 A CN202211229194 A CN 202211229194A CN 115310425 A CN115310425 A CN 115310425A
Authority
CN
China
Prior art keywords
policy
text
label
paragraph
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202211229194.3A
Other languages
Chinese (zh)
Other versions
CN115310425B (en
Inventor
杨象笋
李响
胡奇韬
王江华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tiandao Jinke Co ltd
Zhejiang Zhelixin Credit Reporting Co ltd
Original Assignee
Tiandao Jinke Co ltd
Zhejiang Zhelixin Credit Reporting Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tiandao Jinke Co ltd, Zhejiang Zhelixin Credit Reporting Co ltd filed Critical Tiandao Jinke Co ltd
Priority to CN202211229194.3A priority Critical patent/CN115310425B/en
Publication of CN115310425A publication Critical patent/CN115310425A/en
Application granted granted Critical
Publication of CN115310425B publication Critical patent/CN115310425B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities

Abstract

The invention discloses a policy text analysis method based on policy text classification and key information identification, and belongs to the technical field of natural language processing. The policy text classifier provided by the invention is used for classifying the text through the original paragraph
Figure 100004_DEST_PATH_IMAGE001
The prompt language of the classification task is added, the prompt language comprises a mask position which needs to predict and fill in a label, the paragraph classification problem is converted into a classification prediction problem of type completion and filling in a blank, the process of paragraph classification prediction is simplified, and a complete policy file element system can be constructed based on the constructed complete policy file element system to more accurately select the contentAnd (4) forming an angle of a file structure and analyzing a policy file text, and mining deeper information. The policy information recognizer provided by the invention also simplifies the recognition difficulty of the text entity by predicting the vacant content labels under the constructed policy text element system, and has better performance when the training data is smaller in scale.

Description

Policy text analysis method based on policy text classification and key information identification
Technical Field
The invention relates to the technical field of natural language processing, in particular to a policy text analysis method based on policy text classification and key information identification.
Background
Generally, the text structure division of the policy document has standard followability, and even has a uniform standard in terms of words. The content and the structure of the policy document are automatically identified and analyzed, and the method is particularly important for improving the analysis efficiency of the policy document. In recent years, natural language processing technology has been developed rapidly, and is mainly applied to machine translation, public opinion monitoring, automatic summarization, viewpoint extraction, text classification, question answering, text semantic comparison, voice recognition, chinese OCR and other aspects. Thus, for policy documents having structured textual content, natural language processing techniques are no longer an effective means of analyzing the textual content of the policy document.
At present, policy document text content identification methods with high classification and identification precision are few, and part of students train text classification identification models in an unsupervised learning mode to identify policy text content, but the performance of the trained text classification identification models is not stable enough due to lack of classification and identification standards for the policy text content. Still, some scholars train the text classification recognition model by adopting a supervised learning method, but there is no uniform standard to label the policy text content, so that the trained text classification recognition model is not stable enough, and a large amount of training samples for supervised learning are usually acquired at a high cost.
Disclosure of Invention
The invention provides a policy text analysis method based on policy text classification and key information identification, aiming at realizing accurate classification of text paragraphs of policy documents and accurate identification of key information.
In order to achieve the purpose, the invention adopts the following technical scheme:
a policy text analysis method based on policy text classification and key information identification is provided, and the method comprises the following steps:
s1, input paragraphs are classified by a policy text classifier based on pre-training
Figure DEST_PATH_IMAGE001
Predicting and outputting the paragraph
Figure 79006DEST_PATH_IMAGE001
Type of (d);
s2, finishing each paragraph of the classification based on the pre-trained policy information recognizer
Figure 933830DEST_PATH_IMAGE001
Key information is further extracted at the entity level.
Preferably, in step S1, the policy text classifier predicts the paragraphs
Figure 624486DEST_PATH_IMAGE001
The method of type (d) specifically comprises the steps of:
s11, regarding the paragraph in the given policy document
Figure 160641DEST_PATH_IMAGE001
Using template functions
Figure DEST_PATH_IMAGE002
Will be provided with
Figure 286598DEST_PATH_IMAGE001
Conversion to language models
Figure DEST_PATH_IMAGE003
Is inputted
Figure DEST_PATH_IMAGE004
Figure 30432DEST_PATH_IMAGE004
In the original paragraph
Figure 202525DEST_PATH_IMAGE001
A prompt language of a classification task is added, wherein the prompt language comprises a mask position needing to predict and fill in a label;
s12, the language model
Figure 11212DEST_PATH_IMAGE003
Predicting out tags filling in the mask locations
Figure DEST_PATH_IMAGE005
S13, label converter
Figure DEST_PATH_IMAGE006
Attaching the label to the container
Figure 657920DEST_PATH_IMAGE005
Set of tagged words mapped as a pre-constructed policy document element system
Figure DEST_PATH_IMAGE007
Corresponding label word in
Figure DEST_PATH_IMAGE008
The paragraphs obtained as predictions
Figure 103813DEST_PATH_IMAGE001
Type (c) of the cell.
Preferably, the language model is trained
Figure 559940DEST_PATH_IMAGE003
The method comprises the following steps:
a1, for each as training sample
Figure 437897DEST_PATH_IMAGE004
Calculating the set of tagged words
Figure 164545DEST_PATH_IMAGE007
Each tag word in (1)
Figure 233870DEST_PATH_IMAGE008
Probability scores for filling in the mask locations
Figure DEST_PATH_IMAGE009
A2, calculating probability distribution through softmax function
Figure DEST_PATH_IMAGE010
A3 is according to
Figure 22965DEST_PATH_IMAGE009
And
Figure 209008DEST_PATH_IMAGE010
calculating model prediction loss by using the constructed loss function;
a4, judging whether a model iterative training termination condition is reached,
if yes, terminating iteration and outputting the language model
Figure 462266DEST_PATH_IMAGE003
If not, adjusting the model parameters and returning to the step A1 to continue the iterative training.
Preferably, the language model
Figure 499230DEST_PATH_IMAGE003
For forming a plurality of language sub-models
Figure DEST_PATH_IMAGE011
The fusion language model formed by fusion, and the method for training the fusion language model comprises the following steps:
b1, defining a template function setClosing box
Figure DEST_PATH_IMAGE012
The set of template functions
Figure 680682DEST_PATH_IMAGE012
Comprising a plurality of different said template functions
Figure 962758DEST_PATH_IMAGE002
B2, for each as training sample
Figure 601681DEST_PATH_IMAGE004
By corresponding said language sub-model
Figure 278388DEST_PATH_IMAGE011
Calculating the set of tagged words
Figure 635551DEST_PATH_IMAGE007
Each tag word in (1)
Figure 924581DEST_PATH_IMAGE008
Probability score of filling in the mask position
Figure 199703DEST_PATH_IMAGE009
B3, associating each template function
Figure 345513DEST_PATH_IMAGE002
Is/are as follows
Figure 189973DEST_PATH_IMAGE009
Carrying out fusion to obtain
Figure DEST_PATH_IMAGE013
B4, calculating probability distribution through softmax function
Figure DEST_PATH_IMAGE014
B5 according to
Figure 764917DEST_PATH_IMAGE013
And
Figure 378432DEST_PATH_IMAGE014
calculating model prediction loss by using the constructed loss function;
b6, judging whether a model iterative training termination condition is reached,
if yes, terminating iteration and outputting the fusion language model;
if not, the model parameters are adjusted and then the step B2 is returned to continue the iterative training.
Preferably, the language model
Figure DEST_PATH_IMAGE015
Or the language submodel
Figure DEST_PATH_IMAGE016
Is a BERT language model.
Preferably, the system of policy document elements includes sentence-level elements and entity-level elements, the sentence-level elements including any one or more of 27 sub-categories of policy objective, application review, policy tool-supply type, policy tool-environment type, policy tool-demand type, fund management, regulatory evaluation, admission condition 8,
wherein, the policy tool-supply type category includes any one or more of the 4 sub-categories of talent culture, fund support, technical support and public service;
the policy tool-environment type comprises any one or more of 6 sub-categories of regulation and control, target planning, tax and revenue discount, financial support, organization and construction and policy promotion;
the policy tool-demand type comprises any one or more of the 3 sub-categories of government procurement, company cooperation and overseas cooperation;
the supervision evaluation category comprises 2 sub-categories of supervision management and/or assessment evaluation;
the capital management category includes 2 sub-categories of sources of capital and/or management principles.
Preferably, in step S2, each paragraph is extracted based on the policy information identifier
Figure DEST_PATH_IMAGE017
The method of the key information in (1) comprises the steps of:
s21, defining a sentence template set
Figure DEST_PATH_IMAGE018
Label word set for entity identification in policy document element system
Figure DEST_PATH_IMAGE019
And a language model
Figure DEST_PATH_IMAGE020
Label sets for entity identification
Figure DEST_PATH_IMAGE021
The sentence template set
Figure DEST_PATH_IMAGE022
Sentence template containing entity type and non-entity type
Figure DEST_PATH_IMAGE023
Sentence template
Figure 619445DEST_PATH_IMAGE023
Including two words to be filled in the vacancy, wherein the first vacancy is the paragraph input from the input
Figure 747938DEST_PATH_IMAGE017
The second vacancy is a category label for classifying the intercepted text segment, and the label set is
Figure 113192DEST_PATH_IMAGE021
Middle label
Figure DEST_PATH_IMAGE024
With said set of tagged words
Figure 17431DEST_PATH_IMAGE019
The label word in
Figure DEST_PATH_IMAGE025
Having a mapping relationship;
s22, from the paragraph
Figure 404245DEST_PATH_IMAGE017
Each of the text fragments and each of the labels intercepted in
Figure 364242DEST_PATH_IMAGE024
Corresponding label words
Figure 297301DEST_PATH_IMAGE025
Filling in the sentence template set respectively
Figure 885408DEST_PATH_IMAGE018
Each of the sentence templates
Figure 481605DEST_PATH_IMAGE023
In the first and second vacancies and then using the language model
Figure 286488DEST_PATH_IMAGE020
Calculating probability scores of the filled-in sentences
Figure DEST_PATH_IMAGE026
The calculation method is expressed by the following formula (1):
Figure DEST_PATH_IMAGE027
in the formula (1), the first and second groups,
Figure DEST_PATH_IMAGE028
text passage for representing use candidate
Figure DEST_PATH_IMAGE029
And a label
Figure DEST_PATH_IMAGE030
Label word with mapping relation
Figure 560256DEST_PATH_IMAGE025
Filling sentence template
Figure 268449DEST_PATH_IMAGE023
The sentence obtained;
Figure DEST_PATH_IMAGE031
representing the sentence
Figure 612711DEST_PATH_IMAGE028
The sequence length of (a);
Figure DEST_PATH_IMAGE032
representing the sentence
Figure 949232DEST_PATH_IMAGE028
Of the word sequence
Figure DEST_PATH_IMAGE033
An item;
Figure DEST_PATH_IMAGE034
representing the sentence
Figure 302722DEST_PATH_IMAGE028
1 st to 1 st item in the word sequence of
Figure DEST_PATH_IMAGE035
An item;
Figure DEST_PATH_IMAGE036
representing input to the language model
Figure 223011DEST_PATH_IMAGE020
Text sequence of
Figure DEST_PATH_IMAGE037
Figure DEST_PATH_IMAGE038
Representing text at a given input
Figure 99961DEST_PATH_IMAGE017
And 1 st to 1 st item in the word sequence of the sentence template
Figure 348277DEST_PATH_IMAGE035
Item(s)
Figure 725032DEST_PATH_IMAGE034
In the case where the model predicts that the c-th term is
Figure 142238DEST_PATH_IMAGE032
The probability of (d);
s23, the sentence with the highest score
Figure DEST_PATH_IMAGE039
The filled text segment is used as key information entity and corresponding type label
Figure DEST_PATH_IMAGE040
Mapping to the tagged word
Figure 530100DEST_PATH_IMAGE025
Then as corresponding entity type, jointly form the paragraph
Figure DEST_PATH_IMAGE041
The key information of (a).
Preferably, the language model in step S2
Figure 970440DEST_PATH_IMAGE020
Is a BART model.
Preferably, in the step A1,
Figure DEST_PATH_IMAGE042
is expressed by the following formula (2):
Figure DEST_PATH_IMAGE043
in step A2
Figure DEST_PATH_IMAGE044
Calculating by softmax function (3):
Figure DEST_PATH_IMAGE045
in the formulas (2) and (3),
Figure DEST_PATH_IMAGE046
representing a set of tags
Figure DEST_PATH_IMAGE047
The Chinese and the label words
Figure DEST_PATH_IMAGE048
A label having a mapping relationship;
Figure 137504DEST_PATH_IMAGE047
a set of labels representing a text classification task;
the loss function constructed in step A3 is expressed by the following formula (4):
Figure DEST_PATH_IMAGE049
in the formula (4), the first and second groups,
Figure DEST_PATH_IMAGE050
representing a fine tuning coefficient;
Figure DEST_PATH_IMAGE051
representing the distribution of model predictions
Figure 698630DEST_PATH_IMAGE044
The difference from the true distribution;
Figure DEST_PATH_IMAGE052
score representing model prediction
Figure 447274DEST_PATH_IMAGE042
The difference from the true score.
Preferably, in step B2
Figure 670183DEST_PATH_IMAGE042
Is expressed by the following formula (5):
Figure DEST_PATH_IMAGE053
in step B3
Figure DEST_PATH_IMAGE054
Obtained by fusing the following formula (6):
Figure DEST_PATH_IMAGE055
in the formula (6), the first and second groups,
Figure DEST_PATH_IMAGE056
representing the set of template functions
Figure DEST_PATH_IMAGE057
The template function of
Figure DEST_PATH_IMAGE058
The number of (2);
Figure DEST_PATH_IMAGE059
representing said template function
Figure 502922DEST_PATH_IMAGE058
In the calculation of
Figure DEST_PATH_IMAGE060
The weight occupied by the hour;
in step B4
Figure DEST_PATH_IMAGE061
Calculating by means of the softmax function (7):
Figure DEST_PATH_IMAGE062
in the formulas (5), (6) and (7),
Figure DEST_PATH_IMAGE063
representing a set of tags
Figure DEST_PATH_IMAGE064
The Chinese and the label words
Figure DEST_PATH_IMAGE065
A label having a mapping relationship;
Figure 767156DEST_PATH_IMAGE064
a set of labels representing a text classification task;
the loss function constructed in step B5 is expressed by the following formula (8):
Figure DEST_PATH_IMAGE066
in the formula (8), the first and second groups,
Figure DEST_PATH_IMAGE067
representing a fine tuning coefficient;
Figure DEST_PATH_IMAGE068
representing the distribution of model predictions
Figure DEST_PATH_IMAGE069
The difference from the true distribution;
Figure DEST_PATH_IMAGE070
score representing model prediction
Figure DEST_PATH_IMAGE071
The difference from the true score.
Preferably, the fine adjustment coefficients in formula (4) and formula (8)
Figure DEST_PATH_IMAGE072
The invention has the following beneficial effects:
1. a set of complete policy document element system is constructed, and different elements in the policy document are clearly divided. Subsequently, based on the system, the classification of each paragraph in the policy document and the key information extraction of the text paragraphs at the entity level can be more accurately realized.
2. By following the original paragraph
Figure DEST_PATH_IMAGE073
The prompt language of the classification task is added, the prompt language comprises a mask position which needs to predict and fill in a label, the paragraph classification problem is converted into a classification prediction problem similar to a complete shape and fill in the blank, the process of paragraph classification prediction is simplified, the policy document text can be more accurately analyzed from the aspects of content composition and document structure based on the constructed complete policy document element system,and deeper information is mined, and excellent performance is achieved under the condition that the scale of the labeled training data set is small.
3. The policy information recognizer provided simplifies the recognition difficulty of text entities by predicting two vacant content labels under the constructed policy document element system, can more accurately extract useful key information from the text based on the constructed policy document element system, and has excellent performance under the condition that the scale of a labeled training data set is small.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required to be used in the embodiments of the present invention will be briefly described below. It is obvious that the drawings described below are only some embodiments of the invention, and that for a person skilled in the art, other drawings can be derived from them without inventive effort.
FIG. 1 is a diagram of a policy document element system constructed according to an embodiment of the present invention;
FIG. 2 is a logic block diagram of paragraph classes of a prediction policy file provided by an embodiment of the present invention;
FIG. 3 is a logic block diagram of a policy information identifier based on hint learning provided by an embodiment of the present invention;
FIG. 4 is a logic block diagram of a pre-training-fine-tuning-based policy information identifier for comparison according to an embodiment of the present invention;
FIG. 5 is a diagram illustrating the steps for implementing a policy text analysis method based on policy text classification and key information identification according to an embodiment of the present invention;
FIG. 6 is a policy text classifier prediction section
Figure 12935DEST_PATH_IMAGE073
A method implementation step diagram of the type(s);
FIG. 7 is a diagram of a policy information identifier extraction section
Figure 490184DEST_PATH_IMAGE073
Is key ofA method step diagram of information.
Detailed Description
The technical scheme of the invention is further explained by the specific implementation mode in combination with the attached drawings.
Wherein the showings are for the purpose of illustration only and are shown by way of illustration only and not in actual form, and are not to be construed as limiting the present patent; for a better explanation of the embodiments of the present invention, some parts of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product; it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.
The same or similar reference numerals in the drawings of the embodiments of the present invention correspond to the same or similar components; in the description of the present invention, it should be understood that if the terms "upper", "lower", "left", "right", "inner", "outer", etc. are used for indicating the orientation or positional relationship based on the orientation or positional relationship shown in the drawings, it is only for convenience of description and simplification of description, but it is not indicated or implied that the referred device or element must have a specific orientation, be constructed in a specific orientation and be operated, and therefore, the terms describing the positional relationship in the drawings are only used for illustrative purposes and are not to be construed as limitations of the present patent, and the specific meanings of the terms may be understood by those skilled in the art according to specific situations.
In the description of the present invention, unless otherwise explicitly specified or limited, the term "connected" or the like, if appearing to indicate a connection relationship between components, is to be understood broadly, for example, as being either fixedly connected, detachably connected, or integrated; can be mechanically or electrically connected; they may be directly connected or indirectly connected through intervening media, or may be connected through one or more other components or may be in an interactive relationship with one another. The specific meanings of the above terms in the present invention can be understood in a specific case to those of ordinary skill in the art.
In the embodiment of the invention, the applicant collects a certain amount of policy documents as references for constructing a policy document element system and model training data of a subsequent policy text classifier and a policy information recognizer. The policy documents relate to various fields such as agriculture, industry, business, service industry and the like, and applicable objects of the policy documents include individuals, enterprises, institutions and the like. The policy document element system constructed by the embodiment is as shown in fig. 1, and elements in the system are divided into a sentence level and an entity level according to the length of characters in a text passage. Sentence-level elements generally cover the entire sentence in a paragraph, such as "give a marketing team a city-linked 200 ten thousand yuan reward for a business enterprise that successfully comes into the market," which is a complete sentence and thus is considered sentence-level; and elements at the entity level are typically included in words having a particular meaning, such as policy name, policy number, distribution area, department of formulation, etc. in paragraphs.
Further, the sentence-level elements are subdivided into general forms and "body-relationship-domain" forms, and the sentence-level elements in the general forms are used for distinguishing content compositions of paragraphs in the policy text, such as policy targets, application reviews, policy tools, supervision evaluations, fund management and the like in fig. 1. While sentence-level elements in the form of "body-relationship-domain" are used to structurally represent the admission conditions of the policies, such as the admission condition "business registry-owned-shanghai" associated with the business registry. Specifically, as shown in fig. 1, the content of the policy document element system constructed in this embodiment is as follows:
1. elements at the entity level include: 7 categories of policy name, policy number, release area, formulation department, execution department, release time and execution period;
2. the general form of sentence-level elements includes: policy objectives, application auditing, policy tools, regulatory evaluation, and fund management 5 categories. Wherein, the supervision evaluation is further subdivided into 2 subclasses of supervision management and assessment evaluation. Funding management is further subdivided into funding sources and management rules 2 subclasses. Policy tools are further subdivided into 13 subclasses of 3 types as follows:
supply-type policy tools (i.e., policy tools-supply type) include talent training (establishing talent development plans, actively perfecting various education systems and training systems, etc.), fund support (providing financial support, such as development expenses and infrastructure construction expenses, etc.), technical support (technical counseling and consultation, enhancing technical infrastructure construction, etc.), public services (perfecting relevant supporting facilities, policy environments, etc.).
The environmental policy tool (i.e. policy tool-environmental policy) includes regulation and control (making regulation, standard, standardizing market order, increasing supervision), target planning (top-level design, providing corresponding policy support service), tax benefits (tax incentives such as tax exemption and tax refund, including investment exemption, accelerated discount, tax exemption, tax refund, etc.), financial support (providing loan, subsidy, risk investment, credit guarantee, fund, risk control and other financial support for enterprises through financial institutions), organization and construction (setting leadership, supervision, service, etc. organization and team construction for promoting industry health development), and policy propaganda (propaganda related policy for promoting industry development).
Demand-type policy tools (i.e., policy tools — demand type) include government procurement (products procurement from government to related enterprises), public and private cooperation (government and many social entities participate in the related activities of industry development, such as joint investment, joint technical research development, development planning research, etc.), overseas cooperation (introduction of foreign materials, development of cooperation and communication with overseas government, enterprises or scientific research institutions in terms of generation technology, standard customization, etc.).
The sentence-level elements in the form of "body-relationship-domain" include admission conditions, which in turn can be subdivided into 8 subclasses: registration places, property right requirements, business fields, employee composition, legal qualifications, enterprise types, operational requirements, and research and development requirements.
Before paragraph classification and key information identification are performed on the policy text, paragraph splitting is performed on the text content of the policy document. There are many existing methods for paragraph splitting the text content of the policy document, and the way of splitting paragraphs is not the scope of the claimed invention, so the detailed way of paragraph splitting will not be described here.
After the paragraph is split, the paragraph classification and key information identification process is carried out. In this embodiment, the paragraphs are classified by a pre-trained policy text classifier, and the content composition and file structure of the policy file are further analyzed. In this embodiment, a general-form sentence-level element in the policy document element system shown in fig. 1 is selected as a candidate category set of a paragraph, and two category sets with different classification granularities are used as samples to respectively perform training of a policy text classifier and compare training effects, where a sentence-level element with one classification granularity is 7 major categories including a policy target, an application audit, a policy tool-supply type, a policy tool-environment type, a policy tool-demand type, and fund management and supervision evaluation shown in fig. 1; the other classification granularity is 17 small classes after expanding the 3 major classes of policy tools, supervision evaluation and fund management, and 19 classes of policy targets and application auditing. When classifying paragraphs, the policy text classifier also determines whether the paragraph does not belong to any of these categories, i.e., whether the paragraph is a nonsense paragraph.
The method for classifying the input paragraphs by using the pre-trained policy text classifier in the embodiment is specifically described as follows:
in this embodiment, the technical core of classifying the input paragraphs is to adopt the idea of prompt learning, which can simplify the classification process and improve the classification efficiency, and has higher classification superiority for small-scale data sets. Specifically, in order to more fully exert the powerful question-answer and reading comprehension capabilities of the policy text classifier and mine deeper information contained in the labeled small-scale policy file text data set, the input paragraph text is processed according to a specific mode, and a task prompt language is added to the paragraph text, so that the paragraph text is more adaptive to the question-answer form of the language model. The principle of paragraph recognition by the policy text classifier based on prompt learning is as follows:
is provided with
Figure DEST_PATH_IMAGE074
For pre-trained language models (preferred)Is a BERT language model),
Figure DEST_PATH_IMAGE075
is a label word set and a mask word in a policy document element system
Figure DEST_PATH_IMAGE076
Is used for filling out language model
Figure 1980DEST_PATH_IMAGE074
Is masked in the input
Figure DEST_PATH_IMAGE077
In a word of
Figure DEST_PATH_IMAGE078
Is a set of labels for a text classification task (paragraph classification task). Obtaining an input language model after segmenting words of text paragraphs of each policy
Figure 825448DEST_PATH_IMAGE074
Word sequence of
Figure DEST_PATH_IMAGE079
Then use the self-defined template function
Figure DEST_PATH_IMAGE080
Will be provided with
Figure 555375DEST_PATH_IMAGE079
Conversion to language models
Figure 752876DEST_PATH_IMAGE074
Is inputted
Figure DEST_PATH_IMAGE081
Figure 219761DEST_PATH_IMAGE081
In that
Figure 813248DEST_PATH_IMAGE079
The method is characterized in that a prompt language of a classification task is added, and the prompt language comprises a mask position needing to predict and fill in a label. Warp beam
Figure 605754DEST_PATH_IMAGE081
After conversion, the paragraph type prediction problem can be converted into a complete fill-in-the-blank problem, i.e., a language model
Figure 323175DEST_PATH_IMAGE074
Expressed in the form of a filled-in-space problem
Figure 623444DEST_PATH_IMAGE081
For input, the most suitable word to fill in the mask position is predicted to be used as a pair
Figure 313182DEST_PATH_IMAGE079
The classification of the expressed paragraphs predicts the outcome.
It is emphasized that the present application, based on the idea of prompt learning, makes better use of the language model
Figure 604486DEST_PATH_IMAGE074
The question answering and reading comprehension ability of the policy text classifier is achieved, meanwhile, the classification problem is converted into a complete form filling-in-the-air problem, the prediction process is simpler, and the classification efficiency of the policy text classifier is improved. Further, the present embodiment defines a set of tags for classifying tasks from text
Figure 511000DEST_PATH_IMAGE078
Set of tagged words into policy document element hierarchy
Figure DEST_PATH_IMAGE082
As a converter of the label
Figure DEST_PATH_IMAGE083
. For example, for
Figure 959168DEST_PATH_IMAGE078
The label in (1)
Figure DEST_PATH_IMAGE084
The label converter
Figure 972254DEST_PATH_IMAGE083
Map it to tagged words
Figure DEST_PATH_IMAGE085
The policy objective is the predicted paragraph category.
Fig. 2 is a logic block diagram of a paragraph category of a predicted policy document according to an embodiment of the present invention. It is emphasized that for each template function
Figure DEST_PATH_IMAGE086
And label converter
Figure 486324DEST_PATH_IMAGE083
The present embodiment implements classification of paragraphs by the following steps:
given an input paragraph
Figure 145714DEST_PATH_IMAGE079
(preferably a sequence of words of the original passage), using a template function
Figure 23671DEST_PATH_IMAGE080
Will be provided with
Figure 750318DEST_PATH_IMAGE079
Conversion to language models
Figure 85223DEST_PATH_IMAGE074
Is inputted
Figure 467794DEST_PATH_IMAGE081
Language model
Figure 382398DEST_PATH_IMAGE074
Will predict
Figure 901235DEST_PATH_IMAGE081
Tag with most suitable middle mask position
Figure 362301DEST_PATH_IMAGE084
Figure DEST_PATH_IMAGE087
Then using a label converter
Figure 169852DEST_PATH_IMAGE083
Mapping the label to label words in a policy document element system
Figure DEST_PATH_IMAGE088
Figure DEST_PATH_IMAGE089
And use it as a pair paragraph
Figure 497934DEST_PATH_IMAGE079
The classification of (2). Preferably, the present embodiment employs a pre-trained Chinese BERT model as the language model
Figure 900971DEST_PATH_IMAGE074
The prediction method of the mask position follows the pre-training task of the BERT model, namely the pair thereof is used
Figure 547984DEST_PATH_IMAGE081
The label of the mask position is predicted by the output corresponding to the middle mask position (the prediction method is consistent with the mask Language Model pre-training task of the BERT Model, and detailed description is not given).
For example, with respect to template functions
Figure 170726DEST_PATH_IMAGE080
Suppose to be defined as "
Figure 958292DEST_PATH_IMAGE079
. In general, this is a paragraph of the policy text about _____. "wherein," _____ "represents the mask location, and thus the original text paragraph
Figure 982879DEST_PATH_IMAGE079
A prompt language for the classification task is added.
Figure 836347DEST_PATH_IMAGE079
For example, "for a successful business to market, give a 200 ten thousand dollar reward to the business team for city linkage" for that paragraph
Figure 415227DEST_PATH_IMAGE079
After adding the above-mentioned prompt language, the language model
Figure 242368DEST_PATH_IMAGE074
The classification task of (1) is to predict that the enterprise successfully appeared on the market and the management team is awarded 200 ten thousand yuan of linkage in the urban area. In general, this is a paragraph of policy text about _____. "mask position in" _____
Figure 416736DEST_PATH_IMAGE084
. After predicting the label after the mask position, the predicted label
Figure 671131DEST_PATH_IMAGE084
Mapping to a set of tagged words in a policy document element hierarchy
Figure DEST_PATH_IMAGE090
Corresponding label word in
Figure DEST_PATH_IMAGE091
Paragraphs obtained as predictions
Figure 48891DEST_PATH_IMAGE079
Of the type (c).
The language model is trained for the present embodiment as follows
Figure 476461DEST_PATH_IMAGE074
Method of (a):
Language model
Figure 708597DEST_PATH_IMAGE074
There are many existing training methods that preferably use the BERT model, which can be applied in the present application for training the language model
Figure 399473DEST_PATH_IMAGE074
With the difference that the present embodiment is used to train language models
Figure 952945DEST_PATH_IMAGE074
Is a template function
Figure DEST_PATH_IMAGE092
Converted to obtain
Figure 21090DEST_PATH_IMAGE081
And via-tag converter
Figure DEST_PATH_IMAGE093
Label word set obtained by conversion
Figure 514257DEST_PATH_IMAGE090
Corresponding label word in
Figure 376034DEST_PATH_IMAGE091
And a loss function for evaluating model performance improved for improved classification accuracy.
Training language models
Figure 180916DEST_PATH_IMAGE074
In the method, the sample data set is randomly divided into a training set and a verification set according to the proportion of 7:3, and the training process is as follows:
sequence generated for each policy text paragraph containing only one mask position
Figure 261874DEST_PATH_IMAGE079
To, forLabel word set in policy document element system
Figure 907750DEST_PATH_IMAGE090
Each tag word in (1)
Figure 737166DEST_PATH_IMAGE091
The probability of filling in the mask position calculates a score (due to the label)
Figure 35204DEST_PATH_IMAGE084
Word set on tag
Figure 545951DEST_PATH_IMAGE090
Has a label word with a mapping relation
Figure 810448DEST_PATH_IMAGE091
Thus predicting the label
Figure 14027DEST_PATH_IMAGE084
The probability score of filling in the mask position is equivalent to predicting the corresponding tag word
Figure 826125DEST_PATH_IMAGE091
Probability score of filling in the mask location), this score is determined by the language model
Figure 904677DEST_PATH_IMAGE074
Predictions represent the likelihood that the predicted tag word can fill in the mask position. More specifically, for a sequence
Figure 853042DEST_PATH_IMAGE079
The present application computes a set of labels for a text classification task
Figure DEST_PATH_IMAGE094
The label in (1)
Figure 867003DEST_PATH_IMAGE084
The method of filling the probability score of the mask position is based on the following formula(1) Expressing:
Figure DEST_PATH_IMAGE095
in the formula (1), the first and second groups,
Figure DEST_PATH_IMAGE096
presentation label
Figure 229981DEST_PATH_IMAGE084
Probability score of filling mask position due to label
Figure 879268DEST_PATH_IMAGE084
Tab word set related to policy document element system
Figure 416560DEST_PATH_IMAGE090
Corresponding label word in
Figure 132581DEST_PATH_IMAGE091
Have a mapping relationship, therefore
Figure 762014DEST_PATH_IMAGE096
Equivalent to representing label words
Figure 887096DEST_PATH_IMAGE091
Filling in probability scores for mask locations;
Figure DEST_PATH_IMAGE097
for example, the label of the label word "policy target" in FIG. 1 may be mapped to
Figure DEST_PATH_IMAGE098
Mapping the label of the label word 'apply for review' to
Figure DEST_PATH_IMAGE099
By establishing the mapping in this way, the task is changed from assigning a label without meaning to the input sentence to selectingThe most likely word to fill in the mask position.
Is calculated to obtain
Figure 235819DEST_PATH_IMAGE090
After all the label words are filled in the scores of the same mask position, obtaining a probability distribution through a softmax function, wherein the specific calculation method is expressed by the following formula (2):
Figure DEST_PATH_IMAGE100
in the formula (2), the first and second groups of the compound,
Figure DEST_PATH_IMAGE101
a set of labels representing a text classification task;
then, according to
Figure DEST_PATH_IMAGE102
And
Figure DEST_PATH_IMAGE103
and calculating a model predicted loss using the constructed loss function expressed by the following formula (3):
Figure DEST_PATH_IMAGE104
in the formula (3), the first and second groups,
Figure DEST_PATH_IMAGE105
represents a trimming coefficient (preferably 0.0001);
Figure DEST_PATH_IMAGE106
representing the distribution of model predictions
Figure 185058DEST_PATH_IMAGE103
The difference from the true one-hot vector distribution;
Figure DEST_PATH_IMAGE107
score representing model prediction
Figure 98525DEST_PATH_IMAGE102
The difference from the true score;
finally, whether a model iterative training termination condition is reached is judged,
if yes, stopping iteration and outputting the language model
Figure DEST_PATH_IMAGE108
If not, the iterative training is continued after the model parameters are adjusted.
In order to further improve the training effect of the model and further improve the language model
Figure 988815DEST_PATH_IMAGE108
Preferably, language models
Figure 235120DEST_PATH_IMAGE108
For forming a plurality of language sub-models
Figure DEST_PATH_IMAGE109
The method for training the fusion language model comprises the following steps:
first, a template function set is defined
Figure DEST_PATH_IMAGE110
Set of template functions
Figure 433889DEST_PATH_IMAGE110
Comprising several different template functions
Figure 132855DEST_PATH_IMAGE092
E.g.) "
Figure DEST_PATH_IMAGE111
. This policy text paragraph and what concerns _____ ", again for example,"this policy text paragraph relates to what relates to _____," and so on. For different template functions
Figure 301537DEST_PATH_IMAGE092
In this embodiment, the fusion language model is trained by the following method:
for each as training sample
Figure 900883DEST_PATH_IMAGE081
By corresponding language submodels
Figure DEST_PATH_IMAGE112
Computing a set of tagged words
Figure DEST_PATH_IMAGE113
Each tag word in (1)
Figure DEST_PATH_IMAGE114
Probability score of filling mask position
Figure 823883DEST_PATH_IMAGE102
The calculation method is expressed by the following formula (4):
Figure DEST_PATH_IMAGE115
for associating each template function
Figure 649625DEST_PATH_IMAGE092
Is/are as follows
Figure 215473DEST_PATH_IMAGE102
Carrying out fusion to obtain
Figure DEST_PATH_IMAGE116
Specifically, it is expressed by the following formula (5):
Figure DEST_PATH_IMAGE117
in the formula (5), the first and second groups,
Figure DEST_PATH_IMAGE118
representing a set of template functions
Figure 358459DEST_PATH_IMAGE110
Template function in (1)
Figure 649763DEST_PATH_IMAGE092
The number of (2);
Figure DEST_PATH_IMAGE119
representing a template function
Figure 962802DEST_PATH_IMAGE092
In the calculation of
Figure DEST_PATH_IMAGE120
Figure DEST_PATH_IMAGE121
The weight of each language, in this embodiment, according to each language sub-model
Figure DEST_PATH_IMAGE122
Determining individuals with the accuracy obtained on the training and validation sets
Figure 348652DEST_PATH_IMAGE122
The weight of (c).
Then, the probability distribution is calculated by the softmax function
Figure DEST_PATH_IMAGE123
The calculation method is expressed by the following formula (6):
Figure DEST_PATH_IMAGE124
in the formulas (4), (5) and (6),
Figure DEST_PATH_IMAGE125
Figure DEST_PATH_IMAGE126
a set of labels representing a text classification task;
finally, according to
Figure 521926DEST_PATH_IMAGE120
And
Figure 859497DEST_PATH_IMAGE123
and calculating a model predicted loss using the constructed loss function expressed by the following formula (7):
Figure DEST_PATH_IMAGE127
in the formula (7), the first and second groups,
Figure DEST_PATH_IMAGE128
represents a trimming coefficient (preferably 0.0001);
Figure DEST_PATH_IMAGE129
representing the distribution of model predictions
Figure 502575DEST_PATH_IMAGE123
The difference from the true distribution;
Figure DEST_PATH_IMAGE130
score representing model prediction
Figure 114953DEST_PATH_IMAGE120
The difference from the true score.
Provided with a prompt language
Figure DEST_PATH_IMAGE131
For language models
Figure DEST_PATH_IMAGE132
The input mask position label prediction method has excellent prediction performance under the condition that the scale of a labeled training data set is small, and in order to verify the excellent performance of the labeled training data set when the training data is small, the application also designs various policy text classifiers based on fully supervised learning for performance comparison, and the specific method comprises the following steps:
(1) For policy document paragraphs
Figure DEST_PATH_IMAGE133
Using word segmentation tool to obtain word sequence, and recording it as
Figure DEST_PATH_IMAGE134
,
Figure DEST_PATH_IMAGE135
Representing sequences of words
Figure DEST_PATH_IMAGE136
To
Figure DEST_PATH_IMAGE137
And carrying out distributed representation on each word after word segmentation through a word vector representation model obtained by pre-training on a large-scale comprehensive field corpus. In this embodiment, a static word vector is used, each word being represented as a 300-dimensional pre-trained vector
Figure DEST_PATH_IMAGE138
Figure DEST_PATH_IMAGE139
Representing sequences of words
Figure 816500DEST_PATH_IMAGE136
To (1)
Figure 121710DEST_PATH_IMAGE139
Word, obtaining paragraphs by word vectors
Figure 566598DEST_PATH_IMAGE133
Is characterized by
Figure DEST_PATH_IMAGE140
Then, the characteristics of the paragraph are expressed
Figure 950044DEST_PATH_IMAGE140
Inputting a multi-classifier to predict the probability that each paragraph belongs to each class, the prediction process is expressed as:
Figure DEST_PATH_IMAGE141
Figure DEST_PATH_IMAGE142
in order to characterize the function in the form of a feature,
Figure DEST_PATH_IMAGE143
presentation paragraph
Figure 255166DEST_PATH_IMAGE133
Is a first
Figure DEST_PATH_IMAGE144
The probability of each class, and the class with the highest probability is selected as the paragraph
Figure 167496DEST_PATH_IMAGE133
The categories mentioned.
(2) In the multi-classifier part, the method based on statistical machine learning and the method based on deep learning are selected to carry out complete supervised learning on the multi-classifier. The multi-classifier based on statistical machine learning is designed on the basis of a support vector machine model and an XGboost model; the deep learning-based multi-classifier is designed based on a TextCNN model and a Bi-LSTM + Attention model.
1) In a statistical machine learning based multi-classifier, a policy text paragraph is identified
Figure 834101DEST_PATH_IMAGE133
300 of all words of the paragraph to be participatedAveraging each dimension of the dimension distributed representation, and splicing the two characteristics of the length of the paragraph and the relative position of the paragraph in the whole policy document (the index value of the paragraph in the document/the total number of the paragraphs of the document) to obtain a 302-dimension feature vector
Figure DEST_PATH_IMAGE145
It is input into the multi-classifier, and the label of the paragraph classification is output.
2) In the deep learning based multi-classifier, one policy text paragraph is classified
Figure 21238DEST_PATH_IMAGE133
Distributed representation of all words of a segmented paragraph
Figure DEST_PATH_IMAGE146
Splicing into a matrix, extracting features by using convolution kernels of 3 different sizes, wherein the sizes of the convolution kernels of 3 sizes can be respectively 3 multiplied by 3, 4 multiplied by 4 and 5 multiplied by 5, performing maximum pooling after convolution, splicing the features extracted by the convolution kernels of different sizes into a feature vector, inputting the feature vector into a softmax activation function, and outputting a label of the paragraph classification.
3) In another deep learning based multi-classifier, a policy text paragraph is classified
Figure 565220DEST_PATH_IMAGE133
300-dimensional distributed representation of all words of a participled paragraph
Figure DEST_PATH_IMAGE147
Forward input into LSTM long-short-term memory network
Figure DEST_PATH_IMAGE148
Reverse input LSTM to obtain
Figure 353179DEST_PATH_IMAGE148
Adding the elements of the two corresponding time sequences to obtain an output vector of each time sequence
Figure DEST_PATH_IMAGE149
. Then, through an Attention mechanism, the weight of each time sequence is calculated, the vectors of all the time sequences are weighted and summed to be used as a feature vector, and finally, a softmax function is used for classification.
The following shows the multi-classifiers obtained by training on a small-scale training data set by the method (1) and the four methods 1), 2) and 3) in the method (2) and the language model trained by the prompt language and mask position label prediction-based policy text classification method provided by the embodiment of the invention
Figure DEST_PATH_IMAGE150
The evaluation index is the accuracy rate on the test set for the effect comparison table of paragraph classification of two policy documents with different granularities, namely 'policy target, application audit, policy tool-supply type, policy tool-environment type, policy tool-demand type, supervision evaluation, capital management' 7 categories and 'policy target, application audit, talent culture, capital support, technical support, public service, regulation control, target planning, tax discount, financial support, organization construction, policy propaganda, government procurement, public and private cooperation, overseas cooperation, supervision management, assessment evaluation, capital source and management principle' 19 categories shown in figure 1. The following table a shows: language model trained by the embodiment
Figure 836242DEST_PATH_IMAGE150
In paragraphs
Figure DEST_PATH_IMAGE151
The paragraph text classification method for performing mask position label prediction by adding a classification task prompt language shows that the paragraph classification performance is better than that of a multi-classifier trained by other four methods on a small-scale data set, and proves that the language model trained by the embodiment
Figure DEST_PATH_IMAGE152
The superiority of paragraph classes is predicted on small scale datasets.
Figure DEST_PATH_IMAGE153
TABLE a
After the paragraphs in the policy text are classified, it is sometimes necessary to automatically identify key information in each paragraph. The application identifies key information in a policy document through a prompt learning based policy information identifier. In the present application, elements at the entity level in the policy document element system shown in fig. 1 are defined as 7 categories of key information categories of the policy, that is, "policy name, policy document number, distribution area, establishment department, execution department, distribution time, and execution period" shown in fig. 1.
Following extraction of each paragraph from the hint learning based policy information identifier
Figure 873075DEST_PATH_IMAGE151
The method for key information in (1) is specifically described:
in general, each paragraph is regarded as a character sequence, and a policy information identifier is used to identify whether each digit in the character sequence is an entity boundary and identify the type of the entity. Specifically, as shown in fig. 3, setting is performed
Figure DEST_PATH_IMAGE154
For pre-trained language models, in models
Figure 569767DEST_PATH_IMAGE154
In the step (1), the first step,
Figure DEST_PATH_IMAGE155
is a label word set used for entity identification in a policy document element system and order
Figure DEST_PATH_IMAGE156
Tag set for identifying tasks for an entity, tag set
Figure 292741DEST_PATH_IMAGE156
Each of which isLabel (R)
Figure DEST_PATH_IMAGE157
Word set on tag
Figure 782540DEST_PATH_IMAGE155
In which there is a label word with mapping relation
Figure DEST_PATH_IMAGE158
And defining sentence templates
Figure DEST_PATH_IMAGE159
Form board
Figure 921266DEST_PATH_IMAGE159
The method comprises two gaps of words to be filled, wherein the filling content of the first gap is text segments intercepted from an input paragraph, the segments are regarded as candidate entities, and the second gap is an entity class label of the filled text segment needing to be predicted. Set of tagged words for entity identification in policy document element system
Figure 236579DEST_PATH_IMAGE155
Each of the tag words in
Figure 927192DEST_PATH_IMAGE158
The entity type represented, filling this entity type in
Figure 258947DEST_PATH_IMAGE159
Defining a new template, e.g. a sentence template
Figure 624201DEST_PATH_IMAGE159
Is "[ text fragment ]]Is an entity type]Policy entity ", then for the set of tagged words identified by the entity
Figure 647215DEST_PATH_IMAGE155
The entity type of the "department" in (1) is filled into the template
Figure 338091DEST_PATH_IMAGE159
A new template may be defined after the process, for example, as "[ candidate entity ]]Is a department policy making entity ". In addition, in order to deal with the case where the text fragment is not an entity, a sentence template of "non-entity" type is further defined, that is, "[ text fragment" ]]Not a policy entity ", such that a plurality of sentence templates of different entity types and sentence templates of non-entity types constitute a set of sentence templates
Figure DEST_PATH_IMAGE160
Will be followed by paragraph
Figure 327781DEST_PATH_IMAGE151
Filling each text segment intercepted into the sentence template set
Figure 965567DEST_PATH_IMAGE160
Each sentence template in (1)
Figure 350412DEST_PATH_IMAGE159
Then using the language model
Figure 710724DEST_PATH_IMAGE154
The probability scores of these filled sentences are calculated (again preferably by the BART model), the calculation method being expressed by the following equation (8):
Figure DEST_PATH_IMAGE161
in the formula (8), the first and second groups of the chemical reaction are shown in the specification,
Figure DEST_PATH_IMAGE162
text passage for representing use candidate
Figure DEST_PATH_IMAGE163
And a label
Figure DEST_PATH_IMAGE164
Filling into sentence templates
Figure 583783DEST_PATH_IMAGE159
The sentence is obtained;
Figure DEST_PATH_IMAGE165
representing the sentence
Figure 133582DEST_PATH_IMAGE162
The sequence length of (a);
Figure DEST_PATH_IMAGE166
representing sentences
Figure 215676DEST_PATH_IMAGE162
Of the word sequence
Figure DEST_PATH_IMAGE167
An item;
Figure DEST_PATH_IMAGE168
representing sentences
Figure 966463DEST_PATH_IMAGE162
1 st to the first in the word sequence of (1)
Figure DEST_PATH_IMAGE169
An item;
Figure DEST_PATH_IMAGE170
representing input to the language model
Figure DEST_PATH_IMAGE171
Text sequence of (2)
Figure DEST_PATH_IMAGE172
Figure DEST_PATH_IMAGE173
Representing text at a given input
Figure 509176DEST_PATH_IMAGE170
And 1 st to 1 st item in the word sequence of the sentence template
Figure 285502DEST_PATH_IMAGE169
Item(s)
Figure DEST_PATH_IMAGE174
In the case where the model predicts that the c-th term is
Figure DEST_PATH_IMAGE175
The probability is calculated by a pre-training generative language model.
Through the above process, the language model is used
Figure 628628DEST_PATH_IMAGE154
For each sentence template of both entity type and non-entity type, a probability score for filling in the second gap with tag words is calculated, and then each candidate text segment is classified as the type corresponding to the sentence template with the highest score, although this type may also be "non-entity". The text segment assigned with the entity type is the entity identified in the text segment, and the entity type is the assigned entity type.
The following briefly describes a method of training a policy information recognizer:
to be provided with
Figure DEST_PATH_IMAGE176
And
Figure DEST_PATH_IMAGE177
corresponding real label word
Figure DEST_PATH_IMAGE178
For model training samples, the sample data set is randomly divided into a training set and a verification set according to the proportion of 7:3. For data in the training set, ifText segment
Figure DEST_PATH_IMAGE179
The entity type of
Figure 602181DEST_PATH_IMAGE178
Then will be
Figure 584918DEST_PATH_IMAGE179
And
Figure 633777DEST_PATH_IMAGE178
sentence templates for filling in entity types separately
Figure DEST_PATH_IMAGE180
In the first and second vacancies. If the text segment
Figure 855243DEST_PATH_IMAGE179
Not an entity, it will
Figure 698565DEST_PATH_IMAGE179
Sentence template filled in non-entity type
Figure 247227DEST_PATH_IMAGE180
A filled sentence is also obtained. In addition, the sentence template containing the entity is filled by using all entity samples in the training set, and the non-entity sentence template is filled by randomly sampling in the remaining non-entity type words, wherein the proportion of the two is preferably 1.5, so that the interference of the non-entity sentence template on the recognition of the entity sentence template is increased, and the key information extraction precision of the policy information recognizer is further improved.
It is emphasized that, in the present application, the language model
Figure DEST_PATH_IMAGE181
Preferably a BART model. BART model computation sentence template
Figure DEST_PATH_IMAGE182
Is scored by
Figure DEST_PATH_IMAGE183
The principle of (1) is as follows:
given a policy text passage
Figure DEST_PATH_IMAGE184
And sentence template set
Figure DEST_PATH_IMAGE185
Will be
Figure 289657DEST_PATH_IMAGE184
Inputting into the encoder of BART model to obtain paragraphs
Figure 279478DEST_PATH_IMAGE184
Is characterized by
Figure DEST_PATH_IMAGE186
. In each step of the decoder of the BART model,
Figure DEST_PATH_IMAGE187
output before decoder
Figure DEST_PATH_IMAGE188
Taking the input of the current step together, and using an attention method to obtain a characteristic representation of the current step
Figure DEST_PATH_IMAGE189
After the characteristic representation is subjected to linear transformation, the softmax function is used for obtaining the word output in the current step
Figure DEST_PATH_IMAGE190
Is calculated in such a way that the conditional probability (which means the probability distribution of the c-th term after a given preceding c-1 term and input paragraph) of (2) is given
Figure DEST_PATH_IMAGE191
Wherein
Figure DEST_PATH_IMAGE192
Are model parameters.
In the training BART model, the difference between the output of the decoder and a real template is calculated by using a cross entropy loss function, the difference is used as the basis for adjusting the model parameters, and the training BART model is continued to be iterated after the model parameters are adjusted until the model convergence condition is reached.
The policy information extraction method based on prompt learning provided by the application has an excellent recognition effect on a small-scale data set, in order to verify the performance of the policy information extraction method when the scale of a training data set is small, the application also designs various policy information recognizers based on pre-training-fine tuning to perform performance comparison on the same data set, and a specific method is shown in fig. 4 and comprises the following steps:
in the distributed feature representation part of the input data of the policy information recognizer, distributed feature representations of vocabulary level and character level are used simultaneously, the distributed feature representation of each word of the vocabulary level is realized by a word vector representation model pre-trained on a large-scale integrated domain corpus, and the distributed feature representation of each character of the character level is realized by a Chinese RoBERTA model pre-trained. Since the process of performing distributed feature representation on input data by the word vector representation model and the chinese RoBERTa model is not the scope of protection claimed in the present application, the specific process is not described.
The context encoding layer of the policy information recognizer takes over the output of the distributed representation layer, further modeling the dependency between text semantics and words. In this embodiment, the structure and the construction method of the three models are briefly described as follows by using a multilayer perceptron, a Transformer and a Flat-Lattice transform:
in the context coding layer based on the multi-layer perceptron, a structure of a linear layer-a ReLU function layer-a linear layer is adopted.
In the transform-based context coding layer, a transform Encoder is used to feature code the text.
In a context coding layer based on a Flat-text Transformer (FLAT), a variant FLAT of a Transformer is used, meanwhile, distributed representation of characters and words of a text is used, position coding in the Transformer is further expanded, head and tail relative positions of the characters and the words of the text are introduced, and the problem of unbalanced entity length of a policy file is solved. The relative position coding calculation method of the text segment of the FLAT is expressed by the following formula (9):
Figure DEST_PATH_IMAGE193
in the formula (9), the first and second groups,
Figure DEST_PATH_IMAGE194
and
Figure DEST_PATH_IMAGE195
respectively represent
Figure DEST_PATH_IMAGE196
The position of the first and last character of each text segment in the original sequence is indexed. For example, in the text of "3 years" policy validity period, "the" policy "has a head and a tail of 1 and 2, respectively, while for" administration "the character has a head and a tail of 1, respectively.
Figure DEST_PATH_IMAGE197
Is a parameter that can be learned by the user,
Figure DEST_PATH_IMAGE198
Included
Figure DEST_PATH_IMAGE199
Figure 419602DEST_PATH_IMAGE198
the calculation method of (2) is expressed by the following formulas (10), (11):
Figure DEST_PATH_IMAGE200
Figure DEST_PATH_IMAGE201
in the formulas (10) and (11),
Figure DEST_PATH_IMAGE202
Included
Figure DEST_PATH_IMAGE203
Figure DEST_PATH_IMAGE204
Figure DEST_PATH_IMAGE205
Figure DEST_PATH_IMAGE206
any one of the above;
Figure DEST_PATH_IMAGE207
representing the vector length of the input model.
The decoding layer of the policy information recognizer uses a conditional random field model, the decoding process uses a Viterbi algorithm based on dynamic programming to obtain higher decoding efficiency, and a conditional random field loss function is used for optimization.
The following shows a comparison table of extraction effects of policy information of 7 categories, i.e., "policy name, policy number, release area, establishment department, execution department, release time, and execution period" shown in fig. 1, when the scale of a labeled training data set is small, in a policy information identifier based on pre-training-fine adjustment and a policy information identifier based on prompt learning provided by an embodiment of the present invention, and an evaluation index is an F1 score on a test set. The following table b shows: the language model N trained by the embodiment has better performance than the policy information recognizer trained by other methods on a small-scale training data set, and proves the superiority of recognizing policy key information when the labeled training data set is less.
Figure DEST_PATH_IMAGE208
Table b
To sum up, as shown in fig. 5, the policy text analysis method based on policy text classification and key information identification according to the embodiment of the present invention includes the steps of:
s1, input paragraphs are classified by a policy text classifier based on pre-training
Figure DEST_PATH_IMAGE209
Predicting output paragraphs
Figure 541711DEST_PATH_IMAGE209
Type of (d);
s2, the classified paragraphs are completed by the policy information recognizer based on pre-training
Figure 165329DEST_PATH_IMAGE209
Key information is further extracted at the entity level.
More specifically, as shown in FIG. 6, the policy text classifier predicts paragraphs
Figure 822706DEST_PATH_IMAGE209
The method of type (d) specifically comprises the steps of:
s11, for the paragraph in the given policy document
Figure 246646DEST_PATH_IMAGE209
Using template functions
Figure DEST_PATH_IMAGE210
Will be provided with
Figure 864840DEST_PATH_IMAGE209
Conversion to language models
Figure DEST_PATH_IMAGE211
Is inputted
Figure DEST_PATH_IMAGE212
Figure 964252DEST_PATH_IMAGE212
In the original paragraph
Figure 709092DEST_PATH_IMAGE209
A prompt language of a classification task is added, and the prompt language comprises a mask position needing to predict and fill in a label;
s12, language model
Figure 330697DEST_PATH_IMAGE211
Predicting out tags filling in mask locations
Figure DEST_PATH_IMAGE213
S13, label converter
Figure DEST_PATH_IMAGE214
Will label
Figure 75668DEST_PATH_IMAGE213
Set of tagged words mapped as a pre-constructed policy document element system
Figure DEST_PATH_IMAGE215
Corresponding label word in
Figure DEST_PATH_IMAGE216
As a predicted paragraph
Figure 635830DEST_PATH_IMAGE209
Type (c) of the cell.
More specifically, as shown in fig. 7, in step S2, each paragraph is extracted based on the policy information identifier
Figure 2220DEST_PATH_IMAGE209
The method of the key information in (1) comprises the steps of:
s21, sentence pattern definitionPlate assembly
Figure DEST_PATH_IMAGE217
Label word set for entity identification in policy document element system
Figure DEST_PATH_IMAGE218
And a language model
Figure DEST_PATH_IMAGE219
Label sets for entity identification
Figure DEST_PATH_IMAGE220
Sentence template set
Figure 469761DEST_PATH_IMAGE217
Sentence template containing entity type and non-entity type
Figure DEST_PATH_IMAGE221
Sentence template
Figure 764345DEST_PATH_IMAGE221
Including two words to be filled in the vacancy, wherein the first vacancy is a paragraph from the input
Figure 831658DEST_PATH_IMAGE209
The second vacancy is a category label for classifying the intercepted text segment, and the label set
Figure 52555DEST_PATH_IMAGE220
Middle label
Figure DEST_PATH_IMAGE222
With said set of tag words
Figure 186602DEST_PATH_IMAGE218
Tag word in (1)
Figure DEST_PATH_IMAGE223
Having a mapping relationship;
s22, from the paragraph
Figure 765219DEST_PATH_IMAGE209
Filling each intercepted text segment into a sentence template set
Figure 370644DEST_PATH_IMAGE217
Each sentence template in (1)
Figure 938723DEST_PATH_IMAGE221
Then using the language model
Figure 541874DEST_PATH_IMAGE219
Template each sentence filled with text segments
Figure 702728DEST_PATH_IMAGE221
Computing a set of tags
Figure 407117DEST_PATH_IMAGE220
Each tag in (1)
Figure 540289DEST_PATH_IMAGE222
The probability score of filling the second vacancy is calculated by the following formula (12):
Figure DEST_PATH_IMAGE224
in the formula (12), the first and second groups,
Figure DEST_PATH_IMAGE225
text segments representing usage candidates
Figure DEST_PATH_IMAGE226
And a label
Figure DEST_PATH_IMAGE227
Filling sentence template
Figure 468668DEST_PATH_IMAGE221
The sentence obtained;
Figure DEST_PATH_IMAGE228
representing the sentence
Figure 231000DEST_PATH_IMAGE225
The sequence length of (a);
Figure DEST_PATH_IMAGE229
representing the sentence
Figure 348866DEST_PATH_IMAGE225
Of the word sequence
Figure 133283DEST_PATH_IMAGE167
An item;
Figure DEST_PATH_IMAGE230
representing the sentence
Figure 920979DEST_PATH_IMAGE225
1 st to 1 st item in the word sequence of
Figure DEST_PATH_IMAGE231
An item;
Figure DEST_PATH_IMAGE232
representing input to a language model
Figure DEST_PATH_IMAGE233
Text sequence of
Figure DEST_PATH_IMAGE234
Figure DEST_PATH_IMAGE235
Representing text at a given input
Figure 805495DEST_PATH_IMAGE232
And 1 st to 1 st item in the word sequence of the sentence template
Figure 556413DEST_PATH_IMAGE231
Item(s)
Figure 693871DEST_PATH_IMAGE230
In the case where the model predicts that the c-th term is
Figure DEST_PATH_IMAGE236
The probability is calculated by a pre-training generative language model;
s23, the sentence with the highest score
Figure 278567DEST_PATH_IMAGE225
The filled text segment is used as key information entity, and the corresponding type label
Figure DEST_PATH_IMAGE237
Mapping to the tagged word
Figure 123813DEST_PATH_IMAGE223
Then as corresponding entity type, jointly form the paragraph
Figure 678422DEST_PATH_IMAGE232
Key information of (1).
The invention has the following beneficial effects:
1. a set of complete policy document element system is constructed, and different elements in the policy document are clearly divided. Subsequently, based on the system, the classification of each paragraph type in the policy document and the key information extraction of the text paragraphs at the entity level can be more accurately realized.
2. By following the original paragraph
Figure 404808DEST_PATH_IMAGE232
The prompt language is added with a classification task and comprises a mask position needing to predict and fill in a labelThe method converts the paragraph classification problem into the classification prediction problem similar to complete filling, simplifies the paragraph classification prediction process, can more accurately analyze the policy document text from the aspects of content composition and document structure based on the constructed complete policy document element system, and excavates deeper information, and has excellent performance under the condition that the scale of the labeled training data set is small.
3. The policy information recognizer provided simplifies the recognition difficulty of text entities by predicting two vacant content labels under the constructed policy document element system, can more accurately extract useful key information from texts based on the constructed policy document element system, and has excellent performance under the condition that the scale of a labeled training data set is small.
It should be understood that the above-described embodiments are merely preferred embodiments of the invention and the technical principles applied thereto. It will be understood by those skilled in the art that various modifications, equivalents, changes, and the like can be made to the present invention. However, such variations are within the scope of the invention as long as they do not depart from the spirit of the invention. In addition, certain terms used in the specification and claims of the present application are not limiting, but are used merely for convenience of description.

Claims (10)

1. A policy text analysis method based on policy text classification and key information identification is characterized by comprising the following steps:
s1, input paragraphs are classified by a policy text classifier based on pre-training
Figure 30494DEST_PATH_IMAGE001
Predicting and outputting the paragraph
Figure 898962DEST_PATH_IMAGE001
Type of (d);
s2, finishing each paragraph of the classification based on the pre-trained policy information recognizer
Figure 838099DEST_PATH_IMAGE001
Key information is further extracted at the entity level.
2. The method of claim 1, wherein in step S1, the policy text classifier predicts the paragraphs
Figure 771420DEST_PATH_IMAGE001
The method of type (d) specifically comprises the steps of:
s11, regarding the paragraph in the given policy document
Figure 377982DEST_PATH_IMAGE001
Using template functions
Figure 539055DEST_PATH_IMAGE002
Will be provided with
Figure 903171DEST_PATH_IMAGE001
Conversion to language model
Figure 312287DEST_PATH_IMAGE003
Is inputted
Figure 68628DEST_PATH_IMAGE004
Figure 170576DEST_PATH_IMAGE004
In the original paragraph
Figure 349885DEST_PATH_IMAGE001
A prompt language of a classification task is added, wherein the prompt language comprises a mask position needing to predict and fill in a label;
s12, the language model
Figure 736260DEST_PATH_IMAGE005
Predicting out tags filling in the mask locations
Figure 51835DEST_PATH_IMAGE006
S13, label converter
Figure 121422DEST_PATH_IMAGE007
Attaching the label to the container
Figure 788027DEST_PATH_IMAGE006
Tagged word set mapped as pre-constructed policy document element system
Figure 37481DEST_PATH_IMAGE008
Corresponding label word in
Figure 145245DEST_PATH_IMAGE009
The paragraph obtained as a prediction
Figure 854575DEST_PATH_IMAGE001
Type (c) of the cell.
3. The method of claim 2, wherein the language model is trained by training the policy text classification and key information recognition based policy text analysis method
Figure 972923DEST_PATH_IMAGE003
The method comprises the following steps:
a1, for each as training sample
Figure 589849DEST_PATH_IMAGE004
Calculating the set of tagged words
Figure 880016DEST_PATH_IMAGE008
Each tag word in (1)
Figure 494668DEST_PATH_IMAGE009
Probability scores for filling in the mask locations
Figure 40925DEST_PATH_IMAGE010
A2, calculating probability distribution through softmax function
Figure 133646DEST_PATH_IMAGE011
A3 is according to
Figure 340636DEST_PATH_IMAGE012
And
Figure 627654DEST_PATH_IMAGE011
calculating model prediction loss by using the constructed loss function;
a4, judging whether a model iterative training termination condition is reached,
if yes, terminating iteration and outputting the language model
Figure 490568DEST_PATH_IMAGE003
If not, the model parameters are adjusted and then the step A1 is returned to continue the iterative training.
4. The method of claim 2, wherein the language model is based on a policy text classification and key information identification policy text analysis method
Figure 324663DEST_PATH_IMAGE003
For forming a plurality of language sub-models
Figure 556799DEST_PATH_IMAGE013
The fusion language model formed by fusion, and the method for training the fusion language model comprises the following steps:
b1, defining a template function set
Figure 575571DEST_PATH_IMAGE014
The set of template functions
Figure 925781DEST_PATH_IMAGE014
Comprising a plurality of different said template functions
Figure 625883DEST_PATH_IMAGE015
B2, for each as training sample
Figure 264192DEST_PATH_IMAGE004
By corresponding said language submodel
Figure 63652DEST_PATH_IMAGE016
Calculating the set of tagged words
Figure 901158DEST_PATH_IMAGE017
Each tag word in (1)
Figure 637908DEST_PATH_IMAGE018
Probability score of filling in the mask position
Figure 877259DEST_PATH_IMAGE019
B3, associating each template function
Figure 706675DEST_PATH_IMAGE015
Is/are as follows
Figure 31477DEST_PATH_IMAGE019
Carrying out fusion to obtain
Figure 512530DEST_PATH_IMAGE020
B4, calculating probability distribution through softmax function
Figure 340809DEST_PATH_IMAGE021
B5 is according to
Figure 809967DEST_PATH_IMAGE020
And
Figure 651759DEST_PATH_IMAGE021
calculating model prediction loss by using the constructed loss function;
b6, judging whether a model iterative training termination condition is reached,
if yes, terminating iteration and outputting the fusion language model;
if not, the model parameters are adjusted and then the step B2 is returned to continue the iterative training.
5. The method of claim 4 wherein the language model is based on policy text classification and key information identification
Figure 762935DEST_PATH_IMAGE003
Or the language submodel
Figure 445720DEST_PATH_IMAGE022
Is a BERT language model.
6. The method of claim 1 wherein the system of policy document elements includes sentence-level elements and entity-level elements, the sentence-level elements include any one or more of 27 sub-categories, such as policy objective, application review, policy tool-supply type, policy tool-environment type, policy tool-demand type, fund management, regulatory evaluation, and admission condition 8,
wherein, the policy tool-supply type category includes any one or more of the 4 sub-categories of talent culture, fund support, technical support and public service;
the policy tool-environment type comprises any one or more of 6 sub-categories of regulation and control, target planning, tax and discount, financial support, organization and construction and policy promotion;
the policy tool-demand type comprises any one or more of the 3 sub-categories of government procurement, company cooperation and overseas cooperation;
the supervision evaluation category comprises 2 sub-categories of supervision management and/or assessment evaluation;
the capital management category includes 2 sub-categories of sources of capital and/or management principles.
7. The method according to any of claims 1-6, wherein in step S2, each paragraph is further extracted at entity level based on the policy information identifier
Figure 351359DEST_PATH_IMAGE001
The method of the key information in (1) comprises the steps of:
s21, defining a sentence template set
Figure 880780DEST_PATH_IMAGE023
Label word set for entity identification in policy document element system
Figure 202171DEST_PATH_IMAGE024
And a language model
Figure 5042DEST_PATH_IMAGE025
Label sets for entity identification
Figure 580117DEST_PATH_IMAGE026
The sentence template set
Figure 835649DEST_PATH_IMAGE023
Sentence template containing entity type and non-entity type
Figure 554206DEST_PATH_IMAGE027
Sentence template
Figure 385153DEST_PATH_IMAGE027
Including two words to be filled in the vacancy, wherein the first vacancy is the paragraph input from the input
Figure 429332DEST_PATH_IMAGE001
The second vacancy is a category label for classifying the intercepted text segment, and the label set is
Figure 437739DEST_PATH_IMAGE026
Middle label
Figure 163250DEST_PATH_IMAGE028
With said set of tagged words
Figure 111352DEST_PATH_IMAGE024
The label word in
Figure 60853DEST_PATH_IMAGE029
Having a mapping relationship;
s22, will follow the paragraph
Figure 25398DEST_PATH_IMAGE001
Each of the text fragments and each of the labels intercepted in
Figure 315784DEST_PATH_IMAGE028
Corresponding label words
Figure 151016DEST_PATH_IMAGE029
Filling in the sentence template set respectively
Figure 474681DEST_PATH_IMAGE023
Each of the sentence templates
Figure 628320DEST_PATH_IMAGE027
In the first vacancy, in the second vacancy, and then using the language model
Figure 695633DEST_PATH_IMAGE025
Calculating probability scores of the filled-in sentences
Figure 978847DEST_PATH_IMAGE030
The calculation method is expressed by the following formula (1):
Figure 506037DEST_PATH_IMAGE031
in the formula (1), the first and second groups,
Figure 445174DEST_PATH_IMAGE032
text passage for representing use candidate
Figure 253861DEST_PATH_IMAGE033
And a label
Figure 827799DEST_PATH_IMAGE034
Filling sentence template
Figure 821163DEST_PATH_IMAGE027
The obtained sentence;
Figure 247597DEST_PATH_IMAGE035
representing the sentence
Figure 922292DEST_PATH_IMAGE032
The sequence length of (c);
Figure 551069DEST_PATH_IMAGE036
representing the sentence
Figure 653017DEST_PATH_IMAGE032
Of the word sequence
Figure 97905DEST_PATH_IMAGE037
An item;
Figure 74826DEST_PATH_IMAGE038
representing the sentence template
Figure 62505DEST_PATH_IMAGE032
1 st to 1 st item in the word sequence of
Figure 132092DEST_PATH_IMAGE039
An item;
Figure 300162DEST_PATH_IMAGE001
representing input to the language model
Figure 51080DEST_PATH_IMAGE040
Text sequence of
Figure 158844DEST_PATH_IMAGE041
Figure 835551DEST_PATH_IMAGE042
Representing text at a given input
Figure 786190DEST_PATH_IMAGE001
And items 1 to 1 in the word sequence of the sentence templateFirst, the
Figure 606378DEST_PATH_IMAGE039
Item(s)
Figure 634613DEST_PATH_IMAGE043
In the case where the model predicts that the c-th term is
Figure 249265DEST_PATH_IMAGE044
By the pre-trained said language model
Figure 156041DEST_PATH_IMAGE040
Calculating to obtain;
s23, the score is the highest
Figure 684980DEST_PATH_IMAGE032
The filled text segment is used as a key information entity, and the corresponding type label is used
Figure 95233DEST_PATH_IMAGE045
Mapping to the tagged word
Figure 411945DEST_PATH_IMAGE046
Then as corresponding entity type, jointly form the paragraph
Figure 9279DEST_PATH_IMAGE001
Key information of (1).
8. The method of claim 7, wherein the language model is based on the policy text classification and key information identification policy text analysis method
Figure 344839DEST_PATH_IMAGE040
Is a BART model.
9. The policy-based text classification and keyword message of claim 3The policy text analysis method for information recognition is characterized in that, in the step A1,
Figure 344019DEST_PATH_IMAGE047
is expressed by the following formula (2):
Figure 97211DEST_PATH_IMAGE048
Figure 945957DEST_PATH_IMAGE049
calculating by softmax function (3):
Figure 318163DEST_PATH_IMAGE050
in the formulas (2) and (3),
Figure 171850DEST_PATH_IMAGE051
representing a set of tags
Figure 857128DEST_PATH_IMAGE052
The Chinese and the label words
Figure 694634DEST_PATH_IMAGE053
A label having a mapping relationship;
Figure 932848DEST_PATH_IMAGE052
a set of labels representing a text classification task;
the constructed loss function is expressed by the following formula (4):
Figure 77260DEST_PATH_IMAGE054
formula (4) In (1),
Figure 641096DEST_PATH_IMAGE055
the fine-tuning coefficients are represented by,
Figure 700319DEST_PATH_IMAGE056
Figure 538962DEST_PATH_IMAGE057
representing the distribution of model predictions
Figure 868706DEST_PATH_IMAGE058
The difference from the true distribution;
Figure 603444DEST_PATH_IMAGE059
score representing model prediction
Figure 822066DEST_PATH_IMAGE060
The difference from the true score.
10. The method of claim 4 wherein the policy text classification and key information identification based policy text analysis,
Figure 166198DEST_PATH_IMAGE060
is expressed by the following formula (5):
Figure 911300DEST_PATH_IMAGE061
Figure 223464DEST_PATH_IMAGE062
obtained by fusing the following formula (6):
Figure 284043DEST_PATH_IMAGE063
in the formula (6), the first and second groups of the compound,
Figure 74275DEST_PATH_IMAGE064
representing the set of template functions
Figure 673884DEST_PATH_IMAGE065
The template function of
Figure 248959DEST_PATH_IMAGE066
The number of (2);
Figure 504491DEST_PATH_IMAGE067
representing said template function
Figure 68721DEST_PATH_IMAGE068
In the calculation of
Figure 991678DEST_PATH_IMAGE069
The weight occupied by the hour;
Figure 911224DEST_PATH_IMAGE070
calculating by means of the softmax function (7):
Figure 152587DEST_PATH_IMAGE071
in the formulas (5), (6) and (7),
Figure 143677DEST_PATH_IMAGE072
representing a set of tags
Figure 921140DEST_PATH_IMAGE073
The Chinese and the label words
Figure 136220DEST_PATH_IMAGE074
A label having a mapping relationship;
Figure 861950DEST_PATH_IMAGE073
a set of labels representing a text classification task;
the constructed loss function is expressed by the following formula (8):
Figure 391151DEST_PATH_IMAGE075
in the formula (8), the first and second groups,
Figure 695225DEST_PATH_IMAGE076
the fine-tuning coefficients are represented by,
Figure 517425DEST_PATH_IMAGE077
Figure 297162DEST_PATH_IMAGE078
representing the distribution of model predictions
Figure 364475DEST_PATH_IMAGE070
The difference from the true distribution;
Figure 24520DEST_PATH_IMAGE079
score representing model prediction
Figure 784666DEST_PATH_IMAGE080
The difference from the true score.
CN202211229194.3A 2022-10-08 2022-10-08 Policy text analysis method based on policy text classification and key information identification Active CN115310425B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211229194.3A CN115310425B (en) 2022-10-08 2022-10-08 Policy text analysis method based on policy text classification and key information identification

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211229194.3A CN115310425B (en) 2022-10-08 2022-10-08 Policy text analysis method based on policy text classification and key information identification

Publications (2)

Publication Number Publication Date
CN115310425A true CN115310425A (en) 2022-11-08
CN115310425B CN115310425B (en) 2023-01-03

Family

ID=83867063

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211229194.3A Active CN115310425B (en) 2022-10-08 2022-10-08 Policy text analysis method based on policy text classification and key information identification

Country Status (1)

Country Link
CN (1) CN115310425B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127078A (en) * 2023-04-19 2023-05-16 吉林大学 Large-scale extremely weak supervision multi-label policy classification method and system
CN116562265A (en) * 2023-07-04 2023-08-08 南京航空航天大学 Information intelligent analysis method, system and storage medium
CN116738298A (en) * 2023-08-16 2023-09-12 杭州同花顺数据开发有限公司 Text classification method, system and storage medium
CN117034948A (en) * 2023-08-03 2023-11-10 合肥大智慧财汇数据科技有限公司 Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion
CN117077682A (en) * 2023-05-06 2023-11-17 西安公路研究院南京院 Document analysis method and system based on semantic recognition
CN117149859A (en) * 2023-10-27 2023-12-01 中国市政工程华北设计研究总院有限公司 Urban waterlogging point information recommendation method based on government user portrait
CN117577348A (en) * 2024-01-15 2024-02-20 中国医学科学院医学信息研究所 Identification method and related device for evidence-based medical evidence

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210397791A1 (en) * 2020-06-19 2021-12-23 Beijing Baidu Netcom Science And Technology Co., Ltd. Language model training method, apparatus, electronic device and readable storage medium
WO2022036616A1 (en) * 2020-08-20 2022-02-24 中山大学 Method and apparatus for generating inferential question on basis of low labeled resource
CN114741512A (en) * 2022-04-19 2022-07-12 山东省科技发展战略研究所 Automatic text classification method and system
CN114860882A (en) * 2022-05-18 2022-08-05 南京物浦大数据有限公司 Fair competition review auxiliary method based on text classification model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20210397791A1 (en) * 2020-06-19 2021-12-23 Beijing Baidu Netcom Science And Technology Co., Ltd. Language model training method, apparatus, electronic device and readable storage medium
WO2022036616A1 (en) * 2020-08-20 2022-02-24 中山大学 Method and apparatus for generating inferential question on basis of low labeled resource
CN114741512A (en) * 2022-04-19 2022-07-12 山东省科技发展战略研究所 Automatic text classification method and system
CN114860882A (en) * 2022-05-18 2022-08-05 南京物浦大数据有限公司 Fair competition review auxiliary method based on text classification model

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
BIHUI YU: "Policy Text Classification Algorithm Based on Bert", 《IEEE》 *
钟乐; 杨锐; 付彦荣: "基于政策文本分析的中国城市生物多样性治理进展研究", 《中国园林》 *
顾佳怡: "基于BERT模型的政策条件识别研究", 《科技视界》 *

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116127078B (en) * 2023-04-19 2023-07-21 吉林大学 Large-scale extremely weak supervision multi-label policy classification method and system
CN116127078A (en) * 2023-04-19 2023-05-16 吉林大学 Large-scale extremely weak supervision multi-label policy classification method and system
CN117077682A (en) * 2023-05-06 2023-11-17 西安公路研究院南京院 Document analysis method and system based on semantic recognition
CN116562265B (en) * 2023-07-04 2023-12-01 南京航空航天大学 Information intelligent analysis method, system and storage medium
CN116562265A (en) * 2023-07-04 2023-08-08 南京航空航天大学 Information intelligent analysis method, system and storage medium
CN117034948A (en) * 2023-08-03 2023-11-10 合肥大智慧财汇数据科技有限公司 Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion
CN117034948B (en) * 2023-08-03 2024-02-13 合肥大智慧财汇数据科技有限公司 Paragraph identification method, system and storage medium based on multi-feature self-adaptive fusion
CN116738298A (en) * 2023-08-16 2023-09-12 杭州同花顺数据开发有限公司 Text classification method, system and storage medium
CN116738298B (en) * 2023-08-16 2023-11-24 杭州同花顺数据开发有限公司 Text classification method, system and storage medium
CN117149859A (en) * 2023-10-27 2023-12-01 中国市政工程华北设计研究总院有限公司 Urban waterlogging point information recommendation method based on government user portrait
CN117149859B (en) * 2023-10-27 2024-02-23 中国市政工程华北设计研究总院有限公司 Urban waterlogging point information recommendation method based on government user portrait
CN117577348A (en) * 2024-01-15 2024-02-20 中国医学科学院医学信息研究所 Identification method and related device for evidence-based medical evidence
CN117577348B (en) * 2024-01-15 2024-03-29 中国医学科学院医学信息研究所 Identification method and related device for evidence-based medical evidence

Also Published As

Publication number Publication date
CN115310425B (en) 2023-01-03

Similar Documents

Publication Publication Date Title
CN115310425B (en) Policy text analysis method based on policy text classification and key information identification
CN109493166B (en) Construction method for task type dialogue system aiming at e-commerce shopping guide scene
CN110427623B (en) Semi-structured document knowledge extraction method and device, electronic equipment and storage medium
Alaparthi et al. Bidirectional Encoder Representations from Transformers (BERT): A sentiment analysis odyssey
Li et al. A deep learning-based approach to constructing a domain sentiment lexicon: a case study in financial distress prediction
CN105260356B (en) Chinese interaction text emotion and topic detection method based on multi-task learning
CN109992664B (en) Dispute focus label classification method and device, computer equipment and storage medium
CN110990525A (en) Natural language processing-based public opinion information extraction and knowledge base generation method
CN112434535B (en) Element extraction method, device, equipment and storage medium based on multiple models
CN115906842A (en) Policy information identification method
CN115455189A (en) Policy text classification method based on prompt learning
CN113254610B (en) Multi-round conversation generation method for patent consultation
Vo et al. Sentiment analysis of news for effective cryptocurrency price prediction
CN113869055A (en) Power grid project characteristic attribute identification method based on deep learning
CN116070632A (en) Informal text entity tag identification method and device
CN114648031A (en) Text aspect level emotion recognition method based on bidirectional LSTM and multi-head attention mechanism
CN115470354A (en) Method and system for identifying nested and overlapped risk points based on multi-label classification
KR20210089430A (en) News article generation system and method including indirect advertising
CN109885695A (en) Assets suggest generation method, device, computer equipment and storage medium
Fernando et al. Automated vehicle insurance claims processing using computer vision, natural language processing
Singh et al. Complaint and severity identification from online financial content
CN117197569A (en) Image auditing method, image auditing model training method, device and equipment
GUMUS et al. Stock market prediction by combining stock price information and sentiment analysis
CN116362242A (en) Small sample slot value extraction method, device, equipment and storage medium
KR102406961B1 (en) A method of learning data characteristics and method of identifying fake information through self-supervised learning

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant