CN115455155B - Method for extracting subject information of government affair text and storage medium - Google Patents

Method for extracting subject information of government affair text and storage medium Download PDF

Info

Publication number
CN115455155B
CN115455155B CN202211402800.7A CN202211402800A CN115455155B CN 115455155 B CN115455155 B CN 115455155B CN 202211402800 A CN202211402800 A CN 202211402800A CN 115455155 B CN115455155 B CN 115455155B
Authority
CN
China
Prior art keywords
information
keywords
model
government affair
text
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211402800.7A
Other languages
Chinese (zh)
Other versions
CN115455155A (en
Inventor
赵习枝
仇阿根
张福浩
罗宁
朱鹏
陶坤旺
方美丽
陈才
郑佳荣
陈颂
刘尚钦
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chinese Academy of Surveying and Mapping
Original Assignee
Chinese Academy of Surveying and Mapping
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chinese Academy of Surveying and Mapping filed Critical Chinese Academy of Surveying and Mapping
Priority to CN202211402800.7A priority Critical patent/CN115455155B/en
Publication of CN115455155A publication Critical patent/CN115455155A/en
Application granted granted Critical
Publication of CN115455155B publication Critical patent/CN115455155B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A theme information extraction method and storage medium of government affair text, wherein the said method, carry on the preconditioning to the text data of unstructured government affair at first, to the text data after preconditioning, adopt MacBERT model to carry on the vector extraction; capturing semantic information in the sentence through a BiGRU model to obtain a high-level feature vector of the keyword; and finally, calculating the importance of the keywords, arranging the importance of the keywords in a descending order, and selecting the keywords with higher importance as the keywords of the subject information to realize the extraction of the subject information of the government affair text. The invention combines the MacBERT model and the BiGRU model to extract the subject information of unstructured government affair text data, thereby not only reducing the overfitting risk of the models, but also being capable of well extracting advanced features of keywords, obtaining more accurate keywords of the subject information and helping government departments to quickly mine and analyze unstructured texts.

Description

Method for extracting subject information of government affair text and storage medium
Technical Field
The invention relates to the technical field of natural language processing, in particular to a method for extracting subject information of government affair texts and a storage medium.
Background
The government affair big data refers to data owned and managed by the government, has wide sources and various forms, and specifically comprises (but is not limited to) natural information, district construction, district health management statistical monitoring and service and civil consumption data. At present, the quantity of unstructured government affair data is increasing, the data structure of the unstructured government affair data is irregular or incomplete, a predefined data model is not available, the data structure is difficult to represent by a database two-dimensional logic table, and how to extract the subject information of the government affair data quickly and efficiently becomes a technical problem which needs to be solved urgently.
By utilizing the natural language processing technology in the technical field of artificial intelligence, the theme information in the government affair data is extracted, and the mining analysis of the unstructured text can be realized. For example, for the office hall of the people's government in Shanghai city, which is about printing the work scheme of ' private safety treatment of self-built house in Shanghai city ', the file is analyzed by adopting a subject information extraction model, and the general characteristics of the subject expression in the text are analyzed, so that the subject information keywords of ' self-built house ', ' private ', ' investigation ', ' treatment ', ' elimination ', ' potential safety hazard ', ' reinforced guarantee ', ' supervision and guidance ' are finally obtained. Subject information extraction of government affairs text can achieve fast text understanding.
Disclosure of Invention
Aiming at the problem of irregular data structure in the non-structured text data of the government affairs, the invention provides a method for extracting the theme information of the government affair text, which can effectively extract the theme information of the government affair text and realize quick text understanding.
In order to achieve the purpose, the invention adopts the following technical scheme:
a method for extracting subject information of government affair texts comprises the following steps:
data preprocessing step S110:
preprocessing unstructured government affair text data, wherein the preprocessing comprises filtering out irrelevant information and performing word segmentation processing on the text data;
text feature vector extraction and processing step S120:
extracting word vectors of the preprocessed government affair text information data by adopting a MacBERT model to obtain key word feature vectors, then taking the key word feature vectors as input, capturing semantic information in sentences through a BiGRU model, and optimizing the feature vectors to obtain advanced feature vectors of keywords;
subject information obtaining step S130: receiving the high-level feature vectors of the keywords extracted in step S120, calculating the importance of the keywords, sorting the importance of the keywords in a descending order, and selecting the keywords with high importance as the topic information keywords, thereby realizing the topic information extraction of the government affair text.
Optionally, the pretreatment specifically includes: and deleting punctuation marks and spaces, introducing a field dictionary into the government affair text data, performing word segmentation on the data, filtering stop words by using a general stop word bank, and removing corresponding stop words in the divided government affair text data.
Optionally, the government affair text information data includes unstructured government affair text data, which specifically includes: and the natural text language is used for describing information such as construction of the district, statistical monitoring conditions of health management of the district and the like.
Optionally, in step S120, the BiGRU model is a bidirectional improved recurrent neural network.
Optionally, the BiGRU model includes a forward GRU model
Figure 95761DEST_PATH_IMAGE001
And reverse GRU model
Figure 340797DEST_PATH_IMAGE002
Among them forward GRU model
Figure 589376DEST_PATH_IMAGE003
In which the feature vector of the keyword is input in the forward direction
Figure 811016DEST_PATH_IMAGE004
Reverse GRU model
Figure 218864DEST_PATH_IMAGE005
By using backward inputs on the feature vectors of the keywords
Figure 510168DEST_PATH_IMAGE006
Each GRU model
Figure 105097DEST_PATH_IMAGE007
By renewing the door
Figure 772839DEST_PATH_IMAGE008
And a reset gate
Figure 35193DEST_PATH_IMAGE009
The information propagation process inside the GRU model is as follows:
Figure 497398DEST_PATH_IMAGE010
wherein, the first and the second end of the pipe are connected with each other,
Figure 346668DEST_PATH_IMAGE011
in order to input the vector, the input vector is input,
Figure 411576DEST_PATH_IMAGE012
to reset the door
Figure 138224DEST_PATH_IMAGE009
The weight matrix of (a) is determined,
Figure 161543DEST_PATH_IMAGE013
for updating the door
Figure 872010DEST_PATH_IMAGE008
The weight matrix of (a) is determined,
Figure 740609DEST_PATH_IMAGE014
for the present information
Figure 852922DEST_PATH_IMAGE015
The weight matrix of (a) is determined,
Figure 280099DEST_PATH_IMAGE016
in order to multiply the elements one by one,
Figure 477862DEST_PATH_IMAGE017
in order to be a sigmoid function,
Figure 150151DEST_PATH_IMAGE018
as hyperbolic tangent function, now information
Figure 976025DEST_PATH_IMAGE015
From past information
Figure 216514DEST_PATH_IMAGE019
And the current input
Figure 760627DEST_PATH_IMAGE011
The decision is made in a joint manner,
Figure 111974DEST_PATH_IMAGE020
is composed of
Figure 293819DEST_PATH_IMAGE021
Outputting time information including past information
Figure 705209DEST_PATH_IMAGE019
And present information
Figure 736619DEST_PATH_IMAGE015
Updating door
Figure 485132DEST_PATH_IMAGE008
Used for controlling how much history information is forgotten and how much new information is accepted when the current state is in a reset state
Figure 426543DEST_PATH_IMAGE009
Used for controlling how much information in the candidate state is obtained from the historical information;
finally, the output of the BiGRU model
Figure 133468DEST_PATH_IMAGE022
Defined by the following equation:
Figure DEST_PATH_IMAGE023
wherein the content of the first and second substances,
Figure 353972DEST_PATH_IMAGE001
is the output of the forward GRU model,
Figure 312700DEST_PATH_IMAGE002
for the output of the reverse GRU model,
Figure 967673DEST_PATH_IMAGE024
represent
Figure 720865DEST_PATH_IMAGE021
Time of day
Figure 726867DEST_PATH_IMAGE001
The weight of the corresponding one of the first and second weights,
Figure DEST_PATH_IMAGE025
to represent
Figure 551604DEST_PATH_IMAGE002
The weight of the corresponding one of the first and second weights,
Figure 202028DEST_PATH_IMAGE026
to represent
Figure 486641DEST_PATH_IMAGE021
Time of day
Figure 855305DEST_PATH_IMAGE022
The corresponding bias term.
Optionally, in step S120, a MacBERT model is used to extract a word vector, the extracted word vector is used to extract context features through a bidirectional GRU model, and high-level feature vectors of the keywords are generated by concatenation.
Optionally, subject information keyword importancePObtained by sigmoid function, where 0<P<1:
Figure 14891DEST_PATH_IMAGE027
Wherein the content of the first and second substances,
Figure 785401DEST_PATH_IMAGE028
is that
Figure 739451DEST_PATH_IMAGE029
The weight matrix of (a) is determined,
Figure 595411DEST_PATH_IMAGE030
is that
Figure 558688DEST_PATH_IMAGE029
The bias term of (c).
Optionally, importance of each topic information keywordPAnd selecting the first eight keywords as the topic information keywords according to the sequence from big to small.
The invention further discloses a storage medium for storing computer executable instructions, which is characterized in that:
the computer executable instructions, when executed by a processor, perform the above-described government affairs text subject information extraction method.
Compared with the prior art, the method for extracting the theme information of the government affair text has the following advantages that:
1) The invention adopts MacBERT model, can obtain the keyword feature vector, and solves the problem of insufficient local feature extraction capability.
2) Because the invention adopts the BiGRU model, the semantic information in the sentence can be captured, the high-grade characteristic vector of the key word is obtained, the text information is effectively utilized, and the parallel computation is adopted, thereby greatly improving the extraction efficiency of the subject information.
3) According to the invention, the MacBERT model and the BiGRU model are fused, so that the extraction effect of a single model on the subject information is improved, the extraction accuracy of the subject information is further improved, and the overfitting risk of the model is reduced.
Drawings
Fig. 1 is a basic flowchart of a method for extracting subject information of a government affair text and a storage medium according to an embodiment of the present invention.
Detailed Description
The present invention will be described in further detail with reference to the drawings and examples. It is to be understood that the specific embodiments described herein are merely illustrative of the invention and are not to be construed as limiting the invention. It should be further noted that, for the convenience of description, only some of the structures related to the present invention are shown in the drawings, not all of the structures.
The invention is characterized in that a MacBERT model (Masked Language modeling as Correction Bidirectional Encoder reproduction from transformations) and a BiGRU model (Bi-Gate Recurrent Unit) are combined to extract the subject information of unstructured government affair text data. Firstly, extracting a word vector by adopting a MacBERT layer to obtain a keyword feature vector; then capturing semantic information in sentences through a BiGRU layer, and extracting high-level feature vectors of keywords, so that the features are more judgment; and finally, calculating the importance of the keywords, arranging the importance of the keywords in a descending order, and selecting the keywords with higher importance as the keywords of the subject information to realize the extraction of the subject information of the government affair text.
Referring to fig. 1, a basic flowchart of a method for extracting subject information of a government affairs text and a storage medium according to an embodiment of the present invention is shown.
Data preprocessing step S110:
preprocessing unstructured government affair text data, wherein the preprocessing comprises filtering out irrelevant information and performing word segmentation processing on the text data;
specifically, the pretreatment specifically comprises: deleting punctuation marks, blank spaces and the like, introducing a field dictionary into the government affair text data, performing word segmentation processing on the data, filtering stop words by using a general stop word library, and removing corresponding stop words in the government affair text data after word segmentation.
Specifically, in step S110, the unstructured government affair text data includes a natural text language describing information such as construction of the jurisdiction and statistical supervision of health management of the jurisdiction.
Of course, the invention is not limited thereto, and the processing method of the invention can be applied to other government affair text information.
Text feature vector extraction and processing step S120:
and performing word vector extraction on the preprocessed government affair text information data, such as unstructured government affair text data, by using a MacBERT model to obtain a keyword feature vector, capturing semantic information in a sentence by using a BiGRU model by taking the keyword feature vector as input, and optimizing the feature vector to obtain a high-level feature vector of the keyword.
Specifically, in step S120, the MacBERT model may obtain a keyword feature vector
Figure 918125DEST_PATH_IMAGE031
And the problem of insufficient local feature extraction capability is solved.
Specifically, in step S120, the BiGRU model is a bidirectional improved recurrent neural network, including a forward GRU model
Figure 807190DEST_PATH_IMAGE032
And reverse GRU model
Figure 884868DEST_PATH_IMAGE002
Among them forward GRU model
Figure 651835DEST_PATH_IMAGE033
Using forward input of feature vectors of middle pairs of keywords
Figure 131358DEST_PATH_IMAGE004
Reverse GRU model
Figure 692790DEST_PATH_IMAGE034
Using inverse inputs for the feature vectors of the keywords
Figure 116818DEST_PATH_IMAGE035
Each GRU model
Figure 297263DEST_PATH_IMAGE036
By renewing the door
Figure 257391DEST_PATH_IMAGE008
And a reset gate
Figure 865090DEST_PATH_IMAGE009
The information propagation process inside the GRU model is as follows:
Figure 41994DEST_PATH_IMAGE037
wherein the content of the first and second substances,
Figure 760551DEST_PATH_IMAGE011
in order to be the vector input, the vector is input,
Figure 339300DEST_PATH_IMAGE012
to reset the door
Figure 117900DEST_PATH_IMAGE009
The weight matrix of (a) is determined,
Figure 782100DEST_PATH_IMAGE038
to renew the door
Figure DEST_PATH_IMAGE039
The weight matrix of (a) is determined,
Figure 130779DEST_PATH_IMAGE040
for the present information
Figure 439401DEST_PATH_IMAGE015
The weight matrix of (a) is determined,
Figure 513536DEST_PATH_IMAGE016
in order to multiply the elements one by one,
Figure 274818DEST_PATH_IMAGE017
in order to be a sigmoid function,
Figure 725391DEST_PATH_IMAGE018
as hyperbolic tangent function, now information
Figure 747574DEST_PATH_IMAGE015
From past information
Figure 867977DEST_PATH_IMAGE019
And the current input
Figure 8233DEST_PATH_IMAGE011
In a joint decision, it is decided that,
Figure DEST_PATH_IMAGE041
is composed of
Figure 465760DEST_PATH_IMAGE042
Time information output including past information
Figure 483394DEST_PATH_IMAGE019
And present information
Figure 164911DEST_PATH_IMAGE015
. Updating door
Figure 900786DEST_PATH_IMAGE008
The method is used for controlling how much historical information needs to be forgotten and how much new information needs to be accepted in the current state, and is helpful for capturing long-term dependence in the sequence. Reset door
Figure 693161DEST_PATH_IMAGE009
For controlling how many of the candidate states areLittle information is obtained from historical information, which helps to capture short-term dependencies in the sequence.
Finally, the output of the BiGRU model
Figure 299723DEST_PATH_IMAGE022
Defined by the following equation:
Figure DEST_PATH_IMAGE043
wherein the content of the first and second substances,
Figure 978573DEST_PATH_IMAGE032
for the output of the forward GRU model,
Figure 201744DEST_PATH_IMAGE002
for the output of the reverse GRU model,
Figure 532231DEST_PATH_IMAGE044
represent
Figure 524458DEST_PATH_IMAGE042
Time of day
Figure 547778DEST_PATH_IMAGE032
The weight corresponding to the weight of the corresponding weight,
Figure 992665DEST_PATH_IMAGE025
to represent
Figure 362729DEST_PATH_IMAGE002
The weight of the corresponding one of the first and second weights,
Figure 475042DEST_PATH_IMAGE026
to represent
Figure 138104DEST_PATH_IMAGE042
Time of day
Figure 460501DEST_PATH_IMAGE022
The corresponding bias term.
Specifically, in step S120, a MacBERT model is used to extract a word vector, the extracted word vector is used to extract context features through a bidirectional GRU model, and high-level feature vectors of keywords are generated by concatenation, so as to improve the extraction accuracy of the topic information.
Subject information obtaining step S130: receiving the high-level feature vectors of the keywords extracted in step S120, calculating the importance of the keywords, sorting the importance of the keywords in a descending order, and selecting the keywords with higher importance as the topic information keywords, thereby realizing the topic information extraction of the government affair text.
Specifically, in step S130, the importance of the topic information keywordPObtained by sigmoid function, where 0<P<1:
Figure DEST_PATH_IMAGE045
Wherein the content of the first and second substances,
Figure 336053DEST_PATH_IMAGE046
is that
Figure DEST_PATH_IMAGE047
The weight matrix of (a) is determined,
Figure 129303DEST_PATH_IMAGE048
is that
Figure 228846DEST_PATH_IMAGE047
The bias term of (1). Training data through the proposed model to obtain the optimal parameters of the model.
In particular, importance to each topic information keywordPThe first eight may be selected as the topic information keywords in descending order.
A storage medium for storing computer-executable instructions, characterized in that:
the computer executable instructions, when executed by a processor, perform the above-described government affairs text subject information extraction method.
Compared with the prior art, the method for extracting the theme information of the government affair text has the following advantages that:
1) The invention adopts the MacBERT model, so that the keyword feature vector can be obtained, and the problem of insufficient local feature extraction capability is solved.
2) Because the invention adopts the BiGRU model, the semantic information in the sentence can be captured, the high-grade characteristic vector of the key word is obtained, the text information is effectively utilized, and the parallel computation is adopted, thereby greatly improving the extraction efficiency of the subject information.
3) According to the invention, the MacBERT model and the BiGRU model are fused, so that the extraction effect of a single model on the theme information is improved, the extraction accuracy of the theme information is improved, and the overfitting risk of the model is reduced.
It will be apparent to those skilled in the art that the various elements or steps of the invention described above may be implemented using a general purpose computing device, they may be centralized on a single computing device, or alternatively, they may be implemented using program code that is executable by a computing device, such that they may be stored in a memory device and executed by a computing device, or they may be separately fabricated into various integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.
The above is a further detailed description of the invention with reference to specific preferred embodiments, which should not be considered as limiting the invention to the specific embodiments described herein, but rather as a matter of simple deductions or substitutions by a person skilled in the art without departing from the inventive concept, it should be considered that the invention lies within the scope of protection defined by the claims as filed.

Claims (5)

1. A method for extracting subject information of a government affair text is characterized by comprising the following steps:
data preprocessing step S110:
preprocessing unstructured government affair text data, wherein the preprocessing comprises filtering out irrelevant information and performing word segmentation processing on the text data;
text feature vector extraction and processing step S120:
extracting word vectors of the preprocessed government affair text data by adopting a MacBERT model to obtain key word feature vectors, then taking the key word feature vectors as input, capturing semantic information in sentences through a BiGRU model, and optimizing the feature vectors to obtain advanced feature vectors of keywords;
subject information obtaining step S130: receiving the high-level feature vectors of the keywords extracted in the step S120, calculating the importance of the keywords, sorting the importance of the keywords in a descending order, and selecting the keywords with high importance as the topic information keywords to realize the topic information extraction of the government affair text;
in step S120, the BiGRU model is a bidirectional modified recurrent neural network;
the BiGRU model comprises a forward GRU model
Figure 218286DEST_PATH_IMAGE001
And reverse GRU model
Figure 284593DEST_PATH_IMAGE002
Among them forward GRU model
Figure 26284DEST_PATH_IMAGE003
In which the feature vector of the keyword is input in the forward direction
Figure 551943DEST_PATH_IMAGE004
Reverse GRU model
Figure 286550DEST_PATH_IMAGE005
Using inverse inputs for the feature vectors of the keywords
Figure 456631DEST_PATH_IMAGE006
Each GRU model
Figure 25016DEST_PATH_IMAGE007
By renewing the door
Figure 132911DEST_PATH_IMAGE008
And a reset gate
Figure 546575DEST_PATH_IMAGE009
The information propagation process inside the GRU model is as follows:
Figure 571163DEST_PATH_IMAGE010
wherein, the first and the second end of the pipe are connected with each other,
Figure 169503DEST_PATH_IMAGE011
in order to input the vector, the input vector is input,
Figure 404176DEST_PATH_IMAGE012
to reset the door
Figure 231317DEST_PATH_IMAGE009
The weight matrix of (a) is determined,
Figure 126723DEST_PATH_IMAGE013
for updating the door
Figure 771331DEST_PATH_IMAGE008
The weight matrix of (a) is determined,
Figure 103087DEST_PATH_IMAGE014
for the present information
Figure 858553DEST_PATH_IMAGE015
The weight matrix of (a) is determined,
Figure 841421DEST_PATH_IMAGE016
to be made intoThe elements are multiplied by each other, and the multiplication,
Figure 656931DEST_PATH_IMAGE017
is a function of the sigmoid and is,
Figure 475982DEST_PATH_IMAGE018
is a hyperbolic tangent function, now information
Figure 35139DEST_PATH_IMAGE015
From past information
Figure 639558DEST_PATH_IMAGE019
And the current input
Figure 235756DEST_PATH_IMAGE011
The decision is made in a joint manner,
Figure 932316DEST_PATH_IMAGE020
is composed of
Figure 154219DEST_PATH_IMAGE021
Outputting time information including past information
Figure 127991DEST_PATH_IMAGE019
And present information
Figure 19724DEST_PATH_IMAGE015
Updating door
Figure 564100DEST_PATH_IMAGE008
Reset gate for controlling how much history information is forgotten and how much new information is accepted in current state
Figure 465060DEST_PATH_IMAGE009
Used for controlling how much information in the candidate state is obtained from the history information;
finally, the output of the BiGRU model
Figure 27759DEST_PATH_IMAGE022
Defined by the following equation:
Figure 355973DEST_PATH_IMAGE023
wherein, the first and the second end of the pipe are connected with each other,
Figure 620601DEST_PATH_IMAGE001
for the output of the forward GRU model,
Figure 325252DEST_PATH_IMAGE001
for the output of the inverse GRU model,
Figure 8037DEST_PATH_IMAGE024
to represent
Figure 133250DEST_PATH_IMAGE021
Time of day
Figure 26120DEST_PATH_IMAGE001
The weight of the corresponding one of the first and second weights,
Figure 409827DEST_PATH_IMAGE025
to represent
Figure 196387DEST_PATH_IMAGE026
The weight of the corresponding one of the first and second weights,
Figure 7348DEST_PATH_IMAGE027
to represent
Figure 121934DEST_PATH_IMAGE021
Time of day
Figure 782768DEST_PATH_IMAGE022
The corresponding bias term;
in step S120, a MacBERT model extracts a word vector, the extracted word vector extracts context features through a bidirectional GRU model, and high-level feature vectors of the keywords are generated by concatenation;
in the step S130, in the step S,
topic information keyword importancePObtained by sigmoid function, where 0<P<1:
Figure 440145DEST_PATH_IMAGE028
Wherein, the first and the second end of the pipe are connected with each other,
Figure 546641DEST_PATH_IMAGE029
is that
Figure 538737DEST_PATH_IMAGE030
The weight matrix of (a) is determined,
Figure 264248DEST_PATH_IMAGE031
is that
Figure 526864DEST_PATH_IMAGE030
The bias term of (c).
2. The subject information extraction method according to claim 1, characterized in that:
the pretreatment specifically comprises: and deleting punctuation marks and spaces, introducing a field dictionary into the government affair text data, performing word segmentation on the data, filtering stop words by using a general stop word bank, and removing corresponding stop words in the divided government affair text data.
3. The subject information extraction method according to claim 2, characterized in that:
the government affair text information data comprise unstructured government affair text data, and specifically comprise: and describing natural text language of the statistical monitoring condition of the construction and health management of the district.
4. The subject information extraction method according to claim 1, characterized in that:
importance to each topic information keywordPAnd selecting the first eight keywords as the topic information keywords according to the sequence from big to small.
5. A storage medium for storing computer-executable instructions, characterized in that:
the computer-executable instructions, when executed by a processor, perform the method of extracting subject information of government affairs texts according to any one of claims 1 to 4.
CN202211402800.7A 2022-11-10 2022-11-10 Method for extracting subject information of government affair text and storage medium Active CN115455155B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211402800.7A CN115455155B (en) 2022-11-10 2022-11-10 Method for extracting subject information of government affair text and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211402800.7A CN115455155B (en) 2022-11-10 2022-11-10 Method for extracting subject information of government affair text and storage medium

Publications (2)

Publication Number Publication Date
CN115455155A CN115455155A (en) 2022-12-09
CN115455155B true CN115455155B (en) 2023-03-03

Family

ID=84295516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211402800.7A Active CN115455155B (en) 2022-11-10 2022-11-10 Method for extracting subject information of government affair text and storage medium

Country Status (1)

Country Link
CN (1) CN115455155B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535886A (en) * 2020-04-15 2021-10-22 北大方正信息产业集团有限公司 Information processing method, device and equipment
CN114398877A (en) * 2022-01-12 2022-04-26 平安普惠企业管理有限公司 Theme extraction method and device based on artificial intelligence, electronic equipment and medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11755840B2 (en) * 2021-02-09 2023-09-12 Tata Consultancy Services Limited Extracting mentions of complex relation types from documents by using joint first and second RNN layers to determine sentence spans which correspond to relation mentions
CN114153802A (en) * 2021-12-03 2022-03-08 西安交通大学 Government affair file theme classification method based on Bert and residual self-attention mechanism
CN114357172A (en) * 2022-01-07 2022-04-15 北京邮电大学 Rumor detection method based on ERNIE-BiGRU-Attention
CN115310448A (en) * 2022-08-10 2022-11-08 南京邮电大学 Chinese named entity recognition method based on combining bert and word vector

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113535886A (en) * 2020-04-15 2021-10-22 北大方正信息产业集团有限公司 Information processing method, device and equipment
CN114398877A (en) * 2022-01-12 2022-04-26 平安普惠企业管理有限公司 Theme extraction method and device based on artificial intelligence, electronic equipment and medium

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
面向机器阅读理解的候选句抽取算法;郭鑫 等;《计算机科学》;20200531;第47卷(第5期);198-203 *

Also Published As

Publication number Publication date
CN115455155A (en) 2022-12-09

Similar Documents

Publication Publication Date Title
CN111144131B (en) Network rumor detection method based on pre-training language model
CN109800310B (en) Electric power operation and maintenance text analysis method based on structured expression
CN111143576A (en) Event-oriented dynamic knowledge graph construction method and device
CN111767725B (en) Data processing method and device based on emotion polarity analysis model
CN111368086A (en) CNN-BilSTM + attribute model-based sentiment classification method for case-involved news viewpoint sentences
CN111858932A (en) Multiple-feature Chinese and English emotion classification method and system based on Transformer
CN111222305A (en) Information structuring method and device
CN111325029A (en) Text similarity calculation method based on deep learning integration model
Liu et al. A multi-label text classification model based on ELMo and attention
Qian et al. Sentiment analysis model on weather related tweets with deep neural network
Sartakhti et al. Persian language model based on BiLSTM model on COVID-19 corpus
Shijia et al. Aspect-based Financial Sentiment Analysis with Deep Neural Networks.
CN111831783A (en) Chapter-level relation extraction method
CN113742733A (en) Reading understanding vulnerability event trigger word extraction and vulnerability type identification method and device
CN114925157A (en) Nuclear power station maintenance experience text matching method based on pre-training model
Abujar et al. An approach for bengali text summarization using word2vector
CN107526721A (en) A kind of disambiguation method and device to electric business product review vocabulary
CN114462379A (en) Improved script learning method and device based on event evolution diagram
CN113886601A (en) Electronic text event extraction method, device, equipment and storage medium
Fu et al. Improving distributed word representation and topic model by word-topic mixture model
Wang et al. Mongolian named entity recognition with bidirectional recurrent neural networks
Li et al. [Retracted] Emotion Analysis Model of Microblog Comment Text Based on CNN‐BiLSTM
Wang et al. Weighted graph convolution over dependency trees for nontaxonomic relation extraction on public opinion information
CN113761192A (en) Text processing method, text processing device and text processing equipment
Sairam et al. Image Captioning using CNN and LSTM

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant