CN112989811A - BilSTM-CRF-based historical book reading auxiliary system and control method thereof - Google Patents

BilSTM-CRF-based historical book reading auxiliary system and control method thereof Download PDF

Info

Publication number
CN112989811A
CN112989811A CN202110224356.3A CN202110224356A CN112989811A CN 112989811 A CN112989811 A CN 112989811A CN 202110224356 A CN202110224356 A CN 202110224356A CN 112989811 A CN112989811 A CN 112989811A
Authority
CN
China
Prior art keywords
entity
sequence
name
crf
bilstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110224356.3A
Other languages
Chinese (zh)
Other versions
CN112989811B (en
Inventor
张宇
崔涵
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Harbin Institute of Technology
Original Assignee
Harbin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Harbin Institute of Technology filed Critical Harbin Institute of Technology
Priority to CN202110224356.3A priority Critical patent/CN112989811B/en
Publication of CN112989811A publication Critical patent/CN112989811A/en
Application granted granted Critical
Publication of CN112989811B publication Critical patent/CN112989811B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Engineering & Computer Science (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A history book reading auxiliary system based on BilSTM-CRF and a control method thereof use the BilSTM-CRF to label named entities; when a user specifies access to an entity, the program will query the database for information about the word; the invention marks texts in the period of sui and Tang as training data; then, the data are used for training a BilSTM-CRF model; in the actual use process, after a reader opens a language file by using an application, the text is preprocessed; then, the representation is transmitted into a BilSTM layer, the probability of each label possibly corresponding to each position is calculated, and then a label sequence with the highest score is calculated through a CRF layer; after the entity is extracted, inquiring the related information of the entity in a preset database, and displaying the inquiry result on the application in a floating window form.

Description

BilSTM-CRF-based historical book reading auxiliary system and control method thereof
Technical Field
The invention belongs to the technical field of natural language processing, and particularly relates to a history book reading auxiliary system based on BilSTM-CRF and a control method thereof.
Background
The Chinese culture has a long source, and the historical book is the most common carrier for the flow of the brilliant cultures. However, the learning of the speeches is not paid sufficient attention today when the white speeches are popular; the difficulty of modern people in reading cultural heritage is increasing. Today, the study of Chinese language is much less important than the study of white language; furthermore, for people with a good reading base of literary works, the lack of historical knowledge and information about the main characters of the event can also greatly affect the reading experience of literary classics, and the difficulty is greatly increased for beginners.
Currently, there is relatively little research on the named entity recognition tasks associated with literary languages. Several methods in the development of the named entity technology are listed, for example (Nadeau D, Sekine S.A. surfacy of nominal identification and classification [ J ]. Lingvisicacae investments, 2007,30(1): pages.3-26.). However, the entity recognition effect of these methods on the language is not good, because the structure of the language is more compact, the ideograph is more refined, many entities have only one word, and one word may also contain the meanings of multiple entities, and the general named entity recognition model cannot well represent the entities in the language. For Example (Morteza Ziyadi, Yuting Sun, Abhishek Goswam, Jade Huang, Weizhu Chen: Example-Based Named Entity Recognition) gives an Example-Based Named Entity Recognition model, but in the literature, there are very few labeled corpora available, and this type of approach does not work well.
Meanwhile, in the current software market, the applications related to the reading of the language are mainly divided into two types: the teaching application is served for primary and middle school students and comprises the Chinese language and literature in class, the information of the application is comprehensive, but the coverage range of the article is small, and the article only comprises the content in class, so that the requirements of vast Chinese language and literature lovers cannot be met; the second type is a language translation tool, which can provide the corresponding white language translation of the language to help the reader understand, but can not provide the reader with the detailed information needed for understanding.
The prior art has the following disadvantages:
1. only serve target groups such as primary and middle school students, etc. Although the effect is relatively good, the application range is limited to the lessons in class and a small part of the extraclass development articles, and most of the information in the lessons is realized by storing ready-made data, and the information not contained in the data cannot assist reading, which is also the reason that the information cannot assist reading for literary literature lovers.
2. Most of the existing reading auxiliary applications which can be used in a large scale only provide a translation function, entity information in an article, such as name of a person, place name, time and the like, cannot be identified, and only white words corresponding to the language words are simply displayed to a user. However, the user often has problems that the user cannot understand the information and cannot efficiently organize and organize the known information.
3. Most of the existing marked entity identification corpora are white languages; the word vector of the open-source literature is only in the quadbook. For the language, the style difference of different dynasties is large, and the result of training by using the language materials across the dynasties is not ideal. For most existing named entity recognition models, the number of words in the language is too many. Taking a person's first name as an example, the same person may have multiple names, characters, numbers, with or without surnames, etc.
Disclosure of Invention
The invention provides a history book reading auxiliary system based on BilSTM-CRF and a control method thereof in order to solve the problem.
The invention is realized by the following method:
a system for assisting reading of historical classical based on BiLSTM-CRF, the system comprising:
front end module based on asp.net Core Blazor: NET Core Blazor is a framework for generating an interactive client Web UI; NET writes server-side and client-side application logic; the front-end module completes construction or transplantation by using an existing Web front-end framework;
a backend module based on the form of. NET Core + python + propeller: the NET Core part is used for preprocessing a transmission file, transmitting data to the python + propeller model part to complete prediction, processing a prediction result, inquiring a database and transmitting the data to the front-end module to display;
a display module: data transmitted back from the back-end module is rendered into an html document for a user to read; the entity words identified as time, place and name will be highlighted; when a user moves a mouse to a highlighted entity word, a floating information card is displayed, the word is clicked, and a detail page applied to the word related information is skipped; the detail page corresponding to the person name is a person relation graph which is displayed based on Apache EChats;
net packaging module: and the electron.NET packaging module packages a Web application formed by the front-end module, the rear-end module and the display module into a cross-platform desktop application.
A control method applied to a history book reading assistance system based on BiLSTM-CRF, the method comprising the steps of:
A. establishing a database;
B. constructing an entity recognition model based on the BilSTM-CRF;
C. evaluating the model result;
D. and processing the special entity, inquiring in the database, and outputting an inquiry result to a display module.
Further, the step A comprises the following steps:
collecting the contents of related entries in Baidu encyclopedia and Wikipedia, manufacturing simple crawlers by using a Python Request library, and setting the collection frequency to be 5 s;
analyzing the collected data by using beautifulsoup4, extracting the content related to the relationship, and manually sorting and extracting the complex format;
establishing a database locally, processing and retrieving data, and preferentially selecting an SQLite library for storage and retrieval;
dividing local data into two tables, wherein the first table is a content table, the index is an ID of Integer, and the content comprises a name and relative information of a father, a mother, a brother, a sister, a child and the like; the second is a lookup table, the index is an alias, and the content is an ID corresponding to the alias;
when data is extracted from the local data, the corresponding ID is searched in the lookup table according to the incoming name of the person, then the corresponding row of the data is searched in the content table through the ID, and the searched corpus is output.
Further, the step B comprises the following steps:
b, marking each character in the corpus by adopting a BIOES marking method for the corpus output in the step A; for three entity types of different names of people, place names and time, N, L and T letters are respectively marked;
combining a forward LSTM with a backward LSTM to construct a BiLSTM bidirectional long-and-short term memory network and capture bidirectional semantic information; the capturing of the bidirectional semantic information specifically comprises: firstly, randomly generating a vector for each character of an input sentence, and respectively transmitting all randomly generated vectors into a forward LSTM model and a backward LSTM model; integrating the two obtained output vectors to obtain a vector containing the context information of the original text as an input sequence of a CRF model, wherein the input sequence is a sequence containing characteristics obtained by decomposing a sentence into a plurality of single words and then passing through a BilSTM layer;
the CRF model is a discriminant-based undirected probability map model and is used for predicting the probability distribution of the tag sequence under the condition of giving an observation sequence needing to be marked;
the discriminant of the CRF model is: if P (X | Y) is in the form of a linear continuous conditional random formula, the conditional probability that the random variable Y takes the value Y has the following form under the condition that the random variable X takes the value X:
Figure RE-GDA0003050521760000031
Figure RE-GDA0003050521760000032
wherein, tk(yi-1,yiX, i) are characteristic functions respectively used for describing the relative relation between adjacent marked variables and the influence of an input sequence on the adjacent marked variables; sl(yiX, i) is a state feature function defined at a marker position i of the input sequence, i having a value in the range 0 ≦ i<L and L are sequence length, the output sequence is as long as the target sequence, and is used for describing the influence of the input sequence on the mark variable, lambdakAnd mulZ (x) is a normalization factor; x is the content of a certain position of the input sequence, X belongs to X, X is the input sequence, Y is the content of the corresponding position of the output sequence, Y belongs to Y, Yi∈Y, yi-1E is Y, and Y is an output sequence; k is not less than 0<M,0≤l<M, M is the number of combinations of the input sequence and a certain position output value;
and adding the score of the predicted sequence obtained by the BilSTM and the score of the transition probability in the CRF into the highest sequence to be used as a final output sequence, wherein the output sequence is a sequence formed by the category of the word of each single character.
Further, the step C comprises the following steps:
calculating the precision P, the recall R and the F values of the entity recognition model result based on the BilSTM-CRF respectively for three categories as evaluation standards, wherein the calculation formula is as follows:
Figure RE-GDA0003050521760000041
Figure RE-GDA0003050521760000042
Figure RE-GDA0003050521760000043
wherein TP is the number of positive classes predicted to be positive classes; FP is the number of positive classes predicted from negative classes; FN is the number of positive classes predicted as negative classes; during calculation, characters with correct results and prediction results both being O are removed;
in the actual evaluation, weighting P and R to calculate a new F value; the new F value calculation formula is as follows:
Figure RE-GDA0003050521760000044
wherein, α is the relative weight of P to R, and the value range of α is (0, 1).
Further, the step D comprises the steps of:
for the identified time class entities, the time class entities are automatically separated after being identified, and the year number part is inquired in a database to obtain the starting time of the year number; the time part is converted into an offset value, the offset value is directly added with the starting time, and the addition result is used as the corresponding epoch of the time entity;
for the identified name entity, firstly, obtaining the unique ID of the name by using a uniform ID comparison table in a database; for the phenomenon of duplicate names in an article, namely two uniform IDs of the same entity name meet the condition, starting from the entity, finding N name entities closest to the entity, and respectively calculating the distances from the N name entities to the persons corresponding to the IDs;
Figure RE-GDA0003050521760000045
wherein s (p)j,pi) The calculation method is as follows: if yes, then person pjAnd piThe number of people on the connected path; otherwise s (p)j,pi) Is k '+ 1, where k' is the number of steps of the search:
Figure RE-GDA0003050521760000051
directly querying the identified place name entity in a corresponding place name library; for the unified place names of different places, under the same dynasty condition, the place name of each place is unique, namely the same place name can only correspond to the same place; when the identified place name has two records in the database, counting N time entities and person name entities which are nearest to the place name, voting by utilizing the N entities to determine the era where the place name is located, and further determining the sub-item of the place name in the database;
and finally, after the entity is extracted, inquiring the related information of the entity by the database established in the step A, and displaying the inquiry result on the application in a floating window form.
The invention has the beneficial effects
(1) The method provided by the invention is not specific to data, but specific to articles, and can theoretically process any article, thereby greatly expanding the service range of application. In addition, for the unregistered words, more possible entities can be detected by adjusting parameters and threshold values during detection of the model, and then filtering is performed through a database retrieval step, so that the recognition efficiency of the named entity recognition model is greatly improved;
(2) aiming at the over-simplified language representation of the language text, the invention provides a set of scheme for processing the problems possibly occurring in the information retrieval process of the main entity types possibly occurring in the language text, such as the conversion between the names of people, the names of places, the ancient year number and the public era;
(3) in the aspect of data, the invention is also labeled with the linguistic data of the NER task in the Tang and sui period. These corpora will be subsequently sourced for use by other studies;
(4) for the construction of the application, the invention uses Asp.NET Core Blazor, the whole application interface is simple and clear, the use is convenient, and the reading efficiency of a reader is greatly improved; meanwhile, the Web application constructed based on the NET framework can ensure the compatibility of different platforms, can be used as an independent application at a mobile phone end or a PC section, and can also be used together with other applications in a plug-in mode, so that real-time and accurate reading assistance is provided for readers, and the blank of an application market is filled.
Drawings
FIG. 1 is a schematic diagram of the process of the present invention;
FIG. 2 is a schematic view of the LSTM framework of the present invention;
FIG. 3 is a diagram illustrating exemplary relationships between persons according to the present invention;
FIG. 4 is a labeled example diagram of an identified entity in accordance with the present invention;
FIG. 5 is a diagram illustrating examples of entity information display related to names of people according to the present invention;
fig. 6 is a diagram illustrating a display example of location name related entity information according to the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
The method uses the BilSTM-CRF to label the named entity; when a user specifies an access to an entity (i.e., clicks on the word or mouse hovers), the program queries the database for information about the word (e.g., a web of people, location information, etc.).
First, the present invention labels some texts in the Tang dynasty as training data. These data were then used to train the BilSTM-CRF model. In actual use, the reader first selects to open a text file (which may be a text file or a pdf file) using an application, and the present invention preprocesses the text. Then, the representation is transmitted into a BilSTM layer, the probability of each label possibly corresponding to each position is calculated, and then the label sequence with the highest score is calculated through a CRF layer. After the entity is extracted, inquiring the related information of the entity in a preset database, and displaying the inquiry result on an application in a floating window form to provide smooth use experience for a user; the specific processing flow is shown in fig. 1.
A system for assisting reading of historical classical based on BiLSTM-CRF, the system comprising:
front end module based on asp.net Core Blazor: NET Core Blazor is a framework for generating an interactive client Web UI; NET writes server-side and client-side application logic; the Blazor has the advantages that the front end can be constructed by using a plurality of existing Web front end frameworks (such as Bootstrap and the like), the attractiveness of the Blazor can be ensured, and the Blazor can be used as a Web application and has certain transportability;
in the design process of the front end, in order to fully test the operation effect, a false back end is also constructed, the false back end and the finally realized true back end have the same interface, but the data provided by the false back end is only used for testing;
a backend module based on the form of. NET Core + python + propeller: the NET Core part can efficiently preprocess a transmission file, transmits data to the python + propeller model part to complete prediction, processes a prediction result, queries a database and transmits the data to a front-end module to display;
a display module: data transmitted back from the back-end module is rendered into an html document for a user to read; the entity words identified as time, place and name will be highlighted; when a user moves a mouse to a highlighted entity word, a floating information card is displayed, the word is clicked, and a detail page applied to the word related information is skipped; the detail page corresponding to the person name is a person relation graph which is displayed based on Apache EChats;
in the invention, the data marking part can be assisted by adopting a text matching mode, and marks a fixed place or time by adopting a text matching mode;
net packaging module: and the electron.NET packaging module packages a Web application formed by the front-end module, the rear-end module and the display module into a cross-platform desktop application.
A control method applied to a history book reading assistance system based on BiLSTM-CRF, the method comprising the steps of:
A. establishing a database;
B. constructing an entity recognition model based on the BilSTM-CRF;
C. evaluating the model result;
D. and processing the special entity, inquiring in the database, and outputting an inquiry result to a display module.
The step A comprises the following steps:
acquiring contents of related entries in encyclopedia and Wikipedia, manufacturing a simple crawler by using a Python Request library, and reducing the acquisition frequency to 5s by considering that the data volume required to be acquired is small and the related policy of the crawler is also considered;
analyzing the collected data by using beautifulsoup4, extracting the content related to the relationship, and manually sorting and extracting the complex format;
because the data volume is small, a database is established locally and retrieval data is processed, and an SQLite library is preferentially selected for storage and retrieval; the SQLite is an ultra-lightweight open source database, is a relational database management system complying with ACID, is written by about 30000 lines of C language codes, can be operated only by about 500k of memory, is stored in a single file, and does not depend on the operation of a server. The overall simplicity and high efficiency are kept;
dividing local data into two tables, wherein the first table is a content table, the index is an ID of Integer, and the content comprises a name and relative information of a father, a mother, a brother, a sister, a child and the like; the second is a lookup table, the index is an alias, and the content is an ID corresponding to the alias;
when data is extracted from the local data, the corresponding ID is searched in the lookup table according to the incoming name of the person, then the corresponding row of the data is searched in the content table through the ID, and the searched corpus is output.
That is, for example, for Liouin, a down-high ancestor, which may be referred to as Liouin, etc. The ID of the corresponding person entity is found in the second table through the transmitted name, and the relative information is found in the first table according to the ID;
the database query step in the invention can be linked to databases of other dynasty information, thereby being suitable for articles in different periods.
The step B comprises the following steps:
the era history book represented by twenty-four history is a bright pearl in the cultural book of the Chinese nationality like the tobacco sea. In the invention, the old Tang book is selected as the main research corpus for two reasons: firstly, the inertia Tang period has special historical status, the characters in the period span two dynasties of inertia Tang, the character relationship is more complex than that of other dynasties, the research is more meaningful, and the result is more practical; secondly, the sato period changes from generation to generation frequently and has a lot of historical events, and compared with the new Tang book, the old Tang book has plain language and more historical materials.
In the old Tang book, the data to be labeled are selected from 'first Ben Ji' to 'fifth Ben Ji', 'first Lie Chuan' to 'fifth Liao Chuan', 'first geography' and 'second geography', and the linguistic data are used as a training set and a testing set for the model.
In the actual development process, the manual labeling efficiency of the language materials of the language is not as expected, so the application development direction is mainly white language for a while.
B, marking each character in the corpus by adopting a BIOES marking method for the corpus output in the step A; for three entity types of different names of people, place names and time, N, L and T letters are respectively marked; the two labels are combined as follows:
Figure RE-GDA0003050521760000081
for example:
the gaozu Shenyao Dasheng Guangxi emperor Li.
The labeling results are:
'B-N','E-N','O','O','O','O','O','O','O','O','O','O','B-N','E-N', 'O','O','S-N','O';
the BilSTM-CRF model automatically constructs features through the BilSTM, and then transmits a sequence of word vectors into the CRF model for sequence labeling.
Long-Short Memory networks (LSTMs), all known as Long Short-Term memories, are a variant of RNN, proposed in 1997 by Sepp Hechreit and Jurgen Schimidhuber. Different from RNN, a long-term state is added in an LSTM unit, as shown in FIG. 2, so that the dependence relationship of longer distance can be better captured by using an LSTM model, and the memorized information and the forgotten information can be selected by adjusting the training process;
because the long-short term memory network LSTM can not encode information from back to front, a forward LSTM and a backward LSTM are combined to construct a BiLSTM bidirectional long-short term memory network and capture bidirectional semantic information; firstly, randomly generating a vector for each character of an input sentence, and respectively transmitting all randomly generated vectors into a forward LSTM model and a backward LSTM model; after model training, integrating the two obtained output vectors to obtain a vector containing context information of an original text as an input sequence of a CRF model, wherein the input sequence is a sequence containing characteristics obtained by decomposing a sentence into a plurality of single words and passing through a BilSTM layer;
in the invention, the word vector training step can not only use the BilSTM network, but also adopt some pre-training models, such as BERT or GPT, etc.;
the CRF model is a discriminant-based undirected probability map model proposed by Lafferty et al in 2001, and is used for predicting the probability distribution of a tag sequence under the condition of giving an observation sequence needing to be marked; the effect of adding a CRF layer is mainly two-fold: firstly, during the training process, the CRF can automatically learn some constraint conditions to ensure that the final label prediction result is legal; second, the transfer feature in the CRF may take into account the sequentiality of the outgoing labels;
the discriminant of the CRF model is: if P (X | Y) is in the form of a linear continuous conditional random formula, the conditional probability that the random variable Y takes the value Y has the following form under the condition that the random variable X takes the value X:
Figure RE-GDA0003050521760000091
Figure RE-GDA0003050521760000092
wherein, tk(yi-1,yiX, i) are characteristic functions respectively used for describing the relative relation between adjacent marked variables and the influence of an input sequence on the adjacent marked variables; sl(yiX, i) is a state feature function defined at a marker position i of the input sequence, i having a value in the range 0 ≦ i<L and L are sequence length, the output sequence is as long as the target sequence, and is used for describing the influence of the input sequence on the mark variable, lambdakAnd mulZ (x) is a normalization factor; x is the content of a certain position of the input sequence, X belongs to X, X is the input sequence, Y is the content of the corresponding position of the output sequence, Y belongs to Y, Yi∈Y, yi-1E is Y, and Y is an output sequence; k is not less than 0<M,0≤l<M, M is the number of combinations of the input sequence and a certain position output value;
and adding the score of the predicted sequence obtained by the BilSTM and the score of the transition probability in the CRF into the highest sequence to be used as a final output sequence, wherein the output sequence is a sequence formed by the category of the word of each single character.
The step C comprises the following steps:
calculating the precision P, the recall R and the F values of the entity recognition model result based on the BilSTM-CRF respectively for three categories as evaluation standards, wherein the calculation formula is as follows:
Figure RE-GDA0003050521760000093
Figure RE-GDA0003050521760000094
Figure RE-GDA0003050521760000095
wherein TP is the number of positive classes predicted to be positive classes; FP is the number of positive classes predicted from negative classes; FN is the number of positive classes predicted as negative classes; during calculation, characters with correct results and prediction results both being O are removed;
in the actual evaluation, P and R are properly weighted to calculate a new F value; because the data identified by the model can be filtered during searching, the proportion of the recall rate is improved to ensure that the searching is comprehensive, and the formula is as follows:
Figure RE-GDA0003050521760000101
wherein alpha is the relative weight of P to R, and the value range of alpha is (0,1) in order to improve the proportion occupied by the recall rate.
The step D comprises the following steps:
for the identified time-class entities, most of the time-class entities are in the form of 'year number + time', such as 'three years of observation for the same year'; after the entity is identified, the entity is automatically separated, and the year number part is inquired in a database to obtain the starting time of the year number; the time part is converted into an offset value, the offset value is directly added with the starting time, and the addition result is used as the corresponding epoch of the time entity; the three years of the 'Zhensguan' are divided into 'Zhensguan' and 'three years', wherein the 'Zhensguan' part is converted into '627' according to the database, the 'three years' is converted into '3', the two are added to obtain 630 ', and the' Gongyuan 630 'is displayed as the related information of the' three years of the 'Zhensguan';
for the identified name entity, firstly, obtaining the unique ID of the name by using a uniform ID comparison table in a database; for the phenomenon of duplicate names in an article, namely two uniform IDs of the same entity name meet the condition, starting from the entity, finding N name entities closest to the entity, and respectively calculating the distances from the N name entities to the persons corresponding to the IDs;
Figure RE-GDA0003050521760000103
wherein s (p)j,pi) The calculation method is as follows: if yes, then person pjAnd piThe number of people on the connected path; otherwise s (p)j,pi) Is k '+ 1, where k' is the number of steps of the search:
Figure RE-GDA0003050521760000102
for example, assuming that k' is 3, i.e., 3-layer relationships are searched at most, in fig. 3, the distance between "document queen" and "south yang princess" is 2, and the distance between "luohuoyuan for down and high ancestors" and "south yang princess" is 4;
directly querying the identified place name entity in a corresponding place name library; for the unified place names of different places, under the same dynasty condition, the place name of each place is unique, namely the same place name can only correspond to the same place; when the identified place name has two records in the database, counting N time entities and person name entities which are nearest to the place name, voting by utilizing the N entities to determine the era where the place name is located, and further determining the sub-item of the place name in the database;
in the invention, the step of entity identification can be assisted by using rules;
and finally, after the entity is extracted, inquiring the related information of the entity by the database established in the step A, and displaying the inquiry result on the application in a floating window form.
Examples
When a user opens a language in an application, the application reads text contents selected by the user and preprocesses the read text contents to obtain character-level vector representation. And then, transmitting the word vector into a BilSTM layer, and obtaining the probability that each word in a sentence corresponds to each label after training. And transmitting the probabilities into a CRF layer, and calculating a path with the maximum probability as a final named entity recognition result. By using the constructed database, the relevant information of all the identified entities can be found and transmitted to the front end for display.
Taking the chessman-making table as an example, when a user opens a corresponding file, the application reads the text content of the chessman-making table, and then preprocessing is carried out, including segmentation, vector initialization and the like. Take the first sentence as an example: "the pioneer venture is not half and the central road slope, today's lower third, benefiting state fatigue, this is true to be dangerous for death in autumn too. "
The preprocessed result of the sentence is transmitted to the BilSTM layer, and we will get a 33 x 13 matrix, where 33 is the sentence length, 13 is the label space size, and each row is the probability of each label corresponding to the word at the corresponding position. Then, through the CRF layer, we can obtain a path with the highest probability as the result of sequence labeling. For this sentence, we obtain the result:
'B-N','E-N','O','O','O','O','O','O','O','O','O','O','O','O','O', 'O','O','O','B-L','E-L','O','O','O','O','O','O','O','O','O','O', 'O','O','O';
after all sentences are processed, displaying the text on an application interface, marking the identified entity in the text by a special color, and displaying the application interface as shown in FIG. 4;
after the entity identification step, the user hovers a mouse over the identified entity, and a query step will be initiated. Firstly, according to the identified entity type, finding a corresponding entity name-unified ID table in a database. For example, for "precedent" in the first sentence, we find in the database:
Figure RE-GDA0003050521760000111
Figure RE-GDA0003050521760000121
thereby recognizing that the uniform ID corresponding to the entity 'emperor' is 'Liu Bei'. And then, according to Liu Bei, finding the following information in the character relation table:
Figure RE-GDA0003050521760000122
and transmitting the information into an application layer and directly displaying the information. And directly querying the database for the entities related to the place name and the time. The entity information related to the name of the person, the name of the place, and the like are displayed as shown in fig. 5 and 6, respectively.
The history book reading auxiliary system based on the BilSTM-CRF and the control method thereof are introduced in detail, numerical simulation examples are applied in the system to explain the principle and the implementation mode of the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims (6)

1. A history book reading assistance system based on BiLSTM-CRF, the system comprising:
front end module based on asp.net Core Blazor: NET Core Blazor is a framework for generating an interactive client Web UI; NET writes server-side and client-side application logic; the front-end module completes construction or transplantation by using an existing Web front-end framework;
a backend module based on the form of. NET Core + python + propeller: the NET Core part is used for preprocessing a transmission file, transmitting data to the python + propeller model part to complete prediction, processing a prediction result, inquiring a database and transmitting the data to the front-end module to display;
a display module: data transmitted back from the back-end module is rendered into an html document for a user to read; the entity words identified as time, place and name will be highlighted; when a user moves a mouse to a highlighted entity word, a floating information card is displayed, the word is clicked, and a detail page applied to the word related information is skipped; the detail page corresponding to the person name is a person relation graph which is displayed based on Apache EChats;
net packaging module: and the electron.NET packaging module packages a Web application formed by the front-end module, the rear-end module and the display module into a cross-platform desktop application.
2. A control method applied to a history book reading auxiliary system based on BilSTM-CRF is characterized by comprising the following steps:
A. establishing a database;
B. constructing an entity recognition model based on the BilSTM-CRF;
C. evaluating the model result;
D. and processing the special entity, inquiring in the database, and outputting an inquiry result to a display module.
3. The method of claim 2, further comprising: the step A comprises the following steps:
collecting the contents of related entries in Baidu encyclopedia and Wikipedia, manufacturing simple crawlers by using a Python Request library, and setting the collection frequency to be 5 s;
analyzing the collected data by using beautifulsoup4, extracting the content related to the relationship, and manually sorting and extracting the complex format;
establishing a database locally, processing and retrieving data, and preferentially selecting an SQLite library for storage and retrieval;
dividing local data into two tables, wherein the first table is a content table, the index is an ID of Integer, and the content comprises a name and relative information of a father, a mother, a brother, a sister, a child and the like; the second is a lookup table, the index is an alias, and the content is an ID corresponding to the alias;
when data is extracted from the local data, the corresponding ID is searched in the lookup table according to the incoming name of the person, then the corresponding row of the data is searched in the content table through the ID, and the searched corpus is output.
4. The method of claim 3, further comprising: the step B comprises the following steps:
b, marking each character in the corpus by adopting a BIOES marking method for the corpus output in the step A; for three entity types of different names of people, place names and time, N, L and T letters are respectively marked;
combining a forward LSTM with a backward LSTM to construct a BiLSTM bidirectional long-and-short term memory network and capture bidirectional semantic information; the capturing of the bidirectional semantic information specifically comprises: firstly, randomly generating a vector for each character of an input sentence, and respectively transmitting all randomly generated vectors into a forward LSTM model and a backward LSTM model; integrating the two obtained output vectors to obtain a vector containing the context information of the original text as an input sequence of a CRF model, wherein the input sequence is a sequence containing characteristics obtained by decomposing a sentence into a plurality of single words and then passing through a BilSTM layer;
the CRF model is a discriminant-based undirected probability map model and is used for predicting the probability distribution of the tag sequence under the condition of giving an observation sequence needing to be marked;
the discriminant of the CRF model is: if P (X | Y) is in the form of a linear continuous conditional random formula, the conditional probability that the random variable Y takes the value Y has the following form under the condition that the random variable X takes the value X:
Figure FDA0002956459030000021
Figure FDA0002956459030000022
wherein, tk(yi-1,yiX, i) are characteristic functions respectively used for describing the relative relation between adjacent marked variables and the influence of an input sequence on the adjacent marked variables; sl(yiX, i) is a state feature function defined at a marker position i of the input sequence, i having a value in the range 0 ≦ i<L and L are sequence length, the output sequence is as long as the target sequence, and is used for describing the influence of the input sequence on the mark variable, lambdakAnd mulZ (x) is a normalization factor; x is the content of a certain position of the input sequence, X belongs to X, X is the input sequence, Y is the content of the corresponding position of the output sequence, Y belongs to Y, Yi∈Y,yi-1E is Y, and Y is an output sequence; k is not less than 0<M,0≤l<M, M is the number of combinations of the input sequence and a certain position output value;
and adding the score of the predicted sequence obtained by the BilSTM and the score of the transition probability in the CRF into the highest sequence to be used as a final output sequence, wherein the output sequence is a sequence formed by the category of the word of each single character.
5. The method of claim 4, further comprising: the step C comprises the following steps:
calculating the precision P, the recall R and the F values of the entity recognition model result based on the BilSTM-CRF respectively for three categories as evaluation standards, wherein the calculation formula is as follows:
Figure FDA0002956459030000023
Figure FDA0002956459030000024
Figure FDA0002956459030000031
wherein TP is the number of positive classes predicted to be positive classes; FP is the number of positive classes predicted from negative classes; FN is the number of positive classes predicted as negative classes; during calculation, characters with correct results and prediction results both being O are removed;
in the actual evaluation, weighting P and R to calculate a new F value; the new F value calculation formula is as follows:
Figure FDA0002956459030000032
wherein, α is the relative weight of P to R, and the value range of α is (0, 1).
6. The method of claim 5, wherein the step D comprises the steps of:
for the identified time class entities, the time class entities are automatically separated after being identified, and the year number part is inquired in a database to obtain the starting time of the year number; the time part is converted into an offset value, the offset value is directly added with the starting time, and the addition result is used as the corresponding epoch of the time entity;
for the identified name entity, firstly, obtaining the unique ID of the name by using a uniform ID comparison table in a database; for the phenomenon of duplicate names in an article, namely two uniform IDs of the same entity name meet the condition, starting from the entity, finding N name entities closest to the entity, and respectively calculating the distances from the N name entities to the persons corresponding to the IDs;
Figure FDA0002956459030000033
wherein s (p)j,pi) The calculation method is as follows: if yes, then person pjAnd piThe number of people on the connected path; otherwise s (p)j,pi) Is k '+ 1, where k' is the number of steps of the search:
Figure FDA0002956459030000034
directly querying the identified place name entity in a corresponding place name library; for the unified place names of different places, under the same dynasty condition, the place name of each place is unique, namely the same place name can only correspond to the same place; when the identified place name has two records in the database, counting N time entities and person name entities which are nearest to the place name, voting by utilizing the N entities to determine the era where the place name is located, and further determining the sub-item of the place name in the database;
and finally, after the entity is extracted, inquiring the related information of the entity by the database established in the step A, and displaying the inquiry result on the application in a floating window form.
CN202110224356.3A 2021-03-01 2021-03-01 History book reading auxiliary system based on BiLSTM-CRF and control method thereof Active CN112989811B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110224356.3A CN112989811B (en) 2021-03-01 2021-03-01 History book reading auxiliary system based on BiLSTM-CRF and control method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110224356.3A CN112989811B (en) 2021-03-01 2021-03-01 History book reading auxiliary system based on BiLSTM-CRF and control method thereof

Publications (2)

Publication Number Publication Date
CN112989811A true CN112989811A (en) 2021-06-18
CN112989811B CN112989811B (en) 2022-09-09

Family

ID=76351440

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110224356.3A Active CN112989811B (en) 2021-03-01 2021-03-01 History book reading auxiliary system based on BiLSTM-CRF and control method thereof

Country Status (1)

Country Link
CN (1) CN112989811B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420296A (en) * 2021-07-08 2021-09-21 国网甘肃省电力公司电力科学研究院 C source code vulnerability detection method based on Bert model and BiLSTM
CN117933245A (en) * 2024-03-22 2024-04-26 四川省特种设备检验研究院 Chinese word segmentation method for special equipment maintenance question-answering system

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460013A (en) * 2018-01-30 2018-08-28 大连理工大学 A kind of sequence labelling model based on fine granularity vocabulary representation model
CN109271529A (en) * 2018-10-10 2019-01-25 内蒙古大学 Cyrillic Mongolian and the double language knowledge mapping construction methods of traditional Mongolian
CN109657239A (en) * 2018-12-12 2019-04-19 电子科技大学 The Chinese name entity recognition method learnt based on attention mechanism and language model
CN110334300A (en) * 2019-07-10 2019-10-15 哈尔滨工业大学 Text aid reading method towards the analysis of public opinion
CN110347978A (en) * 2019-07-02 2019-10-18 深圳市数字星河科技有限公司 A kind of method of e-book aid reading
CN111274804A (en) * 2020-01-17 2020-06-12 珠海市新德汇信息技术有限公司 Case information extraction method based on named entity recognition
CN111324742A (en) * 2020-02-10 2020-06-23 同方知网(北京)技术有限公司 Construction method of digital human knowledge map
CN111611802A (en) * 2020-05-21 2020-09-01 苏州大学 Multi-field entity identification method
CN111859887A (en) * 2020-07-21 2020-10-30 北京北斗天巡科技有限公司 Scientific and technological news automatic writing system based on deep learning
CN112417880A (en) * 2020-11-30 2021-02-26 太极计算机股份有限公司 Court electronic file oriented case information automatic extraction method

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108460013A (en) * 2018-01-30 2018-08-28 大连理工大学 A kind of sequence labelling model based on fine granularity vocabulary representation model
CN109271529A (en) * 2018-10-10 2019-01-25 内蒙古大学 Cyrillic Mongolian and the double language knowledge mapping construction methods of traditional Mongolian
CN109657239A (en) * 2018-12-12 2019-04-19 电子科技大学 The Chinese name entity recognition method learnt based on attention mechanism and language model
CN110347978A (en) * 2019-07-02 2019-10-18 深圳市数字星河科技有限公司 A kind of method of e-book aid reading
CN110334300A (en) * 2019-07-10 2019-10-15 哈尔滨工业大学 Text aid reading method towards the analysis of public opinion
CN111274804A (en) * 2020-01-17 2020-06-12 珠海市新德汇信息技术有限公司 Case information extraction method based on named entity recognition
CN111324742A (en) * 2020-02-10 2020-06-23 同方知网(北京)技术有限公司 Construction method of digital human knowledge map
CN111611802A (en) * 2020-05-21 2020-09-01 苏州大学 Multi-field entity identification method
CN111859887A (en) * 2020-07-21 2020-10-30 北京北斗天巡科技有限公司 Scientific and technological news automatic writing system based on deep learning
CN112417880A (en) * 2020-11-30 2021-02-26 太极计算机股份有限公司 Court electronic file oriented case information automatic extraction method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DOTNET跨平台: "【译】使用Blazor构建桌面应用", 《BLOG.CSDN.NET/SD7O95O/ARTICLE/DETAILS/103081336》 *
盛雅兰 等: "基于BiLSTM-CRF的《曹沧州医案》实体识别研究", 《第五届中国中医药信息大会——大数据标准化与智慧中医药论文集》 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113420296A (en) * 2021-07-08 2021-09-21 国网甘肃省电力公司电力科学研究院 C source code vulnerability detection method based on Bert model and BiLSTM
CN117933245A (en) * 2024-03-22 2024-04-26 四川省特种设备检验研究院 Chinese word segmentation method for special equipment maintenance question-answering system

Also Published As

Publication number Publication date
CN112989811B (en) 2022-09-09

Similar Documents

Publication Publication Date Title
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN107766371B (en) Text information classification method and device
CN109271529B (en) Method for constructing bilingual knowledge graph of Xilier Mongolian and traditional Mongolian
CN102262634B (en) Automatic questioning and answering method and system
CN110502621A (en) Answering method, question and answer system, computer equipment and storage medium
CN110990590A (en) Dynamic financial knowledge map construction method based on reinforcement learning and transfer learning
CN107315738B (en) A kind of innovation degree appraisal procedure of text information
CN110795543A (en) Unstructured data extraction method and device based on deep learning and storage medium
CN110134954B (en) Named entity recognition method based on Attention mechanism
CN109308321A (en) A kind of knowledge question answering method, knowledge Q-A system and computer readable storage medium
CN101539907A (en) Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
CN111858940B (en) Multi-head attention-based legal case similarity calculation method and system
CN106126619A (en) A kind of video retrieval method based on video content and system
CN110688474B (en) Embedded representation obtaining and citation recommending method based on deep learning and link prediction
CN112559684A (en) Keyword extraction and information retrieval method
CN102955848A (en) Semantic-based three-dimensional model retrieval system and method
CN111401040A (en) Keyword extraction method suitable for word text
CN112989811B (en) History book reading auxiliary system based on BiLSTM-CRF and control method thereof
CN115599899B (en) Intelligent question-answering method, system, equipment and medium based on aircraft knowledge graph
CN112749265A (en) Intelligent question-answering system based on multiple information sources
CN116362221A (en) Aviation document keyword similarity judging method integrating multi-mode semantic association patterns
CN112966053A (en) Knowledge graph-based marine field expert database construction method and device
CN112015907A (en) Method and device for quickly constructing discipline knowledge graph and storage medium
CN111666374A (en) Method for integrating additional knowledge information into deep language model
CN111104503A (en) Construction engineering quality acceptance standard question-answering system and construction method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant