CN112989811A

CN112989811A - BilSTM-CRF-based historical book reading auxiliary system and control method thereof

Info

Publication number: CN112989811A
Application number: CN202110224356.3A
Authority: CN
Inventors: 张宇; 崔涵
Original assignee: Harbin Institute of Technology
Current assignee: Harbin Institute of Technology
Priority date: 2021-03-01
Filing date: 2021-03-01
Publication date: 2021-06-18
Anticipated expiration: 2041-03-01
Also published as: CN112989811B

Abstract

A history book reading auxiliary system based on BilSTM-CRF and a control method thereof use the BilSTM-CRF to label named entities; when a user specifies access to an entity, the program will query the database for information about the word; the invention marks texts in the period of sui and Tang as training data; then, the data are used for training a BilSTM-CRF model; in the actual use process, after a reader opens a language file by using an application, the text is preprocessed; then, the representation is transmitted into a BilSTM layer, the probability of each label possibly corresponding to each position is calculated, and then a label sequence with the highest score is calculated through a CRF layer; after the entity is extracted, inquiring the related information of the entity in a preset database, and displaying the inquiry result on the application in a floating window form.

Description

BilSTM-CRF-based historical book reading auxiliary system and control method thereof

Technical Field

The invention belongs to the technical field of natural language processing, and particularly relates to a history book reading auxiliary system based on BilSTM-CRF and a control method thereof.

Background

The Chinese culture has a long source, and the historical book is the most common carrier for the flow of the brilliant cultures. However, the learning of the speeches is not paid sufficient attention today when the white speeches are popular; the difficulty of modern people in reading cultural heritage is increasing. Today, the study of Chinese language is much less important than the study of white language; furthermore, for people with a good reading base of literary works, the lack of historical knowledge and information about the main characters of the event can also greatly affect the reading experience of literary classics, and the difficulty is greatly increased for beginners.

Currently, there is relatively little research on the named entity recognition tasks associated with literary languages. Several methods in the development of the named entity technology are listed, for example (Nadeau D, Sekine S.A. surfacy of nominal identification and classification [ J ]. Lingvisicacae investments, 2007,30(1): pages.3-26.). However, the entity recognition effect of these methods on the language is not good, because the structure of the language is more compact, the ideograph is more refined, many entities have only one word, and one word may also contain the meanings of multiple entities, and the general named entity recognition model cannot well represent the entities in the language. For Example (Morteza Ziyadi, Yuting Sun, Abhishek Goswam, Jade Huang, Weizhu Chen: Example-Based Named Entity Recognition) gives an Example-Based Named Entity Recognition model, but in the literature, there are very few labeled corpora available, and this type of approach does not work well.

Meanwhile, in the current software market, the applications related to the reading of the language are mainly divided into two types: the teaching application is served for primary and middle school students and comprises the Chinese language and literature in class, the information of the application is comprehensive, but the coverage range of the article is small, and the article only comprises the content in class, so that the requirements of vast Chinese language and literature lovers cannot be met; the second type is a language translation tool, which can provide the corresponding white language translation of the language to help the reader understand, but can not provide the reader with the detailed information needed for understanding.

The prior art has the following disadvantages:

1. only serve target groups such as primary and middle school students, etc. Although the effect is relatively good, the application range is limited to the lessons in class and a small part of the extraclass development articles, and most of the information in the lessons is realized by storing ready-made data, and the information not contained in the data cannot assist reading, which is also the reason that the information cannot assist reading for literary literature lovers.

2. Most of the existing reading auxiliary applications which can be used in a large scale only provide a translation function, entity information in an article, such as name of a person, place name, time and the like, cannot be identified, and only white words corresponding to the language words are simply displayed to a user. However, the user often has problems that the user cannot understand the information and cannot efficiently organize and organize the known information.

3. Most of the existing marked entity identification corpora are white languages; the word vector of the open-source literature is only in the quadbook. For the language, the style difference of different dynasties is large, and the result of training by using the language materials across the dynasties is not ideal. For most existing named entity recognition models, the number of words in the language is too many. Taking a person's first name as an example, the same person may have multiple names, characters, numbers, with or without surnames, etc.

Disclosure of Invention

The invention provides a history book reading auxiliary system based on BilSTM-CRF and a control method thereof in order to solve the problem.

The invention is realized by the following method:

a system for assisting reading of historical classical based on BiLSTM-CRF, the system comprising:

front end module based on asp.net Core Blazor: NET Core Blazor is a framework for generating an interactive client Web UI; NET writes server-side and client-side application logic; the front-end module completes construction or transplantation by using an existing Web front-end framework;

a backend module based on the form of. NET Core + python + propeller: the NET Core part is used for preprocessing a transmission file, transmitting data to the python + propeller model part to complete prediction, processing a prediction result, inquiring a database and transmitting the data to the front-end module to display;

a display module: data transmitted back from the back-end module is rendered into an html document for a user to read; the entity words identified as time, place and name will be highlighted; when a user moves a mouse to a highlighted entity word, a floating information card is displayed, the word is clicked, and a detail page applied to the word related information is skipped; the detail page corresponding to the person name is a person relation graph which is displayed based on Apache EChats;

net packaging module: and the electron.NET packaging module packages a Web application formed by the front-end module, the rear-end module and the display module into a cross-platform desktop application.

A control method applied to a history book reading assistance system based on BiLSTM-CRF, the method comprising the steps of:

A. establishing a database;

B. constructing an entity recognition model based on the BilSTM-CRF;

C. evaluating the model result;

D. and processing the special entity, inquiring in the database, and outputting an inquiry result to a display module.

Further, the step A comprises the following steps:

collecting the contents of related entries in Baidu encyclopedia and Wikipedia, manufacturing simple crawlers by using a Python Request library, and setting the collection frequency to be 5 s;

analyzing the collected data by using beautifulsoup4, extracting the content related to the relationship, and manually sorting and extracting the complex format;

establishing a database locally, processing and retrieving data, and preferentially selecting an SQLite library for storage and retrieval;

dividing local data into two tables, wherein the first table is a content table, the index is an ID of Integer, and the content comprises a name and relative information of a father, a mother, a brother, a sister, a child and the like; the second is a lookup table, the index is an alias, and the content is an ID corresponding to the alias;

when data is extracted from the local data, the corresponding ID is searched in the lookup table according to the incoming name of the person, then the corresponding row of the data is searched in the content table through the ID, and the searched corpus is output.

Further, the step B comprises the following steps:

b, marking each character in the corpus by adopting a BIOES marking method for the corpus output in the step A; for three entity types of different names of people, place names and time, N, L and T letters are respectively marked;

combining a forward LSTM with a backward LSTM to construct a BiLSTM bidirectional long-and-short term memory network and capture bidirectional semantic information; the capturing of the bidirectional semantic information specifically comprises: firstly, randomly generating a vector for each character of an input sentence, and respectively transmitting all randomly generated vectors into a forward LSTM model and a backward LSTM model; integrating the two obtained output vectors to obtain a vector containing the context information of the original text as an input sequence of a CRF model, wherein the input sequence is a sequence containing characteristics obtained by decomposing a sentence into a plurality of single words and then passing through a BilSTM layer;

the CRF model is a discriminant-based undirected probability map model and is used for predicting the probability distribution of the tag sequence under the condition of giving an observation sequence needing to be marked;

the discriminant of the CRF model is: if P (X | Y) is in the form of a linear continuous conditional random formula, the conditional probability that the random variable Y takes the value Y has the following form under the condition that the random variable X takes the value X:

wherein, t_k(y_i-1,y_iX, i) are characteristic functions respectively used for describing the relative relation between adjacent marked variables and the influence of an input sequence on the adjacent marked variables; s_l(y_iX, i) is a state feature function defined at a marker position i of the input sequence, i having a value in the range 0 ≦ i<L and L are sequence length, the output sequence is as long as the target sequence, and is used for describing the influence of the input sequence on the mark variable, lambda_kAnd mu_lZ (x) is a normalization factor; x is the content of a certain position of the input sequence, X belongs to X, X is the input sequence, Y is the content of the corresponding position of the output sequence, Y belongs to Y, Y_i∈Y， y_i-1E is Y, and Y is an output sequence; k is not less than 0<M,0≤l<M, M is the number of combinations of the input sequence and a certain position output value;

and adding the score of the predicted sequence obtained by the BilSTM and the score of the transition probability in the CRF into the highest sequence to be used as a final output sequence, wherein the output sequence is a sequence formed by the category of the word of each single character.

Further, the step C comprises the following steps:

calculating the precision P, the recall R and the F values of the entity recognition model result based on the BilSTM-CRF respectively for three categories as evaluation standards, wherein the calculation formula is as follows:

wherein TP is the number of positive classes predicted to be positive classes; FP is the number of positive classes predicted from negative classes; FN is the number of positive classes predicted as negative classes; during calculation, characters with correct results and prediction results both being O are removed;

in the actual evaluation, weighting P and R to calculate a new F value; the new F value calculation formula is as follows:

wherein, α is the relative weight of P to R, and the value range of α is (0, 1).

Further, the step D comprises the steps of:

for the identified time class entities, the time class entities are automatically separated after being identified, and the year number part is inquired in a database to obtain the starting time of the year number; the time part is converted into an offset value, the offset value is directly added with the starting time, and the addition result is used as the corresponding epoch of the time entity;

for the identified name entity, firstly, obtaining the unique ID of the name by using a uniform ID comparison table in a database; for the phenomenon of duplicate names in an article, namely two uniform IDs of the same entity name meet the condition, starting from the entity, finding N name entities closest to the entity, and respectively calculating the distances from the N name entities to the persons corresponding to the IDs;

wherein s (p)_j，p_i) The calculation method is as follows: if yes, then person p_jAnd p_iThe number of people on the connected path; otherwise s (p)_j，p_i) Is k '+ 1, where k' is the number of steps of the search:

directly querying the identified place name entity in a corresponding place name library; for the unified place names of different places, under the same dynasty condition, the place name of each place is unique, namely the same place name can only correspond to the same place; when the identified place name has two records in the database, counting N time entities and person name entities which are nearest to the place name, voting by utilizing the N entities to determine the era where the place name is located, and further determining the sub-item of the place name in the database;

and finally, after the entity is extracted, inquiring the related information of the entity by the database established in the step A, and displaying the inquiry result on the application in a floating window form.

The invention has the beneficial effects

(1) The method provided by the invention is not specific to data, but specific to articles, and can theoretically process any article, thereby greatly expanding the service range of application. In addition, for the unregistered words, more possible entities can be detected by adjusting parameters and threshold values during detection of the model, and then filtering is performed through a database retrieval step, so that the recognition efficiency of the named entity recognition model is greatly improved;

(2) aiming at the over-simplified language representation of the language text, the invention provides a set of scheme for processing the problems possibly occurring in the information retrieval process of the main entity types possibly occurring in the language text, such as the conversion between the names of people, the names of places, the ancient year number and the public era;

(3) in the aspect of data, the invention is also labeled with the linguistic data of the NER task in the Tang and sui period. These corpora will be subsequently sourced for use by other studies;

(4) for the construction of the application, the invention uses Asp.NET Core Blazor, the whole application interface is simple and clear, the use is convenient, and the reading efficiency of a reader is greatly improved; meanwhile, the Web application constructed based on the NET framework can ensure the compatibility of different platforms, can be used as an independent application at a mobile phone end or a PC section, and can also be used together with other applications in a plug-in mode, so that real-time and accurate reading assistance is provided for readers, and the blank of an application market is filled.

Drawings

FIG. 1 is a schematic diagram of the process of the present invention;

FIG. 2 is a schematic view of the LSTM framework of the present invention;

FIG. 3 is a diagram illustrating exemplary relationships between persons according to the present invention;

FIG. 4 is a labeled example diagram of an identified entity in accordance with the present invention;

FIG. 5 is a diagram illustrating examples of entity information display related to names of people according to the present invention;

fig. 6 is a diagram illustrating a display example of location name related entity information according to the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments; all other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The method uses the BilSTM-CRF to label the named entity; when a user specifies an access to an entity (i.e., clicks on the word or mouse hovers), the program queries the database for information about the word (e.g., a web of people, location information, etc.).

First, the present invention labels some texts in the Tang dynasty as training data. These data were then used to train the BilSTM-CRF model. In actual use, the reader first selects to open a text file (which may be a text file or a pdf file) using an application, and the present invention preprocesses the text. Then, the representation is transmitted into a BilSTM layer, the probability of each label possibly corresponding to each position is calculated, and then the label sequence with the highest score is calculated through a CRF layer. After the entity is extracted, inquiring the related information of the entity in a preset database, and displaying the inquiry result on an application in a floating window form to provide smooth use experience for a user; the specific processing flow is shown in fig. 1.

front end module based on asp.net Core Blazor: NET Core Blazor is a framework for generating an interactive client Web UI; NET writes server-side and client-side application logic; the Blazor has the advantages that the front end can be constructed by using a plurality of existing Web front end frameworks (such as Bootstrap and the like), the attractiveness of the Blazor can be ensured, and the Blazor can be used as a Web application and has certain transportability;

in the design process of the front end, in order to fully test the operation effect, a false back end is also constructed, the false back end and the finally realized true back end have the same interface, but the data provided by the false back end is only used for testing;

a backend module based on the form of. NET Core + python + propeller: the NET Core part can efficiently preprocess a transmission file, transmits data to the python + propeller model part to complete prediction, processes a prediction result, queries a database and transmits the data to a front-end module to display;

in the invention, the data marking part can be assisted by adopting a text matching mode, and marks a fixed place or time by adopting a text matching mode;

A. establishing a database;

B. constructing an entity recognition model based on the BilSTM-CRF;

C. evaluating the model result;

The step A comprises the following steps:

acquiring contents of related entries in encyclopedia and Wikipedia, manufacturing a simple crawler by using a Python Request library, and reducing the acquisition frequency to 5s by considering that the data volume required to be acquired is small and the related policy of the crawler is also considered;

because the data volume is small, a database is established locally and retrieval data is processed, and an SQLite library is preferentially selected for storage and retrieval; the SQLite is an ultra-lightweight open source database, is a relational database management system complying with ACID, is written by about 30000 lines of C language codes, can be operated only by about 500k of memory, is stored in a single file, and does not depend on the operation of a server. The overall simplicity and high efficiency are kept;

That is, for example, for Liouin, a down-high ancestor, which may be referred to as Liouin, etc. The ID of the corresponding person entity is found in the second table through the transmitted name, and the relative information is found in the first table according to the ID;

the database query step in the invention can be linked to databases of other dynasty information, thereby being suitable for articles in different periods.

The step B comprises the following steps:

the era history book represented by twenty-four history is a bright pearl in the cultural book of the Chinese nationality like the tobacco sea. In the invention, the old Tang book is selected as the main research corpus for two reasons: firstly, the inertia Tang period has special historical status, the characters in the period span two dynasties of inertia Tang, the character relationship is more complex than that of other dynasties, the research is more meaningful, and the result is more practical; secondly, the sato period changes from generation to generation frequently and has a lot of historical events, and compared with the new Tang book, the old Tang book has plain language and more historical materials.

In the old Tang book, the data to be labeled are selected from 'first Ben Ji' to 'fifth Ben Ji', 'first Lie Chuan' to 'fifth Liao Chuan', 'first geography' and 'second geography', and the linguistic data are used as a training set and a testing set for the model.

In the actual development process, the manual labeling efficiency of the language materials of the language is not as expected, so the application development direction is mainly white language for a while.

B, marking each character in the corpus by adopting a BIOES marking method for the corpus output in the step A; for three entity types of different names of people, place names and time, N, L and T letters are respectively marked; the two labels are combined as follows:

for example:

the gaozu Shenyao Dasheng Guangxi emperor Li.

The labeling results are:

'B-N','E-N','O','O','O','O','O','O','O','O','O','O','B-N','E-N', 'O','O','S-N','O'；

the BilSTM-CRF model automatically constructs features through the BilSTM, and then transmits a sequence of word vectors into the CRF model for sequence labeling.

Long-Short Memory networks (LSTMs), all known as Long Short-Term memories, are a variant of RNN, proposed in 1997 by Sepp Hechreit and Jurgen Schimidhuber. Different from RNN, a long-term state is added in an LSTM unit, as shown in FIG. 2, so that the dependence relationship of longer distance can be better captured by using an LSTM model, and the memorized information and the forgotten information can be selected by adjusting the training process;

because the long-short term memory network LSTM can not encode information from back to front, a forward LSTM and a backward LSTM are combined to construct a BiLSTM bidirectional long-short term memory network and capture bidirectional semantic information; firstly, randomly generating a vector for each character of an input sentence, and respectively transmitting all randomly generated vectors into a forward LSTM model and a backward LSTM model; after model training, integrating the two obtained output vectors to obtain a vector containing context information of an original text as an input sequence of a CRF model, wherein the input sequence is a sequence containing characteristics obtained by decomposing a sentence into a plurality of single words and passing through a BilSTM layer;

in the invention, the word vector training step can not only use the BilSTM network, but also adopt some pre-training models, such as BERT or GPT, etc.;

the CRF model is a discriminant-based undirected probability map model proposed by Lafferty et al in 2001, and is used for predicting the probability distribution of a tag sequence under the condition of giving an observation sequence needing to be marked; the effect of adding a CRF layer is mainly two-fold: firstly, during the training process, the CRF can automatically learn some constraint conditions to ensure that the final label prediction result is legal; second, the transfer feature in the CRF may take into account the sequentiality of the outgoing labels;

The step C comprises the following steps:

in the actual evaluation, P and R are properly weighted to calculate a new F value; because the data identified by the model can be filtered during searching, the proportion of the recall rate is improved to ensure that the searching is comprehensive, and the formula is as follows:

wherein alpha is the relative weight of P to R, and the value range of alpha is (0,1) in order to improve the proportion occupied by the recall rate.

The step D comprises the following steps:

for the identified time-class entities, most of the time-class entities are in the form of 'year number + time', such as 'three years of observation for the same year'; after the entity is identified, the entity is automatically separated, and the year number part is inquired in a database to obtain the starting time of the year number; the time part is converted into an offset value, the offset value is directly added with the starting time, and the addition result is used as the corresponding epoch of the time entity; the three years of the 'Zhensguan' are divided into 'Zhensguan' and 'three years', wherein the 'Zhensguan' part is converted into '627' according to the database, the 'three years' is converted into '3', the two are added to obtain 630 ', and the' Gongyuan 630 'is displayed as the related information of the' three years of the 'Zhensguan';

for example, assuming that k' is 3, i.e., 3-layer relationships are searched at most, in fig. 3, the distance between "document queen" and "south yang princess" is 2, and the distance between "luohuoyuan for down and high ancestors" and "south yang princess" is 4;

in the invention, the step of entity identification can be assisted by using rules;

Examples

When a user opens a language in an application, the application reads text contents selected by the user and preprocesses the read text contents to obtain character-level vector representation. And then, transmitting the word vector into a BilSTM layer, and obtaining the probability that each word in a sentence corresponds to each label after training. And transmitting the probabilities into a CRF layer, and calculating a path with the maximum probability as a final named entity recognition result. By using the constructed database, the relevant information of all the identified entities can be found and transmitted to the front end for display.

Taking the chessman-making table as an example, when a user opens a corresponding file, the application reads the text content of the chessman-making table, and then preprocessing is carried out, including segmentation, vector initialization and the like. Take the first sentence as an example: "the pioneer venture is not half and the central road slope, today's lower third, benefiting state fatigue, this is true to be dangerous for death in autumn too. "

The preprocessed result of the sentence is transmitted to the BilSTM layer, and we will get a 33 x 13 matrix, where 33 is the sentence length, 13 is the label space size, and each row is the probability of each label corresponding to the word at the corresponding position. Then, through the CRF layer, we can obtain a path with the highest probability as the result of sequence labeling. For this sentence, we obtain the result:

'B-N','E-N','O','O','O','O','O','O','O','O','O','O','O','O','O', 'O','O','O','B-L','E-L','O','O','O','O','O','O','O','O','O','O', 'O','O','O'；

after all sentences are processed, displaying the text on an application interface, marking the identified entity in the text by a special color, and displaying the application interface as shown in FIG. 4;

after the entity identification step, the user hovers a mouse over the identified entity, and a query step will be initiated. Firstly, according to the identified entity type, finding a corresponding entity name-unified ID table in a database. For example, for "precedent" in the first sentence, we find in the database:

thereby recognizing that the uniform ID corresponding to the entity 'emperor' is 'Liu Bei'. And then, according to Liu Bei, finding the following information in the character relation table:

and transmitting the information into an application layer and directly displaying the information. And directly querying the database for the entities related to the place name and the time. The entity information related to the name of the person, the name of the place, and the like are displayed as shown in fig. 5 and 6, respectively.

The history book reading auxiliary system based on the BilSTM-CRF and the control method thereof are introduced in detail, numerical simulation examples are applied in the system to explain the principle and the implementation mode of the invention, and the description of the embodiments is only used for helping to understand the method and the core idea of the invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, there may be variations in the specific embodiments and the application scope, and in summary, the content of the present specification should not be construed as a limitation to the present invention.

Claims

1. A history book reading assistance system based on BiLSTM-CRF, the system comprising:

2. A control method applied to a history book reading auxiliary system based on BilSTM-CRF is characterized by comprising the following steps:

A. establishing a database;

B. constructing an entity recognition model based on the BilSTM-CRF;

C. evaluating the model result;

3. The method of claim 2, further comprising: the step A comprises the following steps:

4. The method of claim 3, further comprising: the step B comprises the following steps:

wherein, t_k(y_i-1,y_iX, i) are characteristic functions respectively used for describing the relative relation between adjacent marked variables and the influence of an input sequence on the adjacent marked variables; s_l(y_iX, i) is a state feature function defined at a marker position i of the input sequence, i having a value in the range 0 ≦ i<L and L are sequence length, the output sequence is as long as the target sequence, and is used for describing the influence of the input sequence on the mark variable, lambda_kAnd mu_lZ (x) is a normalization factor; x is the content of a certain position of the input sequence, X belongs to X, X is the input sequence, Y is the content of the corresponding position of the output sequence, Y belongs to Y, Y_i∈Y，y_i-1E is Y, and Y is an output sequence; k is not less than 0<M,0≤l<M, M is the number of combinations of the input sequence and a certain position output value;

5. The method of claim 4, further comprising: the step C comprises the following steps:

6. The method of claim 5, wherein the step D comprises the steps of: