CN112347765B - Entity labeling method, module and device based on dictionary matching - Google Patents

Entity labeling method, module and device based on dictionary matching Download PDF

Info

Publication number
CN112347765B
CN112347765B CN202011079331.0A CN202011079331A CN112347765B CN 112347765 B CN112347765 B CN 112347765B CN 202011079331 A CN202011079331 A CN 202011079331A CN 112347765 B CN112347765 B CN 112347765B
Authority
CN
China
Prior art keywords
entity
sentence
dictionary
word
prefix
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011079331.0A
Other languages
Chinese (zh)
Other versions
CN112347765A (en
Inventor
胡振中
刘毅
吴浪韬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tsinghua University
Original Assignee
Tsinghua University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tsinghua University filed Critical Tsinghua University
Priority to CN202011079331.0A priority Critical patent/CN112347765B/en
Publication of CN112347765A publication Critical patent/CN112347765A/en
Application granted granted Critical
Publication of CN112347765B publication Critical patent/CN112347765B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Abstract

The invention discloses an entity labeling method, a module and a device based on dictionary matching, which comprises the following steps: arranging the entity words into an ordered dictionary according to the small to large; establishing a forward index strip F for the entity words, wherein the ith element in the F is the maximum prefix of the first i characters of the entity words in the ordered dictionary; inserting the sentence s into the ordered dictionary according to the size, calculating the first x same characters of the s and w by using the largest entity word smaller than the sentence s in the ordered dictionary as the largest common prefix base word w, and further obtaining the largest prefix of the sentence s in the ordered dictionary; and adding labeling information to the corresponding entity words of the sentence s by using the maximum prefix labeling information, and cutting out the labeling information from the s, otherwise, cutting out the first word in the s, taking the remaining part after cutting out as the sentence s, and repeating the steps until the s is a null character string. According to the invention, a segmentation entity can be obtained through one binary search, and then a pre-labeling result is obtained, so that the repetitive labor of a user is reduced through the pre-labeling method.

Description

Entity labeling method, module and device based on dictionary matching
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to an entity labeling method, module and device based on dictionary matching.
Background
Due to the continuous development of the internet and big data, people accumulate a large amount of original data in all aspects, and obtaining structured knowledge from the trivial data is one of important ways to realize machine intelligence. With the increasing maturity of natural language processing technology, it has become a feasible matter to automatically extract key information such as entities and relationships from texts, which lays a foundation for solving text knowledge by computing mechanisms. The entity extraction is also called named entity recognition, and refers to recognizing named entities with specific meanings, such as a person name, a place name, equipment and the like, from a text. Extracting entities and relations from a text, wherein text word segmentation is required, and labeling is performed on words by using a model trained by using a large amount of labeling data in advance.
The current word segmentation method for text comprises the following steps:
(1) the word segmentation method based on the dictionary comprises the following steps: and cutting the text to be segmented into small segments according to a certain strategy, searching in a dictionary, and cutting into a word when the small segments are found.
(2) The word segmentation method based on statistics comprises the following steps: several characters which are adjacent and have a large number of simultaneous occurrences are divided into a word, and a new word can be recognized without depending on a dictionary.
(3) The word segmentation method based on the neural network comprises the following steps: training the neural network model through the text of the good words, and segmenting sentences by using the obtained model.
In the word segmentation, the word segmentation method based on the dictionary is widely applied because the word which does not appear is usually not considered in the word segmentation. In the word segmentation method based on the dictionary, the dictionary is classified according to the word length, for example, the word length is a dictionary of one character, the word length is a dictionary of two characters, the prefix substring is intercepted from the rest of the sentence for comparison, only the dictionary with the same length as the substring needs to be searched, and the dictionary with the same word length usually stores the words in sequence so as to use the binary search method to accelerate the search. If the dictionary has the same word as the substring, the substring is cut out, otherwise, the last character of the substring is removed, and then the comparison is carried out by using another dictionary with the corresponding word length, and the process is repeated until the substring comparison is successful or only one character is left in the substring, namely, one-time word cutting is finished.
The word segmentation method has the disadvantages that dictionaries with different word lengths need to be established, the dictionary with the largest word length needs to be searched each time, when a long word exists in a text, the number of the dictionaries needing to be searched is large, and the searching times required for segmenting the word once is correspondingly increased. In addition, the frequency of occurrence of long words is generally low, and it is very inefficient to search in a dictionary of long words for frequently occurring short words with each segmentation.
Moreover, for labeling segmented words, some Chinese entity labeling auxiliary tools are available, which provide an easy-to-operate graphical user interface for users. For example, BRAT is a text labeling tool based on web, which supports the user to define entity classes (but not support Chinese classes) by modifying configuration files, and only needs to select a certain text and select the entity classes during labeling; YEDDA is a text labeling tool based on tkiner, and also supports self-defined entity types and entity labels, but the number of the entity types is only 7. However, these tools only provide a graphical user interface for presentation and interaction, and cannot function as an aid to manual annotation, which is still very labor intensive.
Disclosure of Invention
In order to solve the above problems, the present invention discloses an entity labeling method based on dictionary matching, comprising:
sequentially arranging the entity words from small to large to form an ordered dictionary;
establishing a forward index strip F for each entity word, wherein the ith element in the F is the maximum prefix of a character string formed by i characters in front of the entity word in an ordered dictionary;
obtaining a sentence s to be labeled, virtually inserting the sentence s to be labeled into a corresponding position in an ordered dictionary according to the size sequence, if the ordered dictionary has no entity words smaller than the sentence s, then the ordered dictionary has no prefix of the sentence s,
if the ordered dictionary has entity words smaller than the sentence s, selecting the largest entity word from the entity words smaller than the sentence s as a largest common prefix base word w, calculating the first x same characters of s and w to form a largest common prefix p of s and w, and if x is equal to the word length w.length of w, w is the largest prefix of s in the ordered dictionary; if x is 0, then there is no prefix of s in the ordered dictionary; otherwise 0< x < w.length, the xth element in the forward index bar of w is the maximum prefix of s in the ordered dictionary;
if the maximum prefix of the sentence s exists in the ordered dictionary, adding labeling information to the entity word corresponding to the sentence s by using the labeling information of the maximum prefix, cutting the entity word corresponding to the maximum prefix from the s, otherwise, cutting the first word in the s, taking the residual part after cutting out as the sentence s, continuously and repeatedly inserting the sentence into the ordered dictionary in a virtual mode, searching the maximum prefix and adding the labeling information until the s is an empty character string.
Optionally, the size of the entity word compares two entity words character by character according to Unicode coding, if there is a first different character, the large entity word of the character coding is large, otherwise the large entity word is large.
Optionally, the method further comprises: and acquiring a labeled sentence, wherein the labeled sentence comprises a sentence and labeling information of the sentence, the labeling information comprises an entity position, entity words and an entity type, updating the labeling information in the labeled sentence into the ordered dictionary, for each entity, if the entity words are not in the ordered dictionary, adding the entity words and the entity type into the ordered dictionary, otherwise, modifying the type of the entity in the ordered dictionary into a new entity type.
Optionally, a sentence unit is used for storing all information of a sentence, the sentence unit stores the content, the label information and the state information of the sentence, the label information of each entity is represented by a triple in a format including a start index, an end index and a category, the start index and the end index are used for recording the start position and the end position of the entity word in the sentence, the category is used for recording the category information of the entity word, and a sentence list is used for recording a list formed by all the sentence units;
and storing an entity by adopting an entity unit, wherein the entity unit encapsulates entity words, entity categories and forward index bars of the entity, and storing the ordered dictionary formed by the entity unit by adopting an entity list.
The invention also provides an entity labeling module based on dictionary matching, which comprises the following components:
the dictionary sorting unit is used for sequentially sorting the entity words according to the order of the word character strings from small to large to form an ordered dictionary;
the forward index construction unit is used for establishing a forward index strip F for each entity word, wherein the ith element in the F is the maximum prefix of a character string formed by i characters in front of the entity word in the ordered dictionary;
a pre-labeling unit for obtaining a sentence s to be labeled, virtually inserting the sentence s to be labeled into a corresponding position in an ordered dictionary according to the size sequence, if the ordered dictionary has no entity words smaller than the sentence s, the ordered dictionary has no prefix of the sentence s,
if the ordered dictionary has entity words smaller than the sentence s, selecting the largest entity word from the entity words smaller than the sentence s as a largest common prefix base word w, calculating the first x same characters of s and w to form a largest common prefix p of s and w, and if x is equal to the word length w.length of w, w is the largest prefix of s in the ordered dictionary; if x is 0, then there is no prefix of s in the ordered dictionary; otherwise 0< x < w.length, the xth element in the forward index bar of w is the maximum prefix of s in the ordered dictionary;
if the maximum prefix of the sentence s exists in the ordered dictionary, adding labeling information to the corresponding entity word of the sentence s by using the labeling information of the maximum prefix, cutting the entity word corresponding to the maximum prefix from the s, otherwise, cutting out the first word in the s, taking the residual part after cutting out as the sentence s, continuously and repeatedly inserting the sentence into the ordered dictionary in a virtual mode, searching the maximum prefix and adding the labeling information until the s is an empty character string.
Optionally, the system further includes a Unicode code comparison unit, configured to convert the entity words into Unicode codes, where the size of the entity words compares two entity words character by character according to the Unicode codes, and if there is a first different character, the entity word with a large character code is large, otherwise, the entity word with a large length is large.
Optionally, the system further includes a dictionary updating unit, configured to obtain one or more labeled sentences, where the labeled sentences include sentences and labeling information of the sentences, and the labeling information includes entity positions, entity words and entity categories, update the labeling information in the labeled sentences into an ordered dictionary, and for each entity, add an entity word and an entity category to the ordered dictionary if there is no entity word in the ordered dictionary, otherwise modify the category of the entity in the ordered dictionary into a new entity category.
The invention also provides an entity labeling device, which comprises the entity labeling module and further comprises the following components:
the sentence list module stores all information of a sentence by adopting a sentence unit, the sentence unit stores the content, the labeling information set and the state information of the sentence, the labeling information of each entity is represented by a triple of which the format comprises a starting index, an ending index and a category, the starting index and the ending index are used for recording the starting position and the ending position of the words of the entity in the sentence, the category is used for recording the category information to which the words of the entity belong, and the sentence list is used for recording a list formed by all the sentence units;
the entity list module is used for storing an entity by adopting an entity unit, wherein the entity unit is packaged with entity words, entity categories and forward index bars, and the entity list is used for storing the ordered dictionary formed by the entity unit;
a marking management module for recording the position of the sentence unit currently displayed and operated in the sentence list through indexing,
the system is used for reading the entity annotation file from the specified path and adding the entity annotation file into the sentence list, and is used for storing the content in the sentence list to the specified path according to the format of the entity annotation file, wherein the entity annotation file stores the sentences to be annotated or annotated,
the system is used for adding entity-category pairs in the file into an entity list and exporting entity units in the entity list into the file according to the format of the entity-category pairs;
and the graphical user interface module is used for visually displaying the related operations of the sentence list module, the entity list module, the labeling management module and the entity labeling module.
Optionally, the sentence list module displays the annotation information in a rich text format.
Optionally, the entity annotation file comprises a BIO or biees annotation file.
The method establishes a forward index strip for each entity word in the ordered dictionary, virtually inserts the sentence into the ordered dictionary by using the dichotomy, segments the entity word from the sentence by using the forward index strip and marks a corresponding category label, can obtain a segmented entity only through one-time dichotomy search instead of multiple times by setting the forward index strip, and further obtain a pre-labeling result, and reduces the repetitive labor of a user by using the pre-labeling method.
Drawings
The above features and technical advantages of the present invention will become more apparent and readily appreciated from the following description of the embodiments thereof taken in conjunction with the accompanying drawings.
FIG. 1 is a block diagram of an entity labeling apparatus according to an embodiment of the present invention;
FIG. 2 is a logic diagram of an entity tagging method according to an embodiment of the present invention;
FIG. 3 is a diagram of a unit structure of an entity tagging module according to an embodiment of the present invention;
FIG. 4 is a functional view of a graphical user interface module of an embodiment of the present invention.
Detailed Description
Embodiments of the entity labeling method, module and apparatus based on dictionary matching according to the present invention will be described below with reference to the accompanying drawings. Those of ordinary skill in the art will recognize that the described embodiments can be modified in various different ways, or combinations thereof, without departing from the spirit and scope of the present invention. Accordingly, the drawings and description are illustrative in nature and not intended to limit the scope of the claims. Furthermore, in the present description, the drawings are not to scale and like reference numerals refer to like parts.
The entity refers to a named entity identified from a text, and the named entity refers to a concrete or abstract entity in the real world, such as a computer, a train, an airplane, a building, a teacher and the like.
The entity word refers to a unique identifier used to represent the entity. The entity position refers to the position of the entity word in the labeled sentence.
The entity categories refer to different categories divided according to entity words, for example, the above-mentioned computers, trains, airplanes, buildings and teachers can be divided into the entity categories of equipment, vehicles, buildings and professions.
The entity labeling method based on dictionary matching comprises the following steps:
s1, arranging the entity words in sequence from front to back from small to large to form an ordered dictionary;
the sizes of the entity words are compared according to the Unicode codes, the two entity words are compared character by character, if the first different character exists, the character string with large character codes is large, otherwise, the entity words with large length are large. And if and only if the lengths of the two character strings and the characters at the positions are the same, determining that the two entity words are equal.
S2, establishing a forward index strip F for each entity word, wherein the ith element in F is the maximum prefix of the character string formed by the first i characters of the entity word in the ordered dictionary, and if the ith element does not exist, marking as None. The maximum prefix refers to the largest identical string of the first i characters of the entity word and all entity words that precede the entity word.
Taking the ordered dictionary [ AB, ABC, ABCC, ABDD, BCD ] as an example, AB, ABC, ABCC, ABDD, BCD correspond to a forward index F, for example:
the character "A" before "AB" has no prefix in the ordered dictionary, and is marked as None.
Marking as None if the prefix of the previous character A of the ABC does not exist in the ordered dictionary; the first two characters "AB" of "ABC" have the largest prefix "AB" in the ordered dictionary, and this position is denoted "AB".
Marking as None if the prefix of the previous character A of the ABDD does not exist in the ordered dictionary; the first two characters "AB" of "ABDD" have a maximum prefix "AB" in the ordered dictionary, then this position is denoted as "AB"; the maximum prefix of the first three characters "ABD" of "ABDD" in the ordered dictionary is "AB", then that position is noted as "AB".
By analogy of each entity word, a corresponding forward index strip F can be respectively established, and the ordered dictionary added with the forward index strip F is shown in the following table 1:
TABLE 1
Figure BDA0002717125000000061
And S3, acquiring the sentence S to be labeled, virtually inserting (using a bisection method for example) the sentence S to be labeled into the corresponding position in the ordered dictionary according to the size sequence of the sentence S. If the ordered dictionary has entity words smaller than the sentence s, selecting the largest entity word from the entity words smaller than the sentence s as the largest common prefix base word w, and calculating the largest common prefix length x of the s and w, namely the largest first x characters of the s and the w are the same to form the largest common prefix p of the s and the w. If x is equal to the word length (denoted as w.length) of w, then w is the maximum prefix of s in the ordered dictionary; if x is 0, the prefix of s does not exist in the ordered dictionary; otherwise 0< x < w.length, the xth element in the forward index bar of w is the maximum prefix of s. Wherein, the virtual insertion means that the sentence s is only temporarily inserted into the ordered dictionary so as to be pre-labeled, but the sentence s is not permanently inserted therein.
Taking table 1 as an example, if s is located at the top, it indicates that there is no prefix of s in the ordered dictionary, otherwise, find the first word w closest to s. The maximum common prefix length x of s and w is calculated, i.e. s and w are at most the same as the first x characters. If x is equal to the word length of w (denoted as w.length), then w is the maximum prefix of s; if x is 0, the prefix of s does not exist in the ordered dictionary; otherwise 0< x < w.length, the xth element in the forward index table of w is the maximum prefix of s.
S4, if the maximum prefix of S exists in the ordered dictionary, the labeling information of the maximum prefix is used for adding the labeling information to the entity word of the sentence S, and the entity word corresponding to the maximum prefix is cut out from S, otherwise, the first word in S is cut out, the remaining part after cutting out is used as the sentence S, and the steps S3 and S4 are continuously repeated until S is a null character string.
By the method, the entity in the sentence s to be labeled can be pre-labeled by utilizing the dictionary.
The following is a sentence as an example.
The sentence s takes "ABDDA", which is inserted between "ABDD" and "BCD" in Table 1 in accordance with the Unicode code comparison.
And because the maximum common prefix of the ABDD and the ABDDA is the ABDD, and the length of the maximum common prefix is equal to the length 4 of the closest entity word ABDD, the ABDD is the maximum prefix of the s in the ordered dictionary, and the ABDD of the sentence s is labeled by the labeling information of the maximum prefix ABDD. Then "ABDD" is cut out and the remaining "a" is continued to be inserted into the ordered dictionary in dichotomy, which will be inserted above "AB". It has no maximum public prefix in the ordered dictionary, and directly cuts out the single character "A", s is empty and ends.
The sentence s takes "ABCDE", which is inserted between "ABCC" and "ABD" in Table 1. Since the sentence is not at the top, the closest entity word "ABCC" above the sentence is found, and since the maximum common prefix of "ABCC" and "ABCDE" is "ABC", and the length of "ABC" (the length of the maximum common prefix) is 3, which is smaller than the length of 4 of the closest entity word "ABCC", the 3 rd element in the forward index table of "ABCC" is the maximum prefix of s in the ordered dictionary, that is, "ABC" is the maximum prefix of "ABCDE".
Cutting 'ABC' from 'ABCDE', then placing 'DE' into the ordered dictionary by using a dichotomy, inserting the 'DE' into 'BCD' in table 1, wherein the maximum common prefix length of the 'DE' and the 'BCD' is 0 (no common prefix), indicating that the 'DE' has no prefix in the ordered dictionary, and cutting out a single character 'D'; and then placing the 'E' into the ordered dictionary by using a dichotomy, and cutting out the single character 'E' because the 'E' has no prefix in the ordered dictionary, wherein s is empty.
Further, step S5 is also included (this step S5 does not limit its position, it may be before or after any step):
obtaining a labeled sentence, wherein the labeled sentence comprises a sentence and labeling information of the sentence, the labeling information comprises an entity position, entity words and entity types, updating the labeling information in the labeled sentence into an ordered dictionary, adding the entity words and the entity types into the ordered dictionary for each entity word if the entity words do not exist in the ordered dictionary, and modifying the type of the entity words in the ordered dictionary into a new entity type if the entity words do not exist in the ordered dictionary.
Furthermore, after the labeling of the sentence to be labeled is finished, the labeled sentence is visually displayed, so that a user can check whether the labeling error exists or not to correct the sentence.
Further, the present embodiment designs several data types for storing sentences and dictionaries, wherein a sentence unit is used to store all information of a sentence, and the sentence unit encapsulates the content sntc of the sentence, the tagged information sets tag and the state information state. The labeling information of each entity is represented by a triple in the format of a starting index, an ending index and a category, wherein the starting index and the ending index are integers and are used for recording the starting position and the ending position of an entity word in a sentence, and the category element records the category information of the entity word. The SntcList is used to record a list of all sentence units.
The entity unit is used for storing an entity, and the content word, the entity category kid and the forward index bar forward of the entity are encapsulated in WordItem. WordList is used for storing an ordered list formed by the entity units according to the sequence from small to large of the character strings of the entity words.
The invention also provides an entity labeling module 10 based on dictionary matching, as shown in fig. 3, which comprises a dictionary sorting unit 11, a forward index building unit 12 and a pre-labeling unit 13. The entity tagging module 10 may be integrated into a computer device comprising: processor, memory, input module. The modules referred to herein are referred to as a series of computer program instruction segments capable of performing specified functions. The entity labeling module 10 is executed by a processor to complete the function of pre-labeling sentences.
The memory includes at least one type of readable storage medium. The at least one type of readable storage medium may be a non-volatile storage medium such as a flash memory, a hard disk, a multimedia card, a card-type storage, and the like. In some embodiments, the readable storage medium may be an internal storage unit of the computer device, such as a hard disk of the computer device. In other embodiments, the readable storage medium may also be an external memory of the computer device, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like, provided on the computer device.
In this embodiment, the readable storage medium of the memory is used for storing the entity labeling module 10. The memory may also be used to store input and output data.
The processor may be a Central Processing Unit (CPU), a microprocessor or other data Processing chip in some embodiments, and is configured to execute program codes stored in the memory or process data, such as program codes corresponding to the entity tagging module 10.
The dictionary sorting unit 11 is configured to sequentially sort the entity words from front to back from small to large to form an ordered dictionary;
the forward index constructing unit 12 is configured to establish a forward index strip F for each entity word, where an ith element in the forward index strip F is a maximum common prefix of a character string formed by i characters before the entity word in the ordered dictionary, and if the element does not exist, the element is marked as None. The maximum common prefix refers to the largest identical string of the first i characters of the entity word and all entity words that precede the entity word.
Taking the ordered dictionary [ AB, ABC, ABCC, ABDD, BCD ] as an example, AB, ABC, ABCC, ABDD, BCD correspond to a forward index F, for example:
the character "A" before "AB" has no prefix in the ordered dictionary, and is marked as None.
Marking as None if the prefix of the previous character A of the ABC does not exist in the ordered dictionary; the first two characters "AB" of "ABC" have the largest prefix "AB" in the ordered dictionary, and this position is denoted "AB".
Marking as None if the prefix of the previous character A of the ABDD does not exist in the ordered dictionary; the first two characters "AB" of "ABDD" have the maximum prefix "AB" in the ordered dictionary, and then the position is marked as "AB"; the maximum prefix of the first three characters "ABD" of "ABDD" in the ordered dictionary is "AB", then that position is noted as "AB".
By analogy of each entity word, a corresponding forward index strip F can be respectively established, and the ordered dictionary added with the forward index strip F is shown in the following table 1:
TABLE 1
Figure BDA0002717125000000091
Figure BDA0002717125000000101
S3, obtaining the sentence S to be labeled, and virtually inserting the sentence S to be labeled into the corresponding position in the ordered dictionary according to the size sequence of the sentence S. If the ordered dictionary has entity words smaller than the sentence s, selecting the largest entity word from the entity words smaller than the sentence s as the largest common prefix base word w, calculating the largest common prefix length x of the s and w, namely the largest first x characters of the s and the w are the same, and forming the largest common prefix p forming the s and the w. If x is equal to the word length (denoted as w.length) of w, then w is the maximum prefix of s in the ordered dictionary; if x is 0, the prefix of s does not exist in the ordered dictionary; otherwise 0< x < w.length, the xth element in the forward index bar of w is the maximum prefix of s. The virtual insertion refers to that the sentence s is only temporarily inserted into the ordered dictionary so as to be pre-labeled, but the sentence s is not permanently inserted into the ordered dictionary.
Taking table 1 as an example, if s is located at the top, it indicates that there is no prefix of s in the ordered dictionary, otherwise, the first word w closest to s is found. The maximum common prefix length x of s and w is calculated, i.e. s and w are at most the same as the first x characters. If x is equal to the word length (denoted as w.length) of w, then w is the maximum prefix of s; if x is 0, the prefix of s does not exist in the ordered dictionary; otherwise 0< x < w.length, the xth element in the forward index table of w is the maximum prefix of s.
S4, if the maximum prefix of S exists in the ordered dictionary, the labeling information of the maximum prefix is used for adding the labeling information to the entity word of the sentence S, and the entity word corresponding to the maximum prefix is cut out from S, otherwise, the first word in S is cut out, the remaining part after cutting out is used as the sentence S, and the steps S3 and S4 are continuously repeated until S is a null character string.
By the method, the entity in the sentence s to be labeled can be pre-labeled by utilizing the ordered dictionary.
The following is a sentence as an example.
The sentence s takes "ABDDA", which is inserted between "ABDD" and "BCD" in Table 1 in accordance with the Unicode code comparison.
And because the maximum common prefix of the ABDD and the ABDDA is the ABDD, and the length of the maximum common prefix is equal to the length 4 of the closest entity word ABDD, the ABDD is the maximum prefix of the s in the ordered dictionary, and the ABDD of the sentence s is labeled by the labeling information of the maximum prefix ABDD. Then "ABDD" is cut out and the remaining "a" is continued to be inserted into the ordered dictionary in dichotomy, which will be inserted above "AB". It has no maximum public prefix in the ordered dictionary, and directly cuts out the single character "A", s is empty and ends.
The sentence s takes "ABCDE", which is inserted between "ABCC" and "ABD" in Table 1. Since the sentence is not at the top, the closest entity word "ABCC" above the sentence is found, and since the maximum common prefix of "ABCC" and "ABCDE" is "ABC", and the length of "ABC" (the length of the maximum common prefix) is 3, which is smaller than the length of 4 of the closest entity word "ABCC", the 3 rd element in the forward index table of "ABCC" is the maximum prefix of s in the ordered dictionary, that is, "ABC" is the maximum prefix of "ABCDE".
Cutting 'ABC' from 'ABCDE', then placing 'DE' into the ordered dictionary by using a dichotomy, inserting the 'DE' into 'BCD' in table 1, wherein the maximum common prefix length of the 'DE' and the 'BCD' is 0 (no common prefix), indicating that the 'DE' has no prefix in the ordered dictionary, and cutting out a single character 'D'; and then placing the 'E' into the ordered dictionary by using a dichotomy, and cutting out the single character 'E' because the 'E' has no prefix in the ordered dictionary, wherein s is empty.
And further, the system also comprises a Unicode code comparison unit which is used for converting the entity words into Unicode codes, the size of the entity words compares the two entity words character by character according to the Unicode codes, if the first different character exists, the character string with large character codes is large, otherwise, the entity words with large length are large.
The system further comprises a dictionary updating unit for acquiring one or more labeled sentences, wherein the labeled sentences comprise sentences and labeling information of the sentences, the labeling information comprises entity positions, entity words and entity categories, the labeling information in the labeled sentences is updated to the ordered dictionary, for each entity, if the entity is not in the ordered dictionary, the entity words and the entity categories are added to the ordered dictionary, otherwise, the categories of the entity words in the ordered dictionary are modified into new entity categories. That is, it is believed that the entity category in a piece of text is more likely to be the most recently occurring category. Of course, other implementations may be adopted, for example, a counter may be added to each category of the entity word, the count of each category is counted first, and finally the category with the largest count is taken as the category of the entity word.
The invention also provides an entity labeling device 20, and the entity labeling device 20 can be installed in computer equipment. According to the realized functions, as shown in fig. 1, the entity labeling apparatus 20 may include an entity labeling module 10, a sentence list module 21, an entity list module 22, a label management module 23, and a graphical user interface module 24. The module of the present invention refers to a series of computer program segments that can be executed by a processor of a computer device and can perform a fixed function, and is stored in a memory of the computer device. The computer program segment can be written by using python3 language, and the implementation code of the function is encapsulated in the corresponding class by adopting an object-oriented method, and then each module is formed by the class, and the classes contained in each module are respectively as follows:
sentence list module 21: SntcItem, SntcList
Entity list module 22: WordItem, WordList
The annotation management module 23: TagManager
The graphical user interface module 24: EntitytTagFrame
The graphic user interface module 24 is realized by a wxPython graphic library, and follows the idea of separating service logic from a user interface in the process of programming a program, namely, the function realization of software is put in the marking manager module as much as possible, so that the program is convenient to debug and modify.
The schematic logical view of the entity labeling apparatus 20 is shown in fig. 2, all labeling operations performed by the user on the sentence, such as changing the labeling status of the sentence, adding labeling information, etc., can be reflected in the modification of the sentence unit in the sentence list, and then the sentence list feeds back the result to the user through the graphical user interface. When the user finishes labeling and submitting a sentence, the sentence list updates the entity list by the submitted sentence unit. And then the entity labeling module pre-labels a new sentence to be subsequently labeled by utilizing the entity list, stores a pre-labeling result in the sentence list, further displays the pre-labeled sentence unit to a user, and the user can modify the labeling information and store the sentence added with the label in the sentence unit after confirming that the labeling information is correct.
In the present embodiment, the functions regarding the respective modules/units are as follows:
the sentence list module 21 contains several functions:
add annotation information: after the entity of a sentence is labeled, adding a piece of labeling information in the labeling information set, namely a triple of 'initial index, end index and category';
delete annotation information: transmitting a starting index and an ending index, and deleting all the annotation information positioned in two index intervals in the annotation information set;
emptying the annotation information: emptying the labeling information set;
change annotation state: changing the marking state into one of 'unmarked', 'to-be-completed' and 'completed';
extract all entities: all entities contained in sentence units are extracted in "entity-category" pairs.
Add sentence unit: and constructing a sentence unit according to the sentence content, the labeling information set and the labeling state and adding the sentence unit to the tail part of the sentence list.
An entity list module 22, comprising the following functions:
add entity unit: constructing an entity unit according to the entity and the category and inserting the entity unit into the entity list, and ensuring that the entity list is still an ordered list after insertion;
find entity: searching an entity in the entity list, and if the searching is successful, returning a matched entity unit;
the entity labeling module 10 is configured to perform automatic labeling: a sentence S is introduced, the sentence is participled through the above steps S1, S2, S3 and S4, and the categories and positions in the sentence of all the matching entities are returned in the form of a tagged information set, so that the sentence S with the tagged information added is stored in a sentence unit.
Further, an annotation management module 23 is also included,
the annotation management module 23 needs to have an index for recording the position of the sentence unit currently displayed and operated in the sentence list, and the function of the index is designed as follows:
current sentence unit: returning sentence units at corresponding positions in the sentence list through indexes;
go to the next sentence: searching a next sentence unit in the sentence list according to the specified state, and if the next sentence unit is found, correspondingly modifying the index to correspond to the next sentence unit;
go to the previous sentence: searching a last sentence unit in the sentence list according to the specified state, and if the last sentence unit is found, correspondingly modifying the index to correspond to the last sentence unit;
update entity list: updating WordItem in the entity list through the current sentence unit or the whole sentence list by using a dictionary updating unit;
open entity markup file: reading the content of the entity markup file from the specified path and creating a corresponding sentence list;
saving the entity markup file: storing the contents in the sentence list to a specified path according to the format of the entity markup file;
import sentence set: reading an unlabeled sentence set from a specified path and initializing a sentence list;
derive a set of sentences: only the sentence contents of all SntcItems in the sentence list are exported to a file, and no marking information is contained;
import BIO or BIOES markup files: in order to be compatible with a universal labeling method, reading information from a BIO or BIOES labeling file and initializing a sentence list are supported;
export a BIO or BIOES markup file: exporting all the marked sentences to a file according to a BIO format or a BIOES format;
import entity-class pair set: adding all entity-category pairs in the designated file into an entity list, so that a user can utilize some existing information to assist in pre-labeling;
derive a set of entity-category pairs: exporting all WordItems in the entity list into files according to the format of the entity-category pairs, and facilitating the import when a user checks and marks new files.
Further, a graphical user interface module 24 is included,
the gui module 24 is mainly used for receiving user input and implementing its main functions by the method in the annotation management module 23, as shown in fig. 4, the file processing aspect includes functions of creating, opening, saving or saving, importing, exporting, etc., the setting aspect includes functions of entity type, page number skip, visible sentence state, present sentence state, opening pre-annotation, etc., and the annotation aspect includes functions of displaying annotation information, annotating entity, eliminating entity, previous sentence, next sentence, etc.
The device displays the label information by adopting the rich text format, and compared with the common text, the rich text can be provided with the text format setting, so that the device has stronger readability. For example, for a sentence "increase the installation height of the equipment and the lead", the label information can be displayed in a rich text format as follows:
improvement of mounting height of device and lead wire
Further, a packing module is included for packing the entity tagging device 20 into an EXE executable file by using pylnstar.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. An entity labeling method based on dictionary matching is characterized by comprising the following steps:
sequentially arranging the entity words from small to large to form an ordered dictionary;
establishing a forward index strip F for each entity word, wherein the ith element in the F is the maximum prefix of a character string formed by the first i characters of the entity word in an ordered dictionary;
obtaining a sentence s to be labeled, virtually inserting the sentence s to be labeled into a corresponding position in an ordered dictionary according to the size sequence, if the ordered dictionary has no entity words smaller than the sentence s, then the ordered dictionary has no prefix of the sentence s,
if the ordered dictionary has entity words smaller than the sentence s, selecting the largest entity word from the entity words smaller than the sentence s as a largest common prefix base word w, calculating the first x same characters of s and w to form a largest common prefix p of s and w, and if x is equal to the word length w.length of w, w is the largest prefix of s in the ordered dictionary; if x is 0, then there is no prefix of s in the ordered dictionary; otherwise 0< x < w.length, the xth element in the forward index bar of w is the maximum prefix of s in the ordered dictionary;
and if the maximum prefix of the sentence s exists in the ordered dictionary, adding label information to the corresponding entity word of the sentence s by using the label information of the maximum prefix, cutting out the entity word corresponding to the maximum prefix from the s, otherwise, cutting out the first word in the s, taking the residual part after cutting out as the sentence s, continuously and virtually inserting into the ordered dictionary, searching for the maximum prefix and adding the label information until the s is an empty character string.
2. The entity labeling method based on dictionary matching according to claim 1,
and comparing the two entity words character by character according to the size of the entity words by Unicode coding, wherein if the first different character exists, the entity words with large character coding are large, and otherwise, the entity words with large length are large.
3. The entity labeling method based on dictionary matching according to claim 1,
further comprising: obtaining a labeled sentence, wherein the labeled sentence comprises a sentence and labeling information of the sentence, the labeling information comprises an entity position, an entity word and an entity type, updating the labeling information in the labeled sentence into the ordered dictionary, for each entity, if the entity word does not exist in the ordered dictionary, adding the entity word and the entity type into the ordered dictionary, otherwise, modifying the entity type in the ordered dictionary into a new entity type.
4. The entity labeling method based on dictionary matching according to claim 1,
the method comprises the steps that sentence units are used for storing all information of a sentence, the sentence units store the content, the label information and the state information of the sentence, the label information of each entity is represented by a triple in a format including a start index, an end index and a category, the start index and the end index are used for recording the start position and the end position of an entity word in the sentence, the category is used for recording the category information of the entity word, and a sentence list is used for recording a list formed by all the sentence units;
and storing an entity by adopting an entity unit, wherein the entity unit encapsulates entity words, entity categories and forward index bars of the entity, and storing the ordered dictionary formed by the entity unit by adopting an entity list.
5. An entity labeling module based on dictionary matching, comprising:
the dictionary sorting unit is used for sequentially sorting the entity words according to the order of the word character strings from small to large to form an ordered dictionary;
the forward index construction unit is used for establishing a forward index strip F for each entity word, wherein the ith element in the F is the maximum prefix of a character string formed by i characters in front of the entity word in the ordered dictionary;
a pre-labeling unit for obtaining a sentence s to be labeled, virtually inserting the sentence s to be labeled into a corresponding position in an ordered dictionary according to the size sequence, if the ordered dictionary has no entity words smaller than the sentence s, the ordered dictionary has no prefix of the sentence s,
if the ordered dictionary has entity words smaller than the sentence s, selecting the largest entity word from the entity words smaller than the sentence s as a largest common prefix base word w, calculating the first x same characters of s and w to form a largest common prefix p of s and w, and if x is equal to the word length w.length of w, w is the largest prefix of s in the ordered dictionary; if x is 0, then there is no prefix of s in the ordered dictionary; otherwise 0< x < w.length, the xth element in the forward index bar of w is the maximum prefix of s in the ordered dictionary;
if the maximum prefix of the sentence s exists in the ordered dictionary, adding labeling information to the corresponding entity word of the sentence s by using the labeling information of the maximum prefix, cutting the entity word corresponding to the maximum prefix from the s, otherwise, cutting out the first word in the s, taking the residual part after cutting out as the sentence s, continuously and repeatedly inserting the sentence into the ordered dictionary in a virtual mode, searching the maximum prefix and adding the labeling information until the s is an empty character string.
6. The entity labeling module based on dictionary matching according to claim 5,
the system also comprises a Unicode code comparison unit which is used for converting the entity words into Unicode codes, the size of the entity words compares the two entity words character by character according to the Unicode codes, if the first different character exists, the entity words with large character codes are large, otherwise, the entity words with large length are large.
7. The entity labeling module based on dictionary matching according to claim 5,
the system further comprises a dictionary updating unit used for obtaining one or more labeled sentences, wherein the labeled sentences comprise sentences and labeling information of the sentences, the labeling information comprises entity positions, entity words and entity types, the labeling information in the labeled sentences is updated to the ordered dictionary, for each entity, if the entity words are not in the ordered dictionary, the entity words and the entity types are added to the ordered dictionary, otherwise, the entity types in the ordered dictionary are modified into new entity types.
8. An entity tagging device comprising the entity tagging module of any one of claims 5 to 7, and further comprising:
the sentence list module stores all information of a sentence by adopting a sentence unit, the sentence unit stores the content, the labeling information set and the state information of the sentence, the labeling information of each entity is represented by a triple of which the format comprises a starting index, an ending index and a category, the starting index and the ending index are used for recording the starting position and the ending position of the words of the entity in the sentence, the category is used for recording the category information to which the words of the entity belong, and the sentence list is used for recording a list formed by all the sentence units;
the entity list module is used for storing an entity by adopting an entity unit, wherein the entity unit is packaged with entity words, entity categories and forward index bars, and the entity list is used for storing the ordered dictionary formed by the entity unit;
a marking management module for recording the position of the sentence unit currently displayed and operated in the sentence list through indexing,
the system is used for reading the entity annotation file from the specified path and adding the entity annotation file into the sentence list, and is used for storing the content in the sentence list to the specified path according to the format of the entity annotation file, wherein the entity annotation file stores the sentences to be annotated or annotated,
the system is used for adding entity-category pairs in the file into an entity list and exporting entity units in the entity list into the file according to the format of the entity-category pairs;
and the graphical user interface module is used for visually displaying the related operations of the sentence list module, the entity list module, the labeling management module and the entity labeling module.
9. The entity tagging device of claim 8 wherein said sentence listing module displays tagging information in a rich text format.
10. The entity tagging device of claim 8,
the entity annotation file comprises a BIO or BIOES annotation file.
CN202011079331.0A 2020-10-10 2020-10-10 Entity labeling method, module and device based on dictionary matching Active CN112347765B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011079331.0A CN112347765B (en) 2020-10-10 2020-10-10 Entity labeling method, module and device based on dictionary matching

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011079331.0A CN112347765B (en) 2020-10-10 2020-10-10 Entity labeling method, module and device based on dictionary matching

Publications (2)

Publication Number Publication Date
CN112347765A CN112347765A (en) 2021-02-09
CN112347765B true CN112347765B (en) 2022-06-07

Family

ID=74361553

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011079331.0A Active CN112347765B (en) 2020-10-10 2020-10-10 Entity labeling method, module and device based on dictionary matching

Country Status (1)

Country Link
CN (1) CN112347765B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761215A (en) * 2021-03-25 2021-12-07 中科天玑数据科技股份有限公司 Feedback self-learning-based dynamic dictionary base generation method

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5706496A (en) * 1995-03-15 1998-01-06 Matsushita Electric Industrial Co., Ltd. Full-text search apparatus utilizing two-stage index file to achieve high speed and reliability of searching a text which is a continuous sequence of characters
CN109739987A (en) * 2018-12-29 2019-05-10 北京创鑫旅程网络技术有限公司 A kind of corpus labeling method, construction corpus method and device
WO2020082562A1 (en) * 2018-10-25 2020-04-30 平安科技(深圳)有限公司 Symbol identification method, apparatus, device, and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP4701292B2 (en) * 2009-01-05 2011-06-15 インターナショナル・ビジネス・マシーンズ・コーポレーション Computer system, method and computer program for creating term dictionary from specific expressions or technical terms contained in text data

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5706496A (en) * 1995-03-15 1998-01-06 Matsushita Electric Industrial Co., Ltd. Full-text search apparatus utilizing two-stage index file to achieve high speed and reliability of searching a text which is a continuous sequence of characters
WO2020082562A1 (en) * 2018-10-25 2020-04-30 平安科技(深圳)有限公司 Symbol identification method, apparatus, device, and storage medium
CN109739987A (en) * 2018-12-29 2019-05-10 北京创鑫旅程网络技术有限公司 A kind of corpus labeling method, construction corpus method and device

Also Published As

Publication number Publication date
CN112347765A (en) 2021-02-09

Similar Documents

Publication Publication Date Title
US11475209B2 (en) Device, system, and method for extracting named entities from sectioned documents
CA2242158C (en) Method and apparatus for searching and displaying structured document
US7673235B2 (en) Method and apparatus for utilizing an object model to manage document parts for use in an electronic document
CN114616572A (en) Cross-document intelligent writing and processing assistant
CN110377884B (en) Document analysis method and device, computer equipment and storage medium
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
CN112395851A (en) Text comparison method and device, computer equipment and readable storage medium
CN107463537A (en) A kind of method that structuring processing is carried out to text message
CN112347765B (en) Entity labeling method, module and device based on dictionary matching
CN116205211A (en) Document level resume analysis method based on large-scale pre-training generation model
US7073122B1 (en) Method and apparatus for extracting structured data from HTML pages
CN115544975B (en) Log format conversion method and device
CN105608137A (en) Method and device for extracting identity label
CN112667208A (en) Translation error recognition method and device, computer equipment and readable storage medium
CN112017078A (en) Auxiliary writing method, processing device and storage medium of patent document
CN111753536A (en) Automatic patent application text writing method and device
CA2422490C (en) Method and apparatus for extracting structured data from html pages
CN115270723A (en) PDF document splitting method, device, equipment and storage medium
CN113254583B (en) Document marking method, device and medium based on semantic vector
CN115203445A (en) Multimedia resource searching method, device, equipment and medium
CN114997167A (en) Resume content extraction method and device
CN113515907A (en) Pre-analysis method of VVP file and computer-readable storage medium
CN107145947A (en) A kind of information processing method, device and electronic equipment
Suriyachay et al. Thai named entity tagged corpus annotation scheme and self verification
CN112017079A (en) Component information extraction method, processing device and storage medium of patent document

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant