CN111079436B - Geological named entity extraction method and device - Google Patents

Geological named entity extraction method and device Download PDF

Info

Publication number
CN111079436B
CN111079436B CN201911322290.0A CN201911322290A CN111079436B CN 111079436 B CN111079436 B CN 111079436B CN 201911322290 A CN201911322290 A CN 201911322290A CN 111079436 B CN111079436 B CN 111079436B
Authority
CN
China
Prior art keywords
character
regular
geological
rule
named entity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911322290.0A
Other languages
Chinese (zh)
Other versions
CN111079436A (en
Inventor
邓吉秋
路馥毓
刘文毅
李晨菡
何美香
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Central South University
Original Assignee
Central South University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Central South University filed Critical Central South University
Priority to CN201911322290.0A priority Critical patent/CN111079436B/en
Publication of CN111079436A publication Critical patent/CN111079436A/en
Application granted granted Critical
Publication of CN111079436B publication Critical patent/CN111079436B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to a geological named entity extraction method, which comprises the following steps: acquiring a target text consisting of a plurality of characters or character strings; acquiring a first regular expression based on the target text and a preset first rule character, extracting a first character string in the target text, and replacing the first character string with a preset eighth rule character to obtain a second target text; judging whether a second target text contains a third rule character or not based on the second target text and a preset third rule character; if yes, acquiring a second regular expression by adopting a preset fourth regular character, a preset second regular character, a preset fifth regular character, a preset sixth regular character and a preset third regular character corresponding to the third regular character, and acquiring a second character string in a second target text; and acquiring the length of a second character string, and acquiring the geological named entity in the target text according to the length and the preset minimum length value corresponding to the third regular character.

Description

Geological named entity extraction method and device
Technical Field
The invention relates to the field of natural language processing, in particular to a method and a device for extracting a geological named entity.
Background
Current state of named entity identification: only good results are achieved in limited text types (mainly in news corpora) and entity categories (mainly in names of people, places and organizations); compared with other information retrieval fields, the entity naming evaluation corpus is smaller, and overfitting is easy to generate; named entity recognition focuses more on high recall rate, but in the field of information retrieval, high accuracy rate is more important; the general system of identifying multiple types of named entities performs poorly.
The general named entity extraction method generally needs a large amount of linguistic data, but it is difficult to accurately find the corresponding background linguistic data with considerable amount when a certain document is specifically analyzed. When the rules are applied to extract the geological named entities, if simple rules are adopted, the extraction effect is generally poor probably because different levels of the Chinese rules, different modes of the basic word combinations and the like cannot be effectively considered.
Disclosure of Invention
Technical problem to be solved
In order to solve the problems that the named entity extraction in the prior art needs to depend on a large number of corpora and is low in extraction precision, the invention provides a method and a device for extracting a geological named entity.
(II) technical scheme
In order to achieve the above object, the present invention provides a method for extracting a geological named entity, comprising:
a1, obtaining a target text consisting of a plurality of characters or character strings;
a2, acquiring a first regular expression corresponding to a first regular character based on the target text and the preset first regular character, and extracting a first character string of the regular expression meeting the first regular character in the target text to obtain a second target text; the second target text is a target text which does not contain the first character string;
the first rule character is a word which is in front of the position of the multi-class geological named entity but does not belong to the geological named entity;
a3, judging whether the second target text contains a third regular character or not based on the second target text and the preset third regular character;
wherein the third rule character is an ending word in a geological named entity;
a4, if yes, acquiring a second regular expression by adopting a preset fourth regular character, a second regular character, a fifth regular character, a sixth regular character and a third regular character which correspond to the third regular character, and acquiring a second character string which meets the second regular expression in the second target text by adopting the second regular expression;
the second rule character is: words that precede the final word in all categories of geological named entities, but do not belong to geological named entities;
the fourth rule character is: a word at any position in the geological named entity before the ending word, but not belonging to the geological named entity;
the fifth rule character is: a word which is adjacent to the head word in the geological named entity but does not belong to the geological named entity;
the sixth rule character is: a word which is adjacent to the end word in the geological named entity but does not belong to the geological named entity;
the seventh rule character is a type code of the geological named entity corresponding to the final word;
and A5, acquiring length information of the second character string, and acquiring the geological named entity in the target text according to the length information and the preset minimum length value corresponding to the third regular character.
Preferably, the step a2 includes:
a2-1, obtaining a first regular expression by the preset first regular character;
a2-2, based on the target text, extracting a first character string of a regular expression meeting first regular characters in the target text by adopting the first regular expression;
a2-3, replacing a first character string in the target text with a character string which is the same as the first character string in length and consists of eighth regular characters to obtain a second target text;
the eighth rule character is a space.
Preferably, the step a5 includes:
a5-1, obtaining a length value of the second character string;
a5-2, judging whether the length value of the second character string meets the preset minimum length value corresponding to the third regular character corresponding to the second character string;
if the result is met, obtaining a geological named entity with an entity text character string and an entity type code in the target text based on the second character string and a preset seventh regular character corresponding to a third regular character corresponding to the second character string;
the entity text character string of the geological named entity is a second character string;
and the type code of the geological named entity is a preset seventh rule character corresponding to the third rule character corresponding to the second character string.
Preferably, the second regular expression character corresponding to the fourth rule character, the second rule character, the third rule character, the fifth rule character and the sixth rule character includes: a second regular expression character having a first label character and a second regular expression character having a second label character;
wherein the first label character is: the form of a second regular expression corresponding to a second regular expression character with a first label character is the label character of the first form;
the first form of the second regular expression is arranged in order: a fourth rule character, a second rule character, a fifth rule character, a third rule character and a sixth rule character;
wherein the second label character is: the form of a second regular expression corresponding to a second regular expression character with a second label character is a label character of a second form;
the second regular expression has a second form: a second regular expression form different from the first form and set in advance.
A geological named entity extraction device storing a first instruction;
the first instructions cause a named entity extraction apparatus to perform a named entity extraction method as described in any one of the above.
(III) advantageous effects
The invention has the beneficial effects that: according to the invention, the geological named entity is extracted according to the first regular character regular expression and the second regular expression without a large amount of corpora, so that the high-precision geological named entity extraction can be realized, and the dependence on a geological professional term corpus is reduced or eliminated.
Drawings
FIG. 1 is a flow chart of a method for extracting a named entity from geological formations according to the present invention;
fig. 2 is a schematic diagram of a method for extracting a geological named entity corresponding to fig. 1 in the embodiment of the present invention.
Detailed Description
For the purpose of better explaining the present invention and to facilitate understanding, the present invention will be described in detail by way of specific embodiments with reference to the accompanying drawings.
(1) Regarding the regular character in the present embodiment
In this embodiment, the first rule character is set as a general forward boundary word, the second rule character is set as a general prefix boundary word, the third rule character is set as a tail word, the fourth rule character is set as a specific forward boundary word, the fifth rule character is set as a specific prefix boundary word, the sixth rule character is set as a specific suffix boundary word, the eighth rule character is set as a space, and the seventh rule character is set as a type code of a geological named entity corresponding to the tail word.
In this embodiment, the end word as the third rule character: is a common ending word in similar geological named entities, such as: the "set" in the "Yuenu mountain set" appears at the end of the geological named entity as a stratigraphic division unit. The basic suffix is defined as table 1:
TABLE 1 basic suffix definitions
Figure BDA0002327472150000041
Figure BDA0002327472150000051
The entity category is a common geological named entity category of geological documents, the category code is a self-defined word segmentation part of speech corresponding to the geological named entity, and the tail words contained in the category are tail words used for extracting the geological named entity corresponding to the category based on multiple regular matching.
In this embodiment, the general forward delimiting word as the first regular character: are words that occur frequently before, but are not part of, multiple classes of geologic named entities, such as: "explanation" in "explain Yuenu mountain group …" and "explain F1 fault …".
In this embodiment, the general prefix delimiting word used as the second regular character: is a word that precedes all the endwords but does not belong to a part of the geological named entity, such as: "see" in the "in-zone see fault …".
The general demarcation word is defined as table 2:
TABLE 2 general demarcation word definitions
Figure BDA0002327472150000052
Figure BDA0002327472150000061
The general demarcation word definition table has the following characteristics:
A. wherein, the character strings of the serial numbers 1 to 8 are general forward boundary word combinations, and the character strings of the serial numbers 9 are general prefix boundary word combinations;
B. the first character of the character string with the sequence number of 1-6 is ^ which is used for representing that the character string is a combination of regular expressions, the character string is formed by commas for a plurality of character strings meeting the regular expression rules and adding ^ before the first character string after connection, and the character string is used for limiting the boundary of the geological named entity through phrases;
C. the character strings of the serial numbers 7-9 limit the geological named entity boundary through single characters, wherein the character strings of the serial numbers 7 and 8 directly meet the regular expression rule, and the character string of the serial number 9 meets the regular expression rule after the first character is removed by $;
D. the general dividing word definition table is stored in a database table mode, and the table name is general words; all the general demarcation words are predefined in table 2 according to the rules.
In this embodiment, the specific forward boundary word as the fourth regular character: are words that appear anywhere before the end word of a particular category but do not belong to the geological named entity, such as: the group in the Yunyuan ruyang group Yunmengshan group is not part of the Yunmengshan group, and is a specific forward boundary word of the group.
In this embodiment, the specific prefix delimiting word as the fifth regular character: words that appear in a position before the end word of a particular category that do not belong to that type of geologic named entity, such as: the 'county' in the 'Ji county group' generally only appears in the geological named entities of the 'group' and 'group' but not in other stratigraphic unit entities, and the 'county' is a specific prefix boundary word of other stratigraphic unit entities.
In the present embodiment, as the specific suffix delimiter of the sixth regular character: words that appear in a position after the end word of a particular category that do not belong to that type of geologic named entity, such as: the term "face" in "fault face with calcite filling" cannot be taken together with "fault".
The geological named entity types and specific dividing words thereof are defined as shown in the table 3, and have the following characteristics:
ID is geological named entity type number, Text is a tail word, and Class is a tail word type code;
when the first character of Rules is [ and Rules is a specific forward dividing word, the regular expression form for extracting the geological named entity is as follows: a specific forward delimiter + a general prefix delimiter + a specific prefix delimiter + a suffix + a specific suffix delimiter. The specific forward delimiter is a Rules character string in table 3, the general prefix delimiter is a character string with a serial number of 9 in table 2, the specific prefix delimiter is a part before a Reserve character string tail in table 3, and the specific suffix delimiter is a part after the Reserve character string tail in table 3. Such as: the geology named entity Reserve with type ID of 102 is' the world of the border of China with province, city, county, bottom, south, west and north (? [ ^ to) facial line ]) ".
TABLE 3 geological named entity Categories and specific demarcation term definitions
Figure BDA0002327472150000071
Figure BDA0002327472150000081
Figure BDA0002327472150000091
When the first character of Rules is $, Rules is a regular expression for extracting the geological named entity.
And D, Mini is the minimum length requirement of the geological named entity, and the extracted character string with the length smaller than Mini is not used as the geological named entity.
Storing the geological named entity category and the specific dividing word definition table in a database table mode, wherein the table name is word _ types; all geological named entity types and specific dividing words are predefined in table 3 according to rules.
(2) The steps of extracting the geological named entities in the present embodiment are shown in fig. 1 and fig. 2.
A1, see fig. 1 and 2, a target text composed of a plurality of characters or character strings is obtained.
For example, in the specific application of the present embodiment, step A1 may include the following (2-1), (2-1-2), (2-1-3), and (2-2) steps:
(2-1) entering system initialization, defining a text and rule matching function re _ text, wherein input parameters of the text and rule matching function re _ text are text and regular rule, and output is a list meeting the regular rule in the text, and the function is realized in steps 2-1-1) -2-1-3; and then enters 2-2).
(2-1-1) acquiring text and regular rule of an input parameter, initializing an output parameter re _ words into a null list, and entering 2-1-2).
(2-1-2) judging whether a character string meeting a rule regular expression exists in the text, if so, acquiring the character string meeting the rule and the initial position of the character string in the text; each character string S meeting the rule and the initial position L in the text form a tuple [ S, L ] which is respectively added to re _ words and enters 2-1-3); if no character string satisfying rule exists, enter 2-1-3).
(2-1-3) outputting re _ words as a function return value.
And (2-2) acquiring a target text, wherein the target text in the embodiment is a geological text target text character string geo _ text, and initializing a geological named entity list entry _ list to be a null list.
A2, referring to fig. 1 and fig. 2, obtaining a first regular expression corresponding to a first regular character based on the target text and a preset first regular character and a preset second regular character, and extracting a first character string of the regular expression meeting the first regular character in the target text to obtain a second target text; the second target text is a target text which does not contain the first character string;
in this embodiment, the first regular character is a general forward boundary word, and the second regular character is a general prefix boundary word.
For example, the specific application of step A2 in this embodiment includes the following (2-3) (2-3-1) (2-3-2) (2-3-3) (2-3-4) steps.
(2-3) initializing a general forward boundary word list pre _ words and a general prefix boundary word prefix _ words as null character strings, acquiring a first record of a general boundary word definition list general _ words, and performing steps 2-3-1) -2-3-4) for processing until all records in the general _ words are processed.
(2-3-1) acquiring the word field of the current record, assigning the word field to the current general forward boundary word string g _ Words, and entering 2-3-2).
(2-3-2) obtaining the first character of g _ words, and if the first character is [ OR (, then entering 2-3-3); if the first character is $, deleting the first character from the g _ words, then accumulating the first character to a prefix _ words of the universal prefix boundary word string, and entering 2-3-4); if the first character is ^ the first character of the g _ words is deleted ^ the comma in the g _ words is replaced by) | (left brackets are inserted before the first character of the g _ words (right brackets are inserted after the last character) and 2-3-3 is entered).
(2-3-3) calling a text and rule matching function re _ text to obtain an output value re _ words, and if the re _ words is not a null list, taking a first element of the re _ words as a current element and entering 2-3-3-1); if re _ words is an empty list, then 2-3-4) is entered.
(2-3-3-1) obtaining a current element value [ S, L ], calculating the length len of an S character string (first character string), and replacing all characters from the L-th position from the left to the L + len position in geo _ text with spaces to enter 2-3-3-2).
(2-3-3-2) if the current element is not the last element of re _ words, reading the next element of re _ words as the current element, and entering 2-3-3-1); if it is the last element of re _ words, then go to 2-3-4).
(2-3-4) if the g _ words is not the last record of the general _ words, reading the next record of the general _ words as the current record, and entering 2-3-1); if g _ words is the last record of general _ words, then 2-4) is entered.
A3, referring to fig. 1 and fig. 2, judging whether the second target text contains the third regular character or not based on the second target text and the preset third regular character;
wherein the third rule character is an endword appearing in the geological named entity;
for example, in the embodiment, the step A3 can include the following (2-4), (2-4-1), and (2-4-2):
(2-4) acquiring the geological named entity category and a specific dividing word definition table word _ types, and taking the first record of the word _ types as a current record w _ type to enter 2-4-1).
(2-4-1) obtaining each field value of the current record w _ type, as shown in table 3, respectively assigning to a character string ID, text, class, rule, reserve, and mini, and entering 2-4-2).
(2-4-2) judging whether the current geo _ text contains the tail word text, wherein the current geo _ text is a second target text at the moment, if the current geo _ text contains the tail word text, the entry 2-4-3 is carried out, and if the current geo _ text does not contain the tail word text, the entry 2-4-5 is carried out).
A4, referring to fig. 1 and fig. 2, in this embodiment, when the second target text contains a third regular character, a second regular expression is obtained by using a preset fourth regular character, a second regular character, a fifth regular character, a sixth regular character, and a third regular character corresponding to the third regular character, and a second character string that satisfies the second regular expression in the second target text is obtained by using the second regular expression.
In this embodiment, the fourth regular character is a specific forward dividing word, the fifth regular character is a specific prefix dividing word, the sixth regular character is a specific suffix dividing word, the eighth regular character is a space, and the seventh regular character is a type code of the geological named entity corresponding to the end word.
For example, in a specific application of this embodiment, step A4 may include the following steps (2-4-3) (2-4-4);
(2-4-3) initializing the geological named entity, extracting a regular expression entry _ rule as an empty character string, and acquiring a first character of a rule; if the first character of rule is $, delete the first character, assign it to entry _ rule; if the first character of rule is not $, accumulating the entry _ rule with the character strings rule, prefix _ words and reserve in sequence; enter 2-4-4).
(2-4-4) calling a text and rule matching function re _ text to obtain an output value re _ words, and if the re _ words is not a null list, taking a first element of the re _ words as a current element and entering 2-4-4-1); if re _ words is an empty list, then go to 2-4-5).
A5, obtaining length information of the second character string, and obtaining a geological named entity in the target text according to the length information and the preset minimum length value corresponding to the third regular character
For example, in a specific application of the present embodiment, step a5 may include the following steps:
(2-4-4-1) obtaining a current element value [ S, L ], and calculating the length len of an S character string (a second character string); enter 2-4-4-2) if len is more than or equal to mini), enter 2-4-4-3) if len is less than mini.
(2-4-4-2) inserting class to the end of the current element, and adding the current element of [ S, L, class ] to the entry _ list, and entering 2-4-4-4).
(2-4-4-3) if the current element is not the last element of re _ words, reading the next element of re _ words as the current element, and entering 2-4-4-1); if it is the last element of re _ words, then go to 2-4-5).
(2-4-5) if w _ type is not the last record of word _ types, reading the next record as the current record w _ type, entering 2-4-1); if w _ type is the last record of word _ types, then 2-5) is entered.
And (2-5) outputting the geological named entity list entry _ list.
According to the method, the geological named entity can be extracted with high precision without a large amount of corpora, and the dependency on a geological professional term corpus is reduced or avoided.
The technical principles of the present invention have been described above in connection with specific embodiments, which are intended to explain the principles of the present invention and should not be construed as limiting the scope of the present invention in any way. Based on the explanations herein, those skilled in the art will be able to conceive of other embodiments of the present invention without inventive efforts, which shall fall within the scope of the present invention.

Claims (3)

1. A method for extracting a geological named entity is characterized by comprising the following steps:
a1, obtaining a target text consisting of a plurality of characters or character strings;
a2, acquiring a first regular expression corresponding to a first regular character based on the target text and the preset first regular character, and extracting a first character string of the first regular expression corresponding to the first regular character in the target text to obtain a second target text; the second target text is a target text which does not contain the first character string;
the first rule character is a word which is in front of the position of the multi-class geological named entity but does not belong to the geological named entity;
a3, judging whether the second target text contains a third regular character or not based on the second target text and the preset third regular character;
wherein the third rule character is an ending word in a geological named entity;
a4, if yes, acquiring a second regular expression by adopting a preset fourth regular character, a second regular character, a fifth regular character, a sixth regular character, a seventh regular character and a third regular character which correspond to the third regular character, and acquiring a second character string which meets the second regular expression in the second target text by adopting the second regular expression;
the second rule character is: words that precede the final word in all categories of geological named entities, but do not belong to geological named entities;
the fourth rule character is: a word at any position in the geological named entity before the ending word, but not belonging to the geological named entity;
the fifth rule character is: a word which is adjacent to the head word in the geological named entity but does not belong to the geological named entity;
the sixth rule character is: a word which is adjacent to the end word in the geological named entity but does not belong to the geological named entity;
the seventh rule character is a type code of the geological named entity corresponding to the final word;
a5, obtaining length information of the second character string, and obtaining a geological named entity in the target text according to the length information and a preset minimum length value corresponding to a third rule character;
the geological named entity in the target text has an entity text character string and an entity type code;
the step A2 includes:
a2-1, obtaining a first regular expression by the preset first regular character;
a2-2, based on the target text, extracting a first character string which meets a first regular expression corresponding to a first regular character in the target text by adopting the first regular expression;
a2-3, replacing a first character string in the target text with a character string which is the same as the first character string in length and consists of eighth regular characters to obtain a second target text;
the eighth rule character is a space;
the step A5 includes:
a5-1, obtaining a length value of the second character string;
a5-2, judging whether the length value of the second character string meets the preset minimum length value corresponding to a third regular character corresponding to the second character string;
if the result is met, obtaining a geological named entity with an entity text character string and an entity type code in the target text based on the second character string and a preset seventh regular character corresponding to a third regular character corresponding to the second character string;
the entity text character string of the geological named entity is a second character string;
and the type code of the geological named entity is a preset seventh rule character corresponding to a third rule character corresponding to the second character string.
2. The method of claim 1, wherein the second regular expression character corresponding to a fourth rule character, a second rule character, a third rule character, a fifth rule character, a sixth rule character, and a seventh rule character comprises: a second regular expression character having a first label character and a second regular expression character having a second label character;
wherein the first label character is: the form of a second regular expression corresponding to a second regular expression character with a first label character is the label character of the first form;
the first form of the second regular expression is arranged in order: a fourth rule character, a second rule character, a fifth rule character, a third rule character, a sixth rule character and a seventh rule character;
wherein the second label character is: the form of a second regular expression corresponding to a second regular expression character with a second label character is a label character of a second form;
the second regular expression has a second form: a second regular expression form different from the first form and set in advance.
3. A geological named entity extraction device, wherein the geological named entity extraction device stores a first instruction;
the first instructions cause a named entity extraction apparatus to perform the named entity extraction method of any of claims 1-2.
CN201911322290.0A 2019-12-20 2019-12-20 Geological named entity extraction method and device Active CN111079436B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911322290.0A CN111079436B (en) 2019-12-20 2019-12-20 Geological named entity extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911322290.0A CN111079436B (en) 2019-12-20 2019-12-20 Geological named entity extraction method and device

Publications (2)

Publication Number Publication Date
CN111079436A CN111079436A (en) 2020-04-28
CN111079436B true CN111079436B (en) 2021-09-21

Family

ID=70316038

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911322290.0A Active CN111079436B (en) 2019-12-20 2019-12-20 Geological named entity extraction method and device

Country Status (1)

Country Link
CN (1) CN111079436B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112241558A (en) * 2020-09-03 2021-01-19 深圳市华阳国际工程设计股份有限公司 Element type name unifying method and device and computer storage medium
CN112699637B (en) * 2021-01-08 2024-04-12 中南大学 Paragraph type recognition method and system and document structure recognition method and system
CN115438198B (en) * 2022-11-07 2023-03-31 四川大学 Interpretable medical data structuring method and system based on knowledge base

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8321434B1 (en) * 2006-08-15 2012-11-27 Trend Micro Incorporated Two tiered architecture of named entity recognition engine
CN107133220B (en) * 2017-06-07 2020-11-24 东南大学 Geographic science field named entity identification method
CN109740159B (en) * 2018-12-29 2022-04-26 北京泰迪熊移动科技有限公司 Processing method and device for named entity recognition
CN109858040B (en) * 2019-03-05 2021-05-07 腾讯科技(深圳)有限公司 Named entity identification method and device and computer equipment

Also Published As

Publication number Publication date
CN111079436A (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN111079436B (en) Geological named entity extraction method and device
CN107729314B (en) Chinese time identification method and device, storage medium and program product
CN108763591B (en) Webpage text extraction method and device, computer device and computer readable storage medium
AU2010208523B2 (en) Methods and systems for matching records and normalizing names
US20100257440A1 (en) High precision web extraction using site knowledge
US8732116B1 (en) Harvesting relational tables from lists on the web
CN110609998A (en) Data extraction method of electronic document information, electronic equipment and storage medium
US9558295B2 (en) System for data extraction and processing
CN106776495B (en) Document logic structure reconstruction method
CN106502991B (en) Publication treating method and apparatus
CN110008473B (en) Medical text named entity identification and labeling method based on iteration method
CN107967250A (en) A kind of information processing method and device
CN110837568A (en) Entity alignment method and device, electronic equipment and storage medium
Rehman et al. Morpheme matching based text tokenization for a scarce resourced language
CN111984845A (en) Website wrongly-written character recognition method and system
CN105573981A (en) Method and device for extracting Chinese names of people and places
Lampert et al. Segmenting email message text into zones
US6470362B1 (en) Extracting ordered list of words from documents comprising text and code fragments, without interpreting the code fragments
CN111160445A (en) Bid document similarity calculation method and device
CN110705285A (en) Government affair text subject word bank construction method, device, server and readable storage medium
JP2016018279A (en) Document file search program, document file search device, document file search method, document information output program, document information output device, and document information output method
CN112699636A (en) Multi-source Markdown geological data text format standardization method and system
CN111090997B (en) Geological document feature lexical item ordering method and device based on hierarchical lexical items
CN112836477B (en) Method and device for generating code annotation document, electronic equipment and storage medium
US20150019208A1 (en) Method for identifying a set of sentences in a digital document, method for generating a digital document, and associated device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB03 Change of inventor or designer information
CB03 Change of inventor or designer information

Inventor after: Deng Jiqiu

Inventor after: Lu Biyu

Inventor after: Liu Wenyi

Inventor after: Li Chenhan

Inventor after: He Meixiang

Inventor before: Deng Jiqiu

Inventor before: Lu Biyu

Inventor before: Li Chenhan

GR01 Patent grant
GR01 Patent grant