CN110399501B - Geological field literature map generation method based on language statistical model - Google Patents

Geological field literature map generation method based on language statistical model Download PDF

Info

Publication number
CN110399501B
CN110399501B CN201910705465.XA CN201910705465A CN110399501B CN 110399501 B CN110399501 B CN 110399501B CN 201910705465 A CN201910705465 A CN 201910705465A CN 110399501 B CN110399501 B CN 110399501B
Authority
CN
China
Prior art keywords
relation
probability
triples
word
words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910705465.XA
Other languages
Chinese (zh)
Other versions
CN110399501A (en
Inventor
付立军
吕鹏飞
贺金龙
安梦飞
岳正飞
吴恩廷
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhongke Zhihe Digital Technology Beijing Co ltd
Original Assignee
Beijing Zhongkeruitong Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Zhongkeruitong Information Technology Co ltd filed Critical Beijing Zhongkeruitong Information Technology Co ltd
Priority to CN201910705465.XA priority Critical patent/CN110399501B/en
Publication of CN110399501A publication Critical patent/CN110399501A/en
Application granted granted Critical
Publication of CN110399501B publication Critical patent/CN110399501B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Abstract

The invention relates to the technical field of knowledge maps, and provides a geological field literature map generation method based on a language statistic model. The method aims to find the relevance between similar information (here, information in the geological gold mine field), and the innovation point of the constructed map lies in the knowledge in the gold mine field and is constructed by adopting literature. The main scheme comprises the following steps: and performing word segmentation operation, performing part-of-speech filtering operation to obtain a main sentence only with nouns, verbs and prepositions, and performing verb filtering operation on the main sentence to obtain a sentence without the subject and the object verb. And performing relation binary group extraction operation on the result of the last step, and calculating the probability that each relation binary group has a probability p. And splicing the same relation words, and calculating the probability P of the formed relation triple according to the probability P. And performing relation word filtering and probability sorting operation on the relation triples to obtain a relation triplet list, and performing map generation on the relation triples.

Description

Geological field literature map generation method based on language statistical model
Technical Field
The invention relates to the technical field of knowledge maps, and provides a geological field literature map generation method based on a language statistic model.
Background
As discipline specialities are gradually refined, the research scope of professional literature begins to shrink, and communication among the specialties becomes more difficult. The valuable associated information among the original professional documents is increasingly covered by massive information in the profession due to the high differentiation of the professional documents, and is difficult to be found. However, implicit associations between professional literature often have great significance to scientific findings and guidance for practical applications. Therefore, geological knowledge discovery research based on geological literature will drive further development in this field.
The relation in the corpus papers is numerous and complicated, and in order to extract the relation as much as possible, an unsupervised method is preferably selected for extraction. When supervised learning is adopted, basic relationship categories and fixed rules need to be given, the number of relationship triples in the thesis is extremely large, and the statistical rules possibly cannot completely contain the characteristics of the relationships, so that a supervised learning method is not adopted. Semi-supervised learning is also not suitable for use because semi-supervised learning still requires constraining relationships, just no rules are given. In this case, the omission of extraction still results.
Based on the problems, the relation recognition and extraction can be completed by using the n-gram statistical language model, and meanwhile, a map is generated on the page through the relation.
Disclosure of Invention
The invention aims to find the relevance between similar information (information in the field of geological gold mine), and the innovation point of the constructed map lies in the knowledge in the field of gold mine and is constructed by adopting literature.
In order to solve the technical problems, the invention adopts the following technical scheme:
a geological domain literature atlas generation method based on a language statistic model comprises the following steps:
step 1: performing word segmentation operation on each sentence to obtain a sentence with segmented words;
step 2: performing part-of-speech filtering operation on the sentences of the divided words to obtain main sentences only with nouns, verbs and prepositions;
and step 3: carrying out verb filtering operation on the main sentences to obtain sentences which do not adopt the subject of the person and have the subjects and the subjects;
and 4, step 4: performing relational binary extraction operation on the result of the last step, and calculating probability, wherein the format of the relational binary comprises four types, namely n1-v, n1-p, v-n2 and p-n2, and each relational binary has a probability p; probability p is the probability of occurrence of the entity word in the corpus and the probability of occurrence of the relation word (preposition or verb) under the condition of occurrence of the current entity word;
wherein n1 represents entity 1, n2 represents entity 2, v represents verb, and p represents preposition;
and 5: splicing the same relation words, and calculating the probability of the formed relation triples, wherein the probability P2P 1P 2 of the relation triples/the probability of the relation words appearing in the linguistic data is considered when two relation binary groups forming the relation triples both contain a probability value P1 and P2;
step 6: performing relation word filtering and probability sorting operation on the relation triples, deleting the relation triples corresponding to the relation words in the stop word list, and sorting the relation triples in the order of the relation triple probability P from high to low to obtain a relation triple list;
and 7: and performing graph generation operation on the relation triples, and connecting the extension nodes by taking the entity words as central nodes and taking the relation as an edge to obtain the graph.
In the above technical solution, in step 6, the candidate set of relationship triples is filtered through a rule, where the rule is as follows:
for the extracted n1- (v/p) -n2 structure, if the distance between n1 and n2 exceeds 5, the relationship is considered to be weaker and is discarded; for the extracted n1- (v/p) -n2 result, if a verb follows n2, the incomplete extraction of the relation is considered to be discarded.
In the above technical solution, the extension node in step 7 may be used as a new central node, and a relationship is used as an edge to connect a new extension node.
The invention has the following beneficial effects because of adopting the technical scheme:
1. the geological gold mine documents are often in large quantity, and researchers can quickly understand the research hotspot and development context of the current gold mine field through the map under the condition of short time in the field, so that the time for reading a plurality of papers is saved.
2. The constructed map reflects the knowledge in a centralized way, reflects the incidence relation among the knowledge, is convenient to display to the outside and can make the information clear at a glance.
Drawings
FIG. 1 is a flow diagram of a ternary relationship group extraction process;
FIG. 2 is an associative relationship map;
FIG. 3 is an association map;
fig. 4 is a user literature map.
Detailed Description
The method comprises the steps of introducing a binary grammar language model with parts of speech and word frequency statistics in relation extraction, associating documents with entity word lists in different fields, dividing words by taking sentences as units, filtering noun entities, verbs and prepositions, filtering related words, filtering short verb (such as running) and words (such as seeing) with human subject, then carrying out score calculation on filtered triples by using a joint probability calculation model, combining some filtering rules such as n1- (v/p) -n2 structure, wherein the distance between n1 and n2 is not more than 5, and otherwise, considering that the relation is weak, and discarding; for the extracted n1- (v/p) -n2 result, if n2 is followed by a verb, we consider the relation to be incomplete and discard. For example: "i say to him, leave holidays", an "i-to-his" relationship triple is extracted, and this relationship is incomplete.
(1) And (5) word segmentation. Segmenting each sentence, and filtering out nouns, verbs and prepositions;
(2) and filtering the relation words. Filtering out bad object verbs (e.g., running) and words in human subjects (e.g., seeing);
(3) obtaining a possible set of relationship triples: all triplets of n-v/p-n structure in the sentence, regardless of the adjacency relation.
(4) And calculating the obtained relation triple probability P of all the triples as the score of the triples (by using a bigram model), wherein two relation binary groups forming the relation triples respectively comprise a probability value P1 and P2, and the relation words of the two binary groups also have a probability value P, so that the final relation triple is P2P 1P 2/P. (ii) a Obtaining a candidate set of relational triples: finding out an n-v/p-n triple with the highest score as a candidate relation triple;
(5) determining a relationship triple: filtering the candidate set of the relationship triples through rules to obtain the relationship triples, wherein the filtering is mainly performed through two rules at present: for the extracted n1- (v/p) -n2 structure, if the distance between n1 and n2 exceeds 5, the relationship is considered to be weaker and is discarded; for the extracted n1- (v/p) -n2 result, if n2 is followed by a verb, we consider the incomplete extraction of this relationship and discard it. For example: "i says he, leave vacation in the tomorrow", can extract "i-to-his" relation triple, and this relation is incomplete;
(6) and (3) calculating the confidence of the relation triples: and adding a scoring function, calculating the confidence coefficient of the extracted relationship triples, namely the joint probability of the relationship triples, sequencing the confidence coefficient of the relationship triples constructed by the same entity (in the entity-relationship word) after the relationship triples are calculated, and finding the first 20 percent of the relationship triples for map construction. The scoring function is the frequency with which relational triplets appear in a corpus (all documents provided by the quality library). Node ordering is the randomly generated order of selected relational triples, as the graph surrounds the entities when exposed. The specific extraction flow is shown in fig. 1.
Association relation map
The user searches entity words, a basic map taking the word as a central word is displayed in a page, the user can click a certain type which is allowed to be expanded to view in a classified mode, meanwhile, a certain node can be clicked to realize map extension (if the node can be expanded, the node also becomes a central node), and the map can be extended all the time as long as the node can be expanded. The edge connecting the two nodes is the relationship between the two entities, and the mouse can be checked by sliding on the connecting line. The document recommendation gives a source document of a relation triple formed by the expanded node and the original central node, and the top 10 documents are obtained according to the relevancy score. If no found relationship exists between the two searched words, the atlas cannot be generated.
(1) The upper categories are all classifications, the biology category comprises plant vocabularies, fungus vocabularies and geological ancient biology comprehensive vocabularies (obtained from a geological vocabulary and not explicitly classified), the geology category comprises geological entities, space-time regions, research methods and geological chemistry, and expandable nodes are represented as circles in the classifications, otherwise, the nodes are represented as squares;
(2) the map may continue to expand.
After the input click is performed, filtering is performed according to the selection types provided by the page, and the user can find the relevant relationship, the information around the relevant relationship, the current relationship, the subsequent relationship, the multi-category relationship, the recommended documents and future applications according to the map, as shown in fig. 3 below.
User literature map
The user can use one or more geological documents of the user to perform map construction and view the relationship network in the documents. According to the generated map, a user can quickly know main contents in the literature and common entity words of a plurality of literatures, so that the research directions of the literatures are explored. This is the use of maps. The yellow nodes and the blue nodes facilitate the user to observe the map. Generating a literature map as shown in figure 4.

Claims (3)

1. A geological domain literature atlas generation method based on a language statistic model is characterized by comprising the following steps:
step 1: performing word segmentation operation on each sentence to obtain a sentence with segmented words;
and 2, step: performing part-of-speech filtering operation on the sentences of the good words to obtain main sentences only remaining nouns, verbs and prepositions;
and 3, step 3: carrying out verb filtering operation on the main sentences to obtain sentences which do not adopt the subject of the person and have the subjects and the subjects;
and 4, step 4: performing relational binary extraction operation on the result of the last step and calculating probability, wherein the format of the relational binary comprises four types, namely n1-v, n1-p, v-n2 and p-n2, and each relational binary has a probability p; probability p is probability of occurrence of the entity word in the corpus and probability of occurrence of the relation word under the condition of occurrence of the current entity word, wherein n1 represents entity word 1, n2 represents entity word 2, v represents verb, and p represents preposition;
and 5: splicing the same relation words, and calculating the probability of the formed relation triples, wherein the probability P2P 1P 2 of the relation triples/the probability of the relation words appearing in the linguistic data is considered when two relation binary groups forming the relation triples both contain a probability value P1 and P2;
step 6: performing relation word filtering and probability sorting operation on the relation triples, deleting the relation triples corresponding to the relation words in the stop word list, and sorting the relation triples set from high to low according to the relation triplet probability P to obtain a relation triplet list;
and 7: and performing graph generation operation on the relation triples, and connecting the extension nodes by taking the entity words as central nodes and taking the relation as an edge to obtain the graph.
2. The method according to claim 1, wherein the candidate set of relationship triples is filtered by a rule in step 6, wherein the rule is as follows:
for the extracted n1- (v/p) -n2 structure, if the distance between n1 and n2 exceeds 5, the relationship is considered to be weaker and is discarded; for the extracted n1- (v/p) -n2 result, if a verb follows n2, the incomplete extraction of the relation is considered to be discarded.
3. The method according to claim 1, wherein the extended nodes in step 7 can be used as new central nodes, and new extended nodes are connected with the relationship as edges.
CN201910705465.XA 2019-07-31 2019-07-31 Geological field literature map generation method based on language statistical model Active CN110399501B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910705465.XA CN110399501B (en) 2019-07-31 2019-07-31 Geological field literature map generation method based on language statistical model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910705465.XA CN110399501B (en) 2019-07-31 2019-07-31 Geological field literature map generation method based on language statistical model

Publications (2)

Publication Number Publication Date
CN110399501A CN110399501A (en) 2019-11-01
CN110399501B true CN110399501B (en) 2022-08-19

Family

ID=68327009

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910705465.XA Active CN110399501B (en) 2019-07-31 2019-07-31 Geological field literature map generation method based on language statistical model

Country Status (1)

Country Link
CN (1) CN110399501B (en)

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6101515A (en) * 1996-05-31 2000-08-08 Oracle Corporation Learning system for classification of terminology
CN105528349B (en) * 2014-09-29 2019-02-01 华为技术有限公司 The method and apparatus that question sentence parses in knowledge base
CN108121829B (en) * 2018-01-12 2022-05-24 扬州大学 Software defect-oriented domain knowledge graph automatic construction method

Also Published As

Publication number Publication date
CN110399501A (en) 2019-11-01

Similar Documents

Publication Publication Date Title
US10678816B2 (en) Single-entity-single-relation question answering systems, and methods
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN109933785B (en) Method, apparatus, device and medium for entity association
US11301637B2 (en) Methods, devices, and systems for constructing intelligent knowledge base
US9971974B2 (en) Methods and systems for knowledge discovery
CN111950285B (en) Medical knowledge graph intelligent automatic construction system and method with multi-mode data fusion
US8892420B2 (en) Text segmentation with multiple granularity levels
CN111737496A (en) Power equipment fault knowledge map construction method
CN110532328B (en) Text concept graph construction method
CN111444330A (en) Method, device and equipment for extracting short text keywords and storage medium
US20200073890A1 (en) Intelligent search platforms
US20220114340A1 (en) System and method for an automatic search and comparison tool
WO2021190662A1 (en) Medical text sorting method and apparatus, electronic device, and storage medium
CN109522396B (en) Knowledge processing method and system for national defense science and technology field
CN113792123B (en) Data-driven domain knowledge graph construction method and system
US11227183B1 (en) Section segmentation based information retrieval with entity expansion
CN113157859A (en) Event detection method based on upper concept information
CN112989813A (en) Scientific and technological resource relation extraction method and device based on pre-training language model
CN114266256A (en) Method and system for extracting new words in field
CN115713072A (en) Relation category inference system and method based on prompt learning and context awareness
CN116501875A (en) Document processing method and system based on natural language and knowledge graph
CN109614493B (en) Text abbreviation recognition method and system based on supervision word vector
CN112307364B (en) Character representation-oriented news text place extraction method
CN114239828A (en) Supply chain affair map construction method based on causal relationship
CN117057346A (en) Domain keyword extraction method based on weighted textRank and K-means

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20231117

Address after: 101499 Room 101, building 1, 3 Xingfu West Street, Beifang Town, Huairou District, Beijing

Patentee after: Zhongke Zhihe Digital Technology (Beijing) Co.,Ltd.

Address before: No. 14, South Academy of Sciences Road, Zhongguancun, Haidian District, Beijing 100086

Patentee before: Beijing zhongkeruitong Information Technology Co.,Ltd.

TR01 Transfer of patent right