CN110399501B

CN110399501B - Geological field literature map generation method based on language statistical model

Info

Publication number: CN110399501B
Application number: CN201910705465.XA
Authority: CN
Inventors: 付立军; 吕鹏飞; 贺金龙; 安梦飞; 岳正飞; 吴恩廷
Original assignee: Beijing Zhongkeruitong Information Technology Co ltd
Current assignee: Zhongke Zhihe Digital Technology Beijing Co ltd
Priority date: 2019-07-31
Filing date: 2019-07-31
Publication date: 2022-08-19
Anticipated expiration: 2039-07-31
Also published as: CN110399501A

Abstract

The invention relates to the technical field of knowledge maps, and provides a geological field literature map generation method based on a language statistic model. The method aims to find the relevance between similar information (here, information in the geological gold mine field), and the innovation point of the constructed map lies in the knowledge in the gold mine field and is constructed by adopting literature. The main scheme comprises the following steps: and performing word segmentation operation, performing part-of-speech filtering operation to obtain a main sentence only with nouns, verbs and prepositions, and performing verb filtering operation on the main sentence to obtain a sentence without the subject and the object verb. And performing relation binary group extraction operation on the result of the last step, and calculating the probability that each relation binary group has a probability p. And splicing the same relation words, and calculating the probability P of the formed relation triple according to the probability P. And performing relation word filtering and probability sorting operation on the relation triples to obtain a relation triplet list, and performing map generation on the relation triples.

Description

Geological field literature map generation method based on language statistical model

Technical Field

The invention relates to the technical field of knowledge maps, and provides a geological field literature map generation method based on a language statistic model.

Background

As discipline specialities are gradually refined, the research scope of professional literature begins to shrink, and communication among the specialties becomes more difficult. The valuable associated information among the original professional documents is increasingly covered by massive information in the profession due to the high differentiation of the professional documents, and is difficult to be found. However, implicit associations between professional literature often have great significance to scientific findings and guidance for practical applications. Therefore, geological knowledge discovery research based on geological literature will drive further development in this field.

The relation in the corpus papers is numerous and complicated, and in order to extract the relation as much as possible, an unsupervised method is preferably selected for extraction. When supervised learning is adopted, basic relationship categories and fixed rules need to be given, the number of relationship triples in the thesis is extremely large, and the statistical rules possibly cannot completely contain the characteristics of the relationships, so that a supervised learning method is not adopted. Semi-supervised learning is also not suitable for use because semi-supervised learning still requires constraining relationships, just no rules are given. In this case, the omission of extraction still results.

Based on the problems, the relation recognition and extraction can be completed by using the n-gram statistical language model, and meanwhile, a map is generated on the page through the relation.

Disclosure of Invention

The invention aims to find the relevance between similar information (information in the field of geological gold mine), and the innovation point of the constructed map lies in the knowledge in the field of gold mine and is constructed by adopting literature.

In order to solve the technical problems, the invention adopts the following technical scheme:

a geological domain literature atlas generation method based on a language statistic model comprises the following steps:

step 1: performing word segmentation operation on each sentence to obtain a sentence with segmented words;

step 2: performing part-of-speech filtering operation on the sentences of the divided words to obtain main sentences only with nouns, verbs and prepositions;

and step 3: carrying out verb filtering operation on the main sentences to obtain sentences which do not adopt the subject of the person and have the subjects and the subjects;

and 4, step 4: performing relational binary extraction operation on the result of the last step, and calculating probability, wherein the format of the relational binary comprises four types, namely n1-v, n1-p, v-n2 and p-n2, and each relational binary has a probability p; probability p is the probability of occurrence of the entity word in the corpus and the probability of occurrence of the relation word (preposition or verb) under the condition of occurrence of the current entity word;

wherein n1 represents entity 1, n2 represents entity 2, v represents verb, and p represents preposition;

and 5: splicing the same relation words, and calculating the probability of the formed relation triples, wherein the probability P2P 1P 2 of the relation triples/the probability of the relation words appearing in the linguistic data is considered when two relation binary groups forming the relation triples both contain a probability value P1 and P2;

step 6: performing relation word filtering and probability sorting operation on the relation triples, deleting the relation triples corresponding to the relation words in the stop word list, and sorting the relation triples in the order of the relation triple probability P from high to low to obtain a relation triple list;

and 7: and performing graph generation operation on the relation triples, and connecting the extension nodes by taking the entity words as central nodes and taking the relation as an edge to obtain the graph.

In the above technical solution, in step 6, the candidate set of relationship triples is filtered through a rule, where the rule is as follows:

for the extracted n1- (v/p) -n2 structure, if the distance between n1 and n2 exceeds 5, the relationship is considered to be weaker and is discarded; for the extracted n1- (v/p) -n2 result, if a verb follows n2, the incomplete extraction of the relation is considered to be discarded.

In the above technical solution, the extension node in step 7 may be used as a new central node, and a relationship is used as an edge to connect a new extension node.

The invention has the following beneficial effects because of adopting the technical scheme:

1. the geological gold mine documents are often in large quantity, and researchers can quickly understand the research hotspot and development context of the current gold mine field through the map under the condition of short time in the field, so that the time for reading a plurality of papers is saved.

2. The constructed map reflects the knowledge in a centralized way, reflects the incidence relation among the knowledge, is convenient to display to the outside and can make the information clear at a glance.

Drawings

FIG. 1 is a flow diagram of a ternary relationship group extraction process;

FIG. 2 is an associative relationship map;

FIG. 3 is an association map;

fig. 4 is a user literature map.

Detailed Description

The method comprises the steps of introducing a binary grammar language model with parts of speech and word frequency statistics in relation extraction, associating documents with entity word lists in different fields, dividing words by taking sentences as units, filtering noun entities, verbs and prepositions, filtering related words, filtering short verb (such as running) and words (such as seeing) with human subject, then carrying out score calculation on filtered triples by using a joint probability calculation model, combining some filtering rules such as n1- (v/p) -n2 structure, wherein the distance between n1 and n2 is not more than 5, and otherwise, considering that the relation is weak, and discarding; for the extracted n1- (v/p) -n2 result, if n2 is followed by a verb, we consider the relation to be incomplete and discard. For example: "i say to him, leave holidays", an "i-to-his" relationship triple is extracted, and this relationship is incomplete.

(1) And (5) word segmentation. Segmenting each sentence, and filtering out nouns, verbs and prepositions;

(2) and filtering the relation words. Filtering out bad object verbs (e.g., running) and words in human subjects (e.g., seeing);

(3) obtaining a possible set of relationship triples: all triplets of n-v/p-n structure in the sentence, regardless of the adjacency relation.

(4) And calculating the obtained relation triple probability P of all the triples as the score of the triples (by using a bigram model), wherein two relation binary groups forming the relation triples respectively comprise a probability value P1 and P2, and the relation words of the two binary groups also have a probability value P, so that the final relation triple is P2P 1P 2/P. (ii) a Obtaining a candidate set of relational triples: finding out an n-v/p-n triple with the highest score as a candidate relation triple;

(5) determining a relationship triple: filtering the candidate set of the relationship triples through rules to obtain the relationship triples, wherein the filtering is mainly performed through two rules at present: for the extracted n1- (v/p) -n2 structure, if the distance between n1 and n2 exceeds 5, the relationship is considered to be weaker and is discarded; for the extracted n1- (v/p) -n2 result, if n2 is followed by a verb, we consider the incomplete extraction of this relationship and discard it. For example: "i says he, leave vacation in the tomorrow", can extract "i-to-his" relation triple, and this relation is incomplete;

(6) and (3) calculating the confidence of the relation triples: and adding a scoring function, calculating the confidence coefficient of the extracted relationship triples, namely the joint probability of the relationship triples, sequencing the confidence coefficient of the relationship triples constructed by the same entity (in the entity-relationship word) after the relationship triples are calculated, and finding the first 20 percent of the relationship triples for map construction. The scoring function is the frequency with which relational triplets appear in a corpus (all documents provided by the quality library). Node ordering is the randomly generated order of selected relational triples, as the graph surrounds the entities when exposed. The specific extraction flow is shown in fig. 1.

Association relation map

The user searches entity words, a basic map taking the word as a central word is displayed in a page, the user can click a certain type which is allowed to be expanded to view in a classified mode, meanwhile, a certain node can be clicked to realize map extension (if the node can be expanded, the node also becomes a central node), and the map can be extended all the time as long as the node can be expanded. The edge connecting the two nodes is the relationship between the two entities, and the mouse can be checked by sliding on the connecting line. The document recommendation gives a source document of a relation triple formed by the expanded node and the original central node, and the top 10 documents are obtained according to the relevancy score. If no found relationship exists between the two searched words, the atlas cannot be generated.

(1) The upper categories are all classifications, the biology category comprises plant vocabularies, fungus vocabularies and geological ancient biology comprehensive vocabularies (obtained from a geological vocabulary and not explicitly classified), the geology category comprises geological entities, space-time regions, research methods and geological chemistry, and expandable nodes are represented as circles in the classifications, otherwise, the nodes are represented as squares;

(2) the map may continue to expand.

After the input click is performed, filtering is performed according to the selection types provided by the page, and the user can find the relevant relationship, the information around the relevant relationship, the current relationship, the subsequent relationship, the multi-category relationship, the recommended documents and future applications according to the map, as shown in fig. 3 below.

User literature map

The user can use one or more geological documents of the user to perform map construction and view the relationship network in the documents. According to the generated map, a user can quickly know main contents in the literature and common entity words of a plurality of literatures, so that the research directions of the literatures are explored. This is the use of maps. The yellow nodes and the blue nodes facilitate the user to observe the map. Generating a literature map as shown in figure 4.

Claims

1. A geological domain literature atlas generation method based on a language statistic model is characterized by comprising the following steps:

and 2, step: performing part-of-speech filtering operation on the sentences of the good words to obtain main sentences only remaining nouns, verbs and prepositions;

and 3, step 3: carrying out verb filtering operation on the main sentences to obtain sentences which do not adopt the subject of the person and have the subjects and the subjects;

and 4, step 4: performing relational binary extraction operation on the result of the last step and calculating probability, wherein the format of the relational binary comprises four types, namely n1-v, n1-p, v-n2 and p-n2, and each relational binary has a probability p; probability p is probability of occurrence of the entity word in the corpus and probability of occurrence of the relation word under the condition of occurrence of the current entity word, wherein n1 represents entity word 1, n2 represents entity word 2, v represents verb, and p represents preposition;

step 6: performing relation word filtering and probability sorting operation on the relation triples, deleting the relation triples corresponding to the relation words in the stop word list, and sorting the relation triples set from high to low according to the relation triplet probability P to obtain a relation triplet list;

2. The method according to claim 1, wherein the candidate set of relationship triples is filtered by a rule in step 6, wherein the rule is as follows:

3. The method according to claim 1, wherein the extended nodes in step 7 can be used as new central nodes, and new extended nodes are connected with the relationship as edges.