CN113836265A - Knowledge mining method based on cross-model and cross-language knowledge modeling - Google Patents

Knowledge mining method based on cross-model and cross-language knowledge modeling Download PDF

Info

Publication number
CN113836265A
CN113836265A CN202111112651.6A CN202111112651A CN113836265A CN 113836265 A CN113836265 A CN 113836265A CN 202111112651 A CN202111112651 A CN 202111112651A CN 113836265 A CN113836265 A CN 113836265A
Authority
CN
China
Prior art keywords
knowledge
model
entity
language
cross
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111112651.6A
Other languages
Chinese (zh)
Inventor
方明
赵蔚彬
岳晨
刘世刚
代勋勋
方宸
任国政
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Institute Of Foreign Military Studies Academy Of War Studies Pla Academy Of Military Sciences
Original Assignee
Institute Of Foreign Military Studies Academy Of War Studies Pla Academy Of Military Sciences
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Institute Of Foreign Military Studies Academy Of War Studies Pla Academy Of Military Sciences filed Critical Institute Of Foreign Military Studies Academy Of War Studies Pla Academy Of Military Sciences
Priority to CN202111112651.6A priority Critical patent/CN113836265A/en
Publication of CN113836265A publication Critical patent/CN113836265A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention belongs to the technical field of knowledge mining, and particularly relates to a knowledge mining method based on cross-model and cross-language knowledge modeling. In the implementation process of the mining method, multi-language news knowledge is mined, different established knowledge models have certain isolation and certain relevance, knowledge extraction is performed by combining a supervised knowledge extraction method and an unsupervised knowledge extraction method aiming at different source language materials and knowledge models, the knowledge and the knowledge models obtained in the knowledge extraction process are mapped, cross-language knowledge base construction is realized on the basis of an open source hundred-degree department library and an open source knowledge base on the basis of cross-language linkage, and knowledge fusion alignment is performed on the basis of cross-language linkage. And finally, completing cross-model, cross-language knowledge modeling and knowledge mining to obtain a required cross-language news knowledge base.

Description

Knowledge mining method based on cross-model and cross-language knowledge modeling
Technical Field
The invention belongs to the technical field of knowledge mining, and particularly relates to a knowledge mining method based on cross-model and cross-language knowledge modeling.
Background
With the explosive growth of open source news text data, the traditional data retrieval and browsing mode cannot meet the research requirement. Due to the gradual subdivision of research directions, researchers urgently need cross-country models and cross-language knowledge graph technologies to extract concerned target entities and relations between the entities from complicated natural languages, so that potential high-value information is mined from the concerned target entities.
Since google was introduced in 2012, the knowledge graph is concerned by academic and industrial fields and widely studied and applied in various fields. At present, head Internet enterprises such as Google, Baidu, Ali, Tencent and the like and emerging enterprises such as Ming, Xingjiang and the like establish a knowledge map framework and technology, and aim to improve data application capability by using the knowledge map. Conventionally, the process of knowledge graph construction includes knowledge modeling, data access, knowledge mapping, knowledge extraction and knowledge fusion, and finally a knowledge graph is formed, so that knowledge data support is provided for upper-layer applications.
According to the method for constructing the cross-language multi-source vertical domain knowledge map of the Chinese patent CN112199511A, the vertical domain translation completes the construction of a parallel corpus through content and link analysis according to input cross-language texts, domain dictionaries, domain term libraries, domain materials and data, the automatic translation of foreign language texts is realized on the basis of pre-processing based on a trained translation model, and the extraction of semantic features and the extraction of entity relations based on deep learning are completed by combining vertical domain translation data and actual scenes; the domain knowledge fusion and disambiguation carries out fusion disambiguation knowledge from different sources through network equivalent entity combination to obtain the cross-language multi-source vertical domain knowledge map.
Traditional knowledge graph construction is usually directed at a single knowledge model and a single language text, and knowledge mining of cross-language and cross-knowledge models cannot be achieved. The cross-language knowledge graph construction in the Chinese patent CN112199511A is only for a single knowledge model, and the cross-language in the patent is realized by machine translation of texts and is not the real cross-language knowledge graph construction, so that the application requirements of the existing cross-model and cross-language knowledge graph construction in the news research field cannot be met.
Disclosure of Invention
Technical problem to be solved
The technical problem to be solved by the invention is as follows: according to the requirements of a user on cross-model modeling and cross-language knowledge extraction, a cross-model and cross-language knowledge map construction technology is provided, so that a researcher can conveniently model the self research direction, extract the original knowledge from the language acquisition data of the self direction, and realize knowledge fusion through cross-language knowledge linkage.
(II) technical scheme
In order to solve the technical problems, the invention provides a knowledge mining method based on cross-model and cross-language knowledge modeling, which comprises the following steps:
step 1: respectively carrying out knowledge modeling aiming at different countries, wherein the knowledge modeling comprises multi-level entity concept modeling, entity attribute modeling and logic association relation modeling; the step 1 comprises the following specific steps;
step 11: respectively creating respective knowledge models for different countries in a knowledge modeling background system, respectively creating and naming the knowledge models corresponding to the different countries by the names of the countries, and then creating a multi-level entity concept constrained according to the membership relationship for each knowledge model; the multi-level entity concept comprises: a primary entity concept and a secondary entity concept; wherein the primary entity concept comprises: organization, people, places, weaponry; the second-level entity concepts are subsets of a certain first-level entity concept, and the two concepts have inclusion and contained relations in the membership relation;
step 12: creating for each particular entity concept its corresponding entity attribute; wherein, for an organization, the physical attributes thereof include: establishment time, headquarters location, scale, number of people; for a person, its physical attributes include: age, gender, job title, place of birth, school calendar; for a place, its physical attributes include: climate, latitude and longitude; for weaponry, its physical attributes include length, importance, radius of action, power;
step 13: in the whole knowledge model, for every two entity concepts capable of generating entity relations, defining the two entity concepts into a pair, and for each pair of entity concepts capable of generating entity relations, creating a logical association relation between the two entity concepts;
the logical association relationship comprises:
the relationship between the character and the organization includes "job function", "member", the relationship between the character and the place includes "place of birth", "dead spot", the relationship between the character and the character includes "co-worker", "relatives", the relationship between the weaponry and the organization includes "development unit", the relationship between the organization and the organization includes "affiliation", and the relationship between the weaponry and the weaponry includes "loading", "alias";
step 14: repeating the steps 11 to 13, and sequentially creating knowledge models of all countries involved in the research mission, multi-level entity concepts, entity attributes and logic association relations in the knowledge models of all countries, so as to form knowledge model data in a data form; grouping the knowledge model data according to countries, and storing the knowledge model data into a database table in a grouping mode;
step 2: regarding multi-level entity concepts, entity attributes and logic association relations in knowledge model data, taking the local official language of the knowledge model data as a source language, and modeling the knowledge model data by using the source language to form respective source language models of the multi-level entity concepts, the entity attributes and the logic association relations, namely a multi-level entity concept source language model, an entity attribute source language model and a logic association relation source language model;
then according to the comparison semantic relation between the source language and the Chinese, establishing a multi-level entity concept Chinese model, an entity attribute Chinese model and a logic association relation Chinese model for the multi-level entity concept source language model, the entity attribute source language model and the logic association relation source language model respectively corresponding to the multi-level entity concept source language model, the entity attribute Chinese model and the logic association relation Chinese model;
and step 3: extracting knowledge of each news material;
the knowledge is extracted by combining a supervised knowledge extraction method and an unsupervised knowledge extraction method, and the step 3 comprises the following specific steps:
step 31: carrying out supervision training by manually marking data by adopting a supervision deep learning method to generate a supervision knowledge extraction model; the supervised knowledge extraction model is generated by training for each language respectively, and comprises the following steps: a Chinese supervision knowledge extraction model, an English supervision knowledge extraction model, a Japanese supervision knowledge extraction model and a Russian supervision knowledge extraction model;
step 32: defining a dictionary and rules according to a multi-level entity concept model, an entity attribute model and a logic association relation model established by a user;
for the multi-level entity concept model, when a user defines an entity concept 'weaponry' in a knowledge model, pre-sorting data of airplanes, ships and missiles related to the 'weaponry' to be used as a dictionary, wherein the dictionary refers to a mapping relation between a specific entity name and the entity concept; once the data in the dictionary is matched in the news footage, it is considered to belong to the concept of "weaponry";
meanwhile, a rule is defined, and all entities ending in warships and airplanes are considered to belong to the concept of weapon equipment; for the entity attribute model, when the user defines the entity attribute "length" of the "weaponry" concept in the knowledge model, then the rules are defined: once the entity under the concept of 'weaponry' and the keyword 'length' are matched in the news material, specific numerical values corresponding to the 'length' and the 'length' are used as the attributes of the 'weaponry';
for the logical association model, when the user defines the logical association "relatives" of the concept pair "person" and "person" in the knowledge model, then the rule is defined: once the concept pair "person" - "person" and the keywords "father", "mother", "relative" are matched in the news footage, the "relative" is taken as the logical association relationship of the concept pair "person" - "person";
by analogy, defining all rules and dictionaries according to a multi-level entity concept model, an entity attribute model and a logic association relation model to form an unsupervised knowledge extraction model;
similarly, the rules and dictionaries are respectively generated for each language design, and comprise a Chinese rule dictionary, an English rule dictionary, a Japanese rule dictionary and a Russian rule dictionary, so that a Chinese unsupervised knowledge extraction model, an English unsupervised knowledge extraction model, a Japanese unsupervised knowledge extraction model and a Russian unsupervised knowledge extraction model are formed;
step 33: performing knowledge extraction on each news material, respectively calling a supervised knowledge extraction model and an unsupervised knowledge extraction model of the corresponding language according to the language of the news material, and fusing the returned results of the supervised knowledge extraction model and the unsupervised knowledge extraction model;
knowledge extraction encompasses three processes:
firstly, entity extraction is carried out, wherein a single news material is input, and all entity information contained in the material is output;
secondly, extracting attributes, inputting all entity information in the material, and outputting attribute information of each entity;
finally, extracting the relation, inputting all entity information in the material, and outputting logic association relation information between each group of entity pairs;
acquiring knowledge in news materials through the knowledge extraction process;
and 4, step 4: mapping the knowledge and the knowledge model acquired in the knowledge extraction process; because knowledge acquired from news materials of different languages is not necessarily mapped to the knowledge model corresponding to the language, judgment needs to be carried out again according to the semantic information of chapters and sentences of the news materials, and thus the mapping between the knowledge and the knowledge model is completed; including the japanese domestic news as reported in english, the extracted knowledge should be mapped under the japanese knowledge model; the extracted knowledge should be mapped under an Indian knowledge model by using the Indian domestic news reported in Japanese;
the step 4 comprises the following specific steps:
step 41: classifying each sentence in the news material according to a knowledge model, wherein the specific categories comprise a United states knowledge model, a Japanese knowledge model, an India knowledge model and a Russian knowledge model; if the credibility of the classification result is higher, the classification result is considered to be effective; mapping the knowledge extracted from the sentence to a knowledge model obtained by classification;
step 42: if the credibility of the sentence classification result is low, the sentence classification result is considered to be invalid; classifying the whole news material according to a knowledge model, wherein the specific categories comprise a United states knowledge model, a Japanese knowledge model, an India knowledge model and a Russian knowledge model; mapping the extracted knowledge of each sentence in the news material to a knowledge model obtained by classification;
step 43: through the mapping process of the above steps 41 and 42, the generated output is the combination of knowledge and knowledge model, i.e. knowledge base; obtaining a knowledge base in a source language form;
step 44: aiming at the knowledge base in the source language form, mapping the source language and the Chinese again according to the multi-level entity concept Chinese model, the entity attribute Chinese model and the logic association relation Chinese model which are respectively corresponding to the multi-level entity concept source language model, the entity attribute source language model and the logic association relation source language model in the step 2 to obtain the knowledge base in the Chinese form;
and 5: because the same knowledge often has different expression forms, the knowledge mapped to the knowledge model in the knowledge base in the Chinese form needs to be further subjected to cross-language knowledge fusion;
the method comprises the steps of constructing a cross-language knowledge base based on an open source department base and an open source knowledge base, and realizing cross-language knowledge fusion based on the cross-language knowledge base; the module comprises the following specific steps:
step 51, building a cross-language knowledge base by utilizing an open source department base and an open source knowledge base;
the open source department libraries and the open source knowledge base are sorted and integrated to build a uniform cross-language knowledge base; the cross-language knowledge base comprises alias names, attributes, descriptions and label information of the same entity on different language dimensions;
step 52, aligning the same knowledge in different languages by using a cross-language knowledge base to complete the fusion alignment of the knowledge;
therefore, cross-model, cross-language knowledge modeling and knowledge mining are finally completed, and a required cross-language news knowledge base is obtained.
Wherein, in the step 11, the creating respective knowledge models for different countries respectively includes: the american knowledge model, the japanese knowledge model, the indian knowledge model.
In step 11, in the case of the united states knowledge model, first-level entity concepts including organizations, people, places, weaponry are respectively created;
continuing to create secondary entity concepts based on the primary entity concepts, secondary entity concepts are created under the organizational structure of the primary entity concepts, including the U.S. department of defense, the American Petroleum institute, the American academy of participation, and the American academy of public health.
In the step 1, knowledge models of different countries are isolated from each other in a display interface, a physical storage and a database so as to ensure the contact authority of different users or user groups.
In step 14, different database open authorities are given to each group, so that the groups are isolated from each other, and different users can only contact knowledge model data corresponding to their own authorities.
In step 32, the representation form of the dictionary includes: { Rossfu aircraft carrier: weaponry }, { jackson: character }, { new york: location }.
In step 41, the higher reliability of the classification result means higher than or equal to 0.7.
In step 42, the lower confidence level of the classification result means that the confidence level is lower than 0.7.
The open source encyclopedia library refers to a public encyclopedia knowledge base including Wikipedia, encyclopedia and interactive encyclopedia.
Wherein, the open source knowledge base refers to a public knowledge base including Baklib and Raneto.
(III) advantageous effects
Compared with the prior art, the multi-language news knowledge mining method is based on a cross-language and cross-country knowledge mining method, and multi-language news knowledge is mined, and different established knowledge models have certain isolation and certain relevance. Aiming at different source language materials and knowledge models, the method can automatically adapt to a knowledge extraction algorithm, and realize the real source language knowledge extraction. On the cross-language knowledge fusion level, the cross-language knowledge base construction is realized based on the open source hundred-department base, the open source knowledge base and the user experts, and the knowledge fusion alignment is carried out based on the cross-language link.
Compared with the prior art, the method is based on a cross-language and cross-country knowledge mining method, multi-language news knowledge is mined, source language knowledge extraction in the true sense is achieved, and a knowledge extraction algorithm can be automatically adapted according to different source language materials and knowledge models. Meanwhile, cross-language knowledge base construction is realized based on the open source department base, the open source knowledge base and the user experts, and knowledge fusion alignment is realized.
In addition, the invention realizes cross-language and cross-model knowledge graph construction, retrieval and application in actual projects.
Drawings
FIG. 1 is a schematic diagram of a multilingual knowledge extraction process according to the technical solution of the present invention.
Fig. 2 is a flow chart of cross-language news knowledge graph construction in the technical scheme of the invention.
Detailed Description
In order to make the objects, contents, and advantages of the present invention clearer, the following detailed description of the embodiments of the present invention will be made in conjunction with the accompanying drawings and examples.
In order to solve the problems of the prior art, the invention provides a knowledge mining method based on cross-model and cross-language knowledge modeling, as shown in fig. 2, the method comprises the following steps:
step 1: news researchers in different fields pay different attention to news contents. Respectively carrying out knowledge modeling aiming at different countries, wherein the knowledge modeling comprises multi-level entity concept modeling, entity attribute modeling and logic association relation modeling; in the step 1, knowledge models of different countries are isolated from each other in a display interface, a physical storage and a database so as to ensure the contact authority of different users or user groups.
The step 1 comprises the following specific steps;
step 11: respectively creating respective knowledge models for different countries in a knowledge modeling background system, respectively creating and naming the knowledge models corresponding to the different countries by the names of the countries, and then creating a multi-level entity concept constrained according to the membership relationship for each knowledge model; the multi-level entity concept comprises: a primary entity concept and a secondary entity concept; wherein the primary entity concept comprises: organization, people, places, weaponry; the second-level entity concepts are subsets of a certain first-level entity concept, and the two concepts have inclusion and contained relations in the membership relation;
in step 11, the creating respective knowledge models for different countries includes: the american knowledge model, the japanese knowledge model, the indian knowledge model, etc.
In the step 11, in the case of the united states knowledge model, primary entity concepts including organizations, people, places, weaponry are respectively created;
continuing to create secondary entity concepts from the primary entity concepts, secondary entity concepts are created under the organizational structure of the primary entity concepts, including the U.S. department of defense, the American Petroleum institute, the American academy of participation, the American academy of public health, and the like.
Step 12: creating for each particular entity concept its corresponding entity attribute; wherein, for an organization, the physical attributes thereof include: time of establishment, headquarters location, scale, number of people, etc.; for a person, its physical attributes include: age, gender, job title, place of birth, school calendar, etc.; for a place, its physical attributes include: climate, latitude and longitude, etc.; for weaponry, its physical attributes include length, importance, radius of action, power, etc.;
step 13: in the whole knowledge model, for every two entity concepts capable of generating entity relations, defining the two entity concepts into a pair, and for each pair of entity concepts capable of generating entity relations, creating a logical association relation between the two entity concepts;
the logical association relationship comprises: the relationship between the character and the organization includes "job assignment", "membership", etc., the relationship between the character and the place includes "place of birth", "death" etc., the relationship between the character and the character includes "co-worker", "relatives", etc., the relationship between the weaponry and the organization includes "development unit", etc., the relationship between the organization and the organization includes "affiliation", etc., and the relationship between the weaponry and the weaponry includes "loading", "alias", etc.;
step 14: repeating the steps 11 to 13, and sequentially creating knowledge models of all countries involved in the research mission, multi-level entity concepts, entity attributes and logic association relations in the knowledge models of all countries, so as to form knowledge model data in a data form; grouping the knowledge model data according to countries, and storing the knowledge model data into a database table in a grouping mode; in the step 14, different database open authorities are given to each group, so that the groups are isolated from each other, and different users can only contact knowledge model data corresponding to their own authorities.
The display mode of the knowledge base is as follows:
american knowledge model
Character
Military figure
Mark Miller
Brincoln
Scientists
Franklin
Edison
Weapon equipment
Aircraft carrier
Rossfu aircraft carrier
Nimitz aircraft carrier
Aircraft with a flight control device
F22 fighter plane
F35 fighter plane
Organization mechanism
.......
Location of a site
......
Japanese knowledge model … …
India knowledge model … …
Step 2: regarding multi-level entity concepts, entity attributes and logic association relations in knowledge model data, taking the local official language of the knowledge model data as a source language, and modeling the knowledge model data by using the source language to form respective source language models of the multi-level entity concepts, the entity attributes and the logic association relations, namely a multi-level entity concept source language model, an entity attribute source language model and a logic association relation source language model;
then according to the comparison semantic relation between the source language and the Chinese, establishing a multi-level entity concept Chinese model, an entity attribute Chinese model and a logic association relation Chinese model for the multi-level entity concept source language model, the entity attribute source language model and the logic association relation source language model respectively corresponding to the multi-level entity concept source language model, the entity attribute Chinese model and the logic association relation Chinese model;
(for example, the second level entity concept of the American academy of knowledge in America "is stored in the modeling system as" United States Senate "and is annotated with its name as" American academy of participation ", the entity attribute" size number "is stored in the modeling system as" Scale number "and is annotated with its name as" Scale number "; the second level entity concept of the Japanese knowledge model" is stored in the modeling system as "Yao on land", and is annotated with its name as "Yao", the entity attribute "establishment time" is stored in the modeling system as "establishment place", and is annotated as "establishment time")
And step 3: the knowledge extraction is carried out on each news material, and the knowledge extraction mainly has two modes: one is general supervised knowledge extraction, and the other is unsupervised knowledge extraction based on a knowledge model; the general supervised knowledge extraction adopts the current mainstream supervised deep learning method, and the supervised training is carried out through manually labeled data, so that the knowledge extraction is realized, the method has high accuracy, but needs a large amount of manually labeled data, so that the period is long, the cost is high, and the method cannot be well adapted to various knowledge models; the unsupervised knowledge extraction based on the knowledge model refers to flexibly defining dictionaries and rules according to a multi-level entity concept model, an entity attribute model and a logic association relation model which are established by a user, and acquiring knowledge through grammatical and semantic information of texts. Therefore, aiming at the knowledge model flexibly configured by the user, a mode of combining a supervised knowledge extraction method and an unsupervised knowledge extraction method is adopted, and the step 3 comprises the following specific steps:
step 31: carrying out supervision training by manually marking data by adopting a supervision deep learning method to generate a supervision knowledge extraction model; the supervised knowledge extraction model is generated by training for each language respectively, and comprises the following steps: a Chinese supervision knowledge extraction model, an English supervision knowledge extraction model, a Japanese supervision knowledge extraction model and a Russian supervision knowledge extraction model;
step 32: flexibly defining dictionaries and rules according to a multi-level entity concept model, an entity attribute model and a logic association relation model established by a user; for the multi-level entity concept model, when a user defines an entity concept 'weaponry' in a knowledge model, a batch of data of airplanes, ships and missiles related to the 'weaponry' is pre-arranged to serve as a dictionary, and the dictionary refers to a mapping relation between specific entity names and the entity concepts, such as { rossfu aircraft carrier: weaponry }, { edison: character }, { new york: location }; once the data in the dictionary is matched in the news footage, it is considered to belong to the concept of "weaponry";
meanwhile, a rule is defined, and all entities ending in warships and airplanes are considered to belong to the concept of weapon equipment; for the entity attribute model, when the user defines the entity attribute "length" of the "weaponry" concept in the knowledge model, then the rules are defined: once the entity under the concept of 'weaponry' and the keyword 'length' are matched in the news material, specific numerical values corresponding to the 'length' and the 'length' are used as the attributes of the 'weaponry';
for the logical association model, when the user defines the logical association "relatives" of the concept pair "person" and "person" in the knowledge model, then the rule is defined: once the concept pair "person" - "person" and the keywords "father", "mother", "relative" and the like are matched in the news footage, the "relative" is taken as the logical association relationship of the concept pair "person" - "person";
by analogy, defining all rules and dictionaries according to a multi-level entity concept model, an entity attribute model and a logic association relation model to form an unsupervised knowledge extraction model;
similarly, the rules and dictionaries are respectively generated for each language design, and comprise a Chinese rule dictionary, an English rule dictionary, a Japanese rule dictionary and a Russian rule dictionary, so that a Chinese unsupervised knowledge extraction model, an English unsupervised knowledge extraction model, a Japanese unsupervised knowledge extraction model and a Russian unsupervised knowledge extraction model are formed;
step 33: performing knowledge extraction on each news material, respectively calling a supervised knowledge extraction model and an unsupervised knowledge extraction model of the corresponding language according to the language of the news material, and fusing the returned results of the supervised knowledge extraction model and the unsupervised knowledge extraction model; for example, when the news material language is English, the English supervised knowledge extraction model and the English unsupervised knowledge extraction model are respectively called, and the returned results of the English supervised knowledge extraction model and the English unsupervised knowledge extraction model are fused;
knowledge extraction encompasses three processes:
firstly, entity extraction is carried out, wherein a single news material is input, and all entity information contained in the material is output;
secondly, extracting attributes, inputting all entity information in the material, and outputting attribute information of each entity;
finally, extracting the relation, inputting all entity information in the material, and outputting logic association relation information between each group of entity pairs;
for example, using English material as an example, inputting a sentence "US Iventor Edison words in the White House for four year", firstly calling English entity extraction algorithm, and returning result is Edison-person, White House-organization; secondly, calling an English attribute extraction algorithm, and returning a result to be Ediso-position-US Iventor; and finally, calling a relational extraction algorithm, and returning a result to be Ediso-take office-White House.
FIG. 1 specifically illustrates the process of multi-lingual knowledge extraction.
Acquiring knowledge in news materials through the knowledge extraction process;
and 4, step 4: mapping the knowledge and the knowledge model acquired in the knowledge extraction process; because knowledge acquired from news materials of different languages is not necessarily mapped to the knowledge model corresponding to the language, judgment needs to be carried out again according to the semantic information of chapters and sentences of the news materials, and thus the mapping between the knowledge and the knowledge model is completed; including the japanese domestic news as reported in english, the extracted knowledge should be mapped under the japanese knowledge model; the extracted knowledge should be mapped under an Indian knowledge model by using the Indian domestic news reported in Japanese;
the step 4 comprises the following specific steps:
step 41: classifying each sentence in the news material according to a knowledge model, wherein the specific categories comprise a United states knowledge model, a Japanese knowledge model, an India knowledge model and a Russian knowledge model; if the confidence level of the classification result is high (higher than or equal to 0.7), the classification result is considered to be valid; mapping the knowledge extracted from the sentence to a knowledge model obtained by classification;
the classification algorithm is the most basic and general algorithm in the industry, and is used for determining which category a sentence or a chapter belongs to, the input of the classification algorithm is a sentence or a chapter, and the output is each category and corresponding credibility, such as { U.S. knowledge model: 0.7, japanese knowledge model: 0.1, indian knowledge model: 0.05, russian knowledge model: 0.1}. Acquiring a category with the highest credibility as a final credible category;
step 42: if the credibility of the sentence classification result is low (lower than 0.7), the sentence classification result is considered invalid; classifying the whole news material according to a knowledge model, wherein the specific categories comprise a United states knowledge model, a Japanese knowledge model, an India knowledge model and a Russian knowledge model; mapping the extracted knowledge of each sentence in the news material to a knowledge model obtained by classification;
step 43: through the mapping process of the above steps 41 and 42, the generated output is the combination of knowledge and knowledge model, i.e. knowledge base; obtaining a knowledge base in a source language form;
step 44: aiming at the knowledge base in the source language form, mapping the source language and the Chinese again according to the multi-level entity concept Chinese model, the entity attribute Chinese model and the logic association relation Chinese model which are respectively corresponding to the multi-level entity concept source language model, the entity attribute source language model and the logic association relation source language model in the step 2 to obtain the knowledge base in the Chinese form;
and 5: because the same knowledge often has different expression forms, the knowledge mapped to the knowledge model in the knowledge base in the Chinese form needs to be further subjected to cross-language knowledge fusion;
the method comprises the steps of constructing a cross-language knowledge base based on an open source department base, an open source knowledge base and user expert knowledge, and realizing cross-language knowledge fusion based on the cross-language knowledge base; the module comprises the following specific steps:
step 51, building a cross-language knowledge base by utilizing an open source department base and an open source knowledge base (which can further comprise knowledge accumulated by user experts);
the open source encyclopedia library refers to a public encyclopedia knowledge base comprising Wikipedia, encyclopedia and interactive encyclopedia, and the open source knowledge base refers to a public knowledge base comprising Baklib and Raneo (user expert accumulation refers to the industry knowledge accumulated by a user in work); the open source department libraries and the open source knowledge base are sorted and integrated to build a uniform cross-language knowledge base; the cross-language knowledge base comprises alias names, attributes, descriptions and label information of the same entity on different language dimensions;
step 52 aligns the same knowledge in different languages using a cross-language knowledge base, such as edio, Thomas Alva edio, edison can align to the standard chinese name Thomas alvarez edison. Thus, the fusion alignment of knowledge is completed;
therefore, cross-model, cross-language knowledge modeling and knowledge mining are finally completed, and a required cross-language news knowledge base is obtained.
On the basis of the above technical scheme, the method may further include:
step 6: and (4) knowledge correction and updating, namely, for knowledge obtained by knowledge extraction, a user can perform auditing intervention to guide continuous optimization of an extraction model, so that the accuracy of knowledge extraction and knowledge fusion is continuously improved. The module comprises the following specific steps:
and 61, for the knowledge acquired by the multi-language knowledge extraction model, a user can perform manual examination and correction, including multi-dimensional correction of entity names, entity types, entity attributes, entity relationships and the like, and finally the knowledge with high accuracy is acquired for storage and application.
And step 62, on the other hand, the knowledge with high accuracy after being corrected by the user can be used as a high-quality training corpus of the knowledge extraction algorithm to guide the continuous optimization of the algorithm, so that the accuracy of knowledge extraction and knowledge fusion is continuously improved.
To sum up, the invention belongs to the technical field of knowledge mining, and particularly relates to a knowledge mining method based on cross-model and cross-language knowledge modeling. In the implementation process of the mining method, multi-language news knowledge is mined, different established knowledge models have certain isolation and certain relevance, knowledge extraction is performed by combining a supervised knowledge extraction method and an unsupervised knowledge extraction method aiming at different source language materials and knowledge models, the knowledge and the knowledge models obtained in the knowledge extraction process are mapped, cross-language knowledge base construction is realized on the basis of an open source hundred-degree department library and an open source knowledge base on the basis of cross-language linkage, and knowledge fusion alignment is performed on the basis of cross-language linkage. And finally, completing cross-model, cross-language knowledge modeling and knowledge mining to obtain a required cross-language news knowledge base.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (10)

1. A knowledge mining method based on cross-model and cross-language knowledge modeling is characterized by comprising the following steps:
step 1: respectively carrying out knowledge modeling aiming at different countries, wherein the knowledge modeling comprises multi-level entity concept modeling, entity attribute modeling and logic association relation modeling; the step 1 comprises the following specific steps;
step 11: respectively creating respective knowledge models for different countries in a knowledge modeling background system, respectively creating and naming the knowledge models corresponding to the different countries by the names of the countries, and then creating a multi-level entity concept constrained according to the membership relationship for each knowledge model; the multi-level entity concept comprises: a primary entity concept and a secondary entity concept; wherein the primary entity concept comprises: organization, people, places, weaponry; the second-level entity concepts are subsets of a certain first-level entity concept, and the two concepts have inclusion and contained relations in the membership relation;
step 12: creating for each particular entity concept its corresponding entity attribute; wherein, for an organization, the physical attributes thereof include: establishment time, headquarters location, scale, number of people; for a person, its physical attributes include: age, gender, job title, place of birth, school calendar; for a place, its physical attributes include: climate, latitude and longitude; for weaponry, its physical attributes include length, importance, radius of action, power;
step 13: in the whole knowledge model, for every two entity concepts capable of generating entity relations, defining the two entity concepts into a pair, and for each pair of entity concepts capable of generating entity relations, creating a logical association relation between the two entity concepts;
the logical association relationship comprises:
the relationship between the character and the organization includes "job function", "member", the relationship between the character and the place includes "place of birth", "dead spot", the relationship between the character and the character includes "co-worker", "relatives", the relationship between the weaponry and the organization includes "development unit", the relationship between the organization and the organization includes "affiliation", and the relationship between the weaponry and the weaponry includes "loading", "alias";
step 14: repeating the steps 11 to 13, and sequentially creating knowledge models of all countries involved in the research mission, multi-level entity concepts, entity attributes and logic association relations in the knowledge models of all countries, so as to form knowledge model data in a data form; grouping the knowledge model data according to countries, and storing the knowledge model data into a database table in a grouping mode;
step 2: regarding multi-level entity concepts, entity attributes and logic association relations in knowledge model data, taking the local official language of the knowledge model data as a source language, and modeling the knowledge model data by using the source language to form respective source language models of the multi-level entity concepts, the entity attributes and the logic association relations, namely a multi-level entity concept source language model, an entity attribute source language model and a logic association relation source language model;
then according to the comparison semantic relation between the source language and the Chinese, establishing a multi-level entity concept Chinese model, an entity attribute Chinese model and a logic association relation Chinese model for the multi-level entity concept source language model, the entity attribute source language model and the logic association relation source language model respectively corresponding to the multi-level entity concept source language model, the entity attribute Chinese model and the logic association relation Chinese model;
and step 3: extracting knowledge of each news material;
the knowledge is extracted by combining a supervised knowledge extraction method and an unsupervised knowledge extraction method, and the step 3 comprises the following specific steps:
step 31: carrying out supervision training by manually marking data by adopting a supervision deep learning method to generate a supervision knowledge extraction model; the supervised knowledge extraction model is generated by training for each language respectively, and comprises the following steps: a Chinese supervision knowledge extraction model, an English supervision knowledge extraction model, a Japanese supervision knowledge extraction model and a Russian supervision knowledge extraction model;
step 32: defining a dictionary and rules according to a multi-level entity concept model, an entity attribute model and a logic association relation model established by a user;
for the multi-level entity concept model, when a user defines an entity concept 'weaponry' in a knowledge model, pre-sorting data of airplanes, ships and missiles related to the 'weaponry' to be used as a dictionary, wherein the dictionary refers to a mapping relation between a specific entity name and the entity concept; once the data in the dictionary is matched in the news footage, it is considered to belong to the concept of "weaponry";
meanwhile, a rule is defined, and all entities ending in warships and airplanes are considered to belong to the concept of weapon equipment; for the entity attribute model, when the user defines the entity attribute "length" of the "weaponry" concept in the knowledge model, then the rules are defined: once the entity under the concept of 'weaponry' and the keyword 'length' are matched in the news material, specific numerical values corresponding to the 'length' and the 'length' are used as the attributes of the 'weaponry';
for the logical association model, when the user defines the logical association "relatives" of the concept pair "person" and "person" in the knowledge model, then the rule is defined: once the concept pair "person" - "person" and the keywords "father", "mother", "relative" are matched in the news footage, the "relative" is taken as the logical association relationship of the concept pair "person" - "person";
by analogy, defining all rules and dictionaries according to a multi-level entity concept model, an entity attribute model and a logic association relation model to form an unsupervised knowledge extraction model;
similarly, the rules and dictionaries are respectively generated for each language design, and comprise a Chinese rule dictionary, an English rule dictionary, a Japanese rule dictionary and a Russian rule dictionary, so that a Chinese unsupervised knowledge extraction model, an English unsupervised knowledge extraction model, a Japanese unsupervised knowledge extraction model and a Russian unsupervised knowledge extraction model are formed;
step 33: performing knowledge extraction on each news material, respectively calling a supervised knowledge extraction model and an unsupervised knowledge extraction model of the corresponding language according to the language of the news material, and fusing the returned results of the supervised knowledge extraction model and the unsupervised knowledge extraction model;
knowledge extraction encompasses three processes:
firstly, entity extraction is carried out, wherein a single news material is input, and all entity information contained in the material is output;
secondly, extracting attributes, inputting all entity information in the material, and outputting attribute information of each entity;
finally, extracting the relation, inputting all entity information in the material, and outputting logic association relation information between each group of entity pairs;
acquiring knowledge in news materials through the knowledge extraction process;
and 4, step 4: mapping the knowledge and the knowledge model acquired in the knowledge extraction process; because knowledge acquired from news materials of different languages is not necessarily mapped to the knowledge model corresponding to the language, judgment needs to be carried out again according to the semantic information of chapters and sentences of the news materials, and thus the mapping between the knowledge and the knowledge model is completed; including the japanese domestic news as reported in english, the extracted knowledge should be mapped under the japanese knowledge model; the extracted knowledge should be mapped under an Indian knowledge model by using the Indian domestic news reported in Japanese;
the step 4 comprises the following specific steps:
step 41: classifying each sentence in the news material according to a knowledge model, wherein the specific categories comprise a United states knowledge model, a Japanese knowledge model, an India knowledge model and a Russian knowledge model; if the credibility of the classification result is higher, the classification result is considered to be effective; mapping the knowledge extracted from the sentence to a knowledge model obtained by classification;
step 42: if the credibility of the sentence classification result is low, the sentence classification result is considered to be invalid; classifying the whole news material according to a knowledge model, wherein the specific categories comprise a United states knowledge model, a Japanese knowledge model, an India knowledge model and a Russian knowledge model; mapping the extracted knowledge of each sentence in the news material to a knowledge model obtained by classification;
step 43: through the mapping process of the above steps 41 and 42, the generated output is the combination of knowledge and knowledge model, i.e. knowledge base; obtaining a knowledge base in a source language form;
step 44: aiming at the knowledge base in the source language form, mapping the source language and the Chinese again according to the multi-level entity concept Chinese model, the entity attribute Chinese model and the logic association relation Chinese model which are respectively corresponding to the multi-level entity concept source language model, the entity attribute source language model and the logic association relation source language model in the step 2 to obtain the knowledge base in the Chinese form;
and 5: because the same knowledge often has different expression forms, the knowledge mapped to the knowledge model in the knowledge base in the Chinese form needs to be further subjected to cross-language knowledge fusion;
the method comprises the steps of constructing a cross-language knowledge base based on an open source department base and an open source knowledge base, and realizing cross-language knowledge fusion based on the cross-language knowledge base; the module comprises the following specific steps:
step 51, building a cross-language knowledge base by utilizing an open source department base and an open source knowledge base;
the open source department libraries and the open source knowledge base are sorted and integrated to build a uniform cross-language knowledge base; the cross-language knowledge base comprises alias names, attributes, descriptions and label information of the same entity on different language dimensions;
step 52, aligning the same knowledge in different languages by using a cross-language knowledge base to complete the fusion alignment of the knowledge;
therefore, cross-model, cross-language knowledge modeling and knowledge mining are finally completed, and a required cross-language news knowledge base is obtained.
2. The method of claim 1, wherein the step 11 of creating respective knowledge models for different countries comprises: the american knowledge model, the japanese knowledge model, the indian knowledge model.
3. The knowledge mining method based on cross-model and cross-language knowledge modeling according to claim 1, characterized in that in the step 11, in the case of the U.S. knowledge model, first-level entity concepts are respectively created, including organization, people, places, weaponry;
continuing to create secondary entity concepts based on the primary entity concepts, secondary entity concepts are created under the organizational structure of the primary entity concepts, including the U.S. department of defense, the American Petroleum institute, the American academy of participation, and the American academy of public health.
4. The knowledge mining method based on cross-model and cross-language knowledge modeling according to claim 1, characterized in that in step 1, knowledge models of different countries are isolated from each other in a display interface, a physical storage and a database so as to ensure the contact authority of different users or user groups.
5. The knowledge mining method based on cross-model and cross-language knowledge modeling according to claim 4, wherein in the step 14, different database open permissions are given to each group, so that the groups are isolated from each other, and different users can only contact knowledge model data corresponding to the own permissions.
6. The method of knowledge mining based on cross-model, cross-language knowledge modeling according to claim 1, wherein in step 32, the representation of the dictionary comprises: { Rossfu aircraft carrier: weaponry }, { edison: character }, { new york: location }.
7. The method of knowledge mining based on cross-model, cross-language knowledge modeling according to claim 1, wherein in step 41, the higher confidence level of the classification result means higher than or equal to 0.7.
8. The method of knowledge mining based on cross-model, cross-language knowledge modeling according to claim 1, wherein in step 42, the lower confidence level of the classification result means less than 0.7.
9. The cross-model, cross-language knowledge modeling-based knowledge mining method of claim 1, wherein the open-source encyclopedia library refers to a public encyclopedia knowledge library comprising wikipedia, encyclopedia, and interactive encyclopedia.
10. The cross-model, cross-language knowledge modeling based knowledge mining method of claim 1, wherein the open-source knowledge base refers to a public knowledge base comprising Baklib, raneo.
CN202111112651.6A 2021-09-23 2021-09-23 Knowledge mining method based on cross-model and cross-language knowledge modeling Pending CN113836265A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111112651.6A CN113836265A (en) 2021-09-23 2021-09-23 Knowledge mining method based on cross-model and cross-language knowledge modeling

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111112651.6A CN113836265A (en) 2021-09-23 2021-09-23 Knowledge mining method based on cross-model and cross-language knowledge modeling

Publications (1)

Publication Number Publication Date
CN113836265A true CN113836265A (en) 2021-12-24

Family

ID=78969351

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111112651.6A Pending CN113836265A (en) 2021-09-23 2021-09-23 Knowledge mining method based on cross-model and cross-language knowledge modeling

Country Status (1)

Country Link
CN (1) CN113836265A (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677913A (en) * 2016-02-29 2016-06-15 哈尔滨工业大学 Machine translation-based construction method for Chinese semantic knowledge base
CN111310480A (en) * 2020-01-20 2020-06-19 昆明理工大学 Weakly supervised Hanyue bilingual dictionary construction method based on English pivot
CN112163097A (en) * 2020-09-23 2021-01-01 中国电子科技集团公司第十五研究所 Military knowledge graph construction method and system
CN112199511A (en) * 2020-09-28 2021-01-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Cross-language multi-source vertical domain knowledge graph construction method
CN112541087A (en) * 2020-12-18 2021-03-23 清华大学 Cross-language knowledge graph construction method and device based on encyclopedia
WO2021139239A1 (en) * 2020-07-28 2021-07-15 平安科技(深圳)有限公司 Mechanism entity extraction method, system and device based on multiple training targets

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105677913A (en) * 2016-02-29 2016-06-15 哈尔滨工业大学 Machine translation-based construction method for Chinese semantic knowledge base
CN111310480A (en) * 2020-01-20 2020-06-19 昆明理工大学 Weakly supervised Hanyue bilingual dictionary construction method based on English pivot
WO2021139239A1 (en) * 2020-07-28 2021-07-15 平安科技(深圳)有限公司 Mechanism entity extraction method, system and device based on multiple training targets
CN112163097A (en) * 2020-09-23 2021-01-01 中国电子科技集团公司第十五研究所 Military knowledge graph construction method and system
CN112199511A (en) * 2020-09-28 2021-01-08 西南电子技术研究所(中国电子科技集团公司第十研究所) Cross-language multi-source vertical domain knowledge graph construction method
CN112541087A (en) * 2020-12-18 2021-03-23 清华大学 Cross-language knowledge graph construction method and device based on encyclopedia

Similar Documents

Publication Publication Date Title
US9645993B2 (en) Method and system for semantic searching
Lind et al. Building the bridge: Topic modeling for comparative research
KR20160060253A (en) Natural Language Question-Answering System and method
US20130232147A1 (en) Generating a taxonomy from unstructured information
CN110162768B (en) Method and device for acquiring entity relationship, computer readable medium and electronic equipment
Ta'a et al. Al-Quran ontology based on knowledge themes
Alkadri et al. Semantic feature based arabic opinion mining using ontology
Zorrilla-Agut et al. IATE 2: Modernising the EU’s IATE terminological database to respond to the challenges of today’s translation world and beyond
Wei et al. GeoBERTSegmenter: Word segmentation of Chinese texts in the geoscience domain using the improved BERT model
Avetisyan et al. A simple and effective method of cross-lingual plagiarism detection
Rahul et al. Social media sentiment analysis for Malayalam
Revanth et al. Nl2sql: Natural language to sql query translator
Zong et al. Research on alignment in the construction of parallel corpus
Belliardo et al. Leave no Place Behind: Improved Geolocation in Humanitarian Documents
Han et al. Answer ranking based on named entity types for question answering
Ferrés Domènech Knowledge-based and data-driven approaches for geographical information access
CN113836265A (en) Knowledge mining method based on cross-model and cross-language knowledge modeling
Sidhu et al. Role of machine translation and word sense disambiguation in natural language processing
Arefin et al. BAENPD: A Bilingual Plagiarism Detector.
Xiong et al. OBSKP: Oracle Bone Studies Knowledge Pyramid Model With Applications
Leotta et al. My MOoD, a Multimedia and Multilingual Ontology Driven MAS: Design and First Experiments in the Sentiment Analysis Domain.
Lourdusamy et al. METHODS, APPROACHES, PRINCIPLES, GUIDELINES AND APPLICATIONS ON MULTILINGUAL ONTOLOGIES: A SURVEY.
Batjargal et al. Providing universal access to Japanese humanities digital libraries: an approach to federated searching system using automatic metadata mapping
Semenova et al. Domain ontology development for linguistic purposes
Wu et al. Research on intelligent retrieval model of multilingual text information in corpus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination