CN116226390A

CN116226390A - Method, device and storage medium for constructing digital battlefield knowledge graph body

Info

Publication number: CN116226390A
Application number: CN202211077373.XA
Authority: CN
Inventors: 黄文勋; 鲍首熙; 洪万福; 黄勇
Original assignee: Xiamen Yuanting Information Technology Co ltd
Current assignee: Xiamen Yuanting Information Technology Co ltd
Priority date: 2022-09-05
Filing date: 2022-09-05
Publication date: 2023-06-06

Abstract

The invention provides a method, a device and a storage medium for constructing a digital battlefield knowledge graph body, wherein the method comprises the following steps: s1, forming a term dictionary of a selected field; s2, preprocessing the original corpus to obtain concept vocabulary; s3, determining a core concept vocabulary by using a TF-IWF algorithm; s4, clustering similar concepts by using an ART network based on core concept vocabulary to obtain a cluster group; s5, selecting the word with the largest word frequency from the cluster group according to the word frequency size as the ontology concept of the class; s6, clustering the cluster group again based on a finer classification threshold to obtain a fine cluster group, and circulating S5-S6 to obtain a hierarchical relationship of a clustering concept; s7, serializing the ontology concepts and the hierarchical relationship of the concepts into an OWL file. By utilizing the technical scheme, the domain document can be semi-automatically converted into the domain ontology which can be processed by the computer, and the efficiency of constructing the domain ontology of the digital battlefield is effectively improved.

Description

Method, device and storage medium for constructing digital battlefield knowledge graph body

Technical Field

The present invention relates to the field of ontology construction technology, and in particular, to a method, an apparatus, and a storage medium for constructing a digital battlefield knowledge graph ontology.

Background

The ontology is a model of concepts and relations among concepts in a certain field, and can realize formal description of the field. The digital battlefield ontology is a formalized description model of concepts and their interrelationships in digital battlefield (military) domain knowledge. The domain ontology builds a unified cognitive concept set, overcomes communication barriers among people, organizations and systems due to different backgrounds, languages and technologies, and enables domain knowledge to be shared and reused. The construction of the digital battlefield ontology is high in specificity, high in construction difficulty and low in construction efficiency, and is constructed manually by virtue of field experts.

The existing ontology construction methods at home and abroad mainly comprise a TOVE method, a skeleton method, an IDEF5 method, a seven-step method and the like, but a complete ontology construction engineering method is not formed at present, and a mature method specially aiming at ontology construction of Chinese document knowledge in the field is not available.

Disclosure of Invention

The embodiment of the invention provides a body construction method, device and storage medium of a digital battlefield knowledge graph, so as to improve the efficiency of body construction in the digital battlefield field.

To achieve the above object, in one aspect, a method for constructing a digital battlefield knowledge graph body is provided, including:

step S1, forming a term dictionary of the selected field by combining the collected documents and terms related to the selected field according to the field of the selected digital battlefield knowledge graph body, and adding the term dictionary as a custom dictionary into a selected natural language processing tool;

s2, preprocessing an original corpus by using a selected natural language processing tool and a custom dictionary to obtain concept vocabularies related to a selected field in the original corpus, wherein the preprocessing comprises word segmentation, part-of-speech tagging and interference item removal, and the interference item comprises word and gas, preposition and graduated word;

step S3, calculating the domain weight of each concept vocabulary in the obtained concept vocabularies by using a characteristic term frequency-word inverse frequency TF-IWF algorithm, and determining the core concept vocabularies in the obtained concept vocabularies according to the calculated domain weights;

step S4, clustering similar concepts by using a recursive self-adaptive resonance theory ART network based on core concept vocabulary to obtain a cluster group;

step S5, selecting a candidate word with the largest word frequency from the cluster group according to the word frequency size as an ontology concept representing the corresponding class of the cluster group, and removing the selected candidate word from the cluster group;

step S6, clustering the cluster groups shifted out of the candidate words again based on the selected finer classification threshold value to obtain cluster groups of detail levels, and then turning to step S5, and circularly executing the steps S5-S6 until the cluster cannot be subdivided again to obtain the hierarchical relationship of the clustering concept;

and S7, serializing the obtained ontology concepts and the hierarchical relations of the concepts into a network ontology language OWL file which can be processed by a computer.

Preferably, the method, wherein step S7 includes:

the api provided by the Jena semantic web framework is used for serializing the ontology concepts and the hierarchical relationship of the concepts into an OWL file which can be processed by a computer in a resource description framework RDF mode and an OWL format.

Preferably, in step S3, the domain weight of the concept vocabulary is calculated using the following TF-IWF formula:

wherein TF is _i,j Representation word t _i Feature term frequency, n, in text j _i,j Representation word t _i Frequency of occurrence in text j, Σ _k n _k,j Representing the sum of the frequency numbers of occurrences of all k words in text j, IWF _i Representation word t _i Word inverse frequencies in a corpus containing m words,

representing the sum, nt, of the frequency of occurrence of all m words in a corpus _i Representation word t _i Total frequency of occurrence in the corpus.

Preferably, the method, wherein step S3 includes:

step S31, inputting a document set d= { di, i=1, 2, …, N } composed of the obtained concept vocabulary di, N being the number of concept vocabularies;

step S32, calculating the domain weight of each conceptual vocabulary according to the TF-IWF formula, namely TF-IWF value;

and step S33, traversing the TF-IWF value of each concept vocabulary, and extracting the concept vocabularies with the TF-IWF values larger than a preset threshold value as core concept vocabularies.

Preferably, the method wherein the selected natural language processing tool is HanLP.

Preferably, the method further includes, after step S7:

and S8, opening the OWL file by using a protein tool, and performing visual management on the OWL body.

Preferably, the method, wherein the term dictionary is a chinese term dictionary and the original corpus is a chinese original corpus.

In another aspect, an apparatus for constructing a digital battlefield knowledge graph ontology is provided, comprising a memory and a processor, the memory storing at least one program, the at least one program being executable by the processor to implement any of the methods as described above.

In yet another aspect, a computer readable storage medium having stored therein at least one program that is executed by a processor to implement a method as any one of the above.

The technical scheme has the following technical effects:

according to the technical scheme, the natural language preprocessing tool is used for word segmentation and part-of-speech tagging of document knowledge in the field of digital battlefield, core concepts are mined through a statistical algorithm, ontology concepts and concept hierarchical relations are extracted through a clustering algorithm, and then the ontology concepts and the concept hierarchical relations are serialized into OWL ontology files, so that the technical effect that the field documents are semi-automatically converted into field ontologies with computer-processable ontology structures, and the construction efficiency of the digital battlefield field ontologies is improved;

in a further improvement scheme, the OWL ontology can be visually managed by using an ontology modeling tool, so that the ontology is convenient to perfect and correct.

Drawings

FIG. 1 is a flow chart of a method for constructing a digital battlefield knowledge base according to an embodiment of the present invention;

FIG. 2 is an example of the effect of word segmentation and part-of-speech tagging on an original corpus in a method according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an apparatus for constructing a digital battlefield knowledge base according to an embodiment of the present invention.

Detailed Description

For further illustration of the various embodiments, the invention is provided with the accompanying drawings. The accompanying drawings, which are incorporated in and constitute a part of this disclosure, illustrate embodiments and together with the description, serve to explain the principles of the embodiments. With reference to these matters, one of ordinary skill in the art will understand other possible embodiments and advantages of the present invention. The components in the figures are not drawn to scale and like reference numerals are generally used to designate like components.

The invention will now be further described with reference to the drawings and detailed description.

Embodiment one:

fig. 1 is a flow chart of a method for constructing a digital battlefield knowledge base according to an embodiment of the present invention. As in fig. 1, the method of this example includes the steps of:

step S1, forming a term dictionary of the selected field according to the field of the selected digital battlefield knowledge graph body by combining the collected documents and terms related to the selected field, and adding the term dictionary as a custom dictionary into a selected natural language processing tool.

The digital battlefield is related to various fields, such as characters, weapons, environments, tactics, facilities and the like, and the related data are relatively wide and complex, and the analysis of the digital battlefield data materials comprises the following data: military character data, weaponry data, military facility data, battlefield environmental data, and military tactical data. The domain scope and the data involved are first determined as needed.

The term dictionary in the selected field is obtained according to the related terms in the selected field listed in the professional literature data in the military fields such as 'Chinese people' free army, 'national defense science and technology narrative list', 'national defense science and technology noun dictionary'. The obtained term dictionary is used as a custom dictionary to be added into a custom dictionary of a selected natural language processing tool such as a HanLP tool, so that the professional term is prevented from being wrongly segmented in subsequent word segmentation.

Preferably, the embodiment of the invention carries out ontology construction aiming at Chinese corpus; wherein the term dictionary is a Chinese term dictionary, and the original corpus is a Chinese original corpus.

Step S2, preprocessing the original corpus by using the selected natural language processing tool and the custom dictionary to obtain conceptual vocabulary related to the selected field in the original corpus, wherein the preprocessing comprises word segmentation, part-of-speech tagging and interference item removal, and the interference item comprises word, preposition, word segmentation and the like. Fig. 2 shows an example of a segment of an original corpus and an effect of word segmentation and part-of-speech tagging on the original corpus. In the specific implementation, the preprocessing of the original corpus can be performed by adopting a word segmentation and part-of-speech tagging method in the prior art.

Step S3, calculating the domain weight of each concept vocabulary in the obtained concept vocabularies by using a characteristic term frequency-word inverse frequency TF-IWF algorithm, and determining the core concept vocabularies in the obtained concept vocabularies according to the calculated domain weights; wherein the domain weights are characterized by TF-IWF values.

The TF-IWF (characteristic Term Frequency-Term inverse Frequency, term Frequency-Inverse Word Frequency) algorithm is a relatively well-established improved algorithm based on the TF-IDF algorithm in statistical methods. The term frequency TF (Term Frequency) refers to the number of times that a term, i.e., a concept word, appears in a document, i.e., the frequency. The term inverse frequency IWF (Inverse Word Frequency) refers to the quantization of the frequency distribution of feature items in a document set. The main idea of the algorithm is as follows: if a word or phrase appears at a high frequency TF in a document and is rarely found in the whole corpus, i.e. IWF is small, it is considered that the word or phrase has a good class distinction capability and is suitable for extracting important concept vocabulary. The calculation formula of the TF-IWF value is as follows:

wherein TF is _i,j Representation word t _i Characteristic term frequency in text j, numerator n of TF moiety _i,j Representation word t _i The number of occurrences in text j, i.e., frequency, Σ _k n _k,j Representing the sum of all word frequency numbers in the text j; IWF _i Representation word t _i The word reverse frequency in the corpus containing m words is defined as the comparison of the total word frequency of the words in the corpus with the number of times that the word appears in the corpus in the text to be checked; wherein the fraction of the IWF part in logarithmic symbols is that

Representing the sum of the frequency of all words in the corpus, in this example, m words, nt in the corpus _i Representation word t _i Total frequency of occurrence in the corpus.

Specifically, in step S3, the process of extracting domain core concept vocabulary using TF-IWF algorithm includes the following steps:

s31, input is preprocessedDocument set d= { D _i I=1, 2, …, N }, in this example, the corpus in the chinese field is selected to be processed, the preprocessed chinese field document set is a chinese field document set D composed of the concept vocabulary di obtained in the above steps, N is the number of the concept vocabulary, and N is a natural number;

s32, calculating the domain weight of each conceptual vocabulary according to the TF-IWF formula, namely TF-IWF value;

s33, traversing the TF-IWF value of each concept vocabulary, judging whether the TF-IWF value is larger than a preset threshold value, if so, extracting the concept vocabulary as a core concept vocabulary of the processed original corpus.

And S4, clustering similar concepts by using a recursive adaptive resonance theory ART (Adaptive Resonance Theory) network based on the obtained core concept vocabulary, and aggregating the similar concept vocabularies into the same group to obtain a clustering group. Specifically, a word frequency method is used for determining that a concept vocabulary represents the cluster group in each group of clusters, when the word is selected as an ontology concept, the word is moved out of the cluster group, and then the group is subjected to finer level clustering to obtain a cluster group of the next concept level, so that all the cluster groups can be obtained by the method. See specifically steps S5 and S6 below.

and S6, clustering the cluster groups shifted out of the candidate words again based on the selected finer classification threshold value to obtain cluster groups of detail levels, and then turning to step S5, and circularly executing steps S5-S6 until the cluster cannot be subdivided again to obtain the hierarchical relationship of the clustering concept.

For example, if a total of 100 conceptual words are obtained after the analysis and preprocessing of the original corpus, 50 of the conceptual words have higher similarity, they will be aggregated into a group. Performing word frequency statistics on the 50 concept words, and taking out a word with the highest word frequency as an ontology concept representing the category; then, the word is removed from the concept set obtained by the clustering, and clustering is performed on the remaining 49 concept words based on a predetermined finer classification threshold, and if the clustering again results in that 20 concept words with high similarity are clustered into one set and 29 concept words are clustered into another set, the cluster set of 50 concept words thus contains the cluster sets of 20 and 29 concept words on the concept level, thereby obtaining the hierarchical relationship between the concept words.

In particular, ontology concepts and concept hierarchy relationships may be serialized into computer processable OWL files in the form of resource description framework RDF (Resource Description Framework), format of OWL using the api provided by Jena semantic web framework. Thus, the construction of the OWL ontology is realized.

After step S7, the method further comprises:

and S8, opening the OWL file by using a protein tool, and performing visual management on the constructed OWL file, namely the OWL body. The domain expert can carry out perfect combination correction on the constructed OWL ontology through a visual process.

Based on the above description, it can be seen that: the technical scheme of the embodiment of the invention realizes semi-automatic conversion of the domain document into the domain ontology which can be processed by a computer by using the natural language processing, statistical method and clustering algorithm, and effectively improves the efficiency of constructing the domain ontology of the digital battlefield.

Embodiment two:

the present invention also provides an apparatus for constructing a digital battlefield knowledge graph body, as shown in fig. 3, the apparatus comprises a processor 301, a memory 302, a bus 303, and a computer program stored in the memory 302 and capable of running on the processor 301, wherein the processor 301 comprises one or more processing cores, the memory 302 is connected with the processor 301 through the bus 303, and the memory 302 is used for storing program instructions, and the steps in the above-mentioned method embodiments of the present invention are implemented when the processor executes the computer program.

Further, as an executable scheme, the device for constructing the digital battlefield knowledge graph body may be a computer unit, and the computer unit may be a computing device such as a desktop computer, a notebook computer, a palm computer, a cloud server, and the like. The computer unit may include, but is not limited to, a processor, a memory. It will be appreciated by those skilled in the art that the constituent structures of the computer unit described above are merely examples of the computer unit and are not limiting, and may include more or fewer components than those described above, or may combine certain components, or different components. For example, the computer unit may further include an input/output device, a network access device, a bus, etc., which is not limited by the embodiment of the present invention.

Further, as an implementation, the processor may be a central processing unit (Central Processing Unit, CPU), other general purpose processor, digital signal processor (Digital Signal Processor, DSP), application specific integrated circuit (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components, etc. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like that is a control center of the computer unit, connecting various parts of the entire computer unit using various interfaces and lines.

The memory may be used to store the computer program and/or modules, and the processor may implement the various functions of the computer unit by running or executing the computer program and/or modules stored in the memory, and invoking data stored in the memory. The memory may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, at least one application program required for a function; the storage data area may store data created according to the use of the cellular phone, etc. In addition, the memory may include high-speed random access memory, and may also include non-volatile memory, such as a hard disk, memory, plug-in hard disk, smart Media Card (SMC), secure Digital (SD) Card, flash Card (Flash Card), at least one disk storage device, flash memory device, or other volatile solid-state storage device.

Embodiment III:

the present invention also provides a computer readable storage medium storing a computer program which when executed by a processor implements the steps of the above-described method of an embodiment of the present invention.

The modules/units integrated with the computer unit may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as separate products. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiment, or may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a software distribution medium, and so forth. It should be noted that the content of the computer readable medium can be appropriately increased or decreased according to the requirements of the legislation and the patent practice in the jurisdiction.

While the invention has been particularly shown and described with reference to a preferred embodiment, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A method of constructing a digital battlefield knowledge graph ontology, comprising:

step S1, forming a term dictionary of a selected field according to the field of the selected digital battlefield knowledge graph body by combining the collected documents and terms related to the selected field, and adding the term dictionary as a custom dictionary into a selected natural language processing tool;

s2, preprocessing an original corpus by using the selected natural language processing tool and the custom dictionary to obtain concept vocabularies related to the selected field in the original corpus, wherein the preprocessing comprises word segmentation, part-of-speech tagging and interference item removal, and the interference item comprises a word and a word, a preposition and a word;

step S4, clustering similar concepts by using a recursive adaptive resonance theory ART network based on the core concept vocabulary to obtain a cluster group;

2. The method according to claim 1, wherein the step S7 includes:

3. The method according to claim 1, wherein in the step S3, the following TF-IWF formula is used to calculate the domain weights of the concept vocabulary:

representing the sum, nt, of the frequency of occurrence of all m words in the corpus _i Representation word t _i Total frequency of occurrence in the corpus.

4. A method according to claim 3, wherein said step S3 comprises:

step S32, calculating the domain weight of each concept vocabulary, namely TF-IWF value, according to the TF-IWF formula;

5. The method of claim 1, wherein the selected natural language processing tool is HanLP.

6. The method according to claim 1, wherein after the step S7, further comprises:

7. The method of claim 1, wherein the term dictionary is a chinese term dictionary and the raw corpus is a chinese raw corpus.

8. An apparatus for constructing a digital battlefield knowledge graph ontology, comprising a memory and a processor, the memory storing at least one program, the at least one program being executable by the processor to implement the method of any one of claims 1-7.

9. A computer readable storage medium, characterized in that at least one program is stored in the storage medium, the at least one program being executed by a processor to implement the method of any one of claims 1 to 7.