CN116450856B

CN116450856B - Meteorological ocean unstructured text knowledge construction method and device and electronic equipment

Info

Publication number: CN116450856B
Application number: CN202310722007.3A
Authority: CN
Inventors: 徐焱; 王宇翔; 孙万有; 何思远
Original assignee: Aerospace Hongtu Information Technology Co Ltd
Current assignee: Aerospace Hongtu Information Technology Co Ltd
Priority date: 2023-06-19
Filing date: 2023-06-19
Publication date: 2023-09-12
Anticipated expiration: 2043-06-19
Also published as: CN116450856A

Abstract

The invention provides a method and a device for constructing unstructured text knowledge of a meteorological ocean and electronic equipment, wherein the method comprises the following steps: acquiring a weather marine unstructured text set to be constructed; determining a target core concept in the field of the meteorological marine environment according to the meteorological marine unstructured text set; carrying out knowledge entity extraction on the weather ocean unstructured text set based on the target core concept to determine a target knowledge entity; identifying entity relations between the target knowledge entities based on the weather marine unstructured text set and the target knowledge entities through a relation identification model; and constructing a knowledge graph of the weather marine environment field based on the entity relationship between the target knowledge entity and the target knowledge entity. The invention can intelligently extract key knowledge information in the numerous unstructured text, construct a map relation, further improve the capability of acquiring target information and the retrieval speed, and is beneficial to realizing knowledge sharing in the field of meteorological marine environment so as to more comprehensively study the knowledge of meteorological marine environment.

Description

Meteorological ocean unstructured text knowledge construction method and device and electronic equipment

Technical Field

The invention relates to the technical field of knowledge maps, in particular to a method and a device for constructing unstructured text knowledge of a meteorological ocean and electronic equipment.

Background

Along with the development of science and technology, the application of the knowledge graph is more and more extensive. The knowledge graph has strong data description capability, provides a technical basis for intelligent information application, and can present structured knowledge to a user in a graphical manner. However, the weather marine environment knowledge has the characteristics of unstructured, multi-source heterogeneous, space-time complex, semantic complex and the like, and no mature and comprehensive knowledge graph application in the weather marine environment field exists at present, so that the retrieval speed of the relevant knowledge in the weather marine environment field is low, knowledge sharing in the weather marine environment field is hindered, and the weather marine environment knowledge cannot be studied more comprehensively.

Disclosure of Invention

In view of the above, the invention aims to provide a method, a device and an electronic device for constructing weather marine unstructured text knowledge, which can intelligently extract key knowledge information in a complex unstructured text to construct a map relationship, further improve the capability of acquiring target information and the retrieval speed, and facilitate the realization of knowledge sharing in the field of weather marine environment so as to more comprehensively study weather marine environment knowledge.

In a first aspect, an embodiment of the present invention provides a method for constructing unstructured text knowledge of a meteorological ocean, including:

acquiring a weather marine unstructured text set to be constructed;

determining a target core concept of the weather marine environment field according to the weather marine unstructured text set;

extracting knowledge entities from the weather marine unstructured text set based on the target core concept to determine target knowledge entities;

identifying entity relationships between the target knowledge entities based on the weather marine unstructured text set and the target knowledge entities through a pre-trained relationship identification model;

and constructing a knowledge graph of the weather marine environment field based on the entity relationship between the target knowledge entity and the target knowledge entity.

In one embodiment, determining target core concepts of a weather marine environment field from the weather marine unstructured text set comprises:

dividing the weather marine unstructured text set into unstructured text subsets corresponding to each sub-field according to a plurality of sub-fields in the weather marine environment field;

acquiring an initial core concept based on the unstructured text subsets corresponding to each sub-field; the initial core concept is obtained by performing expert preliminary extraction and expert cross extraction on the unstructured text subsets corresponding to each sub-field;

Crawling explanation texts in the target explanation pages matched with each initial core concept;

performing word segmentation processing on each interpretation text to obtain a first word segmentation data set, and determining a first word frequency corresponding to each first word in the first word segmentation data set;

and if the first word frequency corresponding to the first word is greater than a preset word frequency threshold, supplementing the first word into the initial core concept to obtain a target core concept in the weather marine environment field.

In one embodiment, knowledge entity extraction of the weather-ocean unstructured text set based on the target core concept to determine a target knowledge entity comprises:

taking the target core concept as a custom dictionary, and performing word segmentation on the weather marine unstructured text set to obtain a second word segmentation data set; the second word segmentation data set comprises a word segmentation list and a syntactic relation, wherein the syntactic relation comprises at least one predicate and a plurality of argument corresponding to each predicate;

the word segmentation list is used as a trigger word matching data source, the syntactic relation is used as a trigger word matching rule, and knowledge entity extraction is carried out on the weather marine unstructured text set to determine an initial knowledge entity;

And screening the initial knowledge entity to obtain a target knowledge entity.

In one embodiment, the step of extracting the knowledge entity from the weather marine unstructured text set to determine an initial knowledge entity by using the word segmentation list as a trigger word matching data source and the syntactic relation as a trigger word matching rule includes:

for each word segmentation list, if clause information in the word segmentation list contains the target core concept, determining a first target predicate to which the target core concept belongs from the syntactic relation matched with the word segmentation list;

determining each argument corresponding to the first target predicate as a first correlation argument corresponding to the target core concept, and storing the first correlation argument into a first-order knowledge word set;

for each word segmentation list, if the clause information in the word segmentation list contains first-order knowledge words in the first-order knowledge word set, determining a second target predicate to which the first-order knowledge words belong from the syntactic relation matched with the word segmentation list;

determining each argument corresponding to the second target predicate as a second correlation argument corresponding to the first-order knowledge vocabulary, and storing the second correlation argument into a second-order knowledge vocabulary set;

And performing de-duplication processing on the target core concept, the first-order knowledge vocabulary and the second-order knowledge vocabulary to obtain an initial knowledge entity.

In one embodiment, storing the first correlation argument in a first order knowledge vocabulary, comprising:

performing word segmentation processing on each first correlation argument corresponding to the first target predicate;

if the word segmentation processing is successful, nouns in the first relativity argument are stored in a first-order knowledge word set;

and if the word segmentation process is unsuccessful, storing the first correlation argument into the first-order knowledge word set.

In one embodiment, the screening the initial knowledge entity to obtain the target knowledge entity includes:

for each second word in the initial knowledge entity, determining a second word frequency of the second word in the weather marine unstructured text set;

determining the total text quantity of the weather and ocean unstructured text sets, determining the text quantity of the weather and ocean unstructured text containing the second segmentation in the weather and ocean unstructured text sets, and determining the logarithmic ratio of the total text quantity to the text quantity as the inverse document frequency of the second segmentation;

Determining the product of the second word frequency of the second word and the inverse document frequency as the word segmentation importance of the second word;

and if the segmentation importance is greater than a preset importance threshold, determining the second segmentation as a target knowledge entity.

In one embodiment, the relationship identification model comprises an entity input layer, a text selection layer, a feature extraction layer, a relationship identification layer and a relationship output layer;

identifying, by a pre-trained relationship identification model, a target entity relationship between the target knowledge entities based on the weather marine unstructured text set and the target knowledge entities, comprising:

receiving, by the entity input layer, the weather-ocean unstructured text set and the target knowledge entity;

screening target unstructured text matched with the target knowledge entity from the weather marine unstructured text set through the text selection layer;

extracting a forward feature vector and a backward feature vector of the target unstructured text through the feature extraction layer, and fusing the forward feature vector and the backward feature vector to obtain a target feature vector;

determining, by the relationship identification layer, a probability value for each candidate entity relationship relative to the target knowledge entity based on the target feature vector;

And determining a target entity relationship from the candidate entity relationships based on the probability value through the relationship output layer.

In a second aspect, an embodiment of the present invention further provides a weather marine unstructured text knowledge construction device, including:

the text acquisition module is used for acquiring a weather marine unstructured text set to be constructed;

the concept determining module is used for determining a target core concept in the field of the weather marine environment according to the weather marine unstructured text set;

the entity determining module is used for extracting the knowledge entity from the weather marine unstructured text set based on the target core concept so as to determine a target knowledge entity;

the relationship determination module is used for identifying entity relationships between the target knowledge entities based on the weather marine unstructured text set and the target knowledge entities through a pre-trained relationship identification model;

and the map construction module is used for constructing a knowledge map of the meteorological marine environment field based on the target knowledge entity and the entity relation between the target knowledge entities.

In a third aspect, an embodiment of the present invention further provides an electronic device comprising a processor and a memory storing computer-executable instructions executable by the processor to implement the method of any one of the first aspects.

In a fourth aspect, embodiments of the present invention also provide a computer-readable storage medium storing computer-executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any one of the first aspects.

According to the method, the device and the electronic equipment for constructing the weather and ocean unstructured text knowledge, after the weather and ocean unstructured text set to be constructed is obtained, the target core concept in the weather and ocean environment field can be determined according to the weather and ocean unstructured text set, then knowledge entity extraction is carried out on the weather and ocean unstructured text set based on the target core concept to determine the target knowledge entity, then the entity relation between the target knowledge entity is identified based on the weather and ocean unstructured text set and the target knowledge entity through the pre-trained relation identification model, and finally the knowledge map in the weather and ocean environment field can be constructed based on the entity relation between the target knowledge entity and the target knowledge entity. According to the method, the construction direction and main content of the weather and marine environment knowledge graph are guided through the target core concept, the target core concept has higher professional and accurate performance, on the basis, more scientific and accurate target knowledge entities are determined based on the target core concept, then the entity relations among the target knowledge entities are identified by utilizing the relation identification model, finally the knowledge graph in the weather and marine environment field can be constructed, the knowledge structure and the relation in the weather and marine environment field can be comprehensively and clearly displayed based on the knowledge graph, key knowledge information in the pointy and marine unstructured text is intelligently extracted, the graph relation is constructed, the acquisition capacity and the retrieval speed of the target information are improved, the knowledge sharing in the weather and marine environment field is facilitated, and the weather and marine environment knowledge can be more comprehensively studied.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

In order to make the above objects, features and advantages of the present invention more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are needed in the description of the embodiments or the prior art will be briefly described, and it is obvious that the drawings in the description below are some embodiments of the present invention, and other drawings can be obtained according to the drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for constructing unstructured text knowledge of a meteorological ocean, which is provided by the embodiment of the invention;

FIG. 2 is a schematic illustration of a weather marine unstructured text set provided by an embodiment of the present invention;

FIG. 3 is a flow chart of extracting a target knowledge entity according to an embodiment of the present invention;

FIG. 4 is a schematic diagram of a relationship recognition model according to an embodiment of the present invention;

FIG. 5 is a schematic flow chart of another method for constructing unstructured text knowledge of a meteorological ocean according to an embodiment of the present invention;

FIG. 6 is a diagram of a visual display sample provided by an embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a weather marine unstructured text knowledge construction device according to an embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the present invention will be clearly and completely described in conjunction with the embodiments, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

At present, with the development of science and technology, the application of the knowledge graph is more and more extensive. The knowledge graph has strong data description capability, provides a technical basis for intelligent information application, and can present structured knowledge to a user in a graphical manner. However, in particular to the field of weather and ocean intellectualization, no mature and comprehensive knowledge graph application in the field of weather and ocean environment exists at present, so that the retrieval speed of relevant knowledge in the field of weather and ocean environment is slow, knowledge sharing in the field of weather and ocean environment is hindered, and weather and ocean environment knowledge cannot be studied more comprehensively. Based on the method, the device and the electronic equipment for constructing the unstructured text knowledge of the meteorological ocean, key knowledge information in the unstructured text of the meteorological ocean can be intelligently extracted, and the graph relationship is constructed, so that the acquisition capacity and the retrieval speed of the target information are improved, knowledge sharing in the field of the meteorological ocean environment is facilitated, and the knowledge of the meteorological ocean environment can be more comprehensively studied.

For the convenience of understanding the present embodiment, a detailed description will be first given of a method for constructing a weather-and-ocean unstructured text knowledge according to an embodiment of the present invention, referring to a schematic flow chart of a method for constructing a weather-and-ocean unstructured text knowledge shown in fig. 1, the method mainly includes steps S102 to S110:

step S102, obtaining a weather marine unstructured text set to be constructed. The weather marine unstructured text set, namely weather marine environment data, can comprise weather reports, weather and marine literature, patents, technical reports and the like.

In one embodiment, after the weather and ocean unstructured text set is obtained, data cleaning and preprocessing, data fusion and integration, consideration of space-time coupling relation of data, solving of semantic complexity and other problems can be carried out on the weather and ocean unstructured text set, so that characteristics of the weather and ocean unstructured text set can be fully understood and utilized, and the information resources can be conveniently and well mined and utilized in the follow-up process, and knowledge graph construction can be carried out.

And step S104, determining a target core concept of the weather marine environment field according to the weather marine unstructured text set. The target core concept is used for describing basic concepts, basic rules and basic principles of the meteorological ocean, so as to guide the construction direction and main content of the meteorological ocean environment knowledge graph.

In one embodiment, knowledge classification can be performed on the unstructured text set of the meteorological ocean to obtain unstructured text subsets corresponding to the sub-fields, initial core concepts are obtained on the basis of the unstructured text subsets, and target core concepts in the field of the meteorological ocean environment can be determined by crawling the relevant Jie Shiwen to complement the initial core concepts.

Step S106, knowledge entity extraction is carried out on the weather marine unstructured text set based on the target core concept to determine a target knowledge entity. In one embodiment, the target core concept is used as a custom dictionary, word segmentation is performed on the weather marine unstructured text set, then knowledge entity extraction is performed on the weather marine unstructured text set according to the word segmentation result to obtain an initial knowledge entity, and the initial knowledge entity is screened to obtain the target knowledge entity.

Step S108, through a pre-trained relation recognition model, the entity relation between the target knowledge entities is recognized based on the weather marine unstructured text set and the target knowledge entities. The relationship recognition model can be constructed by adopting a Bi-directional long and short memory neural network (Bi-directional Long Short Term Memory, bi-LSTM) model and a Conditional RandomField, CRF model.

In one embodiment, the input of the relationship identification model is a weather marine unstructured text set and the target knowledge entity, and the output is an entity relationship between the target knowledge entity.

Step S110, a knowledge graph of the weather marine environment field is constructed based on the entity relation between the target knowledge entity and the target knowledge entity. In one embodiment, the entity relationship between the target knowledge entity and the target knowledge entity may be divided into a subject word set, a relationship word set and an object word set, and mapped to the representation forms of the knowledge graph triples respectively, and finally the knowledge graph is constructed according to the triples.

According to the weather marine unstructured text knowledge construction method provided by the embodiment of the invention, the construction direction and main content of the weather marine environment knowledge graph are guided through the target core concept, the target core concept has higher specialty and accuracy, a more scientific and accurate target knowledge entity is determined based on the target core concept, then the entity relationship among the target knowledge entities is identified by utilizing the relationship identification model, finally the knowledge graph in the weather marine environment field can be constructed, the knowledge structure and relationship in the weather marine environment field can be comprehensively and clearly displayed based on the knowledge graph, key knowledge information in the numerous unstructured text is intelligently extracted, the graph relationship is constructed, the acquisition capacity and the retrieval speed of the target information are further improved, knowledge sharing in the weather marine environment field is facilitated, and the weather marine environment knowledge can be more comprehensively studied.

In order to facilitate understanding of the above embodiments, the embodiment of the present invention provides a specific implementation manner of a weather marine unstructured text knowledge construction method.

For the foregoing step S102, when the step of obtaining the weather marine unstructured text set to be constructed is performed, the unstructured text set is a very abundant resource due to the weather marine environment field, and has huge value, and the data sources are wide and mainly include the following aspects: (1) weather report: weather reports issued by the weather bureaus, weather centers and other institutions in various countries and regions comprise contents such as recent weather forecast, weather disaster early warning and the like, and weather, weather disasters and other conditions of a weather marine environment can be extracted from the reports; (2) Meteorological and oceanographic literature: literature resources such as academic papers, textbooks and the like in the fields of meteorology and oceanography, including various meteorology and oceanography theories, methods, technologies and the like, and the latest research progress, application results and the like of the meteorology and the oceanography can be extracted from the literature resources; (3) patent and technical report: patent and technical reports issued by enterprises and institutions in the meteorological marine environment field of various countries and regions, including various meteorological and marine environment monitoring technologies, equipment, products and the like, can be extracted from the resources, and the latest development and application cases of the meteorological and marine environment monitoring technologies and the like can be extracted.

In the process of constructing the weather marine environment knowledge graph, the embodiment of the invention takes the various data sources as the data sources of the knowledge graph, analyzes and mines the unstructured text data, and correlates various text data with the domain knowledge, thereby constructing a rich and complete weather marine environment knowledge graph.

In addition, the characteristics of the weather marine environment data mainly include the following aspects: (1) unstructured properties: the weather marine environment data generally exist in text form, lack of clear structure and format, and thus are difficult to directly mine and utilize. In order to effectively utilize the weather marine environment data, data cleansing and preprocessing are required to extract useful information. (2) multisource isomerism: the meteorological marine environment data sources are wide and comprise various types of papers, patents, academic reports and the like, and the data sources have differences and heterogeneity, so that certain difficulties are brought to the fusion and utilization of the data. In order to increase the availability and value of data, data fusion and integration is required to better utilize the multi-source heterogeneous data. (3) spatiotemporal complexity: the meteorological marine environment data has obvious space-time complexity and relates to multiple dimensions of time, place and the like, so that the space-time coupling relation of the data needs to be considered to better utilize the data. (4) semantic complexity: meteorological marine environment data has higher semantic complexity, and factors such as ambiguity and ambiguity of vocabulary, complexity of grammar structures and the like need to be considered so as to more accurately understand and utilize the data.

In order to better utilize the weather marine environment knowledge data, data cleaning and preprocessing, data fusion and integration, consideration of space-time coupling relation of data, and solving of the problems of semantic complexity and the like are required. Only if the characteristics of the weather marine environment data are fully understood and utilized, the precious information resources can be better mined and utilized, and the knowledge graph is constructed.

For the foregoing step S104, when performing the step of determining the target core concept of the weather marine environment field from the weather marine unstructured text set, the following steps a to e may be referred to:

and a, dividing the weather marine unstructured text set into unstructured text subsets corresponding to each sub-field according to a plurality of sub-fields in the weather marine environment field. The sub-fields may include, among others, 4 specialty fields of weather, sea, geography, and military.

In one implementation, embodiments of the present invention provide for knowledge extraction by constructing a weather marine unstructured text set as a data source. The weather marine environment data mainly relates to 4 professional fields of weather, ocean, geography and military, so that the content of a weather marine unstructured text set is divided into 4 major classes, including a weather data class, a marine hydrologic data class, a geographic data class and a military data class, and each major class is subdivided and split into a plurality of minor classes, and the schematic diagram of a weather marine unstructured text set is shown in fig. 2. FIG. 2 illustrates that the air bearing data classes include cloud information, visibility information, strong convection information, meteorological equipment information, boundary layer information, tropospheric information, and near space information; marine hydrologic data classes include sea state information, ocean current information, offshore equipment information, in-sea noise information, coastal information, tidal information, and thermocline information; the geographic data class comprises land and shore topography information, seabed topography information, island information, beach information and tidal zone information; military data classes include naval vessel information, security target information, aircraft information, electromagnetic radiation environments, and marine noise information.

These data sets will cover a number of aspects from scientific research to practical applications, including weather forecast, marine ecology, climate change, geographic information systems, military decisions, and the like. Through collection, arrangement and analysis of the unstructured text data, a powerful support and foundation can be provided for construction of a weather marine environment knowledge graph.

Step b, acquiring an initial core concept based on the unstructured text subsets corresponding to each sub-field; the initial core concept is obtained by performing expert preliminary extraction and expert cross extraction on unstructured text subsets corresponding to each sub-field.

In one embodiment, the core concept is a center of a certain knowledge domain, including basic concepts, basic laws, basic principles, and is a backbone part of the discipline structure. The determination of the core concept is the basis and the key point of the knowledge entity extraction method provided herein, and the construction direction and the main content of the weather marine environment knowledge graph are determined.

Considering that expert experience has an important guiding function in respective fields, knowledge entities based on the expert experience can provide higher value for construction of field knowledge patterns, and further improve application effects of the knowledge patterns in specific application scenes. For example, when a knowledge graph of a meteorological marine environment is applied to a marine fishing operation, a meteorological marine knowledge graph constructed through general steps may not be able to meet knowledge mining requirements for the marine fishing operation. However, if the expert experience in the field of ocean fishing is converted into the framework of the knowledge entity in the knowledge extraction process, and the extraction of the knowledge entity is completed based on the framework, the knowledge graph constructed by the method is more suitable for the application scene of ocean fishing, so that the application effect of the knowledge graph is improved.

Aiming at different professional fields of meteorological data, marine hydrological data, geographic data and military data, 4 field experts are respectively organized, and vocabulary is screened and filtered according to a core concept determination principle. Finally, the initial core concept is determined, and the specific processes are as follows 1) to 2):

1) Firstly, 4 domain experts (meteorological domain, ocean domain, geographic domain and military domain) are enabled to conduct preliminary extraction on core concepts of respective domains to obtain first edition of core concepts;

2) And then according to a core concept determination principle, enabling domain experts to carry out cross screening on first-version core concepts of other domains, for example, enabling marine environment experts to screen the first-version core concepts of the meteorological data types according to a preset core concept determination principle (related to marine environment height), and finally obtaining initial core concepts which can be used for constructing a meteorological marine environment knowledge graph in the first-version core concepts of the meteorological data types. For example, in the meteorological data class, both fog and sand storm are vision impairment phenomena, which are important words in the meteorological field. However, under the principle of considering "high correlation with marine environment", rare sand storm in marine environment is filtered out. Instead, the mist, which is more common in marine environments, is chosen to remain as the core concept.

The artificial means provided in the step b is used for initially establishing a relatively accurate weather ocean initial core concept ontology vocabulary, and the final target core concept ontology scope and ontology attribute are determined through the following steps c to e.

And c, crawling the interpretation text in the target interpretation page matched with each initial core concept. The target interpretation page may be a web page such as hundred degrees encyclopedia or wikipedia, and the interpretation text is the content of the web page. In one embodiment, the initial core concept word is used as a search word to perform search call by calling hundred degrees encyclopedia or wikipedia, and content crawling is performed on a result page (namely, a target interpretation page) to obtain a first-level title as an attribute name and obtain content under the title as an attribute value.

And d, performing word segmentation processing on each interpretation text to obtain a first word segmentation data set, and determining a first word frequency corresponding to each first word in the first word segmentation data set.

In one embodiment, text merging can be performed on all the searched explanatory texts, and word segmentation is performed on the merged text by using a word frequency statistics library collection in python, a data processing library numpy and a junction word segmentation library jieba, so that repeated words are prevented from being added to an initial core concept, words which are already present in the initial core concept words can be filtered, and noun words in the remaining words are built into a first word segmentation data set, namely, the first word segmentation in the first word segmentation data set are noun words which are different from the initial core concept.

In one embodiment, word frequency statistics may be performed on the first tokens in the first token data set to obtain a first word frequency for each first token.

And e, if the first word frequency corresponding to the first word is greater than a preset word frequency threshold value, supplementing the first word into the initial core concept to obtain a target core concept in the weather marine environment field. In one embodiment, the word frequency statistics threshold (i.e. the preset word frequency threshold) may be set by repeated operations, the first word segment greater than the threshold is incorporated into the initial core concept word again, so as to complement the missing of the weather marine initial core concept word screened by the expert, and the attribute of the complemented core knowledge body is searched again according to the step c.

Through the above steps a to e, the target core concept profile of each classification data can be obtained, such as shown in table 1. The target core concept required by constructing the weather marine environment knowledge graph in the embodiment of the invention obtained by the step S102 has extremely high specificity and accuracy, so that the knowledge graph construction of the embodiment of the invention aiming at the weather marine environment field is more scientific and accurate.

TABLE 1

Sequence number	Core concept category	Core concept determination principle	Number of core concepts	Core concept sample
					1	Meteorological data class	Is highly relevant to marine environment	49	Water vapor content, turbulence, typhoon, air pressure, atmospheric circulation, Precipitation, sea fog, thunder and lightning, monsoon and atmospheric radiation
2	Ocean data class	Is highly relevant to meteorological environment	66	Ocean current, ocean surface elevation, sea water salinity, sea water density, Submarine topography, ocean storm tide, sea water temperature Degree, marine ecosystem and marine fishery
					3	Geographic data class	High with weather marine environment Correlation of	39	Coastline, altitude, ocean current path, atmospheric ring Flow, glacier, ice cover, typhoon path, ocean current simulation, sea Air coupling, ecological environment
4	Military data class	High with weather marine environment Correlation of	105	Aircraft carrier, submarine, guard ship, expelling ship, cruiser and cannon Boat, carrier-borne aircraft, fighter aircraft, missile expelling ship, and reconnaissance Satellite

For the foregoing step S106, when performing the step of extracting the knowledge entity from the weather-ocean unstructured text set based on the target core concept to determine the target knowledge entity, the embodiment of the present invention further provides a flow chart for extracting the target knowledge entity, which is shown in fig. 3, and includes word segmentation processing on the weather-ocean unstructured text set, matching with the target core concept, performing object-law matching extraction to obtain a vocabulary extraction set (i.e. a second word data set), TF-IDF method calculation, and threshold judgment to determine the target knowledge entity. Specifically, see the following steps 1 to 3:

And step 1, taking the target core concept as a custom dictionary, and performing word segmentation on the weather marine unstructured text set to obtain a second word segmentation data set. The second word data set comprises a word segmentation list and a syntactic relation, the word segmentation list is a two-dimensional list, the first dimension is sentence information, the second dimension is word segmentation information of the sentence, and the syntactic relation comprises at least one predicate and a plurality of argument corresponding to each predicate.

In practical application, the characteristic of Chinese language is that there is no space, so that a segment of Chinese text needs to be segmented into individual words, namely, word segmentation needs to be performed so as to facilitate subsequent knowledge extraction, relationship extraction and other works. The principle of word segmentation is mainly based on two methods of statistics and rules, wherein the statistics method is used for word segmentation based on statistical probabilities of a large number of corpora, and the rules are used for word segmentation based on expert knowledge and language rules. The word segmentation effect has great influence on the subsequent knowledge graph construction and analysis. The accuracy and efficiency of knowledge graph construction can be improved by a good word segmentation effect, so that natural language query and semantic reasoning are better supported.

With the continuous development and application of Natural Language Processing (NLP) technology, word segmentation technology is also continuously perfected and improved. In the field of Chinese word segmentation, a plurality of word segmentation tools with good effect exist at present, such as jieba word segmentation, hanLP word segmentation, THULAC at the university of Qinghai, NLPIR at the national academy of sciences of China, python packaging edition pyltp at the university of Harbin industry, and the like. These tools have advantages and disadvantages in terms of word segmentation accuracy, efficiency, stability, ease of use, etc.

According to the embodiment of the invention, the HanLP word segmentation tool can be used for carrying out word segmentation semantic analysis processing on each document in the weather marine environment text data set. HanLP is an NLP toolkit consisting of a series of models and algorithms, with the goal of popularizing the application of natural language processing in a production environment. The HanLP has the characteristics of perfect functions, high efficiency, clear architecture, new corpus and self-definition, and the internal algorithm is examined in the industry and academia. The HanLP default model trains a Chinese corpus with the largest scale worldwide, and simultaneously carries some corpus processing tools to help users train own models.

The word segmentation tool may provide the following knowledge extraction capabilities: chinese word segmentation, part-of-speech tagging, named entity recognition, keyword extraction, automatic summarization, phrase extraction, pinyin conversion, simplified-to-traditional conversion, text recommendation, dependency syntactic analysis, text classification, text clustering, semantic similarity calculation and the like. The embodiment of the invention mainly utilizes the part-of-speech analysis and dependency syntax analysis capability to process and acquire the word segmentation list and the syntax relation of a single sentence in each document. For example, "water vapor condenses into tiny water droplets in the air, resulting in the appearance of a large mist. The large fog can cause the failure of a gun sighting telescope, and a word segmentation list is obtained through word segmentation, namely [ 'steam', 'in', 'air', 'middle', 'condensation', 'formation', 'tiny', 'water drop', 'lead', 'large fog', 'appearance'. ' fog ', ' Condition ', ' cause ', ' gun sight ', ' failure ', ' syntactic relation ]: the results are two sub-lists corresponding to predicates in the two clauses respectively, [ { 'precursors [ (' A0',' steam ') ], (' ARGM-LOC ',' in air ') ], (' A1',' tiny water droplets ') ] }, {' precursors [ ('A1', 'big fog appearing') ], [ { 'precursors [ (' A0',' big fog ') ],' hairline mirror failure ') ], {' precursors [ ('failure', 'A0', 'hairline mirror') ] ] ]. Each predicate has a number of argument, each argument consisting of an argument tag (e.g., A0, A1, etc.) and an entity.

The first sub-list indicates that "condensation" is the predicate, "water vapor" is the subject of the predicate (subject of the sentence), "in air" is the locality, and "tiny water droplets" are the direct object of the predicate; "cause" is a predicate, and "heavy fog occurrence" is the direct object of the predicate;

the second sub-list represents "cause" as predicate, "fog" as subject of predicate, "failure of the telescope" as direct object of predicate; the failure is a predicate, and the sighting telescope is a main body of the predicate.

Before word segmentation, the technical nouns (not limited to the target core concept) in the four sub-fields are stored as a custom dictionary, so that the problem that the target core concept is split into a plurality of words in the word segmentation process, for example, when a 'gun sight' is not stored in the custom dictionary, the word is divided into two nouns of a 'gun' and a 'sight', and the 'gun sight' is used as a word to be extracted after being stored in the custom dictionary, and the accuracy and recall rate of word segmentation can be enhanced.

And 2, extracting a knowledge entity from the weather ocean unstructured text set by taking the word segmentation list as a trigger word matching data source and the syntactic relation as a trigger word matching rule so as to determine an initial knowledge entity. Considering that verbs tend to represent relationships between knowledge entities well, relationships between knowledge entities in text are often induced by verbs. In order to extract relevant words in a weather and ocean unstructured text set, the embodiment of the invention formulates a trigger mechanism of a weather and ocean core concept, uses a word segmentation list as a data source for trigger word matching, and uses a syntactic relation as a rule for trigger word matching, specifically extracts a syntactic relation sub-list containing the weather and ocean core concept, and extracts relevant arguments belonging to the same predicate. In a specific implementation, see steps 2.1 to 2.3 below:

Step 2.1, for each word segmentation list, if clause information in the word segmentation list contains a target core concept, determining a first target predicate to which the target core concept belongs from a syntactic relation matched with the word segmentation list; and determining each argument corresponding to the first target predicate as a first correlation argument corresponding to the target core concept, and storing the first correlation argument into a first-order knowledge word set.

In one embodiment, each first correlation argument corresponding to the first target predicate is subjected to a word segmentation process; if the word segmentation processing is successful, nouns in the first correlation argument are stored in a first-order knowledge word set; if the word segmentation process is unsuccessful, the first correlation argument is stored in a first order knowledge word set.

In practical application, when the first-order knowledge vocabulary is extracted, the clause information of all weather marine unstructured text sets can be traversed, clauses containing target core concepts are searched, the target core concepts are used as trigger words, and related argument elements which belong to the same predicate as the target core concepts are extracted and stored in the first-order knowledge vocabulary set.

Further, the argument capable of performing word segmentation in the argument can be segmented, nouns in the segmented argument are stored in a first-order knowledge word set, and argument incapable of performing word segmentation in the argument are also stored in the first-order knowledge word set; for example, in the word segmentation list of [ ' big fog ', ' congregation ', ' cause ', ' gun sight ', ' failure ', ' big fog ' is included as a core concept into a vocabulary extraction set, the ' gun sight failure ' is an argument of a predicate belonging to the same kind of the big fog ', ' gun sight ', ' failure ', ' cause ', ' gun sight ' failure ', the argument is segmented to obtain [ ' gun sight ', ' failure ', ' taking the noun ' gun sight ' into a first-order vocabulary extraction set.

Furthermore, the argument containing the target core concept, the relativity argument belonging to a predicate, and the segmentation of the relativity argument are extracted and stored in a first-order knowledge word set, for example, the grammar relation is as follows: "cause" is predicate, "vapor saturation" is the subject of predicate, and "foggy appearance" is the direct object of predicate; the 'big fog appearance' comprises a core concept of 'big fog', and the 'big fog appearance' and the 'water vapor saturation' are extracted and stored in a first-order knowledge word set.

Step 2.2, for each word segmentation list, if the clause information in the word segmentation list contains first-order knowledge words in the first-order knowledge word set, determining a second target predicate to which the first-order knowledge words belong from the syntactic relation matched with the word segmentation list; and determining each argument corresponding to the second target predicate as a second correlation argument corresponding to the first-order knowledge vocabulary, and storing the second correlation argument into a second-order knowledge vocabulary set.

In practical application, the process of extracting the second-order knowledge word set can be referred to as step 2.1. Specifically, when the second-order knowledge vocabulary is extracted, the clause information of all the weather marine unstructured text sets can be traversed again, the first-order knowledge vocabulary in the first-order knowledge vocabulary set is used as a trigger word, and the relevant argument which belongs to the same predicate as the first-order knowledge vocabulary is extracted, so that a second-order knowledge vocabulary set is constructed.

Further, the argument capable of performing word segmentation can be segmented, nouns in the segmented argument are taken and stored in a first-order knowledge word set, and argument incapable of performing word segmentation in the argument are also stored in the first-order knowledge word set.

Furthermore, the argument comprising the meteorological ocean core concept and the related argument and the segmentation of the related argument belonging to a predicate are extracted and stored in a first-order knowledge word set, for example, the grammar relation is as follows: "cause" is predicate, "vapor saturation" is the subject of predicate, and "foggy appearance" is the direct object of predicate; the 'big fog appearance' comprises a core concept of 'big fog', and the 'big fog appearance' and the 'water vapor saturation' are extracted and stored in a first-order knowledge word set.

And 2.3, performing deduplication processing on the target core concept, the first-order knowledge vocabulary and the second-order knowledge vocabulary to obtain an initial knowledge entity. In one embodiment, the target core concept, the first-order knowledge word set and the second-order knowledge word set may be deduplicated and manually reviewed to ensure accuracy and usability thereof, resulting in a final vocabulary extraction set. The step 2.1 to the step 2.3 can extract a plurality of vocabularies related to the weather marine environment through twice trigger word extraction.

To summarize, the embodiment of the invention takes the target core concept as a trigger word, and then realizes extraction of related words in the text data set by matching the related words in the word segmentation list. Through the steps, the vocabulary which is extracted by the embodiment of the invention and used for constructing the weather marine environment knowledge graph is more accurate, professional and comprehensive, and the weather marine environment knowledge graph constructed by the embodiment of the invention has more practicability and professionality.

And step 3, screening the initial knowledge entity to obtain a target knowledge entity. Considering that more noise is also introduced in the two trigger word extraction in the step 2, the accuracy of the knowledge entity extraction is reduced, so that the vocabulary extraction set not only comprises knowledge entities related to the weather marine environment, but also comprises more other irrelevant vocabularies.

The TF-IDF algorithm is a simple and efficient text information processing technique for measuring the importance of a vocabulary in a document set. The main ideas of TF-IDF are: if a word or phrase appears in one article with a high TF value and in other articles with a small TF value, the word or phrase is considered to have good category discrimination and is suitable for classification. TFIDF is actually: TF is IDF, TF word frequency (terminal), IDF reverse file frequency (Inverse Document Frequency). TF represents the frequency of occurrence of the term in document d. The main ideas of IDF are: if the fewer documents containing the term t, i.e., the smaller n, the larger IDF, the better class distinction capability the term t has. The specific implementation steps of the embodiment of the invention for screening the knowledge entity by using the TF-IDF algorithm comprise the following steps of 3.1 to 3.4:

And 3.1, for each second word in the initial knowledge entity, determining a second word frequency of the second word in the weather marine unstructured text set. In one embodiment, a second word frequency, i.e., TF value, of each second word that appears in the weather marine unstructured text set is calculated for each second word. The formula for calculating the TF value is as follows:

；

wherein,,TF value representing the second word x, < ->Representing total words of text in a weather marine unstructured text set,/->Indicating the number of occurrences of the second word x.

And 3.2, determining the total text quantity of the weather and ocean unstructured text sets, determining the text quantity of the weather and ocean unstructured text containing the second segmentation in the weather and ocean unstructured text sets, and determining the logarithmic ratio of the total text quantity to the text quantity as the inverse document frequency of the second segmentation. In one embodiment, the total text number of the weather marine unstructured text set document is calculatedMeasuring amountNumber of documents in the unstructured text set containing the second segmentation word with the weather oceanAnd obtaining the IDF value of the second word. The calculation formula of the IDF value is as follows:

；

wherein,,IDF value representing second word x, < - >Total text quantity representing a weather marine unstructured text set,/->The number of text in the weather-marine unstructured text set containing the second segmentation word x is represented.

And 3.3, determining the product of the second word frequency of the second word and the inverse document frequency as the word segmentation importance of the second word. In one embodiment, the product of the TF value and the IDF value is used as the TF-IDF value to measure the importance of the second segmentation (i.e., the segmentation importance) in the weather marine unstructured text set. The TF-IDF value is calculated as follows:

；

wherein,,TF-IDF value representing the second word x,>TF value representing the second word x, < ->The IDF value representing the second segmentation word x.

And 3.4, if the segmentation importance is greater than a preset importance threshold, determining the second segmentation as a target knowledge entity. In one embodiment, when screening and extracting the target knowledge entity required for constructing the knowledge graph, a threshold (i.e., a preset importance threshold) may be set after TF-IDF values of all the second words are calculated, and the second words with TF-IDF values higher than the threshold are extracted as the target knowledge entity, and all the other second words are filtered out. For example, the verb "cause" is used more frequently in other documents, making its IDF value smaller, resulting in the TF-IDF value being less than the set threshold, and thus filtered out. The TF value of the flap is very small and is also filtered. By the method, the vocabulary extraction set is cleaned, and the knowledge entity is extracted.

Before the foregoing step S108 is performed, the relationships between the target knowledge entities may be classified. Since knowledge of each field among the meteorological marine environment fields is often interwoven and influenced in practical application, the relationship between the knowledge and the knowledge is often complex. Different types of knowledge and different levels of knowledge have different descriptions, so that multiple relationship descriptions can exist between the same two knowledge entities, and the relationship descriptions have the characteristics of diversity and complexity. In order to solve the problem, the embodiment of the invention combines the data characteristics of the actual cross field (meteorological marine environment) to reduce the relationship among the knowledge entities into 9 types, namely matching, composition, condition, mutual exclusion, inheritance, juxtaposition, target, command and co-pointing relationship.

Wherein, the matching relationship indicates that the two entities have similarity in certain attributes or characteristics; a composition relationship indicates that one entity is part of another entity; a conditional relationship indicates the existence or occurrence of one entity that needs to satisfy some condition of another entity; the mutually exclusive relationship indicates that there is exclusivity between the two entities; inheritance relationships represent certain properties or characteristics that one entity inherits from another entity; the parallel relationship indicates that two entities are in equal position; the goal relationship represents the achievement goal that one entity is another entity; a command relationship represents the behavior of one entity controlling another entity; co-reference means that two entities refer to the same thing. Specific examples of knowledge relationship classifications are set forth in Table 2 below.

TABLE 2

The definition of the relationship types can better describe and understand the relationship between knowledge entities in the field of the weather marine environment, and provides powerful support for knowledge extraction and knowledge graph construction.

For the foregoing step S108, when the step of identifying the target entity relationship between the target knowledge entities based on the weather marine unstructured text set and the target knowledge entities by the pre-trained relationship identification model is performed, relationship extraction may be performed by Bi-LSTM-CRF.

In knowledge extraction of a weather marine environment knowledge graph, relationship extraction is also a core work. The relation extraction is mainly to judge the semantic relation among the knowledge entities, and the relation among the knowledge entities is represented by the 9 knowledge relation types, so that the relation extraction problem is converted into the classification problem. And after training the model, mapping the relationship among the knowledge entities into 9 relationship categories to realize the extraction of the relationship.

The embodiment of the invention adopts a Bi-directional long and short memory neural network (Bi-directional Long Short Term Memory, bi-LSTM) model and a Conditional RandomField, CRF model to construct a relationship identification model.

Bi-LSTM is a variant of a recurrent neural network (Recurrent Neural Network, RNN) that can process sequence data. Bi-LSTM has better performance in capturing long-term dependencies in sequence data than traditional RNNs.

The CRF is an undirected graph model and can model sequence labeling problems. In text processing, CRF is mainly used for solving the labeling problem, and labeling accuracy can be improved through global optimization of labeling sequences. Referring to a schematic diagram of a relationship recognition model shown in fig. 4, the relationship recognition model in the embodiment of the present invention mainly comprises five parts, namely a knowledge entity input layer, a text selection layer, a Bi-LSTM layer (i.e. a feature extraction layer), a CRF layer (i.e. a relationship recognition layer), and a relationship output layer.

On this basis, when performing the step of identifying entity relationships between target knowledge entities based on the weather marine unstructured text set and the target knowledge entities by means of a pre-trained relationship identification model, reference may be made to the following (a) to (e):

(a) Through the entity input layer, a weather marine unstructured text set and a target knowledge entity are received. In one implementation, the entity input layer is an input layer of a relational identification model provided by an embodiment of the present invention, and is used for receiving an input weather marine unstructured text set and a target knowledge entity. According to the embodiment of the invention, sentences are taken as basic units, and the target knowledge entity is matched with sentences in the weather marine unstructured text set, so that the sentences are taken as input of a relation recognition model.

(b) And screening out target unstructured texts matched with the target knowledge entity from the weather marine unstructured text set through a text selection layer. In one embodiment, the text selection layer is for screening and extracting target unstructured text related to a target knowledge entity. In the task of relationship identification, since a sentence may include a plurality of target knowledge entities, and the relationship between the target knowledge entities is generally identified at the sentence level, it is necessary to select a target unstructured text related to the target knowledge entities for processing, so as to obtain information such as sentences and indexes.

(c) And extracting a forward feature vector and a backward feature vector of the target unstructured text through a feature extraction layer, and fusing the forward feature vector and the backward feature vector to obtain the target feature vector. In one implementation, the Bi-LSTM layer is the core layer of the relational recognition model provided by the embodiment of the invention, and is used for modeling text information and extracting features. The Bi-LSTM layer can fully consider the context relation of text information, and can better capture long-term dependency relation in a sequence labeling task. In LSTM, it will generally Expressed as model input, +.>The Bi-LSTM layer is composed of a forward LSTM layer and a backward LSTM layer, the forward LSTM layer and the backward LSTM layer respectively process text information from left to right and from right to left, and finally, outputs in two directions (including a forward feature vector and a backward feature vector) are combined into an output feature vector (namely a target feature vector), the formed target feature vector is used as the final feature expression of the word, and the specific calculation formula is as follows:

；

wherein for any input，/>Representing its forward output eigenvector,/and->Representing its backward output feature vector, ">Representing its target feature vector.

(d) The probability value of each candidate entity relationship relative to the target knowledge entity is determined based on the target feature vector by a relationship identification layer. In one embodiment, the CRF layer is a layer for modeling and training the annotation sequence. The CRF layer can consider the dependency relationship among the labels, and unreasonable labeling sequences are avoided. In the relation recognition task, the CRF layer scores the possibility of each labeling sequence according to the output of the Bi-LSTM layer, and selects the labeling sequence with the highest probability as the output.

(e) And determining, by the relationship output layer, a target entity relationship from the candidate entity relationships based on the probability values. In one implementation, the relationship output layer is an output layer of the relationship identification model provided by the embodiment of the present invention, and is used for outputting the predicted relationship label. In the relation identification task, a relation output layer selects a relation label with the highest probability as a prediction result according to the output of the Bi-LSTM layer and the CRF layer. The main function of the relation output layer is to convert the output of the model into a predicted relation label for the subsequent application program.

According to the embodiment of the invention, the output result of the relation recognition model is utilized to finish the relation classification of 9 categories among knowledge entities in the field of the meteorological marine environment, so that the relation extraction is realized. For example, for sentences containing two knowledge entities of "wind" and "sea wave", the "soldier in a battle with a windy roll-up sea wave is dizziness" and the "sea wave height is determined by wind speed, depends on the pressure suffered by a unit area", and the model can intelligently identify whether the relationship between them is "roll-up" or "depends on". Of course, there are many other description words, but the present model can implement intelligent classification of the relationships between them, and divide them into "condition" relationships, that is, "the higher the sea wave will be under the condition of higher wind speed". The identification of the knowledge entity relationship classification lays a solid foundation for the construction of the knowledge graph.

For the foregoing step S110, when performing the step of constructing a knowledge graph of the weather marine environment field based on the entity relationship between the target knowledge entity and the target knowledge entity, the following procedure may be referred to:

the relation extraction model is mainly divided into a subject word set, a relation word set and an object word set, the subject word set, the relation word set and the object word set are respectively mapped into the representation forms of the triplet of the knowledge graph, and the triplet data is the basis of graph construction. The method comprises the steps of constructing a knowledge graph triplet, wherein the two problems are solved, and the problem that the extracted relationship has noise exists is solved, for example, a plurality of relationships exist among two knowledge entities, and particularly, the situation that the two knowledge entities are both a matching relationship and a composition relationship in a classification result output by a model exists; and secondly, how to sort the importance of the knowledge entities.

Aiming at the first problem, the embodiment of the invention carries out aggregation statistics on the classification results extracted by the model, takes the classification type with the largest extracted quantity as the final relation classification, and carries out manual intervention screening when the classification types with the same quantity exist; aiming at the second problem, the importance ranking of the knowledge entities is carried out by using TF-IDF value as important reference of priority weight.

Traversing each relation and knowledge entities at two ends of the relation according to the output of the relation identification model, carrying out screening calculation, obtaining triple data according to a defined structure, and finally constructing a knowledge graph according to the triple data.

In summary, the method for constructing the unstructured text knowledge of the meteorological ocean provided by the embodiment of the invention aims at the characteristics of unstructured, multi-source heterogeneous, space-time complex, semantic complex and the like of the knowledge of the meteorological ocean environment, adopts expert experience to determine the core concept of the field of the meteorological ocean environment as a basic framework of a knowledge ontology, expands and extracts new field knowledge entities from unstructured text data through a trigger learning mechanism, cleans and determines final knowledge entities through a TF-IDF algorithm, extracts relations among the knowledge entities through a Bi-LSTM-CRF method, and finally constructs a knowledge map of the meteorological ocean environment in a triplet mode. The method fully considers the specificity and complexity of the knowledge in the meteorological ocean field, and provides reference and reference for the construction of the knowledge graph in the field.

In order to facilitate understanding, the embodiment of the invention also provides another method for constructing unstructured text knowledge of the meteorological ocean, knowledge points related to the field of the meteorological ocean environment are numerous and interdisciplinary, and the knowledge graph can integrate scattered knowledge points into a visual graph, so that understanding and sharing are facilitated. The embodiment of the invention can comprehensively and clearly show the knowledge structure and the relationship in the meteorological marine environment field aiming at the knowledge graph construction in the meteorological marine environment field, helps people to deeply understand the relation between related concepts and knowledge points, helps researchers to carry out the works of knowledge analysis, trend prediction, development planning and the like on the meteorological marine environment field, provides more favorable support and guidance for scientific research, and assists in carrying out knowledge-based meteorological marine intelligent guarantee decision.

Specifically, referring to a flow chart of another method for constructing unstructured text knowledge of a meteorological ocean shown in fig. 5, the method mainly includes the following steps S502 to S518:

step S502, unstructured text data in the field of meteorological marine environments is acquired.

And step S504, carrying out knowledge classification on the meteorological marine environment field.

And step S506, determining target core concepts in the field of the meteorological marine environment according to the knowledge classification.

And step S508, word segmentation and carding are carried out on the unstructured text data, and a second word segmentation data set is obtained.

Step S510, extracting the knowledge entity based on the target core concept.

Step S512, the knowledge entity extracted based on the trigger mechanism of the target core concept is filtered.

In step S514, the relationships between the knowledge entities are classified.

In step S516, the relationship extraction is performed by Bi-LSTN-CRF according to the classification of the knowledge entity relationship.

Step S518, the construction of the knowledge graph is completed according to the knowledge entity and the entity relationship.

By using the method and the construction process in the embodiment, the data sources of the weather marine environment knowledge graph constructed by the embodiment of the invention mainly comprise reports, papers, teaching materials, patent data and the like in the weather marine field, and a core concept set is constructed by using expert experience, wherein 231 core concepts in 4 categories are determined in total. And performing word segmentation processing and trigger word extraction on the data source by adopting a HanLP word segmentation tool, and obtaining 2681 total knowledge entities through TF-IDF calculation and screening, wherein the minimum threshold value of TF-IDF is set as 3.669e-5. The relation extraction is completed based on Bi-LSTM-CRF algorithm, and the training set is 279 sentence training corpus constructed manually according to 9 classifications among knowledge entities. And after the relation extraction is completed, the extracted relation is subjected to screening and construction of the knowledge-graph triples. And finally, 9967 entity relations are extracted, the core concept is used as a main node, the multi-node parallel visualization method is used for displaying, the number of the extracted knowledge graph triples of each relation type is shown in table 3, and the visualization display sample diagram is shown in fig. 6.

TABLE 3 Table 3

In summary, the construction of the weather marine environment knowledge graph can provide an intuitive and systematic description framework for the field, better know the rich relationship among the influence elements, acquire the knowledge which can be practically applied from mass data, and has important significance for improving the intelligentization, the precision and the high-efficiency utilization of the weather marine environment.

For the method for constructing the weather and ocean unstructured text knowledge provided in the foregoing embodiment, the embodiment of the present invention provides a device for constructing the weather and ocean unstructured text knowledge, and referring to a schematic structural diagram of the weather and ocean unstructured text knowledge constructing device shown in fig. 7, the device mainly includes the following parts:

the text acquisition module 702 is used for acquiring a weather marine unstructured text set to be constructed;

a concept determination module 704, configured to determine a target core concept of the weather marine environment field according to the weather marine unstructured text set;

entity determination module 706, configured to perform knowledge entity extraction on the weather marine unstructured text set based on the target core concept to determine a target knowledge entity;

a relationship determination module 708 for identifying, via a pre-trained relationship identification model, entity relationships between the target knowledge entities based on the weather-ocean unstructured text set and the target knowledge entities;

The map construction module 710 is configured to construct a knowledge map of the weather marine environment field based on the target knowledge entity and the entity relationship between the target knowledge entities.

According to the weather marine unstructured text knowledge construction device provided by the embodiment of the invention, the construction direction and main content of the weather marine environment knowledge graph are guided through the target core concept, the target core concept has higher specialty and accuracy, a more scientific and accurate target knowledge entity is determined based on the target core concept, then the entity relationship among the target knowledge entities is identified by utilizing the relationship identification model, finally the knowledge graph in the weather marine environment field can be constructed, the knowledge structure and relationship in the weather marine environment field can be comprehensively and clearly displayed based on the knowledge graph, key knowledge information in the numerous unstructured text is intelligently extracted, the graph relationship is constructed, the acquisition capacity and the retrieval speed of the target information are further improved, knowledge sharing in the weather marine environment field is facilitated, and the weather marine environment knowledge can be more comprehensively studied.

In one embodiment, the concept determination module 704 is further configured to:

based on the unstructured text subsets corresponding to each sub-field, acquiring an initial core concept; the initial core concept is obtained by performing expert preliminary extraction and expert cross extraction on unstructured text subsets corresponding to each sub-field;

if the first word frequency corresponding to the first word is larger than a preset word frequency threshold value, supplementing the first word into the initial core concept to obtain a target core concept in the weather marine environment field.

In one embodiment, the entity determination module 706 is further configured to:

taking the target core concept as a custom dictionary, and performing word segmentation on the weather marine unstructured text set to obtain a second word segmentation data set; the second word data set comprises a word list and a syntactic relation, wherein the syntactic relation comprises at least one predicate and a plurality of argument corresponding to each predicate;

The word segmentation list is used as a trigger word matching data source, the syntactic relation is used as a trigger word matching rule, and the knowledge entity extraction is carried out on the weather marine unstructured text set so as to determine an initial knowledge entity;

and screening the initial knowledge entity to obtain a target knowledge entity.

for each word segmentation list, if clause information in the word segmentation list contains a target core concept, determining a first target predicate to which the target core concept belongs from a syntactic relation matched with the word segmentation list;

for each word segmentation list, if the clause information in the word segmentation list contains first-order knowledge words in the first-order knowledge word set, determining a second target predicate to which the first-order knowledge words belong from a syntactic relation matched with the word segmentation list;

performing word segmentation on each first correlation argument corresponding to the first target predicate;

if the word segmentation processing is successful, nouns in the first correlation argument are stored in a first-order knowledge word set;

if the word segmentation process is unsuccessful, the first correlation argument is stored in a first order knowledge word set.

determining the total text quantity of the weather and ocean unstructured text sets, determining the text quantity of the weather and ocean unstructured text containing the second word in the weather and ocean unstructured text sets, and determining the logarithmic ratio of the total text quantity to the text quantity as the inverse document frequency of the second word;

and if the importance of the word segmentation is greater than a preset importance threshold, determining the second word as a target knowledge entity.

In one embodiment, the relationship-identifying model includes an entity input layer, a text selection layer, a feature extraction layer, a relationship-identifying layer, and a relationship output layer; the relationship determination module 708 is also to:

receiving a weather marine unstructured text set and a target knowledge entity through an entity input layer;

screening target unstructured text matched with a target knowledge entity from a weather marine unstructured text set through a text selection layer;

extracting a forward feature vector and a backward feature vector of the target unstructured text through a feature extraction layer, and fusing the forward feature vector and the backward feature vector to obtain a target feature vector;

determining a probability value of each candidate entity relation relative to the target knowledge entity based on the target feature vector through the relation recognition layer;

and determining, by the relationship output layer, a target entity relationship from the candidate entity relationships based on the probability values.

The device provided by the embodiment of the present invention has the same implementation principle and technical effects as those of the foregoing method embodiment, and for the sake of brevity, reference may be made to the corresponding content in the foregoing method embodiment where the device embodiment is not mentioned.

The embodiment of the invention provides electronic equipment, which comprises a processor and a storage device; the storage means has stored thereon a computer program which, when executed by the processor, performs the method of any of the embodiments described above.

Fig. 8 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, where the electronic device 100 includes: a processor 80, a memory 81, a bus 82 and a communication interface 83, the processor 80, the communication interface 83 and the memory 81 being connected by the bus 82; the processor 80 is arranged to execute executable modules, such as computer programs, stored in the memory 81.

The memory 81 may include a high-speed random access memory (RAM, random Access Memory), and may further include a non-volatile memory (non-volatile memory), such as at least one magnetic disk memory. The communication connection between the system network element and at least one other network element is implemented via at least one communication interface 83 (which may be wired or wireless), and may use the internet, a wide area network, a local network, a metropolitan area network, etc.

Bus 82 may be an ISA bus, a PCI bus, an EISA bus, or the like. The buses may be classified as address buses, data buses, control buses, etc. For ease of illustration, only one bi-directional arrow is shown in FIG. 8, but not only one bus or type of bus.

The memory 81 is configured to store a program, and the processor 80 executes the program after receiving an execution instruction, and the method executed by the apparatus for flow defining disclosed in any of the foregoing embodiments of the present invention may be applied to the processor 80 or implemented by the processor 80.

The processor 80 may be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the above method may be performed by integrated logic circuitry in hardware or instructions in software in processor 80. The processor 80 may be a general-purpose processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a digital signal processor (Digital Signal Processing, DSP for short), application specific integrated circuit (Application Specific Integrated Circuit, ASIC for short), off-the-shelf programmable gate array (Field-Programmable Gate Array, FPGA for short), or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present invention may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present invention may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in a memory 81 and the processor 80 reads the information in the memory 81 and in combination with its hardware performs the steps of the method described above.

The computer program product of the readable storage medium provided by the embodiment of the present invention includes a computer readable storage medium storing a program code, where the program code includes instructions for executing the method described in the foregoing method embodiment, and the specific implementation may refer to the foregoing method embodiment and will not be described herein.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present invention may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the above examples are only specific embodiments of the present invention, and are not intended to limit the scope of the present invention, but it should be understood by those skilled in the art that the present invention is not limited thereto, and that the present invention is described in detail with reference to the foregoing examples: any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or perform equivalent substitution of some of the technical features, while remaining within the technical scope of the present disclosure; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. The method for constructing the unstructured text knowledge of the meteorological ocean is characterized by comprising the following steps of:

acquiring a weather marine unstructured text set to be constructed;

constructing a knowledge graph of the weather marine environment field based on the target knowledge entity and the entity relationship between the target knowledge entities;

determining a target core concept of the meteorological marine environment field according to the meteorological marine unstructured text set, wherein the target core concept comprises the following steps:

if the first word frequency corresponding to the first word is larger than a preset word frequency threshold value, supplementing the first word into the initial core concept to obtain a target core concept in the weather marine environment field;

performing knowledge entity extraction on the weather marine unstructured text set based on the target core concept to determine a target knowledge entity, including:

and screening the initial knowledge entity to obtain a target knowledge entity.

2. The method of claim 1, wherein the step of extracting the knowledge entity from the weather marine unstructured text set to determine an initial knowledge entity using the word segmentation list as a trigger word matching data source and the syntactic relation as a trigger word matching rule comprises:

3. The method of claim 2, wherein storing the first relatedness argument into a first order knowledge vocabulary, comprising:

4. The method for constructing the weather marine unstructured text knowledge according to claim 1, wherein the step of screening the initial knowledge entity to obtain a target knowledge entity comprises the steps of:

5. The weather marine unstructured text knowledge construction method according to claim 1, wherein the relation recognition model comprises an entity input layer, a text selection layer, a feature extraction layer, a relation recognition layer and a relation output layer;

6. A weather marine unstructured text knowledge construction device, comprising:

the map construction module is used for constructing a knowledge map of the meteorological marine environment field based on the entity relation between the target knowledge entity and the target knowledge entity;

The concept determination module is further to:

the entity determination module is further configured to:

and screening the initial knowledge entity to obtain a target knowledge entity.

7. An electronic device comprising a processor and a memory, the memory storing computer-executable instructions executable by the processor, the processor executing the computer-executable instructions to implement the method of any one of claims 1 to 5.

8. A computer readable storage medium storing computer executable instructions which, when invoked and executed by a processor, cause the processor to implement the method of any one of claims 1 to 5.