CN114238524A

CN114238524A - Satellite frequency-orbit data information extraction method based on enhanced sample model

Info

Publication number: CN114238524A
Application number: CN202111570758.5A
Authority: CN
Inventors: 何元智; 李志强
Original assignee: Institute of Network Engineering Institute of Systems Engineering Academy of Military Sciences
Current assignee: Institute of Network Engineering Institute of Systems Engineering Academy of Military Sciences
Priority date: 2021-12-21
Filing date: 2021-12-21
Publication date: 2022-03-25
Anticipated expiration: 2041-12-21
Also published as: CN114238524B

Abstract

The invention discloses a satellite frequency orbit data information extraction method based on an enhanced sample model, which comprises the following steps: defining entity types and relation sets; a structured frequency-track data relation extraction stage, namely selecting required data information from a database and matching related entities; representing the entity pairs and the relationships thereof by triples; in the unstructured frequency-track data relation extraction stage, labeling the text data after word segmentation, training an entity recognition model and completing entity recognition; enhancing a sample model, generating a text supplement training sentence library by using the structured data, solving the problem of long tail, and classifying correct label sentences and noise sentences in a sentence bag by using reinforcement learning; and training the segmented convolutional neural network model to complete classification and extraction of entity relationships. The invention fully utilizes the structured data and the noise sentences, can efficiently complete the knowledge extraction of the satellite frequency-orbit data, and enriches the satellite frequency-orbit knowledge base; the method has the advantages of high scheme flexibility and high relation extraction accuracy.

Description

Satellite frequency-orbit data information extraction method based on enhanced sample model

Technical Field

The invention relates to the technical field of satellite data processing, in particular to a satellite frequency and orbit data information extraction method based on an enhanced sample model.

Background

At present, with the rapid development of aerospace technology, a plurality of satellites are launched into the outer space in different countries of the world, a large number of frequency-orbit resource data records are generated, and the data contain a lot of useful information. Although the traditional database storage method records a large amount of structural data, the data information is not complete enough to construct a complete frequency-track data map. The relationship between the data can be visually shown by establishing a frequency-orbit diagram knowledge model, and a technical basis can be laid for the mining and utilization of the data. There are many useful unstructured satellite frequency orbit data on the network, and the data volume is often huge, and the data volume can be used as a supplement of the structured data.

How to identify the required entities and the relationships thereof from the unstructured frequency-track data is a basic problem to be solved for constructing a complete frequency-track data map. The method mainly comprises two key links of named entity identification and relationship extraction for the construction of a complete frequency-orbit data map. The method for named entity recognition and relationship extraction can be divided into a joint extraction method and a Pipeline method according to whether the two link tasks are modeled in a unified manner or not.

The joint extraction is to uniformly model the two tasks into one model, and the extraction scheme can further utilize potential association information existing between the two tasks to reduce the propagation of error accumulation. However, due to the unified modeling of the two tasks, performing the two tasks with the same feature representation may cause misunderstandings for the learning of the model. How to strengthen the interaction between the entity model and the relationship model is also a difficult problem. The Pipeline method firstly identifies named entities and then extracts relations, the scheme is high in flexibility, and the entity model and the relation model can respectively use independent data sets.

For named entity recognition, existing methods are classified into rule-based methods, statistical model-based methods, and neural network-based methods. The rule-based method needs to construct a large number of entity identification rules which are matched with input character strings to identify named entities. The method needs experts to construct rules and has certain limitation in application. Statistical model-based methods treat named entity identification as a sequence tagging problem, but still require manual feature definition. The defined features have a large impact on the final recognition result. The neural network based approach solves the above problems without manually defining the features. And because the neural network has stronger feature expression capability, the features of the entity context can be fully learned.

For the satellite frequency-orbit entity relationship extraction, the existing methods are divided into a template-based relationship extraction, a supervised learning-based relationship extraction and a remote supervision-based method. When the data scale is large, the manual template building workload is large. The relation extraction method based on supervised learning needs a large amount of manually labeled data, and becomes a restriction factor of the relation extraction method. Remote supervision-based methods avoid manually labeling large amounts of data, but introduce noise. The existing research mainly considers selecting a sentence containing a correct label or identifying and removing noise, and does not consider the important significance of the noise on model training. Meanwhile, the method based on remote supervision has the problem of long tail. The two points result in that the relation extraction model trained by the existing method is biased and limited in accuracy.

Chinese patent CN108304911 proposes a knowledge extraction method, system and device based on a memory neural network, which can be used for a knowledge extraction task of a predefined relationship type and can automatically extract structured information meeting the predefined relationship type from unstructured texts in the Internet; chinese patent CN109359297 proposes a relationship extraction method and system, the method introduces hierarchy structure information of relationships to construct a set of attention mechanism of hierarchy structure, and stability of a relationship extraction model is improved. The above patent can extract knowledge, but the data used in the technical solutions of the above patent are all unstructured data, and the information contained in the existing structured data cannot be fully utilized. The first patent scheme needs a large number of manual labels, and the second patent scheme adopts the idea of remote supervision, but the effect of noise data is not fully considered, so that the accuracy rate of knowledge extraction is limited.

Disclosure of Invention

The invention discloses a satellite frequency-orbit data information extraction method based on an enhanced sample model, aiming at the problems that the data record of the traditional satellite database is not complete and the data volume is not enough to establish a frequency-orbit diagram knowledge model, so that useful knowledge information is extracted from unstructured data and is used as supplement of structured data.

The invention discloses a satellite frequency orbit data information extraction method based on an enhanced sample model, which comprises the following specific steps:

s1, according to the task requirement of satellite frequency orbit data identification and extraction, defining the entity type of the satellite frequency orbit data, wherein the six defined entity types comprise: satellite name, satellite network ID, department of charge, orbit position, orbit type, frequency band; the entity is a satellite communication subject in satellite frequency orbit data;

s2, defining a set of relationships between entities, defining relationships between entities based on the entity types defined in step S1, where the relationships between entities are represented by triples, which specifically includes: (satellite name, belonging to, satellite network ID), (satellite name, managed, governing department), (satellite name, orbital), (orbital type, suborbital, satellite name), (satellite name, usage, frequency band) and (governing department, owning network, satellite network ID), all the relationships between entities constitute a set of relationships between entities;

s3, acquiring the frequency-orbit data of the structured satellite, and extracting the knowledge of the frequency-orbit data of the structured satellite, wherein the extraction comprises data preprocessing, entity identification and entity relation extraction;

s31, preprocessing data, namely acquiring structured satellite frequency-orbit data from an SRS database of the International telecommunication Union according to a defined entity type, selecting corresponding data of the entity type from the structured satellite frequency-orbit data, and storing the corresponding data into an entity-relation table;

s32, carrying out entity identification on the structured satellite frequency-orbit data, firstly matching corresponding data from the entity-relationship table according to the defined entity type and the relationship thereof, and selecting related entities;

s33, extracting entity relationship, namely setting the relationship between the entities defined in S2 corresponding to the entity type defined in S1 to which the entity selected in the step S32 belongs as the relationship between the entities;

s34, establishing a triple set T for each two entities by using the corresponding relation;

s4, extracting knowledge of the unstructured satellite frequency-orbit data, obtaining text data of the unstructured satellite frequency-orbit data from the Internet by a data crawling method, segmenting the text data to obtain a segmentation sequence, labeling the segmentation sequence by a BIO labeling method, and taking the labeled text as a training sentence library; fine-tuning a pre-training model based on the BERT to form a named entity recognition model based on the BERT; training a named entity recognition model based on BERT by utilizing a training sentence library; correctly classifying each word in the word sequence by using a trained and BERT-based named entity recognition model;

s41, crawling and word segmentation are carried out on the unstructured satellite frequency-orbit data; respectively marking the defined entity types, namely satellite names, satellite network IDs, departments of charge, orbit positions, satellite types and frequency bands, as six types of labels A1, A2, A3, A4, A5 and A6; marking the label for the segmented sentence by using a BIO marking method to obtain a training sentence library;

s42, fine-tuning a sequence labeling layer of the pre-training model based on the BERT, namely replacing hidden layer representation of the BERT by using a full-connection layer to form a named entity recognition model based on the BERT; training a named entity recognition model based on BERT by using a training sentence library; after an input vector v of an input layer passes through a plurality of coding layers, semantic association expression of sentences in the unstructured satellite frequency-orbit data is obtained as h;

s43, outputting the probability distribution P of each moment of the word segmentation sequence under the BIO labeling mode by the sequence labeling layer_tThe expression of (a) is:

P_t＝softmax(h_tW₀+b₀),t＝1,2,...,N

wherein h is_tDenotes the component of h at time t, W₀Weight matrix representing fully connected layers, b₀Indicating the bias of the full connection layer, and softmax indicating the activation function;

s44, after the probability distribution of each moment of the word segmentation sequence is obtained, the named entity recognition model based on the BERT adopts a cross entropy loss function to train the parameters of the named entity recognition model based on the BERT so as to improve the classification prediction capability of the model; and correctly classifying each word in the word segmentation sequence by using the trained model to obtain a classification result BIO label, obtaining a complete entity name and type according to the classification result BIO label, and finally completing entity identification of the satellite frequency-orbit data.

S5, according to the classification result of the step S4, a sentence containing the entity type defined in the step S1 is screened out; in the screened sentences, for the sentences containing the entities with the same entity type, packaging the sentences to be used as a sentence bag, and marking the entity relationship among the entities in the sentences as a sentence bag label;

s6, the entity types and the relations thereof extracted in the step S3 are used for supplementing the sentence bag data in the step S5, the number of sentence bags is increased, and the number of sentence bags under different entity relations is balanced;

the step S6 specifically includes:

s61, calculating the number of the sentence pockets under each entity relationship, and finding out the median of the number of the sentence pockets under all the entity relationships;

s62, for the entity relation that the number of sentence pockets is less than the median, increasing the number of sentence pockets under the entity relation; entities contained in sentences in the existing sentence pockets of the entity relationship needing to increase the number of the sentence pockets are deleted, and the corresponding data of the entity type extracted in the step S3 is filled in the deletion positions of the sentence pockets to be used as new sentence pockets under the entity relationship, so that the number of the sentence pockets under the entity relationship with the number of the sentence pockets smaller than the median value is increased, and the balance of the number of the sentence pockets under different entity relationships is achieved.

S7, constructing an entity relationship extraction model, firstly screening noise sentences and correct label sentences in the sentence bags by using a reinforcement learning algorithm, and then training the entity relationship extraction model by using the correct label sentences and the noise sentences; the entity relation extraction model is realized by a segmented convolution neural network;

the step S7 includes the following steps:

s71, if the relation between the entities contained in the sentence pocket is the sentence pocket label of the sentence pocket, defining the sentence as a correct label sentence; if the relation between the entities contained in the sentence pocket is not the sentence pocket label of the sentence pocket, defining that the sentence is a noise sentence; sentences in the sentence pocket and sentence pocket labels are used as input of a reinforcement learning algorithm;

s72, setting the agent of reinforcement learning algorithm as the filter of correct label sentence or noise sentence, setting the action A of agent to the ith sentence_iThe method comprises two types, namely, marking the sentence as 1 for judging the sentence as a correct label sentence, and marking the sentence as-1 for judging the sentence as a noise sentence; wherein i is the serial number of the sentence in the input sentence bag, A_i∈{1，-1}，A_iThe expression of the action selection policy function of (1) is:

wherein, pi (A)_i|S_i(ii) a θ) represents the state S_iDown selection action A_iProbability of (S)_iRepresenting the state of the agent during the ith selection, theta represents the parameter to be learned of the agent, sigma (·) represents a sigmoid function, and W and b respectively represent a weight matrix and bias to be learned;

s73, defining the state S of the agent as a vector formed by splicing the vector representation of the sentence with correct selected relation label, the vector representation of the selected noise sentence, the vector representation of the current sentence and the vector representation of the entity pair corresponding to the current sentence;

s74, after the agent takes corresponding action to each sentence in the sentence bag, the agent gets corresponding reward according to the action, the reward value of the action before the agent takes the last action is set as 0, and the reward of the last agent action is set as:

wherein B represents a certain sentence pocket; b is_sel+For the current correctly labeled sentence set, r₊The sentence with correct label is corresponding to the relation; b is_sel-Current set of noisy sentences r_-Indicates no relationship, i.e., an NA relationship; | represents the total number of sentences contained in the set, x_jRepresenting the jth sentence in the sentence set;

s75, the optimization goal of the reinforcement learning algorithm is to maximize the expectation value of the total reward obtained by the intelligent agent, and according to the optimization goal, an optimization function is constructed as follows:

wherein the content of the first and second substances,

is represented in action set [ A ]₀,A₁,A₂,…,A_n]And set of states [ S ]₀,S₁,S₂,…,S_n]The expected value of the reward obtained by the agent, n being the total number of actions selected;

s76, according to the distance between each word in the sentence and the character of the entity, the position of the sentence text is coded to obtain the position code of the sentence text;

s77, word vectors of the words in the sentence are obtained by using a word2vec tool, then the position codes and the word vectors are spliced to obtain an input matrix of an entity relation extraction model, sentence features are extracted through convolution operation, and the formula of the convolution operation is as follows:

c_ij＝w_iq_j-m+1:j，1≤i≤n

wherein, w_iRepresenting entity relationship abstractionsTaking the vector of the ith convolution kernel of the model, n representing the number of convolution kernels, m representing the length of the convolution kernels, j representing the row index value of the input matrix, q_i:jRepresenting a matrix of elements from the i-th to the j-th row of the input matrix, c_ijRepresenting the result obtained after convolution operation is carried out on a matrix formed by elements from the j-m +1 th row to the j th row of the input matrix by the ith convolution kernel, dividing the vector formed by the results of all convolution operation into a plurality of parts according to the row serial number of the vector corresponding to the entity in the input matrix, and then carrying out maximum pooling in sections to obtain the result vector of the sectional pooling;

and S78, splicing the result vectors obtained after the segmentation pooling, sending the splicing result to a softmax layer of the entity relationship extraction model, and outputting the splicing result as the probability of all relationship categories, wherein the relationship categories comprise seven categories including six defined entity relationships and no relationship (NA categories), and the corresponding relationship category with the maximum probability is the relationship classification result of the entity of the satellite frequency-track data finally extracted.

And S8, inputting the named entity information obtained in the step S4 and the corresponding sentence into the entity relationship extraction model obtained in the step S7, obtaining a correct relationship classification result of the entities in the sentence, and finishing the relationship extraction of the satellite frequency orbit data named entity.

S9, the entity extracted from the unstructured data and the relation thereof are represented by a triple, the triple is compared with the data in the triple set T, and if the triple data already exists in the triple set T, the triple data is not added; and if the data of the triples does not exist in the triple set T, adding the extracted entity and the triple data of the relationship thereof into the set T, and realizing the expansion of the structured satellite frequency-orbit data set represented in the form of the triples.

The invention has the beneficial effects that:

the invention realizes the method for extracting the satellite frequency-orbit data information based on the enhanced sample model, can conveniently complete the relational extraction of the satellite frequency-orbit data, and enriches the satellite frequency-orbit knowledge base. The invention adopts a Pipeline mode, and the scheme has high flexibility. The invention fully uses the existing structured data, solves the problem of long tail of the data and improves the accuracy of relation extraction.

Drawings

FIG. 1 is a flow chart of an implementation of a method for extracting satellite frequency-orbit data information based on an enhanced sample model according to the present invention;

FIG. 2 is an example of a BIO annotation mode annotation text in the present invention;

FIG. 3 is a schematic diagram of the components of the BERT-based named entity recognition model of the present invention.

Detailed Description

For a better understanding of the present disclosure, two examples are given herein.

The present invention will be described in detail below with reference to the accompanying drawings.

The invention discloses a technical scheme for extracting a frequency-orbit data relation of a remote supervision satellite based on reinforcement learning, aiming at the problems of noise and data long tail introduced by the traditional remote supervision. The scheme has the following characteristics: 1. identifying correct label sentences and noise sentences by using a reinforcement learning mode, and taking noise as a part of training relation extraction model of training data; 2. and introducing structured data, generating linguistic data of corresponding classes according to the texts of the sentence bag classes of the data to be supplemented, supplementing an unstructured training data set, and solving the problem of unbalanced long tails of the linguistic data. The satellite frequency orbit data refers to satellite frequency orbit data.

Example 1:

the invention discloses a satellite frequency orbit data information extraction method based on an enhanced sample model, the implementation flow of which is shown in figure 1, and the basic steps of the method comprise:

101. defining a relation set between entity types and entities;

102. extracting entities of predefined types and relations thereof from SRS database data, and establishing a triple set T;

103. marking unstructured text data by BIO, marking a prediction model by a training sequence, and completing the recognition of a satellite frequency-orbit named entity;

104. sentences containing the same entity pairs form a sentence bag, the relation of the corresponding entity pair types is marked as a sentence bag label, and the structured data is utilized to generate corpus supplement unstructured data and balance data;

105. selecting a correct class and a noise class in the packet, and training a relation classification model;

106. and (4) fusing the entities and the relations thereof in the extracted unstructured data with the set T by using the triples.

The method comprises the following specific steps:

s4, extracting knowledge of the unstructured satellite frequency-orbit data, obtaining text data of the unstructured satellite frequency-orbit data from the Internet by a data crawling method, segmenting the text data to obtain a segmentation sequence, labeling the segmentation sequence by a BIO labeling method, and taking the labeled text as a training sentence library; fine-tuning a pre-training model based on the BERT to form a named entity recognition model based on the BERT; training a named entity recognition model based on BERT by utilizing a training sentence library;

P_t＝softmax(h_tW₀+b₀),t＝1,2,...,N

S5, according to the classification result of the step S44, a sentence containing the entity type defined in the step S1 is screened out; in the screened sentences, for the sentences containing the entities with the same entity type, packaging the sentences to be used as a sentence bag, and marking the entity relationship among the entities in the sentences as a sentence bag label;

s6, the entity types and the relation thereof extracted in the step S3 are used for supplementing the sentence bag data in the step S5, the number of sentence bags is increased, the number of sentence bags under different entity relations is balanced, and the bias of an entity relation extraction model caused by the problem of long tail of a data set is solved;

the step S6 specifically includes:

s62, for the entity relation that the number of sentence pockets is less than the median, increasing the number of sentence pockets under the entity relation; entities contained in sentences in the existing sentence pockets of the entity relationship needing to increase the number of the sentence pockets are deleted, and the corresponding data of the entity type extracted in the step S3 is filled in the deletion positions of the sentence pockets to be used as new sentence pockets under the entity relationship, so that the number of the sentence pockets under the entity relationship with the number of the sentence pockets smaller than the median value is increased, and the data volume balance under different relationship types is achieved.

the step S7 includes the following steps:

wherein B represents a certain sentence pocket; b is_sel+For the current correctly labeled sentence set, r₊The sentence with correct label is corresponding to the relation; b is_sel-Current set of noisy sentences r_-Indicates no relationship, i.e., an NA relationship; | represents the total number of sentences contained in the set;

wherein the content of the first and second substances,

c_ij＝w_iq_j-m+1:j，1≤i≤n

wherein, w_iA vector representing the ith convolution kernel of the entity-relationship extraction model, n representing the number of convolution kernels, m representing the length of the convolution kernels, j representing the row index value of the input matrix, q_i:jRepresenting a matrix of elements from the i-th to the j-th row of the input matrix, c_ijThe matrix formed by the elements from the j-m +1 th row to the j th row of the input matrix representing the ith convolution kernel is obtained after convolution operationThe result of (c) is a vector formed by the results of all convolution operations, and the resulting vector is divided into three parts [ c ] according to the row number of the vector corresponding to the entity in the input matrix_i1,c_i2,c_i3]Then, the maximization pooling is carried out in a segmentation way to obtain a result vector of the segmentation pooling,

p_ij＝max(c_ij)1≤i≤n,1≤j≤3，

wherein p is_ijRepresents the results after maximum pooling;

Example 2:

as shown in fig. 1, the present invention describes a method for extracting satellite frequency-orbit information, which comprises the following specific steps:

s1, defining entity types, wherein according to task requirements, defining six types of entity types comprises: satellite name, satellite network ID, department of charge, orbit position, orbit type, frequency band;

s2, defining a set of relationships among entities, wherein the defined relationships among entities include, on the basis of the entity types defined in the step S1: (satellite name, belonging to, satellite network ID), (satellite name, managed, governing department), (satellite name, orbital), (orbital type, suborbital, satellite name), (satellite name, usage, frequency band), (governing department, owning network, satellite network ID);

s3, extracting the frequency and orbit data knowledge of the structured satellite, which mainly comprises the steps of preprocessing the structured data, identifying entities and extracting entity relations;

s3-1, data preprocessing is to select corresponding entity type data from an SRS database according to predefined entity types and store the entity type data into an Excel document;

s3-2, the frequency-track data entity identification method is that firstly, corresponding row and column data are matched from Excel according to the defined entity type and attribute thereof, and relevant entity nodes are selected;

s3-3, the entity relation extraction method is that the entity node selected from the database matches the relation between the corresponding entities according to the entity type represented by the corresponding column where the entity node is located and the relation set defined in the step S2;

s3-4, establishing a triple set T for each entity pair by using the corresponding relation;

s4, in the unstructured satellite frequency and orbit data recognition and extraction stage, firstly crawling unstructured text data, after word segmentation, marking the crawled and word segmented data by using a BIO marking method, using the marked text as a training sentence library, training a BERT-based named entity recognition model by using the training sentence library, and finally completing the satellite frequency and orbit data named entity recognition:

s4-1, firstly, crawling unstructured data about satellite frequency orbit knowledge and performing word segmentation; the categories of defining named entities, such as satellite names, satellite network IDs, departments in charge, orbit positions, satellite types and frequency bands, are respectively marked as six categories of A1, A2, A3, A4, A5 and A6; labeling the sentences in the training data set with labels by using a BIO labeling method as a training sentence library, as shown in FIG. 2;

s4-2, the BERT pre-training model can learn semantic association of texts, and the model is adjusted to adapt to entity recognition tasks. The whole structure comprises an input layer, a coding layer and a sequence marking layer; training by using a self-built training sentence library; the input layer represents that v is the superposition of input word vectors, block vectors and position vectors; v, learning through multiple layers of transformers to obtain semantic association of sentences, wherein the semantic association of the sentences is expressed as h;

s4-3, outputting the probability distribution P of each moment of the input sequence by the sequence annotation layer under the BIO annotation method_t；

P_t＝softmax(h_tW₀+b₀),t＝1,2,...,N

Wherein h is_tDenotes the component of h at time t, W₀Weight representing full connection, b₀Represents the bias of the fully-connected layer;

s4-4, after the classification probability distribution corresponding to each word is obtained, model parameters are learned through a cross entropy loss function, and the classification prediction capability of the model is improved; the trained model can correctly classify each word, and complete entity names and types can be obtained according to the classification result BIO labels; and finally, the goal of the satellite frequency orbit data named entity recognition is achieved.

S5, selecting sentences containing predefined entity types, packing the sentences containing the same named entity pairs as a sentence bag, and marking the relationship of the corresponding entity pair types as sentence bag labels.

S6, the entity extracted in S3 and the relation knowledge thereof are used for supplementing sentence bag data in S5, the number of sentence bags is increased, the data is balanced, and the bias of the model caused by the problem of long tail of the data set is solved, and the method specifically comprises the following steps:

s6-1, calculating the number of sentence pockets under each relation category, and finding out the median of the number;

s6-2, for the relation category of which the number of sentence pockets is less than the median, increasing the number of sentence pockets under the relation category; and for the relation needing to increase the number of the sentence bags, filling the entity of the relation extracted in the S3 according to the position of the text entity in the existing sentence bag, so as to achieve the balance of data volume under different relation types.

S7, the method comprises the steps of screening noise sentences and correct label sentences in sentence bags by means of reinforcement learning, training a segmented convolutional neural network simultaneously by means of the correct label sentences and the noise sentences, reducing the influence of noise caused by remote supervision, and increasing the accuracy of an entity relationship extraction model, and comprises the following specific steps:

s7-1, defining the sentence in the sentence pocket as a noise sentence, wherein the actual entity relationship is different from the label of the sentence pocket; otherwise, the sentence bag relation label is defined as correct label data, and the sentences in the sentence bags and the sentence bag relation label are input into a reinforcement learning algorithm;

s7-2, setting the agent as a correct label sentence or noise sentence filter, action A of the agent_iThe method comprises two types, namely, the first type is to judge that the relation label of the sentence is correct and mark the sentence as 1, and the second type is to judge that the relation label of the sentence is incorrect and mark the sentence as-1, wherein the relation label of the sentence is regarded as a noise sentence; wherein i is the sequence number of the sentence in the input sentence bag, Ai belongs to {1, -1}, and the action selection policy function pi of Ai_θComprises the following steps:

wherein σ (·) represents a sigmoid function, whose parameters are (W, b);

s7-3, defining the state S of the agent as: vector average of selected correct label sentences, vector average of selected noise sentences, vector representation of current sentences and vector splicing of corresponding entities into vectors S7-4, and the intelligent agent can obtain reward after acting on sentences in each sentence bag; the reward value for the previous action of the agent is 0 and the reward for the last action is set to:

wherein B represents a certain sentence pocket; b is_sel+For the current correct set of tagged sentences, r₊For correctly labeled sentencesA relationship; b is_sel-For the current set of noisy sentences r_-Is in the NA relationship; | represents the total number of sentences in the set; the influence of correct label sentences and noise sentences is comprehensively considered in the reward setting, and model training can be more effectively guided;

s7-5, the optimization objective of the reinforcement learning algorithm is to maximize the expectation of the total reward obtained by the agent, according to which the optimization function is defined as:

s7-6, performing position coding on the text data according to the distance between each word in the sentence and the entity, wherein, if the entity of the sentence 'the orbit of the wind cloud number four 01 star is 99.5 degrees of east longitude' is the wind cloud number four 01 star and the east longitude is 99.5 degrees; the position of the sentence text is encoded as: [0,1,2,3,4] and [ -4, -3, -2, -1, 0 ];

s7-7, obtaining word vectors by using word2vec for words in sentences, splicing position codes and the word vectors, and extracting features through convolution; the convolution operation formula is:

c_ij＝w_iq_j-m+1:j 1≤i≤n

where w represents the convolution kernel, n represents the number of convolution kernels, m represents the length of the convolution kernel, j represents the row index of the input vector, q_i:jRepresenting a slave sequence q_iTo q_jElement (c) of_ijRepresenting the result after convolution. The convolved result is divided into three parts c_i1,c_i2,c_i3]Then carrying out segmented pooling;

p_ij＝max(c_ij)1≤i≤n,1≤j≤3

and S7-8, splicing the pooled vectors and then sending the spliced vectors to a softmax layer, outputting the probability of all relation categories, including seven categories including six predefined relations and no relation (NA category), wherein the category corresponding to the maximum probability is the relation classification of the finally extracted satellite frequency-orbit data entity.

And S8, inputting the named entity information obtained in the S4 and the corresponding sentence into the relation extraction model trained in the S7 to obtain correct relation classification, and finishing the relation extraction of the satellite frequency orbit data named entity.

S9, comparing the newly extracted entity and the relation triple with the data in the triple set T, if the original set has the triple, not adding; and if the original set does not have the triple, adding the triple into the set T, and realizing the expansion of the structured data set represented in the form of the triple.

The foregoing is illustrative of the present application and is not to be construed as limiting thereof. Various modifications and changes may occur to those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present application should be included in the scope of the claims of the present application.

Claims

1. A satellite frequency orbit data information extraction method based on an enhanced sample model is characterized by comprising the following specific steps:

s1, defining entity types of the satellite frequency and orbit data according to the task requirements of the satellite frequency and orbit data information extraction, wherein the six defined entity types comprise: satellite name, satellite network ID, department of charge, orbit position, orbit type, frequency band; the entity is a satellite communication subject in satellite frequency orbit data;

s2, defining a relation set among the entities, defining the relation among the entities on the basis of the entity type defined in the step S1, wherein the relation among the entities is represented by a triple;

s7, constructing an entity relationship extraction model, firstly screening noise sentences and correct label sentences in the sentence bags by using a reinforcement learning algorithm, and then training the entity relationship extraction model by using the correct label sentences and the noise sentences;

s8, inputting the named entity information obtained in the step S4 and the corresponding sentences into the entity relationship extraction model obtained in the step S7, obtaining a correct relationship classification result of the entities in the sentences, and completing the relationship extraction of the satellite frequency orbit data named entities;

2. The method for extracting satellite frequency-orbit data information based on the enhanced sample model as claimed in claim 1, wherein the relationship between the entities specifically includes: (satellite name, belonging to, satellite network ID), (satellite name, managed, governing department), (satellite name, orbital), (orbital type, orbital, satellite name), (satellite name, usage, frequency band) and (governing department, owning network, satellite network ID), all the relationships between entities constitute a set of relationships between entities.

3. The method as claimed in claim 1, wherein the entity relationship extraction model is implemented by a segmented convolutional neural network.

4. The method of claim 1, wherein the method for extracting satellite frequency orbit data information based on the enhanced sample model,

the step S3 specifically includes:

and S34, establishing a triple set T for each two entities by using the corresponding relation.

5. The method of claim 1, wherein the method for extracting satellite frequency orbit data information based on the enhanced sample model,

the step S4 specifically includes:

P_t＝softmax(h_tW₀+b₀),t＝1,2,...,N

6. The method of claim 1, wherein the method for extracting satellite frequency orbit data information based on the enhanced sample model,

the step S6 specifically includes:

7. The method of claim 1, wherein the method for extracting satellite frequency orbit data information based on the enhanced sample model,

the step S7 includes the following steps:

wherein B represents a certain sentence pocket; b is_sel+For the current correctly labeled sentence set, r₊The sentence with correct label is corresponding to the relation; b is_sel-Current set of noisy sentences r_-Indicates no relationship, i.e., an NA relationship; | represents the total number of sentences contained in the set; x is the number of_jRepresenting the jth sentence in the sentence set;

wherein the content of the first and second substances,

c_ij＝w_iq_j-m+1:j，1≤i≤n

wherein, w_iA vector representing the ith convolution kernel of the entity-relationship extraction model, n representing the number of convolution kernels, m representing the length of the convolution kernels, j representing the row index value of the input matrix, q_i:jRepresenting a matrix of elements from the i-th to the j-th row of the input matrix, c_ijRepresenting the result obtained after convolution operation is carried out on a matrix formed by elements from the j-m +1 th row to the j th row of the input matrix by the ith convolution kernel, dividing the vector formed by the results of all the convolution operation into a plurality of parts according to the row serial number of the vector corresponding to the entity in the input matrix, and then carrying out maximum pooling in sections to obtain the result vector of the sectional pooling;

and S78, splicing the result vectors obtained after the segmentation pooling, sending the splicing result to a softmax layer of the entity relationship extraction model, and outputting the splicing result as the probability of all relationship categories, wherein the relationship categories comprise seven categories including six defined entity relationships and no relationship, and the corresponding relationship category with the maximum probability is the relationship classification result of the entities of the satellite frequency-track data finally extracted.