CN112199508A

CN112199508A - Parameter adaptive agricultural knowledge graph recommendation method based on remote supervision

Info

Publication number: CN112199508A
Application number: CN202010794151.4A
Authority: CN
Inventors: 周泓; 万瑾; 朱全银; 孙强; 倪金霆; 陈凌云; 季睿
Original assignee: Huaiyin Institute of Technology
Current assignee: Huaiyin Institute of Technology
Priority date: 2020-08-10
Filing date: 2020-08-10
Publication date: 2021-01-08
Anticipated expiration: 2040-08-10
Also published as: CN112199508B

Abstract

The invention discloses a parameter self-adaptive agricultural knowledge map recommendation method based on remote supervision, which comprises the following steps: crawling text data by adopting a Scapy crawler frame, carrying out data preprocessing, and obtaining a predicted text classification data set Predict _ data by utilizing a KNN algorithm classifier; on the other hand, when the Chinese corpus of crops is processed, the predicted entity classification result is mapped into the Wikipedia Chinese corpus to construct an entity Chinese dictionary. And setting up a parameter self-adaptive optimization searching neural network model based on an improved remote supervision algorithm, wherein the model enables self-adaptive optimization searching parameters extracted by the relation to be the best, automatic labeling of text data is realized, and the relation between entities is obtained. The method can improve the accuracy of relation extraction, and meanwhile, provides effective information screening for the plant cultivation enthusiasts by utilizing agricultural text information.

Description

Parameter adaptive agricultural knowledge graph recommendation method based on remote supervision

Technical Field

The invention belongs to the technical field of knowledge maps and neural networks, and particularly relates to a parameter self-adaptive agricultural knowledge map recommendation method based on remote supervision.

Background

The agricultural knowledge map combines the characteristics of regional property, climatic property, diversity of physical products and the like of agriculture, utilizes the entity relationship and concept in the field of agriculture to dig out an intelligent auxiliary system of potential value of agriculture, and compared with the traditional agricultural information query mode, the agricultural knowledge map combines the visualization technology and the agricultural knowledge base to display and analyze retrieved data, thereby being a new development of Chinese metrology. Therefore, the agricultural knowledge map service system provided by the invention can analyze the environment and climate suitable for the growth of crops by utilizing data in the agricultural knowledge service system, agricultural interactive encyclopedia and Wikipedia, provides an effective auxiliary effect for agricultural research institutes and plant cultivation enthusiasts, and quickly acquires required information in the internet with big data explosion.

The existing research bases of Zhuquanhyin et al include: the classification and extraction algorithm of Web science and technology news [ J ] academic newspaper of Huaiyin institute of Industrial science and technology, 2015,24(5): 18-24; lixiang, Zhu-Quanyin, collaborative clustering and scoring matrix shared collaborative filtering recommendations [ J ] computer science and exploration 2014,8(6): 751-; quanyin Zhu, Sun qun Cao.A Novel Classifier-independent Feature Selection Algorithm for Imbalanced datasets.2009, p: 77-82; quanyin Zhu, Yunyang Yan, Jin Ding, Jin Qian, the Case Study for Price extraction of Mobile Phone Sell Online, 2011, p: 282-285; quanyin Zhu, Suqun Cao, Pei Zhou, Yunyang Yan, Hong Zhou. Integrated print for based on Dichotomy Back filling and Disturbance Factor Algorithm. International Review on Computers and Software,2011, Vol.6(6): 1089-; li Xiang, Zhu quan Yin, Hurong Lin, Zhonhang, a cold chain logistics stowage intelligent recommendation method based on spectral clustering, Chinese patent publication No. CN105654267A, 2016.06.08; suo Cao, Zhu quan Yin, Zuo Xiao Ming, Gao Shang soldier, etc., a feature selection method for pattern classification Chinese patent publication No.: CN 103425994 a, 2013.12.04; liu jin Ling, von Wanli, Zhang Yao red Chinese text clustering method based on rescaling [ J ] computer engineering and applications, 2012,48(21): 146-; the classification and extraction algorithm of Web science and technology news [ J ] academic proceedings of Huaiyin institute of Industrial science and technology, 2015,24(5): 18-24; lixiang, Zhu-Quanyin, collaborative clustering and scoring matrix shared collaborative filtering recommendations [ J ] computer science and exploration 2014,8(6): 751-; zhuquanhyin, sincerity, Lixiang, xukang and the like, a network behavior habit clustering method based on K-means and LDA two-way verification, Chinese patent publication No. CN 106202480A, 2016.12.07;

the traditional knowledge graph construction method relates to agricultural knowledge and relation extraction, and aims at the problems that: huihong remote supervision relation extraction method and device, chinese patent publication no: CN110209836A,2019.5.17, belonging to the application of remote supervision algorithm, aiming at generating an entity recognition training data set through bootstrap algorithm and recognizing the entity of a sentence through crf + + tool; generating an entity relation extraction training data set through a remote supervision method, and generating an entity relation extraction data set through a relation knowledge base and a natural language corpus; the method can automatically label training data through natural corpus to complete entity recognition and entity relationship extraction; sun encourage, an assistant diagnosis and treatment system based on knowledge map, chinese patent publication no: CN110459320A,2019.11.15, belonging to the field of medical diagnosis and treatment, aiming at defining the patient status between two successive medical operations as the side; the system comprises a patient information processing module, a diagnosis and treatment scheme pushing module, a patient information processing module and a diagnosis and treatment decision module, wherein the patient information processing module is used for receiving patient information, extracting historical medical operation and patient state information, sending the historical medical operation and the patient state information to the diagnosis and treatment scheme pushing module, matching the patient information with a knowledge graph, determining the position of the current state of a patient in the knowledge graph, pushing a medical index to be detected and/or next diagnosis and treatment operation based on the knowledge graph, quickly knowing the diagnosis and treatment stage of the patient, and giving a next; the Chinese patent publication No. is: CN110400327A,2019.11.1 belongs to the field of crop image segmentation, and aims to realize self-adaptive adjustment of PCNN model parameters in nighttime image segmentation of tomato plants, reduce PCNN iteration times and improve the real-time performance of algorithm application. However, at present, a system and a method for adopting a parameter adaptive optimization model combined with a neural network to identify entities and extract relationships in the agricultural field, construct a knowledge graph in the agricultural field and make an auxiliary decision do not exist.

Screening algorithm based on heuristic rule:

the information filtering technology is used for facilitating a user to find information which is interested by the user more quickly, and the information filtering technology can solve the problem. Information filtering is generally used to process a large amount of text information and to filter out objectionable information in a targeted manner. The rule is a knowledge representation method, the modification or replacement of the rule does not affect other rules, a plurality of domain knowledge of different categories are stored in the rule base, and a series of reasoning predictions can be completed by using the obtained rule to finally obtain the category. At present, information screening is carried out at home and abroad based on a heuristic rule screening method through keywords.

Remote supervision algorithm:

the remote supervision algorithm is based on a labeled artificial knowledge graph, relation labels are labeled on sentences in an external document, and the algorithm is also a semi-supervision algorithm. Firstly, based on the fact that the entities related to crops in the sentences are extracted in the training stage, and the two entities are in a corresponding relation in the corpus, the texts in the test set are considered to express the entitiesClass relationships. And the extracted text features are spliced and expressed as a word vector, and the word vector is used as a feature vector of the texts. Aiming at the system, the proposed scheme is as follows: the existing triple correspondence is mapped to a massive unstructured database to generate a large amount of training data, and the knowledge sources are diversified, such as manual labeling, the existing knowledge base, a specific statement structure and the like. For example: set data set X ═ X₁，x₂，x₃，…，x_nAccording to the relation h₁Mapping the data set X to a space A where A ═ A₁，A₂，A₃，…，A_MIs then passed through the relationship h₂Mapping space a to space K ═ K₁，K₂，K₃，…，K_r}。

Based on a PCNN neural network model algorithm:

traditional vocabulary characteristics comprise characteristics such as partial entities, word sequences among agricultural product entities, hypernyms of words and the like, and the characteristics depend on manual processing characteristic processes. Characteristics of the lexical level: conversion to word vectors and the functions represented at the lexical level using the word vectors. Features at the syntactic level: and considering the context characteristics, setting a sliding window K, sliding back to a lattice after finishing reading every K characters, and finally obtaining a group of sentence-level characteristics including word characteristics and position characteristics. The PCNN algorithm makes an improvement over the CNN algorithm in terms of the pooling layer: and dividing the statement into k sections according to the position of the entity pair, performing maximum pooling operation on each pair independently, obtaining a maximum value in each section, and finally forming the maximum values into feature vectors.

Disclosure of Invention

The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a parameter self-adaptive agricultural knowledge map recommendation method based on remote supervision, which analyzes the environment and climate suitable for the growth of crops through data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia, provides effective auxiliary action for agricultural research institutes and plant cultivation enthusiasts, and quickly acquires needed information in the internet with big data explosion. .

The technical scheme is as follows: in order to solve the technical problems, the invention provides a parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision, which comprises the following specific steps of:

(1) data pre-processing is carried out on data crawled by an agricultural knowledge service system, and an obtained data set is defined as Agri _ data; performing data crawling on the agricultural interactive encyclopedia by using Scapy, and defining a crawled data set as HuDong _ data; performing Chinese word segmentation and word vector training on text data in the Agri _ data and the HuDong _ data, and defining an obtained data set as Train _ data;

(2) training a KNN algorithm model by using a data set Train _ data, performing feature extraction on text data by using a fast text classification tool, and performing text similarity comparison by using a cosine similarity algorithm to obtain a text entity classification T;

(3) utilizing the KNN algorithm model in the step (2) to Predict the entity classification Result, storing the entity classification Result by using a Presect _ data set, and mapping the entity in the Presect _ data set with the entity data in the Uighur entity relationship data set Result _ data to obtain a data set Train _ data;

(4) constructing an entity dictionary in a Wikipedia Chinese word stock by using a heuristic rule screening algorithm, and preprocessing the text data of Filter _ Wtrain _ data to obtain a wikidadata relationship data set;

(5) respectively building PCNN, CNN, RNN and BiRNN neural network models;

(6) comparing the four algorithm models to obtain a relation extraction model M for parameter self-adaption optimization in the field of agricultural knowledge maps;

(7) and extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.

Further, the specific steps of the data set Train _ data obtained in the step (1) are as follows:

(1.1) performing data crawler and selecting a crawler page;

(1.2) selecting a page;

(1.3) selecting an agricultural knowledge service system;

(1.4) crawling an agricultural knowledge service system, obtaining html files of the system, limiting a crawling range by using front-end div, and obtaining names Title, detailed contents Detail, photo ImageList and webpage links Url of crops; the data item airfem ═ { Title, Detail, ImageList, Url } is formed, and the data set Agri _ data ═ airfem ═ data { (airfem) }₁,Aitem₂,…,Aitem_nExecuting the step (1.8);

(1.5) selecting an agricultural interactive encyclopedia;

(1.6) crawling the content in the agricultural interactive encyclopedia by using Scapy, declaring the address domain of a crawler, acquiring a word list, constructing an original json file, generating a Url list, acquiring a crop Title, and crawling picture ImageList and an open domain label openTypeList;

(1.7) each entity obtained by crawling corresponds to one entry of an agricultural encyclopedia, wherein the entries comprise a Title, a link Url of an interactive encyclopedia, a picture ImageList, an open classification list TypeList, a detailed information InfoList and a basic information ValueList; constitute the data item Hitem ═ { Title, Url, ImageList, TypeList, InfoList, ValueList }, the data set Hudong _ data ═ { Hitem }₁,Hitem₂,…,Hitem_n}；

(1.8) acquiring two types of database sets Agri _ data and HuDong _ data;

(1.9) performing part-of-speech screening on data in the two data sets Agri _ data and HuDong _ data;

(1.10) discarding words containing non-Chinese and English or numeric characters;

(1.11) performing Chinese word segmentation and word vector training on two database sets Agri _ data and HuDong _ data respectively;

(1.12) obtaining a data set Train _ data.

Further, the specific steps of obtaining the text entity classification T in the step (2) are as follows:

(2.1) transmitting a Text data set Train _ data into a KNN Text classifier, defining the Mean value of each component as Mean, the variance of each component as Var, the inverse Text frequency index as Text _ IDF and the Text quantity as Item _ Num; defining the similarity Weight as Weight { Title, TypeList, Detail, InfoList, ValueList } {0.2,0.2,0.2,0.2,0.2 };

(2.2) if each Item has 5 attributes in Weight, adding 1 to the IDF value of each attribute;

(2.3) returning the similarity of Title of 2 items each time, defining the name similarity as Title _ sim, the similarity of TypeList of 2 items, the similarity of open sort list as TypeList _ sim, the similarity of Detail content as Detail _ sim, the similarity of InfoList of 2 items as InfoList _ sim, the similarity of ValueList of 2 items as ValueList _ sim, the similarity of 5 attributes as Dsim;

(2.4) linearly weighting the obtained similarity Dsim, and defining the similarity Dsim as Simi;

(2.5) storing the attribute similarity of the Item in a temporary table CurList, calculating the variance and mean difference of each component, carrying out Gaussian normalization on the similarity of Title and TypeList, and endowing the similarity average value to the similarity values which do not appear;

(2.6) carrying out weighted sum on the similarity of each Item, defining the sum as Count _ sim, sequencing the first k values of the similarity Count _ sim, and classifying the k values into one class;

and (2.7) obtaining the classification T of the text entity.

Further, the specific steps of obtaining the data set Train _ data in the step (3) are as follows:

(3.1) storing the classification result of the predicted entity obtained from the KNN algorithm by using Predict _ data;

(3.2) crawling all the relations and corresponding Chinese names under the Wikipedia webpage by utilizing Scapy, wherein the storage format is json format;

(3.3) the crawled content is a relation rid, an attribute rtype to which the relation belongs, a subclass statement to which the relation belongs and a corresponding rlink link, and is stored in a relation json file, and a data sample is defined as rite ═ rid, rtype, statement, rlink }; data set relationship { Ritem ═₁,Ritem₂,…,Ritem_n}; relational cid and relational chinese representation chrention storageJson file, data sample time { cid, duration } and data set duration { time } are defined₁,Mitem₂,…,Mitem_n}；

(3.4) merging the data in relation data set relation.json and relation.json, defining a data set result.json, and storing the result in a result.json file;

(3.5) defining an entites.json database, searching data in the Presect _ data in the Wikipedia, returning json content, and storing the json content in an entites.json file;

(3.6) Wikitata is an open knowledge base, the description of an entity in a Wikitata entity page and the corresponding relation associated with the entity are crawled, a wikitataRelationjson file is defined, the result is stored in the wikitataRelationjson file, and a data sample wittem ═ { entity ═ is defined₁，relation， entity₂Data set wikidatarelationship (Witem)₁,Witem₂,…,Witem_n}；

(3.7) processing data in a WikidataRelation.json database into a csv file format, corresponding data in the HuDong _ data in the agricultural interactive encyclopedic to a Presect _ data database to obtain a node.csv file, defining a data sample Nitem ═ { Title, Label }, and defining a data set Node ═ Nitem ═ in₁,Nitem₂,…,Nitem_n}；

(3.8) the Wiki Chinese encyclopedia corpus is simplified from traditional to simplified and the line feed symbols in the line feed sentences are removed;

(3.9) selecting a training set related to agriculture, and selecting relations of "instance of" "taxon rank" "subiclass of" "parent taxon";

(3.10) preloading an entity list (selecting entities in the following categories: 5: Animal,6: Plant,7: Chemicals,9: foods, 10: Diseases,12: Nutrients,13: biochemistry, 14: Agricultural examples, 15: Technology);

(3.11) agricultural related sentences are stored in a FileRead file, the triple relation obtained in wikidataRelation.json is aligned to a corpus of a Chinese dimensional base, and a corpus Wtrain _ data of a training set is defined;

(3.12) loading an entity to the labeled mapping dictionary, performing part-of-speech screening on words in the sentence, and reading the category of the entity to be consistent with the category of the entity in the Presect _ data;

(3.13) filtering out null values of the attribute relation in the training set obtained by aligning the Wikipedia data set to obtain filter _ Wtrain _ data; define data sample Fitem ═ entity_1id,entity₁,entity_2id,entity₂Statement, relationship, data set Node ═ Fitem₁,Fitem₂,…,Fitem_n}。

Further, the specific steps of obtaining the wikidataRelation data set in the step (4) are as follows:

(4.1) screening out entities which are all Chinese characters according to a regular expression, and converting the entities into a dictionary format, wherein the first element of each line in the filtered _ Wtrain _ data dataset is an entity;

(4.2) acquiring a sentence set according to the entity dictionary, storing the sentence set in a list format, preprocessing the sentence set, removing all Chinese characters and all characters except Chinese common punctuations, and splitting the sentence;

(4.3) each sentence is traversed by entity, all entities in the sub-sentences are stored according to the character matching rule, sentences without entities or with only one entity are filtered, and data are processed into [ [ sensor, [ entity ]₁,…]],..]The format of';

and (4.4) carrying out Chinese word segmentation by using a jieba library and carrying out Chinese sentence segmentation. Defining the content text data, wherein the content _ seg is the text participle and the entity₁For text entity, data processing is [ [ sense, [ entity ]₁,...],[sentence_seg]],...]The format of';

(4.5) training the word vector;

(4.6) re-screening the entities by the sentence after word segmentation;

(4.7) the entity appears in the sentence after word segmentation, if not, the step (4.10) is executed;

(4.8) the entity sets are combined pairwise, and the data processing is [ sensor, entity₁, entit₂,[sentence_seg]]]One sentence for a plurality of samples;

(4.9) dividing the data in the wikidataRelation data set into a training set and a testing set according to the ratio of 3: 1;

(4.10) removing the entity and the text.

Further, the specific steps of respectively building the PCNN, CNN, RNN, BiRNN neural network models in the step (5) are as follows:

(5.1) building an artificial neural network, wherein in embedding layer embedding, a Word mapping function is defined as Word _ embedding, a Word embedding space vector size is defined as Word _ embedding _ dim ═ 50, a Position feature embedding space vector size is defined as Position _ embedding ═ 5, the maximum length is defined as 120, and the Word _ Position _ embedding function is set to add the two embedding function results;

(5.2) defining three loss functions as softmax cross entry, sigmoid cross entry,

and each sample data softmax or sigmoid layer can obtain different probability distribution and prediction relations, and the maximum prediction result is used as an entity prediction result for calculating the loss cross entropy.

(5.3) setting drop _ out as 0.5, calculating the maximum value of elements in tensor dimension, dividing the maximum value into three sections for maximum pooling, obtaining a 3-dimensional vector by each convolution kernel, inputting the result obtained by the pooling layer into a normalization function layer, and performing nonlinear processing by using a tanh activation function.

(5.4) each instance is defined as a bag, if the bag in the training set is positive, the number of positive instances is more than or equal to 1; if negative, the examples are all negative;

(5.5) adding an attention mechanism on each bag;

(5.6) whether training is performed, if so, executing the step (5.14);

(5.7) whether dropout;

(5.8) defining an entity list, which is defined as bag _ pre, and defining a logistic regression value based on the attribution as an attribution _ logic;

(5.9) defining a variable i for traversing in the local scope, defining the scope as scope;

(5.10) if i > scope. shape [0], performing the step (5.13);

(5.11) calculating an attention value of the softmax loss function;

(5.12) i ═ i +1, performing step (5.10);

(5.13) storing the obtained rank vector in a bag _ pre entity list, and executing the step (5.19);

(5.14) finishing training and starting testing;

(5.15) defining the variable i₁For traversing in a local scope of action;

(5.16) if i₁>scope.shape[0]Executing the step (5.19);

(5.17) calculating a sigmoid activation function value, and storing the obtained entity value in a logistic regression list of the bag _ login entity list;

(5.18)i₁＝i₁+1, go to step (5.16);

(5.19) defining four functions of pcnn, cnn, rnn and birnn, defining hidden factor as hidden _ size as 230, defining convolution kernel size as 3 and step size as 1, and setting local variable count by using activation function relu function;

(5.20) training, and building a pcnn, cnn, rnn and birnn neural network model;

(5.21) if count > n, executing step (5.28);

(5.22) decomposing the sentence into words in turn, each word mapped to a dimensional vector d_wCalled word embedding, learning an embedded vector through model training;

(5.23) indicating the tag's identity in the sentence using the location feature₁And entity₂Each entity has two relative positions, respectively mapped to different d_pA dimension vector;

(5.24) the two relative position results of step (5.23) are concatenated to obtain the matrix M e R_h×R_sAs being an input representation, wherein R_s＝d_w+2*d_p；

(5.25) is provided withPut W ═ W_c*R_sIs a convolution matrix, where W_cIs the convolution window width, and by sliding the convolution window down to the sentence and applying this function to each valid position, a feature map c ═ c is generated₁，c₂，...， c_(h-Wc+1)]Extracting n features from the sentence;

(5.26) repeating the above process with a different W matrix;

(5.27) count +1, performing step (5.21);

(5.28) piece wise Max-firing is commonly used to select the maximum activation value in each functional map;

(5.29) dividing each feature map Ci into three components { c) based on the locations of the two entities_i1，c_i2，c_i3When the piecewise max pooling is complete, the results of each feature map are concatenated into a formal vector p ∈ R_3nAs a characteristic representation of the sentence;

(5.30) building four neural network models of the step (5.20).

Further, the specific steps of obtaining the relation extraction model M for parameter adaptive optimization in the field of agricultural knowledge graph in the step (6) are as follows:

(6.1) carrying out parameter self-adaptive optimization on a link coefficient a, an initial threshold b and an attenuation coefficient c of the PCNN neural network model;

(6.2) determining a relatively fixed link coefficient, an initial threshold value and an attenuation coefficient, firstly, searching in an initial parameter range by adopting a larger search step pitch, and finding out the optimal model parameter at the moment;

(6.3) if a plurality of parameter combinations exist and the optimal neural network model is achieved at the same time, executing the step (6.10);

(6.4) selecting the combination with the smaller link coefficient as the optimizing result;

(6.5) if step is less than 0.0005, executing step (6.10);

(6.6) searching for optimal parameters by the grid;

(6.7) ending the traversal parameters, and if not, executing the step (6.6);

(6.8) obtaining the highest accuracy rate by the link coefficient, the initial threshold value and the attenuation coefficient of the optimal PCNN neural network model;

(6.9)

performing step (6.5);

and (6.10) obtaining an optimal parameter model of the neural network.

Further, the step (7) of extracting the relationship between the entities from the text data in the agricultural field, rendering the entity relationship data through Echarts, and displaying the recommendation result on the web side comprises the following specific steps:

(7.1) obtaining a neural network model PCNN with optimal parameters, and extracting two entity entries in the agricultural text data₁And entity₂And the relationship between the two;

(7.2) mapping the triple relation based on a built small triple knowledge graph library wikidataRelation;

(7.3) performing entity extraction on the sentences in the wiki corpus by using the algorithm model, and mapping the entities into a wikidataRelation database to realize the automatic labeling function in the remote supervision algorithm;

(7.4) inputting agricultural text data, screening entities in the text data, and extracting the relationship between the entities and the text data;

(7.5) importing data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia into a neo4j database;

(7.6) the searched entity exists in the database, if not, the step (7.9) is executed;

(7.7) displaying the search result on the web end in a graph form by using a Cython statement;

(7.8) packaging a python interface, and displaying the data by using a web framework Dijango, and executing the step (7.10);

(7.9) displaying the absence of the entity;

(7.10) searching text data of agricultural knowledge problems in the task box, and performing Chinese word segmentation technology on the text data to obtain an entity;

(7.11) searching the database by using a Cython statement;

(7.12) the answer to the question exists in the database, if not, the step (7.9) is executed;

(7.13) extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.

By adopting the technical scheme, the invention has the following beneficial effects:

according to the method, unstructured text data in the agricultural field are crawled by using a Scapy crawler frame, entity recognition is carried out by adopting a KNN algorithm, a parameter self-adaptive PCNN neural network model is adopted for carrying out relation extraction, a triple relation is constructed, and compared with a traditional convolutional neural network, the model can extract more sentence characteristics. The method changes the traditional neural network that parameter values are set according to experience, adopts a parameter self-adaptive optimization technology, and enhances the accuracy of model relation extraction.

Drawings

FIG. 1 is a general flow diagram of the present invention;

FIG. 2 is a flow diagram of pre-processing data crawled by the agricultural knowledge service system and agricultural interactive encyclopedia in an embodiment;

FIG. 3 is a flowchart of a KNN algorithm model construction and text similarity comparison in a specific embodiment;

fig. 4 is a flowchart illustrating that classification results of predicted entities are stored by using preset _ data, and entities in the preset _ data dataset are mapped with entity data in the wiki encyclopedia entity relationship dataset Result _ data to obtain a dataset Train _ data in the embodiment;

FIG. 5 is a flowchart of the embodiment in which an entity dictionary in a Wikipedia Chinese lexicon is constructed by replacing an entity attribute relationship according to rules, and text data is preprocessed;

FIG. 6 is a flowchart of building PCNN, CNN, RNN, BiRNN neural network models, respectively, in a specific embodiment;

FIG. 7 is a flowchart of a relationship extraction model M for comparing four algorithm models to obtain parameter adaptive optimization in the field of agricultural knowledge base maps in an embodiment;

FIG. 8 is a flow chart of extracting relationships between entities on textual data in the agricultural domain and establishing an agricultural knowledge graph for use in aiding decision making in an embodiment.

Detailed Description

The present invention is further illustrated by the following specific examples in conjunction with the national standards of engineering, it being understood that these examples are intended only to illustrate the invention and not to limit the scope of the invention, which is defined in the claims appended hereto, as modifications of various equivalent forms by those skilled in the art upon reading the present invention.

As shown in fig. 1 to 8, the parameter adaptive agricultural knowledge graph recommendation method based on remote supervision according to the present invention includes the following steps:

step 1: and carrying out data preprocessing on the data crawled by the agricultural knowledge service system, and defining the obtained data set as Agri _ data. And (4) performing data crawling on the agricultural interactive encyclopedia by utilizing Scapy, and defining a crawled data set as HuDong _ data. Performing Chinese word segmentation and word vector training on text data of the Agri _ data and the HuDong _ data, and defining an obtained data set as Train _ data. The specific method comprises the following steps:

step 1.1: performing data crawler and selecting a crawler page;

step 1.2: selecting a page;

step 1.3: selecting an agricultural knowledge service system;

step 1.4: crawling an agricultural knowledge service system, obtaining html files of the system, limiting a crawling range by using a front end div, and obtaining names Title, detailed contents Detail, photo ImageList and webpage links Url of crops. The data item airfem ═ { Title, Detail, ImageList, Url } is formed, and the data set Agri _ data ═ airfem ═ data { (airfem) }₁,Aitem₂,…,Aitem_nExecuting the step 1.8;

step 1.5: selecting an agricultural interactive encyclopedia;

step 1.6: crawling the content in the agricultural interactive encyclopedia by using Scapy, declaring the address domain of a crawler, acquiring a word list, constructing an original json file, generating a Url list, acquiring a crop Title, and crawling a picture ImageList and an open domain label openTypeList;

step 1.7: and each entity obtained by crawling corresponds to one entry of the agricultural encyclopedia, wherein the entries comprise a Title, a link Url of the interactive encyclopedia, a picture ImageList, an open classification list TypeList, a detailed information InfoList and a basic information ValueList. Constitute the data item Hitem ═ { Title, Url, ImageList, TypeList, InfoList, ValueList }, the data set Hudong _ data ═ { Hitem }₁,Hitem₂,…,Hitem_n}；

Step 1.8: acquiring two types of database sets Agri _ data and HuDong _ data;

step 1.9: performing part-of-speech screening on data in the two data sets Agri _ data and HuDong _ data;

step 1.10: discarding words containing non-Chinese and English or numbers;

step 1.11: performing Chinese word segmentation and word vector training on two database sets Agri _ data and HuDong _ data respectively;

step 1.12: a data set Train _ data is obtained.

Step 2: training a KNN algorithm model by using a data set Train _ data, performing feature extraction on text data by using a fast text classification tool, and performing text similarity comparison by using a cosine similarity algorithm to obtain a text entity classification T, wherein the specific method comprises the following steps:

step 2.1: and transmitting the Text data set Train _ data into a KNN Text classifier, defining the Mean value of each component as Mean, the variance of each component as Var, the inverse Text frequency index as Text _ IDF and the Text quantity as Item _ Num.

Defining the similarity Weight as Weight { Title, TypeList, Detail, InfoList, ValueList } {0.2,0.2,0.2,0.2,0.2 };

step 2.2: if each Item has 5 attributes in Weight, the IDF value of each attribute is added with 1;

step 2.3: returning the similarity of Title of 2 items each time, defining the name similarity as Title _ sim, returning the similarity of TypeList of 2 items, defining the similarity of an open classification list as TypeList _ sim, returning the similarity of Detail of 2 items, defining the Detail content similarity as Detail _ sim, returning the similarity of InfoList of 2 items, defining the similarity as InfoList _ sim, returning the similarity of ValueList of 2 items, defining the similarity as ValueList _ sim, averaging the similarities of 5 attributes, and defining the similarity as Dsim;

step 2.4: the obtained similarity Dsim is linearly weighted and defined as Simi;

step 2.5: storing the similarity of each attribute of the Item in a temporary table CurList, calculating the variance and the mean variance of each component, carrying out Gaussian normalization on the similarity of Title and TypeList, and endowing the similarity average value to the values of the similarity which does not appear;

step 2.6: weighting and summing the similarity of each Item, defining the sum as Count _ sim, sequencing the first k values of the similarity Count _ sim, and classifying the k values into one class;

step 2.7: a classification T of the text entity is obtained.

And step 3: and B, predicting the Result of entity classification by using the KNN algorithm model in the step II, storing the Result by using a Presect _ data set, and mapping the entity in the Presect _ data set and the entity data in the Uighur entity relationship data set Result _ data to obtain a data set Train _ data, wherein the specific method comprises the following steps:

step 3.1: storing a prediction entity classification result obtained from the KNN algorithm by using prediction _ data;

step 3.2: crawling all relations and corresponding Chinese names under Wikipedia webpages, wherein the storage format is json format;

step 3.3: the crawled content is a relation rid, a relation attribute rtype, a relation sub-class statement and a corresponding rlink link, and is stored in a relation json file, and a data sample is defined as Ritem ═ rid, rtype, statement and rlink }; data set relationship { Ritem ═₁,Ritem₂,…,Ritem_n}; json file, define data sample (mite ), data set (mite ), and relation chinese representation (relation cid, relation chinese), wherein₁,Mitem₂,…,Mitem_n}；

Step 3.4: merging data in relation data sets relationship.json and relationship.json, defining a data set result.json, and storing a result in a result.json file;

step 3.5: defining an entites.json database, searching data in the Presect _ data on a Wikipedia, returning json content and storing the json content in an entites.json file;

step 3.6: the Wikidata is an open knowledge base, the description of an entity in a Wikidata entity page and the corresponding relation associated with the entity are crawled, a wikidataRelationship file is defined, the result is stored in the wikidataRelationship file, and a data sample Witem is defined as { entity ═ degree₁，relation， entity₂Data set wikidatarelationship (Witem)₁,Witem₂,…,Witem_n}；

Step 3.7: data in a WikidataRelations.json database is processed into a csv file format, data in an agricultural interactive encyclopedia HuDong _ data corresponds to a Presect _ data database to obtain a node.csv file, a data sample Nitem is defined to be { Title, Label }, and a data set Node is defined to be { Nitem }₁,Nitem₂,…,Nitem_n}；

Step 3.8: the Wiki Chinese encyclopedia corpus is simplified from traditional form to simplified form, and the line feed symbols in the line feed sentences are removed;

step 3.9: selecting a training set related to agriculture, and selecting relations of instance of taxon rank, subiclass of parent taxon;

step 3.10: preloading a list of entities (selecting entities in the following categories: 5: Animal,6: Plant,7: Chemicals,9: Fooditems,10: Diseases,12: Nutrients,13: biochemistry, 14: Agricultural experiments, 15: Technology);

step 3.11: agricultural related statements are stored in a FileRead file, a triple relation obtained in wikidataRelation.json is aligned to a corpus of a Chinese wiki, and a corpus Wtrain _ data of a training set is defined;

step 3.12: loading an entity to a labeled mapping dictionary, performing part-of-speech screening on words in a sentence, and reading the category of the entity, wherein the category is consistent with the category of the entity in the Predict _ data;

step 3.13: and filtering out null values to obtain the Filter _ Wtrain _ data according to the attribute relation in the training set obtained by aligning the Wikipedia data set. Define data sample Fitem ═ entity_1id,entity₁,entity_2id,entity₂Statement, relationship, data set Node ═ Fitem₁,Fitem₂,…,Fitem_n}。

And 4, step 4: an entity dictionary in a Wikipedia Chinese word stock is constructed by utilizing a heuristic rule screening algorithm, and Filter _ Wtrain _ data text data are preprocessed to obtain a wikidataRelation data set, wherein the specific method comprises the following steps:

step 4.1: the first element of each line in the screened Filter _ Wtrain _ data dataset is an entity, and all entities which are Chinese characters are screened according to a regular expression and converted into a dictionary format;

step 4.2: acquiring a sentence set according to a physical dictionary, storing the sentence set in a list format, preprocessing the sentence set, removing all Chinese characters and all characters except Chinese common punctuations, and splitting the sentence;

step 4.3: each sentence is traversed by entity, according to character matching rule, all entities in the sub-sentences are stored, sentences without entities or with only one entity are filtered, and data is processed into [ [ sense, [ entity ]₁,…]],..]The format of';

step 4.4: and performing Chinese word segmentation by using a jieba library, and performing word segmentation on Chinese sentences. Defining the content text data, wherein the content _ seg is the text participle and the entity₁For text entity, data processing is [ [ sense, [ entity ]₁,...],[sentence_seg]],...]The format of';

step 4.5: training a word vector;

step 4.6: re-screening the entities by the sentence after word segmentation;

step 4.7: the entity appears in the sentence after word segmentation, if not, the step 4.10 is executed;

step 4.8: the entity sets are combined pairwise, and the data is processed into [ sensor, entity [ ]₁, entit₂,[sentence_seg]]]One sentence for a plurality of samples;

step 4.9: dividing the data in the wikidataRelation data set into a training set and a testing set according to the proportion of 3: 1;

step 4.10: the entity is removed as well as the text.

And 5: respectively building PCNN, CNN, RNN and BiRNN neural network models, wherein the specific method comprises the following steps:

step 5.1: building an artificial neural network, wherein in embedding layer embedding, a Word mapping function is defined as Word _ embedding, a Word embedding space vector size is defined as Word _ embedding _ dim (50), a Position feature embedding space vector size is defined as Position _ embedding (5), the maximum length is 120, and a Word _ Position _ embedding function is set to add the two embedding function results;

step 5.2: three loss functions are defined as softmax cross entry, sigmoid cross entry,

Step 5.3: and setting drop _ out as definition 0.5, calculating the maximum value of elements on tensor dimension, dividing the maximum value into three sections of maximum pooling, obtaining a 3-dimensional vector by each convolution kernel, inputting the result obtained by the pooling layer into a normalization function layer, and performing nonlinear processing by using a tanh activation function.

Step 5.4: each instance is defined as a bag, and if the bags in the training set are positive, the number of positive instances is more than or equal to 1; if negative, the examples are all negative;

step 5.5: adding an attention mechanism to each bag;

step 5.6: whether training is available or not, if so, executing step 5.14;

step 5.7: whether dropout is present;

step 5.8: defining an entity list, namely bag _ pre, and defining a logistic regression value based on the attribution as an attribution _ logic;

step 5.9: defining a variable i for traversing in a local scope, wherein the scope is defined as scope;

step 5.10: if i > scope, shape [0], go to step 5.13;

step 5.11: calculating an attention value of the softmax loss function;

step 5.12: i +1, performing step 5.10;

step 5.13: the obtained rank vector is stored in the bag _ pre entity list, and step 5.19 is executed;

step 5.14: after training, starting testing;

step 5.15: defining a variable i₁For traversing in a local scope of action;

step 5.16: if i₁>scope.shape[0]And 5.19 is executed;

step 5.17: calculating a sigmoid activation function value, and storing an obtained entity value in a logistic regression list of the bag _ login entity list;

step 5.18: i.e. i₁＝i₁+1, go to step 5.16;

step 5.19: defining four functions of pcnn, cnn, rnn and birnn, defining hidden factor as hidden _ size as 230, defining convolution kernel size as 3 and step size as 1, and setting local variable count by using activation function relu function;

step 5.20: training, and building a pcnn, cnn, rnn and birnn neural network model;

step 5.21: if count > n, go to step 5.28;

step 5.22: the sentence is decomposed into words in turn, each word is mapped to a dimension vector d_wCalled word embedding, learning an embedded vector through model training;

step 5.23: indicating an identity in a sentence using a location feature₁And entity₂Each entity has two relative positions, respectively mapped to different d_pA dimension vector;

step 5.24: connecting the two relative position results of step (5.23) to obtain a matrix M e R_h×R_sAs being an input representation, wherein R_s＝d_w+2*d_p；

Step 5.25: setting W ═ W_c*R_sIs a convolution matrix, where W_cIs the convolution window width, and by sliding the convolution window down to the sentence and applying this function to each valid position, a feature map c ═ c is generated₁，c₂，...， c_(h-Wc+1)]Extracting n features from the sentence;

step 5.26: repeating the above process with different W matrices;

step 5.27: step 5.21 is executed, if count is equal to count + 1;

step 5.28: piecewise Max-firing is commonly used to select the maximum activation value in each functional map;

step 5.29: each feature map Ci is divided into three components { c } based on the location of two entities_i1，c_i2，c_i3When the piecewise max pooling is complete, the results of each feature map are concatenated into a formal vector p ∈ R_3nAs a characteristic representation of the sentence;

step 5.30: and (5) building four neural network models of the step 5.20.

Step 6: and comparing the four algorithm models to obtain a relation extraction model M for parameter self-adaption optimization in the field of the agricultural knowledge graph, and specifically comprising the following steps of:

step 6.1: performing parameter self-adaptive optimization on a link coefficient a, an initial threshold b and an attenuation coefficient c of the PCNN neural network model;

step 6.2: determining a relatively fixed link coefficient, an initial threshold value and an attenuation coefficient, firstly, searching in an initial parameter range by adopting a larger search step pitch, and finding out the optimal model parameter at the moment;

step 6.3: if a plurality of parameter combinations exist and the optimal neural network model is achieved at the same time, executing the step 6.10;

step 6.4: selecting the combination with the smaller link coefficient as an optimization result;

step 6.5: if step is less than 0.0005, executing step 6.10;

step 6.6: searching for optimal parameters by a grid;

step 6.7: the traversal parameters are ended, if not, the step 6.6 is executed;

step 6.8: obtaining the highest accuracy rate of the link coefficient, the initial threshold value and the attenuation coefficient of the optimal PCNN neural network model;

step 6.9:

step 6.5 is executed;

step 6.10: and obtaining an optimal parameter model of the neural network.

And 7: extracting the relation between entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end, wherein the method comprises the following specific steps:

step 7.1: obtaining a neural network model PCNN with optimal parameters, and extracting two entity entries in the agricultural text data₁And entity₂And the relationship between the two;

step 7.2: mapping the triple relation based on a built small triple knowledge graph library wikidataRelation;

step 7.3: utilizing an algorithm model to extract entities of sentences in a wiki corpus, and mapping the entities into a wikidatarelationship database to realize an automatic labeling function in a remote supervision algorithm;

step 7.4: inputting agricultural text data, screening entities in the text data, and extracting the relationship between the agricultural text data and the entities;

step 7.5: importing data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia into a neo4j database;

step 7.6: the searched entity exists in the database, if not, step 7.9 is executed;

step 7.7: displaying the search result on a web end in a graph form by using a Cython statement;

step 7.8: packaging a python interface, displaying the data by using a web framework Dijango, and executing a step 7.10;

step 7.9: displaying that the entity does not exist;

step 7.10: searching text data of agricultural knowledge problems in the task box, and performing Chinese word segmentation technology on the text data to obtain an entity;

step 7.11: searching the database by utilizing a Cython statement;

step 7.12: the answer to the question exists in the database, if not, step 7.9 is executed;

step 7.13: and extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.

All of the above parameters are defined in the following table:

264093 pieces of data are processed, KNN algorithm is used for extracting features to classify and predict entities, a remote supervision-based artificial neural network PCNN model is set up to perform relationship extraction and model training, the triple relationship of training texts is obtained, the relationship among different crop entities is shown to a user, and the speed of information search is accelerated. Through tests, the accuracy of the experimental model using the PCNN algorithm exceeds 94%.

The invention creatively provides a parameter self-adaptive agricultural knowledge map system based on remote supervision, obtains an optimal neural network model extracted from agricultural field relations through self-adaptive optimization and parameter adjustment, and is suitable for unstructured text data of generally related crops.

Claims

1. A parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision is characterized by comprising the following specific steps:

(5) respectively building PCNN, CNN, RNN and BiRNN neural network models;

2. The remote supervision-based parameter adaptive agricultural knowledge graph recommendation method according to claim 1, wherein the data set Train _ data obtained in the step (1) comprises the following specific steps:

(1.1) performing data crawler and selecting a crawler page;

(1.2) selecting a page;

(1.3) selecting an agricultural knowledge service system;

(1.5) selecting an agricultural interactive encyclopedia;

(1.6) crawling the contents in the agricultural interactive encyclopedia by using Scapy, declaring the address domain of a crawler, acquiring a word list, constructing an original json file, generating a Url list, acquiring a crop Title, crawling a picture ImageList and an open-domain label openTypeList;

(1.8) acquiring two types of database sets Agri _ data and HuDong _ data;

(1.12) obtaining a data set Train _ data.

3. The remote supervision-based parameter adaptive agricultural knowledge base map recommendation method according to claim 1, wherein the specific steps of obtaining the text entity classification T in the step (2) are as follows:

and (2.7) obtaining the classification T of the text entity.

4. The remote supervision-based parameter adaptive agricultural knowledge graph recommendation method according to claim 1, wherein the specific steps of obtaining the data set Train _ data in the step (3) are as follows:

(3.2) crawling all the relations summarized under the Wikipedia webpage and the corresponding Chinese names by utilizing Scapy, wherein the storage format is json format;

(3.3) the crawled content is a relation rid, an attribute rtype to which the relation belongs, a subclass statement to which the relation belongs and a corresponding rlink link, and is stored in a relation json file, and a data sample is defined as rite ═ rid, rtype, statement, rlink }; data set relationship { Ritem ═₁,Ritem₂,…,Ritem_n}; json file, define data sample (mite ), data set (mite ), and relation chinese representation (relation cid, relation chinese), wherein₁,Mitem₂,…,Mitem_n}；

(3.5) defining an entites.json database, searching data in the Presect _ data on a Wikipedia, returning json content and storing the json content in an entites.json file;

(3.6) Wikitata is an open knowledge base, the description of an entity in a Wikitata entity page and the corresponding relation associated with the entity are crawled, a wikitataRelationship file is defined, the result is stored in the wikitataRelationship file, and a data sample Witem is defined as { entity ═ degree₁，relation，entity₂Data set wikidatarelationship (Witem)₁,Witem₂,…,Witem_n}；

(3.7) processing data in WikidataRelation. json database into csv file format, and performing agricultural interactionCorresponding data in the HuDong _ data in the encyclopedic to a Presect _ data database to obtain a node.csv file, defining a data sample Nitem ═ { Title, Label }, and defining a data set Node ═ Nitem ═ Title }₁,Nitem₂,…,Nitem_n}；

(3.10) preloading an entity list (selecting entities in the following categories: 5: Animal,6: Plant,7: Chemicals,9: foods, 10: Diseases,12: Nutrients,13: biochemistry.14: Agricultural entities, 15: Technology);

5. The method for recommending parameter-adaptive agricultural knowledge-graph based on remote supervision as claimed in claim 1, wherein the specific step of obtaining wikidataRelation data set in step (4) is as follows:

(4.5) training the word vector;

(4.6) re-screening the entities by the sentence after word segmentation;

(4.8) the entity sets are combined pairwise, and the data processing is [ sensor, entity₁,entit₂,[sentence_seg]]]One sentence for a plurality of samples;

(4.10) removing the entity and the text.

6. The remote supervision-based parameter adaptive agricultural knowledge graph recommendation method according to claim 1, wherein the specific steps of respectively building PCNN, CNN, RNN and BiRNN neural network models in the step (5) are as follows:

(5.4) each instance is defined as a bag, if the bags in the training set are positive, the number of positive instances is more than or equal to 1; if negative, the examples are all negative;

(5.5) adding an attention mechanism on each bag;

(5.6) whether training is performed, if so, executing the step (5.14);

(5.7) whether dropout;

(5.10) if i > scope. shape [0], performing the step (5.13);

(5.11) calculating an attention value of the softmax loss function;

(5.12) i ═ i +1, performing step (5.10);

(5.14) finishing training and starting testing;

(5.15) defining the variable i₁For traversing in a local scope of action;

(5.16) if i₁>scope.shape[0]Executing the step (5.19);

(5.18)i₁＝i₁+1, go to step (5.16);

(5.20) training, and building a pcnn, cnn, rnn and birnn neural network model;

(5.21) if count > n, executing step (5.28);

(5.25) setting W ═ W_c*R_sIs a convolution matrix, where W_cIs the convolution window width, and by sliding the convolution window down to the sentence and applying this function to each valid position, a feature map c ═ c is generated₁，c₂，...，c_(h-Wc+1)]Extracting n features from the sentence;

(5.26) repeating the above process with a different W matrix;

(5.27) count +1, performing step (5.21);

(5.30) building four neural network models of the step (5.20).

7. The method for recommending a parameter-adaptive agricultural knowledge graph based on remote supervision according to claim 1, wherein the specific steps of obtaining the relation extraction model M for parameter-adaptive optimization in the field of agricultural knowledge graph in the step (6) are as follows:

(6.3) if a plurality of parameter combinations exist and the optimal neural network model is simultaneously reached, executing the step (6.10);

(6.5) if step is less than 0.0005, executing step (6.10);

(6.6) searching for optimal parameters by the grid;

(6.7) ending the traversal parameters, and if not, executing the step (6.6);

(6.9)

performing step (6.5);

and (6.10) obtaining an optimal parameter model of the neural network.

8. The parameter adaptive agricultural knowledge graph recommendation method based on remote supervision according to claim 1, wherein in the step (7), the relation between the entities is extracted from the text data of the agricultural field, the entity relation data is rendered through Echarts, and the recommendation result is displayed on the web side by the specific steps of:

(7.8) packaging the python interface, and displaying the data by using the web framework Dijango, and executing the step (7.10);

(7.9) displaying the absence of the entity;

(7.11) searching the database by using a Cython statement;