CN112199508A - Parameter adaptive agricultural knowledge graph recommendation method based on remote supervision - Google Patents
Parameter adaptive agricultural knowledge graph recommendation method based on remote supervision Download PDFInfo
- Publication number
- CN112199508A CN112199508A CN202010794151.4A CN202010794151A CN112199508A CN 112199508 A CN112199508 A CN 112199508A CN 202010794151 A CN202010794151 A CN 202010794151A CN 112199508 A CN112199508 A CN 112199508A
- Authority
- CN
- China
- Prior art keywords
- data
- entity
- agricultural
- text
- relation
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 230000003044 adaptive effect Effects 0.000 title claims description 11
- 238000003062 neural network model Methods 0.000 claims abstract description 28
- 238000000605 extraction Methods 0.000 claims abstract description 23
- 230000009193 crawling Effects 0.000 claims abstract description 22
- 238000012216 screening Methods 0.000 claims abstract description 21
- 238000005457 optimization Methods 0.000 claims abstract description 15
- 238000007781 pre-processing Methods 0.000 claims abstract description 10
- 238000002372 labelling Methods 0.000 claims abstract description 5
- 238000012549 training Methods 0.000 claims description 46
- 230000006870 function Effects 0.000 claims description 40
- 239000013598 vector Substances 0.000 claims description 39
- 230000011218 segmentation Effects 0.000 claims description 21
- 230000002452 interceptive effect Effects 0.000 claims description 20
- 238000013507 mapping Methods 0.000 claims description 17
- 238000005516 engineering process Methods 0.000 claims description 14
- 230000004913 activation Effects 0.000 claims description 12
- 238000011176 pooling Methods 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 11
- 238000013528 artificial neural network Methods 0.000 claims description 10
- 239000011159 matrix material Substances 0.000 claims description 10
- 238000013527 convolutional neural network Methods 0.000 claims description 8
- 238000001914 filtration Methods 0.000 claims description 8
- 238000012360 testing method Methods 0.000 claims description 8
- 238000009877 rendering Methods 0.000 claims description 7
- 238000007477 logistic regression Methods 0.000 claims description 6
- 238000010606 normalization Methods 0.000 claims description 6
- 230000008569 process Effects 0.000 claims description 5
- 230000009471 action Effects 0.000 claims description 4
- 201000010099 disease Diseases 0.000 claims description 3
- 208000037265 diseases, disorders, signs and symptoms Diseases 0.000 claims description 3
- 238000010304 firing Methods 0.000 claims description 3
- 238000011068 loading method Methods 0.000 claims description 3
- 230000007246 mechanism Effects 0.000 claims description 3
- 235000015097 nutrients Nutrition 0.000 claims description 3
- 238000004806 packaging method and process Methods 0.000 claims description 3
- 238000012163 sequencing technique Methods 0.000 claims description 3
- 239000000126 substance Substances 0.000 claims description 3
- 235000013305 food Nutrition 0.000 claims description 2
- 238000003745 diagnosis Methods 0.000 description 7
- 241000196324 Embryophyta Species 0.000 description 4
- 230000010365 information processing Effects 0.000 description 3
- 238000011160 research Methods 0.000 description 3
- 238000010276 construction Methods 0.000 description 2
- 238000010586 diagram Methods 0.000 description 2
- 238000004880 explosion Methods 0.000 description 2
- 238000003709 image segmentation Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012935 Averaging Methods 0.000 description 1
- 240000003768 Solanum lycopersicum Species 0.000 description 1
- 230000006399 behavior Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000011049 filling Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012552 review Methods 0.000 description 1
- 238000010187 selection method Methods 0.000 description 1
- 230000003595 spectral effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
- 238000012800 visualization Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/237—Lexical tools
- G06F40/242—Dictionaries
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
- G06Q50/02—Agriculture; Fishing; Mining
Abstract
The invention discloses a parameter self-adaptive agricultural knowledge map recommendation method based on remote supervision, which comprises the following steps: crawling text data by adopting a Scapy crawler frame, carrying out data preprocessing, and obtaining a predicted text classification data set Predict _ data by utilizing a KNN algorithm classifier; on the other hand, when the Chinese corpus of crops is processed, the predicted entity classification result is mapped into the Wikipedia Chinese corpus to construct an entity Chinese dictionary. And setting up a parameter self-adaptive optimization searching neural network model based on an improved remote supervision algorithm, wherein the model enables self-adaptive optimization searching parameters extracted by the relation to be the best, automatic labeling of text data is realized, and the relation between entities is obtained. The method can improve the accuracy of relation extraction, and meanwhile, provides effective information screening for the plant cultivation enthusiasts by utilizing agricultural text information.
Description
Technical Field
The invention belongs to the technical field of knowledge maps and neural networks, and particularly relates to a parameter self-adaptive agricultural knowledge map recommendation method based on remote supervision.
Background
The agricultural knowledge map combines the characteristics of regional property, climatic property, diversity of physical products and the like of agriculture, utilizes the entity relationship and concept in the field of agriculture to dig out an intelligent auxiliary system of potential value of agriculture, and compared with the traditional agricultural information query mode, the agricultural knowledge map combines the visualization technology and the agricultural knowledge base to display and analyze retrieved data, thereby being a new development of Chinese metrology. Therefore, the agricultural knowledge map service system provided by the invention can analyze the environment and climate suitable for the growth of crops by utilizing data in the agricultural knowledge service system, agricultural interactive encyclopedia and Wikipedia, provides an effective auxiliary effect for agricultural research institutes and plant cultivation enthusiasts, and quickly acquires required information in the internet with big data explosion.
The existing research bases of Zhuquanhyin et al include: the classification and extraction algorithm of Web science and technology news [ J ] academic newspaper of Huaiyin institute of Industrial science and technology, 2015,24(5): 18-24; lixiang, Zhu-Quanyin, collaborative clustering and scoring matrix shared collaborative filtering recommendations [ J ] computer science and exploration 2014,8(6): 751-; quanyin Zhu, Sun qun Cao.A Novel Classifier-independent Feature Selection Algorithm for Imbalanced datasets.2009, p: 77-82; quanyin Zhu, Yunyang Yan, Jin Ding, Jin Qian, the Case Study for Price extraction of Mobile Phone Sell Online, 2011, p: 282-285; quanyin Zhu, Suqun Cao, Pei Zhou, Yunyang Yan, Hong Zhou. Integrated print for based on Dichotomy Back filling and Disturbance Factor Algorithm. International Review on Computers and Software,2011, Vol.6(6): 1089-; li Xiang, Zhu quan Yin, Hurong Lin, Zhonhang, a cold chain logistics stowage intelligent recommendation method based on spectral clustering, Chinese patent publication No. CN105654267A, 2016.06.08; suo Cao, Zhu quan Yin, Zuo Xiao Ming, Gao Shang soldier, etc., a feature selection method for pattern classification Chinese patent publication No.: CN 103425994 a, 2013.12.04; liu jin Ling, von Wanli, Zhang Yao red Chinese text clustering method based on rescaling [ J ] computer engineering and applications, 2012,48(21): 146-; the classification and extraction algorithm of Web science and technology news [ J ] academic proceedings of Huaiyin institute of Industrial science and technology, 2015,24(5): 18-24; lixiang, Zhu-Quanyin, collaborative clustering and scoring matrix shared collaborative filtering recommendations [ J ] computer science and exploration 2014,8(6): 751-; zhuquanhyin, sincerity, Lixiang, xukang and the like, a network behavior habit clustering method based on K-means and LDA two-way verification, Chinese patent publication No. CN 106202480A, 2016.12.07;
the traditional knowledge graph construction method relates to agricultural knowledge and relation extraction, and aims at the problems that: huihong remote supervision relation extraction method and device, chinese patent publication no: CN110209836A,2019.5.17, belonging to the application of remote supervision algorithm, aiming at generating an entity recognition training data set through bootstrap algorithm and recognizing the entity of a sentence through crf + + tool; generating an entity relation extraction training data set through a remote supervision method, and generating an entity relation extraction data set through a relation knowledge base and a natural language corpus; the method can automatically label training data through natural corpus to complete entity recognition and entity relationship extraction; sun encourage, an assistant diagnosis and treatment system based on knowledge map, chinese patent publication no: CN110459320A,2019.11.15, belonging to the field of medical diagnosis and treatment, aiming at defining the patient status between two successive medical operations as the side; the system comprises a patient information processing module, a diagnosis and treatment scheme pushing module, a patient information processing module and a diagnosis and treatment decision module, wherein the patient information processing module is used for receiving patient information, extracting historical medical operation and patient state information, sending the historical medical operation and the patient state information to the diagnosis and treatment scheme pushing module, matching the patient information with a knowledge graph, determining the position of the current state of a patient in the knowledge graph, pushing a medical index to be detected and/or next diagnosis and treatment operation based on the knowledge graph, quickly knowing the diagnosis and treatment stage of the patient, and giving a next; the Chinese patent publication No. is: CN110400327A,2019.11.1 belongs to the field of crop image segmentation, and aims to realize self-adaptive adjustment of PCNN model parameters in nighttime image segmentation of tomato plants, reduce PCNN iteration times and improve the real-time performance of algorithm application. However, at present, a system and a method for adopting a parameter adaptive optimization model combined with a neural network to identify entities and extract relationships in the agricultural field, construct a knowledge graph in the agricultural field and make an auxiliary decision do not exist.
Screening algorithm based on heuristic rule:
the information filtering technology is used for facilitating a user to find information which is interested by the user more quickly, and the information filtering technology can solve the problem. Information filtering is generally used to process a large amount of text information and to filter out objectionable information in a targeted manner. The rule is a knowledge representation method, the modification or replacement of the rule does not affect other rules, a plurality of domain knowledge of different categories are stored in the rule base, and a series of reasoning predictions can be completed by using the obtained rule to finally obtain the category. At present, information screening is carried out at home and abroad based on a heuristic rule screening method through keywords.
Remote supervision algorithm:
the remote supervision algorithm is based on a labeled artificial knowledge graph, relation labels are labeled on sentences in an external document, and the algorithm is also a semi-supervision algorithm. Firstly, based on the fact that the entities related to crops in the sentences are extracted in the training stage, and the two entities are in a corresponding relation in the corpus, the texts in the test set are considered to express the entitiesClass relationships. And the extracted text features are spliced and expressed as a word vector, and the word vector is used as a feature vector of the texts. Aiming at the system, the proposed scheme is as follows: the existing triple correspondence is mapped to a massive unstructured database to generate a large amount of training data, and the knowledge sources are diversified, such as manual labeling, the existing knowledge base, a specific statement structure and the like. For example: set data set X ═ X1,x2,x3,…,xnAccording to the relation h1Mapping the data set X to a space A where A ═ A1,A2,A3,…,AMIs then passed through the relationship h2Mapping space a to space K ═ K1,K2,K3,…,Kr}。
Based on a PCNN neural network model algorithm:
traditional vocabulary characteristics comprise characteristics such as partial entities, word sequences among agricultural product entities, hypernyms of words and the like, and the characteristics depend on manual processing characteristic processes. Characteristics of the lexical level: conversion to word vectors and the functions represented at the lexical level using the word vectors. Features at the syntactic level: and considering the context characteristics, setting a sliding window K, sliding back to a lattice after finishing reading every K characters, and finally obtaining a group of sentence-level characteristics including word characteristics and position characteristics. The PCNN algorithm makes an improvement over the CNN algorithm in terms of the pooling layer: and dividing the statement into k sections according to the position of the entity pair, performing maximum pooling operation on each pair independently, obtaining a maximum value in each section, and finally forming the maximum values into feature vectors.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a parameter self-adaptive agricultural knowledge map recommendation method based on remote supervision, which analyzes the environment and climate suitable for the growth of crops through data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia, provides effective auxiliary action for agricultural research institutes and plant cultivation enthusiasts, and quickly acquires needed information in the internet with big data explosion. .
The technical scheme is as follows: in order to solve the technical problems, the invention provides a parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision, which comprises the following specific steps of:
(1) data pre-processing is carried out on data crawled by an agricultural knowledge service system, and an obtained data set is defined as Agri _ data; performing data crawling on the agricultural interactive encyclopedia by using Scapy, and defining a crawled data set as HuDong _ data; performing Chinese word segmentation and word vector training on text data in the Agri _ data and the HuDong _ data, and defining an obtained data set as Train _ data;
(2) training a KNN algorithm model by using a data set Train _ data, performing feature extraction on text data by using a fast text classification tool, and performing text similarity comparison by using a cosine similarity algorithm to obtain a text entity classification T;
(3) utilizing the KNN algorithm model in the step (2) to Predict the entity classification Result, storing the entity classification Result by using a Presect _ data set, and mapping the entity in the Presect _ data set with the entity data in the Uighur entity relationship data set Result _ data to obtain a data set Train _ data;
(4) constructing an entity dictionary in a Wikipedia Chinese word stock by using a heuristic rule screening algorithm, and preprocessing the text data of Filter _ Wtrain _ data to obtain a wikidadata relationship data set;
(5) respectively building PCNN, CNN, RNN and BiRNN neural network models;
(6) comparing the four algorithm models to obtain a relation extraction model M for parameter self-adaption optimization in the field of agricultural knowledge maps;
(7) and extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.
Further, the specific steps of the data set Train _ data obtained in the step (1) are as follows:
(1.1) performing data crawler and selecting a crawler page;
(1.2) selecting a page;
(1.3) selecting an agricultural knowledge service system;
(1.4) crawling an agricultural knowledge service system, obtaining html files of the system, limiting a crawling range by using front-end div, and obtaining names Title, detailed contents Detail, photo ImageList and webpage links Url of crops; the data item airfem ═ { Title, Detail, ImageList, Url } is formed, and the data set Agri _ data ═ airfem ═ data { (airfem) }1,Aitem2,…,AitemnExecuting the step (1.8);
(1.5) selecting an agricultural interactive encyclopedia;
(1.6) crawling the content in the agricultural interactive encyclopedia by using Scapy, declaring the address domain of a crawler, acquiring a word list, constructing an original json file, generating a Url list, acquiring a crop Title, and crawling picture ImageList and an open domain label openTypeList;
(1.7) each entity obtained by crawling corresponds to one entry of an agricultural encyclopedia, wherein the entries comprise a Title, a link Url of an interactive encyclopedia, a picture ImageList, an open classification list TypeList, a detailed information InfoList and a basic information ValueList; constitute the data item Hitem ═ { Title, Url, ImageList, TypeList, InfoList, ValueList }, the data set Hudong _ data ═ { Hitem }1,Hitem2,…,Hitemn};
(1.8) acquiring two types of database sets Agri _ data and HuDong _ data;
(1.9) performing part-of-speech screening on data in the two data sets Agri _ data and HuDong _ data;
(1.10) discarding words containing non-Chinese and English or numeric characters;
(1.11) performing Chinese word segmentation and word vector training on two database sets Agri _ data and HuDong _ data respectively;
(1.12) obtaining a data set Train _ data.
Further, the specific steps of obtaining the text entity classification T in the step (2) are as follows:
(2.1) transmitting a Text data set Train _ data into a KNN Text classifier, defining the Mean value of each component as Mean, the variance of each component as Var, the inverse Text frequency index as Text _ IDF and the Text quantity as Item _ Num; defining the similarity Weight as Weight { Title, TypeList, Detail, InfoList, ValueList } {0.2,0.2,0.2,0.2,0.2 };
(2.2) if each Item has 5 attributes in Weight, adding 1 to the IDF value of each attribute;
(2.3) returning the similarity of Title of 2 items each time, defining the name similarity as Title _ sim, the similarity of TypeList of 2 items, the similarity of open sort list as TypeList _ sim, the similarity of Detail content as Detail _ sim, the similarity of InfoList of 2 items as InfoList _ sim, the similarity of ValueList of 2 items as ValueList _ sim, the similarity of 5 attributes as Dsim;
(2.4) linearly weighting the obtained similarity Dsim, and defining the similarity Dsim as Simi;
(2.5) storing the attribute similarity of the Item in a temporary table CurList, calculating the variance and mean difference of each component, carrying out Gaussian normalization on the similarity of Title and TypeList, and endowing the similarity average value to the similarity values which do not appear;
(2.6) carrying out weighted sum on the similarity of each Item, defining the sum as Count _ sim, sequencing the first k values of the similarity Count _ sim, and classifying the k values into one class;
and (2.7) obtaining the classification T of the text entity.
Further, the specific steps of obtaining the data set Train _ data in the step (3) are as follows:
(3.1) storing the classification result of the predicted entity obtained from the KNN algorithm by using Predict _ data;
(3.2) crawling all the relations and corresponding Chinese names under the Wikipedia webpage by utilizing Scapy, wherein the storage format is json format;
(3.3) the crawled content is a relation rid, an attribute rtype to which the relation belongs, a subclass statement to which the relation belongs and a corresponding rlink link, and is stored in a relation json file, and a data sample is defined as rite ═ rid, rtype, statement, rlink }; data set relationship { Ritem ═1,Ritem2,…,Ritemn}; relational cid and relational chinese representation chrention storageJson file, data sample time { cid, duration } and data set duration { time } are defined1,Mitem2,…,Mitemn};
(3.4) merging the data in relation data set relation.json and relation.json, defining a data set result.json, and storing the result in a result.json file;
(3.5) defining an entites.json database, searching data in the Presect _ data in the Wikipedia, returning json content, and storing the json content in an entites.json file;
(3.6) Wikitata is an open knowledge base, the description of an entity in a Wikitata entity page and the corresponding relation associated with the entity are crawled, a wikitataRelationjson file is defined, the result is stored in the wikitataRelationjson file, and a data sample wittem ═ { entity ═ is defined1,relation, entity2Data set wikidatarelationship (Witem)1,Witem2,…,Witemn};
(3.7) processing data in a WikidataRelation.json database into a csv file format, corresponding data in the HuDong _ data in the agricultural interactive encyclopedic to a Presect _ data database to obtain a node.csv file, defining a data sample Nitem ═ { Title, Label }, and defining a data set Node ═ Nitem ═ in1,Nitem2,…,Nitemn};
(3.8) the Wiki Chinese encyclopedia corpus is simplified from traditional to simplified and the line feed symbols in the line feed sentences are removed;
(3.9) selecting a training set related to agriculture, and selecting relations of "instance of" "taxon rank" "subiclass of" "parent taxon";
(3.10) preloading an entity list (selecting entities in the following categories: 5: Animal,6: Plant,7: Chemicals,9: foods, 10: Diseases,12: Nutrients,13: biochemistry, 14: Agricultural examples, 15: Technology);
(3.11) agricultural related sentences are stored in a FileRead file, the triple relation obtained in wikidataRelation.json is aligned to a corpus of a Chinese dimensional base, and a corpus Wtrain _ data of a training set is defined;
(3.12) loading an entity to the labeled mapping dictionary, performing part-of-speech screening on words in the sentence, and reading the category of the entity to be consistent with the category of the entity in the Presect _ data;
(3.13) filtering out null values of the attribute relation in the training set obtained by aligning the Wikipedia data set to obtain filter _ Wtrain _ data; define data sample Fitem ═ entity1id,entity1,entity2id,entity2Statement, relationship, data set Node ═ Fitem1,Fitem2,…,Fitemn}。
Further, the specific steps of obtaining the wikidataRelation data set in the step (4) are as follows:
(4.1) screening out entities which are all Chinese characters according to a regular expression, and converting the entities into a dictionary format, wherein the first element of each line in the filtered _ Wtrain _ data dataset is an entity;
(4.2) acquiring a sentence set according to the entity dictionary, storing the sentence set in a list format, preprocessing the sentence set, removing all Chinese characters and all characters except Chinese common punctuations, and splitting the sentence;
(4.3) each sentence is traversed by entity, all entities in the sub-sentences are stored according to the character matching rule, sentences without entities or with only one entity are filtered, and data are processed into [ [ sensor, [ entity ]1,…]],..]The format of';
and (4.4) carrying out Chinese word segmentation by using a jieba library and carrying out Chinese sentence segmentation. Defining the content text data, wherein the content _ seg is the text participle and the entity1For text entity, data processing is [ [ sense, [ entity ]1,...],[sentence_seg]],...]The format of';
(4.5) training the word vector;
(4.6) re-screening the entities by the sentence after word segmentation;
(4.7) the entity appears in the sentence after word segmentation, if not, the step (4.10) is executed;
(4.8) the entity sets are combined pairwise, and the data processing is [ sensor, entity1, entit2,[sentence_seg]]]One sentence for a plurality of samples;
(4.9) dividing the data in the wikidataRelation data set into a training set and a testing set according to the ratio of 3: 1;
(4.10) removing the entity and the text.
Further, the specific steps of respectively building the PCNN, CNN, RNN, BiRNN neural network models in the step (5) are as follows:
(5.1) building an artificial neural network, wherein in embedding layer embedding, a Word mapping function is defined as Word _ embedding, a Word embedding space vector size is defined as Word _ embedding _ dim ═ 50, a Position feature embedding space vector size is defined as Position _ embedding ═ 5, the maximum length is defined as 120, and the Word _ Position _ embedding function is set to add the two embedding function results;
(5.2) defining three loss functions as softmax cross entry, sigmoid cross entry,
and each sample data softmax or sigmoid layer can obtain different probability distribution and prediction relations, and the maximum prediction result is used as an entity prediction result for calculating the loss cross entropy.
(5.3) setting drop _ out as 0.5, calculating the maximum value of elements in tensor dimension, dividing the maximum value into three sections for maximum pooling, obtaining a 3-dimensional vector by each convolution kernel, inputting the result obtained by the pooling layer into a normalization function layer, and performing nonlinear processing by using a tanh activation function.
(5.4) each instance is defined as a bag, if the bag in the training set is positive, the number of positive instances is more than or equal to 1; if negative, the examples are all negative;
(5.5) adding an attention mechanism on each bag;
(5.6) whether training is performed, if so, executing the step (5.14);
(5.7) whether dropout;
(5.8) defining an entity list, which is defined as bag _ pre, and defining a logistic regression value based on the attribution as an attribution _ logic;
(5.9) defining a variable i for traversing in the local scope, defining the scope as scope;
(5.10) if i > scope. shape [0], performing the step (5.13);
(5.11) calculating an attention value of the softmax loss function;
(5.12) i ═ i +1, performing step (5.10);
(5.13) storing the obtained rank vector in a bag _ pre entity list, and executing the step (5.19);
(5.14) finishing training and starting testing;
(5.15) defining the variable i1For traversing in a local scope of action;
(5.16) if i1>scope.shape[0]Executing the step (5.19);
(5.17) calculating a sigmoid activation function value, and storing the obtained entity value in a logistic regression list of the bag _ login entity list;
(5.18)i1=i1+1, go to step (5.16);
(5.19) defining four functions of pcnn, cnn, rnn and birnn, defining hidden factor as hidden _ size as 230, defining convolution kernel size as 3 and step size as 1, and setting local variable count by using activation function relu function;
(5.20) training, and building a pcnn, cnn, rnn and birnn neural network model;
(5.21) if count > n, executing step (5.28);
(5.22) decomposing the sentence into words in turn, each word mapped to a dimensional vector dwCalled word embedding, learning an embedded vector through model training;
(5.23) indicating the tag's identity in the sentence using the location feature1And entity2Each entity has two relative positions, respectively mapped to different dpA dimension vector;
(5.24) the two relative position results of step (5.23) are concatenated to obtain the matrix M e Rh×RsAs being an input representation, wherein Rs=dw+2*dp;
(5.25) is provided withPut W ═ Wc*RsIs a convolution matrix, where WcIs the convolution window width, and by sliding the convolution window down to the sentence and applying this function to each valid position, a feature map c ═ c is generated1,c2,..., c(h-Wc+1)]Extracting n features from the sentence;
(5.26) repeating the above process with a different W matrix;
(5.27) count + 1, performing step (5.21);
(5.28) piece wise Max-firing is commonly used to select the maximum activation value in each functional map;
(5.29) dividing each feature map Ci into three components { c) based on the locations of the two entitiesi1,ci2,ci3When the piecewise max pooling is complete, the results of each feature map are concatenated into a formal vector p ∈ R3nAs a characteristic representation of the sentence;
(5.30) building four neural network models of the step (5.20).
Further, the specific steps of obtaining the relation extraction model M for parameter adaptive optimization in the field of agricultural knowledge graph in the step (6) are as follows:
(6.1) carrying out parameter self-adaptive optimization on a link coefficient a, an initial threshold b and an attenuation coefficient c of the PCNN neural network model;
(6.2) determining a relatively fixed link coefficient, an initial threshold value and an attenuation coefficient, firstly, searching in an initial parameter range by adopting a larger search step pitch, and finding out the optimal model parameter at the moment;
(6.3) if a plurality of parameter combinations exist and the optimal neural network model is achieved at the same time, executing the step (6.10);
(6.4) selecting the combination with the smaller link coefficient as the optimizing result;
(6.5) if step is less than 0.0005, executing step (6.10);
(6.6) searching for optimal parameters by the grid;
(6.7) ending the traversal parameters, and if not, executing the step (6.6);
(6.8) obtaining the highest accuracy rate by the link coefficient, the initial threshold value and the attenuation coefficient of the optimal PCNN neural network model;
and (6.10) obtaining an optimal parameter model of the neural network.
Further, the step (7) of extracting the relationship between the entities from the text data in the agricultural field, rendering the entity relationship data through Echarts, and displaying the recommendation result on the web side comprises the following specific steps:
(7.1) obtaining a neural network model PCNN with optimal parameters, and extracting two entity entries in the agricultural text data1And entity2And the relationship between the two;
(7.2) mapping the triple relation based on a built small triple knowledge graph library wikidataRelation;
(7.3) performing entity extraction on the sentences in the wiki corpus by using the algorithm model, and mapping the entities into a wikidataRelation database to realize the automatic labeling function in the remote supervision algorithm;
(7.4) inputting agricultural text data, screening entities in the text data, and extracting the relationship between the entities and the text data;
(7.5) importing data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia into a neo4j database;
(7.6) the searched entity exists in the database, if not, the step (7.9) is executed;
(7.7) displaying the search result on the web end in a graph form by using a Cython statement;
(7.8) packaging a python interface, and displaying the data by using a web framework Dijango, and executing the step (7.10);
(7.9) displaying the absence of the entity;
(7.10) searching text data of agricultural knowledge problems in the task box, and performing Chinese word segmentation technology on the text data to obtain an entity;
(7.11) searching the database by using a Cython statement;
(7.12) the answer to the question exists in the database, if not, the step (7.9) is executed;
(7.13) extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.
By adopting the technical scheme, the invention has the following beneficial effects:
according to the method, unstructured text data in the agricultural field are crawled by using a Scapy crawler frame, entity recognition is carried out by adopting a KNN algorithm, a parameter self-adaptive PCNN neural network model is adopted for carrying out relation extraction, a triple relation is constructed, and compared with a traditional convolutional neural network, the model can extract more sentence characteristics. The method changes the traditional neural network that parameter values are set according to experience, adopts a parameter self-adaptive optimization technology, and enhances the accuracy of model relation extraction.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a flow diagram of pre-processing data crawled by the agricultural knowledge service system and agricultural interactive encyclopedia in an embodiment;
FIG. 3 is a flowchart of a KNN algorithm model construction and text similarity comparison in a specific embodiment;
fig. 4 is a flowchart illustrating that classification results of predicted entities are stored by using preset _ data, and entities in the preset _ data dataset are mapped with entity data in the wiki encyclopedia entity relationship dataset Result _ data to obtain a dataset Train _ data in the embodiment;
FIG. 5 is a flowchart of the embodiment in which an entity dictionary in a Wikipedia Chinese lexicon is constructed by replacing an entity attribute relationship according to rules, and text data is preprocessed;
FIG. 6 is a flowchart of building PCNN, CNN, RNN, BiRNN neural network models, respectively, in a specific embodiment;
FIG. 7 is a flowchart of a relationship extraction model M for comparing four algorithm models to obtain parameter adaptive optimization in the field of agricultural knowledge base maps in an embodiment;
FIG. 8 is a flow chart of extracting relationships between entities on textual data in the agricultural domain and establishing an agricultural knowledge graph for use in aiding decision making in an embodiment.
Detailed Description
The present invention is further illustrated by the following specific examples in conjunction with the national standards of engineering, it being understood that these examples are intended only to illustrate the invention and not to limit the scope of the invention, which is defined in the claims appended hereto, as modifications of various equivalent forms by those skilled in the art upon reading the present invention.
As shown in fig. 1 to 8, the parameter adaptive agricultural knowledge graph recommendation method based on remote supervision according to the present invention includes the following steps:
step 1: and carrying out data preprocessing on the data crawled by the agricultural knowledge service system, and defining the obtained data set as Agri _ data. And (4) performing data crawling on the agricultural interactive encyclopedia by utilizing Scapy, and defining a crawled data set as HuDong _ data. Performing Chinese word segmentation and word vector training on text data of the Agri _ data and the HuDong _ data, and defining an obtained data set as Train _ data. The specific method comprises the following steps:
step 1.1: performing data crawler and selecting a crawler page;
step 1.2: selecting a page;
step 1.3: selecting an agricultural knowledge service system;
step 1.4: crawling an agricultural knowledge service system, obtaining html files of the system, limiting a crawling range by using a front end div, and obtaining names Title, detailed contents Detail, photo ImageList and webpage links Url of crops. The data item airfem ═ { Title, Detail, ImageList, Url } is formed, and the data set Agri _ data ═ airfem ═ data { (airfem) }1,Aitem2,…,AitemnExecuting the step 1.8;
step 1.5: selecting an agricultural interactive encyclopedia;
step 1.6: crawling the content in the agricultural interactive encyclopedia by using Scapy, declaring the address domain of a crawler, acquiring a word list, constructing an original json file, generating a Url list, acquiring a crop Title, and crawling a picture ImageList and an open domain label openTypeList;
step 1.7: and each entity obtained by crawling corresponds to one entry of the agricultural encyclopedia, wherein the entries comprise a Title, a link Url of the interactive encyclopedia, a picture ImageList, an open classification list TypeList, a detailed information InfoList and a basic information ValueList. Constitute the data item Hitem ═ { Title, Url, ImageList, TypeList, InfoList, ValueList }, the data set Hudong _ data ═ { Hitem }1,Hitem2,…,Hitemn};
Step 1.8: acquiring two types of database sets Agri _ data and HuDong _ data;
step 1.9: performing part-of-speech screening on data in the two data sets Agri _ data and HuDong _ data;
step 1.10: discarding words containing non-Chinese and English or numbers;
step 1.11: performing Chinese word segmentation and word vector training on two database sets Agri _ data and HuDong _ data respectively;
step 1.12: a data set Train _ data is obtained.
Step 2: training a KNN algorithm model by using a data set Train _ data, performing feature extraction on text data by using a fast text classification tool, and performing text similarity comparison by using a cosine similarity algorithm to obtain a text entity classification T, wherein the specific method comprises the following steps:
step 2.1: and transmitting the Text data set Train _ data into a KNN Text classifier, defining the Mean value of each component as Mean, the variance of each component as Var, the inverse Text frequency index as Text _ IDF and the Text quantity as Item _ Num.
Defining the similarity Weight as Weight { Title, TypeList, Detail, InfoList, ValueList } {0.2,0.2,0.2,0.2,0.2 };
step 2.2: if each Item has 5 attributes in Weight, the IDF value of each attribute is added with 1;
step 2.3: returning the similarity of Title of 2 items each time, defining the name similarity as Title _ sim, returning the similarity of TypeList of 2 items, defining the similarity of an open classification list as TypeList _ sim, returning the similarity of Detail of 2 items, defining the Detail content similarity as Detail _ sim, returning the similarity of InfoList of 2 items, defining the similarity as InfoList _ sim, returning the similarity of ValueList of 2 items, defining the similarity as ValueList _ sim, averaging the similarities of 5 attributes, and defining the similarity as Dsim;
step 2.4: the obtained similarity Dsim is linearly weighted and defined as Simi;
step 2.5: storing the similarity of each attribute of the Item in a temporary table CurList, calculating the variance and the mean variance of each component, carrying out Gaussian normalization on the similarity of Title and TypeList, and endowing the similarity average value to the values of the similarity which does not appear;
step 2.6: weighting and summing the similarity of each Item, defining the sum as Count _ sim, sequencing the first k values of the similarity Count _ sim, and classifying the k values into one class;
step 2.7: a classification T of the text entity is obtained.
And step 3: and B, predicting the Result of entity classification by using the KNN algorithm model in the step II, storing the Result by using a Presect _ data set, and mapping the entity in the Presect _ data set and the entity data in the Uighur entity relationship data set Result _ data to obtain a data set Train _ data, wherein the specific method comprises the following steps:
step 3.1: storing a prediction entity classification result obtained from the KNN algorithm by using prediction _ data;
step 3.2: crawling all relations and corresponding Chinese names under Wikipedia webpages, wherein the storage format is json format;
step 3.3: the crawled content is a relation rid, a relation attribute rtype, a relation sub-class statement and a corresponding rlink link, and is stored in a relation json file, and a data sample is defined as Ritem ═ rid, rtype, statement and rlink }; data set relationship { Ritem ═1,Ritem2,…,Ritemn}; json file, define data sample (mite ), data set (mite ), and relation chinese representation (relation cid, relation chinese), wherein1,Mitem2,…,Mitemn};
Step 3.4: merging data in relation data sets relationship.json and relationship.json, defining a data set result.json, and storing a result in a result.json file;
step 3.5: defining an entites.json database, searching data in the Presect _ data on a Wikipedia, returning json content and storing the json content in an entites.json file;
step 3.6: the Wikidata is an open knowledge base, the description of an entity in a Wikidata entity page and the corresponding relation associated with the entity are crawled, a wikidataRelationship file is defined, the result is stored in the wikidataRelationship file, and a data sample Witem is defined as { entity ═ degree1,relation, entity2Data set wikidatarelationship (Witem)1,Witem2,…,Witemn};
Step 3.7: data in a WikidataRelations.json database is processed into a csv file format, data in an agricultural interactive encyclopedia HuDong _ data corresponds to a Presect _ data database to obtain a node.csv file, a data sample Nitem is defined to be { Title, Label }, and a data set Node is defined to be { Nitem }1,Nitem2,…,Nitemn};
Step 3.8: the Wiki Chinese encyclopedia corpus is simplified from traditional form to simplified form, and the line feed symbols in the line feed sentences are removed;
step 3.9: selecting a training set related to agriculture, and selecting relations of instance of taxon rank, subiclass of parent taxon;
step 3.10: preloading a list of entities (selecting entities in the following categories: 5: Animal,6: Plant,7: Chemicals,9: Fooditems,10: Diseases,12: Nutrients,13: biochemistry, 14: Agricultural experiments, 15: Technology);
step 3.11: agricultural related statements are stored in a FileRead file, a triple relation obtained in wikidataRelation.json is aligned to a corpus of a Chinese wiki, and a corpus Wtrain _ data of a training set is defined;
step 3.12: loading an entity to a labeled mapping dictionary, performing part-of-speech screening on words in a sentence, and reading the category of the entity, wherein the category is consistent with the category of the entity in the Predict _ data;
step 3.13: and filtering out null values to obtain the Filter _ Wtrain _ data according to the attribute relation in the training set obtained by aligning the Wikipedia data set. Define data sample Fitem ═ entity1id,entity1,entity2id,entity2Statement, relationship, data set Node ═ Fitem1,Fitem2,…,Fitemn}。
And 4, step 4: an entity dictionary in a Wikipedia Chinese word stock is constructed by utilizing a heuristic rule screening algorithm, and Filter _ Wtrain _ data text data are preprocessed to obtain a wikidataRelation data set, wherein the specific method comprises the following steps:
step 4.1: the first element of each line in the screened Filter _ Wtrain _ data dataset is an entity, and all entities which are Chinese characters are screened according to a regular expression and converted into a dictionary format;
step 4.2: acquiring a sentence set according to a physical dictionary, storing the sentence set in a list format, preprocessing the sentence set, removing all Chinese characters and all characters except Chinese common punctuations, and splitting the sentence;
step 4.3: each sentence is traversed by entity, according to character matching rule, all entities in the sub-sentences are stored, sentences without entities or with only one entity are filtered, and data is processed into [ [ sense, [ entity ]1,…]],..]The format of';
step 4.4: and performing Chinese word segmentation by using a jieba library, and performing word segmentation on Chinese sentences. Defining the content text data, wherein the content _ seg is the text participle and the entity1For text entity, data processing is [ [ sense, [ entity ]1,...],[sentence_seg]],...]The format of';
step 4.5: training a word vector;
step 4.6: re-screening the entities by the sentence after word segmentation;
step 4.7: the entity appears in the sentence after word segmentation, if not, the step 4.10 is executed;
step 4.8: the entity sets are combined pairwise, and the data is processed into [ sensor, entity [ ]1, entit2,[sentence_seg]]]One sentence for a plurality of samples;
step 4.9: dividing the data in the wikidataRelation data set into a training set and a testing set according to the proportion of 3: 1;
step 4.10: the entity is removed as well as the text.
And 5: respectively building PCNN, CNN, RNN and BiRNN neural network models, wherein the specific method comprises the following steps:
step 5.1: building an artificial neural network, wherein in embedding layer embedding, a Word mapping function is defined as Word _ embedding, a Word embedding space vector size is defined as Word _ embedding _ dim (50), a Position feature embedding space vector size is defined as Position _ embedding (5), the maximum length is 120, and a Word _ Position _ embedding function is set to add the two embedding function results;
step 5.2: three loss functions are defined as softmax cross entry, sigmoid cross entry,
and each sample data softmax or sigmoid layer can obtain different probability distribution and prediction relations, and the maximum prediction result is used as an entity prediction result for calculating the loss cross entropy.
Step 5.3: and setting drop _ out as definition 0.5, calculating the maximum value of elements on tensor dimension, dividing the maximum value into three sections of maximum pooling, obtaining a 3-dimensional vector by each convolution kernel, inputting the result obtained by the pooling layer into a normalization function layer, and performing nonlinear processing by using a tanh activation function.
Step 5.4: each instance is defined as a bag, and if the bags in the training set are positive, the number of positive instances is more than or equal to 1; if negative, the examples are all negative;
step 5.5: adding an attention mechanism to each bag;
step 5.6: whether training is available or not, if so, executing step 5.14;
step 5.7: whether dropout is present;
step 5.8: defining an entity list, namely bag _ pre, and defining a logistic regression value based on the attribution as an attribution _ logic;
step 5.9: defining a variable i for traversing in a local scope, wherein the scope is defined as scope;
step 5.10: if i > scope, shape [0], go to step 5.13;
step 5.11: calculating an attention value of the softmax loss function;
step 5.12: i +1, performing step 5.10;
step 5.13: the obtained rank vector is stored in the bag _ pre entity list, and step 5.19 is executed;
step 5.14: after training, starting testing;
step 5.15: defining a variable i1For traversing in a local scope of action;
step 5.16: if i1>scope.shape[0]And 5.19 is executed;
step 5.17: calculating a sigmoid activation function value, and storing an obtained entity value in a logistic regression list of the bag _ login entity list;
step 5.18: i.e. i1=i1+1, go to step 5.16;
step 5.19: defining four functions of pcnn, cnn, rnn and birnn, defining hidden factor as hidden _ size as 230, defining convolution kernel size as 3 and step size as 1, and setting local variable count by using activation function relu function;
step 5.20: training, and building a pcnn, cnn, rnn and birnn neural network model;
step 5.21: if count > n, go to step 5.28;
step 5.22: the sentence is decomposed into words in turn, each word is mapped to a dimension vector dwCalled word embedding, learning an embedded vector through model training;
step 5.23: indicating an identity in a sentence using a location feature1And entity2Each entity has two relative positions, respectively mapped to different dpA dimension vector;
step 5.24: connecting the two relative position results of step (5.23) to obtain a matrix M e Rh×RsAs being an input representation, wherein Rs=dw+2*dp;
Step 5.25: setting W ═ Wc*RsIs a convolution matrix, where WcIs the convolution window width, and by sliding the convolution window down to the sentence and applying this function to each valid position, a feature map c ═ c is generated1,c2,..., c(h-Wc+1)]Extracting n features from the sentence;
step 5.26: repeating the above process with different W matrices;
step 5.27: step 5.21 is executed, if count is equal to count + 1;
step 5.28: piecewise Max-firing is commonly used to select the maximum activation value in each functional map;
step 5.29: each feature map Ci is divided into three components { c } based on the location of two entitiesi1,ci2,ci3When the piecewise max pooling is complete, the results of each feature map are concatenated into a formal vector p ∈ R3nAs a characteristic representation of the sentence;
step 5.30: and (5) building four neural network models of the step 5.20.
Step 6: and comparing the four algorithm models to obtain a relation extraction model M for parameter self-adaption optimization in the field of the agricultural knowledge graph, and specifically comprising the following steps of:
step 6.1: performing parameter self-adaptive optimization on a link coefficient a, an initial threshold b and an attenuation coefficient c of the PCNN neural network model;
step 6.2: determining a relatively fixed link coefficient, an initial threshold value and an attenuation coefficient, firstly, searching in an initial parameter range by adopting a larger search step pitch, and finding out the optimal model parameter at the moment;
step 6.3: if a plurality of parameter combinations exist and the optimal neural network model is achieved at the same time, executing the step 6.10;
step 6.4: selecting the combination with the smaller link coefficient as an optimization result;
step 6.5: if step is less than 0.0005, executing step 6.10;
step 6.6: searching for optimal parameters by a grid;
step 6.7: the traversal parameters are ended, if not, the step 6.6 is executed;
step 6.8: obtaining the highest accuracy rate of the link coefficient, the initial threshold value and the attenuation coefficient of the optimal PCNN neural network model;
step 6.10: and obtaining an optimal parameter model of the neural network.
And 7: extracting the relation between entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end, wherein the method comprises the following specific steps:
step 7.1: obtaining a neural network model PCNN with optimal parameters, and extracting two entity entries in the agricultural text data1And entity2And the relationship between the two;
step 7.2: mapping the triple relation based on a built small triple knowledge graph library wikidataRelation;
step 7.3: utilizing an algorithm model to extract entities of sentences in a wiki corpus, and mapping the entities into a wikidatarelationship database to realize an automatic labeling function in a remote supervision algorithm;
step 7.4: inputting agricultural text data, screening entities in the text data, and extracting the relationship between the agricultural text data and the entities;
step 7.5: importing data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia into a neo4j database;
step 7.6: the searched entity exists in the database, if not, step 7.9 is executed;
step 7.7: displaying the search result on a web end in a graph form by using a Cython statement;
step 7.8: packaging a python interface, displaying the data by using a web framework Dijango, and executing a step 7.10;
step 7.9: displaying that the entity does not exist;
step 7.10: searching text data of agricultural knowledge problems in the task box, and performing Chinese word segmentation technology on the text data to obtain an entity;
step 7.11: searching the database by utilizing a Cython statement;
step 7.12: the answer to the question exists in the database, if not, step 7.9 is executed;
step 7.13: and extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.
All of the above parameters are defined in the following table:
264093 pieces of data are processed, KNN algorithm is used for extracting features to classify and predict entities, a remote supervision-based artificial neural network PCNN model is set up to perform relationship extraction and model training, the triple relationship of training texts is obtained, the relationship among different crop entities is shown to a user, and the speed of information search is accelerated. Through tests, the accuracy of the experimental model using the PCNN algorithm exceeds 94%.
The invention creatively provides a parameter self-adaptive agricultural knowledge map system based on remote supervision, obtains an optimal neural network model extracted from agricultural field relations through self-adaptive optimization and parameter adjustment, and is suitable for unstructured text data of generally related crops.
Claims (8)
1. A parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision is characterized by comprising the following specific steps:
(1) data pre-processing is carried out on data crawled by an agricultural knowledge service system, and an obtained data set is defined as Agri _ data; performing data crawling on the agricultural interactive encyclopedia by using Scapy, and defining a crawled data set as HuDong _ data; performing Chinese word segmentation and word vector training on text data in the Agri _ data and the HuDong _ data, and defining an obtained data set as Train _ data;
(2) training a KNN algorithm model by using a data set Train _ data, performing feature extraction on text data by using a fast text classification tool, and performing text similarity comparison by using a cosine similarity algorithm to obtain a text entity classification T;
(3) utilizing the KNN algorithm model in the step (2) to Predict the entity classification Result, storing the entity classification Result by using a Presect _ data set, and mapping the entity in the Presect _ data set with the entity data in the Uighur entity relationship data set Result _ data to obtain a data set Train _ data;
(4) constructing an entity dictionary in a Wikipedia Chinese word stock by using a heuristic rule screening algorithm, and preprocessing the text data of Filter _ Wtrain _ data to obtain a wikidadata relationship data set;
(5) respectively building PCNN, CNN, RNN and BiRNN neural network models;
(6) comparing the four algorithm models to obtain a relation extraction model M for parameter self-adaption optimization in the field of agricultural knowledge maps;
(7) and extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.
2. The remote supervision-based parameter adaptive agricultural knowledge graph recommendation method according to claim 1, wherein the data set Train _ data obtained in the step (1) comprises the following specific steps:
(1.1) performing data crawler and selecting a crawler page;
(1.2) selecting a page;
(1.3) selecting an agricultural knowledge service system;
(1.4) crawling an agricultural knowledge service system, obtaining html files of the system, limiting a crawling range by using front-end div, and obtaining names Title, detailed contents Detail, photo ImageList and webpage links Url of crops; the data item airfem ═ { Title, Detail, ImageList, Url } is formed, and the data set Agri _ data ═ airfem ═ data { (airfem) }1,Aitem2,…,AitemnExecuting the step (1.8);
(1.5) selecting an agricultural interactive encyclopedia;
(1.6) crawling the contents in the agricultural interactive encyclopedia by using Scapy, declaring the address domain of a crawler, acquiring a word list, constructing an original json file, generating a Url list, acquiring a crop Title, crawling a picture ImageList and an open-domain label openTypeList;
(1.7) each entity obtained by crawling corresponds to one entry of an agricultural encyclopedia, wherein the entries comprise a Title, a link Url of an interactive encyclopedia, a picture ImageList, an open classification list TypeList, a detailed information InfoList and a basic information ValueList; constitute the data item Hitem ═ { Title, Url, ImageList, TypeList, InfoList, ValueList }, the data set Hudong _ data ═ { Hitem }1,Hitem2,…,Hitemn};
(1.8) acquiring two types of database sets Agri _ data and HuDong _ data;
(1.9) performing part-of-speech screening on data in the two data sets Agri _ data and HuDong _ data;
(1.10) discarding words containing non-Chinese and English or numeric characters;
(1.11) performing Chinese word segmentation and word vector training on two database sets Agri _ data and HuDong _ data respectively;
(1.12) obtaining a data set Train _ data.
3. The remote supervision-based parameter adaptive agricultural knowledge base map recommendation method according to claim 1, wherein the specific steps of obtaining the text entity classification T in the step (2) are as follows:
(2.1) transmitting a Text data set Train _ data into a KNN Text classifier, defining the Mean value of each component as Mean, the variance of each component as Var, the inverse Text frequency index as Text _ IDF and the Text quantity as Item _ Num; defining the similarity Weight as Weight { Title, TypeList, Detail, InfoList, ValueList } {0.2,0.2,0.2,0.2,0.2 };
(2.2) if each Item has 5 attributes in Weight, adding 1 to the IDF value of each attribute;
(2.3) returning the similarity of Title of 2 items each time, defining the name similarity as Title _ sim, the similarity of TypeList of 2 items, the similarity of open sort list as TypeList _ sim, the similarity of Detail content as Detail _ sim, the similarity of InfoList of 2 items as InfoList _ sim, the similarity of ValueList of 2 items as ValueList _ sim, the similarity of 5 attributes as Dsim;
(2.4) linearly weighting the obtained similarity Dsim, and defining the similarity Dsim as Simi;
(2.5) storing the attribute similarity of the Item in a temporary table CurList, calculating the variance and mean difference of each component, carrying out Gaussian normalization on the similarity of Title and TypeList, and endowing the similarity average value to the similarity values which do not appear;
(2.6) carrying out weighted sum on the similarity of each Item, defining the sum as Count _ sim, sequencing the first k values of the similarity Count _ sim, and classifying the k values into one class;
and (2.7) obtaining the classification T of the text entity.
4. The remote supervision-based parameter adaptive agricultural knowledge graph recommendation method according to claim 1, wherein the specific steps of obtaining the data set Train _ data in the step (3) are as follows:
(3.1) storing the classification result of the predicted entity obtained from the KNN algorithm by using Predict _ data;
(3.2) crawling all the relations summarized under the Wikipedia webpage and the corresponding Chinese names by utilizing Scapy, wherein the storage format is json format;
(3.3) the crawled content is a relation rid, an attribute rtype to which the relation belongs, a subclass statement to which the relation belongs and a corresponding rlink link, and is stored in a relation json file, and a data sample is defined as rite ═ rid, rtype, statement, rlink }; data set relationship { Ritem ═1,Ritem2,…,Ritemn}; json file, define data sample (mite ), data set (mite ), and relation chinese representation (relation cid, relation chinese), wherein1,Mitem2,…,Mitemn};
(3.4) merging the data in relation data set relation.json and relation.json, defining a data set result.json, and storing the result in a result.json file;
(3.5) defining an entites.json database, searching data in the Presect _ data on a Wikipedia, returning json content and storing the json content in an entites.json file;
(3.6) Wikitata is an open knowledge base, the description of an entity in a Wikitata entity page and the corresponding relation associated with the entity are crawled, a wikitataRelationship file is defined, the result is stored in the wikitataRelationship file, and a data sample Witem is defined as { entity ═ degree1,relation,entity2Data set wikidatarelationship (Witem)1,Witem2,…,Witemn};
(3.7) processing data in WikidataRelation. json database into csv file format, and performing agricultural interactionCorresponding data in the HuDong _ data in the encyclopedic to a Presect _ data database to obtain a node.csv file, defining a data sample Nitem ═ { Title, Label }, and defining a data set Node ═ Nitem ═ Title }1,Nitem2,…,Nitemn};
(3.8) the Wiki Chinese encyclopedia corpus is simplified from traditional to simplified and the line feed symbols in the line feed sentences are removed;
(3.9) selecting a training set related to agriculture, and selecting relations of "instance of" "taxon rank" "subiclass of" "parent taxon";
(3.10) preloading an entity list (selecting entities in the following categories: 5: Animal,6: Plant,7: Chemicals,9: foods, 10: Diseases,12: Nutrients,13: biochemistry.14: Agricultural entities, 15: Technology);
(3.11) agricultural related sentences are stored in a FileRead file, the triple relation obtained in wikidataRelation.json is aligned to a corpus of a Chinese dimensional base, and a corpus Wtrain _ data of a training set is defined;
(3.12) loading an entity to the labeled mapping dictionary, performing part-of-speech screening on words in the sentence, and reading the category of the entity to be consistent with the category of the entity in the Presect _ data;
(3.13) filtering out null values of the attribute relation in the training set obtained by aligning the Wikipedia data set to obtain filter _ Wtrain _ data; define data sample Fitem ═ entity1id,entity1,entity2id,entity2Statement, relationship, data set Node ═ Fitem1,Fitem2,…,Fitemn}。
5. The method for recommending parameter-adaptive agricultural knowledge-graph based on remote supervision as claimed in claim 1, wherein the specific step of obtaining wikidataRelation data set in step (4) is as follows:
(4.1) screening out entities which are all Chinese characters according to a regular expression, and converting the entities into a dictionary format, wherein the first element of each line in the filtered _ Wtrain _ data dataset is an entity;
(4.2) acquiring a sentence set according to the entity dictionary, storing the sentence set in a list format, preprocessing the sentence set, removing all Chinese characters and all characters except Chinese common punctuations, and splitting the sentence;
(4.3) each sentence is traversed by entity, all entities in the sub-sentences are stored according to the character matching rule, sentences without entities or with only one entity are filtered, and data are processed into [ [ sensor, [ entity ]1,…]],..]The format of';
and (4.4) carrying out Chinese word segmentation by using a jieba library and carrying out Chinese sentence segmentation. Defining the content text data, wherein the content _ seg is the text participle and the entity1For text entity, data processing is [ [ sense, [ entity ]1,...],[sentence_seg]],...]The format of';
(4.5) training the word vector;
(4.6) re-screening the entities by the sentence after word segmentation;
(4.7) the entity appears in the sentence after word segmentation, if not, the step (4.10) is executed;
(4.8) the entity sets are combined pairwise, and the data processing is [ sensor, entity1,entit2,[sentence_seg]]]One sentence for a plurality of samples;
(4.9) dividing the data in the wikidataRelation data set into a training set and a testing set according to the ratio of 3: 1;
(4.10) removing the entity and the text.
6. The remote supervision-based parameter adaptive agricultural knowledge graph recommendation method according to claim 1, wherein the specific steps of respectively building PCNN, CNN, RNN and BiRNN neural network models in the step (5) are as follows:
(5.1) building an artificial neural network, wherein in embedding layer embedding, a Word mapping function is defined as Word _ embedding, a Word embedding space vector size is defined as Word _ embedding _ dim ═ 50, a Position feature embedding space vector size is defined as Position _ embedding ═ 5, the maximum length is defined as 120, and the Word _ Position _ embedding function is set to add the two embedding function results;
(5.2) defining three loss functions as softmax cross entry, sigmoid cross entry,
and each sample data softmax or sigmoid layer can obtain different probability distribution and prediction relations, and the maximum prediction result is used as an entity prediction result for calculating the loss cross entropy.
(5.3) setting drop _ out as 0.5, calculating the maximum value of elements in tensor dimension, dividing the maximum value into three sections for maximum pooling, obtaining a 3-dimensional vector by each convolution kernel, inputting the result obtained by the pooling layer into a normalization function layer, and performing nonlinear processing by using a tanh activation function.
(5.4) each instance is defined as a bag, if the bags in the training set are positive, the number of positive instances is more than or equal to 1; if negative, the examples are all negative;
(5.5) adding an attention mechanism on each bag;
(5.6) whether training is performed, if so, executing the step (5.14);
(5.7) whether dropout;
(5.8) defining an entity list, which is defined as bag _ pre, and defining a logistic regression value based on the attribution as an attribution _ logic;
(5.9) defining a variable i for traversing in the local scope, defining the scope as scope;
(5.10) if i > scope. shape [0], performing the step (5.13);
(5.11) calculating an attention value of the softmax loss function;
(5.12) i ═ i +1, performing step (5.10);
(5.13) storing the obtained rank vector in a bag _ pre entity list, and executing the step (5.19);
(5.14) finishing training and starting testing;
(5.15) defining the variable i1For traversing in a local scope of action;
(5.16) if i1>scope.shape[0]Executing the step (5.19);
(5.17) calculating a sigmoid activation function value, and storing the obtained entity value in a logistic regression list of the bag _ login entity list;
(5.18)i1=i1+1, go to step (5.16);
(5.19) defining four functions of pcnn, cnn, rnn and birnn, defining hidden factor as hidden _ size as 230, defining convolution kernel size as 3 and step size as 1, and setting local variable count by using activation function relu function;
(5.20) training, and building a pcnn, cnn, rnn and birnn neural network model;
(5.21) if count > n, executing step (5.28);
(5.22) decomposing the sentence into words in turn, each word mapped to a dimensional vector dwCalled word embedding, learning an embedded vector through model training;
(5.23) indicating the tag's identity in the sentence using the location feature1And entity2Each entity has two relative positions, respectively mapped to different dpA dimension vector;
(5.24) the two relative position results of step (5.23) are concatenated to obtain the matrix M e Rh×RsAs being an input representation, wherein Rs=dw+2*dp;
(5.25) setting W ═ Wc*RsIs a convolution matrix, where WcIs the convolution window width, and by sliding the convolution window down to the sentence and applying this function to each valid position, a feature map c ═ c is generated1,c2,...,c(h-Wc+1)]Extracting n features from the sentence;
(5.26) repeating the above process with a different W matrix;
(5.27) count +1, performing step (5.21);
(5.28) piece wise Max-firing is commonly used to select the maximum activation value in each functional map;
(5.29) dividing each feature map Ci into three components { c) based on the locations of the two entitiesi1,ci2,ci3When the piecewise max pooling is complete, the results of each feature map are concatenated into a formal vector p ∈ R3nAs a characteristic representation of the sentence;
(5.30) building four neural network models of the step (5.20).
7. The method for recommending a parameter-adaptive agricultural knowledge graph based on remote supervision according to claim 1, wherein the specific steps of obtaining the relation extraction model M for parameter-adaptive optimization in the field of agricultural knowledge graph in the step (6) are as follows:
(6.1) carrying out parameter self-adaptive optimization on a link coefficient a, an initial threshold b and an attenuation coefficient c of the PCNN neural network model;
(6.2) determining a relatively fixed link coefficient, an initial threshold value and an attenuation coefficient, firstly, searching in an initial parameter range by adopting a larger search step pitch, and finding out the optimal model parameter at the moment;
(6.3) if a plurality of parameter combinations exist and the optimal neural network model is simultaneously reached, executing the step (6.10);
(6.4) selecting the combination with the smaller link coefficient as the optimizing result;
(6.5) if step is less than 0.0005, executing step (6.10);
(6.6) searching for optimal parameters by the grid;
(6.7) ending the traversal parameters, and if not, executing the step (6.6);
(6.8) obtaining the highest accuracy rate by the link coefficient, the initial threshold value and the attenuation coefficient of the optimal PCNN neural network model;
and (6.10) obtaining an optimal parameter model of the neural network.
8. The parameter adaptive agricultural knowledge graph recommendation method based on remote supervision according to claim 1, wherein in the step (7), the relation between the entities is extracted from the text data of the agricultural field, the entity relation data is rendered through Echarts, and the recommendation result is displayed on the web side by the specific steps of:
(7.1) obtaining a neural network model PCNN with optimal parameters, and extracting two entity entries in the agricultural text data1And entity2And the relationship between the two;
(7.2) mapping the triple relation based on a built small triple knowledge graph library wikidataRelation;
(7.3) performing entity extraction on the sentences in the wiki corpus by using the algorithm model, and mapping the entities into a wikidataRelation database to realize the automatic labeling function in the remote supervision algorithm;
(7.4) inputting agricultural text data, screening entities in the text data, and extracting the relationship between the entities and the text data;
(7.5) importing data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia into a neo4j database;
(7.6) the searched entity exists in the database, if not, the step (7.9) is executed;
(7.7) displaying the search result on the web end in a graph form by using a Cython statement;
(7.8) packaging the python interface, and displaying the data by using the web framework Dijango, and executing the step (7.10);
(7.9) displaying the absence of the entity;
(7.10) searching text data of agricultural knowledge problems in the task box, and performing Chinese word segmentation technology on the text data to obtain an entity;
(7.11) searching the database by using a Cython statement;
(7.12) the answer to the question exists in the database, if not, the step (7.9) is executed;
(7.13) extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010794151.4A CN112199508B (en) | 2020-08-10 | 2020-08-10 | Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010794151.4A CN112199508B (en) | 2020-08-10 | 2020-08-10 | Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision |
Publications (2)
Publication Number | Publication Date |
---|---|
CN112199508A true CN112199508A (en) | 2021-01-08 |
CN112199508B CN112199508B (en) | 2024-01-19 |
Family
ID=74004961
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202010794151.4A Active CN112199508B (en) | 2020-08-10 | 2020-08-10 | Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN112199508B (en) |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113064999A (en) * | 2021-03-19 | 2021-07-02 | 南方电网调峰调频发电有限公司信息通信分公司 | Knowledge graph construction algorithm, system, equipment and medium based on IT equipment operation and maintenance |
CN113159320A (en) * | 2021-03-08 | 2021-07-23 | 北京航空航天大学 | Scientific and technological resource data integration method and device based on knowledge graph |
CN113723760A (en) * | 2021-07-30 | 2021-11-30 | 哈尔滨工业大学 | Wisdom agricultural thing networking platform |
WO2023097929A1 (en) * | 2021-12-01 | 2023-06-08 | 浙江师范大学 | Knowledge graph recommendation method and system based on improved kgat model |
CN116911963A (en) * | 2023-09-14 | 2023-10-20 | 南京龟兔赛跑软件研究院有限公司 | Data-driven pesticide byproduct transaction management method and cloud platform |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150178282A1 (en) * | 2013-12-23 | 2015-06-25 | Yahoo! Inc. | Fast and dynamic targeting of users with engaging content |
CN105279264A (en) * | 2015-10-26 | 2016-01-27 | 深圳市智搜信息技术有限公司 | Semantic relevancy calculation method of document |
CN108804521A (en) * | 2018-04-27 | 2018-11-13 | 南京柯基数据科技有限公司 | A kind of answering method and agricultural encyclopaedia question answering system of knowledge based collection of illustrative plates |
CN109871451A (en) * | 2019-01-25 | 2019-06-11 | 中译语通科技股份有限公司 | A kind of Relation extraction method and system incorporating dynamic term vector |
CN110209839A (en) * | 2019-06-18 | 2019-09-06 | 卓尔智联(武汉)研究院有限公司 | Agricultural knowledge map construction device, method and computer readable storage medium |
CN110555084A (en) * | 2019-08-26 | 2019-12-10 | 电子科技大学 | remote supervision relation classification method based on PCNN and multi-layer attention |
US20210391080A1 (en) * | 2018-12-29 | 2021-12-16 | New H3C Big Data Technologies Co., Ltd. | Entity Semantic Relation Classification |
-
2020
- 2020-08-10 CN CN202010794151.4A patent/CN112199508B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20150178282A1 (en) * | 2013-12-23 | 2015-06-25 | Yahoo! Inc. | Fast and dynamic targeting of users with engaging content |
CN105279264A (en) * | 2015-10-26 | 2016-01-27 | 深圳市智搜信息技术有限公司 | Semantic relevancy calculation method of document |
CN108804521A (en) * | 2018-04-27 | 2018-11-13 | 南京柯基数据科技有限公司 | A kind of answering method and agricultural encyclopaedia question answering system of knowledge based collection of illustrative plates |
US20210391080A1 (en) * | 2018-12-29 | 2021-12-16 | New H3C Big Data Technologies Co., Ltd. | Entity Semantic Relation Classification |
CN109871451A (en) * | 2019-01-25 | 2019-06-11 | 中译语通科技股份有限公司 | A kind of Relation extraction method and system incorporating dynamic term vector |
CN110209839A (en) * | 2019-06-18 | 2019-09-06 | 卓尔智联(武汉)研究院有限公司 | Agricultural knowledge map construction device, method and computer readable storage medium |
CN110555084A (en) * | 2019-08-26 | 2019-12-10 | 电子科技大学 | remote supervision relation classification method based on PCNN and multi-layer attention |
Non-Patent Citations (5)
Title |
---|
KAI ZHANG等: "Chinese Agricultural Entity Relation Extraction via Deep Learning", INTELLIGENT COMPUTING METHODOLOGIES, pages 528 * |
吕亿林;田宏韬;高建伟;万怀宇;: "结合百科知识与句子语义特征的关系抽取方法", 计算机科学, no. 1, pages 50 - 54 * |
夏川: "基于深度学习的农作物病虫害领域实体关系抽取研究", 中国优秀硕士学位论文全文数据库农业科技辑, no. 05, pages 046 - 7 * |
张苇如 等: "基于维基百科和模式聚类的实体关系抽取方法", 中国中文信息学会.中国计算语言学研究前沿进展(2009-2011), pages 421 - 426 * |
朱苏阳;惠浩添;钱龙华;张民;: "基于自监督学习的维基百科家庭关系抽取", 计算机应用, no. 04, pages 115 - 118 * |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113159320A (en) * | 2021-03-08 | 2021-07-23 | 北京航空航天大学 | Scientific and technological resource data integration method and device based on knowledge graph |
CN113064999A (en) * | 2021-03-19 | 2021-07-02 | 南方电网调峰调频发电有限公司信息通信分公司 | Knowledge graph construction algorithm, system, equipment and medium based on IT equipment operation and maintenance |
CN113064999B (en) * | 2021-03-19 | 2023-12-15 | 南方电网调峰调频发电有限公司信息通信分公司 | Knowledge graph construction algorithm, system, equipment and medium based on IT equipment operation and maintenance |
CN113723760A (en) * | 2021-07-30 | 2021-11-30 | 哈尔滨工业大学 | Wisdom agricultural thing networking platform |
WO2023097929A1 (en) * | 2021-12-01 | 2023-06-08 | 浙江师范大学 | Knowledge graph recommendation method and system based on improved kgat model |
CN116911963A (en) * | 2023-09-14 | 2023-10-20 | 南京龟兔赛跑软件研究院有限公司 | Data-driven pesticide byproduct transaction management method and cloud platform |
CN116911963B (en) * | 2023-09-14 | 2023-12-19 | 南京龟兔赛跑软件研究院有限公司 | Data-driven pesticide byproduct transaction management method and cloud platform system |
Also Published As
Publication number | Publication date |
---|---|
CN112199508B (en) | 2024-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110222160B (en) | Intelligent semantic document recommendation method and device and computer readable storage medium | |
CN106874378B (en) | Method for constructing knowledge graph based on entity extraction and relation mining of rule model | |
CN112199508B (en) | Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision | |
CN111209738B (en) | Multi-task named entity recognition method combining text classification | |
US10783451B2 (en) | Ensemble machine learning for structured and unstructured data | |
KR101203345B1 (en) | Method and system for classifying display pages using summaries | |
CN112100344A (en) | Financial field knowledge question-answering method based on knowledge graph | |
CN107301199A (en) | A kind of data label generation method and device | |
CN110516074B (en) | Website theme classification method and device based on deep learning | |
CN114896388A (en) | Hierarchical multi-label text classification method based on mixed attention | |
CN111582506A (en) | Multi-label learning method based on global and local label relation | |
CN113269477B (en) | Scientific research project query scoring model training method, query method and device | |
CN112862569B (en) | Product appearance style evaluation method and system based on image and text multi-modal data | |
CN103049454B (en) | A kind of Chinese and English Search Results visualization system based on many labelings | |
CN112445862B (en) | Internet of things equipment data set construction method and device, electronic equipment and storage medium | |
CN112100395B (en) | Expert cooperation feasibility analysis method | |
CN111753151A (en) | Service recommendation method based on internet user behaviors | |
CN112613318B (en) | Entity name normalization system, method thereof and computer readable medium | |
CN113516202A (en) | Webpage accurate classification method for CBL feature extraction and denoising | |
CN113779387A (en) | Industry recommendation method and system based on knowledge graph | |
Maladkar | Content based hierarchical URL classification with Convolutional Neural Networks | |
RIZVI | A Systematic Overview on Data Mining: concepts and techniques | |
Ghosh et al. | Understanding Machine Learning | |
CN107341169B (en) | Large-scale software information station label recommendation method based on information retrieval | |
Chebil et al. | Clustering social media data for marketing strategies: Literature review using topic modelling techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |