CN112199508A - Parameter adaptive agricultural knowledge graph recommendation method based on remote supervision - Google Patents

Parameter adaptive agricultural knowledge graph recommendation method based on remote supervision Download PDF

Info

Publication number
CN112199508A
CN112199508A CN202010794151.4A CN202010794151A CN112199508A CN 112199508 A CN112199508 A CN 112199508A CN 202010794151 A CN202010794151 A CN 202010794151A CN 112199508 A CN112199508 A CN 112199508A
Authority
CN
China
Prior art keywords
data
entity
agricultural
text
relation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010794151.4A
Other languages
Chinese (zh)
Other versions
CN112199508B (en
Inventor
周泓
万瑾
朱全银
孙强
倪金霆
陈凌云
季睿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huaiyin Institute of Technology
Original Assignee
Huaiyin Institute of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huaiyin Institute of Technology filed Critical Huaiyin Institute of Technology
Priority to CN202010794151.4A priority Critical patent/CN112199508B/en
Publication of CN112199508A publication Critical patent/CN112199508A/en
Application granted granted Critical
Publication of CN112199508B publication Critical patent/CN112199508B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/02Agriculture; Fishing; Mining

Abstract

The invention discloses a parameter self-adaptive agricultural knowledge map recommendation method based on remote supervision, which comprises the following steps: crawling text data by adopting a Scapy crawler frame, carrying out data preprocessing, and obtaining a predicted text classification data set Predict _ data by utilizing a KNN algorithm classifier; on the other hand, when the Chinese corpus of crops is processed, the predicted entity classification result is mapped into the Wikipedia Chinese corpus to construct an entity Chinese dictionary. And setting up a parameter self-adaptive optimization searching neural network model based on an improved remote supervision algorithm, wherein the model enables self-adaptive optimization searching parameters extracted by the relation to be the best, automatic labeling of text data is realized, and the relation between entities is obtained. The method can improve the accuracy of relation extraction, and meanwhile, provides effective information screening for the plant cultivation enthusiasts by utilizing agricultural text information.

Description

Parameter adaptive agricultural knowledge graph recommendation method based on remote supervision
Technical Field
The invention belongs to the technical field of knowledge maps and neural networks, and particularly relates to a parameter self-adaptive agricultural knowledge map recommendation method based on remote supervision.
Background
The agricultural knowledge map combines the characteristics of regional property, climatic property, diversity of physical products and the like of agriculture, utilizes the entity relationship and concept in the field of agriculture to dig out an intelligent auxiliary system of potential value of agriculture, and compared with the traditional agricultural information query mode, the agricultural knowledge map combines the visualization technology and the agricultural knowledge base to display and analyze retrieved data, thereby being a new development of Chinese metrology. Therefore, the agricultural knowledge map service system provided by the invention can analyze the environment and climate suitable for the growth of crops by utilizing data in the agricultural knowledge service system, agricultural interactive encyclopedia and Wikipedia, provides an effective auxiliary effect for agricultural research institutes and plant cultivation enthusiasts, and quickly acquires required information in the internet with big data explosion.
The existing research bases of Zhuquanhyin et al include: the classification and extraction algorithm of Web science and technology news [ J ] academic newspaper of Huaiyin institute of Industrial science and technology, 2015,24(5): 18-24; lixiang, Zhu-Quanyin, collaborative clustering and scoring matrix shared collaborative filtering recommendations [ J ] computer science and exploration 2014,8(6): 751-; quanyin Zhu, Sun qun Cao.A Novel Classifier-independent Feature Selection Algorithm for Imbalanced datasets.2009, p: 77-82; quanyin Zhu, Yunyang Yan, Jin Ding, Jin Qian, the Case Study for Price extraction of Mobile Phone Sell Online, 2011, p: 282-285; quanyin Zhu, Suqun Cao, Pei Zhou, Yunyang Yan, Hong Zhou. Integrated print for based on Dichotomy Back filling and Disturbance Factor Algorithm. International Review on Computers and Software,2011, Vol.6(6): 1089-; li Xiang, Zhu quan Yin, Hurong Lin, Zhonhang, a cold chain logistics stowage intelligent recommendation method based on spectral clustering, Chinese patent publication No. CN105654267A, 2016.06.08; suo Cao, Zhu quan Yin, Zuo Xiao Ming, Gao Shang soldier, etc., a feature selection method for pattern classification Chinese patent publication No.: CN 103425994 a, 2013.12.04; liu jin Ling, von Wanli, Zhang Yao red Chinese text clustering method based on rescaling [ J ] computer engineering and applications, 2012,48(21): 146-; the classification and extraction algorithm of Web science and technology news [ J ] academic proceedings of Huaiyin institute of Industrial science and technology, 2015,24(5): 18-24; lixiang, Zhu-Quanyin, collaborative clustering and scoring matrix shared collaborative filtering recommendations [ J ] computer science and exploration 2014,8(6): 751-; zhuquanhyin, sincerity, Lixiang, xukang and the like, a network behavior habit clustering method based on K-means and LDA two-way verification, Chinese patent publication No. CN 106202480A, 2016.12.07;
the traditional knowledge graph construction method relates to agricultural knowledge and relation extraction, and aims at the problems that: huihong remote supervision relation extraction method and device, chinese patent publication no: CN110209836A,2019.5.17, belonging to the application of remote supervision algorithm, aiming at generating an entity recognition training data set through bootstrap algorithm and recognizing the entity of a sentence through crf + + tool; generating an entity relation extraction training data set through a remote supervision method, and generating an entity relation extraction data set through a relation knowledge base and a natural language corpus; the method can automatically label training data through natural corpus to complete entity recognition and entity relationship extraction; sun encourage, an assistant diagnosis and treatment system based on knowledge map, chinese patent publication no: CN110459320A,2019.11.15, belonging to the field of medical diagnosis and treatment, aiming at defining the patient status between two successive medical operations as the side; the system comprises a patient information processing module, a diagnosis and treatment scheme pushing module, a patient information processing module and a diagnosis and treatment decision module, wherein the patient information processing module is used for receiving patient information, extracting historical medical operation and patient state information, sending the historical medical operation and the patient state information to the diagnosis and treatment scheme pushing module, matching the patient information with a knowledge graph, determining the position of the current state of a patient in the knowledge graph, pushing a medical index to be detected and/or next diagnosis and treatment operation based on the knowledge graph, quickly knowing the diagnosis and treatment stage of the patient, and giving a next; the Chinese patent publication No. is: CN110400327A,2019.11.1 belongs to the field of crop image segmentation, and aims to realize self-adaptive adjustment of PCNN model parameters in nighttime image segmentation of tomato plants, reduce PCNN iteration times and improve the real-time performance of algorithm application. However, at present, a system and a method for adopting a parameter adaptive optimization model combined with a neural network to identify entities and extract relationships in the agricultural field, construct a knowledge graph in the agricultural field and make an auxiliary decision do not exist.
Screening algorithm based on heuristic rule:
the information filtering technology is used for facilitating a user to find information which is interested by the user more quickly, and the information filtering technology can solve the problem. Information filtering is generally used to process a large amount of text information and to filter out objectionable information in a targeted manner. The rule is a knowledge representation method, the modification or replacement of the rule does not affect other rules, a plurality of domain knowledge of different categories are stored in the rule base, and a series of reasoning predictions can be completed by using the obtained rule to finally obtain the category. At present, information screening is carried out at home and abroad based on a heuristic rule screening method through keywords.
Remote supervision algorithm:
the remote supervision algorithm is based on a labeled artificial knowledge graph, relation labels are labeled on sentences in an external document, and the algorithm is also a semi-supervision algorithm. Firstly, based on the fact that the entities related to crops in the sentences are extracted in the training stage, and the two entities are in a corresponding relation in the corpus, the texts in the test set are considered to express the entitiesClass relationships. And the extracted text features are spliced and expressed as a word vector, and the word vector is used as a feature vector of the texts. Aiming at the system, the proposed scheme is as follows: the existing triple correspondence is mapped to a massive unstructured database to generate a large amount of training data, and the knowledge sources are diversified, such as manual labeling, the existing knowledge base, a specific statement structure and the like. For example: set data set X ═ X1,x2,x3,…,xnAccording to the relation h1Mapping the data set X to a space A where A ═ A1,A2,A3,…,AMIs then passed through the relationship h2Mapping space a to space K ═ K1,K2,K3,…,Kr}。
Based on a PCNN neural network model algorithm:
traditional vocabulary characteristics comprise characteristics such as partial entities, word sequences among agricultural product entities, hypernyms of words and the like, and the characteristics depend on manual processing characteristic processes. Characteristics of the lexical level: conversion to word vectors and the functions represented at the lexical level using the word vectors. Features at the syntactic level: and considering the context characteristics, setting a sliding window K, sliding back to a lattice after finishing reading every K characters, and finally obtaining a group of sentence-level characteristics including word characteristics and position characteristics. The PCNN algorithm makes an improvement over the CNN algorithm in terms of the pooling layer: and dividing the statement into k sections according to the position of the entity pair, performing maximum pooling operation on each pair independently, obtaining a maximum value in each section, and finally forming the maximum values into feature vectors.
Disclosure of Invention
The purpose of the invention is as follows: aiming at the problems in the prior art, the invention provides a parameter self-adaptive agricultural knowledge map recommendation method based on remote supervision, which analyzes the environment and climate suitable for the growth of crops through data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia, provides effective auxiliary action for agricultural research institutes and plant cultivation enthusiasts, and quickly acquires needed information in the internet with big data explosion. .
The technical scheme is as follows: in order to solve the technical problems, the invention provides a parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision, which comprises the following specific steps of:
(1) data pre-processing is carried out on data crawled by an agricultural knowledge service system, and an obtained data set is defined as Agri _ data; performing data crawling on the agricultural interactive encyclopedia by using Scapy, and defining a crawled data set as HuDong _ data; performing Chinese word segmentation and word vector training on text data in the Agri _ data and the HuDong _ data, and defining an obtained data set as Train _ data;
(2) training a KNN algorithm model by using a data set Train _ data, performing feature extraction on text data by using a fast text classification tool, and performing text similarity comparison by using a cosine similarity algorithm to obtain a text entity classification T;
(3) utilizing the KNN algorithm model in the step (2) to Predict the entity classification Result, storing the entity classification Result by using a Presect _ data set, and mapping the entity in the Presect _ data set with the entity data in the Uighur entity relationship data set Result _ data to obtain a data set Train _ data;
(4) constructing an entity dictionary in a Wikipedia Chinese word stock by using a heuristic rule screening algorithm, and preprocessing the text data of Filter _ Wtrain _ data to obtain a wikidadata relationship data set;
(5) respectively building PCNN, CNN, RNN and BiRNN neural network models;
(6) comparing the four algorithm models to obtain a relation extraction model M for parameter self-adaption optimization in the field of agricultural knowledge maps;
(7) and extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.
Further, the specific steps of the data set Train _ data obtained in the step (1) are as follows:
(1.1) performing data crawler and selecting a crawler page;
(1.2) selecting a page;
(1.3) selecting an agricultural knowledge service system;
(1.4) crawling an agricultural knowledge service system, obtaining html files of the system, limiting a crawling range by using front-end div, and obtaining names Title, detailed contents Detail, photo ImageList and webpage links Url of crops; the data item airfem ═ { Title, Detail, ImageList, Url } is formed, and the data set Agri _ data ═ airfem ═ data { (airfem) }1,Aitem2,…,AitemnExecuting the step (1.8);
(1.5) selecting an agricultural interactive encyclopedia;
(1.6) crawling the content in the agricultural interactive encyclopedia by using Scapy, declaring the address domain of a crawler, acquiring a word list, constructing an original json file, generating a Url list, acquiring a crop Title, and crawling picture ImageList and an open domain label openTypeList;
(1.7) each entity obtained by crawling corresponds to one entry of an agricultural encyclopedia, wherein the entries comprise a Title, a link Url of an interactive encyclopedia, a picture ImageList, an open classification list TypeList, a detailed information InfoList and a basic information ValueList; constitute the data item Hitem ═ { Title, Url, ImageList, TypeList, InfoList, ValueList }, the data set Hudong _ data ═ { Hitem }1,Hitem2,…,Hitemn};
(1.8) acquiring two types of database sets Agri _ data and HuDong _ data;
(1.9) performing part-of-speech screening on data in the two data sets Agri _ data and HuDong _ data;
(1.10) discarding words containing non-Chinese and English or numeric characters;
(1.11) performing Chinese word segmentation and word vector training on two database sets Agri _ data and HuDong _ data respectively;
(1.12) obtaining a data set Train _ data.
Further, the specific steps of obtaining the text entity classification T in the step (2) are as follows:
(2.1) transmitting a Text data set Train _ data into a KNN Text classifier, defining the Mean value of each component as Mean, the variance of each component as Var, the inverse Text frequency index as Text _ IDF and the Text quantity as Item _ Num; defining the similarity Weight as Weight { Title, TypeList, Detail, InfoList, ValueList } {0.2,0.2,0.2,0.2,0.2 };
(2.2) if each Item has 5 attributes in Weight, adding 1 to the IDF value of each attribute;
(2.3) returning the similarity of Title of 2 items each time, defining the name similarity as Title _ sim, the similarity of TypeList of 2 items, the similarity of open sort list as TypeList _ sim, the similarity of Detail content as Detail _ sim, the similarity of InfoList of 2 items as InfoList _ sim, the similarity of ValueList of 2 items as ValueList _ sim, the similarity of 5 attributes as Dsim;
(2.4) linearly weighting the obtained similarity Dsim, and defining the similarity Dsim as Simi;
(2.5) storing the attribute similarity of the Item in a temporary table CurList, calculating the variance and mean difference of each component, carrying out Gaussian normalization on the similarity of Title and TypeList, and endowing the similarity average value to the similarity values which do not appear;
(2.6) carrying out weighted sum on the similarity of each Item, defining the sum as Count _ sim, sequencing the first k values of the similarity Count _ sim, and classifying the k values into one class;
and (2.7) obtaining the classification T of the text entity.
Further, the specific steps of obtaining the data set Train _ data in the step (3) are as follows:
(3.1) storing the classification result of the predicted entity obtained from the KNN algorithm by using Predict _ data;
(3.2) crawling all the relations and corresponding Chinese names under the Wikipedia webpage by utilizing Scapy, wherein the storage format is json format;
(3.3) the crawled content is a relation rid, an attribute rtype to which the relation belongs, a subclass statement to which the relation belongs and a corresponding rlink link, and is stored in a relation json file, and a data sample is defined as rite ═ rid, rtype, statement, rlink }; data set relationship { Ritem ═1,Ritem2,…,Ritemn}; relational cid and relational chinese representation chrention storageJson file, data sample time { cid, duration } and data set duration { time } are defined1,Mitem2,…,Mitemn};
(3.4) merging the data in relation data set relation.json and relation.json, defining a data set result.json, and storing the result in a result.json file;
(3.5) defining an entites.json database, searching data in the Presect _ data in the Wikipedia, returning json content, and storing the json content in an entites.json file;
(3.6) Wikitata is an open knowledge base, the description of an entity in a Wikitata entity page and the corresponding relation associated with the entity are crawled, a wikitataRelationjson file is defined, the result is stored in the wikitataRelationjson file, and a data sample wittem ═ { entity ═ is defined1,relation, entity2Data set wikidatarelationship (Witem)1,Witem2,…,Witemn};
(3.7) processing data in a WikidataRelation.json database into a csv file format, corresponding data in the HuDong _ data in the agricultural interactive encyclopedic to a Presect _ data database to obtain a node.csv file, defining a data sample Nitem ═ { Title, Label }, and defining a data set Node ═ Nitem ═ in1,Nitem2,…,Nitemn};
(3.8) the Wiki Chinese encyclopedia corpus is simplified from traditional to simplified and the line feed symbols in the line feed sentences are removed;
(3.9) selecting a training set related to agriculture, and selecting relations of "instance of" "taxon rank" "subiclass of" "parent taxon";
(3.10) preloading an entity list (selecting entities in the following categories: 5: Animal,6: Plant,7: Chemicals,9: foods, 10: Diseases,12: Nutrients,13: biochemistry, 14: Agricultural examples, 15: Technology);
(3.11) agricultural related sentences are stored in a FileRead file, the triple relation obtained in wikidataRelation.json is aligned to a corpus of a Chinese dimensional base, and a corpus Wtrain _ data of a training set is defined;
(3.12) loading an entity to the labeled mapping dictionary, performing part-of-speech screening on words in the sentence, and reading the category of the entity to be consistent with the category of the entity in the Presect _ data;
(3.13) filtering out null values of the attribute relation in the training set obtained by aligning the Wikipedia data set to obtain filter _ Wtrain _ data; define data sample Fitem ═ entity1id,entity1,entity2id,entity2Statement, relationship, data set Node ═ Fitem1,Fitem2,…,Fitemn}。
Further, the specific steps of obtaining the wikidataRelation data set in the step (4) are as follows:
(4.1) screening out entities which are all Chinese characters according to a regular expression, and converting the entities into a dictionary format, wherein the first element of each line in the filtered _ Wtrain _ data dataset is an entity;
(4.2) acquiring a sentence set according to the entity dictionary, storing the sentence set in a list format, preprocessing the sentence set, removing all Chinese characters and all characters except Chinese common punctuations, and splitting the sentence;
(4.3) each sentence is traversed by entity, all entities in the sub-sentences are stored according to the character matching rule, sentences without entities or with only one entity are filtered, and data are processed into [ [ sensor, [ entity ]1,…]],..]The format of';
and (4.4) carrying out Chinese word segmentation by using a jieba library and carrying out Chinese sentence segmentation. Defining the content text data, wherein the content _ seg is the text participle and the entity1For text entity, data processing is [ [ sense, [ entity ]1,...],[sentence_seg]],...]The format of';
(4.5) training the word vector;
(4.6) re-screening the entities by the sentence after word segmentation;
(4.7) the entity appears in the sentence after word segmentation, if not, the step (4.10) is executed;
(4.8) the entity sets are combined pairwise, and the data processing is [ sensor, entity1, entit2,[sentence_seg]]]One sentence for a plurality of samples;
(4.9) dividing the data in the wikidataRelation data set into a training set and a testing set according to the ratio of 3: 1;
(4.10) removing the entity and the text.
Further, the specific steps of respectively building the PCNN, CNN, RNN, BiRNN neural network models in the step (5) are as follows:
(5.1) building an artificial neural network, wherein in embedding layer embedding, a Word mapping function is defined as Word _ embedding, a Word embedding space vector size is defined as Word _ embedding _ dim ═ 50, a Position feature embedding space vector size is defined as Position _ embedding ═ 5, the maximum length is defined as 120, and the Word _ Position _ embedding function is set to add the two embedding function results;
(5.2) defining three loss functions as softmax cross entry, sigmoid cross entry,
and each sample data softmax or sigmoid layer can obtain different probability distribution and prediction relations, and the maximum prediction result is used as an entity prediction result for calculating the loss cross entropy.
(5.3) setting drop _ out as 0.5, calculating the maximum value of elements in tensor dimension, dividing the maximum value into three sections for maximum pooling, obtaining a 3-dimensional vector by each convolution kernel, inputting the result obtained by the pooling layer into a normalization function layer, and performing nonlinear processing by using a tanh activation function.
(5.4) each instance is defined as a bag, if the bag in the training set is positive, the number of positive instances is more than or equal to 1; if negative, the examples are all negative;
(5.5) adding an attention mechanism on each bag;
(5.6) whether training is performed, if so, executing the step (5.14);
(5.7) whether dropout;
(5.8) defining an entity list, which is defined as bag _ pre, and defining a logistic regression value based on the attribution as an attribution _ logic;
(5.9) defining a variable i for traversing in the local scope, defining the scope as scope;
(5.10) if i > scope. shape [0], performing the step (5.13);
(5.11) calculating an attention value of the softmax loss function;
(5.12) i ═ i +1, performing step (5.10);
(5.13) storing the obtained rank vector in a bag _ pre entity list, and executing the step (5.19);
(5.14) finishing training and starting testing;
(5.15) defining the variable i1For traversing in a local scope of action;
(5.16) if i1>scope.shape[0]Executing the step (5.19);
(5.17) calculating a sigmoid activation function value, and storing the obtained entity value in a logistic regression list of the bag _ login entity list;
(5.18)i1=i1+1, go to step (5.16);
(5.19) defining four functions of pcnn, cnn, rnn and birnn, defining hidden factor as hidden _ size as 230, defining convolution kernel size as 3 and step size as 1, and setting local variable count by using activation function relu function;
(5.20) training, and building a pcnn, cnn, rnn and birnn neural network model;
(5.21) if count > n, executing step (5.28);
(5.22) decomposing the sentence into words in turn, each word mapped to a dimensional vector dwCalled word embedding, learning an embedded vector through model training;
(5.23) indicating the tag's identity in the sentence using the location feature1And entity2Each entity has two relative positions, respectively mapped to different dpA dimension vector;
(5.24) the two relative position results of step (5.23) are concatenated to obtain the matrix M e Rh×RsAs being an input representation, wherein Rs=dw+2*dp
(5.25) is provided withPut W ═ Wc*RsIs a convolution matrix, where WcIs the convolution window width, and by sliding the convolution window down to the sentence and applying this function to each valid position, a feature map c ═ c is generated1,c2,..., c(h-Wc+1)]Extracting n features from the sentence;
(5.26) repeating the above process with a different W matrix;
(5.27) count +1, performing step (5.21);
(5.28) piece wise Max-firing is commonly used to select the maximum activation value in each functional map;
(5.29) dividing each feature map Ci into three components { c) based on the locations of the two entitiesi1,ci2,ci3When the piecewise max pooling is complete, the results of each feature map are concatenated into a formal vector p ∈ R3nAs a characteristic representation of the sentence;
(5.30) building four neural network models of the step (5.20).
Further, the specific steps of obtaining the relation extraction model M for parameter adaptive optimization in the field of agricultural knowledge graph in the step (6) are as follows:
(6.1) carrying out parameter self-adaptive optimization on a link coefficient a, an initial threshold b and an attenuation coefficient c of the PCNN neural network model;
(6.2) determining a relatively fixed link coefficient, an initial threshold value and an attenuation coefficient, firstly, searching in an initial parameter range by adopting a larger search step pitch, and finding out the optimal model parameter at the moment;
(6.3) if a plurality of parameter combinations exist and the optimal neural network model is achieved at the same time, executing the step (6.10);
(6.4) selecting the combination with the smaller link coefficient as the optimizing result;
(6.5) if step is less than 0.0005, executing step (6.10);
(6.6) searching for optimal parameters by the grid;
(6.7) ending the traversal parameters, and if not, executing the step (6.6);
(6.8) obtaining the highest accuracy rate by the link coefficient, the initial threshold value and the attenuation coefficient of the optimal PCNN neural network model;
(6.9)
Figure BDA0002624901620000101
Figure BDA0002624901620000102
performing step (6.5);
and (6.10) obtaining an optimal parameter model of the neural network.
Further, the step (7) of extracting the relationship between the entities from the text data in the agricultural field, rendering the entity relationship data through Echarts, and displaying the recommendation result on the web side comprises the following specific steps:
(7.1) obtaining a neural network model PCNN with optimal parameters, and extracting two entity entries in the agricultural text data1And entity2And the relationship between the two;
(7.2) mapping the triple relation based on a built small triple knowledge graph library wikidataRelation;
(7.3) performing entity extraction on the sentences in the wiki corpus by using the algorithm model, and mapping the entities into a wikidataRelation database to realize the automatic labeling function in the remote supervision algorithm;
(7.4) inputting agricultural text data, screening entities in the text data, and extracting the relationship between the entities and the text data;
(7.5) importing data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia into a neo4j database;
(7.6) the searched entity exists in the database, if not, the step (7.9) is executed;
(7.7) displaying the search result on the web end in a graph form by using a Cython statement;
(7.8) packaging a python interface, and displaying the data by using a web framework Dijango, and executing the step (7.10);
(7.9) displaying the absence of the entity;
(7.10) searching text data of agricultural knowledge problems in the task box, and performing Chinese word segmentation technology on the text data to obtain an entity;
(7.11) searching the database by using a Cython statement;
(7.12) the answer to the question exists in the database, if not, the step (7.9) is executed;
(7.13) extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.
By adopting the technical scheme, the invention has the following beneficial effects:
according to the method, unstructured text data in the agricultural field are crawled by using a Scapy crawler frame, entity recognition is carried out by adopting a KNN algorithm, a parameter self-adaptive PCNN neural network model is adopted for carrying out relation extraction, a triple relation is constructed, and compared with a traditional convolutional neural network, the model can extract more sentence characteristics. The method changes the traditional neural network that parameter values are set according to experience, adopts a parameter self-adaptive optimization technology, and enhances the accuracy of model relation extraction.
Drawings
FIG. 1 is a general flow diagram of the present invention;
FIG. 2 is a flow diagram of pre-processing data crawled by the agricultural knowledge service system and agricultural interactive encyclopedia in an embodiment;
FIG. 3 is a flowchart of a KNN algorithm model construction and text similarity comparison in a specific embodiment;
fig. 4 is a flowchart illustrating that classification results of predicted entities are stored by using preset _ data, and entities in the preset _ data dataset are mapped with entity data in the wiki encyclopedia entity relationship dataset Result _ data to obtain a dataset Train _ data in the embodiment;
FIG. 5 is a flowchart of the embodiment in which an entity dictionary in a Wikipedia Chinese lexicon is constructed by replacing an entity attribute relationship according to rules, and text data is preprocessed;
FIG. 6 is a flowchart of building PCNN, CNN, RNN, BiRNN neural network models, respectively, in a specific embodiment;
FIG. 7 is a flowchart of a relationship extraction model M for comparing four algorithm models to obtain parameter adaptive optimization in the field of agricultural knowledge base maps in an embodiment;
FIG. 8 is a flow chart of extracting relationships between entities on textual data in the agricultural domain and establishing an agricultural knowledge graph for use in aiding decision making in an embodiment.
Detailed Description
The present invention is further illustrated by the following specific examples in conjunction with the national standards of engineering, it being understood that these examples are intended only to illustrate the invention and not to limit the scope of the invention, which is defined in the claims appended hereto, as modifications of various equivalent forms by those skilled in the art upon reading the present invention.
As shown in fig. 1 to 8, the parameter adaptive agricultural knowledge graph recommendation method based on remote supervision according to the present invention includes the following steps:
step 1: and carrying out data preprocessing on the data crawled by the agricultural knowledge service system, and defining the obtained data set as Agri _ data. And (4) performing data crawling on the agricultural interactive encyclopedia by utilizing Scapy, and defining a crawled data set as HuDong _ data. Performing Chinese word segmentation and word vector training on text data of the Agri _ data and the HuDong _ data, and defining an obtained data set as Train _ data. The specific method comprises the following steps:
step 1.1: performing data crawler and selecting a crawler page;
step 1.2: selecting a page;
step 1.3: selecting an agricultural knowledge service system;
step 1.4: crawling an agricultural knowledge service system, obtaining html files of the system, limiting a crawling range by using a front end div, and obtaining names Title, detailed contents Detail, photo ImageList and webpage links Url of crops. The data item airfem ═ { Title, Detail, ImageList, Url } is formed, and the data set Agri _ data ═ airfem ═ data { (airfem) }1,Aitem2,…,AitemnExecuting the step 1.8;
step 1.5: selecting an agricultural interactive encyclopedia;
step 1.6: crawling the content in the agricultural interactive encyclopedia by using Scapy, declaring the address domain of a crawler, acquiring a word list, constructing an original json file, generating a Url list, acquiring a crop Title, and crawling a picture ImageList and an open domain label openTypeList;
step 1.7: and each entity obtained by crawling corresponds to one entry of the agricultural encyclopedia, wherein the entries comprise a Title, a link Url of the interactive encyclopedia, a picture ImageList, an open classification list TypeList, a detailed information InfoList and a basic information ValueList. Constitute the data item Hitem ═ { Title, Url, ImageList, TypeList, InfoList, ValueList }, the data set Hudong _ data ═ { Hitem }1,Hitem2,…,Hitemn};
Step 1.8: acquiring two types of database sets Agri _ data and HuDong _ data;
step 1.9: performing part-of-speech screening on data in the two data sets Agri _ data and HuDong _ data;
step 1.10: discarding words containing non-Chinese and English or numbers;
step 1.11: performing Chinese word segmentation and word vector training on two database sets Agri _ data and HuDong _ data respectively;
step 1.12: a data set Train _ data is obtained.
Step 2: training a KNN algorithm model by using a data set Train _ data, performing feature extraction on text data by using a fast text classification tool, and performing text similarity comparison by using a cosine similarity algorithm to obtain a text entity classification T, wherein the specific method comprises the following steps:
step 2.1: and transmitting the Text data set Train _ data into a KNN Text classifier, defining the Mean value of each component as Mean, the variance of each component as Var, the inverse Text frequency index as Text _ IDF and the Text quantity as Item _ Num.
Defining the similarity Weight as Weight { Title, TypeList, Detail, InfoList, ValueList } {0.2,0.2,0.2,0.2,0.2 };
step 2.2: if each Item has 5 attributes in Weight, the IDF value of each attribute is added with 1;
step 2.3: returning the similarity of Title of 2 items each time, defining the name similarity as Title _ sim, returning the similarity of TypeList of 2 items, defining the similarity of an open classification list as TypeList _ sim, returning the similarity of Detail of 2 items, defining the Detail content similarity as Detail _ sim, returning the similarity of InfoList of 2 items, defining the similarity as InfoList _ sim, returning the similarity of ValueList of 2 items, defining the similarity as ValueList _ sim, averaging the similarities of 5 attributes, and defining the similarity as Dsim;
step 2.4: the obtained similarity Dsim is linearly weighted and defined as Simi;
step 2.5: storing the similarity of each attribute of the Item in a temporary table CurList, calculating the variance and the mean variance of each component, carrying out Gaussian normalization on the similarity of Title and TypeList, and endowing the similarity average value to the values of the similarity which does not appear;
step 2.6: weighting and summing the similarity of each Item, defining the sum as Count _ sim, sequencing the first k values of the similarity Count _ sim, and classifying the k values into one class;
step 2.7: a classification T of the text entity is obtained.
And step 3: and B, predicting the Result of entity classification by using the KNN algorithm model in the step II, storing the Result by using a Presect _ data set, and mapping the entity in the Presect _ data set and the entity data in the Uighur entity relationship data set Result _ data to obtain a data set Train _ data, wherein the specific method comprises the following steps:
step 3.1: storing a prediction entity classification result obtained from the KNN algorithm by using prediction _ data;
step 3.2: crawling all relations and corresponding Chinese names under Wikipedia webpages, wherein the storage format is json format;
step 3.3: the crawled content is a relation rid, a relation attribute rtype, a relation sub-class statement and a corresponding rlink link, and is stored in a relation json file, and a data sample is defined as Ritem ═ rid, rtype, statement and rlink }; data set relationship { Ritem ═1,Ritem2,…,Ritemn}; json file, define data sample (mite ), data set (mite ), and relation chinese representation (relation cid, relation chinese), wherein1,Mitem2,…,Mitemn};
Step 3.4: merging data in relation data sets relationship.json and relationship.json, defining a data set result.json, and storing a result in a result.json file;
step 3.5: defining an entites.json database, searching data in the Presect _ data on a Wikipedia, returning json content and storing the json content in an entites.json file;
step 3.6: the Wikidata is an open knowledge base, the description of an entity in a Wikidata entity page and the corresponding relation associated with the entity are crawled, a wikidataRelationship file is defined, the result is stored in the wikidataRelationship file, and a data sample Witem is defined as { entity ═ degree1,relation, entity2Data set wikidatarelationship (Witem)1,Witem2,…,Witemn};
Step 3.7: data in a WikidataRelations.json database is processed into a csv file format, data in an agricultural interactive encyclopedia HuDong _ data corresponds to a Presect _ data database to obtain a node.csv file, a data sample Nitem is defined to be { Title, Label }, and a data set Node is defined to be { Nitem }1,Nitem2,…,Nitemn};
Step 3.8: the Wiki Chinese encyclopedia corpus is simplified from traditional form to simplified form, and the line feed symbols in the line feed sentences are removed;
step 3.9: selecting a training set related to agriculture, and selecting relations of instance of taxon rank, subiclass of parent taxon;
step 3.10: preloading a list of entities (selecting entities in the following categories: 5: Animal,6: Plant,7: Chemicals,9: Fooditems,10: Diseases,12: Nutrients,13: biochemistry, 14: Agricultural experiments, 15: Technology);
step 3.11: agricultural related statements are stored in a FileRead file, a triple relation obtained in wikidataRelation.json is aligned to a corpus of a Chinese wiki, and a corpus Wtrain _ data of a training set is defined;
step 3.12: loading an entity to a labeled mapping dictionary, performing part-of-speech screening on words in a sentence, and reading the category of the entity, wherein the category is consistent with the category of the entity in the Predict _ data;
step 3.13: and filtering out null values to obtain the Filter _ Wtrain _ data according to the attribute relation in the training set obtained by aligning the Wikipedia data set. Define data sample Fitem ═ entity1id,entity1,entity2id,entity2Statement, relationship, data set Node ═ Fitem1,Fitem2,…,Fitemn}。
And 4, step 4: an entity dictionary in a Wikipedia Chinese word stock is constructed by utilizing a heuristic rule screening algorithm, and Filter _ Wtrain _ data text data are preprocessed to obtain a wikidataRelation data set, wherein the specific method comprises the following steps:
step 4.1: the first element of each line in the screened Filter _ Wtrain _ data dataset is an entity, and all entities which are Chinese characters are screened according to a regular expression and converted into a dictionary format;
step 4.2: acquiring a sentence set according to a physical dictionary, storing the sentence set in a list format, preprocessing the sentence set, removing all Chinese characters and all characters except Chinese common punctuations, and splitting the sentence;
step 4.3: each sentence is traversed by entity, according to character matching rule, all entities in the sub-sentences are stored, sentences without entities or with only one entity are filtered, and data is processed into [ [ sense, [ entity ]1,…]],..]The format of';
step 4.4: and performing Chinese word segmentation by using a jieba library, and performing word segmentation on Chinese sentences. Defining the content text data, wherein the content _ seg is the text participle and the entity1For text entity, data processing is [ [ sense, [ entity ]1,...],[sentence_seg]],...]The format of';
step 4.5: training a word vector;
step 4.6: re-screening the entities by the sentence after word segmentation;
step 4.7: the entity appears in the sentence after word segmentation, if not, the step 4.10 is executed;
step 4.8: the entity sets are combined pairwise, and the data is processed into [ sensor, entity [ ]1, entit2,[sentence_seg]]]One sentence for a plurality of samples;
step 4.9: dividing the data in the wikidataRelation data set into a training set and a testing set according to the proportion of 3: 1;
step 4.10: the entity is removed as well as the text.
And 5: respectively building PCNN, CNN, RNN and BiRNN neural network models, wherein the specific method comprises the following steps:
step 5.1: building an artificial neural network, wherein in embedding layer embedding, a Word mapping function is defined as Word _ embedding, a Word embedding space vector size is defined as Word _ embedding _ dim (50), a Position feature embedding space vector size is defined as Position _ embedding (5), the maximum length is 120, and a Word _ Position _ embedding function is set to add the two embedding function results;
step 5.2: three loss functions are defined as softmax cross entry, sigmoid cross entry,
and each sample data softmax or sigmoid layer can obtain different probability distribution and prediction relations, and the maximum prediction result is used as an entity prediction result for calculating the loss cross entropy.
Step 5.3: and setting drop _ out as definition 0.5, calculating the maximum value of elements on tensor dimension, dividing the maximum value into three sections of maximum pooling, obtaining a 3-dimensional vector by each convolution kernel, inputting the result obtained by the pooling layer into a normalization function layer, and performing nonlinear processing by using a tanh activation function.
Step 5.4: each instance is defined as a bag, and if the bags in the training set are positive, the number of positive instances is more than or equal to 1; if negative, the examples are all negative;
step 5.5: adding an attention mechanism to each bag;
step 5.6: whether training is available or not, if so, executing step 5.14;
step 5.7: whether dropout is present;
step 5.8: defining an entity list, namely bag _ pre, and defining a logistic regression value based on the attribution as an attribution _ logic;
step 5.9: defining a variable i for traversing in a local scope, wherein the scope is defined as scope;
step 5.10: if i > scope, shape [0], go to step 5.13;
step 5.11: calculating an attention value of the softmax loss function;
step 5.12: i +1, performing step 5.10;
step 5.13: the obtained rank vector is stored in the bag _ pre entity list, and step 5.19 is executed;
step 5.14: after training, starting testing;
step 5.15: defining a variable i1For traversing in a local scope of action;
step 5.16: if i1>scope.shape[0]And 5.19 is executed;
step 5.17: calculating a sigmoid activation function value, and storing an obtained entity value in a logistic regression list of the bag _ login entity list;
step 5.18: i.e. i1=i1+1, go to step 5.16;
step 5.19: defining four functions of pcnn, cnn, rnn and birnn, defining hidden factor as hidden _ size as 230, defining convolution kernel size as 3 and step size as 1, and setting local variable count by using activation function relu function;
step 5.20: training, and building a pcnn, cnn, rnn and birnn neural network model;
step 5.21: if count > n, go to step 5.28;
step 5.22: the sentence is decomposed into words in turn, each word is mapped to a dimension vector dwCalled word embedding, learning an embedded vector through model training;
step 5.23: indicating an identity in a sentence using a location feature1And entity2Each entity has two relative positions, respectively mapped to different dpA dimension vector;
step 5.24: connecting the two relative position results of step (5.23) to obtain a matrix M e Rh×RsAs being an input representation, wherein Rs=dw+2*dp
Step 5.25: setting W ═ Wc*RsIs a convolution matrix, where WcIs the convolution window width, and by sliding the convolution window down to the sentence and applying this function to each valid position, a feature map c ═ c is generated1,c2,..., c(h-Wc+1)]Extracting n features from the sentence;
step 5.26: repeating the above process with different W matrices;
step 5.27: step 5.21 is executed, if count is equal to count + 1;
step 5.28: piecewise Max-firing is commonly used to select the maximum activation value in each functional map;
step 5.29: each feature map Ci is divided into three components { c } based on the location of two entitiesi1,ci2,ci3When the piecewise max pooling is complete, the results of each feature map are concatenated into a formal vector p ∈ R3nAs a characteristic representation of the sentence;
step 5.30: and (5) building four neural network models of the step 5.20.
Step 6: and comparing the four algorithm models to obtain a relation extraction model M for parameter self-adaption optimization in the field of the agricultural knowledge graph, and specifically comprising the following steps of:
step 6.1: performing parameter self-adaptive optimization on a link coefficient a, an initial threshold b and an attenuation coefficient c of the PCNN neural network model;
step 6.2: determining a relatively fixed link coefficient, an initial threshold value and an attenuation coefficient, firstly, searching in an initial parameter range by adopting a larger search step pitch, and finding out the optimal model parameter at the moment;
step 6.3: if a plurality of parameter combinations exist and the optimal neural network model is achieved at the same time, executing the step 6.10;
step 6.4: selecting the combination with the smaller link coefficient as an optimization result;
step 6.5: if step is less than 0.0005, executing step 6.10;
step 6.6: searching for optimal parameters by a grid;
step 6.7: the traversal parameters are ended, if not, the step 6.6 is executed;
step 6.8: obtaining the highest accuracy rate of the link coefficient, the initial threshold value and the attenuation coefficient of the optimal PCNN neural network model;
step 6.9:
Figure BDA0002624901620000171
Figure BDA0002624901620000181
step 6.5 is executed;
step 6.10: and obtaining an optimal parameter model of the neural network.
And 7: extracting the relation between entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end, wherein the method comprises the following specific steps:
step 7.1: obtaining a neural network model PCNN with optimal parameters, and extracting two entity entries in the agricultural text data1And entity2And the relationship between the two;
step 7.2: mapping the triple relation based on a built small triple knowledge graph library wikidataRelation;
step 7.3: utilizing an algorithm model to extract entities of sentences in a wiki corpus, and mapping the entities into a wikidatarelationship database to realize an automatic labeling function in a remote supervision algorithm;
step 7.4: inputting agricultural text data, screening entities in the text data, and extracting the relationship between the agricultural text data and the entities;
step 7.5: importing data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia into a neo4j database;
step 7.6: the searched entity exists in the database, if not, step 7.9 is executed;
step 7.7: displaying the search result on a web end in a graph form by using a Cython statement;
step 7.8: packaging a python interface, displaying the data by using a web framework Dijango, and executing a step 7.10;
step 7.9: displaying that the entity does not exist;
step 7.10: searching text data of agricultural knowledge problems in the task box, and performing Chinese word segmentation technology on the text data to obtain an entity;
step 7.11: searching the database by utilizing a Cython statement;
step 7.12: the answer to the question exists in the database, if not, step 7.9 is executed;
step 7.13: and extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.
All of the above parameters are defined in the following table:
Figure BDA0002624901620000182
Figure BDA0002624901620000191
Figure BDA0002624901620000201
Figure BDA0002624901620000211
264093 pieces of data are processed, KNN algorithm is used for extracting features to classify and predict entities, a remote supervision-based artificial neural network PCNN model is set up to perform relationship extraction and model training, the triple relationship of training texts is obtained, the relationship among different crop entities is shown to a user, and the speed of information search is accelerated. Through tests, the accuracy of the experimental model using the PCNN algorithm exceeds 94%.
The invention creatively provides a parameter self-adaptive agricultural knowledge map system based on remote supervision, obtains an optimal neural network model extracted from agricultural field relations through self-adaptive optimization and parameter adjustment, and is suitable for unstructured text data of generally related crops.

Claims (8)

1. A parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision is characterized by comprising the following specific steps:
(1) data pre-processing is carried out on data crawled by an agricultural knowledge service system, and an obtained data set is defined as Agri _ data; performing data crawling on the agricultural interactive encyclopedia by using Scapy, and defining a crawled data set as HuDong _ data; performing Chinese word segmentation and word vector training on text data in the Agri _ data and the HuDong _ data, and defining an obtained data set as Train _ data;
(2) training a KNN algorithm model by using a data set Train _ data, performing feature extraction on text data by using a fast text classification tool, and performing text similarity comparison by using a cosine similarity algorithm to obtain a text entity classification T;
(3) utilizing the KNN algorithm model in the step (2) to Predict the entity classification Result, storing the entity classification Result by using a Presect _ data set, and mapping the entity in the Presect _ data set with the entity data in the Uighur entity relationship data set Result _ data to obtain a data set Train _ data;
(4) constructing an entity dictionary in a Wikipedia Chinese word stock by using a heuristic rule screening algorithm, and preprocessing the text data of Filter _ Wtrain _ data to obtain a wikidadata relationship data set;
(5) respectively building PCNN, CNN, RNN and BiRNN neural network models;
(6) comparing the four algorithm models to obtain a relation extraction model M for parameter self-adaption optimization in the field of agricultural knowledge maps;
(7) and extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.
2. The remote supervision-based parameter adaptive agricultural knowledge graph recommendation method according to claim 1, wherein the data set Train _ data obtained in the step (1) comprises the following specific steps:
(1.1) performing data crawler and selecting a crawler page;
(1.2) selecting a page;
(1.3) selecting an agricultural knowledge service system;
(1.4) crawling an agricultural knowledge service system, obtaining html files of the system, limiting a crawling range by using front-end div, and obtaining names Title, detailed contents Detail, photo ImageList and webpage links Url of crops; the data item airfem ═ { Title, Detail, ImageList, Url } is formed, and the data set Agri _ data ═ airfem ═ data { (airfem) }1,Aitem2,…,AitemnExecuting the step (1.8);
(1.5) selecting an agricultural interactive encyclopedia;
(1.6) crawling the contents in the agricultural interactive encyclopedia by using Scapy, declaring the address domain of a crawler, acquiring a word list, constructing an original json file, generating a Url list, acquiring a crop Title, crawling a picture ImageList and an open-domain label openTypeList;
(1.7) each entity obtained by crawling corresponds to one entry of an agricultural encyclopedia, wherein the entries comprise a Title, a link Url of an interactive encyclopedia, a picture ImageList, an open classification list TypeList, a detailed information InfoList and a basic information ValueList; constitute the data item Hitem ═ { Title, Url, ImageList, TypeList, InfoList, ValueList }, the data set Hudong _ data ═ { Hitem }1,Hitem2,…,Hitemn};
(1.8) acquiring two types of database sets Agri _ data and HuDong _ data;
(1.9) performing part-of-speech screening on data in the two data sets Agri _ data and HuDong _ data;
(1.10) discarding words containing non-Chinese and English or numeric characters;
(1.11) performing Chinese word segmentation and word vector training on two database sets Agri _ data and HuDong _ data respectively;
(1.12) obtaining a data set Train _ data.
3. The remote supervision-based parameter adaptive agricultural knowledge base map recommendation method according to claim 1, wherein the specific steps of obtaining the text entity classification T in the step (2) are as follows:
(2.1) transmitting a Text data set Train _ data into a KNN Text classifier, defining the Mean value of each component as Mean, the variance of each component as Var, the inverse Text frequency index as Text _ IDF and the Text quantity as Item _ Num; defining the similarity Weight as Weight { Title, TypeList, Detail, InfoList, ValueList } {0.2,0.2,0.2,0.2,0.2 };
(2.2) if each Item has 5 attributes in Weight, adding 1 to the IDF value of each attribute;
(2.3) returning the similarity of Title of 2 items each time, defining the name similarity as Title _ sim, the similarity of TypeList of 2 items, the similarity of open sort list as TypeList _ sim, the similarity of Detail content as Detail _ sim, the similarity of InfoList of 2 items as InfoList _ sim, the similarity of ValueList of 2 items as ValueList _ sim, the similarity of 5 attributes as Dsim;
(2.4) linearly weighting the obtained similarity Dsim, and defining the similarity Dsim as Simi;
(2.5) storing the attribute similarity of the Item in a temporary table CurList, calculating the variance and mean difference of each component, carrying out Gaussian normalization on the similarity of Title and TypeList, and endowing the similarity average value to the similarity values which do not appear;
(2.6) carrying out weighted sum on the similarity of each Item, defining the sum as Count _ sim, sequencing the first k values of the similarity Count _ sim, and classifying the k values into one class;
and (2.7) obtaining the classification T of the text entity.
4. The remote supervision-based parameter adaptive agricultural knowledge graph recommendation method according to claim 1, wherein the specific steps of obtaining the data set Train _ data in the step (3) are as follows:
(3.1) storing the classification result of the predicted entity obtained from the KNN algorithm by using Predict _ data;
(3.2) crawling all the relations summarized under the Wikipedia webpage and the corresponding Chinese names by utilizing Scapy, wherein the storage format is json format;
(3.3) the crawled content is a relation rid, an attribute rtype to which the relation belongs, a subclass statement to which the relation belongs and a corresponding rlink link, and is stored in a relation json file, and a data sample is defined as rite ═ rid, rtype, statement, rlink }; data set relationship { Ritem ═1,Ritem2,…,Ritemn}; json file, define data sample (mite ), data set (mite ), and relation chinese representation (relation cid, relation chinese), wherein1,Mitem2,…,Mitemn};
(3.4) merging the data in relation data set relation.json and relation.json, defining a data set result.json, and storing the result in a result.json file;
(3.5) defining an entites.json database, searching data in the Presect _ data on a Wikipedia, returning json content and storing the json content in an entites.json file;
(3.6) Wikitata is an open knowledge base, the description of an entity in a Wikitata entity page and the corresponding relation associated with the entity are crawled, a wikitataRelationship file is defined, the result is stored in the wikitataRelationship file, and a data sample Witem is defined as { entity ═ degree1,relation,entity2Data set wikidatarelationship (Witem)1,Witem2,…,Witemn};
(3.7) processing data in WikidataRelation. json database into csv file format, and performing agricultural interactionCorresponding data in the HuDong _ data in the encyclopedic to a Presect _ data database to obtain a node.csv file, defining a data sample Nitem ═ { Title, Label }, and defining a data set Node ═ Nitem ═ Title }1,Nitem2,…,Nitemn};
(3.8) the Wiki Chinese encyclopedia corpus is simplified from traditional to simplified and the line feed symbols in the line feed sentences are removed;
(3.9) selecting a training set related to agriculture, and selecting relations of "instance of" "taxon rank" "subiclass of" "parent taxon";
(3.10) preloading an entity list (selecting entities in the following categories: 5: Animal,6: Plant,7: Chemicals,9: foods, 10: Diseases,12: Nutrients,13: biochemistry.14: Agricultural entities, 15: Technology);
(3.11) agricultural related sentences are stored in a FileRead file, the triple relation obtained in wikidataRelation.json is aligned to a corpus of a Chinese dimensional base, and a corpus Wtrain _ data of a training set is defined;
(3.12) loading an entity to the labeled mapping dictionary, performing part-of-speech screening on words in the sentence, and reading the category of the entity to be consistent with the category of the entity in the Presect _ data;
(3.13) filtering out null values of the attribute relation in the training set obtained by aligning the Wikipedia data set to obtain filter _ Wtrain _ data; define data sample Fitem ═ entity1id,entity1,entity2id,entity2Statement, relationship, data set Node ═ Fitem1,Fitem2,…,Fitemn}。
5. The method for recommending parameter-adaptive agricultural knowledge-graph based on remote supervision as claimed in claim 1, wherein the specific step of obtaining wikidataRelation data set in step (4) is as follows:
(4.1) screening out entities which are all Chinese characters according to a regular expression, and converting the entities into a dictionary format, wherein the first element of each line in the filtered _ Wtrain _ data dataset is an entity;
(4.2) acquiring a sentence set according to the entity dictionary, storing the sentence set in a list format, preprocessing the sentence set, removing all Chinese characters and all characters except Chinese common punctuations, and splitting the sentence;
(4.3) each sentence is traversed by entity, all entities in the sub-sentences are stored according to the character matching rule, sentences without entities or with only one entity are filtered, and data are processed into [ [ sensor, [ entity ]1,…]],..]The format of';
and (4.4) carrying out Chinese word segmentation by using a jieba library and carrying out Chinese sentence segmentation. Defining the content text data, wherein the content _ seg is the text participle and the entity1For text entity, data processing is [ [ sense, [ entity ]1,...],[sentence_seg]],...]The format of';
(4.5) training the word vector;
(4.6) re-screening the entities by the sentence after word segmentation;
(4.7) the entity appears in the sentence after word segmentation, if not, the step (4.10) is executed;
(4.8) the entity sets are combined pairwise, and the data processing is [ sensor, entity1,entit2,[sentence_seg]]]One sentence for a plurality of samples;
(4.9) dividing the data in the wikidataRelation data set into a training set and a testing set according to the ratio of 3: 1;
(4.10) removing the entity and the text.
6. The remote supervision-based parameter adaptive agricultural knowledge graph recommendation method according to claim 1, wherein the specific steps of respectively building PCNN, CNN, RNN and BiRNN neural network models in the step (5) are as follows:
(5.1) building an artificial neural network, wherein in embedding layer embedding, a Word mapping function is defined as Word _ embedding, a Word embedding space vector size is defined as Word _ embedding _ dim ═ 50, a Position feature embedding space vector size is defined as Position _ embedding ═ 5, the maximum length is defined as 120, and the Word _ Position _ embedding function is set to add the two embedding function results;
(5.2) defining three loss functions as softmax cross entry, sigmoid cross entry,
and each sample data softmax or sigmoid layer can obtain different probability distribution and prediction relations, and the maximum prediction result is used as an entity prediction result for calculating the loss cross entropy.
(5.3) setting drop _ out as 0.5, calculating the maximum value of elements in tensor dimension, dividing the maximum value into three sections for maximum pooling, obtaining a 3-dimensional vector by each convolution kernel, inputting the result obtained by the pooling layer into a normalization function layer, and performing nonlinear processing by using a tanh activation function.
(5.4) each instance is defined as a bag, if the bags in the training set are positive, the number of positive instances is more than or equal to 1; if negative, the examples are all negative;
(5.5) adding an attention mechanism on each bag;
(5.6) whether training is performed, if so, executing the step (5.14);
(5.7) whether dropout;
(5.8) defining an entity list, which is defined as bag _ pre, and defining a logistic regression value based on the attribution as an attribution _ logic;
(5.9) defining a variable i for traversing in the local scope, defining the scope as scope;
(5.10) if i > scope. shape [0], performing the step (5.13);
(5.11) calculating an attention value of the softmax loss function;
(5.12) i ═ i +1, performing step (5.10);
(5.13) storing the obtained rank vector in a bag _ pre entity list, and executing the step (5.19);
(5.14) finishing training and starting testing;
(5.15) defining the variable i1For traversing in a local scope of action;
(5.16) if i1>scope.shape[0]Executing the step (5.19);
(5.17) calculating a sigmoid activation function value, and storing the obtained entity value in a logistic regression list of the bag _ login entity list;
(5.18)i1=i1+1, go to step (5.16);
(5.19) defining four functions of pcnn, cnn, rnn and birnn, defining hidden factor as hidden _ size as 230, defining convolution kernel size as 3 and step size as 1, and setting local variable count by using activation function relu function;
(5.20) training, and building a pcnn, cnn, rnn and birnn neural network model;
(5.21) if count > n, executing step (5.28);
(5.22) decomposing the sentence into words in turn, each word mapped to a dimensional vector dwCalled word embedding, learning an embedded vector through model training;
(5.23) indicating the tag's identity in the sentence using the location feature1And entity2Each entity has two relative positions, respectively mapped to different dpA dimension vector;
(5.24) the two relative position results of step (5.23) are concatenated to obtain the matrix M e Rh×RsAs being an input representation, wherein Rs=dw+2*dp
(5.25) setting W ═ Wc*RsIs a convolution matrix, where WcIs the convolution window width, and by sliding the convolution window down to the sentence and applying this function to each valid position, a feature map c ═ c is generated1,c2,...,c(h-Wc+1)]Extracting n features from the sentence;
(5.26) repeating the above process with a different W matrix;
(5.27) count +1, performing step (5.21);
(5.28) piece wise Max-firing is commonly used to select the maximum activation value in each functional map;
(5.29) dividing each feature map Ci into three components { c) based on the locations of the two entitiesi1,ci2,ci3When the piecewise max pooling is complete, the results of each feature map are concatenated into a formal vector p ∈ R3nAs a characteristic representation of the sentence;
(5.30) building four neural network models of the step (5.20).
7. The method for recommending a parameter-adaptive agricultural knowledge graph based on remote supervision according to claim 1, wherein the specific steps of obtaining the relation extraction model M for parameter-adaptive optimization in the field of agricultural knowledge graph in the step (6) are as follows:
(6.1) carrying out parameter self-adaptive optimization on a link coefficient a, an initial threshold b and an attenuation coefficient c of the PCNN neural network model;
(6.2) determining a relatively fixed link coefficient, an initial threshold value and an attenuation coefficient, firstly, searching in an initial parameter range by adopting a larger search step pitch, and finding out the optimal model parameter at the moment;
(6.3) if a plurality of parameter combinations exist and the optimal neural network model is simultaneously reached, executing the step (6.10);
(6.4) selecting the combination with the smaller link coefficient as the optimizing result;
(6.5) if step is less than 0.0005, executing step (6.10);
(6.6) searching for optimal parameters by the grid;
(6.7) ending the traversal parameters, and if not, executing the step (6.6);
(6.8) obtaining the highest accuracy rate by the link coefficient, the initial threshold value and the attenuation coefficient of the optimal PCNN neural network model;
(6.9)
Figure FDA0002624901610000071
Figure FDA0002624901610000072
performing step (6.5);
and (6.10) obtaining an optimal parameter model of the neural network.
8. The parameter adaptive agricultural knowledge graph recommendation method based on remote supervision according to claim 1, wherein in the step (7), the relation between the entities is extracted from the text data of the agricultural field, the entity relation data is rendered through Echarts, and the recommendation result is displayed on the web side by the specific steps of:
(7.1) obtaining a neural network model PCNN with optimal parameters, and extracting two entity entries in the agricultural text data1And entity2And the relationship between the two;
(7.2) mapping the triple relation based on a built small triple knowledge graph library wikidataRelation;
(7.3) performing entity extraction on the sentences in the wiki corpus by using the algorithm model, and mapping the entities into a wikidataRelation database to realize the automatic labeling function in the remote supervision algorithm;
(7.4) inputting agricultural text data, screening entities in the text data, and extracting the relationship between the entities and the text data;
(7.5) importing data in an agricultural knowledge service system, an agricultural interactive encyclopedia and a Wikipedia into a neo4j database;
(7.6) the searched entity exists in the database, if not, the step (7.9) is executed;
(7.7) displaying the search result on the web end in a graph form by using a Cython statement;
(7.8) packaging the python interface, and displaying the data by using the web framework Dijango, and executing the step (7.10);
(7.9) displaying the absence of the entity;
(7.10) searching text data of agricultural knowledge problems in the task box, and performing Chinese word segmentation technology on the text data to obtain an entity;
(7.11) searching the database by using a Cython statement;
(7.12) the answer to the question exists in the database, if not, the step (7.9) is executed;
(7.13) extracting the relation between the entities from the text data in the agricultural field, rendering the entity relation data through Echarts, and displaying the recommendation result on a web end.
CN202010794151.4A 2020-08-10 2020-08-10 Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision Active CN112199508B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010794151.4A CN112199508B (en) 2020-08-10 2020-08-10 Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010794151.4A CN112199508B (en) 2020-08-10 2020-08-10 Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision

Publications (2)

Publication Number Publication Date
CN112199508A true CN112199508A (en) 2021-01-08
CN112199508B CN112199508B (en) 2024-01-19

Family

ID=74004961

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010794151.4A Active CN112199508B (en) 2020-08-10 2020-08-10 Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision

Country Status (1)

Country Link
CN (1) CN112199508B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113064999A (en) * 2021-03-19 2021-07-02 南方电网调峰调频发电有限公司信息通信分公司 Knowledge graph construction algorithm, system, equipment and medium based on IT equipment operation and maintenance
CN113159320A (en) * 2021-03-08 2021-07-23 北京航空航天大学 Scientific and technological resource data integration method and device based on knowledge graph
CN113723760A (en) * 2021-07-30 2021-11-30 哈尔滨工业大学 Wisdom agricultural thing networking platform
WO2023097929A1 (en) * 2021-12-01 2023-06-08 浙江师范大学 Knowledge graph recommendation method and system based on improved kgat model
CN116911963A (en) * 2023-09-14 2023-10-20 南京龟兔赛跑软件研究院有限公司 Data-driven pesticide byproduct transaction management method and cloud platform

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150178282A1 (en) * 2013-12-23 2015-06-25 Yahoo! Inc. Fast and dynamic targeting of users with engaging content
CN105279264A (en) * 2015-10-26 2016-01-27 深圳市智搜信息技术有限公司 Semantic relevancy calculation method of document
CN108804521A (en) * 2018-04-27 2018-11-13 南京柯基数据科技有限公司 A kind of answering method and agricultural encyclopaedia question answering system of knowledge based collection of illustrative plates
CN109871451A (en) * 2019-01-25 2019-06-11 中译语通科技股份有限公司 A kind of Relation extraction method and system incorporating dynamic term vector
CN110209839A (en) * 2019-06-18 2019-09-06 卓尔智联(武汉)研究院有限公司 Agricultural knowledge map construction device, method and computer readable storage medium
CN110555084A (en) * 2019-08-26 2019-12-10 电子科技大学 remote supervision relation classification method based on PCNN and multi-layer attention
US20210391080A1 (en) * 2018-12-29 2021-12-16 New H3C Big Data Technologies Co., Ltd. Entity Semantic Relation Classification

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150178282A1 (en) * 2013-12-23 2015-06-25 Yahoo! Inc. Fast and dynamic targeting of users with engaging content
CN105279264A (en) * 2015-10-26 2016-01-27 深圳市智搜信息技术有限公司 Semantic relevancy calculation method of document
CN108804521A (en) * 2018-04-27 2018-11-13 南京柯基数据科技有限公司 A kind of answering method and agricultural encyclopaedia question answering system of knowledge based collection of illustrative plates
US20210391080A1 (en) * 2018-12-29 2021-12-16 New H3C Big Data Technologies Co., Ltd. Entity Semantic Relation Classification
CN109871451A (en) * 2019-01-25 2019-06-11 中译语通科技股份有限公司 A kind of Relation extraction method and system incorporating dynamic term vector
CN110209839A (en) * 2019-06-18 2019-09-06 卓尔智联(武汉)研究院有限公司 Agricultural knowledge map construction device, method and computer readable storage medium
CN110555084A (en) * 2019-08-26 2019-12-10 电子科技大学 remote supervision relation classification method based on PCNN and multi-layer attention

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
KAI ZHANG等: "Chinese Agricultural Entity Relation Extraction via Deep Learning", INTELLIGENT COMPUTING METHODOLOGIES, pages 528 *
吕亿林;田宏韬;高建伟;万怀宇;: "结合百科知识与句子语义特征的关系抽取方法", 计算机科学, no. 1, pages 50 - 54 *
夏川: "基于深度学习的农作物病虫害领域实体关系抽取研究", 中国优秀硕士学位论文全文数据库农业科技辑, no. 05, pages 046 - 7 *
张苇如 等: "基于维基百科和模式聚类的实体关系抽取方法", 中国中文信息学会.中国计算语言学研究前沿进展(2009-2011), pages 421 - 426 *
朱苏阳;惠浩添;钱龙华;张民;: "基于自监督学习的维基百科家庭关系抽取", 计算机应用, no. 04, pages 115 - 118 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113159320A (en) * 2021-03-08 2021-07-23 北京航空航天大学 Scientific and technological resource data integration method and device based on knowledge graph
CN113064999A (en) * 2021-03-19 2021-07-02 南方电网调峰调频发电有限公司信息通信分公司 Knowledge graph construction algorithm, system, equipment and medium based on IT equipment operation and maintenance
CN113064999B (en) * 2021-03-19 2023-12-15 南方电网调峰调频发电有限公司信息通信分公司 Knowledge graph construction algorithm, system, equipment and medium based on IT equipment operation and maintenance
CN113723760A (en) * 2021-07-30 2021-11-30 哈尔滨工业大学 Wisdom agricultural thing networking platform
WO2023097929A1 (en) * 2021-12-01 2023-06-08 浙江师范大学 Knowledge graph recommendation method and system based on improved kgat model
CN116911963A (en) * 2023-09-14 2023-10-20 南京龟兔赛跑软件研究院有限公司 Data-driven pesticide byproduct transaction management method and cloud platform
CN116911963B (en) * 2023-09-14 2023-12-19 南京龟兔赛跑软件研究院有限公司 Data-driven pesticide byproduct transaction management method and cloud platform system

Also Published As

Publication number Publication date
CN112199508B (en) 2024-01-19

Similar Documents

Publication Publication Date Title
CN110222160B (en) Intelligent semantic document recommendation method and device and computer readable storage medium
CN106874378B (en) Method for constructing knowledge graph based on entity extraction and relation mining of rule model
CN112199508B (en) Parameter self-adaptive agricultural knowledge graph recommendation method based on remote supervision
CN111209738B (en) Multi-task named entity recognition method combining text classification
US10783451B2 (en) Ensemble machine learning for structured and unstructured data
KR101203345B1 (en) Method and system for classifying display pages using summaries
CN112100344A (en) Financial field knowledge question-answering method based on knowledge graph
CN107301199A (en) A kind of data label generation method and device
CN110516074B (en) Website theme classification method and device based on deep learning
CN114896388A (en) Hierarchical multi-label text classification method based on mixed attention
CN111582506A (en) Multi-label learning method based on global and local label relation
CN113269477B (en) Scientific research project query scoring model training method, query method and device
CN112862569B (en) Product appearance style evaluation method and system based on image and text multi-modal data
CN103049454B (en) A kind of Chinese and English Search Results visualization system based on many labelings
CN112445862B (en) Internet of things equipment data set construction method and device, electronic equipment and storage medium
CN112100395B (en) Expert cooperation feasibility analysis method
CN111753151A (en) Service recommendation method based on internet user behaviors
CN112613318B (en) Entity name normalization system, method thereof and computer readable medium
CN113516202A (en) Webpage accurate classification method for CBL feature extraction and denoising
CN113779387A (en) Industry recommendation method and system based on knowledge graph
Maladkar Content based hierarchical URL classification with Convolutional Neural Networks
RIZVI A Systematic Overview on Data Mining: concepts and techniques
Ghosh et al. Understanding Machine Learning
CN107341169B (en) Large-scale software information station label recommendation method based on information retrieval
Chebil et al. Clustering social media data for marketing strategies: Literature review using topic modelling techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant