CN113128233B - Construction method and system of mental disease knowledge map - Google Patents
Construction method and system of mental disease knowledge map Download PDFInfo
- Publication number
- CN113128233B CN113128233B CN202110512846.3A CN202110512846A CN113128233B CN 113128233 B CN113128233 B CN 113128233B CN 202110512846 A CN202110512846 A CN 202110512846A CN 113128233 B CN113128233 B CN 113128233B
- Authority
- CN
- China
- Prior art keywords
- entity
- word
- mental disease
- data
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
- G06F40/295—Named entity recognition
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Artificial Intelligence (AREA)
- Databases & Information Systems (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- General Health & Medical Sciences (AREA)
- Medical Treatment And Welfare Office Work (AREA)
Abstract
The invention provides a construction method and a system of a mental disease knowledge base, which are used for acquiring the existing information related to mental diseases and establishing a mental disease corpus; determining an entity, a relation and an attribute indication word list according to the mental disease corpus; fine-tuning data in the mental disease corpus set by using a language model, constructing a mental disease named entity recognition data set, extracting characteristic values of the mental disease named entity recognition data set, fusing the fine-tuned data and the extracted characteristics, and training a pre-constructed deep learning model by using the fused data; and predicting the psychological disease corpus to be processed by utilizing the trained deep learning model, converting the entity category index sequence obtained by prediction into an entity type sequence, storing each entity word into an entity word list, and respectively extracting entity relationship and attribute data according to the relationship type and the attribute type for respectively storing. The invention can effectively improve the accuracy of entity identification and improve the accuracy of extracting complex entities.
Description
Technical Field
The invention belongs to the technical field of knowledge maps, and particularly relates to a construction method and a system of a mental disease knowledge map.
Background
The statements in this section merely provide background information related to the present disclosure and may not necessarily constitute prior art.
The knowledge graph is a semantic network for revealing the relationship between entities, and can show the real world things and the mutual relationship thereof in a structured form. From the semantic perspective, the knowledge map represents a knowledge structure which can be identified by a computer by describing concepts, entities and relations in an objective world and by a triple, so that the computer has the capability of better organizing, managing and utilizing mass information on the Internet.
At present, many people have psychological problems or obstacles, but the standards of most people on the psychological health are not clear, the classification of the types of the psychological diseases is not clear, and the propaganda work of the mental health knowledge needs to be deepened.
With the rapid development of the internet plus technology and intelligent medical treatment, a large amount of mental disease data is generated, but most of the mental disease data are stored in unstructured texts such as documents, the relevance among the data is poor, and the data cannot be effectively utilized. In order to effectively manage and utilize the mental disease data and to make the data become interconnected, it is important to construct a mental disease knowledge map.
The steps of establishing the knowledge graph are mainly as follows: entity identification, relation extraction, attribute identification and knowledge storage. With the continuous development of deep learning, the method for identifying and extracting entities and constructing knowledge maps by using a deep neural network becomes a mainstream method, however, the labeling cost of mental disease data is high, and the recognition accuracy of the neural network which is lack of a large amount of labeled data for training is not high; the mental disease knowledge graph belongs to a knowledge graph in the professional field, the required knowledge quality is higher, an existing entity recognition algorithm is lack of guidance of priori knowledge, errors are avoided when complex entities are extracted, secondary correction is needed by professionals, and manpower and material resources are consumed.
Disclosure of Invention
The invention provides a construction method and a system of a mental disease knowledge graph to solve the problems, and the method and the system can effectively improve the accuracy of entity identification and improve the accuracy of extracting complex entities.
According to some embodiments, the invention adopts the following technical scheme:
a construction method of a mental disease knowledge map comprises the following steps:
obtaining existing information related to psychological diseases, and establishing a psychological disease corpus;
determining an entity, a relation and an attribute indication word list according to the mental disease corpus;
finely adjusting data in the mental disease corpus by using a language model, constructing a mental disease named entity recognition data set, extracting characteristic values of the named entity recognition data set, fusing the finely adjusted data and the extracted characteristics, and training a pre-constructed deep learning model by using the fused data;
and predicting the psychological disease corpus to be processed by utilizing the trained deep learning model, converting the entity category index sequence obtained by prediction into an entity type sequence, storing each entity word into an entity word list, and respectively extracting entity relationship and attribute data according to the relationship type and the attribute type for respectively storing.
As an alternative embodiment, the specific process of acquiring the existing information related to the mental disease and establishing the corpus of the mental disease includes:
setting a mental disease term seed word set according to the book related to the mental disease;
according to the mental disease term seed set, related contents in the medical website are searched in a traversing manner, and related webpage url is recorded and stored as a url set;
crawling the webpage content of the url set by using a crawler technology;
extracting contents of the crawled webpage contents by adopting a regular expression and an xpath analyzer, storing unstructured data into a database, directly extracting triples for storing semi-structured data, and distinguishing and storing different relation types and different attribute types;
and labeling at least one part of the processed corpus.
As an alternative embodiment, the specific process of using the language model to fine-tune the data in the mental disease corpus includes:
executing git command, downloading albert _ tiny _ google _ zh model of Google open source;
processing a psychological disease corpus, converting the txt file into a tfrecrds file with a specific format, and pre-training the obtained tfrecrds file;
executing a modeling.py function, loading a pretrained and fine-tuned ALBERT language model, and using the language model to pretraine and fine-tune the obtained corpus.
As an alternative embodiment, the specific process of constructing the mental disease named entity recognition data set comprises:
labeling each character of the labeled data;
generating a training set and a verification set for the data according to a certain proportion;
constructing word index files word2id and id2word by the obtained training set and verification set; constructing a word frequency statistical dictionary word _ frequency file for the linguistic data of the physical diseases;
and constructing tag index files tag2id and id2tag for the training set, the test set and the verification set.
As an alternative embodiment, the specific process of extracting the feature value of the named entity identification data set includes:
constructing four word sets of 'BMES' for each character in the input sequence, wherein 'B', 'M', 'E' and 'S' respectively represent segmentation information of each character in words;
for a 'BMES' word set of corresponding characters, embedding a look-up table by contrasting words, converting the words in the word set into word vectors, and setting the dimensionality of the word vectors;
and compressing the BMES word set of each character by adopting a weighted average algorithm.
As an alternative embodiment, the specific process of fusing the trimmed data and the extracted features includes:
for an input character vector sequence, creating a forward GRU hidden layer unit and a backward GRU hidden layer unit at each moment, creating a gating cycle unit for each hidden layer unit, determining a state sequence, and setting corresponding parameters;
adjusting the obtained state sequence, flattening the three-dimensional array into a two-dimensional array, calculating hidden layer output, and further calculating state output;
adjusting the calculated state output, and converting the two-dimensional array into a three-dimensional array;
constructing and initializing a transfer matrix, taking a three-dimensional array obtained after the last step of conversion and a state transfer matrix as the input of a CRF function, and calculating a loss value by adopting a maximum likelihood estimation method;
and (5) performing backward propagation, and calculating and predicting the optimal sequence mark.
As an alternative implementation, the specific process of storing each entity word into the entity word table, and extracting the entity relationship and attribute data respectively according to the relationship type and the attribute type, and performing respective storage includes:
comparing the id2tag index files, converting the obtained entity type index sequence into an entity type sequence, and storing each entity word into an entity word list;
comparing the entity word list, using a word segmentation tool to segment words of the linguistic data of the psychological disease, and comparing the stop word list to stop words;
comparing the relation indication word list to complete the extraction of the entity relation triple;
the attribute indication word list is contrasted to complete the extraction of entity attribute triples;
storing the extracted entity relationship and entity attribute triples into csv files, and performing differentiated storage on different relationship types and different attribute types;
and respectively creating entities, entity relations and entity attributes according to the storage files to finish knowledge storage.
A system for constructing a mental disease knowledge map, comprising:
the mental disease corpus building module is configured to obtain existing information related to mental diseases and build a mental disease corpus;
the indicating word list building module is configured to determine an entity, a relation and an attribute indicating word list according to the mental disease corpus;
the prediction model building and training module is configured to utilize a language model to finely tune data in the mental disease corpus, build a mental disease named entity recognition data set, extract characteristic values of the named entity recognition data set, fuse the finely tuned data and the extracted characteristics, and utilize a pre-trained deep learning model to predict;
and the knowledge storage construction module is configured to convert the entity category index sequence obtained by prediction into an entity type sequence, store each entity word into an entity word list, and respectively extract entity relationship and attribute data according to the relationship type and the attribute type for respective storage.
An electronic device comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, the computer instructions, when executed by the processor, performing the steps of the above method.
A computer readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the above method.
Compared with the prior art, the invention has the following beneficial effects:
the invention provides a construction method of a mental disease knowledge map, which can effectively manage and utilize massive mental disease data and can develop multiple applications such as knowledge search, intelligent question answering and the like on the basis of the knowledge map.
Aiming at the deep learning model which lacks a large amount of labeled data training, the invention uses the ALBERT language model to pre-train the mental disease corpus, brings rich semantic information for the deep learning model, and can effectively improve the accuracy of entity recognition.
Aiming at the situation that the entity names in the knowledge graph of the professional field are more and more complex, the invention introduces the priori knowledge by constructing the MWI characteristics, thereby greatly improving the accuracy of extracting the complex entities.
In order to make the aforementioned and other objects, features and advantages of the present invention comprehensible, preferred embodiments accompanied with figures are described in detail below.
Drawings
The accompanying drawings, which are incorporated in and constitute a part of this specification, are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate exemplary embodiments of the invention and together with the description serve to explain the invention and not to limit the invention.
FIG. 1 is an overall flowchart of the present embodiment;
FIG. 2 is a flowchart illustrating the process of searching and processing the linguistic data of the psychological diseases;
FIG. 3 is a flowchart illustrating the operation of the entity recognition model according to this embodiment;
FIG. 4 is a flowchart illustrating relationship extraction and attribute identification according to the present embodiment.
The specific implementation mode is as follows:
the invention is further described with reference to the following figures and examples.
It is to be understood that the following detailed description is exemplary and is intended to provide further explanation of the invention as claimed. Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs.
It is noted that the terminology used herein is for the purpose of describing particular embodiments only and is not intended to be limiting of exemplary embodiments according to the invention. As used herein, the singular forms "a", "an", and "the" are intended to include the plural forms as well, and it should be understood that when the terms "comprises" and/or "comprising" are used in this specification, they specify the presence of stated features, steps, operations, devices, components, and/or combinations thereof, unless the context clearly indicates otherwise.
A method for constructing a mental disease knowledge graph, as shown in fig. 1, specifically comprising:
step (1): and (4) establishing a mental disease corpus.
Step (2): and (4) constructing a mental disease domain ontology.
And (3): named entity extraction based on the ALBERT-MWI-BiGRU-CRF model.
And (4): and extracting and identifying the attribute based on the relation of template matching.
And (5): based on the knowledge storage of Neo4j graph databases.
Specifically, in this embodiment, the step (1) includes:
step (1-1): the word set of mental disease term seeds is set according to professional mental disease books such as CCDM-3 and DSM-5.
Step (1-2): according to the mental disease term seed set, relevant contents in the medical website are searched in a traversing mode, relevant webpage url is recorded, and the url is stored as a url set.
Step (1-3): crawling the webpage content on the url set obtained in the step (1-2) by using a crawler technology, as shown in fig. 2.
The crawler technology uses the script framework.
Step (1-4): and (3) the contents of the web pages crawled in the step (1-3) contain hypertext markup language and cannot be directly used as psychological disease linguistic data, so that the contents of the crawled html web pages are extracted by adopting a regular expression and an xpath parser, unstructured data are stored in a database in a txt file format, semi-structured data are directly extracted as triples and stored as csv files, and different relation types and different attribute types are distinguished and stored.
The database to which it belongs: a database is already established locally.
Step (1-5): and (4) carrying out partial labeling on the processed corpus, wherein the labeled data accounts for about 40% of the total data.
Specifically, in this embodiment, the step (2) includes:
step (2-1): according to the mental disease corpus created in step (1), in this embodiment, under the guidance of a professional, the determined entity types are shown in table 1:
TABLE 1
Step (2-2): according to the mental disease corpus created in step (1), in this embodiment, under the guidance of a professional, the determined relationship types are shown in table 2:
TABLE 2
Step (2-3): according to the mental disease corpus created in step (1), in this embodiment, under the guidance of a professional, the determined attribute types are shown in table 3:
TABLE 3
Of course, in other embodiments, the vocabulary may be adjusted.
Specifically, in this embodiment, as shown in fig. 3, the step (3) includes:
step (3-1): and (3) obtaining a pre-trained ALBERT language model, and performing pre-training fine adjustment on the corpus obtained in the step (1).
Step (3-2): and constructing a mental disease named entity recognition data set, and preprocessing the data set.
Step (3-3): and (4) constructing a Muti-word Information (MWI) characteristic for the data set obtained in the step (3-2).
Step (3-4): and (4) fusing the word vector obtained in the step (3-1) with the MWI characteristics obtained in the step (3-3) to obtain an enhanced word vector.
Step (3-5): and establishing a BiGRU-CRF deep learning model.
Step (3-6): and (5) inputting the vector sequence of the enhanced words obtained in the step (3-4) into a model for training, and storing the trained model.
Specifically, in this embodiment, the step (3-1) includes:
step (3-1-1): executing git command, and downloading albert-tiny-google-zh model of the Google open source.
The command is specifically: git clone https:// gitubb. com/bright mart/albert _ zh
Step (3-1-2): and (2) processing the corpus obtained in the step (1), executing a create _ pretrain _ data.py command, and converting the txt file into a tfrecrds file with a specific format.
The command is specifically:
python3
create _ predicting _ data _ py- -do _ world _ mask ═ True- -input _ file ═ s- - - -output _ file ═ s- - - -vocab _ file ═ s- - - -do _ lower _ case ═ True \ - - -max _ seq _ length ═ 512- - -max _ prediction _ per _ seq ═ 20-masked _ lm _ prob ═ 0.10% (material file address to be processed, processed tfregs file address, vob
Step (3-1-3): and (4) executing a pre-training command on the tfrecrds file obtained in the step (3-1-2).
The pre-training command is as follows:
python3 run _ prediction. py-input _ file% -output _ dir [% ] s-do _ train [% ] True-do _ even [% ] True-bert _ configuration _ file [% ]/] -train _ batch _ size [ ]4096-max _ seq _ length [ ]128-max _ prediction _ per _ seq [ - ] -num _ train _ size [% ] 100000-num _ mujjp _ rods [ ]12500-left _ ratio [ ] 0.00176 ] - -save _ bottles _ steps [% ] _ stem _ s 1000-input _ ckpoints [% ] (address of processed tforr language file, address of trimmed storage language model, address of configuration language model, 1-address of storage language model (step address download).
Of course, the above programming statement is only an example of the embodiment, and in other embodiments, the adjustment may be performed.
Step (3-1-4): py function is executed, and the pretrained and fine-tuned ALBERT language model is loaded in the step (3-1-3).
The step (3-2) comprises the following steps:
step (3-2-1): and labeling each character of the labeled data by adopting a BIOES labeling method.
Step (3-2-2): training and validation sets were generated for the data using a 9:1 ratio. Similarly, in other embodiments, the above ratios may be varied.
Step (3-2-3): and (5) constructing word index files word2id and id2word for the training set and the verification set obtained in the step (3-2-2). And (3) constructing a word frequency statistical dictionary word _ frequency file for the psychological disease corpus obtained in the step (1).
Step (3-2-4): and constructing tag index files tag2id and id2tag for the training set, the test set and the verification set.
Specifically, in this embodiment, the step (3-3) includes:
step (3-3-1): and constructing four word sets of 'BMES' for each character in the input sequence, wherein 'B', 'M', 'E' and 'S' respectively represent the segmentation information of each character in the word. For a certain character c in the input sequence s, the construction formula of the BMES word set is as follows:
wherein B, M, E, S represents a set of four words, ciFor the characters of the word set to be constructed, D represents a pre-constructed mental disease wordIn the dictionary, w is a word included in the mental disease dictionary D. In addition, when a certain word set is empty, the word set is filled with a special word "NULL".
Step (3-3-2): and (3) after the BMES word set of the character c is obtained according to the step (3-3-1), embedding the reference words into a lookup table, converting the words in the word set into word vectors, and setting the dimensionality of the word vectors to be 50.
And (3) constructing the Word embedding lookup table by using a skip-gram algorithm of a Word2Vec model on the psychological disease corpus data obtained in the step (1).
Step (3-3-3): the "BMES" word set for each character c is compressed. The compression algorithm adopts a weighted average algorithm, specifically, if f (w) is the frequency of the word w appearing in the static data, and the ew representative word is embedded into a lookup table, then a certain word set S is compressed by adopting the following formula:
here, F (w) is obtained by the word _ frequency file generated in step (3-3), and F represents the sum of F (w) of all words under the vocabulary set belonging to the character. And (3) static data is the psychological disease corpus obtained in the step (1).
Specifically, in this embodiment, the step (3-4) includes:
step (3-4-1): inputting the input sentence sequence s ═ (c1, c2, c3,. and cn) into the ALBERT language model obtained in step (3-1), and obtaining a word vector sequence ec ═ (e1, e2, e3,. and en), wherein the word vector dimension is 128.
Step (3-4-2): and (3) processing each character ci in the input sentence sequence s ═ (c1, c2, c3,. cndot., cn) according to the step (3-3) to obtain the MWI characteristic.
Step (3-4-3): fusing the word vector sequence ec and the MWI characteristic sequence:
Ec=[ec;vf(B);vf(M);vf(E);vf(S)]
wherein vf is the compression algorithm in step (3-3-3).
Specifically, in this embodiment, the step (3-5) includes:
step (3-5-1): and creating a BiGRU model and constructing a context relationship.
Step (3-5-2): and connecting CRF functions, and calculating the predicted optimal sequence marker.
The step (3-5-1) comprises the following steps:
step (3-5-1-1): for the input character vector sequence, a forward and backward GRU hidden layer unit is created at each moment, and the number of neurons is 256. Here, the time represents the number of characters.
Step (3-5-1-2): each hidden layer cell creates a gated round-robin cell H, which is defined as follows:
rt=σ(Wr·[ht-1,xt],br)
zt=σ(Wz·[ht-1,xt],bz)
here, rt is the reset gate and zt is the update gate. The input of the gating circulation unit H is the input character vector xt at the moment and the state ht-1 of the gating circulation unit at the last moment, and the output is the state ht at the moment.
The reset gate rt: for controlling candidate statesWhether or not to rely on the last time state ht-1. The input of the method is an input character vector xt at the moment and a state ht-1 of a gating circulation unit at the last moment, and all values are controlled to be 0,1 after the method is activated by using a sigmoid function]Within the range.
The update gate zt: to control how much information the current state needs to retain from the historical state and how much new information needs to be accepted from the candidate state. The input of the control method is an input character vector xt at the moment and a state ht-1 of a gating circulation unit at the last moment, and all values are controlled within a range of [0,1] after the control method is activated by using a sigmoid function.
The Wr, Wz,Is a weight matrix of GRU units, br, bz,Bias quantities for GRU units are trainable parameters, sigma is a sigmoid activation function, and sigma is a dot product operation.
Step (3-5-1-3): constructing Wh and Wp parameter matrixes and bh and bp offset which are trainable parameters.
Step (3-5-1-4): and (3) carrying out reshape operation on the state sequence obtained in the step (3-5-1-2), and flattening the state sequence into a two-dimensional array hs by using a three-dimensional array, wherein the dimension of the first dimension is batch _ size _ num _ steps, and the dimension of the second dimension is gru _ dim _ 2.
The batch _ size is the number of samples output this time.
The num steps is the number of characters within each sample.
The GRU _ dim is the neuron number of one GRU unit.
Step (3-5-1-5): and (5) taking the output hs of the step (3-5-1-4) as an input, and calculating the hidden layer output h according to the following formula:
h=hs*Wh+bh
step (3-5-1-6): calculating a state output p from the hidden layer output h obtained in the step (3-5-1-5), and showing the state output p as follows:
p=h*Wp+bp
step (3-5-1-7): and (4) carrying out reshape operation on the state output p obtained in the step (3-5-1-6), and converting the state output p into a three-dimensional array from the two-dimensional array, wherein the dimension of the first dimension is batch _ size, the dimension of the second dimension is num _ steps, and the dimension of the third dimension is num _ tags.
And num _ tags is the total number of the entity identification task prediction tags.
Specifically, in this embodiment, the step (3-5-2) includes the following steps:
step (3-5-2-1): the transition matrix trans is constructed and initialized.
Step (3-5-2-2): and (4) taking the output of the step (3-5-1-7) and the state transition matrix trans as the input of a CRF function, and calculating the loss value loss by adopting a maximum likelihood estimation method, wherein the formula is as follows:
y represents a sequence of labels, h represents a hidden layer output of the current input sequence s, θ is a trainable parameter, y(s) represents all possible sequences of labels of the current input sequence s, whereinAnd by′And y is a trainable parameter.
Step (3-5-2-3): in reverse propagation, parameters are updated using an Adam optimizer.
Specifically, in this embodiment, the step (3-6) includes:
step (3-6-1): and (4) loading the training set and the verification set data in the step (3-2), and processing according to the steps (3-3) and (3-4).
Step (3-6-2): and (4) inputting the data loaded in the step (3-6-1) into the model built in the step (3-5) for training.
Step (3-6-3): and storing the trained model.
The step (3-6-2) comprises the following steps:
step (3-6-2-1): the training period is set to 30, i.e. the training set is iterated 30 times completely and the model training is stopped.
Step (3-6-2-2): the batch _ size for each iteration is set to 24, i.e., 24 sequences of statements are trained at one time.
Step (3-6-2-3): the initial learning rate lr0 was set to 1e-3, the decay rate decay was 0.05, and the learning rate update formula was as follows:
specifically, in this embodiment, the step (3-6-3) includes the following steps:
step (3-6-3-1): the global variable best _ f1 is set to 0.0.
Step (3-6-3-2): and inputting the data of the verification set into the current model for prediction every time the model finishes one cycle of training, and storing and overwriting the original model storage file if the obtained f1 is greater than best _ f1 and best _ f1 is f 1.
Specifically, in this embodiment, as shown in fig. 4, the step (4) includes:
step (4-1): and (4) loading the entity recognition model trained in the step (3): and (3) processing the psychological disease corpus obtained in the step (1) according to the steps (3-2), (3-3) and (3-4), and sending the processed psychological disease corpus into an entity recognition model for prediction.
Step (4-3): and (5) comparing the id2tag index file, converting the entity type index sequence obtained in the step (4-2) into an entity type sequence, and storing each entity word into an entity word list.
The entity word list is a pre-constructed entity word list.
Step (4-4): and (5) comparing the entity word list, using a word segmentation tool to segment words of the linguistic data of the psychological diseases, and comparing the stop word list to stop words.
The word segmentation tool is a jieba word segmentation tool, and the installation can be completed by using pip install jieba in a python environment.
The deactivation vocabulary is pre-constructed.
Step (4-5): and (5) comparing the relation indication word list to finish the extraction of the entity relation triple.
Step (4-6): and (5) finishing the extraction of the entity attribute triple by contrasting the attribute indication word list.
Step (4-7): and (4) storing the triples extracted in the steps (4-5) and (4-6) into csv files, and distinguishing and storing different relationship types and different attribute types.
Specifically, in this embodiment, the step (5) includes:
step (5-1): and (4) loading all csv files obtained in the step (4).
Step (5-2): invoking the following commands completes the creation of the entity:
CREATE(<node-name>:<label-name>)
where < node-name > is the name of the node to be created and < label-name > is the name of the node tag.
Step (5-3): the following command is invoked to complete the creation of the entity attributes:
CREATE(
<node-name>:<label-name>
<property1-name>:<property1-Value>,
<property2-name>:<property2-Value>,
<propertyn-name>:<propertyn-Value>
)
wherein < property1-name >. < property-name > is the name of the property and < property1-Value >. < property-Value > is the Value of the property.
Step (5-4): call completion of the following commands
CREATE(
<node1-name>:<label1-name>
[(<relationship-name>:<relationship-label-name>)]
<node2-name>:<label2-name>
)
Wherein < node1-name > is the name of the From node, < label1-name > is the label name of the From node, < node2-name > is the name of the To node, < label2-name > is the label name of the To node, < relationship-name > is the name of a relationship, and < relationship-label-name > is the label name of a relationship.
Similarly, in other embodiments, the implementation-specific programming statements described above may be modified, but would fall within the scope of the invention as long as the command logic is consistent with the invention.
As will be appreciated by one skilled in the art, embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.
Although the embodiments of the present invention have been described with reference to the accompanying drawings, it is not intended to limit the scope of the present invention, and it should be understood by those skilled in the art that various modifications and variations can be made without inventive efforts by those skilled in the art based on the technical solution of the present invention.
Claims (9)
1. A construction method of a mental disease knowledge map is characterized in that: the method comprises the following steps:
obtaining existing information related to psychological diseases, and establishing a psychological disease corpus;
determining an entity, a relation and an attribute indication word list according to the mental disease corpus;
fine-tuning data in the mental disease corpus set by using a language model, constructing a mental disease named entity identification data set, extracting characteristic values of the named entity identification data set, fusing the fine-tuned data and the extracted characteristics, and training a pre-constructed deep learning model by using the fused data; the specific process of utilizing a language model to finely adjust the data in the mental disease corpus set, constructing a mental disease named entity identification data set, and extracting the characteristic values of the named entity identification data set comprises the following steps: acquiring a pre-trained ALBERT language model, and performing pre-training fine adjustment on the obtained corpus; constructing a mental disease named entity recognition data set, and preprocessing the data set; constructing a Muti-word Information (MWI) characteristic for the obtained data set; fusing the word vector obtained in the step with the obtained MWI characteristics to obtain an enhanced word vector; establishing a BiGRU-CRF deep learning model; inputting the obtained vector sequence of the reinforced words into a model for training, and storing the trained model; the specific process of fusing the fine-tuned data and the extracted features comprises the following steps: for an input character vector sequence, creating a forward GRU hidden layer unit and a backward GRU hidden layer unit at each moment, creating a gating cycle unit for each hidden layer unit, determining a state sequence, and setting corresponding parameters; adjusting the obtained state sequence, flattening the three-dimensional array into a two-dimensional array, calculating hidden layer output, and further calculating state output; adjusting the calculated state output, and converting the two-dimensional array into a three-dimensional array; constructing and initializing a transfer matrix, taking a three-dimensional array obtained after the last step of conversion and a state transfer matrix as the input of a CRF function, and calculating a loss value by adopting a maximum likelihood estimation method; and, carrying out backward propagation, and calculating and predicting an optimal sequence mark;
and predicting the psychological disease corpus to be processed by utilizing the trained deep learning model, converting the entity category index sequence obtained by prediction into an entity type sequence, storing each entity word into an entity word list, and respectively extracting entity relationship and attribute data according to the relationship type and the attribute type for respectively storing.
2. The method of claim 1, wherein the mental disease knowledge base map is constructed by: the method comprises the following specific processes of acquiring existing information related to the psychological diseases and establishing a psychological disease corpus set:
setting a mental disease term seed word set according to the book related to the mental disease;
according to the mental disease term seed set, traversing and searching related contents in the medical website, recording related webpage url, and storing as a url set;
crawling the webpage content of the url set by using a crawler technology;
extracting contents of the crawled webpage contents by adopting a regular expression and an xpath analyzer, storing unstructured data into a database, directly extracting triples for storing semi-structured data, and distinguishing and storing different relation types and different attribute types;
and labeling at least one part of the processed corpus.
3. The method of claim 1, wherein the mental disease knowledge base map is constructed by: the specific process of utilizing the language model to finely adjust the data in the psychological disease corpus set comprises the following steps:
executing git command, downloading albert _ tiny _ google _ zh model of Google open source;
processing the psychological disease corpus, converting the txt file into a tfrecrds file with a specific format, and pre-training the obtained tfrecrds file;
executing a modeling.py function, loading the pretrained and fine-tuned ALBERT language model, and pretraining and fine-tuning the obtained corpus by using the language model.
4. The method of claim 1, wherein the mental disease knowledge base map is constructed by: the specific process for constructing the mental disease named entity recognition data set comprises the following steps:
labeling each character of the labeled data;
generating a training set and a verification set for the data according to a certain proportion;
constructing word index files word2id and id2word by the obtained training set and verification set; constructing a word frequency statistical dictionary word _ frequency file for the linguistic data of the physical diseases;
and constructing tag index files tag2id and id2tag for the training set, the test set and the verification set.
5. The method of claim 1, wherein the mental disease knowledge base map is constructed by: the specific process for extracting the characteristic value of the named entity identification data set comprises the following steps:
constructing four word sets of 'BMES' for each character in the input sequence, wherein 'B', 'M', 'E' and 'S' respectively represent segmentation information of each character in words;
for a 'BMES' word set of corresponding characters, embedding a look-up table by contrasting words, converting the words in the word set into word vectors, and setting the dimensionality of the word vectors;
and compressing the BMES word set of each character by adopting a weighted average algorithm.
6. The method of claim 1, wherein the mental disease knowledge base map is constructed by: the concrete process of storing each entity word into the entity word list, extracting entity relation and attribute data respectively according to the relation type and the attribute type, and storing respectively comprises the following steps:
comparing the id2tag index files, converting the obtained entity type index sequence into an entity type sequence, and storing each entity word into an entity word list;
comparing the entity word list, using a word segmentation tool to segment words of the linguistic data of the psychological disease, and comparing the stop word list to stop words;
the relation indication word list is compared to complete the extraction of entity relation triples;
the extraction of entity attribute triples is completed by contrasting the attribute indication word list;
storing the extracted entity relationship and entity attribute triples into csv files, and performing differentiated storage on different relationship types and different attribute types;
and respectively creating an entity, an entity relation and an entity attribute according to the storage file to finish knowledge storage.
7. A construction system of mental disease knowledge base map is characterized in that: the method comprises the following steps:
the mental disease corpus building module is configured to acquire existing information related to mental diseases and build a mental disease corpus;
the indicating word list building module is configured to determine an entity, a relation and an attribute indicating word list according to the mental disease corpus;
the prediction model building and training module is configured to finely tune data in the mental disease corpus by using a language model, build a mental disease named entity recognition data set, extract characteristic values of the named entity recognition data set, fuse the finely tuned data and the extracted characteristics, and predict by using a pre-trained deep learning model; the specific process of utilizing a language model to finely adjust the data in the mental disease corpus set, constructing a mental disease named entity identification data set, and extracting the characteristic values of the named entity identification data set comprises the following steps: acquiring a pre-trained ALBERT language model, and performing pre-training fine adjustment on the obtained corpus; constructing a mental disease named entity recognition data set, and preprocessing the data set; constructing a Muti-word Information (MWI) characteristic for the obtained data set; fusing the word vector obtained in the step with the obtained MWI characteristics to obtain an enhanced word vector; establishing a BiGRU-CRF deep learning model; inputting the obtained vector sequence of the reinforced words into a model for training, and storing the trained model; the specific process of fusing the fine-tuned data and the extracted features comprises the following steps: for an input character vector sequence, creating a forward GRU hidden layer unit and a backward GRU hidden layer unit at each moment, creating a gating cycle unit for each hidden layer unit, determining a state sequence, and setting corresponding parameters; adjusting the obtained state sequence, flattening the three-dimensional array into a two-dimensional array, calculating hidden layer output, and further calculating state output;
and the knowledge storage construction module is configured to convert the entity category index sequence obtained by prediction into an entity type sequence, store each entity word into an entity word list, and respectively extract entity relationship and attribute data according to the relationship type and the attribute type for respective storage.
8. An electronic device, characterized by: comprising a memory and a processor and computer instructions stored on the memory and executed on the processor, which when executed by the processor, perform the steps in the method according to any of claims 1-6.
9. A computer-readable storage medium, comprising: for storing computer instructions which, when executed by a processor, perform the steps of the method according to any one of claims 1-6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110512846.3A CN113128233B (en) | 2021-05-11 | 2021-05-11 | Construction method and system of mental disease knowledge map |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110512846.3A CN113128233B (en) | 2021-05-11 | 2021-05-11 | Construction method and system of mental disease knowledge map |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113128233A CN113128233A (en) | 2021-07-16 |
CN113128233B true CN113128233B (en) | 2022-07-19 |
Family
ID=76781679
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110512846.3A Active CN113128233B (en) | 2021-05-11 | 2021-05-11 | Construction method and system of mental disease knowledge map |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113128233B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN113722501B (en) * | 2021-08-06 | 2023-09-22 | 深圳清华大学研究院 | Knowledge graph construction method, device and storage medium based on deep learning |
CN114141379A (en) * | 2021-08-12 | 2022-03-04 | 北京好欣晴移动医疗科技有限公司 | Sleep disorder attribution analysis method, device and system based on knowledge graph |
CN114356990B (en) * | 2021-12-30 | 2024-10-01 | 中国人民解放军海军工程大学 | Base named entity recognition system and method based on transfer learning |
CN114504298B (en) * | 2022-01-21 | 2024-02-13 | 南京航空航天大学 | Physiological characteristic discriminating method and system based on multisource health perception data fusion |
CN118585631A (en) * | 2024-08-02 | 2024-09-03 | 北京双高国际人力资本集团有限公司 | Intelligent psychological test question recommending method and system based on knowledge graph |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105894088A (en) * | 2016-03-25 | 2016-08-24 | 苏州赫博特医疗信息科技有限公司 | Medical information extraction system and method based on depth learning and distributed semantic features |
CN110334211A (en) * | 2019-06-14 | 2019-10-15 | 电子科技大学 | A kind of Chinese medicine diagnosis and treatment knowledge mapping method for auto constructing based on deep learning |
CN112002411A (en) * | 2020-08-20 | 2020-11-27 | 杭州电子科技大学 | Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record |
CN112542223A (en) * | 2020-12-21 | 2021-03-23 | 西南科技大学 | Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record |
CN112635071A (en) * | 2020-12-25 | 2021-04-09 | 中国矿业大学 | Diabetes knowledge map construction method integrating traditional Chinese and western medicine knowledge |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20170097984A1 (en) * | 2015-10-05 | 2017-04-06 | Yahoo! Inc. | Method and system for generating a knowledge representation |
CN111834014A (en) * | 2020-07-17 | 2020-10-27 | 北京工业大学 | Medical field named entity identification method and system |
CN112417100A (en) * | 2020-11-20 | 2021-02-26 | 大连民族大学 | Knowledge graph in Liaodai historical culture field and construction method of intelligent question-answering system thereof |
-
2021
- 2021-05-11 CN CN202110512846.3A patent/CN113128233B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105894088A (en) * | 2016-03-25 | 2016-08-24 | 苏州赫博特医疗信息科技有限公司 | Medical information extraction system and method based on depth learning and distributed semantic features |
CN110334211A (en) * | 2019-06-14 | 2019-10-15 | 电子科技大学 | A kind of Chinese medicine diagnosis and treatment knowledge mapping method for auto constructing based on deep learning |
CN112002411A (en) * | 2020-08-20 | 2020-11-27 | 杭州电子科技大学 | Cardiovascular and cerebrovascular disease knowledge map question-answering method based on electronic medical record |
CN112542223A (en) * | 2020-12-21 | 2021-03-23 | 西南科技大学 | Semi-supervised learning method for constructing medical knowledge graph from Chinese electronic medical record |
CN112635071A (en) * | 2020-12-25 | 2021-04-09 | 中国矿业大学 | Diabetes knowledge map construction method integrating traditional Chinese and western medicine knowledge |
Non-Patent Citations (1)
Title |
---|
基于自然语言处理的临床合理用药知识图谱构建;张小亮等;《中华医学图书情报杂志》;20190915(第09期);全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN113128233A (en) | 2021-07-16 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN113128233B (en) | Construction method and system of mental disease knowledge map | |
CN110298037B (en) | Convolutional neural network matching text recognition method based on enhanced attention mechanism | |
CN110020438B (en) | Sequence identification based enterprise or organization Chinese name entity disambiguation method and device | |
CN110134757B (en) | Event argument role extraction method based on multi-head attention mechanism | |
CN111209738B (en) | Multi-task named entity recognition method combining text classification | |
CN108460089A (en) | Diverse characteristics based on Attention neural networks merge Chinese Text Categorization | |
CN113128232B (en) | Named entity identification method based on ALBERT and multiple word information embedding | |
CN110489523B (en) | Fine-grained emotion analysis method based on online shopping evaluation | |
CN116737967B (en) | Knowledge graph construction and perfecting system and method based on natural language | |
CN112052684A (en) | Named entity identification method, device, equipment and storage medium for power metering | |
CN114896388A (en) | Hierarchical multi-label text classification method based on mixed attention | |
CN113220865B (en) | Text similar vocabulary retrieval method, system, medium and electronic equipment | |
CN113761868B (en) | Text processing method, text processing device, electronic equipment and readable storage medium | |
CN114722805B (en) | Little sample emotion classification method based on size instructor knowledge distillation | |
CN105893362A (en) | A method for acquiring knowledge point semantic vectors and a method and a system for determining correlative knowledge points | |
CN114238653A (en) | Method for establishing, complementing and intelligently asking and answering knowledge graph of programming education | |
CN117151222B (en) | Domain knowledge guided emergency case entity attribute and relation extraction method thereof, electronic equipment and storage medium | |
CN112699685A (en) | Named entity recognition method based on label-guided word fusion | |
CN109189848A (en) | Abstracting method, system, computer equipment and the storage medium of knowledge data | |
CN110222737A (en) | A kind of search engine user satisfaction assessment method based on long memory network in short-term | |
CN112052681A (en) | Information extraction model training method, information extraction device and electronic equipment | |
CN115757464B (en) | Intelligent materialized view query method based on deep reinforcement learning | |
CN111666375A (en) | Matching method of text similarity, electronic equipment and computer readable medium | |
CN115600595A (en) | Entity relationship extraction method, system, equipment and readable storage medium | |
CN115964486A (en) | Small sample intention recognition method based on data enhancement |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |