CN109241273B - Method for extracting minority subject data in new media environment - Google Patents
Method for extracting minority subject data in new media environment Download PDFInfo
- Publication number
- CN109241273B CN109241273B CN201810969312.1A CN201810969312A CN109241273B CN 109241273 B CN109241273 B CN 109241273B CN 201810969312 A CN201810969312 A CN 201810969312A CN 109241273 B CN109241273 B CN 109241273B
- Authority
- CN
- China
- Prior art keywords
- data
- words
- minority
- domain
- label
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 37
- 238000000605 extraction Methods 0.000 claims abstract description 31
- 238000013075 data extraction Methods 0.000 claims abstract description 18
- 230000008569 process Effects 0.000 claims abstract description 6
- 239000013598 vector Substances 0.000 claims description 30
- 230000011218 segmentation Effects 0.000 claims description 19
- 238000012545 processing Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000005070 sampling Methods 0.000 claims description 5
- 238000010276 construction Methods 0.000 claims description 3
- 230000008859 change Effects 0.000 claims description 2
- 230000006855 networking Effects 0.000 claims description 2
- 238000005065 mining Methods 0.000 abstract description 3
- 238000011160 research Methods 0.000 description 16
- 241000630665 Hada Species 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 3
- 230000000694 effects Effects 0.000 description 2
- 239000000284 extract Substances 0.000 description 2
- 239000004744 fabric Substances 0.000 description 2
- 238000001914 filtration Methods 0.000 description 2
- 235000013550 pizza Nutrition 0.000 description 2
- 238000012827 research and development Methods 0.000 description 2
- 101100153586 Caenorhabditis elegans top-1 gene Proteins 0.000 description 1
- 101100136092 Drosophila melanogaster peng gene Proteins 0.000 description 1
- 241000152447 Hades Species 0.000 description 1
- 101100370075 Mus musculus Top1 gene Proteins 0.000 description 1
- 241000282376 Panthera tigris Species 0.000 description 1
- 101100283966 Pectobacterium carotovorum subsp. carotovorum outN gene Proteins 0.000 description 1
- 244000097202 Rathbunia alamosensis Species 0.000 description 1
- 235000009776 Rathbunia alamosensis Nutrition 0.000 description 1
- 238000009411 base construction Methods 0.000 description 1
- 150000001875 compounds Chemical class 0.000 description 1
- 238000004883 computer application Methods 0.000 description 1
- 238000013527 convolutional neural network Methods 0.000 description 1
- 238000007418 data mining Methods 0.000 description 1
- 230000002996 emotional effect Effects 0.000 description 1
- 238000009472 formulation Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000004927 fusion Effects 0.000 description 1
- PCHJSUWPFVWCPO-UHFFFAOYSA-N gold Chemical compound [Au] PCHJSUWPFVWCPO-UHFFFAOYSA-N 0.000 description 1
- 239000010931 gold Substances 0.000 description 1
- 229910052737 gold Inorganic materials 0.000 description 1
- 235000013402 health food Nutrition 0.000 description 1
- 239000010977 jade Substances 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 239000000203 mixture Substances 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 239000002994 raw material Substances 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a method for acquiring data from a new media platform and extracting minority subject data, which is characterized by adopting an LDA (latent Dirichlet Allocation) model to perform feature extraction, subject analysis and implicit subject mining on the preprocessed new media data according to the characteristics of massive, unstructured and multi-subject of the new media data, then constructing KG by utilizing knowledge in the minority domain, and guiding the extraction of the minority subject data by utilizing the domain KG. In the invention, parameters are set according to different data scales in the process of extracting data by adopting an LDA model and a KG guide, so that the algorithm is optimized, and accurate, efficient and expandable new media data extraction is realized.
Description
Technical Field
The invention discloses a method for acquiring data from a new media platform and extracting minority subject data. Relates to a method for performing implicit theme analysis and feature extraction based on Latent Dirichlet Allocation (LDA) new media data and realizing extraction of minority theme data by using a domain Knowledge Graph (KG). Belonging to the field of data processing and knowledge discovery.
Background
The new media is a new media form relative to traditional media such as newspapers, broadcasts, televisions and the like, comprises network media, mobile phone media, digital televisions and the like, and has the characteristics of interactivity, instantaneity, mass and sharing, multimedia, hypertext, individuation, socialization and the like. With the increasingly important role of new media in information dissemination, the processing and analysis of network media data are also highly concerned by scholars at home and abroad. The data is divided according to the difference of main contents, and the data describing the same type of contents is called data of the same theme, such as themes of tourism, entertainment, movie and television and the like. The method is characterized in that data of a specific theme are obtained from massive and heterogeneous network media data, and theme analysis, content screening and information filtering are performed according to different fields, so that the method is an important research content for new media data processing and knowledge discovery, and is also an important basis for work such as decision support, influence prediction, knowledge base construction, public opinion analysis and the like.
In addition, with the deep advance of cultural strategies in China, research on subjects of acquiring, analyzing and utilizing minority information is continuously increased, and massive new media data comprise a large amount of valuable minority theme data, such as data of minority travel information sharing, minority cultural exchange, minority hot spot problems, minority news events and the like in a microblog platform, and can enrich data sources for relevant research and development of the minority theme. The method has important significance in researching data extraction of the minority theme in the new media environment, aiming at actual problems in aspects of minority politics, economy, culture and the like, processing and analyzing mass new media data, discovering data-driven knowledge, and researching and applying data-intensive minority region public opinion monitoring and management policy formulation, minority culture propagation and heritage protection.
Data extraction is a process of extracting target data from source data, known data extraction research results are numerous, and data extraction technologies adopted for different data or different applications are different, for example, liu jin et al (< master thesis of university of science and technology, 2016) realizes the Chinese character social relation extraction research of news data based on unsupervised learning, wu self tiger et al (< master thesis of university of liaoning, 2017) proposes a template-based network public opinion data extraction method, zhang et al (< master thesis of liberation military information engineering, 2017) proposes a relation extraction model based on a depth fusion convolutional neural network for network text data, yao xiao peng et al (< computer application and software, 2018) proposes a deep-network data extraction and mining method in a global mode, yuanhua et al (< management engineering > report, 2018) a method for extracting hot topics and feature words thereof from mass data is provided, the relation between the hot topic words and local feature words is analyzed from the mass user generated content, and a data extraction method and a system based on mass data are provided for the achievement of Hades and the like (< patents CN107436902A >, 2017), and corresponding target data are extracted and obtained respectively aiming at a static data source and a dynamic data source. The methods can well complete data extraction tasks aiming at respective data sources and the problem to be solved, but the methods are not general technologies, and lack universality aiming at the extraction of minority ethnic theme data in a massive, unstructured and multi-theme new media platform. Therefore, aiming at the characteristics of the mass, the non-structure, the multiple subjects and the like of the new media, the invention excavates the implicit subjects in the network new media data based on the LDA model, realizes the analysis of the multiple subjects, and can more accurately and comprehensively realize the extraction of the minority subject data by utilizing the data characteristic word sequence and the entity described by the knowledge graph and the incidence relation among the entities.
The LDA model is a Bayesian hierarchical model, and is widely applied to the fields of data extraction, text mining, social networks, natural language processing and the like in the known research. For example, liu shao et al (< computer science and report >, 2015) performs qualitative and descriptive theme extraction on massive movie comment data by using LDA, liu ice jade et al (< software science and report >, 2017) researches massive e-commerce comment data, realizes extraction of commodity features and emotional words based on semantic constraint LDA, and zhao science et al (< patent CN107885754A >, 2018) provides a method and a device for extracting credit variables from transaction data based on LDA models. The LDA model is utilized to process the mass data, the outstanding effects of the LDA model on topic analysis, feature extraction, text mining and other problem researches are shown, and on the basis, the LDA model further plays the advantages of the LDA model in analyzing mass, unstructured and multi-topic new media data.
The KG is a semantic network expressing entities, concepts and relationships among the entities and the concepts, and in the known research, the KG is widely used in the fields of personalized recommendation, intelligent search, knowledge discovery and the like. For example, chendhan et al (< computer research and development >, 2017) proposes a link prediction model of clinical field time sequence KG, and heijun et al (< computer science report >, 2016) proposes a relation extraction method facing the evolution of knowledge in the field of chinese wikipedia, and rekey et al (< patent CN108073711A >, 2018) proposes a relation extraction method based on KG, and extracts the path and attribute information of KG to mine potential semantic information. The above achievements, whether medical research or data relation extraction, fully show the effect of rich prior knowledge of the KG semantic network in practical application, but select KG aiming at different applications simultaneously, also influence the high efficiency and effectiveness of problem research, and aiming at different application scenes and specific research fields, need to construct corresponding KG, more comprehensively and completely cover the knowledge and semantic relation of the research field, thereby improving the accuracy and high efficiency of data extraction results.
However, due to the limitation of data sources, the remoteness of knowledge and the cultural difference under the minority nationality theme, the difficulty of the interdisciplinary research is relatively high, the interdisciplinary research becomes the necessity of numerous topics in the current research, and how to utilize a large amount of data in a new medium as a basis to extract valuable data from the data becomes the basis of the relevant research.
Therefore, the invention aims at the extraction problem of the new media minority topic data, takes the large-scale data of a new media platform and the knowledge of the minority domain as the basis, takes the extraction of the minority topic data from the massive, unstructured and multi-topic new media data as the target, utilizes an LDA model to mine implicit multi-topic information from the unstructured data, performs topic analysis, realizes the feature extraction of the data, and utilizes the semantic relation rich in domain KG to solve the problems of strong professional, remote word sources and ambiguous words in the process of extracting the minority topic data from the massive new media data. In conclusion, the invention provides the method for extracting the minority nationality theme data in the new media environment, lays a new technical foundation for the application of large-scale new media data processing, analysis, prediction, decision and the like, and also provides reference for the extraction of new media data in a specific field.
Disclosure of Invention
In order to solve the problem of efficiency bottleneck caused by the situations of obscure source of words, strong specialization, word synonymy and the like in the minority nationality field, the invention provides a method for acquiring data from a new media platform and extracting the minority topic data based on an LDA model and KG. The method can accurately, efficiently and extensible extract the data of the specific field of the new media aiming at the characteristics of the new media data such as massive, unstructured and multi-theme data.
The method comprises three steps, wherein the first step is data preprocessing, required new media data are obtained, a word segmentation tool is adopted to perform word segmentation on data contents, words are segmented according to field vocabularies added in the minority nationality field to be researched, personalized stop words are added, and a data preprocessing result is simplified; secondly, topic analysis and feature extraction of new media data, iteration processing is carried out on the preprocessed data by using an LDA model, topic analysis is carried out, topics hidden by the data are mined, a topic vector of each piece of data and high-frequency word vectors of all the topics are obtained, the high-frequency word vectors of the topics to which the data belong are matched with the content of the data, and a feature word sequence of each piece of data is obtained; and the third step is extraction of the minority topic data based on KG, namely, the knowledge of the minority domain is constructed into the domain KG, KG is used for guiding, namely, the rich semantic relation of KG is used as prior knowledge, the characteristic word sequence of the data in the second step is used as basis for matching, the data of the minority topic is filtered, and meanwhile, KG formed by noise data irrelevant to the domain is constructed, so that reverse filtering is realized, and the accuracy of data extraction is improved.
The method comprises the following steps:
s1: data pre-processing
S1.1: obtaining from social networking or news web pagesMStripe media dataI={I 1, I 2, …, I M },I i Is shown asiData of strip, 0 is less than or equal toi≤M,I i By a triplet (id, T i , A i ) It is shown that,idfor the purpose of the identification of the instance of data,T i representing dataI i The content of the characters of (a) is,A i ={A i,u , A i,p , A i,l , A i,v , A i,f , A i,q , A i,c , A i,r indicates additional information, respectively indicates a data publisherA i,u Time of releaseA i,p And a distribution siteA i,l Publishing sourceA i,v Forwarding amountA i,f Amount of praiseA i,q Number of commentsA i,c And read time of dataA i,r ;
S1.2: knowledge in minority nationality domainZ=<term, attributes, addition>As given by the domain expert,termis the name of the entity, and is the name of the entity,attributesin order to be an attribute of an entity,additionadding a description to the entry;
s1.3: obtaining a set of stop wordsStop_words;
S1.4: adopting word segmentation tool to obtain media data text contentT i Performing word segmentation processing, wherein the word segmentation is performed beforeStop_wordsAdding the information into a default stop word set of a word segmentation tool and naming the knowledge entity names in the minority domaintermThe set is added to the default vocabulary set of the segmentation tool,T i the word segmentation result is stored in the data separatelyI i End, is marked asSeg_T i ;
S2: topic analysis and feature extraction
S2.1: definition dictionaryW={w 1, w 2, …, w S Store all the words contained in the data,Sis the total number of words in the dictionary,w i ≠w j (1≤i,j≤S,i≠j);
s2.2: defining dataI i Subject vector ofΛ i =(λ i1,, λ i2,, …, λ K,i ),λ k,i Is thatI i The Chinese vocabulary belongs to the subjectz k Probability of 0 ≦λ k,i 1 or less, wherein the subjectz k Using high-frequency word vectorsΔ k =((w 1, δ k1,), (w 2, δ k2,), …,) Is) to indicate that,S k is composed ofz k The total number of words of (a) is,δ t,k is thatz k Words in the general vocabularyw t Probability of 0 ≦δ t,k ≤1。δ t,k Andλ k,i the following equations (1) and (2) respectively determine:
wherein,representing a topicz k The words and phrases ofw t The total number of (a) and (b),to representI i Including a themez k The number of the Chinese words and phrases is,Sis the total number of words in the dictionary,Kis the total number of topics;
s2.3: sampling a theme and a vocabulary;
s2.3.1: given number of iterationsN iter ,N iter Not less than 1, total number of themesK,KNot less than 1, parameterα,β,κ,0<α,β<1,κ≥1;
S2.3.2: for each topic z k Probability distribution of words in a sample topicφ k ~Dir(β);
S2.3.3: for dataI i Topic probability distribution of sampled dataθ i ~Dir(α) Of dataSeg_T i Sampling the subject of a wordVocabulary of sample topics(ii) a Statistical themesz k Total number of words ofData, dataI i Including a themez k Number of Chinese words;
S2.3.4: repeat S2.3.3, iterateN iter Next to each vocabularyw i,j Subject matter of (1)z i,j Convergence is achieved, and at the moment, the subject to which each vocabulary belongs does not change any more;
s2.4: obtaining a themez k High frequency word vectors and dataI i The topic vector of (1);
s2.4.1: reading each piece of dataI i The words and phrases ofw i,j And corresponding subject matterz i,j Statistical topicz i,j =z k The words and phrases ofw i,j Total number ofAnd dataI i Inz i,j =z k Chinese vocabularyw i,j Number of (2);
S2.4.2: calculating according to formula (1) to obtain each themez k Chinese vocabularyw t Probability of (2)δ t,k According toδ t,k Arranging in descending order to obtain the themez k High frequency word vector ofΔ k =((w 1,δ k1,), (w 2,δ k2,), …, ()),0≤k≤K;
S2.4.3: calculating according to formula (2) to obtain each piece of dataI i The Chinese vocabulary belongs to the subjectz k Probability of (2)λ k,i According toλ k,i Arranging in descending order to obtain dataI i Subject vector ofΛ i =(λ i1,, λ i2,, …, λ K,i );
S2.5: acquiring a data feature word sequence;
s2.5.1: reading dataI i Subject vector ofΛ i Push buttonλ k,i Descending order, and taking top-kappa subjects;
s2.5.2: will be in dataSeg_T i Of the vocabulary of (1) with the high-frequency word vectors of the above top-k topicsΔ k The words are mapped and matched, and the union of the two words is recorded asd i =<w i,1 , w i,2 ,…, >Represents dataI i The feature word sequence of (1);
s3: extraction of minority subject data
S3.1: defining the minority domain KG asG k =(V, E) WhereinV={v 1, v 2, …, v n Denotes the set of entity corresponding nodes in KG,E={e 1, e 2,…, e m represents a collection of edges between entities; any edge corresponds to a node triplee x =(v i , v j , label) Node ofv i Called origin, nodev j Referred to as the end point,labela relation label of a starting point and an end point;
s3.2: using knowledge in the minority domainZConstruction of the field KG, usingG k Represents;
s3.2.1: first, the minority domain knowledge is obtained from domain expertsZ=<term, attributes, addition>Sequentially takingZElement entity name ofv i And name of artv 0 Expressed as a triplet (v 0 , v i , label),labelGetv i Is given asThe relationship label of (1);
s3.2.2: then each element is built in turnv i Andv j the triad of (v i , v j , label) At this timelabelAdditional information by a nodeadditionTo obtainRelationship labels of, e.g.v i Andv j if there is no relation, the corresponding edge does not exist, and all the triples together form the minority domain KG ofG k ;
S3.3: for domain-independent data, called noise data, which do not belong to the field of interest but which influence the accuracy of the extraction of the domain data during the data extraction process, the constructed domain of independence KG is used inG k Represents;
s3.3.1: obtaining knowledge of minority irrelevant field from domain expertZ=<term, attributes, addition>In turn getZElement entity name ofv i And name of artv 0 Expressed as a triplet (v 0 , v i , label),labelGetv i Is given asThe relationship label of (1);
s3.3.2: then each element is built in turnv i Andv j the triad of (v i , v j , label),labelAdditional information by a nodeadditionTo obtainRelationship labels of, e.g.v i Andv j irrelevant, the corresponding edge does not exist, all triples jointly form an irrelevant field KGG k ;
S3.4: the extraction of minority domain data is realized;
s3.4.1: giving a decision parameterτ,0≤τ≤1;
S3.4.2: for dataI i Calculating the characteristic word sequence thereofd i Length of (2)m i ,m i ≥0;
S3.4.3: for datad i Each word of (1)w i,j By usingG k Association between nodes: (v x , v x+1, label) Adjacency point of node searched in sequence, statistical dataI i The number of the words in the minority domain of the Chinese vocabulary is recordedn,n≥0;
S3.4.4: of the same datad i Each word of (1)w i,j UtilizeG k Association between nodes: (v x , v x+1, label) Adjacency point of node searched in sequence, statistical dataI i The number of words in which the middle word is noise data that is not domain-related is recorded;
S3.4.5: calculating dataI i In the field ofG k Probability of (2)Data ofI i In the fieldG k Probability of (2). If it is notp>τAnd isp<τThen determine the dataI i Is data of minority subjects and willI i Adding to the final minority data setDPerforming the following steps;
the invention can realize the extraction of the minority nationality theme data from the new media environment through the steps, and in order to ensure that the data extraction is more accurate and efficient, the parameters in the method are further limited and optimized, and in step S2.3.1, the iteration times are carried outN iter Given relation to efficiency of the method and accuracy of the resultThe certainty is that each vocabulary in the result is obtained when the iteration number is too smallw i,j Subject matter of (1)z i,j The method has the advantages of not converging, inaccurate subject characteristic words, excessive iteration times, time consumption increase and efficiency reduction of converged iteration, and the method is used for solving the problems that,SIs the total number of words in the dictionary,for upward integers, the number of iterations is directly linked to the amount of data, and, in addition, parametersαWhen in useKValue is less than or equal to 40α=0.5, whenK>Value of 40 hoursα=20/KTo do soβThe value of the compound is 0.01,κvalue takingI.e. with the number of subjectsKIncreasing, the number of characteristic word sequences of the data obtained by using the high-frequency word vector of top-kappa subjects is also increased. In addition, in step S3.4.1, the determination parameter of the domainτThe range of (A) is not less than 0.05 ≤τLess than or equal to 0.15, and the data can realize more accurate domain attribution judgment.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Example (b): and (3) extracting example of 'Xinlang microblog' Tibetan data.
The method comprises the following steps: pretreatment of
Microblog data are acquired from a 'Sina microblog' platform, and a single piece of microblog data is shown in a table 1.
TABLE 1 microblog data example
For convenience of description, the additional information items will be described in the following description of data extractionA i Hiding the obtained Sing microblog data comprising 5 microblog dataa1~a5, as shown in table 2.
Table 2 green wave microblog data
Then, the text part of the microblog data is processedT i Performing word segmentation, selecting word segmentation tool, supporting self-defined dictionary and stop word, and introducing Tibetan domain knowledgeZ={<Tibet, Place name, Tibetan province>,<Lhasa, city name, Tibet province>,<Gongga, geographical name, south Tibet>,<Zang region, fuzzy geographical name, refers to the collection of zang nationality>,<The collection of people refers to the Tibetan nationality>,<Hada, name of things, Tibetan ceremony silk fabric>Adding the words in the Tibetan domain into the word segmentation tool dictionary, and recording the word segmentation result asSeg_T i As shown in table 3.
TABLE 3 Seawa microblog data word segmentation result
Step two: topic analysis and feature extraction
Reading microblog data, and obtaining a dictionary according to the appearance sequence and the non-repeated principle of vocabulariesW= 1: tibet, 2: and, peace, 3: liberation, 4: residence, 5: tibetan, 6: gutianle, 7: gold, 8: donation, 9: aid, 10: school, 11: tibetan region, 12: hada, 13: vacation, 14: Shangri-La, 15: qinghai lake, 16: pizza, 17: trip, 18: strategy, 19: the health food is prepared from the raw materials of Gongga,20: airport, 21: kilometers }.
Given number of iterationsTotal number of subjectK=3, parameterα=0.5,β=0.01,. Sampling topics of wordsVocabulary of sample topicsIterate 10 times so that words can all be assigned to the respective topics.
The obtained topics corresponding to all the vocabularies are respectively as follows:
travel: { holiday, travel, walk, lassa, airport, kilometer, Shangri La, Qinghai lake, Tibet, Gonga }
Circularly counting the total number of all subjects of the microblog dataAnd the total number of words for each topic. By microblog dataa1 as an example, calculateIs provided with,,Therefore, the formula (1) has:
thus, microblog dataaThe topic vector for content 1 is (0.2143, 0.75, 0.0357). Similarly, microblog data can be obtainedaThe topic vector of 2 is (0.0357, 0.2143, 0.75),athe topic vector of 3 is (0.0435, 0.9130, 0.0435),athe topic vector of 4 is (0.9130, 0.0435, 0.0435),athe topic vector of 5 is (0.9583, 0.0208, 0.0208).
For each topicz k CalculatingBy subject matterAs an example, a dictionaryWIn a clear view of the above, it is known that,t=1 for "Tibet",t=13 means "vacation",t=16 denotes "pizza", and is calculated by formula (2)δ t,k The following were used:
further in accordance withArranging in descending order to obtain the themeHigh frequency word vector ofΔ 1= (("rassa", 0.2118), ("holiday", 0.1414), ("tibet", 0.1414), ("travel", 0.0711), ("strategy", 0.0711), …). Subject matter available in the same wayThe high frequency word vector ofΔ 2= (("school", 0.2182), ("Tibetan", 0.2182), ("peace", 0.1097), ("hada", 0.1097), ("liberation", 0.1097), …), topicThe high frequency word vector ofΔ 3= (("fund", 0.2399), ("donation", 0.2399), ("build", 0.2399), ("gutianle", 0.2399)).
Taking kappa =1, i.e. taking each microblog datumSeg_T i And mapping and matching the words with the words of the high-frequency word vectors of the top-1 subjects of the data to obtain a characteristic word sequence of the data. By microblog dataa1 is an exampleSubject of top-1Then the subject will beHigh frequency word vector ofΔ 2With a1Seg_T i Is matched to obtaind 1=<Peace, liberation, living and Tibetan ">And obtaining the following by the same method:
d 2=<'Gutianle', 'fund', 'donation', 'aid construction'>
d 3=<'zang district', 'school', 'Tibetan person' and 'Hada'>
d 4=<Vacation, Shangri-La, Qinghai lake and Lasa ">
d 5=<"travel", "attack", "vacation", "Tibet", "Lhasa", "Gonga", "airport", "Lhasa", "kilometer">
Step three: minority data extraction
First, from the Tibetan domain knowledgeZ={<Tibet, Place name, Tibetan province>,<Lhasa, city name, Tibet province>,<Gongga, geographical name, south Tibet>,<Zang region, fuzzy geographical name, refers to the collection of zang nationality>,<The collection of people refers to the Tibetan nationality>,<Hada, name of things, Tibetan ceremony silk fabric>Constructing field KG.
Take in turnZElement entity name ofv i And name of artv 0Expressed as a triplet (v 0, v i , label) Such as ("Tibetan", "Tibet", "Place"), each element is then established in turnv i Andv j the triad of (v i , v j , label) At this timelabelThe graphical representation of the results obtained from additional information of the nodes, such as ("Tibet", "Lasa", "Save") is shown in FIG. 3.
In the same way, by the field-independent knowledge of the subject of "travelZ={<Yunnan province, province name, tourist province>,<Qinghai province, Ming province and traveling province>,<Shangri-La, Place name, Yunnan tourist attraction>,<Tourist attractions of Qinghai lake, lake and Qinghai province>Constructing a travel KG irrelevant to TibetanG k As shown in fig. 4.
Given parametersτ= 0.1. For microblog dataa1, obtaining word sequence of word featured 1Length of (2)m i =4, ford 1Each vocabulary is respectively fromG k Searching corresponding words and phrases between the middle edge node and the edge, and counting to obtain。
Thus, microblog dataa1 in the field ofG k Probability of (2)In the fieldG k Probability of (2)Due to p>τAnd isτThen microblog dataa1 belongs to the field of Tibetan, willa1 adding the extracted Tibetan data setDIn (1). In the same way, the method can obtain,a3 anda5 also belongs to the "Tibetan" domain data. For thea4, due to,And therefore belong to irrelevant noise data related to the names of the Tibetan nationalities.
The extraction results of the "Tibetan" subject data are shown in Table 4.
TABLE 4 "Tibetan" subject data extraction results
Drawings
FIG. 1 is a flow chart for implementing the present invention. The method comprises the following three steps: preprocessing new media data, analyzing themes and extracting characteristics, and extracting minority ethnic data.
Fig. 2, LDA graph model.
FIG. 3 is a graphical illustration of a Tibetan domain knowledge graph in an embodiment.
Fig. 4, an example of an embodiment in which noise data corresponds to a graphical knowledge map.
Claims (3)
1. A method for extracting minority ethnic group theme data in a new media environment is characterized by comprising the following steps:
s1: data pre-processing
S1.1: obtaining from social networking or news web pagesMStripe media dataI={I 1, I 2, …, I M },I i Is shown asiData of strip, 0 is less than or equal toi≤M,I i By a triplet (id, T i , A i ) It is shown that,idfor the purpose of the identification of the instance of data,T i representing dataI i The content of the characters of (a) is,A i ={A i,u , A i,p , A i,l , A i,v , A i,f , A i,q , A i,c , A i,r indicates additional information, respectively indicates a data publisherA i,u Time of releaseA i,p And a distribution siteA i,l Publishing sourceA i,v Forwarding amountA i,f Amount of praiseA i,q Number of commentsA i,c And read time of dataA i,r ;
S1.2: knowledge in minority nationality domainZ=<term, attributes, addition>As given by the domain expert,termis the name of the entity, and is the name of the entity,attributesin order to be an attribute of an entity,additionadding a description to the entry;
s1.3: obtaining a set of stop wordsStop_words;
S1.4: adopting word segmentation tool to obtain media data text contentT i Performing word segmentation processing, wherein the word segmentation is performed beforeStop_ wordsAdding the information into a default stop word set of a word segmentation tool and naming the knowledge entity names in the minority domaintermThe set is added to the default vocabulary set of the segmentation tool,T i the word segmentation result is stored in the data separatelyI i End, is marked asSeg_T i ;
S2: topic analysis and feature extraction
S2.1: definition dictionaryW={w 1, w 2, …, w S Store all the words contained in the data,Sis the total number of words in the dictionary,w i ≠w j ,1≤i,j≤S,i≠j;
s2.2: defining dataI i Subject vector ofΛ i =(λ i1,, λ i2,, …, λ K,i ),λ k,i Is thatI i The Chinese vocabulary belongs to the subjectz k Probability of 0 ≦λ k,i 1 or less, wherein the subjectz k Using high-frequency word vectorsΔ k =((w 1,δ k1,), (w 2,δ k2,), …, () Is) to indicate that,S k is composed ofz k The total number of words of (a) is,δ t,k is thatz k Words in the general vocabularyw t Probability of 0 ≦δ t,k ≤1,δ t,k Andλ k,i the following equations (1) and (2) respectively determine:
wherein,representing a topicz k The words and phrases ofw t The total number of (a) and (b),to representI i Including a themez k The number of the Chinese words and phrases is,Sis the total number of words in the dictionary,Kas a result of the total number of themes,is composed oftThe hyperparameter of the dirichlet distribution of the dimension,is composed ofkA hyper-parameter of the dirichlet distribution of dimensions;
s2.3: sampling a theme and a vocabulary;
s2.3.1: given number of iterationsN iter ,N iter Not less than 1, total number of themesK,KNot less than 1, parameterα,β,κ,0<α,β<1,κ≥1;
S2.3.2: for each topic z k Probability distribution of words in a sample topicφ k ~Dir(β),Dir(β) Representing a hyperparameter ofβDirichlet distribution of (a);
s2.3.3: for dataI i Topic probability distribution of sampled dataθ i ~Dir(α),Dir(α) Representing a hyperparameter ofαOf the Dirichlet distribution, of the dataSeg_T i Sampling the subject of a wordz i,j ~Mult() Vocabulary of sample topicsw i,j ~Mult(),Mult() And Mult () Respectively represent parameters ofAnda polynomial distribution of (a); statistical themesz k Total number of words ofData, dataI i Including a themez k Number of Chinese words;
S2.3.4: repeat S2.3.3, iterateN iter Next to each vocabularyw i,j Subject matter of (1)z i,j Convergence is achieved, and at the moment, the subject to which each vocabulary belongs does not change any more;
s2.4: obtaining a themez k High frequency word vectors and dataI i The topic vector of (1);
s2.4.1: reading each piece of dataI i The words and phrases ofw i,j And corresponding subject matterz i,j Statistical topicz i,j =z k The words and phrases ofw i,j Total number ofAnd dataI i Inz i,j =z k Chinese vocabularyw i,j Number of (2);
S2.4.2: calculating according to formula (1) to obtain each themez k Chinese vocabularyw t Probability of (2)δ t,k According toδ t,k Arranging in descending order to obtain the themez k High frequency word vector ofΔ k =((w 1,δ k1,), (w 2,δ k2,), …, ()),0≤k≤K;
S2.4.3: calculating according to formula (2) to obtain each piece of dataI i The Chinese vocabulary belongs to the subjectz k Probability of (2)λ k,i According toλ k,i Arranging in descending order to obtain dataI i Subject vector ofΛ i =(λ i1,, λ i2,, …, λ K,i );
S2.5: acquiring a data feature word sequence;
s2.5.1: reading dataI i Subject vector ofΛ i Push buttonλ k,i Descending order, and taking top-kappa subjects;
s2.5.2: will be in dataSeg_T i Of the vocabulary of (1) with the high-frequency word vectors of the above top-k topicsΔ k The words are mapped and matched, and the union of the two words is recorded asd i =<w i,1 ,w i,2 ,…, >Represents dataI i The feature word sequence of (1);
s3: extraction of minority subject data
S3.1: defining the minority domain KG asG k =(V, E) WhereinV={v 1, v 2, …, v n Denotes the set of entity corresponding nodes in KG,E={e 1, e 2,…, e m represents a collection of edges between entities; any edge corresponds to a node triplee x =(v i , v j , label) Node ofv i Called origin, nodev j Referred to as the end point,labela relation label of a starting point and an end point;
s3.2: using knowledge in the minority domainZConstruction of the field KG, usingG k Represents;
s3.2.1: first, the minority domain knowledge is obtained from domain expertsZ=<term, attributes, addition>Sequentially takingZElement entity name ofv i And name of artv 0 Expressed as a triplet (v 0 , v i , label),labelGetv i Is given asThe relationship label of (1);
s3.2.2: then each element is built in turnv i Andv j the triad of (v i , v j , label) At this timelabelAdditional information by a nodeadditionTo obtainRelationship labels of, e.g.v i Andv j if there is no relation, the corresponding edge does not exist, and all the triples together form the minority domain KG ofG k ;
S3.3: for domain-independent data, called noise data, which do not belong to the field of interest but which influence the accuracy of the extraction of the domain data during the data extraction process, the constructed domain of independence KG is used inG k Represents;
s3.3.1: obtaining knowledge of minority irrelevant field from domain expertZ=<term, attributes, addition>In turn getZElement entity name ofv i And name of artv 0 Expressed as a triplet (v 0 , v i , label),labelGetv i Is given asThe relationship label of (1);
s3.3.2: then each element is built in turnv i Andv j the triad of (v i , v j , label),labelAdditional information by a nodeadditionTo obtainRelationship labels of, e.g.v i Andv j irrelevant, the corresponding edge does not exist, all triples jointly form an irrelevant field KGG k ;
S3.4: the extraction of minority domain data is realized;
s3.4.1: giving a decision parameterτ,0≤τ≤1;
S3.4.2: for dataI i Calculating the characteristic word sequence thereofd i Length of (2)m i ,m i ≥0;
S3.4.3: for datad i Each word of (1)w i,j By usingG k Association between nodes: (v x , v x+1, label) Searching adjacent points of nodes in sequence, and counting dataI i The number of the words in the minority domain of the Chinese vocabulary is recordedn,n≥0;
S3.4.4: of the same datad i Each word of (1)w i,j UtilizeG k Association between nodes: (v x , v x+1, label) Searching adjacent points of nodes in sequence, and counting dataI i The number of words in which the middle word is noise data that is not domain-related is recorded;
3. The method for extracting minority topic data in a new media environment as claimed in claim 1, wherein in step S3.4.1, the parameters are determinedτThe range of (A) is 0.05. ltoreq.τ≤0.15。
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810969312.1A CN109241273B (en) | 2018-08-23 | 2018-08-23 | Method for extracting minority subject data in new media environment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810969312.1A CN109241273B (en) | 2018-08-23 | 2018-08-23 | Method for extracting minority subject data in new media environment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109241273A CN109241273A (en) | 2019-01-18 |
CN109241273B true CN109241273B (en) | 2022-02-18 |
Family
ID=65069466
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810969312.1A Active CN109241273B (en) | 2018-08-23 | 2018-08-23 | Method for extracting minority subject data in new media environment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109241273B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110110013B (en) * | 2019-05-10 | 2020-03-24 | 成都信息工程大学 | Entity competition relation data mining method based on space-time attributes |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577549A (en) * | 2013-10-16 | 2014-02-12 | 复旦大学 | Crowd portrayal system and method based on microblog label |
CN103699663A (en) * | 2013-12-27 | 2014-04-02 | 中国科学院自动化研究所 | Hot event mining method based on large-scale knowledge base |
CN104217038A (en) * | 2014-09-30 | 2014-12-17 | 中国科学技术大学 | Knowledge network building method for financial news |
CN105653706A (en) * | 2015-12-31 | 2016-06-08 | 北京理工大学 | Multilayer quotation recommendation method based on literature content mapping knowledge domain |
CN106156090A (en) * | 2015-04-01 | 2016-11-23 | 上海宽文是风软件有限公司 | A kind of designing for manufacturing knowledge personalized push method of knowledge based collection of illustrative plates (Man-tree) |
CN106909643A (en) * | 2017-02-20 | 2017-06-30 | 同济大学 | The social media big data motif discovery method of knowledge based collection of illustrative plates |
CN106960025A (en) * | 2017-03-19 | 2017-07-18 | 北京工业大学 | A kind of personalized literature recommendation method based on domain knowledge collection of illustrative plates |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104008106B (en) * | 2013-02-25 | 2018-07-20 | 腾讯科技(深圳)有限公司 | A kind of method and device obtaining much-talked-about topic |
-
2018
- 2018-08-23 CN CN201810969312.1A patent/CN109241273B/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103577549A (en) * | 2013-10-16 | 2014-02-12 | 复旦大学 | Crowd portrayal system and method based on microblog label |
CN103699663A (en) * | 2013-12-27 | 2014-04-02 | 中国科学院自动化研究所 | Hot event mining method based on large-scale knowledge base |
CN104217038A (en) * | 2014-09-30 | 2014-12-17 | 中国科学技术大学 | Knowledge network building method for financial news |
CN106156090A (en) * | 2015-04-01 | 2016-11-23 | 上海宽文是风软件有限公司 | A kind of designing for manufacturing knowledge personalized push method of knowledge based collection of illustrative plates (Man-tree) |
CN105653706A (en) * | 2015-12-31 | 2016-06-08 | 北京理工大学 | Multilayer quotation recommendation method based on literature content mapping knowledge domain |
CN106909643A (en) * | 2017-02-20 | 2017-06-30 | 同济大学 | The social media big data motif discovery method of knowledge based collection of illustrative plates |
CN106960025A (en) * | 2017-03-19 | 2017-07-18 | 北京工业大学 | A kind of personalized literature recommendation method based on domain knowledge collection of illustrative plates |
Non-Patent Citations (3)
Title |
---|
SeCo-LDA Mining Service Co-occurrence Topics for Recommendation;Zhenfeng Gao et al.;《2016 IEEE International Conference on Web Services》;20160901;25-32 * |
基于Topic Model的我国档案学主题结构与演化研究;董克 等;《信息资源管理学报》;20170726;第7卷(第3期);97-105 * |
民族志传播:一幅不十分完备的研究地图——基于中文文献的考察;郭建斌;《新闻大学》;20180415(第2期);1-17,149 * |
Also Published As
Publication number | Publication date |
---|---|
CN109241273A (en) | 2019-01-18 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106776711B (en) | Chinese medical knowledge map construction method based on deep learning | |
JP7468929B2 (en) | How to acquire geographical knowledge | |
CN104317801B (en) | A kind of Data clean system and method towards big data | |
CN111680173A (en) | CMR model for uniformly retrieving cross-media information | |
CN102819575B (en) | Personalized search method for Web service recommendation | |
CN113158033A (en) | Collaborative recommendation model construction method based on knowledge graph preference propagation | |
US20120174006A1 (en) | System, method, apparatus and computer program for generating and modeling a scene | |
US20100205176A1 (en) | Discovering City Landmarks from Online Journals | |
CN113065003B (en) | Knowledge graph generation method based on multiple indexes | |
CN105550190A (en) | Knowledge graph-oriented cross-media retrieval system | |
Wang et al. | A novel blockchain oracle implementation scheme based on application specific knowledge engines | |
CN106354844A (en) | Service combination package recommendation system and method based on text mining | |
Ye et al. | A web services classification method based on GCN | |
Tang et al. | Content‐based and knowledge graph‐based paper recommendation: Exploring user preferences with the knowledge graphs for scientific paper recommendation | |
Wu et al. | Towards semantic web of things: from manual to semi-automatic semantic annotation on web of things | |
CN109241273B (en) | Method for extracting minority subject data in new media environment | |
CN104765763B (en) | A kind of semantic matching method of the Heterogeneous Spatial Information classification of service based on concept lattice | |
Suresh Kumar et al. | Multi-ontology based points of interests (MO-POIS) and parallel fuzzy clustering (PFC) algorithm for travel sequence recommendation with mobile communication on big social media | |
Wen et al. | GCN-IA: User profile based on graph convolutional network with implicit association labels | |
Yochum et al. | Tourist attraction recommendation based on knowledge graph | |
Zhang et al. | Author name disambiguation based on rule and graph model | |
Kettouch et al. | An interlinking approach based on domain recognition for linked data | |
Nogueira | A framework for automatic annotation of semantic trajectories | |
Guesmi et al. | Community detection in multi-relational bibliographic networks | |
Yan et al. | Multilayer network representation learning method based on random walk of multiple information |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
OL01 | Intention to license declared | ||
OL01 | Intention to license declared |