CN109241273B - Method for extracting minority subject data in new media environment - Google Patents

Method for extracting minority subject data in new media environment Download PDF

Info

Publication number
CN109241273B
CN109241273B CN201810969312.1A CN201810969312A CN109241273B CN 109241273 B CN109241273 B CN 109241273B CN 201810969312 A CN201810969312 A CN 201810969312A CN 109241273 B CN109241273 B CN 109241273B
Authority
CN
China
Prior art keywords
data
words
minority
domain
label
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201810969312.1A
Other languages
Chinese (zh)
Other versions
CN109241273A (en
Inventor
岳昆
麻友
李维华
王笑一
郭建斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Yunnan University YNU
Original Assignee
Yunnan University YNU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Yunnan University YNU filed Critical Yunnan University YNU
Priority to CN201810969312.1A priority Critical patent/CN109241273B/en
Publication of CN109241273A publication Critical patent/CN109241273A/en
Application granted granted Critical
Publication of CN109241273B publication Critical patent/CN109241273B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for acquiring data from a new media platform and extracting minority subject data, which is characterized by adopting an LDA (latent Dirichlet Allocation) model to perform feature extraction, subject analysis and implicit subject mining on the preprocessed new media data according to the characteristics of massive, unstructured and multi-subject of the new media data, then constructing KG by utilizing knowledge in the minority domain, and guiding the extraction of the minority subject data by utilizing the domain KG. In the invention, parameters are set according to different data scales in the process of extracting data by adopting an LDA model and a KG guide, so that the algorithm is optimized, and accurate, efficient and expandable new media data extraction is realized.

Description

Method for extracting minority subject data in new media environment
Technical Field
The invention discloses a method for acquiring data from a new media platform and extracting minority subject data. Relates to a method for performing implicit theme analysis and feature extraction based on Latent Dirichlet Allocation (LDA) new media data and realizing extraction of minority theme data by using a domain Knowledge Graph (KG). Belonging to the field of data processing and knowledge discovery.
Background
The new media is a new media form relative to traditional media such as newspapers, broadcasts, televisions and the like, comprises network media, mobile phone media, digital televisions and the like, and has the characteristics of interactivity, instantaneity, mass and sharing, multimedia, hypertext, individuation, socialization and the like. With the increasingly important role of new media in information dissemination, the processing and analysis of network media data are also highly concerned by scholars at home and abroad. The data is divided according to the difference of main contents, and the data describing the same type of contents is called data of the same theme, such as themes of tourism, entertainment, movie and television and the like. The method is characterized in that data of a specific theme are obtained from massive and heterogeneous network media data, and theme analysis, content screening and information filtering are performed according to different fields, so that the method is an important research content for new media data processing and knowledge discovery, and is also an important basis for work such as decision support, influence prediction, knowledge base construction, public opinion analysis and the like.
In addition, with the deep advance of cultural strategies in China, research on subjects of acquiring, analyzing and utilizing minority information is continuously increased, and massive new media data comprise a large amount of valuable minority theme data, such as data of minority travel information sharing, minority cultural exchange, minority hot spot problems, minority news events and the like in a microblog platform, and can enrich data sources for relevant research and development of the minority theme. The method has important significance in researching data extraction of the minority theme in the new media environment, aiming at actual problems in aspects of minority politics, economy, culture and the like, processing and analyzing mass new media data, discovering data-driven knowledge, and researching and applying data-intensive minority region public opinion monitoring and management policy formulation, minority culture propagation and heritage protection.
Data extraction is a process of extracting target data from source data, known data extraction research results are numerous, and data extraction technologies adopted for different data or different applications are different, for example, liu jin et al (< master thesis of university of science and technology, 2016) realizes the Chinese character social relation extraction research of news data based on unsupervised learning, wu self tiger et al (< master thesis of university of liaoning, 2017) proposes a template-based network public opinion data extraction method, zhang et al (< master thesis of liberation military information engineering, 2017) proposes a relation extraction model based on a depth fusion convolutional neural network for network text data, yao xiao peng et al (< computer application and software, 2018) proposes a deep-network data extraction and mining method in a global mode, yuanhua et al (< management engineering > report, 2018) a method for extracting hot topics and feature words thereof from mass data is provided, the relation between the hot topic words and local feature words is analyzed from the mass user generated content, and a data extraction method and a system based on mass data are provided for the achievement of Hades and the like (< patents CN107436902A >, 2017), and corresponding target data are extracted and obtained respectively aiming at a static data source and a dynamic data source. The methods can well complete data extraction tasks aiming at respective data sources and the problem to be solved, but the methods are not general technologies, and lack universality aiming at the extraction of minority ethnic theme data in a massive, unstructured and multi-theme new media platform. Therefore, aiming at the characteristics of the mass, the non-structure, the multiple subjects and the like of the new media, the invention excavates the implicit subjects in the network new media data based on the LDA model, realizes the analysis of the multiple subjects, and can more accurately and comprehensively realize the extraction of the minority subject data by utilizing the data characteristic word sequence and the entity described by the knowledge graph and the incidence relation among the entities.
The LDA model is a Bayesian hierarchical model, and is widely applied to the fields of data extraction, text mining, social networks, natural language processing and the like in the known research. For example, liu shao et al (< computer science and report >, 2015) performs qualitative and descriptive theme extraction on massive movie comment data by using LDA, liu ice jade et al (< software science and report >, 2017) researches massive e-commerce comment data, realizes extraction of commodity features and emotional words based on semantic constraint LDA, and zhao science et al (< patent CN107885754A >, 2018) provides a method and a device for extracting credit variables from transaction data based on LDA models. The LDA model is utilized to process the mass data, the outstanding effects of the LDA model on topic analysis, feature extraction, text mining and other problem researches are shown, and on the basis, the LDA model further plays the advantages of the LDA model in analyzing mass, unstructured and multi-topic new media data.
The KG is a semantic network expressing entities, concepts and relationships among the entities and the concepts, and in the known research, the KG is widely used in the fields of personalized recommendation, intelligent search, knowledge discovery and the like. For example, chendhan et al (< computer research and development >, 2017) proposes a link prediction model of clinical field time sequence KG, and heijun et al (< computer science report >, 2016) proposes a relation extraction method facing the evolution of knowledge in the field of chinese wikipedia, and rekey et al (< patent CN108073711A >, 2018) proposes a relation extraction method based on KG, and extracts the path and attribute information of KG to mine potential semantic information. The above achievements, whether medical research or data relation extraction, fully show the effect of rich prior knowledge of the KG semantic network in practical application, but select KG aiming at different applications simultaneously, also influence the high efficiency and effectiveness of problem research, and aiming at different application scenes and specific research fields, need to construct corresponding KG, more comprehensively and completely cover the knowledge and semantic relation of the research field, thereby improving the accuracy and high efficiency of data extraction results.
However, due to the limitation of data sources, the remoteness of knowledge and the cultural difference under the minority nationality theme, the difficulty of the interdisciplinary research is relatively high, the interdisciplinary research becomes the necessity of numerous topics in the current research, and how to utilize a large amount of data in a new medium as a basis to extract valuable data from the data becomes the basis of the relevant research.
Therefore, the invention aims at the extraction problem of the new media minority topic data, takes the large-scale data of a new media platform and the knowledge of the minority domain as the basis, takes the extraction of the minority topic data from the massive, unstructured and multi-topic new media data as the target, utilizes an LDA model to mine implicit multi-topic information from the unstructured data, performs topic analysis, realizes the feature extraction of the data, and utilizes the semantic relation rich in domain KG to solve the problems of strong professional, remote word sources and ambiguous words in the process of extracting the minority topic data from the massive new media data. In conclusion, the invention provides the method for extracting the minority nationality theme data in the new media environment, lays a new technical foundation for the application of large-scale new media data processing, analysis, prediction, decision and the like, and also provides reference for the extraction of new media data in a specific field.
Disclosure of Invention
In order to solve the problem of efficiency bottleneck caused by the situations of obscure source of words, strong specialization, word synonymy and the like in the minority nationality field, the invention provides a method for acquiring data from a new media platform and extracting the minority topic data based on an LDA model and KG. The method can accurately, efficiently and extensible extract the data of the specific field of the new media aiming at the characteristics of the new media data such as massive, unstructured and multi-theme data.
The method comprises three steps, wherein the first step is data preprocessing, required new media data are obtained, a word segmentation tool is adopted to perform word segmentation on data contents, words are segmented according to field vocabularies added in the minority nationality field to be researched, personalized stop words are added, and a data preprocessing result is simplified; secondly, topic analysis and feature extraction of new media data, iteration processing is carried out on the preprocessed data by using an LDA model, topic analysis is carried out, topics hidden by the data are mined, a topic vector of each piece of data and high-frequency word vectors of all the topics are obtained, the high-frequency word vectors of the topics to which the data belong are matched with the content of the data, and a feature word sequence of each piece of data is obtained; and the third step is extraction of the minority topic data based on KG, namely, the knowledge of the minority domain is constructed into the domain KG, KG is used for guiding, namely, the rich semantic relation of KG is used as prior knowledge, the characteristic word sequence of the data in the second step is used as basis for matching, the data of the minority topic is filtered, and meanwhile, KG formed by noise data irrelevant to the domain is constructed, so that reverse filtering is realized, and the accuracy of data extraction is improved.
The method comprises the following steps:
s1: data pre-processing
S1.1: obtaining from social networking or news web pagesMStripe media dataI={I 1, I 2, …, I M },I i Is shown asiData of strip, 0 is less than or equal toiMI i By a triplet (id, T i , A i ) It is shown that,idfor the purpose of the identification of the instance of data,T i representing dataI i The content of the characters of (a) is,A i ={A i,u , A i,p , A i,l , A i,v , A i,f , A i,q , A i,c , A i,r indicates additional information, respectively indicates a data publisherA i,u Time of releaseA i,p And a distribution siteA i,l Publishing sourceA i,v Forwarding amountA i,f Amount of praiseA i,q Number of commentsA i,c And read time of dataA i,r
S1.2: knowledge in minority nationality domainZ=<term, attributes, addition>As given by the domain expert,termis the name of the entity, and is the name of the entity,attributesin order to be an attribute of an entity,additionadding a description to the entry;
s1.3: obtaining a set of stop wordsStop_words
S1.4: adopting word segmentation tool to obtain media data text contentT i Performing word segmentation processing, wherein the word segmentation is performed beforeStop_wordsAdding the information into a default stop word set of a word segmentation tool and naming the knowledge entity names in the minority domaintermThe set is added to the default vocabulary set of the segmentation tool,T i the word segmentation result is stored in the data separatelyI i End, is marked asSeg_T i
S2: topic analysis and feature extraction
S2.1: definition dictionaryW={w 1, w 2, …, w S Store all the words contained in the data,Sis the total number of words in the dictionary,w i w j (1≤i,jS,ij);
s2.2: defining dataI i Subject vector ofΛ i =(λ i1,, λ i2,, …, λ K,i ),λ k,i Is thatI i The Chinese vocabulary belongs to the subjectz k Probability of 0 ≦λ k,i 1 or less, wherein the subjectz k Using high-frequency word vectorsΔ k =((w 1, δ k1,), (w 2, δ k2,), …,
Figure DEST_PATH_IMAGE002
) Is) to indicate that,S k is composed ofz k The total number of words of (a) is,δ t,k is thatz k Words in the general vocabularyw t Probability of 0 ≦δ t,k ≤1。δ t,k Andλ k,i the following equations (1) and (2) respectively determine:
Figure DEST_PATH_IMAGE004
(1)
Figure DEST_PATH_IMAGE006
(2)
wherein the content of the first and second substances,
Figure DEST_PATH_IMAGE008
representing a topicz k The words and phrases ofw t The total number of (a) and (b),
Figure DEST_PATH_IMAGE010
to representI i Including a themez k The number of the Chinese words and phrases is,Sis the total number of words in the dictionary,Kis the total number of topics;
s2.3: sampling a theme and a vocabulary;
s2.3.1: given number of iterationsN iter N iter Not less than 1, total number of themesKKNot less than 1, parameterαβκ,0<α,β<1,κ≥1;
S2.3.2: for each topic z k Probability distribution of words in a sample topicφ k ~Dir(β);
S2.3.3: for dataI i Topic probability distribution of sampled dataθ i ~Dir(α) Of dataSeg_T i Sampling the subject of a word
Figure DEST_PATH_IMAGE012
Vocabulary of sample topics
Figure DEST_PATH_IMAGE014
(ii) a Statistical themesz k Total number of words of
Figure DEST_PATH_IMAGE016
Data, dataI i Including a themez k Number of Chinese words
Figure DEST_PATH_IMAGE018
S2.3.4: repeat S2.3.3, iterateN iter Next to each vocabularyw i,j Subject matter of (1)z i,j Convergence is achieved, and at the moment, the subject to which each vocabulary belongs does not change any more;
s2.4: obtaining a themez k High frequency word vectors and dataI i The topic vector of (1);
s2.4.1: reading each piece of dataI i The words and phrases ofw i,j And corresponding subject matterz i,j Statistical topicz i,j =z k The words and phrases ofw i,j Total number of
Figure DEST_PATH_IMAGE020
And dataI i Inz i,j =z k Chinese vocabularyw i,j Number of (2)
Figure DEST_PATH_IMAGE022
S2.4.2: calculating according to formula (1) to obtain each themez k Chinese vocabularyw t Probability of (2)δ t,k According toδ t,k Arranging in descending order to obtain the themez k High frequency word vector ofΔ k =((w 1,δ k1,), (w 2,δ k2,), …, (
Figure DEST_PATH_IMAGE024
)),0≤kK
S2.4.3: calculating according to formula (2) to obtain each piece of dataI i The Chinese vocabulary belongs to the subjectz k Probability of (2)λ k,i According toλ k,i Arranging in descending order to obtain dataI i Subject vector ofΛ i =(λ i1,, λ i2,, …, λ K,i );
S2.5: acquiring a data feature word sequence;
s2.5.1: reading dataI i Subject vector ofΛ i Push buttonλ k,i Descending order, and taking top-kappa subjects;
s2.5.2: will be in dataSeg_T i Of the vocabulary of (1) with the high-frequency word vectors of the above top-k topicsΔ k The words are mapped and matched, and the union of the two words is recorded asd i =<w i,1 , w i,2 ,…,
Figure DEST_PATH_IMAGE026
>Represents dataI i The feature word sequence of (1);
s3: extraction of minority subject data
S3.1: defining the minority domain KG asG k =(V, E) WhereinV={v 1, v 2, …, v n Denotes the set of entity corresponding nodes in KG,E={e 1, e 2,…, e m represents a collection of edges between entities; any edge corresponds to a node triplee x =(v i , v j , label) Node ofv i Called origin, nodev j Referred to as the end point,labela relation label of a starting point and an end point;
s3.2: using knowledge in the minority domainZConstruction of the field KG, usingG k Represents;
s3.2.1: first, the minority domain knowledge is obtained from domain expertsZ=<term, attributes, addition>Sequentially takingZElement entity name ofv i And name of artv 0 Expressed as a triplet (v 0 , v i , label),labelGetv i Is given as
Figure DEST_PATH_IMAGE028
The relationship label of (1);
s3.2.2: then each element is built in turnv i Andv j the triad of (v i , v j , label) At this timelabelAdditional information by a nodeadditionTo obtain
Figure DEST_PATH_IMAGE030
Relationship labels of, e.g.v i Andv j if there is no relation, the corresponding edge does not exist, and all the triples together form the minority domain KG ofG k
S3.3: for domain-independent data, called noise data, which do not belong to the field of interest but which influence the accuracy of the extraction of the domain data during the data extraction process, the constructed domain of independence KG is used inG k Represents;
s3.3.1: obtaining knowledge of minority irrelevant field from domain expertZ=<term, attributes, addition>In turn getZElement entity name ofv i And name of artv 0 Expressed as a triplet (v 0 , v i , label),labelGetv i Is given as
Figure DEST_PATH_IMAGE032
The relationship label of (1);
s3.3.2: then each element is built in turnv i Andv j the triad of (v i , v j , label),labelAdditional information by a nodeadditionTo obtain
Figure DEST_PATH_IMAGE034
Relationship labels of, e.g.v i Andv j irrelevant, the corresponding edge does not exist, all triples jointly form an irrelevant field KGG k
S3.4: the extraction of minority domain data is realized;
s3.4.1: giving a decision parameterτ,0≤τ≤1;
S3.4.2: for dataI i Calculating the characteristic word sequence thereofd i Length of (2)m i m i ≥0;
S3.4.3: for datad i Each word of (1)w i,j By usingG k Association between nodes: (v x , v x+1, label) Adjacency point of node searched in sequence, statistical dataI i The number of the words in the minority domain of the Chinese vocabulary is recordednn≥0;
S3.4.4: of the same datad i Each word of (1)w i,j UtilizeG k Association between nodes: (v x , v x+1, label) Adjacency point of node searched in sequence, statistical dataI i The number of words in which the middle word is noise data that is not domain-related is recorded
Figure DEST_PATH_IMAGE036
S3.4.5: calculating dataI i In the field ofG k Probability of (2)
Figure DEST_PATH_IMAGE038
Data ofI i In the fieldG k Probability of (2)
Figure DEST_PATH_IMAGE040
. If it is notp>τAnd isp<τThen determine the dataI i Is data of minority subjects and willI i Adding to the final minority data setDPerforming the following steps;
the invention can realize the extraction of the minority nationality theme data from the new media environment through the steps, and in order to ensure that the data extraction is more accurate and efficient, the parameters in the method are further limited and optimized, and in step S2.3.1, the iteration times are carried outN iter Given relation to efficiency of the method and accuracy of the resultThe certainty is that each vocabulary in the result is obtained when the iteration number is too smallw i,j Subject matter of (1)z i,j The method has the advantages of not converging, inaccurate subject characteristic words, excessive iteration times, time consumption increase and efficiency reduction of converged iteration, and the method is used for solving the problems that
Figure DEST_PATH_IMAGE042
SIs the total number of words in the dictionary,
Figure DEST_PATH_IMAGE044
for upward integers, the number of iterations is directly linked to the amount of data, and, in addition, parametersαWhen in useKValue is less than or equal to 40α=0.5, whenK>Value of 40 hoursα=20/KTo do soβThe value of the compound is 0.01,κvalue taking
Figure DEST_PATH_IMAGE046
I.e. with the number of subjectsKIncreasing, the number of characteristic word sequences of the data obtained by using the high-frequency word vector of top-kappa subjects is also increased. In addition, in step S3.4.1, the determination parameter of the domainτThe range of (A) is not less than 0.05 ≤τLess than or equal to 0.15, and the data can realize more accurate domain attribution judgment.
Detailed Description
The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.
Example (b): and (3) extracting example of 'Xinlang microblog' Tibetan data.
The method comprises the following steps: pretreatment of
Microblog data are acquired from a 'Sina microblog' platform, and a single piece of microblog data is shown in a table 1.
TABLE 1 microblog data example
Figure DEST_PATH_IMAGE048
For convenience of description, the additional information items will be described in the following description of data extractionA i Hiding the obtained Sing microblog data comprising 5 microblog dataa1~a5, as shown in table 2.
Table 2 green wave microblog data
Figure DEST_PATH_IMAGE050
Figure DEST_PATH_IMAGE052
Then, the text part of the microblog data is processedT i Performing word segmentation, selecting word segmentation tool, supporting self-defined dictionary and stop word, and introducing Tibetan domain knowledgeZ={<Tibet, Place name, Tibetan province>,<Lhasa, city name, Tibet province>,<Gongga, geographical name, south Tibet>,<Zang region, fuzzy geographical name, refers to the collection of zang nationality>,<The collection of people refers to the Tibetan nationality>,<Hada, name of things, Tibetan ceremony silk fabric>Adding the words in the Tibetan domain into the word segmentation tool dictionary, and recording the word segmentation result asSeg_T i As shown in table 3.
TABLE 3 Seawa microblog data word segmentation result
Figure DEST_PATH_IMAGE054
Step two: topic analysis and feature extraction
Reading microblog data, and obtaining a dictionary according to the appearance sequence and the non-repeated principle of vocabulariesW= 1: tibet, 2: and, peace, 3: liberation, 4: residence, 5: tibetan, 6: gutianle, 7: gold, 8: donation, 9: aid, 10: school, 11: tibetan region, 12: hada, 13: vacation, 14: Shangri-La, 15: qinghai lake, 16: pizza, 17: trip, 18: strategy, 19: the health food is prepared from the raw materials of Gongga,20: airport, 21: kilometers }.
Given number of iterations
Figure DEST_PATH_IMAGE056
Total number of subjectK=3, parameterα=0.5,β=0.01,
Figure DEST_PATH_IMAGE058
. Sampling topics of words
Figure DEST_PATH_IMAGE060
Vocabulary of sample topics
Figure DEST_PATH_IMAGE062
Iterate 10 times so that words can all be assigned to the respective topics.
The obtained topics corresponding to all the vocabularies are respectively as follows:
Figure DEST_PATH_IMAGE064
travel: { holiday, travel, walk, lassa, airport, kilometer, Shangri La, Qinghai lake, Tibet, Gonga }
Figure DEST_PATH_IMAGE066
Culture: { peace, liberalism, residence, school, Hada, Tibetan, Tibetan}
Figure DEST_PATH_IMAGE068
Public welfare: { fund, donation, aid, gutianle }
Circularly counting the total number of all subjects of the microblog data
Figure DEST_PATH_IMAGE070
And the total number of words for each topic
Figure DEST_PATH_IMAGE072
. By microblog dataa1 as an example, calculate
Figure DEST_PATH_IMAGE074
Is provided with
Figure DEST_PATH_IMAGE076
Figure DEST_PATH_IMAGE078
Figure DEST_PATH_IMAGE080
Therefore, the formula (1) has:
Figure DEST_PATH_IMAGE082
Figure DEST_PATH_IMAGE084
Figure DEST_PATH_IMAGE086
thus, microblog dataaThe topic vector for content 1 is (0.2143, 0.75, 0.0357). Similarly, microblog data can be obtainedaThe topic vector of 2 is (0.0357, 0.2143, 0.75),athe topic vector of 3 is (0.0435, 0.9130, 0.0435),athe topic vector of 4 is (0.9130, 0.0435, 0.0435),athe topic vector of 5 is (0.9583, 0.0208, 0.0208).
For each topicz k Calculating
Figure DEST_PATH_IMAGE088
By subject matter
Figure 146664DEST_PATH_IMAGE064
As an example, a dictionaryWIn a clear view of the above, it is known that,t=1 for "Tibet",t=13 means "vacation",t=16 denotes "pizza", and is calculated by formula (2)δ t,k The following were used:
Figure DEST_PATH_IMAGE090
Figure DEST_PATH_IMAGE092
Figure DEST_PATH_IMAGE094
Figure DEST_PATH_IMAGE096
further in accordance with
Figure DEST_PATH_IMAGE098
Arranging in descending order to obtain the theme
Figure 939170DEST_PATH_IMAGE064
High frequency word vector ofΔ 1= (("rassa", 0.2118), ("holiday", 0.1414), ("tibet", 0.1414), ("travel", 0.0711), ("strategy", 0.0711), …). Subject matter available in the same way
Figure 781225DEST_PATH_IMAGE066
The high frequency word vector ofΔ 2= (("school", 0.2182), ("Tibetan", 0.2182), ("peace", 0.1097), ("hada", 0.1097), ("liberation", 0.1097), …), topic
Figure 176434DEST_PATH_IMAGE068
The high frequency word vector ofΔ 3= (("fund", 0.2399), ("donation", 0.2399), ("build", 0.2399), ("gutianle", 0.2399)).
Taking kappa =1, i.e. taking each microblog datumSeg_T i And mapping and matching the words with the words of the high-frequency word vectors of the top-1 subjects of the data to obtain a characteristic word sequence of the data. By microblog dataa1 is an exampleSubject of top-1
Figure 990806DEST_PATH_IMAGE066
Then the subject will be
Figure 626318DEST_PATH_IMAGE066
High frequency word vector ofΔ 2With a1Seg_T i Is matched to obtaind 1=<Peace, liberation, living and Tibetan ">And obtaining the following by the same method:
d 2=<'Gutianle', 'fund', 'donation', 'aid construction'>
d 3=<'zang district', 'school', 'Tibetan person' and 'Hada'>
d 4=<Vacation, Shangri-La, Qinghai lake and Lasa ">
d 5=<"travel", "attack", "vacation", "Tibet", "Lhasa", "Gonga", "airport", "Lhasa", "kilometer">
Step three: minority data extraction
First, from the Tibetan domain knowledgeZ={<Tibet, Place name, Tibetan province>,<Lhasa, city name, Tibet province>,<Gongga, geographical name, south Tibet>,<Zang region, fuzzy geographical name, refers to the collection of zang nationality>,<The collection of people refers to the Tibetan nationality>,<Hada, name of things, Tibetan ceremony silk fabric>Constructing field KG.
Take in turnZElement entity name ofv i And name of artv 0Expressed as a triplet (v 0, v i , label) Such as ("Tibetan", "Tibet", "Place"), each element is then established in turnv i Andv j the triad of (v i , v j , label) At this timelabelThe graphical representation of the results obtained from additional information of the nodes, such as ("Tibet", "Lasa", "Save") is shown in FIG. 3.
In the same way, by the field-independent knowledge of the subject of "travelZ={<Yunnan province, province name, tourist province>,<Qinghai province, Ming province and traveling province>,<Shangri-La, Place name, Yunnan tourist attraction>,<Tourist attractions of Qinghai lake, lake and Qinghai province>Constructing a travel KG irrelevant to TibetanG k As shown in fig. 4.
Given parametersτ= 0.1. For microblog dataa1, obtaining word sequence of word featured 1Length of (2)m i =4, ford 1Each vocabulary is respectively fromG k Searching corresponding words and phrases between the middle edge node and the edge, and counting to obtain
Figure DEST_PATH_IMAGE100
Thus, microblog dataa1 in the field ofG k Probability of (2)
Figure DEST_PATH_IMAGE102
In the fieldG k Probability of (2)
Figure DEST_PATH_IMAGE104
Due to p>τAnd isτThen microblog dataa1 belongs to the field of Tibetan, willa1 adding the extracted Tibetan data setDIn (1). In the same way, the method can obtain,a3 anda5 also belongs to the "Tibetan" domain data. For thea4, due to
Figure DEST_PATH_IMAGE106
Figure DEST_PATH_IMAGE108
And therefore belong to irrelevant noise data related to the names of the Tibetan nationalities.
The extraction results of the "Tibetan" subject data are shown in Table 4.
TABLE 4 "Tibetan" subject data extraction results
Figure DEST_PATH_IMAGE110
Drawings
FIG. 1 is a flow chart for implementing the present invention. The method comprises the following three steps: preprocessing new media data, analyzing themes and extracting characteristics, and extracting minority ethnic data.
Fig. 2, LDA graph model.
FIG. 3 is a graphical illustration of a Tibetan domain knowledge graph in an embodiment.
Fig. 4, an example of an embodiment in which noise data corresponds to a graphical knowledge map.

Claims (3)

1. A method for extracting minority ethnic group theme data in a new media environment is characterized by comprising the following steps:
s1: data pre-processing
S1.1: obtaining from social networking or news web pagesMStripe media dataI={I 1, I 2, …, I M },I i Is shown asiData of strip, 0 is less than or equal toiMI i By a triplet (id, T i , A i ) It is shown that,idfor the purpose of the identification of the instance of data,T i representing dataI i The content of the characters of (a) is,A i ={A i,u , A i,p , A i,l , A i,v , A i,f , A i,q , A i,c , A i,r indicates additional information, respectively indicates a data publisherA i,u Time of releaseA i,p And a distribution siteA i,l Publishing sourceA i,v Forwarding amountA i,f Amount of praiseA i,q Number of commentsA i,c And read time of dataA i,r
S1.2: knowledge in minority nationality domainZ=<term, attributes, addition>As given by the domain expert,termis the name of the entity, and is the name of the entity,attributesin order to be an attribute of an entity,additionadding a description to the entry;
s1.3: obtaining a set of stop wordsStop_words
S1.4: adopting word segmentation tool to obtain media data text contentT i Performing word segmentation processing, wherein the word segmentation is performed beforeStop_ wordsAdding the information into a default stop word set of a word segmentation tool and naming the knowledge entity names in the minority domaintermThe set is added to the default vocabulary set of the segmentation tool,T i the word segmentation result is stored in the data separatelyI i End, is marked asSeg_T i
S2: topic analysis and feature extraction
S2.1: definition dictionaryW={w 1, w 2, …, w S Store all the words contained in the data,Sis the total number of words in the dictionary,w i w j ,1≤i,jS,ij
s2.2: defining dataI i Subject vector ofΛ i =(λ i1,, λ i2,, …, λ K,i ),λ k,i Is thatI i The Chinese vocabulary belongs to the subjectz k Probability of 0 ≦λ k,i 1 or less, wherein the subjectz k Using high-frequency word vectorsΔ k =((w 1,δ k1,), (w 2,δ k2,), …, (
Figure 294171DEST_PATH_IMAGE001
) Is) to indicate that,S k is composed ofz k The total number of words of (a) is,δ t,k is thatz k Words in the general vocabularyw t Probability of 0 ≦δ t,k ≤1,δ t,k Andλ k,i the following equations (1) and (2) respectively determine:
Figure 837410DEST_PATH_IMAGE002
(1)
Figure 617147DEST_PATH_IMAGE003
(2)
wherein the content of the first and second substances,
Figure 74673DEST_PATH_IMAGE004
representing a topicz k The words and phrases ofw t The total number of (a) and (b),
Figure 889045DEST_PATH_IMAGE005
to representI i Including a themez k The number of the Chinese words and phrases is,Sis the total number of words in the dictionary,Kas a result of the total number of themes,
Figure 524557DEST_PATH_IMAGE006
is composed oftThe hyperparameter of the dirichlet distribution of the dimension,
Figure 791590DEST_PATH_IMAGE007
is composed ofkA hyper-parameter of the dirichlet distribution of dimensions;
s2.3: sampling a theme and a vocabulary;
s2.3.1: given number of iterationsN iter N iter Not less than 1, total number of themesKKNot less than 1, parameterαβκ,0<α,β<1,κ≥1;
S2.3.2: for each topic z k Probability distribution of words in a sample topicφ k ~Dir(β),Dir(β) Representing a hyperparameter ofβDirichlet distribution of (a);
s2.3.3: for dataI i Topic probability distribution of sampled dataθ i ~Dir(α),Dir(α) Representing a hyperparameter ofαOf the Dirichlet distribution, of the dataSeg_T i Sampling the subject of a wordz i,j ~Mult(
Figure 990490DEST_PATH_IMAGE008
) Vocabulary of sample topicsw i,j ~Mult(
Figure 721686DEST_PATH_IMAGE009
),Mult(
Figure 715050DEST_PATH_IMAGE008
) And Mult (
Figure 783893DEST_PATH_IMAGE009
) Respectively represent parameters of
Figure 520905DEST_PATH_IMAGE008
And
Figure 106607DEST_PATH_IMAGE009
a polynomial distribution of (a); statistical themesz k Total number of words of
Figure 536452DEST_PATH_IMAGE010
Data, dataI i Including a themez k Number of Chinese words
Figure 591127DEST_PATH_IMAGE011
S2.3.4: repeat S2.3.3, iterateN iter Next to each vocabularyw i,j Subject matter of (1)z i,j Convergence is achieved, and at the moment, the subject to which each vocabulary belongs does not change any more;
s2.4: obtaining a themez k High frequency word vectors and dataI i The topic vector of (1);
s2.4.1: reading each piece of dataI i The words and phrases ofw i,j And corresponding subject matterz i,j Statistical topicz i,j =z k The words and phrases ofw i,j Total number of
Figure 866250DEST_PATH_IMAGE012
And dataI i Inz i,j =z k Chinese vocabularyw i,j Number of (2)
Figure 509721DEST_PATH_IMAGE013
S2.4.2: calculating according to formula (1) to obtain each themez k Chinese vocabularyw t Probability of (2)δ t,k According toδ t,k Arranging in descending order to obtain the themez k High frequency word vector ofΔ k =((w 1,δ k1,), (w 2,δ k2,), …, (
Figure 907204DEST_PATH_IMAGE014
)),0≤kK
S2.4.3: calculating according to formula (2) to obtain each piece of dataI i The Chinese vocabulary belongs to the subjectz k Probability of (2)λ k,i According toλ k,i Arranging in descending order to obtain dataI i Subject vector ofΛ i =(λ i1,, λ i2,, …, λ K,i );
S2.5: acquiring a data feature word sequence;
s2.5.1: reading dataI i Subject vector ofΛ i Push buttonλ k,i Descending order, and taking top-kappa subjects;
s2.5.2: will be in dataSeg_T i Of the vocabulary of (1) with the high-frequency word vectors of the above top-k topicsΔ k The words are mapped and matched, and the union of the two words is recorded asd i =<w i,1 ,w i,2 ,…,
Figure 636126DEST_PATH_IMAGE015
>Represents dataI i The feature word sequence of (1);
s3: extraction of minority subject data
S3.1: defining the minority domain KG asG k =(V, E) WhereinV={v 1, v 2, …, v n Denotes the set of entity corresponding nodes in KG,E={e 1, e 2,…, e m represents a collection of edges between entities; any edge corresponds to a node triplee x =(v i , v j , label) Node ofv i Called origin, nodev j Referred to as the end point,labela relation label of a starting point and an end point;
s3.2: using knowledge in the minority domainZConstruction of the field KG, usingG k Represents;
s3.2.1: first, the minority domain knowledge is obtained from domain expertsZ=<term, attributes, addition>Sequentially takingZElement entity name ofv i And name of artv 0 Expressed as a triplet (v 0 , v i , label),labelGetv i Is given as
Figure 26525DEST_PATH_IMAGE016
The relationship label of (1);
s3.2.2: then each element is built in turnv i Andv j the triad of (v i , v j , label) At this timelabelAdditional information by a nodeadditionTo obtain
Figure 524502DEST_PATH_IMAGE017
Relationship labels of, e.g.v i Andv j if there is no relation, the corresponding edge does not exist, and all the triples together form the minority domain KG ofG k
S3.3: for domain-independent data, called noise data, which do not belong to the field of interest but which influence the accuracy of the extraction of the domain data during the data extraction process, the constructed domain of independence KG is used inG k Represents;
s3.3.1: obtaining knowledge of minority irrelevant field from domain expertZ=<term, attributes, addition>In turn getZElement entity name ofv i And name of artv 0 Expressed as a triplet (v 0 , v i , label),labelGetv i Is given as
Figure 296149DEST_PATH_IMAGE018
The relationship label of (1);
s3.3.2: then each element is built in turnv i Andv j the triad of (v i , v j , label),labelAdditional information by a nodeadditionTo obtain
Figure 309105DEST_PATH_IMAGE019
Relationship labels of, e.g.v i Andv j irrelevant, the corresponding edge does not exist, all triples jointly form an irrelevant field KGG k
S3.4: the extraction of minority domain data is realized;
s3.4.1: giving a decision parameterτ,0≤τ≤1;
S3.4.2: for dataI i Calculating the characteristic word sequence thereofd i Length of (2)m i m i ≥0;
S3.4.3: for datad i Each word of (1)w i,j By usingG k Association between nodes: (v x , v x+1, label) Searching adjacent points of nodes in sequence, and counting dataI i The number of the words in the minority domain of the Chinese vocabulary is recordednn≥0;
S3.4.4: of the same datad i Each word of (1)w i,j UtilizeG k Association between nodes: (v x , v x+1, label) Searching adjacent points of nodes in sequence, and counting dataI i The number of words in which the middle word is noise data that is not domain-related is recorded
Figure 191610DEST_PATH_IMAGE020
S3.4.5: calculating dataI i In the field ofG k Probability of (2)
Figure 357143DEST_PATH_IMAGE022
Data ofI i In the fieldG k Probability of (2)
Figure 299691DEST_PATH_IMAGE023
If, ifp>τAnd is
Figure 737626DEST_PATH_IMAGE024
<τThen determine the dataI i Is data of minority subjects and willI i Adding to the final minority data setDIn (1).
2. The method for extracting minority topic data in a new media environment as claimed in claim 1, wherein in step S2.3.1, each parameter takes on value
Figure 220560DEST_PATH_IMAGE025
Figure 693129DEST_PATH_IMAGE026
Is an integer taken up whenKTaking at less than or equal to 40 timesα=0.5, whenK>At 40 hoursα=20/KAnd parameter ofβ=0.01,κ=
Figure 121093DEST_PATH_IMAGE027
3. The method for extracting minority topic data in a new media environment as claimed in claim 1, wherein in step S3.4.1, the parameters are determinedτThe range of (A) is 0.05. ltoreq.τ≤0.15。
CN201810969312.1A 2018-08-23 2018-08-23 Method for extracting minority subject data in new media environment Active CN109241273B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810969312.1A CN109241273B (en) 2018-08-23 2018-08-23 Method for extracting minority subject data in new media environment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810969312.1A CN109241273B (en) 2018-08-23 2018-08-23 Method for extracting minority subject data in new media environment

Publications (2)

Publication Number Publication Date
CN109241273A CN109241273A (en) 2019-01-18
CN109241273B true CN109241273B (en) 2022-02-18

Family

ID=65069466

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810969312.1A Active CN109241273B (en) 2018-08-23 2018-08-23 Method for extracting minority subject data in new media environment

Country Status (1)

Country Link
CN (1) CN109241273B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110110013B (en) * 2019-05-10 2020-03-24 成都信息工程大学 Entity competition relation data mining method based on space-time attributes

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577549A (en) * 2013-10-16 2014-02-12 复旦大学 Crowd portrayal system and method based on microblog label
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
CN104217038A (en) * 2014-09-30 2014-12-17 中国科学技术大学 Knowledge network building method for financial news
CN105653706A (en) * 2015-12-31 2016-06-08 北京理工大学 Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN106156090A (en) * 2015-04-01 2016-11-23 上海宽文是风软件有限公司 A kind of designing for manufacturing knowledge personalized push method of knowledge based collection of illustrative plates (Man-tree)
CN106909643A (en) * 2017-02-20 2017-06-30 同济大学 The social media big data motif discovery method of knowledge based collection of illustrative plates
CN106960025A (en) * 2017-03-19 2017-07-18 北京工业大学 A kind of personalized literature recommendation method based on domain knowledge collection of illustrative plates

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008106B (en) * 2013-02-25 2018-07-20 腾讯科技(深圳)有限公司 A kind of method and device obtaining much-talked-about topic

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103577549A (en) * 2013-10-16 2014-02-12 复旦大学 Crowd portrayal system and method based on microblog label
CN103699663A (en) * 2013-12-27 2014-04-02 中国科学院自动化研究所 Hot event mining method based on large-scale knowledge base
CN104217038A (en) * 2014-09-30 2014-12-17 中国科学技术大学 Knowledge network building method for financial news
CN106156090A (en) * 2015-04-01 2016-11-23 上海宽文是风软件有限公司 A kind of designing for manufacturing knowledge personalized push method of knowledge based collection of illustrative plates (Man-tree)
CN105653706A (en) * 2015-12-31 2016-06-08 北京理工大学 Multilayer quotation recommendation method based on literature content mapping knowledge domain
CN106909643A (en) * 2017-02-20 2017-06-30 同济大学 The social media big data motif discovery method of knowledge based collection of illustrative plates
CN106960025A (en) * 2017-03-19 2017-07-18 北京工业大学 A kind of personalized literature recommendation method based on domain knowledge collection of illustrative plates

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
SeCo-LDA Mining Service Co-occurrence Topics for Recommendation;Zhenfeng Gao et al.;《2016 IEEE International Conference on Web Services》;20160901;25-32 *
基于Topic Model的我国档案学主题结构与演化研究;董克 等;《信息资源管理学报》;20170726;第7卷(第3期);97-105 *
民族志传播:一幅不十分完备的研究地图——基于中文文献的考察;郭建斌;《新闻大学》;20180415(第2期);1-17,149 *

Also Published As

Publication number Publication date
CN109241273A (en) 2019-01-18

Similar Documents

Publication Publication Date Title
CN104899273B (en) A kind of Web Personalization method based on topic and relative entropy
CN104317801B (en) A kind of Data clean system and method towards big data
CN111680173A (en) CMR model for uniformly retrieving cross-media information
CN102819575B (en) Personalized search method for Web service recommendation
US20120174006A1 (en) System, method, apparatus and computer program for generating and modeling a scene
US20100205176A1 (en) Discovering City Landmarks from Online Journals
CN113065003B (en) Knowledge graph generation method based on multiple indexes
CN105550190A (en) Knowledge graph-oriented cross-media retrieval system
Wang et al. A novel blockchain oracle implementation scheme based on application specific knowledge engines
Ye et al. A web services classification method based on GCN
Lee et al. Mining tourists’ destinations and preferences through LSTM-based text classification and spatial clustering using Flickr data
Wu et al. Towards semantic web of things: from manual to semi-automatic semantic annotation on web of things
CN109241273B (en) Method for extracting minority subject data in new media environment
CN104765763B (en) A kind of semantic matching method of the Heterogeneous Spatial Information classification of service based on concept lattice
Suresh Kumar et al. Multi-ontology based points of interests (MO-POIS) and parallel fuzzy clustering (PFC) algorithm for travel sequence recommendation with mobile communication on big social media
Song et al. Topic modeling: Measuring scholarly impact using a topical lens
CN106021306A (en) Ontology matching based case search system
Ge et al. A Novel Chinese Domain Ontology Construction Method for Petroleum Exploration Information.
Pereira Nunes et al. SCS connector-Quantifying and visualising semantic paths between entity Pairs
CN106202564A (en) A kind of ontological relationship data search framework based on ElasticSearch
Ceri et al. Towards mega-modeling: a walk through data analysis experiences
Rani et al. Efficient query clustering technique and context well-informed document clustering
Wen et al. GCN-IA: User profile based on graph convolutional network with implicit association labels
KR20090072542A (en) Semantic web potal system and search system using multi ontology
Guesmi et al. Community detection in multi-relational bibliographic networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant