CN109241273B

CN109241273B - Method for extracting minority subject data in new media environment

Info

Publication number: CN109241273B
Application number: CN201810969312.1A
Authority: CN
Inventors: 岳昆; 麻友; 李维华; 王笑一; 郭建斌
Original assignee: Yunnan University YNU
Current assignee: Yunnan University YNU
Priority date: 2018-08-23
Filing date: 2018-08-23
Publication date: 2022-02-18
Anticipated expiration: 2038-08-23
Also published as: CN109241273A

Abstract

The invention discloses a method for acquiring data from a new media platform and extracting minority subject data, which is characterized by adopting an LDA (latent Dirichlet Allocation) model to perform feature extraction, subject analysis and implicit subject mining on the preprocessed new media data according to the characteristics of massive, unstructured and multi-subject of the new media data, then constructing KG by utilizing knowledge in the minority domain, and guiding the extraction of the minority subject data by utilizing the domain KG. In the invention, parameters are set according to different data scales in the process of extracting data by adopting an LDA model and a KG guide, so that the algorithm is optimized, and accurate, efficient and expandable new media data extraction is realized.

Description

Method for extracting minority subject data in new media environment

Technical Field

The invention discloses a method for acquiring data from a new media platform and extracting minority subject data. Relates to a method for performing implicit theme analysis and feature extraction based on Latent Dirichlet Allocation (LDA) new media data and realizing extraction of minority theme data by using a domain Knowledge Graph (KG). Belonging to the field of data processing and knowledge discovery.

Background

The new media is a new media form relative to traditional media such as newspapers, broadcasts, televisions and the like, comprises network media, mobile phone media, digital televisions and the like, and has the characteristics of interactivity, instantaneity, mass and sharing, multimedia, hypertext, individuation, socialization and the like. With the increasingly important role of new media in information dissemination, the processing and analysis of network media data are also highly concerned by scholars at home and abroad. The data is divided according to the difference of main contents, and the data describing the same type of contents is called data of the same theme, such as themes of tourism, entertainment, movie and television and the like. The method is characterized in that data of a specific theme are obtained from massive and heterogeneous network media data, and theme analysis, content screening and information filtering are performed according to different fields, so that the method is an important research content for new media data processing and knowledge discovery, and is also an important basis for work such as decision support, influence prediction, knowledge base construction, public opinion analysis and the like.

In addition, with the deep advance of cultural strategies in China, research on subjects of acquiring, analyzing and utilizing minority information is continuously increased, and massive new media data comprise a large amount of valuable minority theme data, such as data of minority travel information sharing, minority cultural exchange, minority hot spot problems, minority news events and the like in a microblog platform, and can enrich data sources for relevant research and development of the minority theme. The method has important significance in researching data extraction of the minority theme in the new media environment, aiming at actual problems in aspects of minority politics, economy, culture and the like, processing and analyzing mass new media data, discovering data-driven knowledge, and researching and applying data-intensive minority region public opinion monitoring and management policy formulation, minority culture propagation and heritage protection.

Data extraction is a process of extracting target data from source data, known data extraction research results are numerous, and data extraction technologies adopted for different data or different applications are different, for example, liu jin et al (< master thesis of university of science and technology, 2016) realizes the Chinese character social relation extraction research of news data based on unsupervised learning, wu self tiger et al (< master thesis of university of liaoning, 2017) proposes a template-based network public opinion data extraction method, zhang et al (< master thesis of liberation military information engineering, 2017) proposes a relation extraction model based on a depth fusion convolutional neural network for network text data, yao xiao peng et al (< computer application and software, 2018) proposes a deep-network data extraction and mining method in a global mode, yuanhua et al (< management engineering > report, 2018) a method for extracting hot topics and feature words thereof from mass data is provided, the relation between the hot topic words and local feature words is analyzed from the mass user generated content, and a data extraction method and a system based on mass data are provided for the achievement of Hades and the like (< patents CN107436902A >, 2017), and corresponding target data are extracted and obtained respectively aiming at a static data source and a dynamic data source. The methods can well complete data extraction tasks aiming at respective data sources and the problem to be solved, but the methods are not general technologies, and lack universality aiming at the extraction of minority ethnic theme data in a massive, unstructured and multi-theme new media platform. Therefore, aiming at the characteristics of the mass, the non-structure, the multiple subjects and the like of the new media, the invention excavates the implicit subjects in the network new media data based on the LDA model, realizes the analysis of the multiple subjects, and can more accurately and comprehensively realize the extraction of the minority subject data by utilizing the data characteristic word sequence and the entity described by the knowledge graph and the incidence relation among the entities.

The LDA model is a Bayesian hierarchical model, and is widely applied to the fields of data extraction, text mining, social networks, natural language processing and the like in the known research. For example, liu shao et al (< computer science and report >, 2015) performs qualitative and descriptive theme extraction on massive movie comment data by using LDA, liu ice jade et al (< software science and report >, 2017) researches massive e-commerce comment data, realizes extraction of commodity features and emotional words based on semantic constraint LDA, and zhao science et al (< patent CN107885754A >, 2018) provides a method and a device for extracting credit variables from transaction data based on LDA models. The LDA model is utilized to process the mass data, the outstanding effects of the LDA model on topic analysis, feature extraction, text mining and other problem researches are shown, and on the basis, the LDA model further plays the advantages of the LDA model in analyzing mass, unstructured and multi-topic new media data.

The KG is a semantic network expressing entities, concepts and relationships among the entities and the concepts, and in the known research, the KG is widely used in the fields of personalized recommendation, intelligent search, knowledge discovery and the like. For example, chendhan et al (< computer research and development >, 2017) proposes a link prediction model of clinical field time sequence KG, and heijun et al (< computer science report >, 2016) proposes a relation extraction method facing the evolution of knowledge in the field of chinese wikipedia, and rekey et al (< patent CN108073711A >, 2018) proposes a relation extraction method based on KG, and extracts the path and attribute information of KG to mine potential semantic information. The above achievements, whether medical research or data relation extraction, fully show the effect of rich prior knowledge of the KG semantic network in practical application, but select KG aiming at different applications simultaneously, also influence the high efficiency and effectiveness of problem research, and aiming at different application scenes and specific research fields, need to construct corresponding KG, more comprehensively and completely cover the knowledge and semantic relation of the research field, thereby improving the accuracy and high efficiency of data extraction results.

However, due to the limitation of data sources, the remoteness of knowledge and the cultural difference under the minority nationality theme, the difficulty of the interdisciplinary research is relatively high, the interdisciplinary research becomes the necessity of numerous topics in the current research, and how to utilize a large amount of data in a new medium as a basis to extract valuable data from the data becomes the basis of the relevant research.

Therefore, the invention aims at the extraction problem of the new media minority topic data, takes the large-scale data of a new media platform and the knowledge of the minority domain as the basis, takes the extraction of the minority topic data from the massive, unstructured and multi-topic new media data as the target, utilizes an LDA model to mine implicit multi-topic information from the unstructured data, performs topic analysis, realizes the feature extraction of the data, and utilizes the semantic relation rich in domain KG to solve the problems of strong professional, remote word sources and ambiguous words in the process of extracting the minority topic data from the massive new media data. In conclusion, the invention provides the method for extracting the minority nationality theme data in the new media environment, lays a new technical foundation for the application of large-scale new media data processing, analysis, prediction, decision and the like, and also provides reference for the extraction of new media data in a specific field.

Disclosure of Invention

In order to solve the problem of efficiency bottleneck caused by the situations of obscure source of words, strong specialization, word synonymy and the like in the minority nationality field, the invention provides a method for acquiring data from a new media platform and extracting the minority topic data based on an LDA model and KG. The method can accurately, efficiently and extensible extract the data of the specific field of the new media aiming at the characteristics of the new media data such as massive, unstructured and multi-theme data.

The method comprises three steps, wherein the first step is data preprocessing, required new media data are obtained, a word segmentation tool is adopted to perform word segmentation on data contents, words are segmented according to field vocabularies added in the minority nationality field to be researched, personalized stop words are added, and a data preprocessing result is simplified; secondly, topic analysis and feature extraction of new media data, iteration processing is carried out on the preprocessed data by using an LDA model, topic analysis is carried out, topics hidden by the data are mined, a topic vector of each piece of data and high-frequency word vectors of all the topics are obtained, the high-frequency word vectors of the topics to which the data belong are matched with the content of the data, and a feature word sequence of each piece of data is obtained; and the third step is extraction of the minority topic data based on KG, namely, the knowledge of the minority domain is constructed into the domain KG, KG is used for guiding, namely, the rich semantic relation of KG is used as prior knowledge, the characteristic word sequence of the data in the second step is used as basis for matching, the data of the minority topic is filtered, and meanwhile, KG formed by noise data irrelevant to the domain is constructed, so that reverse filtering is realized, and the accuracy of data extraction is improved.

The method comprises the following steps:

s1: data pre-processing

S1.1: obtaining from social networking or news web pagesMStripe media dataI={I ₁, I ₂, …, I _M}，I _iIs shown asiData of strip, 0 is less than or equal toi≤M，I _iBy a triplet (id, T _i, A _i) It is shown that,idfor the purpose of the identification of the instance of data,T _irepresenting dataI _iThe content of the characters of (a) is,A _i={A _i,u, A _i,p, A _i,l, A _i,v, A _i,f, A _i,q, A _i,c, A _i,rindicates additional information, respectively indicates a data publisherA _i,uTime of releaseA _i,pAnd a distribution siteA _i,lPublishing sourceA _i,vForwarding amountA _i,fAmount of praiseA _i,qNumber of commentsA _i,cAnd read time of dataA _i,r；

S1.2: knowledge in minority nationality domainZ=<term, attributes, addition>As given by the domain expert,termis the name of the entity, and is the name of the entity,attributesin order to be an attribute of an entity,additionadding a description to the entry;

s1.3: obtaining a set of stop wordsStop_words；

S1.4: adopting word segmentation tool to obtain media data text contentT _iPerforming word segmentation processing, wherein the word segmentation is performed beforeStop_wordsAdding the information into a default stop word set of a word segmentation tool and naming the knowledge entity names in the minority domaintermThe set is added to the default vocabulary set of the segmentation tool,T _ithe word segmentation result is stored in the data separatelyI _iEnd, is marked asSeg_T _i；

S2: topic analysis and feature extraction

S2.1: definition dictionaryW={w _1, w ₂, …, w _SStore all the words contained in the data,Sis the total number of words in the dictionary,w _i≠w _j(1≤i,j≤S,i≠j)；

s2.2: defining dataI _iSubject vector ofΛ _i=(λ _i1,, λ _i2,, …, λ _K,i)，λ _k,iIs thatI _iThe Chinese vocabulary belongs to the subjectz _kProbability of 0 ≦λ _k,i1 or less, wherein the subjectz _kUsing high-frequency word vectorsΔ _k=((w ₁, δ _k1,), (w ₂, δ _k2,), …,

) Is) to indicate that,S _kis composed ofz _kThe total number of words of (a) is,δ _t,kis thatz _kWords in the general vocabularyw _tProbability of 0 ≦δ _t,k≤1。δ _t,kAndλ _k,ithe following equations (1) and (2) respectively determine:

(1)

(2)

wherein,

representing a topicz _kThe words and phrases ofw _tThe total number of (a) and (b),

to representI _iIncluding a themez _kThe number of the Chinese words and phrases is,Sis the total number of words in the dictionary,Kis the total number of topics;

s2.3: sampling a theme and a vocabulary;

s2.3.1: given number of iterationsN _iter，N _iterNot less than 1, total number of themesK，KNot less than 1, parameterα，β，κ，0<α,β<1，κ≥1；

S2.3.2: for each topic z_kProbability distribution of words in a sample topicφ _k~Dir(β)；

S2.3.3: for dataI _iTopic probability distribution of sampled dataθ _i~Dir(α) Of dataSeg_T _iSampling the subject of a word

Vocabulary of sample topics

(ii) a Statistical themesz _kTotal number of words of

Data, dataI _iIncluding a themez _kNumber of Chinese words

；

S2.3.4: repeat S2.3.3, iterateN _iterNext to each vocabularyw _i,jSubject matter of (1)z _i,jConvergence is achieved, and at the moment, the subject to which each vocabulary belongs does not change any more;

s2.4: obtaining a themez _kHigh frequency word vectors and dataI _iThe topic vector of (1);

s2.4.1: reading each piece of dataI _iThe words and phrases ofw _i,jAnd corresponding subject matterz _i,jStatistical topicz _i,j=z _kThe words and phrases ofw _i,jTotal number of

And dataI _iInz _i,j=z _kChinese vocabularyw _i,jNumber of (2)

；

S2.4.2: calculating according to formula (1) to obtain each themez _kChinese vocabularyw _tProbability of (2)δ _t,kAccording toδ _t,kArranging in descending order to obtain the themez _kHigh frequency word vector ofΔ _k=((w ₁,δ _k1,), (w ₂,δ _k2,), …, (

))，0≤k≤K；

S2.4.3: calculating according to formula (2) to obtain each piece of dataI _iThe Chinese vocabulary belongs to the subjectz _kProbability of (2)λ _k,iAccording toλ _k,iArranging in descending order to obtain dataI _iSubject vector ofΛ _i=(λ _i1,, λ _i2,, …, λ _K,i)；

S2.5: acquiring a data feature word sequence;

s2.5.1: reading dataI _iSubject vector ofΛ _iPush buttonλ _k,iDescending order, and taking top-kappa subjects;

s2.5.2: will be in dataSeg_T _iOf the vocabulary of (1) with the high-frequency word vectors of the above top-k topicsΔ _kThe words are mapped and matched, and the union of the two words is recorded asd _i=<w _i,1, w _i,2,…,

>Represents dataI _iThe feature word sequence of (1);

s3: extraction of minority subject data

S3.1: defining the minority domain KG asG _k=(V, E) WhereinV={v ₁, v ₂, …, v _nDenotes the set of entity corresponding nodes in KG,E={e ₁, e ₂,…, e _mrepresents a collection of edges between entities; any edge corresponds to a node triplee _x=（v _i, v _j, label) Node ofv _iCalled origin, nodev _jReferred to as the end point,labela relation label of a starting point and an end point;

s3.2: using knowledge in the minority domainZConstruction of the field KG, usingG _kRepresents;

s3.2.1: first, the minority domain knowledge is obtained from domain expertsZ=<term, attributes, addition>Sequentially takingZElement entity name ofv _iAnd name of artv ₀Expressed as a triplet (v ₀, v _i, label），labelGetv _iIs given as

The relationship label of (1);

s3.2.2: then each element is built in turnv _iAndv _jthe triad of (v _i, v _j, label) At this timelabelAdditional information by a nodeadditionTo obtain

Relationship labels of, e.g.v _iAndv _jif there is no relation, the corresponding edge does not exist, and all the triples together form the minority domain KG ofG _k；

S3.3: for domain-independent data, called noise data, which do not belong to the field of interest but which influence the accuracy of the extraction of the domain data during the data extraction process, the constructed domain of independence KG is used inG _kRepresents;

s3.3.1: obtaining knowledge of minority irrelevant field from domain expertZ=<term, attributes, addition>In turn getZElement entity name ofv _iAnd name of artv ₀Expressed as a triplet (v ₀, v _i, label），labelGetv _iIs given as

The relationship label of (1);

s3.3.2: then each element is built in turnv _iAndv _jthe triad of (v _i, v _j, label），labelAdditional information by a nodeadditionTo obtain

Relationship labels of, e.g.v _iAndv _jirrelevant, the corresponding edge does not exist, all triples jointly form an irrelevant field KGG _k；

S3.4: the extraction of minority domain data is realized;

s3.4.1: giving a decision parameterτ，0≤τ≤1；

S3.4.2: for dataI _iCalculating the characteristic word sequence thereofd _iLength of (2)m _i，m _i≥0；

S3.4.3: for datad _iEach word of (1)w _i,jBy usingG _kAssociation between nodes: (v _x, v _x+1, label) Adjacency point of node searched in sequence, statistical dataI _iThe number of the words in the minority domain of the Chinese vocabulary is recordedn，n≥0；

S3.4.4: of the same datad _iEach word of (1)w _i,jUtilizeG _kAssociation between nodes: (v _x, v _x+1, label) Adjacency point of node searched in sequence, statistical dataI _iThe number of words in which the middle word is noise data that is not domain-related is recorded

；

S3.4.5: calculating dataI _iIn the field ofG _kProbability of (2)

Data ofI _iIn the fieldG _kProbability of (2)

. If it is notp>τAnd isp<τThen determine the dataI _iIs data of minority subjects and willI _iAdding to the final minority data setDPerforming the following steps;

the invention can realize the extraction of the minority nationality theme data from the new media environment through the steps, and in order to ensure that the data extraction is more accurate and efficient, the parameters in the method are further limited and optimized, and in step S2.3.1, the iteration times are carried outN _iterGiven relation to efficiency of the method and accuracy of the resultThe certainty is that each vocabulary in the result is obtained when the iteration number is too smallw _i,jSubject matter of (1)z _i,jThe method has the advantages of not converging, inaccurate subject characteristic words, excessive iteration times, time consumption increase and efficiency reduction of converged iteration, and the method is used for solving the problems that

，SIs the total number of words in the dictionary,

for upward integers, the number of iterations is directly linked to the amount of data, and, in addition, parametersαWhen in useKValue is less than or equal to 40α=0.5, whenK>Value of 40 hoursα=20/KTo do soβThe value of the compound is 0.01,κvalue taking

I.e. with the number of subjectsKIncreasing, the number of characteristic word sequences of the data obtained by using the high-frequency word vector of top-kappa subjects is also increased. In addition, in step S3.4.1, the determination parameter of the domainτThe range of (A) is not less than 0.05 ≤τLess than or equal to 0.15, and the data can realize more accurate domain attribution judgment.

Detailed Description

The following description of the embodiments of the present invention is provided in order to better understand the present invention for those skilled in the art with reference to the accompanying drawings. It is to be expressly noted that in the following description, a detailed description of known functions and designs will be omitted when it may obscure the subject matter of the present invention.

Example (b): and (3) extracting example of 'Xinlang microblog' Tibetan data.

The method comprises the following steps: pretreatment of

Microblog data are acquired from a 'Sina microblog' platform, and a single piece of microblog data is shown in a table 1.

TABLE 1 microblog data example

For convenience of description, the additional information items will be described in the following description of data extractionA _iHiding the obtained Sing microblog data comprising 5 microblog dataa1~a5, as shown in table 2.

Table 2 green wave microblog data

Then, the text part of the microblog data is processedT _iPerforming word segmentation, selecting word segmentation tool, supporting self-defined dictionary and stop word, and introducing Tibetan domain knowledgeZ={<Tibet, Place name, Tibetan province>，<Lhasa, city name, Tibet province>，<Gongga, geographical name, south Tibet>，<Zang region, fuzzy geographical name, refers to the collection of zang nationality>，<The collection of people refers to the Tibetan nationality>，<Hada, name of things, Tibetan ceremony silk fabric>Adding the words in the Tibetan domain into the word segmentation tool dictionary, and recording the word segmentation result asSeg_T _iAs shown in table 3.

TABLE 3 Seawa microblog data word segmentation result

Step two: topic analysis and feature extraction

Reading microblog data, and obtaining a dictionary according to the appearance sequence and the non-repeated principle of vocabulariesW= 1: tibet, 2: and, peace, 3: liberation, 4: residence, 5: tibetan, 6: gutianle, 7: gold, 8: donation, 9: aid, 10: school, 11: tibetan region, 12: hada, 13: vacation, 14: Shangri-La, 15: qinghai lake, 16: pizza, 17: trip, 18: strategy, 19: the health food is prepared from the raw materials of Gongga,20: airport, 21: kilometers }.

Given number of iterations

Total number of subjectK=3, parameterα=0.5，β=0.01，

. Sampling topics of words

Vocabulary of sample topics

Iterate 10 times so that words can all be assigned to the respective topics.

The obtained topics corresponding to all the vocabularies are respectively as follows:

travel: { holiday, travel, walk, lassa, airport, kilometer, Shangri La, Qinghai lake, Tibet, Gonga }

Culture: { peace, liberalism, residence, school, Hada, Tibetan, Tibetan}

Public welfare: { fund, donation, aid, gutianle }

Circularly counting the total number of all subjects of the microblog data

And the total number of words for each topic

. By microblog dataa1 as an example, calculate

Is provided with

，

，

Therefore, the formula (1) has:

thus, microblog dataaThe topic vector for content 1 is (0.2143, 0.75, 0.0357). Similarly, microblog data can be obtainedaThe topic vector of 2 is (0.0357, 0.2143, 0.75),athe topic vector of 3 is (0.0435, 0.9130, 0.0435),athe topic vector of 4 is (0.9130, 0.0435, 0.0435),athe topic vector of 5 is (0.9583, 0.0208, 0.0208).

For each topicz _kCalculating

By subject matter

As an example, a dictionaryWIn a clear view of the above, it is known that,t=1 for "Tibet",t=13 means "vacation",t=16 denotes "pizza", and is calculated by formula (2)δ _t,kThe following were used:

further in accordance with

Arranging in descending order to obtain the theme

High frequency word vector ofΔ ₁= (("rassa", 0.2118), ("holiday", 0.1414), ("tibet", 0.1414), ("travel", 0.0711), ("strategy", 0.0711), …). Subject matter available in the same way

The high frequency word vector ofΔ ₂= (("school", 0.2182), ("Tibetan", 0.2182), ("peace", 0.1097), ("hada", 0.1097), ("liberation", 0.1097), …), topic

The high frequency word vector ofΔ ₃= (("fund", 0.2399), ("donation", 0.2399), ("build", 0.2399), ("gutianle", 0.2399)).

Taking kappa =1, i.e. taking each microblog datumSeg_T _iAnd mapping and matching the words with the words of the high-frequency word vectors of the top-1 subjects of the data to obtain a characteristic word sequence of the data. By microblog dataa1 is an exampleSubject of top-1

Then the subject will be

High frequency word vector ofΔ ₂With a1Seg_T _iIs matched to obtaind ₁=<Peace, liberation, living and Tibetan ">And obtaining the following by the same method:

d ₂=<'Gutianle', 'fund', 'donation', 'aid construction'>

d ₃=<'zang district', 'school', 'Tibetan person' and 'Hada'>

d ₄=<Vacation, Shangri-La, Qinghai lake and Lasa ">

d ₅=<"travel", "attack", "vacation", "Tibet", "Lhasa", "Gonga", "airport", "Lhasa", "kilometer">

Step three: minority data extraction

First, from the Tibetan domain knowledgeZ={<Tibet, Place name, Tibetan province>，<Lhasa, city name, Tibet province>，<Gongga, geographical name, south Tibet>，<Zang region, fuzzy geographical name, refers to the collection of zang nationality>，<The collection of people refers to the Tibetan nationality>，<Hada, name of things, Tibetan ceremony silk fabric>Constructing field KG.

Take in turnZElement entity name ofv _iAnd name of artv ₀Expressed as a triplet (v ₀, v _i, label) Such as ("Tibetan", "Tibet", "Place"), each element is then established in turnv _iAndv _jthe triad of (v _i, v _j, label) At this timelabelThe graphical representation of the results obtained from additional information of the nodes, such as ("Tibet", "Lasa", "Save") is shown in FIG. 3.

In the same way, by the field-independent knowledge of the subject of "travelZ={<Yunnan province, province name, tourist province>，<Qinghai province, Ming province and traveling province>，<Shangri-La, Place name, Yunnan tourist attraction>，<Tourist attractions of Qinghai lake, lake and Qinghai province>Constructing a travel KG irrelevant to TibetanG _kAs shown in fig. 4.

Given parametersτ= 0.1. For microblog dataa1, obtaining word sequence of word featured ₁Length of (2)m _i=4, ford ₁Each vocabulary is respectively fromG _kSearching corresponding words and phrases between the middle edge node and the edge, and counting to obtain

。

Thus, microblog dataa1 in the field ofG _kProbability of (2)

In the fieldG _kProbability of (2)

Due to p>τAnd isτThen microblog dataa1 belongs to the field of Tibetan, willa1 adding the extracted Tibetan data setDIn (1). In the same way, the method can obtain,a3 anda5 also belongs to the "Tibetan" domain data. For thea4, due to

，

And therefore belong to irrelevant noise data related to the names of the Tibetan nationalities.

The extraction results of the "Tibetan" subject data are shown in Table 4.

TABLE 4 "Tibetan" subject data extraction results

Drawings

FIG. 1 is a flow chart for implementing the present invention. The method comprises the following three steps: preprocessing new media data, analyzing themes and extracting characteristics, and extracting minority ethnic data.

Fig. 2, LDA graph model.

FIG. 3 is a graphical illustration of a Tibetan domain knowledge graph in an embodiment.

Fig. 4, an example of an embodiment in which noise data corresponds to a graphical knowledge map.

Claims

1. A method for extracting minority ethnic group theme data in a new media environment is characterized by comprising the following steps:

s1: data pre-processing

s1.3: obtaining a set of stop wordsStop_words；

S1.4: adopting word segmentation tool to obtain media data text contentT _iPerforming word segmentation processing, wherein the word segmentation is performed beforeStop_ wordsAdding the information into a default stop word set of a word segmentation tool and naming the knowledge entity names in the minority domaintermThe set is added to the default vocabulary set of the segmentation tool,T _ithe word segmentation result is stored in the data separatelyI _iEnd, is marked asSeg_T _i；

S2: topic analysis and feature extraction

S2.1: definition dictionaryW={w _1, w ₂, …, w _SStore all the words contained in the data,Sis the total number of words in the dictionary,w _i≠w _j,1≤i,j≤S,i≠j；

s2.2: defining dataI _iSubject vector ofΛ _i=(λ _i1,, λ _i2,, …, λ _K,i)，λ _k,iIs thatI _iThe Chinese vocabulary belongs to the subjectz _kProbability of 0 ≦λ _k,i1 or less, wherein the subjectz _kUsing high-frequency word vectorsΔ _k=((w ₁,δ _k1,), (w ₂,δ _k2,), …, (

) Is) to indicate that,S _kis composed ofz _kThe total number of words of (a) is,δ _t,kis thatz _kWords in the general vocabularyw _tProbability of 0 ≦δ _t,k≤1，δ _t,kAndλ _k,ithe following equations (1) and (2) respectively determine:

(1)

(2)

wherein,

to representI _iIncluding a themez _kThe number of the Chinese words and phrases is,Sis the total number of words in the dictionary,Kas a result of the total number of themes,

is composed oftThe hyperparameter of the dirichlet distribution of the dimension,

is composed ofkA hyper-parameter of the dirichlet distribution of dimensions;

s2.3: sampling a theme and a vocabulary;

S2.3.2: for each topic z_kProbability distribution of words in a sample topicφ _k~Dir(β)，Dir(β) Representing a hyperparameter ofβDirichlet distribution of (a);

s2.3.3: for dataI _iTopic probability distribution of sampled dataθ _i~Dir(α)，Dir(α) Representing a hyperparameter ofαOf the Dirichlet distribution, of the dataSeg_T _iSampling the subject of a wordz _i,j~Mult(

) Vocabulary of sample topicsw _i,j~Mult(

)，Mult(

) And Mult (

) Respectively represent parameters of

And

a polynomial distribution of (a); statistical themesz _kTotal number of words of

Data, dataI _iIncluding a themez _kNumber of Chinese words

；

And dataI _iInz _i,j=z _kChinese vocabularyw _i,jNumber of (2)

；

))，0≤k≤K；

S2.5: acquiring a data feature word sequence;

s2.5.2: will be in dataSeg_T _iOf the vocabulary of (1) with the high-frequency word vectors of the above top-k topicsΔ _kThe words are mapped and matched, and the union of the two words is recorded asd _i=<w _i,1,w _i,2,…,

>Represents dataI _iThe feature word sequence of (1);

s3: extraction of minority subject data

The relationship label of (1);

The relationship label of (1);

S3.4: the extraction of minority domain data is realized;

s3.4.1: giving a decision parameterτ，0≤τ≤1；

S3.4.3: for datad _iEach word of (1)w _i,jBy usingG _kAssociation between nodes: (v _x, v _x+1, label) Searching adjacent points of nodes in sequence, and counting dataI _iThe number of the words in the minority domain of the Chinese vocabulary is recordedn，n≥0；

S3.4.4: of the same datad _iEach word of (1)w _i,jUtilizeG _kAssociation between nodes: (v _x, v _x+1, label) Searching adjacent points of nodes in sequence, and counting dataI _iThe number of words in which the middle word is noise data that is not domain-related is recorded

；

S3.4.5: calculating dataI _iIn the field ofG _kProbability of (2)

Data ofI _iIn the fieldG _kProbability of (2)

If, ifp>τAnd is

<τThen determine the dataI _iIs data of minority subjects and willI _iAdding to the final minority data setDIn (1).

2. The method for extracting minority topic data in a new media environment as claimed in claim 1, wherein in step S2.3.1, each parameter takes on value

，

Is an integer taken up whenKTaking at less than or equal to 40 timesα=0.5, whenK>At 40 hoursα=20/KAnd parameter ofβ=0.01，κ=

。

3. The method for extracting minority topic data in a new media environment as claimed in claim 1, wherein in step S3.4.1, the parameters are determinedτThe range of (A) is 0.05. ltoreq.τ≤0.15。