CN108415901A - A kind of short text topic model of word-based vector sum contextual information - Google Patents

A kind of short text topic model of word-based vector sum contextual information Download PDF

Info

Publication number
CN108415901A
CN108415901A CN201810124600.7A CN201810124600A CN108415901A CN 108415901 A CN108415901 A CN 108415901A CN 201810124600 A CN201810124600 A CN 201810124600A CN 108415901 A CN108415901 A CN 108415901A
Authority
CN
China
Prior art keywords
word
theme
document
semantic
model
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810124600.7A
Other languages
Chinese (zh)
Inventor
梁文新
冯然
张宪超
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Dalian University of Technology
Original Assignee
Dalian University of Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Dalian University of Technology filed Critical Dalian University of Technology
Priority to CN201810124600.7A priority Critical patent/CN108415901A/en
Publication of CN108415901A publication Critical patent/CN108415901A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of short text topic models of word-based vector sum contextual information, from the semantic relation extracted in term vector between word, the disadvantage of short text data word co-occurrence deficiency is compensated for by explicit this semantic relation of acquisition, semantic relation between word is further filtered by training set data, it is made to be more applicable for training dataset.Background theme is added in generating process, noise word in document is modeled by background theme.Model is solved using gibbs sampler method in model inference, and increase probability of the larger word of semantic dependency under related subject using the sampling policy of Generalized Wave Leah urn model during sampling, in this way so that the semantic consistency of word is greatly improved under theme.A series of experiments shows that method proposed by the present invention can largely improve the semantic consistency of theme, and a kind of new method is provided for the modeling of short text theme.

Description

A kind of short text topic model of word-based vector sum contextual information
Technical field
The invention belongs to natural language processing fields, are related to a kind of short text theme of word-based vector sum contextual information Model
Background technology
With the development of social networks, short text has become one of main path of internet information spreading.Short essay Contain abundant information in notebook data, it is very valuable that subject information is extracted from short text data.Probability topic mould Type is a kind of effective ways for concentrating extraction subject information from document data, and topic model is a kind of unsupervised learning method, The input of model is document data, exports the subject information to include in document data, each theme can be regarded as word Distribution, the higher word of probability of occurrence can reflect the semantic feature of this theme, such as " education " under the theme, " university ", The probability under a theme such as " student " words is higher, then what the theme was reflected is the theme of one " educational ".Theme Why model is effectively largely dependent upon the co-occurrence information of word, i.e. two words occur in same piece document Probability is higher, then the probability for belonging to a theme is bigger.The models such as classical topic model such as LDA and PLSA exist Preferable effect is achieved in large-scale data.
Since in short text data, for word co-occurrence than sparse, traditional topic model can not be effectively from short The theme of high quality is extracted in text, the semantic consistency for obtaining theme is not high.In order to concentrate extraction high from short text data The theme of quality, it is therefore desirable to be able to make full use of the feature of external knowledge and training data itself to obtain the semantic letter of word Breath makes up the deficiency of word co-occurrence information, and further application semantics information improves the semanteme one of theme during modeling Cause property.
Invention content
The present invention is on the basis of existing research, it is proposed that a kind of short text theme of word-based vector sum contextual information Model, influence caused by making up word co-occurrence deficiency using the semantic information of word improve semantic relevant word and exist The probability occurred under same subject.Meanwhile background theme is introduced in a model to capture noise word information, it can further carry The semantic consistency of word under high each theme.
Technical scheme of the present invention:
A kind of short text topic model of word-based vector sum contextual information, steps are as follows:
(1) the Semantic features extraction stage
First, training term vector is concentrated from large-scale data, is obtained in training set between two words according to term vector Semantic similarity further obtains the set of semantic related words for word in training set.
(2) semantic information filtration stage
Because term vector training from large-scale text data obtains, the semantic dependency between word simultaneously differs Surely it is suitable for training data, so the semantic dependency according to the information of training data between word is needed to carry out further mistake Filter.
(3) the generating process modelling phase
With reference to DMM models, the generating process of Definition Model.Assuming that every short essay shelves are only there are one theme, it is every in document A word is generated by the theme or a background theme.The instruction of one binary of each word associations in document becomes Amount illustrates that the word comes from a normal theme, if the value of the variable is 1, illustrates the word when the value of the variable is 0 From background theme, which is a background word.
(4) model parameter solves the stage
According to generating process, the hidden variable in model is sampled using gibbs sampler, the parameter of model can root It is found out according to maximum posterior estimation.Increase semantic phase using Generalized Wave Leah urn model (General Polya Urn model) The statistic occurred under the word same subject of pass, after carrying out maximum posterior estimation according to sample, semantic phase under each theme The word probability of occurrence of pass will increase, so the semantic consistency of theme can improve.
The beneficial effects of the present invention are, it is proposed that a kind of short text theme mould of word-based vector sum contextual information Type, effectively utilizes term vector and contextual information goes to obtain the semantic dependency between word to make up in short text data The insufficient defect of word co-occurrence.During model inference, the word with stronger semantic dependency is under related subject Probability can be increased simultaneously, improve the semantic consistency of theme to a certain extent.It is added in the generating process of model Background theme information, effectively captures the noise word information in document, can further increase the semantic consistency of theme. And compared with the short text topic model proposed in the recent period, the model of this paper is all promoted in efficiency and effect, not Robustness is presented in same data, a kind of new frame is provided for the modeling of short text theme.
Description of the drawings
Fig. 1 is that the probability graph model of the method for the present invention indicates.
Fig. 2 is the theme that is extracted on Amazon data set of the present invention as file characteristics, the F1 to classify to document Value.
Fig. 3 is the theme that is extracted on Amazon data set of the present invention as file characteristics, the standard classified to document True rate.
Fig. 4 is the theme that is extracted on network inquiry data set of the present invention as file characteristics, is classified to document F1 values.
Fig. 5 is the theme that is extracted on network inquiry data set of the present invention as file characteristics, is classified to document F1 values.
Specific implementation mode
Specific embodiment of the present invention is illustrated below, the starting point to further illustrate the present invention and corresponding Technical solution.
The present invention is a kind of short text topic model of word-based vector sum contextual information, and main purpose is desirable to model The subject information for automatically extracting high quality, this method can be concentrated to be divided into following 4 steps from short text data:
(1) semantic similarity between word is obtained:
Term vector is trained from the Open-Source Tools word2vec of wikipedia data focus utilization Google first, if it is English Training data, then need using English wikipedia data set training term vector, if Chinese training data, then need to use The term vector of Chinese wikipedia data set training Chinese.Herein by taking English training data as an example, the training data used is Google's comment data collection (Amazon Reviews) and network inquiry data set (Web Snippet).So we utilize The term vector of word2vec tools training English word in English wikipedia data, vectorial dimension are set as 300.
For the training data of model, it would be desirable to be pre-processed data so as to subsequent operation:First with The nltk natural language processings library of python carries out subordinate sentence to text, and then each sentence is segmented, and English text is come It says, the separator between space, that is, word.Progress one needs to filter out stop words in text, then filters out that frequency occur small In the word of 5 documents, the document that Document Length is less than 3 words is filtered out.It can be obtained after the treatment about training number According to word list V.
For word wiAnd wj, corresponding term vector is viAnd vj, then the semantic similarity between word we define For:Cosine similarity i.e. between vector.For the list in word list V Word w, defining its semanteme related words set S (w) is:S (w)={ wo|SR(w,wo)>ε }, wherein the value of ε regarding data set and Fixed, range is [0,1] ε ∈.
(2) training data is used to filter semantic similarity information between word
In working before, term vector is taken as the topic model field that external knowledge is applied to.Term vector is typically Concentrate training to obtain from large-scale text data, it includes semantic information may be not particularly suited for training data, such as " bachelor " and " undergraduate " the two words may be there is no too big association in " family " this theme Property, thus our point of use mutual informations (Point Mutual Information) come the semantic similarity information between word into Row filtering, makes it be more suitable for short text training data.Given word wiAnd wj, then the PMI between word be defined as:
Wherein p (wi,wj) indicate wiAnd wjThe probability occurred jointly in same piece document, by Estimation obtains, whereinIndicate word wiAnd wjThe number of files occurred jointly, and | D | indicate training set total number of documents, p (w) it indicates the probability that word w occurs in collection of document, can go to estimate by the document frequency comprising the word, i.e.,Can be that word w definition set S (w) are again according to PMI:S (w)={ wo|SR(w,wo)>ε,PMI(w, wo)≥η}.I.e. if two relevant words of semanteme in training data if relevance very little, largely this two A word is not semantic relevant in training data concentration.Wherein η ∈ (- ∞ ,+∞), specific value is depending on data set.
(3) generating process of Definition Model
Due to being directed to short text data, it is possible to use for reference DMM models and Twitter-LDA models go definition originally Generating process involved in method.In topic model, generating process refers to the flow for assuming to generate document.Assume initially that text For shelves collection there are K theme, every short essay shelves are only related to a theme, because holding the limited length of document, every short text Length is generally all within 100 words, so the hypothesis is relatively easy and rational.Assuming that a short essay shelves d is associated Theme is k, then in the document not all word due to theme k list that is related, such as can all occurring in most of documents Word, these words can be counted as background word, they keep sentence more complete or semantic meaning representation more cleans.So can It is responsible for generating background word to set a global background theme B.Assuming that each word w in document d relating subjects z, d is closed One binary indicator variable y of connection illustrates that the word illustrates that the word comes from from background theme B if y=0 if y=1 In theme z.Theme is to be generated from a multinomial distribution θ sampling, and it is to sample to obtain in the distribution of α Di Li Crays that θ, which is from parameter, 's.For theme k and background theme B, the multinomial distribution φ about wordkIt is to be adopted in being distributed from parameter for β Di Li Crays What sample obtained.Complete generating process is as follows:
A) sampling obtains the theme distribution of collection of document from the Di Li Crays distribution that parameter is α:θ~Dirichlet (α)
B) it for background theme, is sampled from the Di Li Crays distribution that parameter is β and obtains the multinomial distribution about word: φB~Dirichlet (β)
C) sampling obtains the distribution of binary indicator variable:ψ~Dirichlet (γ)
D) for each theme k, sampling obtains subject word distribution:φk~Dirichlet (β)
E) for every document d in document sets, sampling first obtains the theme z of the documentd~Multinomial (θ) samples a binary indicator variable y first for i-th of word in document dd,i~Bernoulli (ψ), if yd,i= 0, then the word is from theme zdIt generates, i.e.,If yd,i=1, then the word is from background theme It generates, i.e. wd,i~Multinomial (φB), wherein wd,iIndicate i-th of word in document d.
The corresponding probability graph model of above-mentioned generating process indicates as shown in Figure 1.We assume that the document of training data is all Generated according to the above process, so in the case where having obtained observational variable, need according to above-mentioned generating process and Observational variable carrys out the hidden variable of modulus type.
(4) model parameter solves
According to generating process, we can write out is about the likelihood function L of training data:
It needs to maximize likelihood function to acquire the parameter and hidden variable of model, but due between variable in above formula Coupling, accurate solve is impossible, so we use the hidden variable and ginseng in the method solving model of approximate solution Number, the method for more commonly used approximate solution has EM algorithms, variation EM, variation it is expected propagation and lucky cloth in probability graph model This sampling, we solve parameter using gibbs sampler method here, and this method solution is fairly simple, by abundant Sampling after can obtain globally optimal solution.Certainly there are one very important the reason is that, using gibbs sampler we During can the semantic relevant information between word being included into sampling.According to likelihood function, it would be desirable to which sampling is to hidden Variable y and z are sampled, and for hidden variable φBφ1,...Kθ ψ, we can be obtained by MAP estimation.
Given word w and its semanteme related words set S (w), since the word in set S (w) has relatively by force with w Semantic relation, so if probability of the word w at a theme z is larger, for the word arbitrarily in set S (w) wo, the probability at theme z also should be bigger.Our adopting using Generalized Wave Leah urn model in order to reach this purpose Sample strategy.
Pohle Asia urn model is more classical one of model in statistics, and many problems can be attributed to Pohle Asia tank Submodel.In the urn model of simple Pohle Asia, there are multiple beads, each bead to be coated with one kind inside a jar Color extracts a bead, then simultaneously by the bead and a bead identical with the bead color in jar at random It puts back in jar.And Generalized Wave Leah model is extended on the basis of the model, i.e., one is randomly selected from jar Bead records the color of this ball, and the bead and certain amount and should of two colors are then put back into the jar The bead of other similar colors of color.In topic model, jar can be the theme with analogy, and bead can be with analogy Cheng Dan Word, it is common gibbs sampler process that simple Pohle Asia urn model is corresponding.And during this model solution, we make It is Generalized Wave Leah urn model, i.e., for word w, if it occurs once, not only increasing w at k in theme k Statistic, also to increase statistic of the word at theme k in S (w).I.e. during gibbs sampler, if single The theme of word w is set as k, thenMeanwhile for wo∈ S (w),WhereinIndicate master Topic k is associated with the statistic of word w, andIt is defined as:
For document d, the sampling formula of theme z is:
Wherein nkThe number of documents that the k that is the theme occurs,For statistics of the word w at theme k,Be the theme k about The statistic of all words,For occurrence numbers of the word w in document d, and subscript-d indicates calculating these statistics When, the information of document d is not considered.
For i-th of word of document d, the sampling formula of binary indicator variable y is:
Wherein nY=1Indicate the number of background word in document sets, similarly, nY=0Indicate of non-background word in collection of document Number,The number of word w, n are generated for background theme B in document setsBFor background word number in document sets, and subscript-d, i are indicated When calculating ASSOCIATE STATISTICS amount, the information of i-th of word of document d is not considered.After fully sampling, word can be obtained Probability values of the w at theme k be:
Since the word of the distribution of relationship indicator variable and background theme is not distributed for we, so in practical applications It only needs to find out φ.
The evaluation method of topic model is a variety of in having, we will learn feature of the obtained theme as document, to document Classify, the quality of theme quality is judged with the accuracy rate of classification.For giving grader, if theme semantic consistency Higher, then the accuracy rate classified is higher, we use random forest as grader, are made with the accuracy rate of classification and F1 values For the measurement index of classification.Our setting model parameter ε=0.5 and η=0.0 in experiment, with Amazon comment data and net Network inquires training data of the data as model.In order to further illustrate the validity of this model, we by the model and other 5 The topic model of a common short essay this field is compared, and the number K of theme is set as { 20,40,60,80 }.As a result as schemed Shown in 2, Fig. 3, Fig. 4, Fig. 5.From the point of view of the classifying quality of data, our model can be obtained in most cases preferably As a result, illustrating that the theme quality of the model extraction in majority of case is preferable.We can be into from Fig. 4 and Fig. 5 One step observes the growth with theme number, which has certain robustness, can't be because of the growth for the number that is the theme And the quality of theme is made to have prodigious decline.It can be seen that word-based vector sum context proposed by the present invention from experiment effect The feasibility of the short text topic model of information.
It is specific embodiments of the present invention and the technical principle used described in above, if conception under this invention institute The change of work when the spirit that generated function is still covered without departing from specification and attached drawing, should belong to the present invention's Protection domain.

Claims (1)

1. a kind of short text topic model of word-based vector sum contextual information, which is characterized in that effectively utilize term vector And contextual information obtains the semantic similarity between word, and by semantic similarity Information application to gibbs sampler In the process, increase the semantic consistency of theme:
(1) semantic similarity between word is obtained
The training term vector from wikipedia or Google's news, the vector for obtaining each word in training data indicates, with vector Between cosine similarity indicate the semantic dependency between two words;For word wiAnd wj, corresponding term vector is viAnd vj, then the semantic similarity between word be defined as:For training The each word concentrated obtains the set S (w) of its semantic related words, is defined as:S (w)={ wo|SR(w,wo)>ε }, For the value of middle ε depending on data set, range is [0,1] ε ∈;
(2) training data is used to filter semantic similarity information between word
Term vector is obtained from large data concentration training, and semantic information wherein included may be not particularly suited for training number According in order to further include the feature of training data, point of use mutual information PMI carried out obtained semantic similarity information Filter, word wiAnd wjBetween PMI be defined as:
Wherein, p (wi,wj) indicate word wiAnd wjThe probability occurred jointly in same piece document, p (w) indicate word w in document The probability occurred in set is gone to estimate by the document frequency comprising the word;Redefining set S (w) according to PMI is:S(w) ={ wo|SR(w,wo)>ε,PMI(w,wo) >=η }, wherein η ∈ (- ∞ ,+∞), specific value is depending on data set;
(3) generating process of Definition Model
Specifying in short collection of document has K theme and a background theme;One short essay shelves includes only a theme, a document In word both generated by a normal theme or generated by a background theme;Specifically generating process is:
A) sampling obtains the theme distribution of collection of document:θ~Dirichlet (α);
B) sampling obtains the word distribution of background theme:φB~Dirichlet (β);
C) sampling obtains the distribution of binary indicator variable:ψ~Dirichlet (γ);
D) for each theme k, sampling obtains subject word distribution:φk~Dirichlet (β);
E) for every document d in document sets, sampling first obtains the theme of the document;
zd~Multinomial (θ) samples a binary variable first for i-th of word in document d
yd,i~Bernoulli (ψ), if yd,i=0, then the word is from theme zdIt generates, i.e.,
wd,i~Multinomial (φzd), if yd,i=1, then the word is from background theme B generations, i.e.,
wd,i~Multinomial (φB);
(4) model parameter solves
The method that the solution of model parameter is used is gibbs sampler, after maximum of the sample obtained according to sampling to carry out parameter Test estimation;In order to increase language using the method for sampling of General Polya Urn models in the semantic consistency for improving theme Statistic of the adopted higher word of similarity under related subject, i.e., when the corresponding themes of word w are assigned k, thenMeanwhile for wo∈ S (w),WhereinIt is defined as:
According to generating process, the hidden variable that is sampled is z and y, and hidden variable φB、φ1,...,k, θ and ψ estimated by maximum a posteriori Meter obtains;For document d, the sampling formula of theme z is:
Wherein α, β are the hyper parameter in Dirichlet distributions, and V is the size of word list,For statistics of the word w at theme k Amount,Be the theme statistics of the k about all words, nkBased on entitled k text number, subscript-d indicates calculating current system When metering, document d will not be considered into, obtain sample z and obtain each theme about word by MAP estimation later Distribution:
CN201810124600.7A 2018-02-07 2018-02-07 A kind of short text topic model of word-based vector sum contextual information Pending CN108415901A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810124600.7A CN108415901A (en) 2018-02-07 2018-02-07 A kind of short text topic model of word-based vector sum contextual information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810124600.7A CN108415901A (en) 2018-02-07 2018-02-07 A kind of short text topic model of word-based vector sum contextual information

Publications (1)

Publication Number Publication Date
CN108415901A true CN108415901A (en) 2018-08-17

Family

ID=63127010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810124600.7A Pending CN108415901A (en) 2018-02-07 2018-02-07 A kind of short text topic model of word-based vector sum contextual information

Country Status (1)

Country Link
CN (1) CN108415901A (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110516053A (en) * 2019-08-15 2019-11-29 出门问问(武汉)信息科技有限公司 Dialog process method, equipment and computer storage medium
CN110532378A (en) * 2019-05-13 2019-12-03 南京大学 A kind of short text aspect extracting method based on topic model
CN110674783A (en) * 2019-10-08 2020-01-10 山东浪潮人工智能研究院有限公司 Video description method and system based on multistage prediction architecture
CN113191134A (en) * 2021-05-31 2021-07-30 平安科技(深圳)有限公司 Document quality verification method, device, equipment and medium based on attention mechanism
CN113705247A (en) * 2021-10-27 2021-11-26 腾讯科技(深圳)有限公司 Theme model effect evaluation method, device, equipment, storage medium and product

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391842A (en) * 2014-12-18 2015-03-04 苏州大学 Translation model establishing method and system
CN105320960A (en) * 2015-10-14 2016-02-10 北京航空航天大学 Voting based classification method for cross-language subjective and objective sentiments
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391842A (en) * 2014-12-18 2015-03-04 苏州大学 Translation model establishing method and system
CN105320960A (en) * 2015-10-14 2016-02-10 北京航空航天大学 Voting based classification method for cross-language subjective and objective sentiments
CN105955948A (en) * 2016-04-22 2016-09-21 武汉大学 Short text topic modeling method based on word semantic similarity
CN107861939A (en) * 2017-09-30 2018-03-30 昆明理工大学 A kind of domain entities disambiguation method for merging term vector and topic model

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110532378A (en) * 2019-05-13 2019-12-03 南京大学 A kind of short text aspect extracting method based on topic model
CN110532378B (en) * 2019-05-13 2021-10-26 南京大学 Short text aspect extraction method based on topic model
CN110516053A (en) * 2019-08-15 2019-11-29 出门问问(武汉)信息科技有限公司 Dialog process method, equipment and computer storage medium
CN110674783A (en) * 2019-10-08 2020-01-10 山东浪潮人工智能研究院有限公司 Video description method and system based on multistage prediction architecture
CN110674783B (en) * 2019-10-08 2022-06-28 山东浪潮科学研究院有限公司 Video description method and system based on multi-stage prediction architecture
CN113191134A (en) * 2021-05-31 2021-07-30 平安科技(深圳)有限公司 Document quality verification method, device, equipment and medium based on attention mechanism
CN113191134B (en) * 2021-05-31 2023-02-03 平安科技(深圳)有限公司 Document quality verification method, device, equipment and medium based on attention mechanism
CN113705247A (en) * 2021-10-27 2021-11-26 腾讯科技(深圳)有限公司 Theme model effect evaluation method, device, equipment, storage medium and product
CN113705247B (en) * 2021-10-27 2022-02-11 腾讯科技(深圳)有限公司 Theme model effect evaluation method, device, equipment, storage medium and product

Similar Documents

Publication Publication Date Title
CN108415901A (en) A kind of short text topic model of word-based vector sum contextual information
Nguyen et al. Automatic image filtering on social networks using deep learning and perceptual hashing during crises
CN111126386B (en) Sequence domain adaptation method based on countermeasure learning in scene text recognition
CN105975499B (en) A kind of text subject detection method and system
CN109492026B (en) Telecommunication fraud classification detection method based on improved active learning technology
CN103995876A (en) Text classification method based on chi square statistics and SMO algorithm
CN112417157A (en) Emotion classification method of text attribute words based on deep learning network
CN113051932B (en) Category detection method for network media event of semantic and knowledge expansion theme model
CN108829661A (en) A kind of subject of news title extracting method based on fuzzy matching
CN113448843A (en) Defect analysis-based image recognition software test data enhancement method and device
CN103440315A (en) Web page cleaning method based on theme
CN114329455B (en) User abnormal behavior detection method and device based on heterogeneous graph embedding
CN111984790B (en) Entity relation extraction method
CN103853720B (en) User attention based network sensitive information monitoring system and method
US20150149374A1 (en) Relationship circle processing method and system, and computer storage medium
CN117114112A (en) Vertical field data integration method, device, equipment and medium based on large model
CN110929509B (en) Domain event trigger word clustering method based on louvain community discovery algorithm
CN112329857A (en) Image classification method based on improved residual error network
CN108229565A (en) A kind of image understanding method based on cognition
CN116450827A (en) Event template induction method and system based on large-scale language model
Yuan et al. A novel figure panel classification and extraction method for document image understanding
CN112328812A (en) Domain knowledge extraction method and system based on self-adjusting parameters and electronic equipment
Jingliang et al. A data-driven approach based on LDA for identifying duplicate bug report
Sánchez et al. Diatom classification including morphological adaptations using CNNs
CN112131384A (en) News classification method and computer-readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20180817

WD01 Invention patent application deemed withdrawn after publication