CN106095791A - A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof - Google Patents
A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof Download PDFInfo
- Publication number
- CN106095791A CN106095791A CN201610369833.4A CN201610369833A CN106095791A CN 106095791 A CN106095791 A CN 106095791A CN 201610369833 A CN201610369833 A CN 201610369833A CN 106095791 A CN106095791 A CN 106095791A
- Authority
- CN
- China
- Prior art keywords
- abstract
- word
- sample
- term vector
- barycenter
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/355—Class or cluster creation or modification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/332—Query formulation
- G06F16/3329—Natural language query formulation or dialogue systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
- G06F18/232—Non-hierarchical techniques
- G06F18/2321—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
- G06F18/23213—Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
Abstract
The present invention proposes a kind of abstract sample information searching system based on context.In this system, abstract sample characteristics method for expressing utilizes Word2vector to extract meaning of a word feature, it is thus achieved that the term vector of abstract word;Then, the term vector of abstract word is carried out the cluster of " division of adaptive optimal control degree ", and be cluster barycenter according to cluster result by abstract word replacing representation;Finally, according to barycenter and the word frequency of representative abstract word thereof, constitute term vector cluster centroid frequency model (ST IDF), represent abstract sample for characterization.Present invention reduces cluster and the execution number of times of fitness calculating, improve the performance that abstract sample similarity is analyzed, improve sample classification accuracy rate.
Description
Technical field
The present invention relates to the information retrieval field of Data-Link message, semi-structured text or plain text, particularly to base
Sample similarity analysis and classification in term vector (Word2vector).
Background technology
Abstract word refers to the special word directly cannot understood by language in information retrieval sample, i.e. advise without known language
Then (meaning of a word, grammer, word order) can directly identify that it is actual semantic.Substantial amounts of abstract word is present in information retrieval to some extent
Sample in, the most military Data-Link message (Link-16, Link-22), for data exchange semi-structured text (XML)
Or plain text.Concurrently there are substantial amounts of Data-Link message, semi-structured text or plain text and use abstract word record completely
Information.For this situation, this type of message or text in information retrieval task are referred to as abstract sample by us.
At present, for the abstract sample in information retrieval task, cannot in the case of its abstract word semanteme of Direct Recognition,
Many employings sample characteristics method for expressing based on word statistics.Existing characterization method for expressing based on word statistics cannot
Efficiently extract its phrase semantic (meaning of a word) feature, such as TF-IDF (TermFrequency-Inverse Document
Frequency) model and BOW (Bag of words) model.
Word2vector is a kind of phrase semantic (meaning of a word) feature extracting method according to context relation, at first by
Mikolov proposes at the beginning of being equal to 2013 in the open source projects of Google.When document is as the sample of information retrieval, for
Each word in different document, Word2vector can efficiently extract its semantic (i.e. meaning of a word spy according to its context relation
Levy), and be given with the form of term vector.Must be noted that the meaning of a word feature extraction mechanism of Word2vector makes not identical text
The term vector corresponding to word identical in Dang also differs.So, cause the term vector being difficult to according to Word2vector to form letter
The characterization of breath sample retrieval represents, the sample characteristics being particularly difficult to be formed VSM (vector space model) form represents.
At present, abstract levying of sample represents needs to use Word2vector as meaning of a word feature extraction based on context
Method, and make self to be applicable to existing Information-retrieval Algorithm based on sampling feature vectors.But, not yet occur clearly being recognized
Can method can according to Word2vector meaning of a word feature extraction formed VSM form abstract sample characteristics represent.
Therefore it is badly in need of proposing a kind of abstract sample information searching system based on context and corresponding abstract sample characteristics
Change method for expressing, solve the problems referred to above.
Summary of the invention
In information retrieval application, the invention provides the retrieval of a kind of abstract sample information based on context is
System, and elaborate its characterization method for expressing in detail.It is an object of the invention to, overcome and prior art is difficult to basis
The term vector of Word2vector forms the situation that the characterization of sample represents, solves the meaning of a word during abstract sample characteristics represents special
The problem levying extraction.
A kind of abstract sample information searching system based on context, including participle functional module, meaning of a word feature extraction mould
Block, abstract word character displacement representation module, ST-IDF module and sort module, taking out of described abstract sample information searching system
Decent eigen method for expressing comprises the following steps:
Step 1, utilize participle functional module that sample carries out the participle of abstract word: when sample is Data-Link message, can
Form according to Data-Link message divides each abstract word with word length;When sample is text, can be according to space and specific participle
The each abstract word of regular partition.
Step 2, utilize meaning of a word characteristic extracting module extract abstract word phrase semantic feature: for obtained by step 1
Abstract word, uses Word2vector method, and context relation based on abstract word extracts its meaning of a word feature, and with term vector shape
Formula represents.
Step 3, utilize abstract word character displacement representation module that abstract word feature is carried out replacing representation: first, use
Clustering quantity under excellent Clustering Effect fitness, carries out K-means algorithm cluster, i.e. to the term vector obtained by step 2
Realize the cluster of " division of adaptive optimal control degree " to abstract word term vector.Wherein, the barycenter of term vector clustering is referred to as S (table
It is shown as the vector in term vector space), quantity k of S is i.e. clustering number, and in all samples, the quantity of abstract word is N,
The sample classification quantity known be C, f (k) be embody Clustering Effect fitness function,
α is the mean cosine distance between k S vector, and β is mean cosine distance between the term vector in k clustering
Average, makes positive integer k ∈ [N, N × C];As f (k)=max (f (k)), make the clustering under optimum cluster effect fitness
Quantity K=k, the quantity of barycenter S is ultimately determined to K.Then, it is its word according to final cluster result by abstract word replacing representation
The barycenter S of clustering belonging to vector, or be referred to as representing the abstract word in its clustering with barycenter S, will the spy of abstract word
Levy the barycenter that approximation approval is affiliated clustering.
Step 4, utilize ST-IDF module export abstract sample characteristics represent: first, add up each abstract word at one
The frequency occurred in sample, the replacing representation relation be given according to step 3, by the abstract word representated by barycenter S in this sample
The frequency of occurrences be calculated as the frequency of barycenter S;And add up the reverse document-frequency of term vector cluster barycenter;Then, with reference to TF-IDF
Model constitutes term vector cluster centroid frequency model ST-IDF, and ST-IDF model belongs to VSM form, represents for characterization
One abstract sample.
Step 5, Similarity Measure, it is achieved the similarity analysis of abstract sample: the characterization table provided according to step 4
Show, calculate the similarity between two abstract samples, and carry out the execution of sample classification algorithm in information retrieval field accordingly.
Step 6, utilize sort module that characterization is represented after abstract sample carry out kind judging: according to similarity, adopt
With NWKNN algorithm, abstract sample is carried out kind judging.
Beneficial effects of the present invention is as follows:
The present invention proposes a kind of information retrieval system based on context and abstract sample characteristics method for expressing thereof, it
Improvement including two aspects: (1) proposes optimum cluster effect fitness partitioning algorithm, and fits according in optimum cluster effect
Term vector cluster under response, has carried out abstract word character displacement and has represented;(2) propose and represent for abstract sample characteristics
Term vector cluster centroid frequency model ST-IDF.
The present invention extracts meaning of a word feature first with Word2vector, it is thus achieved that the term vector of all abstract words in sample;And
After, it is proposed that optimum cluster effect fitness partitioning algorithm, and according to the optimum cluster effect fitness term vector to abstract word
Carry out K-means cluster, and according to cluster result, the barycenter that abstract word replacing representation is clustering belonging to its term vector (is remembered
For S);Finally, the frequency of occurrences in the sample of the abstract word representated by barycenter is calculated as the frequency of barycenter S, and constitutes term vector
Cluster centroid frequency model ST-IDF, represents abstract sample for characterization.With traditional sample based on word statistics
Characterization method for expressing is compared, and ST-IDF model comprises the meaning of a word feature of abstract word, and belongs to VSM (vector space model) shape
Formula, is applicable to the Information-retrieval Algorithm (as classified, return, clustering) of existing feature based vector.
From the angle of excess syndrome, use information retrieval field classics sample classification algorithm NWKNN, at public data collection
On Reuter-21758, Wikipedia XML, ST-IDF model and TF-IDF model are carried out contrast experiment, experimental result
Illustrate the clear superiority of the method for the invention objectively, improve the accuracy that abstract Sample Similarity calculates, improve
Abstract sample classification accuracy, and effectively expanded the construction method of vector space model in information retrieval field.
Accompanying drawing explanation
Fig. 1 is data and the module map of abstract sample information searching system of the present invention.
Fig. 2 is the flow chart of information retrieval method of the present invention.
Fig. 3 is Word2vector method basic principle schematic.
Fig. 4 is Clustering Effect fitness function figure.
Fig. 5 is the replacing representation relation schematic diagram in term vector space according to cluster.
Detailed description of the invention
Below in conjunction with drawings and Examples, the present invention is described further.
As it is shown in figure 1, wherein content is a kind of abstract sample information searching system based on context of the present invention, including dividing
Word functional module, meaning of a word characteristic extracting module, abstract word character displacement representation module, ST-IDF module and sort module.
The abstract sample characteristics method for expressing of described abstract sample information searching system comprises the following steps:
Step 1: utilize participle functional module that sample is carried out the participle of abstract word.When sample uses abstract word record completely
During information, it is impossible to carry out the participle of abstract word in sample according to dictionary or dictionary.So, abstract word is only considered as by this step
The character string of ascii character.When sample is Data-Link message, divide each abstract according to the form of Data-Link message with word length
Word;When sample is text, divide each abstract word according to space and specific word segmentation regulation.The participle of abstract word is designated as
wordi,t, word wordi,tThe participle of the t kind abstract word in expression i-th sample, has i={1, and 2 ..., | D | }, | D | is number
According to concentrate D sample number, t={1,2 ..., n}, n are abstract word species number, abstract word word in all samplesi,tQuantity be
N。
Step 2: utilize meaning of a word characteristic extracting module, extracts the phrase semantic feature of abstract word.For obtained by step 1
Abstract word, uses Word2vector method, and context relation based on abstract word extracts its meaning of a word feature, and with term vector shape
Formula represents.This step uses Word2vec instrument, can obtain the term vector of abstract word.
Word2vec is the model realization of Word2vector method, can context relation based on word, fast and effeciently
Train and generate term vector.It contains two kinds of training patterns, CBOW and Skip_gram.As being used for training generation term vector
Software tool, in Word2vec, the basis of training pattern is neutral net language model NNLM, its ultimate principle such as Fig. 2 institute
Show.
According to the abstract word obtained by step 1, it is word that NNLM can calculate the next word of some contexti,t's
Probability, i.e. p (wordi,t=t | context), term vector is the by-product of its training.NNLM is right according to data set D generation one
The vocabulary V answered.Each word in V correspond to a labelling wordi,t.In order to determine the parameter of neutral net, need
Training sample the input as neutral net is built by data set.The building process of NNLM word context sample is:
For any one word word in Di,t, obtain its context context (wordi,t) (n-1 word before such as), thus obtain
One tuple (context (wordi,t),wordi,t).It is trained using this tuple as the input of neutral net.NNLM's is defeated
Entering layer and traditional neural network model is different, each node unit of input is no longer a scalar value, but one
Individual vector, each value of vector is variable, to be updated it during training, and this vector is exactly term vector.By Fig. 2
Understand, for each word wordi,t, NNLM maps it onto a vectorial wi,t, it is term vector.
Use the term vector w that Word2vec instrument obtainsi,tThe concrete t kind abstract word participle represented in i-th sample
Meaning of a word feature, has i={1, and 2 ..., | D | }, | D | is sample number, the term vector w of abstract word in all samplesi,tQuantity be N.
Step 3: utilize abstract word character displacement representation module, word vector clusters barycenter to represent taking out in its clustering
As word.First, use the clustering quantity under optimum cluster effect fitness, step 2 term vector obtained is carried out K-
Means algorithm clusters, and i.e. realizes the cluster of " division of adaptive optimal control degree " to abstract word term vector.The K-means of term vector gathers
Apoplexy due to endogenous wind, uses the cosine value of two term vector angles to calculate distance between the two.
According to step 2 gained, the term vector w of abstract word in all samplesi,tQuantity be N, term vector wi,tConcrete expression
The meaning of a word feature of the t kind abstract word participle in i-th sample.Known sample classification quantity is C, and sample size is M.This
In step, the barycenter of term vector clustering being referred to as S (being expressed as the vector in term vector space), quantity k of S is i.e. cluster
Divide number.
For embodying the K-means Clustering Effect in term vector space, the present invention provides the adaptive meter of clustering quantity
Calculate.For representing clustering quantity adaptability, making f (k) is the function embodying Clustering Effect fitness,
α is the mean cosine distance between k S vector, and β is mean cosine distance between the term vector in k clustering
Average, specifically has:
Wherein, S and S ' is the centroid vector of different clusterings, wi,tWith w 'i,tIt is that class belongs in the b clustering
The term vector of different abstract word participles.
If clustering number k ∈ [N, N × C], and be positive integer, as f (k)=max (f (k)), make optimum cluster imitate
Really clustering quantity K=k under fitness, f (K) is the maximum of Clustering Effect fitness.It is computed understanding, function f (k)
Being monotonically increasing in the interval of N to K, be monotone decreasing in the interval of K to N × C, the image of function f (k) is as shown in Figure 3.
So, as f (k)=max (f (k)), K=k, f (K) they are the extreme values of Clustering Effect fitness function, i.e. optimum poly-
Class effect fitness, the quantity of K-means cluster barycenter S is ultimately determined to K.Determining max (f (k)), the process of K Yu f (K)
In, for reducing the execution number of times that K-means clusters and f (k) calculates, the present invention proposes optimum cluster effect fitness and divides calculation
Method, often carries out f (k) and calculates and then need pre-to first carry out the K-means cluster that barycenter quantity is k in algorithm, specific as follows:
Optimum cluster effect fitness partitioning algorithm
Optimum cluster effect fitness partitioning algorithm is analyzed: according to the recursive operation feature of algorithm, its time complexity is
Ο(log2[(N × C-N)/4], so the actual K-means cluster number of times performed is less than with f (k) calculation times in this step
In log2[(N × C-N)/4] are secondary;And when not using optimum cluster effect fitness partitioning algorithm, have k={N, N+1, N+
1 ..., N × C}, during determining max (f (k)), K with f (K), the required K-means cluster performed is average with what f (k) calculated
Number of times is (N × C-N)/2.So, the optimum cluster effect fitness partitioning algorithm in this step reduces cluster and fitness
The execution number of times calculated.
Finally, according to final cluster result by barycenter S that abstract word replacing representation is clustering belonging to its term vector.
Specifically, as f (k)=max (f (k)), clustering quantity K=k under optimum cluster effect fitness, will be the most abstract
Word wi,tReplacing representation is the barycenter S of clustering belonging to its term vector, will the feature approximation approval of abstract word for affiliated cluster
The barycenter divided.In any meronym vector space, represent the abstract word in its clustering with barycenter S, its corresponding relation
As shown in Figure 4.Concrete replacing representation relation is as described in following formula:
Wherein, the b cluster barycenter SbRepresentative abstract word wordi,tConstitute an abstract word set, wi,tIt is abstract
Word wordi,tTerm vector, WbIt is that class belongs to barycenter SbThe set of the abstract word corresponding to the term vector of place clustering.
Step 4: utilize ST-IDF module, exports abstract sample characteristics and represents.First, each abstract word is added up at one
The replacing representation relation of the frequency occurred in sample, the barycenter S be given according to step 3 and abstract word, by the b barycenter SbInstitute's generation
The abstract word of the table frequency of occurrences in this sample is calculated as barycenter SbFrequency;And add up term vector cluster barycenter SbReverse literary composition
Part frequency, has b={1, and 2 ..., K}.Then, term vector cluster centroid frequency model ST-is constituted with reference to TF-IDF model
IDF, concrete constituted mode will be further elaborated.
In TF-IDF model, sample dociCharacterization represent by characteristic vector diRealize,
di=(di(1),di(2),……,di(n))
Vector diIn t tie up element di(t)Calculation is as follows:
di(t)=TF (wordt,doci)·IDF(wordt),
TF(wordt,doci) it is word wordtAt sample dociIn frequency, have its calculation
Middle molecule is this word occurrence number in the sample, and denominator is then the occurrence number of the most all words
Sum,
IDF(wordt) it is word wordtReverse document-frequency, have its calculation
Wherein, D is sample dociComposition data set, | D | is the sum of sample in data set D, | { doci|wordt∈
doci| for comprising word wordtSample size.
With reference to TF-IDF model, ST-IDF model specifically constitutes as follows:
SF(Sb,doci) it is that term vector clusters barycenter SbAt abstract sample dociIn frequency, have its calculation
Wherein, WbIt is that class belongs to barycenter SbThe set of the abstract word corresponding to the term vector of place clustering, TF (wi,t)
Represent abstract word wi,tAt abstract sample dociThe frequency of middle appearance, SF (Sb,doci) only add up abstract sample dociIn by barycenter Sb
The frequency of representative abstract word.
IDF(Sb) it is that term vector clusters barycenter SbReverse document-frequency, have its calculation
Wherein, D is abstract sample dociComposition data set, | D | is the sum of sample in data set D,For comprising by barycenter SbThe quantity of the sample of representative abstract word.
In ST-IDF model, abstract sample dociCharacterization represent by characteristic vectorRealize,
VectorIn b tie up elementCalculation is as follows:
The ST-IDF model that this step is proposed belongs to VSM (vector space model) form, represents one for characterization
Abstract sample.
Step 5: Similarity Measure, it is achieved the similarity analysis of abstract sample.The characterization table provided according to step 4
Show, calculate the similarity between two abstract samples;And carry out the execution of sample classification algorithm in information retrieval field accordingly.
A kind of abstract sample characteristics method for expressing of information retrieval based on context uses the ST-IDF that step 4 is proposed
Model carries out abstract sample characteristics and represents.Any two abstract sample dociWith doci' similarity is by similarity function Sim
(doci,doci') represent, its concrete calculation is as follows:
For characteristic vector in ST-IDF vector spaceWithBetween the cosine value of angle.
Step 6: utilize sort module, the abstract sample after representing characterization carries out kind judging.According to similarity, adopt
With NWKNN algorithm, abstract sample is carried out kind judging.
According to similarity function Sim (doci,doci'), use the classical sample classification algorithm in information retrieval field
NWKNN performs abstract sample classification.NWKNN is weight neighbours' KNN algorithm, and the sample classification for unbalanced classified sample set is sentenced
Not, its formula is as follows:
Wherein, function score (doc, ci) calculate document doc is attributed to classify ciAssessed value;Function Sim (doc,
docj) represent sample doc and known class sample docjSimilarity, use vector COS distance calculate;WeightiFor classification
Weight setting value, is entered as 3.5;Function δ (docj,ci) represent sample docjWhether belong to classification ciIf, sample docjBelong to class
Other ci, then this function value is 1, and otherwise, this function value is 0.
The Performance Evaluation of sample classification uses F1-measure standard.This standard combines recall rate Recall and accuracy rate
The assessment tolerance F1 of Precision is as follows:
Use F1-measure standard, sample classification system classifying quality for data set be can be observed.For just
In comparing, by summing up macroscopical F1 metric Macro-F1 of abstract sample classification result, meanwhile, abstract sample classification can be obtained
The Average precision of result.
The data set being data exchange semi-structured text with wikipedia XML data Wikipedia XML, with Reuter
Document sets Reuter-21578 is the data set of plain text, uses NWKNN algorithm to carry out abstract sample classification experiment, and uses
F1-measure standard carries out the recruitment evaluation of sample classification, SF-IDF vector and the TF-IDF in prior art that the present invention proposes
The classifying quality contrast of vector is shown in Table 1, table 2:
On table 1Wikipedia XML data collection, TF-IDF vector compares with the classifying quality of SF-IDF vector
On table 2Reuter-21578 data set, TF-IDF vector compares with the classifying quality of SF-IDF vector
According to table 1, table 2, it is seen that the classifying quality of the SF-IDF vector that the present invention proposes is substantially better than prior art
Middle TF-IDF vector, especially on Wikipedia XML data collection, Average Accuracy is brought up to by original 48.7%
59.2%, on Reuter-21578 data set, Average Accuracy is brought up to 63.3% by original 57.1%.Experimental result shows
Showing, the information retrieval task classified for abstract sample similarity, TF-IDF mould compared by ST-IDF model proposed by the invention
Type has more excellent F1-measure assessment result, it was demonstrated that characterization method for expressing provided by the present invention possesses abstract sample
The advantage of meaning of a word feature extraction.
Claims (1)
1. an abstract sample information searching system based on context, it is characterised in that: it includes participle functional module, the meaning of a word
Characteristic extracting module, abstract word character displacement representation module, ST-IDF module and sort module, described abstract sample information is retrieved
The abstract sample characteristics method for expressing of system comprises the following steps:
Step 1, utilize participle functional module that sample carries out the participle of abstract word: when sample is Data-Link message, can basis
The form of Data-Link message divides each abstract word with word length;When sample is text, can be according to space and specific word segmentation regulation
Divide each abstract word;
Step 2, utilize meaning of a word characteristic extracting module extract abstract word phrase semantic feature: abstract for obtained by step 1
Word, uses Word2vector method, and context relation based on abstract word extracts its meaning of a word feature, and with term vector form table
Show;
Step 3, utilize abstract word character displacement representation module that abstract word feature is carried out replacing representation: first, use optimum poly-
Clustering quantity under class effect fitness, carries out K-means algorithm cluster, i.e. realizes the term vector obtained by step 2
Cluster to " division of adaptive optimal control degree " of abstract word term vector, wherein, the barycenter of term vector clustering is referred to as S and (is expressed as
Vector in term vector space), quantity k of S is i.e. clustering number, and in all samples, the quantity of abstract word is N, it is known that
Sample classification quantity be C, f (k) be embody Clustering Effect fitness function,
α is the mean cosine distance between k S vector, and β is the average of mean cosine distance between the term vector in k clustering,
Make positive integer k ∈ [N, N × C];As f (k)=max (f (k)), make clustering quantity K under optimum cluster effect fitness
=k, the quantity of barycenter S is ultimately determined to K;Then, it is its term vector according to final cluster result by abstract word replacing representation
Belonging to the barycenter S of clustering, or be referred to as representing the abstract word in its clustering with barycenter S, will the feature of abstract word near
Like the barycenter that approval is affiliated clustering;
Step 4, utilize ST-IDF module export abstract sample characteristics represent: first, add up each abstract word at a sample
The frequency of middle appearance, the replacing representation relation be given according to step 3, by abstract word the going out in this sample representated by barycenter S
Existing frequency is calculated as the frequency of barycenter S;And add up the reverse document-frequency of term vector cluster barycenter;Then, with reference to TF-IDF model
Constituting term vector cluster centroid frequency model ST-IDF, ST-IDF model belongs to VSM form, represents one for characterization
Abstract sample;
Step 5, Similarity Measure, it is achieved the similarity analysis of abstract sample: represent according to the characterization that step 4 is provided, meter
Calculate the similarity between two abstract samples, and carry out the execution of sample classification algorithm in information retrieval field accordingly;
Step 6, utilize sort module that characterization is represented after abstract sample carry out kind judging: according to similarity, use
NWKNN algorithm carries out kind judging to abstract sample.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610068972 | 2016-01-31 | ||
CN2016100689723 | 2016-01-31 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106095791A true CN106095791A (en) | 2016-11-09 |
CN106095791B CN106095791B (en) | 2019-08-09 |
Family
ID=57230265
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610369833.4A Active CN106095791B (en) | 2016-01-31 | 2016-05-29 | A kind of abstract sample information searching system based on context |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106095791B (en) |
Cited By (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106502994A (en) * | 2016-11-29 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | A kind of method and apparatus of the keyword extraction of text |
CN106874367A (en) * | 2016-12-30 | 2017-06-20 | 江苏号百信息服务有限公司 | A kind of sampling distribution formula clustering method based on public sentiment platform |
CN110110143A (en) * | 2019-04-15 | 2019-08-09 | 厦门网宿有限公司 | A kind of video classification methods and device |
CN110363206A (en) * | 2018-03-26 | 2019-10-22 | 阿里巴巴集团控股有限公司 | Cluster, data processing and the data identification method of data object |
CN110457470A (en) * | 2019-07-05 | 2019-11-15 | 深圳壹账通智能科技有限公司 | A kind of textual classification model learning method and device |
CN111241269A (en) * | 2018-11-09 | 2020-06-05 | 中移(杭州)信息技术有限公司 | Short message text classification method and device, electronic equipment and storage medium |
CN113127636A (en) * | 2019-12-31 | 2021-07-16 | 北京国双科技有限公司 | Method and device for selecting center point of text cluster |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101339551A (en) * | 2007-07-05 | 2009-01-07 | 日电(中国)有限公司 | Natural language query demand extension equipment and its method |
CN101847405A (en) * | 2009-03-23 | 2010-09-29 | 索尼公司 | Speech recognition equipment and method, language model generation device and method and program |
US20110087468A1 (en) * | 2009-10-12 | 2011-04-14 | Lewis James M | Approximating a System Using an Abstract Geometrical Space |
CN104598586A (en) * | 2015-01-18 | 2015-05-06 | 北京工业大学 | Large-scale text classifying method |
-
2016
- 2016-05-29 CN CN201610369833.4A patent/CN106095791B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101339551A (en) * | 2007-07-05 | 2009-01-07 | 日电(中国)有限公司 | Natural language query demand extension equipment and its method |
CN101847405A (en) * | 2009-03-23 | 2010-09-29 | 索尼公司 | Speech recognition equipment and method, language model generation device and method and program |
US20110087468A1 (en) * | 2009-10-12 | 2011-04-14 | Lewis James M | Approximating a System Using an Abstract Geometrical Space |
CN104598586A (en) * | 2015-01-18 | 2015-05-06 | 北京工业大学 | Large-scale text classifying method |
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106502994A (en) * | 2016-11-29 | 2017-03-15 | 上海智臻智能网络科技股份有限公司 | A kind of method and apparatus of the keyword extraction of text |
CN106502994B (en) * | 2016-11-29 | 2019-12-13 | 上海智臻智能网络科技股份有限公司 | method and device for extracting keywords of text |
CN106874367A (en) * | 2016-12-30 | 2017-06-20 | 江苏号百信息服务有限公司 | A kind of sampling distribution formula clustering method based on public sentiment platform |
CN110363206A (en) * | 2018-03-26 | 2019-10-22 | 阿里巴巴集团控股有限公司 | Cluster, data processing and the data identification method of data object |
CN110363206B (en) * | 2018-03-26 | 2023-06-27 | 阿里巴巴集团控股有限公司 | Clustering of data objects, data processing and data identification method |
CN111241269A (en) * | 2018-11-09 | 2020-06-05 | 中移(杭州)信息技术有限公司 | Short message text classification method and device, electronic equipment and storage medium |
CN111241269B (en) * | 2018-11-09 | 2024-02-23 | 中移(杭州)信息技术有限公司 | Short message text classification method and device, electronic equipment and storage medium |
CN110110143A (en) * | 2019-04-15 | 2019-08-09 | 厦门网宿有限公司 | A kind of video classification methods and device |
CN110457470A (en) * | 2019-07-05 | 2019-11-15 | 深圳壹账通智能科技有限公司 | A kind of textual classification model learning method and device |
CN113127636A (en) * | 2019-12-31 | 2021-07-16 | 北京国双科技有限公司 | Method and device for selecting center point of text cluster |
CN113127636B (en) * | 2019-12-31 | 2024-02-13 | 北京国双科技有限公司 | Text clustering cluster center point selection method and device |
Also Published As
Publication number | Publication date |
---|---|
CN106095791B (en) | 2019-08-09 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106095791A (en) | A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof | |
CN113792818B (en) | Intention classification method and device, electronic equipment and computer readable storage medium | |
CN107861939A (en) | A kind of domain entities disambiguation method for merging term vector and topic model | |
CN100583101C (en) | Text categorization feature selection and weight computation method based on field knowledge | |
CN109885824B (en) | Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium | |
CN104750844B (en) | Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device | |
CN109948149B (en) | Text classification method and device | |
CN107818164A (en) | A kind of intelligent answer method and its system | |
CN108287858A (en) | The semantic extracting method and device of natural language | |
CN106980609A (en) | A kind of name entity recognition method of the condition random field of word-based vector representation | |
CN105808524A (en) | Patent document abstract-based automatic patent classification method | |
CN104573046A (en) | Comment analyzing method and system based on term vector | |
CN107992542A (en) | A kind of similar article based on topic model recommends method | |
CN106372061A (en) | Short text similarity calculation method based on semantics | |
CN109508379A (en) | A kind of short text clustering method indicating and combine similarity based on weighted words vector | |
CN103617157A (en) | Text similarity calculation method based on semantics | |
CN110321925A (en) | A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint | |
CN101231634A (en) | Autoabstract method for multi-document | |
CN114169442B (en) | Remote sensing image small sample scene classification method based on double prototype network | |
CN105760888A (en) | Neighborhood rough set ensemble learning method based on attribute clustering | |
CN103473217B (en) | The method and apparatus of extracting keywords from text | |
CN103020167B (en) | A kind of computer Chinese file classification method | |
CN105205124A (en) | Semi-supervised text sentiment classification method based on random feature subspace | |
CN111597328B (en) | New event theme extraction method | |
CN104484380A (en) | Personalized search method and personalized search device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20190628 Address after: 100095 Beijing Haidian District Gaolizhang Road No. 1 Courtyard 2 Floor 201-004 Applicant after: Changyuan power (Beijing) Technology Co., Ltd. Address before: 250300 Shandong Province Changqing District Guyunhu Street Office Danfeng District South District 1 Building Applicant before: Changyuan power (Shandong) Technology Co. Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |