CN106095791A - A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof - Google Patents

A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof Download PDF

Info

Publication number
CN106095791A
CN106095791A CN201610369833.4A CN201610369833A CN106095791A CN 106095791 A CN106095791 A CN 106095791A CN 201610369833 A CN201610369833 A CN 201610369833A CN 106095791 A CN106095791 A CN 106095791A
Authority
CN
China
Prior art keywords
abstract
word
sample
term vector
barycenter
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610369833.4A
Other languages
Chinese (zh)
Other versions
CN106095791B (en
Inventor
吴�琳
韩广
袁鑫攀
李亚楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Changyuan power (Beijing) Technology Co., Ltd.
Original Assignee
Changyuan Power (shandong) Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Changyuan Power (shandong) Technology Co Ltd filed Critical Changyuan Power (shandong) Technology Co Ltd
Publication of CN106095791A publication Critical patent/CN106095791A/en
Application granted granted Critical
Publication of CN106095791B publication Critical patent/CN106095791B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering

Abstract

The present invention proposes a kind of abstract sample information searching system based on context.In this system, abstract sample characteristics method for expressing utilizes Word2vector to extract meaning of a word feature, it is thus achieved that the term vector of abstract word;Then, the term vector of abstract word is carried out the cluster of " division of adaptive optimal control degree ", and be cluster barycenter according to cluster result by abstract word replacing representation;Finally, according to barycenter and the word frequency of representative abstract word thereof, constitute term vector cluster centroid frequency model (ST IDF), represent abstract sample for characterization.Present invention reduces cluster and the execution number of times of fitness calculating, improve the performance that abstract sample similarity is analyzed, improve sample classification accuracy rate.

Description

A kind of abstract sample information searching system based on context and abstract sample characteristics thereof Change method for expressing
Technical field
The present invention relates to the information retrieval field of Data-Link message, semi-structured text or plain text, particularly to base Sample similarity analysis and classification in term vector (Word2vector).
Background technology
Abstract word refers to the special word directly cannot understood by language in information retrieval sample, i.e. advise without known language Then (meaning of a word, grammer, word order) can directly identify that it is actual semantic.Substantial amounts of abstract word is present in information retrieval to some extent Sample in, the most military Data-Link message (Link-16, Link-22), for data exchange semi-structured text (XML) Or plain text.Concurrently there are substantial amounts of Data-Link message, semi-structured text or plain text and use abstract word record completely Information.For this situation, this type of message or text in information retrieval task are referred to as abstract sample by us.
At present, for the abstract sample in information retrieval task, cannot in the case of its abstract word semanteme of Direct Recognition, Many employings sample characteristics method for expressing based on word statistics.Existing characterization method for expressing based on word statistics cannot Efficiently extract its phrase semantic (meaning of a word) feature, such as TF-IDF (TermFrequency-Inverse Document Frequency) model and BOW (Bag of words) model.
Word2vector is a kind of phrase semantic (meaning of a word) feature extracting method according to context relation, at first by Mikolov proposes at the beginning of being equal to 2013 in the open source projects of Google.When document is as the sample of information retrieval, for Each word in different document, Word2vector can efficiently extract its semantic (i.e. meaning of a word spy according to its context relation Levy), and be given with the form of term vector.Must be noted that the meaning of a word feature extraction mechanism of Word2vector makes not identical text The term vector corresponding to word identical in Dang also differs.So, cause the term vector being difficult to according to Word2vector to form letter The characterization of breath sample retrieval represents, the sample characteristics being particularly difficult to be formed VSM (vector space model) form represents.
At present, abstract levying of sample represents needs to use Word2vector as meaning of a word feature extraction based on context Method, and make self to be applicable to existing Information-retrieval Algorithm based on sampling feature vectors.But, not yet occur clearly being recognized Can method can according to Word2vector meaning of a word feature extraction formed VSM form abstract sample characteristics represent.
Therefore it is badly in need of proposing a kind of abstract sample information searching system based on context and corresponding abstract sample characteristics Change method for expressing, solve the problems referred to above.
Summary of the invention
In information retrieval application, the invention provides the retrieval of a kind of abstract sample information based on context is System, and elaborate its characterization method for expressing in detail.It is an object of the invention to, overcome and prior art is difficult to basis The term vector of Word2vector forms the situation that the characterization of sample represents, solves the meaning of a word during abstract sample characteristics represents special The problem levying extraction.
A kind of abstract sample information searching system based on context, including participle functional module, meaning of a word feature extraction mould Block, abstract word character displacement representation module, ST-IDF module and sort module, taking out of described abstract sample information searching system Decent eigen method for expressing comprises the following steps:
Step 1, utilize participle functional module that sample carries out the participle of abstract word: when sample is Data-Link message, can Form according to Data-Link message divides each abstract word with word length;When sample is text, can be according to space and specific participle The each abstract word of regular partition.
Step 2, utilize meaning of a word characteristic extracting module extract abstract word phrase semantic feature: for obtained by step 1 Abstract word, uses Word2vector method, and context relation based on abstract word extracts its meaning of a word feature, and with term vector shape Formula represents.
Step 3, utilize abstract word character displacement representation module that abstract word feature is carried out replacing representation: first, use Clustering quantity under excellent Clustering Effect fitness, carries out K-means algorithm cluster, i.e. to the term vector obtained by step 2 Realize the cluster of " division of adaptive optimal control degree " to abstract word term vector.Wherein, the barycenter of term vector clustering is referred to as S (table It is shown as the vector in term vector space), quantity k of S is i.e. clustering number, and in all samples, the quantity of abstract word is N, The sample classification quantity known be C, f (k) be embody Clustering Effect fitness function,
f ( k ) = α β , N ≤ k ≤ N × C ,
α is the mean cosine distance between k S vector, and β is mean cosine distance between the term vector in k clustering Average, makes positive integer k ∈ [N, N × C];As f (k)=max (f (k)), make the clustering under optimum cluster effect fitness Quantity K=k, the quantity of barycenter S is ultimately determined to K.Then, it is its word according to final cluster result by abstract word replacing representation The barycenter S of clustering belonging to vector, or be referred to as representing the abstract word in its clustering with barycenter S, will the spy of abstract word Levy the barycenter that approximation approval is affiliated clustering.
Step 4, utilize ST-IDF module export abstract sample characteristics represent: first, add up each abstract word at one The frequency occurred in sample, the replacing representation relation be given according to step 3, by the abstract word representated by barycenter S in this sample The frequency of occurrences be calculated as the frequency of barycenter S;And add up the reverse document-frequency of term vector cluster barycenter;Then, with reference to TF-IDF Model constitutes term vector cluster centroid frequency model ST-IDF, and ST-IDF model belongs to VSM form, represents for characterization One abstract sample.
Step 5, Similarity Measure, it is achieved the similarity analysis of abstract sample: the characterization table provided according to step 4 Show, calculate the similarity between two abstract samples, and carry out the execution of sample classification algorithm in information retrieval field accordingly.
Step 6, utilize sort module that characterization is represented after abstract sample carry out kind judging: according to similarity, adopt With NWKNN algorithm, abstract sample is carried out kind judging.
Beneficial effects of the present invention is as follows:
The present invention proposes a kind of information retrieval system based on context and abstract sample characteristics method for expressing thereof, it Improvement including two aspects: (1) proposes optimum cluster effect fitness partitioning algorithm, and fits according in optimum cluster effect Term vector cluster under response, has carried out abstract word character displacement and has represented;(2) propose and represent for abstract sample characteristics Term vector cluster centroid frequency model ST-IDF.
The present invention extracts meaning of a word feature first with Word2vector, it is thus achieved that the term vector of all abstract words in sample;And After, it is proposed that optimum cluster effect fitness partitioning algorithm, and according to the optimum cluster effect fitness term vector to abstract word Carry out K-means cluster, and according to cluster result, the barycenter that abstract word replacing representation is clustering belonging to its term vector (is remembered For S);Finally, the frequency of occurrences in the sample of the abstract word representated by barycenter is calculated as the frequency of barycenter S, and constitutes term vector Cluster centroid frequency model ST-IDF, represents abstract sample for characterization.With traditional sample based on word statistics Characterization method for expressing is compared, and ST-IDF model comprises the meaning of a word feature of abstract word, and belongs to VSM (vector space model) shape Formula, is applicable to the Information-retrieval Algorithm (as classified, return, clustering) of existing feature based vector.
From the angle of excess syndrome, use information retrieval field classics sample classification algorithm NWKNN, at public data collection On Reuter-21758, Wikipedia XML, ST-IDF model and TF-IDF model are carried out contrast experiment, experimental result Illustrate the clear superiority of the method for the invention objectively, improve the accuracy that abstract Sample Similarity calculates, improve Abstract sample classification accuracy, and effectively expanded the construction method of vector space model in information retrieval field.
Accompanying drawing explanation
Fig. 1 is data and the module map of abstract sample information searching system of the present invention.
Fig. 2 is the flow chart of information retrieval method of the present invention.
Fig. 3 is Word2vector method basic principle schematic.
Fig. 4 is Clustering Effect fitness function figure.
Fig. 5 is the replacing representation relation schematic diagram in term vector space according to cluster.
Detailed description of the invention
Below in conjunction with drawings and Examples, the present invention is described further.
As it is shown in figure 1, wherein content is a kind of abstract sample information searching system based on context of the present invention, including dividing Word functional module, meaning of a word characteristic extracting module, abstract word character displacement representation module, ST-IDF module and sort module.
The abstract sample characteristics method for expressing of described abstract sample information searching system comprises the following steps:
Step 1: utilize participle functional module that sample is carried out the participle of abstract word.When sample uses abstract word record completely During information, it is impossible to carry out the participle of abstract word in sample according to dictionary or dictionary.So, abstract word is only considered as by this step The character string of ascii character.When sample is Data-Link message, divide each abstract according to the form of Data-Link message with word length Word;When sample is text, divide each abstract word according to space and specific word segmentation regulation.The participle of abstract word is designated as wordi,t, word wordi,tThe participle of the t kind abstract word in expression i-th sample, has i={1, and 2 ..., | D | }, | D | is number According to concentrate D sample number, t={1,2 ..., n}, n are abstract word species number, abstract word word in all samplesi,tQuantity be N。
Step 2: utilize meaning of a word characteristic extracting module, extracts the phrase semantic feature of abstract word.For obtained by step 1 Abstract word, uses Word2vector method, and context relation based on abstract word extracts its meaning of a word feature, and with term vector shape Formula represents.This step uses Word2vec instrument, can obtain the term vector of abstract word.
Word2vec is the model realization of Word2vector method, can context relation based on word, fast and effeciently Train and generate term vector.It contains two kinds of training patterns, CBOW and Skip_gram.As being used for training generation term vector Software tool, in Word2vec, the basis of training pattern is neutral net language model NNLM, its ultimate principle such as Fig. 2 institute Show.
According to the abstract word obtained by step 1, it is word that NNLM can calculate the next word of some contexti,t's Probability, i.e. p (wordi,t=t | context), term vector is the by-product of its training.NNLM is right according to data set D generation one The vocabulary V answered.Each word in V correspond to a labelling wordi,t.In order to determine the parameter of neutral net, need Training sample the input as neutral net is built by data set.The building process of NNLM word context sample is: For any one word word in Di,t, obtain its context context (wordi,t) (n-1 word before such as), thus obtain One tuple (context (wordi,t),wordi,t).It is trained using this tuple as the input of neutral net.NNLM's is defeated Entering layer and traditional neural network model is different, each node unit of input is no longer a scalar value, but one Individual vector, each value of vector is variable, to be updated it during training, and this vector is exactly term vector.By Fig. 2 Understand, for each word wordi,t, NNLM maps it onto a vectorial wi,t, it is term vector.
Use the term vector w that Word2vec instrument obtainsi,tThe concrete t kind abstract word participle represented in i-th sample Meaning of a word feature, has i={1, and 2 ..., | D | }, | D | is sample number, the term vector w of abstract word in all samplesi,tQuantity be N.
Step 3: utilize abstract word character displacement representation module, word vector clusters barycenter to represent taking out in its clustering As word.First, use the clustering quantity under optimum cluster effect fitness, step 2 term vector obtained is carried out K- Means algorithm clusters, and i.e. realizes the cluster of " division of adaptive optimal control degree " to abstract word term vector.The K-means of term vector gathers Apoplexy due to endogenous wind, uses the cosine value of two term vector angles to calculate distance between the two.
According to step 2 gained, the term vector w of abstract word in all samplesi,tQuantity be N, term vector wi,tConcrete expression The meaning of a word feature of the t kind abstract word participle in i-th sample.Known sample classification quantity is C, and sample size is M.This In step, the barycenter of term vector clustering being referred to as S (being expressed as the vector in term vector space), quantity k of S is i.e. cluster Divide number.
For embodying the K-means Clustering Effect in term vector space, the present invention provides the adaptive meter of clustering quantity Calculate.For representing clustering quantity adaptability, making f (k) is the function embodying Clustering Effect fitness,
f ( k ) = α β , N ≤ k ≤ N × C ,
α is the mean cosine distance between k S vector, and β is mean cosine distance between the term vector in k clustering Average, specifically has:
α = 1 k Σ c o s ( S , S ′ ) ,
β = 1 k Σ b = 1 k c o s ( w i , t , w i , t ′ ) ‾ ,
Wherein, S and S ' is the centroid vector of different clusterings, wi,tWith w 'i,tIt is that class belongs in the b clustering The term vector of different abstract word participles.
If clustering number k ∈ [N, N × C], and be positive integer, as f (k)=max (f (k)), make optimum cluster imitate Really clustering quantity K=k under fitness, f (K) is the maximum of Clustering Effect fitness.It is computed understanding, function f (k) Being monotonically increasing in the interval of N to K, be monotone decreasing in the interval of K to N × C, the image of function f (k) is as shown in Figure 3.
So, as f (k)=max (f (k)), K=k, f (K) they are the extreme values of Clustering Effect fitness function, i.e. optimum poly- Class effect fitness, the quantity of K-means cluster barycenter S is ultimately determined to K.Determining max (f (k)), the process of K Yu f (K) In, for reducing the execution number of times that K-means clusters and f (k) calculates, the present invention proposes optimum cluster effect fitness and divides calculation Method, often carries out f (k) and calculates and then need pre-to first carry out the K-means cluster that barycenter quantity is k in algorithm, specific as follows:
Optimum cluster effect fitness partitioning algorithm
Optimum cluster effect fitness partitioning algorithm is analyzed: according to the recursive operation feature of algorithm, its time complexity is Ο(log2[(N × C-N)/4], so the actual K-means cluster number of times performed is less than with f (k) calculation times in this step In log2[(N × C-N)/4] are secondary;And when not using optimum cluster effect fitness partitioning algorithm, have k={N, N+1, N+ 1 ..., N × C}, during determining max (f (k)), K with f (K), the required K-means cluster performed is average with what f (k) calculated Number of times is (N × C-N)/2.So, the optimum cluster effect fitness partitioning algorithm in this step reduces cluster and fitness The execution number of times calculated.
Finally, according to final cluster result by barycenter S that abstract word replacing representation is clustering belonging to its term vector. Specifically, as f (k)=max (f (k)), clustering quantity K=k under optimum cluster effect fitness, will be the most abstract Word wi,tReplacing representation is the barycenter S of clustering belonging to its term vector, will the feature approximation approval of abstract word for affiliated cluster The barycenter divided.In any meronym vector space, represent the abstract word in its clustering with barycenter S, its corresponding relation As shown in Figure 4.Concrete replacing representation relation is as described in following formula:
Wherein, the b cluster barycenter SbRepresentative abstract word wordi,tConstitute an abstract word set, wi,tIt is abstract Word wordi,tTerm vector, WbIt is that class belongs to barycenter SbThe set of the abstract word corresponding to the term vector of place clustering.
Step 4: utilize ST-IDF module, exports abstract sample characteristics and represents.First, each abstract word is added up at one The replacing representation relation of the frequency occurred in sample, the barycenter S be given according to step 3 and abstract word, by the b barycenter SbInstitute's generation The abstract word of the table frequency of occurrences in this sample is calculated as barycenter SbFrequency;And add up term vector cluster barycenter SbReverse literary composition Part frequency, has b={1, and 2 ..., K}.Then, term vector cluster centroid frequency model ST-is constituted with reference to TF-IDF model IDF, concrete constituted mode will be further elaborated.
In TF-IDF model, sample dociCharacterization represent by characteristic vector diRealize,
di=(di(1),di(2),……,di(n))
Vector diIn t tie up element di(t)Calculation is as follows:
di(t)=TF (wordt,doci)·IDF(wordt),
TF(wordt,doci) it is word wordtAt sample dociIn frequency, have its calculation
T F ( word t , doc i ) = c o u n t ( word t ) Σ j = 1 n c o u n t ( word j ) ,
Middle molecule is this word occurrence number in the sample, and denominator is then the occurrence number of the most all words Sum,
IDF(wordt) it is word wordtReverse document-frequency, have its calculation
I D F ( word t ) = | D | | { doc i | word t ∈ doc i } | ,
Wherein, D is sample dociComposition data set, | D | is the sum of sample in data set D, | { doci|wordt∈ doci| for comprising word wordtSample size.
With reference to TF-IDF model, ST-IDF model specifically constitutes as follows:
SF(Sb,doci) it is that term vector clusters barycenter SbAt abstract sample dociIn frequency, have its calculation
S F ( S b , doc i ) = Σ w i , t ∈ W b T F ( w i , t ) ,
Wherein, WbIt is that class belongs to barycenter SbThe set of the abstract word corresponding to the term vector of place clustering, TF (wi,t) Represent abstract word wi,tAt abstract sample dociThe frequency of middle appearance, SF (Sb,doci) only add up abstract sample dociIn by barycenter Sb The frequency of representative abstract word.
IDF(Sb) it is that term vector clusters barycenter SbReverse document-frequency, have its calculation
I D F ( S b ) = | D | | { doc i | w i , t w i , t ∈ W b ∈ doc i } | ,
Wherein, D is abstract sample dociComposition data set, | D | is the sum of sample in data set D,For comprising by barycenter SbThe quantity of the sample of representative abstract word.
In ST-IDF model, abstract sample dociCharacterization represent by characteristic vectorRealize,
d · i = ( d · i ( 1 ) , d · i ( 2 ) , ... ... , d · i ( K ) ) ,
VectorIn b tie up elementCalculation is as follows:
d · i ( b ) = S F ( S b , doc i ) · I D F ( S b ) .
The ST-IDF model that this step is proposed belongs to VSM (vector space model) form, represents one for characterization Abstract sample.
Step 5: Similarity Measure, it is achieved the similarity analysis of abstract sample.The characterization table provided according to step 4 Show, calculate the similarity between two abstract samples;And carry out the execution of sample classification algorithm in information retrieval field accordingly.
A kind of abstract sample characteristics method for expressing of information retrieval based on context uses the ST-IDF that step 4 is proposed Model carries out abstract sample characteristics and represents.Any two abstract sample dociWith doci' similarity is by similarity function Sim (doci,doci') represent, its concrete calculation is as follows:
S i m ( doc i , doc i ′ ) = c o s ( d · i , d · i ′ ) ,
For characteristic vector in ST-IDF vector spaceWithBetween the cosine value of angle.
Step 6: utilize sort module, the abstract sample after representing characterization carries out kind judging.According to similarity, adopt With NWKNN algorithm, abstract sample is carried out kind judging.
According to similarity function Sim (doci,doci'), use the classical sample classification algorithm in information retrieval field NWKNN performs abstract sample classification.NWKNN is weight neighbours' KNN algorithm, and the sample classification for unbalanced classified sample set is sentenced Not, its formula is as follows:
s c o r e ( d o c , c i ) = Weight i ( Σ doc j ∈ K N N ( d ) S i m ( d o c , doc j ) δ ( doc j , c i ) ) ,
Wherein, function score (doc, ci) calculate document doc is attributed to classify ciAssessed value;Function Sim (doc, docj) represent sample doc and known class sample docjSimilarity, use vector COS distance calculate;WeightiFor classification Weight setting value, is entered as 3.5;Function δ (docj,ci) represent sample docjWhether belong to classification ciIf, sample docjBelong to class Other ci, then this function value is 1, and otherwise, this function value is 0.
The Performance Evaluation of sample classification uses F1-measure standard.This standard combines recall rate Recall and accuracy rate The assessment tolerance F1 of Precision is as follows:
F 1 = 2 × Re c a l l × Pr e c i s i o n ( Re c a l l + Pr e c i s i o n )
Use F1-measure standard, sample classification system classifying quality for data set be can be observed.For just In comparing, by summing up macroscopical F1 metric Macro-F1 of abstract sample classification result, meanwhile, abstract sample classification can be obtained The Average precision of result.
The data set being data exchange semi-structured text with wikipedia XML data Wikipedia XML, with Reuter Document sets Reuter-21578 is the data set of plain text, uses NWKNN algorithm to carry out abstract sample classification experiment, and uses F1-measure standard carries out the recruitment evaluation of sample classification, SF-IDF vector and the TF-IDF in prior art that the present invention proposes The classifying quality contrast of vector is shown in Table 1, table 2:
On table 1Wikipedia XML data collection, TF-IDF vector compares with the classifying quality of SF-IDF vector
On table 2Reuter-21578 data set, TF-IDF vector compares with the classifying quality of SF-IDF vector
According to table 1, table 2, it is seen that the classifying quality of the SF-IDF vector that the present invention proposes is substantially better than prior art Middle TF-IDF vector, especially on Wikipedia XML data collection, Average Accuracy is brought up to by original 48.7% 59.2%, on Reuter-21578 data set, Average Accuracy is brought up to 63.3% by original 57.1%.Experimental result shows Showing, the information retrieval task classified for abstract sample similarity, TF-IDF mould compared by ST-IDF model proposed by the invention Type has more excellent F1-measure assessment result, it was demonstrated that characterization method for expressing provided by the present invention possesses abstract sample The advantage of meaning of a word feature extraction.

Claims (1)

1. an abstract sample information searching system based on context, it is characterised in that: it includes participle functional module, the meaning of a word Characteristic extracting module, abstract word character displacement representation module, ST-IDF module and sort module, described abstract sample information is retrieved The abstract sample characteristics method for expressing of system comprises the following steps:
Step 1, utilize participle functional module that sample carries out the participle of abstract word: when sample is Data-Link message, can basis The form of Data-Link message divides each abstract word with word length;When sample is text, can be according to space and specific word segmentation regulation Divide each abstract word;
Step 2, utilize meaning of a word characteristic extracting module extract abstract word phrase semantic feature: abstract for obtained by step 1 Word, uses Word2vector method, and context relation based on abstract word extracts its meaning of a word feature, and with term vector form table Show;
Step 3, utilize abstract word character displacement representation module that abstract word feature is carried out replacing representation: first, use optimum poly- Clustering quantity under class effect fitness, carries out K-means algorithm cluster, i.e. realizes the term vector obtained by step 2 Cluster to " division of adaptive optimal control degree " of abstract word term vector, wherein, the barycenter of term vector clustering is referred to as S and (is expressed as Vector in term vector space), quantity k of S is i.e. clustering number, and in all samples, the quantity of abstract word is N, it is known that Sample classification quantity be C, f (k) be embody Clustering Effect fitness function,
f ( k ) = α β , N ≤ k ≤ N × C ,
α is the mean cosine distance between k S vector, and β is the average of mean cosine distance between the term vector in k clustering, Make positive integer k ∈ [N, N × C];As f (k)=max (f (k)), make clustering quantity K under optimum cluster effect fitness =k, the quantity of barycenter S is ultimately determined to K;Then, it is its term vector according to final cluster result by abstract word replacing representation Belonging to the barycenter S of clustering, or be referred to as representing the abstract word in its clustering with barycenter S, will the feature of abstract word near Like the barycenter that approval is affiliated clustering;
Step 4, utilize ST-IDF module export abstract sample characteristics represent: first, add up each abstract word at a sample The frequency of middle appearance, the replacing representation relation be given according to step 3, by abstract word the going out in this sample representated by barycenter S Existing frequency is calculated as the frequency of barycenter S;And add up the reverse document-frequency of term vector cluster barycenter;Then, with reference to TF-IDF model Constituting term vector cluster centroid frequency model ST-IDF, ST-IDF model belongs to VSM form, represents one for characterization Abstract sample;
Step 5, Similarity Measure, it is achieved the similarity analysis of abstract sample: represent according to the characterization that step 4 is provided, meter Calculate the similarity between two abstract samples, and carry out the execution of sample classification algorithm in information retrieval field accordingly;
Step 6, utilize sort module that characterization is represented after abstract sample carry out kind judging: according to similarity, use NWKNN algorithm carries out kind judging to abstract sample.
CN201610369833.4A 2016-01-31 2016-05-29 A kind of abstract sample information searching system based on context Active CN106095791B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610068972 2016-01-31
CN2016100689723 2016-01-31

Publications (2)

Publication Number Publication Date
CN106095791A true CN106095791A (en) 2016-11-09
CN106095791B CN106095791B (en) 2019-08-09

Family

ID=57230265

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610369833.4A Active CN106095791B (en) 2016-01-31 2016-05-29 A kind of abstract sample information searching system based on context

Country Status (1)

Country Link
CN (1) CN106095791B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502994A (en) * 2016-11-29 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and apparatus of the keyword extraction of text
CN106874367A (en) * 2016-12-30 2017-06-20 江苏号百信息服务有限公司 A kind of sampling distribution formula clustering method based on public sentiment platform
CN110110143A (en) * 2019-04-15 2019-08-09 厦门网宿有限公司 A kind of video classification methods and device
CN110363206A (en) * 2018-03-26 2019-10-22 阿里巴巴集团控股有限公司 Cluster, data processing and the data identification method of data object
CN110457470A (en) * 2019-07-05 2019-11-15 深圳壹账通智能科技有限公司 A kind of textual classification model learning method and device
CN111241269A (en) * 2018-11-09 2020-06-05 中移(杭州)信息技术有限公司 Short message text classification method and device, electronic equipment and storage medium
CN113127636A (en) * 2019-12-31 2021-07-16 北京国双科技有限公司 Method and device for selecting center point of text cluster

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101339551A (en) * 2007-07-05 2009-01-07 日电(中国)有限公司 Natural language query demand extension equipment and its method
CN101847405A (en) * 2009-03-23 2010-09-29 索尼公司 Speech recognition equipment and method, language model generation device and method and program
US20110087468A1 (en) * 2009-10-12 2011-04-14 Lewis James M Approximating a System Using an Abstract Geometrical Space
CN104598586A (en) * 2015-01-18 2015-05-06 北京工业大学 Large-scale text classifying method

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101339551A (en) * 2007-07-05 2009-01-07 日电(中国)有限公司 Natural language query demand extension equipment and its method
CN101847405A (en) * 2009-03-23 2010-09-29 索尼公司 Speech recognition equipment and method, language model generation device and method and program
US20110087468A1 (en) * 2009-10-12 2011-04-14 Lewis James M Approximating a System Using an Abstract Geometrical Space
CN104598586A (en) * 2015-01-18 2015-05-06 北京工业大学 Large-scale text classifying method

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106502994A (en) * 2016-11-29 2017-03-15 上海智臻智能网络科技股份有限公司 A kind of method and apparatus of the keyword extraction of text
CN106502994B (en) * 2016-11-29 2019-12-13 上海智臻智能网络科技股份有限公司 method and device for extracting keywords of text
CN106874367A (en) * 2016-12-30 2017-06-20 江苏号百信息服务有限公司 A kind of sampling distribution formula clustering method based on public sentiment platform
CN110363206A (en) * 2018-03-26 2019-10-22 阿里巴巴集团控股有限公司 Cluster, data processing and the data identification method of data object
CN110363206B (en) * 2018-03-26 2023-06-27 阿里巴巴集团控股有限公司 Clustering of data objects, data processing and data identification method
CN111241269A (en) * 2018-11-09 2020-06-05 中移(杭州)信息技术有限公司 Short message text classification method and device, electronic equipment and storage medium
CN111241269B (en) * 2018-11-09 2024-02-23 中移(杭州)信息技术有限公司 Short message text classification method and device, electronic equipment and storage medium
CN110110143A (en) * 2019-04-15 2019-08-09 厦门网宿有限公司 A kind of video classification methods and device
CN110457470A (en) * 2019-07-05 2019-11-15 深圳壹账通智能科技有限公司 A kind of textual classification model learning method and device
CN113127636A (en) * 2019-12-31 2021-07-16 北京国双科技有限公司 Method and device for selecting center point of text cluster
CN113127636B (en) * 2019-12-31 2024-02-13 北京国双科技有限公司 Text clustering cluster center point selection method and device

Also Published As

Publication number Publication date
CN106095791B (en) 2019-08-09

Similar Documents

Publication Publication Date Title
CN106095791A (en) A kind of abstract sample information searching system based on context and abstract sample characteristics method for expressing thereof
CN113792818B (en) Intention classification method and device, electronic equipment and computer readable storage medium
CN107861939A (en) A kind of domain entities disambiguation method for merging term vector and topic model
CN100583101C (en) Text categorization feature selection and weight computation method based on field knowledge
CN109885824B (en) Hierarchical Chinese named entity recognition method, hierarchical Chinese named entity recognition device and readable storage medium
CN104750844B (en) Text eigenvector based on TF-IGM generates method and apparatus and file classification method and device
CN109948149B (en) Text classification method and device
CN107818164A (en) A kind of intelligent answer method and its system
CN108287858A (en) The semantic extracting method and device of natural language
CN106980609A (en) A kind of name entity recognition method of the condition random field of word-based vector representation
CN105808524A (en) Patent document abstract-based automatic patent classification method
CN104573046A (en) Comment analyzing method and system based on term vector
CN107992542A (en) A kind of similar article based on topic model recommends method
CN106372061A (en) Short text similarity calculation method based on semantics
CN109508379A (en) A kind of short text clustering method indicating and combine similarity based on weighted words vector
CN103617157A (en) Text similarity calculation method based on semantics
CN110321925A (en) A kind of more granularity similarity comparison methods of text based on semantics fusion fingerprint
CN101231634A (en) Autoabstract method for multi-document
CN114169442B (en) Remote sensing image small sample scene classification method based on double prototype network
CN105760888A (en) Neighborhood rough set ensemble learning method based on attribute clustering
CN103473217B (en) The method and apparatus of extracting keywords from text
CN103020167B (en) A kind of computer Chinese file classification method
CN105205124A (en) Semi-supervised text sentiment classification method based on random feature subspace
CN111597328B (en) New event theme extraction method
CN104484380A (en) Personalized search method and personalized search device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right
TA01 Transfer of patent application right

Effective date of registration: 20190628

Address after: 100095 Beijing Haidian District Gaolizhang Road No. 1 Courtyard 2 Floor 201-004

Applicant after: Changyuan power (Beijing) Technology Co., Ltd.

Address before: 250300 Shandong Province Changqing District Guyunhu Street Office Danfeng District South District 1 Building

Applicant before: Changyuan power (Shandong) Technology Co. Ltd.

GR01 Patent grant
GR01 Patent grant