CN103793491B - Chinese news story segmentation method based on flexible semantic similarity measurement - Google Patents
Chinese news story segmentation method based on flexible semantic similarity measurement Download PDFInfo
- Publication number
- CN103793491B CN103793491B CN201410027012.3A CN201410027012A CN103793491B CN 103793491 B CN103793491 B CN 103793491B CN 201410027012 A CN201410027012 A CN 201410027012A CN 103793491 B CN103793491 B CN 103793491B
- Authority
- CN
- China
- Prior art keywords
- word
- sim
- semantic similarity
- semantic
- flexible
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/131—Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/253—Grammatical analysis; Style critique
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a Chinese news story segmentation method based on flexible semantic similarity measurement. The method includes the following steps that a target text set is input, and word segmentation is conducted on news storyboards Ti in the text set; a context relation diagram is built; iteration spreading is conducted on context relevancy between words through the context relation diagram and a rapid sorting algorithm to acquire a flexible semantic relevant matrix; flexible semantic similarity between sentences is defined through the flexible semantic relevant matrix; a Chinese news story is segmented through the flexible semantic similarity. By the adoption of the flexible measurement method, the semantic similarity between the words and between word sets can be more reasonably expressed. Experiments show that compared with a traditional similarity measurement method, in a Chinese news story segmentation technology and based on same segmentation principles, the flexible semantic similarity measurement method can improve segmentation accuracy to 3% to 10%.
Description
Technical field
The present invention relates to Chinese News Stories segmentation field, particularly in a kind of tolerance based on flexible Semantic Similarity
Civilian News Stories dividing method.
Background technology
Popularization with network and development, for example: in the multimedia of Broadcast Journalism, minutes, online open class etc
Holding rapidly increases, and is badly in need of now a kind of effective method and is automatically organized this kind of multi-medium data, for base
Text retrieval and analysis in theme.One multimedia document, the broadcasts-based network of such as a hour, generally by multiple events
Thing (story) form, in order to carry out efficient semantic retrieval, instruct user go to find their subject of interest beginning and
End is critically important, and meanwhile, a multimedia document split is by topic tracking[1], classification and summarize[2]Contour level
The important prerequisite condition that secondary semanteme browses.The purpose of News Stories cutting techniques is that and for News Stories script to be divided into master
Inscribe consistent story.Technically, the efficiency of News Stories cutting techniques is related to two factors: one is the phase between word
Measure like the similitude between property and this sentence set;Two is the criterion of segmentation News Stories script.
Many work before are all conceived to segmentation criterion reasonable in design, for example: texttiling[3][4]Minimum normalizing
Change segmentation criterion (minimum ncuts)[5][6], maximum vocabulary connect criterion[7]Deng.Compared with widely studied segmentation criterion,
Great majority work at this stage is all using simple based on the rigid similarity measurement mode repeating, i.e. phase between identical word
It is 1 like property, the similitude between different terms is 0.Clearly this based on repeat rigid method for measuring similarity have ignored
Potential semantic dependency between different terms so that semantic relation tolerance is inaccurate, tie by the Chinese News Stories segmentation obtaining
Really inaccurate.It is thus desirable to proposing a kind of more rational Semantic Similarity metric form to help the efficiency improving segmentation and essence
Degree.
Content of the invention
The invention provides a kind of Chinese News Stories dividing method based on flexible Semantic Similarity tolerance, energy of the present invention
Enough rational Semantic Similarity representing between word, and the precision of Chinese News Stories cutting techniques can be significantly improved,
Described below:
A kind of Chinese News Stories dividing method based on flexible Semantic Similarity tolerance, methods described includes following step
Rapid:
(1) target collected works are inputtedTo each News Stories script t in collected worksiCarry out participle;
(2) set up context relation figure;
(3) by described context relation figure and quick sorting algorithm, the context dependence between word is iterated
Propagate and obtain flexible semantic dependency matrix;
(4) it is defined by described flexibility flexible Semantic Similarity between sentence for the semantic dependency matrix;
(5) using described flexibility Semantic Similarity, Chinese News Stories are split.
The described step setting up context relation figure particularly as follows:
1) read in each News Stories script successively, word frequency statisticses are carried out to the word being comprised;
2) according to the word frequency threshold value defining, frequent words and low frequency word are deleted;
3) using the word retaining as context relation in figure node, its set be v;
4) whether any two word in judging to gather simultaneously appears in a certain News Stories script, and this two words
The distance between language is less than or equal to distance threshold, if it is sets up side between this two words, the set on side is e;
Rejudge other any two words if not, until the word in whole set is all traversed;
5) the weights s on sidecBy the weights sim between wordcThe weights sim of (a, b), word itselfc(a a) represents;
6) described context diagram is shown as g=v, e, sc.
Weights sim between described wordc(a, b) particularly as follows:
Wherein, freq (a, b) represents the number of times that word a and word b occurs simultaneously, freqmax=max(i,j){freq(i,
J) } represent the frequency maxima to (i, j) for the word, ε is a constant in order to guarantee 0≤simc(a,b)≤1.
The weights sim of described word itselfc(a,a)=1.
Described by described context relation figure and quick sorting algorithm, the context dependence between word is changed
Generation propagate the step obtaining flexible semantic dependency matrix particularly as follows:
1) defining the Semantic Similarity between context relation in figure word is sims(a, b), following three criterions of satisfaction:
Word is 1 with the similitude of itself, i.e. sims(a,a)=1;sims(a, b) and simc(a, b) positive correlation;sims
Similitude between (a, b) and their neighbours is directly proportional;
2) the iterative diffusion process of definition Semantic Similarity:
Wherein, u~a, v~b represent that u and v is word a and the neighbor node of word b in context relation in figure respectively, z
It is normalization factor, c is controlling elements,Represent the Semantic Similarity of word a and word b during the t time iteration,Represent the Semantic Similarity of word a and word b during the t-1 time iteration,Represent initialization;
3) use quick sorting algorithm solve 2) defined in relational expression, obtain semantic dependency, to each two word all
Ask for semantic dependency, several semantic dependencies constitute flexible semantic dependency matrix, this correlation matrix is designated as ss.
The described step being defined by described flexibility flexible Semantic Similarity between sentence for the semantic dependency matrix
Particularly as follows:
Wherein siAnd sjRepresent sentence respectively, | | fi| | and | | fj| | represent two norms of two sentence word frequency vectors respectively,
T is transposition.
The beneficial effect of the technical scheme that the present invention provides is: the present invention is proposed a kind of non-supervisory by quick sorting algorithm
The Semantic Similarity measure of formula, improves potential between word to enable to incorporate to traditional cosine similarity
Semantic relation, and improve Chinese News Stories cutting techniques using this flexible Semantic Similarity.Flexible tolerance proposed by the present invention
Method can more reasonably represent the Semantic Similarity between word and between set of words.Test result indicate that, in
In civilian News Stories cutting techniques, criterion is split based on identical, compared with traditional method for measuring similarity, using this flexibility
Segmentation precision can be brought up to 3%-10% by Semantic Similarity measure.
Brief description
Fig. 1 is the flow chart of the Chinese News Stories cutting techniques based on flexible Semantic Similarity;
Fig. 2 is the schematic diagram of context relation figure;
Fig. 3 is between story on standard data set cctv and tdt2 and the contrast of the internal sentence similitude ratio of story
Figure;
Fig. 4 is that Chinese News Stories partitioning algorithm uses on 100 groups of random parameters on standard data set cctv-75-s
The Comparative result figure of three kinds of different similarity measurement modes.
Specific embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention
Formula is described in further detail.
The tolerance of Semantic Similarity is an extremely challenging research topic in natural language processing.Existing method master
It is divided into two classes: supervised and non-supervisory formula.The method of supervised mainly includes wordnet[8][9]And disco.Wordnet uses
Similitude between tolerance any two English word.The collected works that wordnet dependence has identified, by ranking, verb, adjective
Carry out distinguishing hierarchy with adverbial word, the foundation of division is the semantical definition to these words for the language specialist.Terseness due to wordnet
And validity, wordnet has been widely applied in natural language processing task.Similar with wordnet, disco is as another
A kind of conventional supervised method, for retrieving the similitude between any given two word.Compared with wordnet, disco
Support more rich languages, for example: English, German, French, Spanish etc..The method of supervised can be used directly to carry
Before the language space that defines it is not necessary to any extra calculating, meanwhile, the method for supervised also almost covers whole normal
Word.But, the method for supervised depends on the knowledge of linguist, and between word, the tolerance of similitude is generally by subjective consciousness
Defined, meanwhile, the method for supervised is not suitable for the application based on specific collected works.The method of non-supervisory formula mainly include pmi,
Lsa and plsa.The statistics that pmi is got by query web search engine, two words of statistics simultaneously appear in same
Number of times in webpage, number of times is more, then the pmi score of this two words is higher.Lsa is also a kind of semantic phase of non-supervisory formula
Like property measure, the mechanism that it has incorporated mankind's learning knowledge goes to obtain the similitude between word or text fragment.lsa
Committed step be that dimensionality reduction is carried out by singular value decomposition.Lsa can also solve synonymy in natural language processing simultaneously
Problem.Plsa is to lsa innovatory algorithm.Different from the lsa algorithm coming from linear algebra, in the base inheriting lsa advantage
On plinth, plsa is analyzed to the correlation between paired word using the method for probability theory, and can be good at processing
Synonymy and the problem of ambiguity.Compared with lsa, plsa has more generality.
In recent years, the development of graph theory causes the attention of natural language processing scholars, and widdows et al. proposes one kind
It is used for obtaining Semantic Similarity based on the non-supervisory formula method of graph model.In this graph model, node represents word, while being word
Relation between language.Additionally, this graph model is based on specific collected works, the ambiguousness of word can be processed.Ambwani et al. proposes
The graph model of another kind of tolerance phrase semantic similitude, each word is represented as a series of node, and each node corresponds to
One sentence in this word coverage, the weights on side represent the correlation between word.This model is by between word
Influence each other and incorporate wherein, to determine the correlation between word according to context simultaneously.Semantic Similarity discussed above
The non-supervisory formula method calculating is all based on specific collected works, is more suitable for specifically applying compared with supervised method.Non- at these
Inside the method for supervised, because terseness and the high efficiency of graph model is so that Semantic Similarity calculating side based on graph model
Method causes the concern of more and more natural language processing researchers.
Semantic Similarity tolerance between set of words (such as paragraph, text etc.) is also a problem demanding prompt solution.
The conventional method of set of words Semantic Similarity tolerance is cosine similarity.Based on bag of words it is assumed that each set of words is expressed
Vectorial for word frequency, cosine similarity is used for measuring the angle between word frequency vector, and angle is bigger, and similitude is less;Conversely, more
Greatly.Because the use very simple of cosine similarity is with effectively, therefore cosine similarity is widely used and set of words is semantic
The tolerance of similitude, but cosine similarity only considered the relation between identical word, have ignored in set of words word it
Between correlation, this make set of words similitude tolerance inaccurate.In order that the tolerance of set of words Semantic Similarity
More accurately and meaningful, when measuring similarity between set of words, the correlation between word is taken into account more
For meaningful.Therefore, need proposition badly and a kind of correlation between word is dissolved into similarity measurement between set of words
Method.
In order to the rational Semantic Similarity representing between word, and Chinese News Stories can be significantly improved divide
Cut the precision of technology, embodiments provide a kind of Chinese News Stories segmentation side based on flexible Semantic Similarity tolerance
Method, referring to Fig. 1, this method, in carrying out Semantic Similarity calculating and News Stories cutting procedure, is all for a certain specific
Data set.Meanwhile, in order to embody the reasonability of flexible Semantic Similarity measure, redesigned checking criterion reasonable to this
Property is verified, described below:
101: input target collected worksParticiple is carried out to each News Stories script ti in collected works;
Pass through this step and every words in News Stories script are split several words, this step is art technology
Well known to personnel, the embodiment of the present invention does not repeat to this.
102: set up context relation figure;
1) read in News Stories script successively, word frequency statisticses are carried out to the word included in it;
2) according to the word frequency threshold value defining, frequent words and low frequency word are deleted.
3) using the word retaining as context relation in figure node, its set be v.
4) whether any two word in judging to gather simultaneously appears in a certain News Stories script, and this two words
The distance between language is less than or equal to distance threshold, if it is sets up side between this two words, the set on side is e;
Rejudge other any two words if not, until the word in whole set is all traversed.
5) the weights s on sidecBy the weights sim between wordcThe weights sim of (a, b), word itselfc(a a) represents;
Weights sim between wordc(a, b) is defined by below equation:
Wherein freq (a, b) represents the number of times that word a and word b occurs simultaneously, freqmax=max(i,j){freq(i,j)}
Represent the frequency maxima to (i, j) for the word, ε is a constant in order to guarantee 0≤simc(a,b)≤1(a≠b).
Meanwhile, the weights sim of word itselfc(a,a)=1.
6) therefore context relation figure can be expressed as g=v, e, sc, Fig. 2 is the upper and lower of a certain document in structure data set
The schematic diagram of civilian graph of a relation, wherein, wuRepresent tiIn u-th word, lines represent the relation between word.
103: by context relation figure and quick sorting algorithm[10][11]Context dependence between word is changed
In generation, propagates and obtains flexible semantic dependency matrix;
1) defining the Semantic Similarity between context relation in figure word is sims(a, b), it meets following three standards
Then:
Word is 1 with the similitude of itself, i.e. sims(a,a)=1;
sims(a, b) and simc(a, b) positive correlation;
simsSimilitude between (a, b) and their neighbours is directly proportional.
2) the iterative diffusion process defining Semantic Similarity is defined as follows by relationship below:
Wherein, u~a, v~b represent that u and v is word a and the neighbor node of word b in context relation in figure respectively, z
It is normalization factor, c is controlling elements,Represent the Semantic Similarity of word a and word b during the t time iteration,Represent the Semantic Similarity of word a and word b during the t-1 time iteration,Represent initialization.
3) use quick sorting algorithm solve 2) defined in relational expression, obtain semantic dependency, to each two word all
Ask for semantic dependency, several semantic dependencies constitute flexible semantic dependency matrix, this correlation matrix is designated as ss.With
Similar, traditional rigid Semantic Similarity is defined as sh=i, wherein i represent unit matrix.
The hypothesis that quick sorting algorithm is based on: if neighbours' word of two words more similar (correlation be more than or
Equal to 0.5), then this two words are also more similar;
Quick sorting algorithm is using context graph of a relation as input;Algorithm complex is o (k | v |2), wherein, k is in g
The average number of degrees, | v | represents the nodes of context relation in figure, and o is algorithm complex.
The full point based on gpu for the quick sorting algorithm is achieved to parallel algorithm in this invention.Found by Germicidal efficacy, with
Same context relation figure as input, the speed of the quick sorting algorithm realized based on gpu than tradition based on cpu realize fast
The speed of fast sort algorithm improves about 1000 times.
Quick sorting algorithm is output as being iterated the flexible semantic dependency matrix s after propagatings={sims(a,
b)}a,b∈c;
104: by flexible semantic dependency matrix to sentence (the one section of word continuously occurring as in News Stories script
Set) between flexible Semantic Similarity be defined;
In News Stories cutting techniques, it is also desirable to measure set of words i.e. sentence in addition to phrase semantic similitude
Similitude between son.In story, each sentence can be identified as word frequency vector, be used for recording each word and go out in sentence
Existing number of times.For given flexible semantic dependency matrix, the flexible Semantic Similarity between sentence is defined as follows:
Wherein siAnd sjRepresent sentence respectively, | | fi| | and | | fj| | represent two norms of two sentence word frequency vectors respectively,
T is transposition.This definition is the improvement to traditional cosine similarity, and semantic dependency potential between different terms is considered by it
Enter, therefore, it is possible to more reasonably represent the Semantic Similarity between sentence.
105: using flexible Semantic Similarity, Chinese News Stories are split.
1) Chinese News Stories cutting techniques based on criterion be Normalization norm[5][6];
Wherein, Normalization norm is particularly as follows: this criterion is based on graph model;Sentence is identified as the node in graph model, sentence
Relation between son is expressed as the side in graph model;Similitude between sentence is expressed as the weights on side;News Stories are split
Problem is converted into graph model segmentation problem.
2) between using sentence, flexible Semantic Similarity carries out Chinese to the News Stories script included in input data set
News Stories are split.
A kind of Chinese news event based on flexible Semantic Similarity tolerance with specific test, the present invention being provided below
The feasibility of thing dividing method is verified:
Standard data set is tested:
For verifying the validity of this method, this method is tested on two standard data set cctv and tdt2.
Cctv data set contains 71 Chinese news storyboards altogether, can say cctv according to News Stories length and identification error rate
Data set is divided into 8 subsets, is designated as cctv_59_f/s, cctv_66_f/s, cctv_75_f/s and cctv_ref_f/s respectively,
Wherein f represents long story set, and s represents short story set, and ref represents reference data set.Tdt2 data set comprises 177 Chinese news
According to identification error rate, storyboard, can say that this 177 scripts are divided into two subsets, be designated as tdt2_ref and tdt2_ respectively
rcg.Respectively using rigid Semantic Similarity sh, context Semantic Similarity scWith flexible Semantic Similarity ssTo cctv and tdt2
News Stories in data set are split, and compare the quality of its segmentation precision.Wherein segmentation precision scores to represent by f1.
Table 1 lists using different segmentation precisions on cctv and tdt2 data set for the similarity measurement mode.
Table 1
It is observed that compared with traditional rigid Semantic Similarity, can be made using flexible Semantic Similarity from table 1
Segmentation precision is significantly improved, and increase rate is about 3% to 10%.Simultaneously it has also been found that context Semantic Similarity is than hard
The Semantic Similarity of property is good, and context Semantic Similarity can be made to get a promotion by quick sorting algorithm.In order to show
The robustness of this method, this method implements another more strict experiment on cctv_75_s data set, by than less
Use the segmentation precision of 100 groups of random parameters with method.Fig. 4 shows this experimental result.In news therefore cutting techniques, sentence
Between the quality of similitude can be measured with the ratio of the similitude between story and the similitude within story, this ratio pair
Answer the ga s safety degree of sentence, wherein, ratio is more little then to prove that similitude is better, and this ratio to be defined by below equation:
Wherein lab (si) represent sentence siThe label of affiliated story, lab (sj) represent sentence sjThe label of affiliated story,
Mean represents and averages.
This method is by rigid Semantic Similarity sh, context Semantic Similarity scWith flexible Semantic Similarity ssIn criterion numeral
Contrasted according on collection, comparing result represents in the diagram.It is found through experiments, using flexible Semantic Similarity ssObtained
R ratio lower than other two kinds of similitudes, and context Semantic Similarity scS lower than rigid Semantic Similarityh.Should
Experiment shows flexible Semantic Similarity ssThan traditional rigid Semantic Similarity shMore reasonable, and calculated by quicksort
The flexible Semantic Similarity (i.e. Semantic Similarity after iterative diffusion) that method solves is more reasonable.During this method is applied to
Segmentation precision can be made to increase significantly in civilian News Stories cutting techniques.
Bibliography:
[1].j.allan,ed.,topic detection and tracking:event-based information
organization,kluwer academic publishers,2002.
[2].l.-s.lee and b.chen,“spoken document understanding and
organization,”vol.22,no.5,pp.42–60,2005.
[3].s.banerjee and i.a.rudnicky,“a texttiling based approach to topic
boundary detection in meetings,”in interspeech,2006.
[4].l.xie,j.zeng,and w.feng,“multi-scale texttiling for automatic
story segmentation in chinese broadcast news,”in airs,2008.
[5].i.malioutov and r.barzilay,“minimum cut model for spoken lecture
segmentation,”in acl,2006.
[6].j.zhang,l.xie,w.feng,and y.zhang,“a subword normalized cut
approach to automatic story segmentation of chinese broadcast news,”in airs,
2009.
[7].z.liu,l.xie,and w.feng,“maximum lexical cohesion for fine-grained
news story segmentation,”in interspeech,2010.
[8].t.pedersen,s.patwardhan,and j.michelizzi,“wordnet::similarity-
measuring the relatedness of concepts,”in aaai(intelligent systems
demonstration),2004.
[9].christiane fellbaum,ed.,wordnet:an electronic lexical database,
mit press,1998.
[10].g.jeh and j.widom,“simrank:a measure of structural-context
similarity,”in acm sigkdd,2002.
[11].g.he,h.feng,c.li,and h.chen,“parallel simrank computation on
large graphs with iterative aggregation,”in acm sigkdd,2010.
It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention
Sequence number is for illustration only, does not represent the quality of embodiment.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all spirit in the present invention and
Within principle, any modification, equivalent substitution and improvement made etc., should be included within the scope of the present invention.
Claims (5)
1. a kind of Chinese News Stories dividing method based on flexible Semantic Similarity tolerance is it is characterised in that methods described bag
Include following steps:
(1) target collected works are inputtedTo each News Stories script t in collected worksiCarry out participle;
(2) set up context relation figure;
(3) context dependence between word is iterated propagate by described context relation figure and quick sorting algorithm
Obtain flexible semantic dependency matrix;
(4) it is defined by described flexibility flexible Semantic Similarity between sentence for the semantic dependency matrix;
(5) using described flexibility Semantic Similarity, Chinese News Stories are split;
The described step setting up context relation figure particularly as follows:
1) read in each News Stories script successively, word frequency statisticses are carried out to the word being comprised;
2) according to the word frequency threshold value defining, frequent words and low frequency word are deleted;
3) using the word retaining as context relation in figure node, its set be v;
4) judge gather in any two word whether simultaneously appear in a certain News Stories script, and this two words it
Between distance be less than or equal to distance threshold, if it is between this two words, set up side, the set on side is e;If
No rejudge other any two words, until whole gather in word be all traversed;
5) the weights s on sidecBy the weights sim between wordcThe weights sim of (a, b), word itselfc(a a) represents;
6) described context diagram is shown as g=< v, e, sc>.
2. method according to claim 1 is it is characterised in that weights sim between described wordc(a, b) particularly as follows:
Wherein, freq (a, b) represents the number of times that word a and word b occurs simultaneously, freqmax=max(i,j){ freq (i, j) } table
Show the frequency maxima to (i, j) for the word, ε is a constant in order to guarantee 0≤simc(a,b)≤1.
3. method according to claim 1 is it is characterised in that the weights sim of described word itselfc(a, a)=1.
4. method according to claim 1 is it is characterised in that described calculated by described context relation figure and quicksort
Method the context dependence between word is iterated propagate the step obtaining flexible semantic dependency matrix particularly as follows:
1) defining the Semantic Similarity between context relation in figure word is sims(a, b), following three criterions of satisfaction:
Word is 1 with the similitude of itself, that is,
sims(a, a)=1;sims(a, b) and simc(a, b) positive correlation;simsSimilitude between (a, b) and their neighbours just becomes
Than;
2) the iterative diffusion process of definition Semantic Similarity:
Wherein, u~a, v~b represent that u and v is word a and the neighbor node of word b in context relation in figure respectively, and z is to return
The one change factor, c is controlling elements,Represent the Semantic Similarity of word a and word b during the t time iteration,Represent the Semantic Similarity of word a and word b during the t-1 time iteration,Represent initialization;
3) use quick sorting algorithm solve 2) defined in relational expression, obtain semantic dependency, each two word is all asked for
Semantic dependency, several semantic dependencies constitute flexible semantic dependency matrix, and this correlation matrix is designated as ss.
5. method according to claim 1 it is characterised in that described by described flexibility semantic dependency matrix to sentence
Between the step that is defined of flexible Semantic Similarity particularly as follows:
Wherein siAnd sjRepresent sentence respectively, | | fi| | and | | fj| | represent two norms of two sentence word frequency vectors respectively, t is to turn
Put.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410027012.3A CN103793491B (en) | 2014-01-20 | 2014-01-20 | Chinese news story segmentation method based on flexible semantic similarity measurement |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201410027012.3A CN103793491B (en) | 2014-01-20 | 2014-01-20 | Chinese news story segmentation method based on flexible semantic similarity measurement |
Publications (2)
Publication Number | Publication Date |
---|---|
CN103793491A CN103793491A (en) | 2014-05-14 |
CN103793491B true CN103793491B (en) | 2017-01-25 |
Family
ID=50669157
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201410027012.3A Active CN103793491B (en) | 2014-01-20 | 2014-01-20 | Chinese news story segmentation method based on flexible semantic similarity measurement |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103793491B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2019023893A1 (en) * | 2017-07-31 | 2019-02-07 | Beijing Didi Infinity Technology And Development Co., Ltd. | System and method for segmenting a sentence |
CN110750617A (en) * | 2018-07-06 | 2020-02-04 | 北京嘀嘀无限科技发展有限公司 | Method and system for determining relevance between input text and interest points |
-
2014
- 2014-01-20 CN CN201410027012.3A patent/CN103793491B/en active Active
Non-Patent Citations (2)
Title |
---|
"基于上下文信息的新闻故事单元分割";冀中;《天津大学学报》;20090228;第42卷(第2期);摘要、第153页第1栏第1段-第158页第1栏第2段 * |
"基于本体概念的柔性相似度计算方法研究";张野;《计算机技术与发展》;20120930;第22卷(第9期);摘要、第103页第1栏第2段-第106页子1栏第4段 * |
Also Published As
Publication number | Publication date |
---|---|
CN103793491A (en) | 2014-05-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Liu et al. | Mining quality phrases from massive text corpora | |
Gupta et al. | Analyzing the dynamics of research by extracting key aspects of scientific papers | |
Christensen et al. | An analysis of open information extraction based on semantic role labeling | |
CN103399901B (en) | A kind of keyword abstraction method | |
Liu et al. | Measuring similarity of academic articles with semantic profile and joint word embedding | |
CN105808525A (en) | Domain concept hypernym-hyponym relation extraction method based on similar concept pairs | |
Kim et al. | Interpreting semantic relations in noun compounds via verb semantics | |
CN105701084A (en) | Characteristic extraction method of text classification on the basis of mutual information | |
CN104391942A (en) | Short text characteristic expanding method based on semantic atlas | |
Jang et al. | Metaphor detection in discourse | |
CN106528524A (en) | Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm | |
KR101396131B1 (en) | Apparatus and method for measuring relation similarity based pattern | |
Paiva et al. | Discovering semantic relations from unstructured data for ontology enrichment: Asssociation rules based approach | |
Momtaz et al. | Graph-based Approach to Text Alignment for Plagiarism Detection in Persian Documents. | |
CN103793491B (en) | Chinese news story segmentation method based on flexible semantic similarity measurement | |
Celebi et al. | Segmenting hashtags using automatically created training data | |
CN103455638A (en) | Behavior knowledge extracting method and device combining reasoning and semi-automatic learning | |
Wan et al. | Chinese shallow semantic parsing based on multilevel linguistic clues | |
Gayen et al. | Automatic identification of Bengali noun-noun compounds using random forest | |
Nie et al. | Measuring semantic similarity by contextualword connections in chinese news story segmentation | |
Thilagavathi et al. | Document clustering in forensic investigation by hybrid approach | |
Reshadat et al. | Confidence measure estimation for open information extraction | |
Zolotarev | Research and development of linguo-statistical methods for forming a portrait of a subject area | |
CN112328811A (en) | Word spectrum clustering intelligent generation method based on same type of phrases | |
CN105808521A (en) | Semantic feature based semantic relation mode acquisition method and system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
TR01 | Transfer of patent right |
Effective date of registration: 20210628 Address after: No.48, 1st floor, No.58, No.44, Middle North Third Ring Road, Haidian District, Beijing 100088 Patentee after: BEIJING HONGBO ZHIWEI SCIENCE & TECHNOLOGY Co.,Ltd. Address before: 300072 Tianjin City, Nankai District Wei Jin Road No. 92 Patentee before: Tianjin University |
|
TR01 | Transfer of patent right |