CN103793491B - Chinese news story segmentation method based on flexible semantic similarity measurement - Google Patents

Chinese news story segmentation method based on flexible semantic similarity measurement Download PDF

Info

Publication number
CN103793491B
CN103793491B CN201410027012.3A CN201410027012A CN103793491B CN 103793491 B CN103793491 B CN 103793491B CN 201410027012 A CN201410027012 A CN 201410027012A CN 103793491 B CN103793491 B CN 103793491B
Authority
CN
China
Prior art keywords
word
sim
semantic similarity
semantic
flexible
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201410027012.3A
Other languages
Chinese (zh)
Other versions
CN103793491A (en
Inventor
冯伟
万亮
聂学成
高晓妮
党建武
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
BEIJING HONGBO ZHIWEI SCIENCE & TECHNOLOGY Co.,Ltd.
Original Assignee
Tianjin University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tianjin University filed Critical Tianjin University
Priority to CN201410027012.3A priority Critical patent/CN103793491B/en
Publication of CN103793491A publication Critical patent/CN103793491A/en
Application granted granted Critical
Publication of CN103793491B publication Critical patent/CN103793491B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/131Fragmentation of text files, e.g. creating reusable text-blocks; Linking to fragments, e.g. using XInclude; Namespaces
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a Chinese news story segmentation method based on flexible semantic similarity measurement. The method includes the following steps that a target text set is input, and word segmentation is conducted on news storyboards Ti in the text set; a context relation diagram is built; iteration spreading is conducted on context relevancy between words through the context relation diagram and a rapid sorting algorithm to acquire a flexible semantic relevant matrix; flexible semantic similarity between sentences is defined through the flexible semantic relevant matrix; a Chinese news story is segmented through the flexible semantic similarity. By the adoption of the flexible measurement method, the semantic similarity between the words and between word sets can be more reasonably expressed. Experiments show that compared with a traditional similarity measurement method, in a Chinese news story segmentation technology and based on same segmentation principles, the flexible semantic similarity measurement method can improve segmentation accuracy to 3% to 10%.

Description

A kind of Chinese News Stories dividing method based on flexible Semantic Similarity tolerance
Technical field
The present invention relates to Chinese News Stories segmentation field, particularly in a kind of tolerance based on flexible Semantic Similarity Civilian News Stories dividing method.
Background technology
Popularization with network and development, for example: in the multimedia of Broadcast Journalism, minutes, online open class etc Holding rapidly increases, and is badly in need of now a kind of effective method and is automatically organized this kind of multi-medium data, for base Text retrieval and analysis in theme.One multimedia document, the broadcasts-based network of such as a hour, generally by multiple events Thing (story) form, in order to carry out efficient semantic retrieval, instruct user go to find their subject of interest beginning and End is critically important, and meanwhile, a multimedia document split is by topic tracking[1], classification and summarize[2]Contour level The important prerequisite condition that secondary semanteme browses.The purpose of News Stories cutting techniques is that and for News Stories script to be divided into master Inscribe consistent story.Technically, the efficiency of News Stories cutting techniques is related to two factors: one is the phase between word Measure like the similitude between property and this sentence set;Two is the criterion of segmentation News Stories script.
Many work before are all conceived to segmentation criterion reasonable in design, for example: texttiling[3][4]Minimum normalizing Change segmentation criterion (minimum ncuts)[5][6], maximum vocabulary connect criterion[7]Deng.Compared with widely studied segmentation criterion, Great majority work at this stage is all using simple based on the rigid similarity measurement mode repeating, i.e. phase between identical word It is 1 like property, the similitude between different terms is 0.Clearly this based on repeat rigid method for measuring similarity have ignored Potential semantic dependency between different terms so that semantic relation tolerance is inaccurate, tie by the Chinese News Stories segmentation obtaining Really inaccurate.It is thus desirable to proposing a kind of more rational Semantic Similarity metric form to help the efficiency improving segmentation and essence Degree.
Content of the invention
The invention provides a kind of Chinese News Stories dividing method based on flexible Semantic Similarity tolerance, energy of the present invention Enough rational Semantic Similarity representing between word, and the precision of Chinese News Stories cutting techniques can be significantly improved, Described below:
A kind of Chinese News Stories dividing method based on flexible Semantic Similarity tolerance, methods described includes following step Rapid:
(1) target collected works are inputtedTo each News Stories script t in collected worksiCarry out participle;
(2) set up context relation figure;
(3) by described context relation figure and quick sorting algorithm, the context dependence between word is iterated Propagate and obtain flexible semantic dependency matrix;
(4) it is defined by described flexibility flexible Semantic Similarity between sentence for the semantic dependency matrix;
(5) using described flexibility Semantic Similarity, Chinese News Stories are split.
The described step setting up context relation figure particularly as follows:
1) read in each News Stories script successively, word frequency statisticses are carried out to the word being comprised;
2) according to the word frequency threshold value defining, frequent words and low frequency word are deleted;
3) using the word retaining as context relation in figure node, its set be v;
4) whether any two word in judging to gather simultaneously appears in a certain News Stories script, and this two words The distance between language is less than or equal to distance threshold, if it is sets up side between this two words, the set on side is e; Rejudge other any two words if not, until the word in whole set is all traversed;
5) the weights s on sidecBy the weights sim between wordcThe weights sim of (a, b), word itselfc(a a) represents;
6) described context diagram is shown as g=v, e, sc.
Weights sim between described wordc(a, b) particularly as follows:
sim c ( a , b ) = freq ( a , b ) freq max + ϵ
Wherein, freq (a, b) represents the number of times that word a and word b occurs simultaneously, freqmax=max(i,j){freq(i, J) } represent the frequency maxima to (i, j) for the word, ε is a constant in order to guarantee 0≤simc(a,b)≤1.
The weights sim of described word itselfc(a,a)=1.
Described by described context relation figure and quick sorting algorithm, the context dependence between word is changed Generation propagate the step obtaining flexible semantic dependency matrix particularly as follows:
1) defining the Semantic Similarity between context relation in figure word is sims(a, b), following three criterions of satisfaction:
Word is 1 with the similitude of itself, i.e. sims(a,a)=1;sims(a, b) and simc(a, b) positive correlation;sims Similitude between (a, b) and their neighbours is directly proportional;
2) the iterative diffusion process of definition Semantic Similarity:
sim s ( 0 ) ( a , b ) = sim c ( a , b ) ; sim s ( t ) ( a , b ) = c z σ u ~ a , v ~ b si m s ( t - 1 ) ( a , b ) ; sim s ( a , b ) = lim t → ∞ sim s ( t ) ( a , b ) ;
Wherein, u~a, v~b represent that u and v is word a and the neighbor node of word b in context relation in figure respectively, z It is normalization factor, c is controlling elements,Represent the Semantic Similarity of word a and word b during the t time iteration,Represent the Semantic Similarity of word a and word b during the t-1 time iteration,Represent initialization;
3) use quick sorting algorithm solve 2) defined in relational expression, obtain semantic dependency, to each two word all Ask for semantic dependency, several semantic dependencies constitute flexible semantic dependency matrix, this correlation matrix is designated as ss.
The described step being defined by described flexibility flexible Semantic Similarity between sentence for the semantic dependency matrix Particularly as follows:
sim ( s i , s j | s s ) = f i t s s f i | | f i | | | | f j | |
Wherein siAnd sjRepresent sentence respectively, | | fi| | and | | fj| | represent two norms of two sentence word frequency vectors respectively, T is transposition.
The beneficial effect of the technical scheme that the present invention provides is: the present invention is proposed a kind of non-supervisory by quick sorting algorithm The Semantic Similarity measure of formula, improves potential between word to enable to incorporate to traditional cosine similarity Semantic relation, and improve Chinese News Stories cutting techniques using this flexible Semantic Similarity.Flexible tolerance proposed by the present invention Method can more reasonably represent the Semantic Similarity between word and between set of words.Test result indicate that, in In civilian News Stories cutting techniques, criterion is split based on identical, compared with traditional method for measuring similarity, using this flexibility Segmentation precision can be brought up to 3%-10% by Semantic Similarity measure.
Brief description
Fig. 1 is the flow chart of the Chinese News Stories cutting techniques based on flexible Semantic Similarity;
Fig. 2 is the schematic diagram of context relation figure;
Fig. 3 is between story on standard data set cctv and tdt2 and the contrast of the internal sentence similitude ratio of story Figure;
Fig. 4 is that Chinese News Stories partitioning algorithm uses on 100 groups of random parameters on standard data set cctv-75-s The Comparative result figure of three kinds of different similarity measurement modes.
Specific embodiment
For making the object, technical solutions and advantages of the present invention clearer, below in conjunction with accompanying drawing to embodiment party of the present invention Formula is described in further detail.
The tolerance of Semantic Similarity is an extremely challenging research topic in natural language processing.Existing method master It is divided into two classes: supervised and non-supervisory formula.The method of supervised mainly includes wordnet[8][9]And disco.Wordnet uses Similitude between tolerance any two English word.The collected works that wordnet dependence has identified, by ranking, verb, adjective Carry out distinguishing hierarchy with adverbial word, the foundation of division is the semantical definition to these words for the language specialist.Terseness due to wordnet And validity, wordnet has been widely applied in natural language processing task.Similar with wordnet, disco is as another A kind of conventional supervised method, for retrieving the similitude between any given two word.Compared with wordnet, disco Support more rich languages, for example: English, German, French, Spanish etc..The method of supervised can be used directly to carry Before the language space that defines it is not necessary to any extra calculating, meanwhile, the method for supervised also almost covers whole normal Word.But, the method for supervised depends on the knowledge of linguist, and between word, the tolerance of similitude is generally by subjective consciousness Defined, meanwhile, the method for supervised is not suitable for the application based on specific collected works.The method of non-supervisory formula mainly include pmi, Lsa and plsa.The statistics that pmi is got by query web search engine, two words of statistics simultaneously appear in same Number of times in webpage, number of times is more, then the pmi score of this two words is higher.Lsa is also a kind of semantic phase of non-supervisory formula Like property measure, the mechanism that it has incorporated mankind's learning knowledge goes to obtain the similitude between word or text fragment.lsa Committed step be that dimensionality reduction is carried out by singular value decomposition.Lsa can also solve synonymy in natural language processing simultaneously Problem.Plsa is to lsa innovatory algorithm.Different from the lsa algorithm coming from linear algebra, in the base inheriting lsa advantage On plinth, plsa is analyzed to the correlation between paired word using the method for probability theory, and can be good at processing Synonymy and the problem of ambiguity.Compared with lsa, plsa has more generality.
In recent years, the development of graph theory causes the attention of natural language processing scholars, and widdows et al. proposes one kind It is used for obtaining Semantic Similarity based on the non-supervisory formula method of graph model.In this graph model, node represents word, while being word Relation between language.Additionally, this graph model is based on specific collected works, the ambiguousness of word can be processed.Ambwani et al. proposes The graph model of another kind of tolerance phrase semantic similitude, each word is represented as a series of node, and each node corresponds to One sentence in this word coverage, the weights on side represent the correlation between word.This model is by between word Influence each other and incorporate wherein, to determine the correlation between word according to context simultaneously.Semantic Similarity discussed above The non-supervisory formula method calculating is all based on specific collected works, is more suitable for specifically applying compared with supervised method.Non- at these Inside the method for supervised, because terseness and the high efficiency of graph model is so that Semantic Similarity calculating side based on graph model Method causes the concern of more and more natural language processing researchers.
Semantic Similarity tolerance between set of words (such as paragraph, text etc.) is also a problem demanding prompt solution. The conventional method of set of words Semantic Similarity tolerance is cosine similarity.Based on bag of words it is assumed that each set of words is expressed Vectorial for word frequency, cosine similarity is used for measuring the angle between word frequency vector, and angle is bigger, and similitude is less;Conversely, more Greatly.Because the use very simple of cosine similarity is with effectively, therefore cosine similarity is widely used and set of words is semantic The tolerance of similitude, but cosine similarity only considered the relation between identical word, have ignored in set of words word it Between correlation, this make set of words similitude tolerance inaccurate.In order that the tolerance of set of words Semantic Similarity More accurately and meaningful, when measuring similarity between set of words, the correlation between word is taken into account more For meaningful.Therefore, need proposition badly and a kind of correlation between word is dissolved into similarity measurement between set of words Method.
In order to the rational Semantic Similarity representing between word, and Chinese News Stories can be significantly improved divide Cut the precision of technology, embodiments provide a kind of Chinese News Stories segmentation side based on flexible Semantic Similarity tolerance Method, referring to Fig. 1, this method, in carrying out Semantic Similarity calculating and News Stories cutting procedure, is all for a certain specific Data set.Meanwhile, in order to embody the reasonability of flexible Semantic Similarity measure, redesigned checking criterion reasonable to this Property is verified, described below:
101: input target collected worksParticiple is carried out to each News Stories script ti in collected works;
Pass through this step and every words in News Stories script are split several words, this step is art technology Well known to personnel, the embodiment of the present invention does not repeat to this.
102: set up context relation figure;
1) read in News Stories script successively, word frequency statisticses are carried out to the word included in it;
2) according to the word frequency threshold value defining, frequent words and low frequency word are deleted.
3) using the word retaining as context relation in figure node, its set be v.
4) whether any two word in judging to gather simultaneously appears in a certain News Stories script, and this two words The distance between language is less than or equal to distance threshold, if it is sets up side between this two words, the set on side is e; Rejudge other any two words if not, until the word in whole set is all traversed.
5) the weights s on sidecBy the weights sim between wordcThe weights sim of (a, b), word itselfc(a a) represents;
Weights sim between wordc(a, b) is defined by below equation:
sim c ( a , b ) = freq ( a , b ) freq max + ϵ
Wherein freq (a, b) represents the number of times that word a and word b occurs simultaneously, freqmax=max(i,j){freq(i,j)} Represent the frequency maxima to (i, j) for the word, ε is a constant in order to guarantee 0≤simc(a,b)≤1(a≠b).
Meanwhile, the weights sim of word itselfc(a,a)=1.
6) therefore context relation figure can be expressed as g=v, e, sc, Fig. 2 is the upper and lower of a certain document in structure data set The schematic diagram of civilian graph of a relation, wherein, wuRepresent tiIn u-th word, lines represent the relation between word.
103: by context relation figure and quick sorting algorithm[10][11]Context dependence between word is changed In generation, propagates and obtains flexible semantic dependency matrix;
1) defining the Semantic Similarity between context relation in figure word is sims(a, b), it meets following three standards Then:
Word is 1 with the similitude of itself, i.e. sims(a,a)=1;
sims(a, b) and simc(a, b) positive correlation;
simsSimilitude between (a, b) and their neighbours is directly proportional.
2) the iterative diffusion process defining Semantic Similarity is defined as follows by relationship below:
sim s ( 0 ) ( a , b ) = sim c ( a , b ) ;
sim s ( t ) ( a , b ) = c z σ u ~ a , v ~ b si m s ( t - 1 ) ( a , b ) ;
sim s ( a , b ) = lim t → ∞ sim s ( t ) ( a , b )
Wherein, u~a, v~b represent that u and v is word a and the neighbor node of word b in context relation in figure respectively, z It is normalization factor, c is controlling elements,Represent the Semantic Similarity of word a and word b during the t time iteration,Represent the Semantic Similarity of word a and word b during the t-1 time iteration,Represent initialization.
3) use quick sorting algorithm solve 2) defined in relational expression, obtain semantic dependency, to each two word all Ask for semantic dependency, several semantic dependencies constitute flexible semantic dependency matrix, this correlation matrix is designated as ss.With Similar, traditional rigid Semantic Similarity is defined as sh=i, wherein i represent unit matrix.
The hypothesis that quick sorting algorithm is based on: if neighbours' word of two words more similar (correlation be more than or Equal to 0.5), then this two words are also more similar;
Quick sorting algorithm is using context graph of a relation as input;Algorithm complex is o (k | v |2), wherein, k is in g The average number of degrees, | v | represents the nodes of context relation in figure, and o is algorithm complex.
The full point based on gpu for the quick sorting algorithm is achieved to parallel algorithm in this invention.Found by Germicidal efficacy, with Same context relation figure as input, the speed of the quick sorting algorithm realized based on gpu than tradition based on cpu realize fast The speed of fast sort algorithm improves about 1000 times.
Quick sorting algorithm is output as being iterated the flexible semantic dependency matrix s after propagatings={sims(a, b)}a,b∈c
104: by flexible semantic dependency matrix to sentence (the one section of word continuously occurring as in News Stories script Set) between flexible Semantic Similarity be defined;
In News Stories cutting techniques, it is also desirable to measure set of words i.e. sentence in addition to phrase semantic similitude Similitude between son.In story, each sentence can be identified as word frequency vector, be used for recording each word and go out in sentence Existing number of times.For given flexible semantic dependency matrix, the flexible Semantic Similarity between sentence is defined as follows:
sim ( s i , s j | s s ) = f i t s s f i | | f i | | | | f j | |
Wherein siAnd sjRepresent sentence respectively, | | fi| | and | | fj| | represent two norms of two sentence word frequency vectors respectively, T is transposition.This definition is the improvement to traditional cosine similarity, and semantic dependency potential between different terms is considered by it Enter, therefore, it is possible to more reasonably represent the Semantic Similarity between sentence.
105: using flexible Semantic Similarity, Chinese News Stories are split.
1) Chinese News Stories cutting techniques based on criterion be Normalization norm[5][6]
Wherein, Normalization norm is particularly as follows: this criterion is based on graph model;Sentence is identified as the node in graph model, sentence Relation between son is expressed as the side in graph model;Similitude between sentence is expressed as the weights on side;News Stories are split Problem is converted into graph model segmentation problem.
2) between using sentence, flexible Semantic Similarity carries out Chinese to the News Stories script included in input data set News Stories are split.
A kind of Chinese news event based on flexible Semantic Similarity tolerance with specific test, the present invention being provided below The feasibility of thing dividing method is verified:
Standard data set is tested:
For verifying the validity of this method, this method is tested on two standard data set cctv and tdt2. Cctv data set contains 71 Chinese news storyboards altogether, can say cctv according to News Stories length and identification error rate Data set is divided into 8 subsets, is designated as cctv_59_f/s, cctv_66_f/s, cctv_75_f/s and cctv_ref_f/s respectively, Wherein f represents long story set, and s represents short story set, and ref represents reference data set.Tdt2 data set comprises 177 Chinese news According to identification error rate, storyboard, can say that this 177 scripts are divided into two subsets, be designated as tdt2_ref and tdt2_ respectively rcg.Respectively using rigid Semantic Similarity sh, context Semantic Similarity scWith flexible Semantic Similarity ssTo cctv and tdt2 News Stories in data set are split, and compare the quality of its segmentation precision.Wherein segmentation precision scores to represent by f1. Table 1 lists using different segmentation precisions on cctv and tdt2 data set for the similarity measurement mode.
Table 1
It is observed that compared with traditional rigid Semantic Similarity, can be made using flexible Semantic Similarity from table 1 Segmentation precision is significantly improved, and increase rate is about 3% to 10%.Simultaneously it has also been found that context Semantic Similarity is than hard The Semantic Similarity of property is good, and context Semantic Similarity can be made to get a promotion by quick sorting algorithm.In order to show The robustness of this method, this method implements another more strict experiment on cctv_75_s data set, by than less Use the segmentation precision of 100 groups of random parameters with method.Fig. 4 shows this experimental result.In news therefore cutting techniques, sentence Between the quality of similitude can be measured with the ratio of the similitude between story and the similitude within story, this ratio pair Answer the ga s safety degree of sentence, wherein, ratio is more little then to prove that similitude is better, and this ratio to be defined by below equation:
r ( c | s s ) = exp ( mean lab ( s i ) &notequal; lab ( s j ) sim ( s i , s j | s s ) ) exp ( mean lab ( s i ) = lan ( s j ) sim ( s i , s j | s s ) )
Wherein lab (si) represent sentence siThe label of affiliated story, lab (sj) represent sentence sjThe label of affiliated story, Mean represents and averages.
This method is by rigid Semantic Similarity sh, context Semantic Similarity scWith flexible Semantic Similarity ssIn criterion numeral Contrasted according on collection, comparing result represents in the diagram.It is found through experiments, using flexible Semantic Similarity ssObtained R ratio lower than other two kinds of similitudes, and context Semantic Similarity scS lower than rigid Semantic Similarityh.Should Experiment shows flexible Semantic Similarity ssThan traditional rigid Semantic Similarity shMore reasonable, and calculated by quicksort The flexible Semantic Similarity (i.e. Semantic Similarity after iterative diffusion) that method solves is more reasonable.During this method is applied to Segmentation precision can be made to increase significantly in civilian News Stories cutting techniques.
Bibliography:
[1].j.allan,ed.,topic detection and tracking:event-based information organization,kluwer academic publishers,2002.
[2].l.-s.lee and b.chen,“spoken document understanding and organization,”vol.22,no.5,pp.42–60,2005.
[3].s.banerjee and i.a.rudnicky,“a texttiling based approach to topic boundary detection in meetings,”in interspeech,2006.
[4].l.xie,j.zeng,and w.feng,“multi-scale texttiling for automatic story segmentation in chinese broadcast news,”in airs,2008.
[5].i.malioutov and r.barzilay,“minimum cut model for spoken lecture segmentation,”in acl,2006.
[6].j.zhang,l.xie,w.feng,and y.zhang,“a subword normalized cut approach to automatic story segmentation of chinese broadcast news,”in airs, 2009.
[7].z.liu,l.xie,and w.feng,“maximum lexical cohesion for fine-grained news story segmentation,”in interspeech,2010.
[8].t.pedersen,s.patwardhan,and j.michelizzi,“wordnet::similarity- measuring the relatedness of concepts,”in aaai(intelligent systems demonstration),2004.
[9].christiane fellbaum,ed.,wordnet:an electronic lexical database, mit press,1998.
[10].g.jeh and j.widom,“simrank:a measure of structural-context similarity,”in acm sigkdd,2002.
[11].g.he,h.feng,c.li,and h.chen,“parallel simrank computation on large graphs with iterative aggregation,”in acm sigkdd,2010.
It will be appreciated by those skilled in the art that accompanying drawing is the schematic diagram of a preferred embodiment, the embodiments of the present invention Sequence number is for illustration only, does not represent the quality of embodiment.
The foregoing is only presently preferred embodiments of the present invention, not in order to limit the present invention, all spirit in the present invention and Within principle, any modification, equivalent substitution and improvement made etc., should be included within the scope of the present invention.

Claims (5)

1. a kind of Chinese News Stories dividing method based on flexible Semantic Similarity tolerance is it is characterised in that methods described bag Include following steps:
(1) target collected works are inputtedTo each News Stories script t in collected worksiCarry out participle;
(2) set up context relation figure;
(3) context dependence between word is iterated propagate by described context relation figure and quick sorting algorithm Obtain flexible semantic dependency matrix;
(4) it is defined by described flexibility flexible Semantic Similarity between sentence for the semantic dependency matrix;
(5) using described flexibility Semantic Similarity, Chinese News Stories are split;
The described step setting up context relation figure particularly as follows:
1) read in each News Stories script successively, word frequency statisticses are carried out to the word being comprised;
2) according to the word frequency threshold value defining, frequent words and low frequency word are deleted;
3) using the word retaining as context relation in figure node, its set be v;
4) judge gather in any two word whether simultaneously appear in a certain News Stories script, and this two words it Between distance be less than or equal to distance threshold, if it is between this two words, set up side, the set on side is e;If No rejudge other any two words, until whole gather in word be all traversed;
5) the weights s on sidecBy the weights sim between wordcThe weights sim of (a, b), word itselfc(a a) represents;
6) described context diagram is shown as g=< v, e, sc>.
2. method according to claim 1 is it is characterised in that weights sim between described wordc(a, b) particularly as follows:
sim c ( a , b ) = f r e q ( a , b ) freq m a x + &epsiv;
Wherein, freq (a, b) represents the number of times that word a and word b occurs simultaneously, freqmax=max(i,j){ freq (i, j) } table Show the frequency maxima to (i, j) for the word, ε is a constant in order to guarantee 0≤simc(a,b)≤1.
3. method according to claim 1 is it is characterised in that the weights sim of described word itselfc(a, a)=1.
4. method according to claim 1 is it is characterised in that described calculated by described context relation figure and quicksort Method the context dependence between word is iterated propagate the step obtaining flexible semantic dependency matrix particularly as follows:
1) defining the Semantic Similarity between context relation in figure word is sims(a, b), following three criterions of satisfaction:
Word is 1 with the similitude of itself, that is,
sims(a, a)=1;sims(a, b) and simc(a, b) positive correlation;simsSimilitude between (a, b) and their neighbours just becomes Than;
2) the iterative diffusion process of definition Semantic Similarity:
sim s ( 0 ) ( a , b ) = sim c ( a , b ) ; sim s ( t ) ( a , b ) = c z &sigma; u ~ a , v ~ b sim s ( t - 1 ) ( a , b ) ; sim s ( a , b ) = lim t &rightarrow; &infin; sim s ( t ) ( a , b ) ;
Wherein, u~a, v~b represent that u and v is word a and the neighbor node of word b in context relation in figure respectively, and z is to return The one change factor, c is controlling elements,Represent the Semantic Similarity of word a and word b during the t time iteration,Represent the Semantic Similarity of word a and word b during the t-1 time iteration,Represent initialization;
3) use quick sorting algorithm solve 2) defined in relational expression, obtain semantic dependency, each two word is all asked for Semantic dependency, several semantic dependencies constitute flexible semantic dependency matrix, and this correlation matrix is designated as ss.
5. method according to claim 1 it is characterised in that described by described flexibility semantic dependency matrix to sentence Between the step that is defined of flexible Semantic Similarity particularly as follows:
s i m ( s i , s j | s s ) = f i t s s f j | | f i | | | | f j | |
Wherein siAnd sjRepresent sentence respectively, | | fi| | and | | fj| | represent two norms of two sentence word frequency vectors respectively, t is to turn Put.
CN201410027012.3A 2014-01-20 2014-01-20 Chinese news story segmentation method based on flexible semantic similarity measurement Active CN103793491B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410027012.3A CN103793491B (en) 2014-01-20 2014-01-20 Chinese news story segmentation method based on flexible semantic similarity measurement

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410027012.3A CN103793491B (en) 2014-01-20 2014-01-20 Chinese news story segmentation method based on flexible semantic similarity measurement

Publications (2)

Publication Number Publication Date
CN103793491A CN103793491A (en) 2014-05-14
CN103793491B true CN103793491B (en) 2017-01-25

Family

ID=50669157

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410027012.3A Active CN103793491B (en) 2014-01-20 2014-01-20 Chinese news story segmentation method based on flexible semantic similarity measurement

Country Status (1)

Country Link
CN (1) CN103793491B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2019023893A1 (en) * 2017-07-31 2019-02-07 Beijing Didi Infinity Technology And Development Co., Ltd. System and method for segmenting a sentence
CN110750617A (en) * 2018-07-06 2020-02-04 北京嘀嘀无限科技发展有限公司 Method and system for determining relevance between input text and interest points

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
"基于上下文信息的新闻故事单元分割";冀中;《天津大学学报》;20090228;第42卷(第2期);摘要、第153页第1栏第1段-第158页第1栏第2段 *
"基于本体概念的柔性相似度计算方法研究";张野;《计算机技术与发展》;20120930;第22卷(第9期);摘要、第103页第1栏第2段-第106页子1栏第4段 *

Also Published As

Publication number Publication date
CN103793491A (en) 2014-05-14

Similar Documents

Publication Publication Date Title
Liu et al. Mining quality phrases from massive text corpora
Gupta et al. Analyzing the dynamics of research by extracting key aspects of scientific papers
Christensen et al. An analysis of open information extraction based on semantic role labeling
CN103399901B (en) A kind of keyword abstraction method
Liu et al. Measuring similarity of academic articles with semantic profile and joint word embedding
CN105808525A (en) Domain concept hypernym-hyponym relation extraction method based on similar concept pairs
Kim et al. Interpreting semantic relations in noun compounds via verb semantics
CN105701084A (en) Characteristic extraction method of text classification on the basis of mutual information
CN104391942A (en) Short text characteristic expanding method based on semantic atlas
Jang et al. Metaphor detection in discourse
CN106528524A (en) Word segmentation method based on MMseg algorithm and pointwise mutual information algorithm
KR101396131B1 (en) Apparatus and method for measuring relation similarity based pattern
Paiva et al. Discovering semantic relations from unstructured data for ontology enrichment: Asssociation rules based approach
Momtaz et al. Graph-based Approach to Text Alignment for Plagiarism Detection in Persian Documents.
CN103793491B (en) Chinese news story segmentation method based on flexible semantic similarity measurement
Celebi et al. Segmenting hashtags using automatically created training data
CN103455638A (en) Behavior knowledge extracting method and device combining reasoning and semi-automatic learning
Wan et al. Chinese shallow semantic parsing based on multilevel linguistic clues
Gayen et al. Automatic identification of Bengali noun-noun compounds using random forest
Nie et al. Measuring semantic similarity by contextualword connections in chinese news story segmentation
Thilagavathi et al. Document clustering in forensic investigation by hybrid approach
Reshadat et al. Confidence measure estimation for open information extraction
Zolotarev Research and development of linguo-statistical methods for forming a portrait of a subject area
CN112328811A (en) Word spectrum clustering intelligent generation method based on same type of phrases
CN105808521A (en) Semantic feature based semantic relation mode acquisition method and system

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20210628

Address after: No.48, 1st floor, No.58, No.44, Middle North Third Ring Road, Haidian District, Beijing 100088

Patentee after: BEIJING HONGBO ZHIWEI SCIENCE & TECHNOLOGY Co.,Ltd.

Address before: 300072 Tianjin City, Nankai District Wei Jin Road No. 92

Patentee before: Tianjin University

TR01 Transfer of patent right