CN102411562A - Affective characteristic generation algorithm based on semantic chunk - Google Patents

Affective characteristic generation algorithm based on semantic chunk Download PDF

Info

Publication number
CN102411562A
CN102411562A CN2010102888550A CN201010288855A CN102411562A CN 102411562 A CN102411562 A CN 102411562A CN 2010102888550 A CN2010102888550 A CN 2010102888550A CN 201010288855 A CN201010288855 A CN 201010288855A CN 102411562 A CN102411562 A CN 102411562A
Authority
CN
China
Prior art keywords
semantic chunk
semantic
affective
tree
chunk
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010102888550A
Other languages
Chinese (zh)
Inventor
朱俭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qiansong Science & Technology Development Co Ltd
BEIJING TONGZHOU DISTRICT SCIENCE TECHNOLOGY ASSOCIATION
Original Assignee
Beijing Qiansong Science & Technology Development Co Ltd
BEIJING TONGZHOU DISTRICT SCIENCE TECHNOLOGY ASSOCIATION
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qiansong Science & Technology Development Co Ltd, BEIJING TONGZHOU DISTRICT SCIENCE TECHNOLOGY ASSOCIATION filed Critical Beijing Qiansong Science & Technology Development Co Ltd
Priority to CN2010102888550A priority Critical patent/CN102411562A/en
Publication of CN102411562A publication Critical patent/CN102411562A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

The invention relates to an affective characteristic generation algorithm based on a semantic chunk, which belongs to the field of Chinese text affective analysis. The invention aims at providing the novel affective characteristic generation algorithm. The affective characteristic is represented by the semantic chunk, and the semantic chunk is an independent semantic unit or a grammar unit. The structural finding is carried out through suffix-tree (PAT tree), and the optimum dismounting results are selected according to the strategy in accordance with the statistical results of contexts in the total text set.

Description

A kind of affective characteristics generating algorithm based on semantic chunk
Affiliated technical field
The present invention is a kind of affective characteristics generating algorithm based on semantic chunk, belongs to Chinese text emotion analysis field.
Background technology
High speed development along with the network correlation technique; The internet has progressively become people and has obtained the important source of information and the platform of expressing own viewpoint; The online comment that increases rapidly produces mass data; So to particular demands, organize related data and obtain useful information, become the great challenge that current information science and technology field faces.The text emotion classification is meant through excavating and analyze the subjective informations such as viewpoint, suggestion and view in the text, the emotion tendency of text is made classification judge.It can be widely used in aspects such as social public sentiment analysis, product quality evaluation, video display comment.
One piece of text presentation is a character string of being made up of literal and punctuate.Speech formed in word or character, and phrase formed in speech, and then form sentence, paragraph and chapter.Therefore to the text emotion analysis, researchers are general all to begin from the emotion tendency of judging word.The CN101609459A patent of invention has been announced a kind of extraction system of affective characteristic words, and this system utilizes tf (number of times that word occurs in article) to choose the higher speech of some scores as the broad sense affective characteristic words with the isoparametric ratio of df (word appears at the different number of times of commenting in the contents of the set of commenting on).Set up narrow sense affective characteristics vocabulary according to the apposition of speech among the semantic relation figure then.Because this technology depends on Chinese words segmentation, this certainly exists problems such as name part of speech main body identification in the participle, standard of word segmentation disunity, directly influences the quality of affective characteristics.
The present invention proposes a kind of affective characteristics generating algorithm based on semantic chunk.Semantic chunk is not necessarily natural language units such as word, speech, phrase, sentence, and it both can regard also semantic unit of syntactical unit as.Through using semantic chunk to substitute traditional dictionary, can embody the affective characteristics in the text more accurately.
Summary of the invention
The purpose of this invention is to provide a kind of new affective characteristics generating algorithm, affective characteristics representes that with semantic chunk semantic chunk is the statistics according to text context, selects best split result by strategy.
Technical scheme of the present invention is following:
Search independently semantic unit or syntactical unit through suffix tree Suffix-tree (PAT tree) structure,, select best split result by strategy according to contextual statistics in the full text set.With two sections Chinese character string S1, S2 is example, searches the semantic chunk operation and is the longest public word string of searching S1 and S2.
S1: " go to the cinema for the first time and see a film, 3D effect is not obvious, wins and is making laughs.”
S1: " quite humorous film, that makes laughs most will belong to that two foxes.”
If use participle technique:
S1: first/m time/qv goes/vf cinema/n sees/v film/n ,/wd 3D/x effect/n not /d obviously/a ,/wd victory/v does at/p/v laughs at/v./wj
S2: quite/d humour/a /ude1 film/n ,/wd /d does/v laughs at/v /ude1 wants/v genuss/v that/rzv two/m/q fox/n./wj
Clearly, independently semantic unit has split.Handle this two sections word strings if use suffix tree.Algorithm briefly is described below:
S1 and S2 splicing are pressed into suffix tree as character string, find the darkest nonleaf node.This is meant the character number that is lived through from the tree root node deeply, and it is exactly the longest repetition substring that the character string that the darkest nonleaf node is experienced is got up.Need find nonleaf node, since be that the leaf node number wants>=2 certainly because be to locate the public part of S1 and S2 repetition.Principle is: if T has repeated twice in S, then two suffix should be arranged is prefix with T to S, and multiplicity has just come out naturally.
In addition, adopt Patricia Tree (PAT tree) storage organization to reduce the complexity of storage space.PAT tree is a kind of special shape of suffix tree structure, adopts the search structure of semiinfinite long word string (semi-infinite string) as character string.Be exactly a kind of binary tree structure of compression memory in simple terms, PAT tree has very excellent performance on the substring coupling of character string.
Character string S1, S2 use the semantic chunk notion to come independently semantic unit of cutting, following expression:
S1: go to the cinema for the first time and see a film, 3D effect is not obvious, wins and is making laughs.
S2: quite humorous film, that makes laughs most will belong to that two foxes.
The present invention has following advantage:
1. the present invention proposes the thought of affective characteristics based on semantic chunk; Overcome the shortcoming of traditional algorithm employing Chinese word segmentation; Avoided the identification problem of standard of word segmentation disunity, cutting ambiguity resolution and unregistered word, the semantic chunk that algorithm obtains is to contain independently semanteme or independently syntactical unit.
2. the algorithm of the present invention's proposition simply is easy to realize.
3. the present invention proposes the affective characteristics result that algorithm obtains, and is superior to traditional participle instrument.
Description of drawings
Fig. 1 adopts the characteristic quantity that the obtains contrast of semantic chunk as affective characteristics and Chinese word segmentation among the present invention
Fig. 2 adopts the frequency contrast of semantic chunk as affective characteristics and Chinese word segmentation among the present invention
Fig. 3 is that the present invention adopts the curve map contrast of semantic chunk as affective characteristics and Chinese word segmentation
Below in conjunction with accompanying drawing and embodiment patent of the present invention is further specified.
We verify this algorithm effects through following experiment.The data set of experiment is 11000 pieces of brief film reviews about " ice Age3 " of collecting through the bean cotyledon net.The language material total scale is 206751 characters, only comprises Chinese character, punctuate, English word, numeral, has removed information such as webpage mark, space.
We select the dual mode contrast that experimentizes: carry out initial option based on semantic chunk 1.; 2. carry out initial option based on Chinese word segmentation.In the experiment language material, we adopt (ICTCLAS30) Chinese word segmentation device of Inst. of Computing Techn. Academia Sinica's exploitation to compare.To identical language material (11000 brief comment), as shown in Figure 1, semantic chunk obtains 11611 altogether, and segmenter acquisition entry order is 13436.Through preliminary feature selecting, semantic chunk obtains 4106 altogether, and it is 6134 that segmenter obtains the entry order.
Find out the initial sets scale of the direct effect characteristics of method of different choice affective characteristics through contrasting among Fig. 2 and Fig. 3.Wherein obviously lack than the characteristic of selecting based on the method for Chinese word segmentation near 1/3rd based on the method for semantic chunk.
Through the experimental analysis reason, semantic chunk is to go out to send division according to the context of text from independently syntactical unit or semantic unit, and segmenter is to divide according to dictionary is fixing.For example: comprise in the semantic chunk characteristic: " making laughs " occurs 942 times, " laughing at " 269 times; Comprise in the characteristic of using segmenter to obtain: " doing " 1056 times, " laughing at " 1883 times.Again for example, comprise in the semantic chunk characteristic: " bearing watching " 40 times, " being worth " 93 times; And use segmenter to obtain " being worth " 133 times, " bearing watching " 0 time.This is because segmenter does not think that " bearing watching " belongs to a speech.From here we clearly to find to use semantic chunk be a kind of new way as affective characteristics, be better than traditional way of being used as participle affective characteristics.

Claims (2)

1. the affective characteristics generating algorithm based on semantic chunk is characterized in that, comprising: affective characteristics is to be represented by semantic chunk, and semantic chunk is meant to have independent semantic unit or syntactical unit.
2. a kind of affective characteristics generating algorithm based on semantic chunk as claimed in claim 1 is characterized in that, also comprises: search independently semantic unit or syntactical unit through suffix tree Suffix-tree (PAT tree) structure.
CN2010102888550A 2010-09-21 2010-09-21 Affective characteristic generation algorithm based on semantic chunk Pending CN102411562A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102888550A CN102411562A (en) 2010-09-21 2010-09-21 Affective characteristic generation algorithm based on semantic chunk

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102888550A CN102411562A (en) 2010-09-21 2010-09-21 Affective characteristic generation algorithm based on semantic chunk

Publications (1)

Publication Number Publication Date
CN102411562A true CN102411562A (en) 2012-04-11

Family

ID=45913640

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102888550A Pending CN102411562A (en) 2010-09-21 2010-09-21 Affective characteristic generation algorithm based on semantic chunk

Country Status (1)

Country Link
CN (1) CN102411562A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294324A (en) * 2016-08-11 2017-01-04 上海交通大学 A kind of machine learning sentiment analysis device based on natural language parsing tree
CN106598935A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Method and apparatus for determining emotional tendency of document
CN106919661A (en) * 2017-02-13 2017-07-04 腾讯科技(深圳)有限公司 A kind of affective style recognition methods and relevant apparatus
CN108874781A (en) * 2018-06-29 2018-11-23 北京千松科技发展有限公司 A kind of segmenting method and system for omnimedia popular science window

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7203680B2 (en) * 2003-10-01 2007-04-10 International Business Machines Corporation System and method for encoding and detecting extensible patterns
CN101576875A (en) * 2009-03-03 2009-11-11 杜小勇 System and method for adjective polarity judgment based on clustering

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7203680B2 (en) * 2003-10-01 2007-04-10 International Business Machines Corporation System and method for encoding and detecting extensible patterns
CN101576875A (en) * 2009-03-03 2009-11-11 杜小勇 System and method for adjective polarity judgment based on clustering

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
龚才春: "短文本语言计算的关键技术研究", 《中国博士学位论文全文数据库 信息科技辑》 *

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106598935A (en) * 2015-10-16 2017-04-26 北京国双科技有限公司 Method and apparatus for determining emotional tendency of document
CN106598935B (en) * 2015-10-16 2019-04-23 北京国双科技有限公司 A kind of method and device of determining document emotion tendency
CN106294324A (en) * 2016-08-11 2017-01-04 上海交通大学 A kind of machine learning sentiment analysis device based on natural language parsing tree
CN106294324B (en) * 2016-08-11 2019-04-05 上海交通大学 A kind of machine learning sentiment analysis device based on natural language parsing tree
CN106919661A (en) * 2017-02-13 2017-07-04 腾讯科技(深圳)有限公司 A kind of affective style recognition methods and relevant apparatus
CN106919661B (en) * 2017-02-13 2020-07-24 腾讯科技(深圳)有限公司 Emotion type identification method and related device
CN108874781A (en) * 2018-06-29 2018-11-23 北京千松科技发展有限公司 A kind of segmenting method and system for omnimedia popular science window

Similar Documents

Publication Publication Date Title
Na’aman et al. Varying linguistic purposes of emoji in (Twitter) context
WO2021114745A1 (en) Named entity recognition method employing affix perception for use in social media
CN105957518B (en) A kind of method of Mongol large vocabulary continuous speech recognition
CN103123618B (en) Text similarity acquisition methods and device
CN106407236B (en) A kind of emotion tendency detection method towards comment data
CN106407235B (en) A kind of semantic dictionary construction method based on comment data
CN109408642A (en) A kind of domain entities relation on attributes abstracting method based on distance supervision
CN102867511A (en) Method and device for recognizing natural speech
CN103389988A (en) Method and device for guiding user to carry out information search
CN107423282A (en) Semantic Coherence Sexual Themes and the concurrent extracting method of term vector in text based on composite character
CN102081642A (en) Chinese label extraction method for clustering search results of search engine
CN102737013A (en) Device and method for identifying statement emotion based on dependency relation
CN103150356B (en) A kind of the general demand search method and system of application
CN109635295B (en) Poetry retrieval method and system based on semantic analysis
CN107577663B (en) Key phrase extraction method and device
CN107203520A (en) The method for building up of hotel's sentiment dictionary, the sentiment analysis method and system of comment
Kang et al. English-to-Korean transliteration using multiple unbounded overlapping phoneme chunks
KR101410601B1 (en) Spoken dialogue system using humor utterance and method thereof
CN102411562A (en) Affective characteristic generation algorithm based on semantic chunk
CN114357989B (en) Video title generation method and device, electronic equipment and storage medium
CN103853746A (en) Word bank generation method and system, input method and input system
CN106653006B (en) Searching method and device based on interactive voice
CN113850291A (en) Text processing and model training method, device, equipment and storage medium
Thangarasu et al. Design and development of stemmer for Tamil language: cluster analysis
Nastase et al. What’s in a name? In some languages, grammatical gender

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120411