CN100444194C - Automatic extraction device, method and program of essay title and correlation information - Google Patents

Automatic extraction device, method and program of essay title and correlation information Download PDF

Info

Publication number
CN100444194C
CN100444194C CNB200510116866XA CN200510116866A CN100444194C CN 100444194 C CN100444194 C CN 100444194C CN B200510116866X A CNB200510116866X A CN B200510116866XA CN 200510116866 A CN200510116866 A CN 200510116866A CN 100444194 C CN100444194 C CN 100444194C
Authority
CN
China
Prior art keywords
title
article
candidate sentence
information
string
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CNB200510116866XA
Other languages
Chinese (zh)
Other versions
CN1955979A (en
Inventor
张正操
孙茂松
刘绍明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujifilm Business Innovation Corp
Original Assignee
Fuji Xerox Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fuji Xerox Co Ltd filed Critical Fuji Xerox Co Ltd
Priority to CNB200510116866XA priority Critical patent/CN100444194C/en
Publication of CN1955979A publication Critical patent/CN1955979A/en
Application granted granted Critical
Publication of CN100444194C publication Critical patent/CN100444194C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

An automatic-drawing device of article title and correlation information consists of title candidate sentence drawing unit for drawing out multiple title candidate sentence from test article inputted by article input unit, characteristic value drawing unit for drawing out characteristic value from each of multiple title candidate sentence and title deciding unit of deciding out title from multiple title candidate sentence according to drawn out characteristic value.

Description

The Automatic Extraction device and the abstracting method of article title and related information
Technical field
The present invention relates to the article title draw-out device that from the article that reads by scanner etc. Automatic Extraction goes out article title.
Background technology
By using optical scanner etc. to read in the papery original copy, the device that extracts article title from the view data after the quilt electronization is practical gradually.For example, in patent documentation 1, relate to when article is converted to view data and extract the header extracting device of article title in the article image that obtains, according to this header extracting device, the rectangular area that black picture element connected that is external in the article image is extracted as the character rectangle, and, merge adjacent a plurality of character rectangles, to extract as character string rectangles with the external rectangular area of these character rectangles, then, underscore attribute according to each character string rectangles, the band box properties, attributes such as form attributes, and the position of the character string rectangles in the article image and mutual alignment relation, calculate the score that each character string rectangles is a title, the character string rectangles that obtains high score is extracted as title.
Patent documentation 2 relates to following header extracting device, this header extracting device is for the character string rectangles that cuts out from article image, carry out the identification of the character code in this character string rectangles, confidence level by character code identification, the natural language resolution unit to resolving with the similar degree of natural language title, the statistical information of suffix, placed in the middle/underscore/specific font, the methods such as size of character rectangle are come extracting header.
Non-patent literature 1 discloses following technology: can use address, city name, URL, the time of regular expression extraction technique paper, and paper can be begun in the part part that is not extracted and extract as author and title.
Non-patent literature 2 discloses following technology: beginning part with article is object, with language feature (ratio of position, word and the non-word of word number, row, ratio that initial is upper case and lower case, digital ratio) etc. as characteristic quantity, utilize SVM (Support Vector Machines, support vector machine) to judge title.
Patent documentation 1 Japanese kokai publication hei 9-134406 number
Patent documentation 2 TOHKEMY 2000-148788
Non-patent literature 1 E Berkowitz, M Elkhadiri, T Sahouri and MAbraham.2004.Intelligent Content Based Title and Author NameExtraction from Formatted Documents.Proceedings Fifteenth MidwestArtificial Intelligence and Cognitive Science Conference.Pages119-124.
Non-patent literature 2 Hui Han, C Giles, E Manavoglu, Hongyuan Zha, Zhenyue Zhang and E Fox.2003. Automatic Document MetadataExtraction using Support Vector Machines.ACM/IEEE Joiht Conferenceon Digital Libraries.Pages36-48.
Yet, because the article of the object yes-no formatization of the header extracting device of patent documentation 1 extracts so use layout information (layout) feature in row zone to carry out title, so there is the inadequate problem of extraction yield.Though patent documentation 2 uses the attribute of several titles to evaluate title, for article, because it is more to have a short character strings rectangle of title attribute, so there is the problem that erroneous judgement takes place easily with a plurality of short character strings rectangles.
In addition, there are the following problems for disclosed technology in non-patent literature 1 and non-patent literature 2: because depend on the structure of article, thus be difficult to be applicable to technical papers article in addition, and, under the less situation of the start information of article, can not carry out correct title and extract.
Summary of the invention
The present invention proposes in order to solve above-mentioned existing issue, its purpose is to provide a kind of header extracting device, abstracting method and extraction program, it needn't necessarily depend on the space of a whole page and the context of article, but effectively utilize linguistry fully, length with the title candidate sentence, the rank of candidate sentence and other similarity, the author, mechanism's name, the key word in title string, candidate sentence and author's distance, title forbidding keyword string, postcode, information such as punctuation mark are as the characteristic quantity of title candidate sentence, utilize sorter (for example, SVM) judge whether this characteristic quantity is title, thereby can maximally utilise the attribute of title uniqueness, extract article title and related information thereof accurately with decision procedure flexibly.
Article title draw-out device of the present invention comprises: title candidate sentence extracting unit, and it extracts a plurality of title candidate sentence from article; The characteristic quantity extracting unit, it extracts the characteristic quantity separately of a plurality of title candidate sentence that extracted; Identifying unit, it judges article title according to the characteristic quantity that is extracted from a plurality of title candidate sentence; And output unit, it exports result of determination, and characteristic quantity comprises similarity information at least, and this similarity information is the functional value of the similarity of a plurality of sentences in title candidate sentence and the article.
Preferred similarity information comprises the ranking information of the similarity size of a plurality of sentences in expression title candidate sentence and the article.Use is calculated similarity information from the vector information of the vector information of the selected substring of title candidate sentence and the selected substring of sentence from article.According to from the frequency of occurrences of the selected N of title candidate sentence (N is the natural number more than or equal to 2) unit string and from article the frequency of occurrences of the selected N of sentence unit string come compute vector information.By using this similarity information, do not use the participle analysis, can effectively utilize the extraction and the judgement of language message, the high-precision article title of realization.
And, when the frequency of occurrences compute vector information of going here and there, under the situation that includes the N that the bans use of unit string of predesignating, revise this vector information according to N unit.By removing the character string that can not become title or becoming the little character string of possibility of title, can improve the precision of the judgement and the extraction of article title.
In addition, can calculate similarity information, also can calculate by the maximum length of character string jointly of the sentence in title candidate sentence and the article by the editing distance of the sentence in title candidate sentence and the article.
And, in the title candidate sentence, comprise under the situation of the key word in title string of predesignating, characteristic quantity comprises the position of representing this keyword string and the key word in title string information of the frequency of occurrences, in the title candidate sentence, comprise under the situation of predesignating that bans use of the key word in title string, characteristic quantity comprise expression this ban use of the position of key word in title string and the frequency of occurrences ban use of key word in title string information.Therefore, the characteristic quantity of title candidate sentence comprises various features, and this can improve the judgement precision of title.
Judge the characteristic quantity of the identifying unit of article title, extract best title candidate sentence according to the title candidate sentence.Preferably utilize SVM (support vector machine) that characteristic quantity is classified, judged.Output unit for example comprises display device such as display, taglines and related information that output has been judged.Related information is author and mechanism's name etc.
The article title draw-out device also comprises: input block, and it is used for the input picture article; And the text data extracting unit, it extracts text data from the image article of input, title candidate sentence extracting unit also can be from the text data that is extracted the extracting header candidate sentence.The input block that is used for the input picture article comprises optically read scanner, from the image article data that reads by scanner, with extraction text datas such as OCR (optical character recognition device).Preferably make of the beginning of title candidate sentence extracting unit, extracting header candidate sentence in certain candidate target scope from text data.This is to be comprised in the beginning part mostly because can become the article of title.
Characteristic quantity also can comprise from the resulting layout information of image article of input.By utilizing these information, improved the judgement precision of article title.
Article title abstracting method of the present invention comprises the steps: to extract the step of a plurality of title candidate sentence from article; At the step of all title candidate sentence extraction characteristic quantities, this characteristic quantity comprises the similarity information of a plurality of sentences in title candidate sentence and the article; According to the characteristic quantity that is extracted, from a plurality of title candidate sentence, judge the step of article title; And the step of output result of determination.And article title extraction program of the present invention comprises: the step that extracts a plurality of title candidate sentence from article; At the step of all title candidate sentence extraction characteristic quantities, this characteristic quantity comprises the similarity information of a plurality of sentences in title candidate sentence and the article; According to the characteristic quantity that is extracted, from a plurality of title candidate sentence, judge the step of article title; And the step of output result of determination.
According to article title draw-out device of the present invention, the characteristic quantity separately of extracting header candidate sentence, this characteristic quantity comprises the similarity information as the functional value of the similarity of a plurality of sentences in expression title candidate sentence and the article, so, not necessarily depend on the space of a whole page, image information and the context of article, can extract article title and related information accurately by fully having effectively utilized the decision procedure flexibly of linguistry.Use the abstracting method of SVM, can reduce the influence that the mistake identification of imperfection, the OCR of decision rule is produced, so the title of the text data that is suitable for having scanned most (having implemented the text data of OCR) and the Automatic Extraction of related information thereof.By using SVM,, can improve the extraction performance (extracting expansion, the extraction precision of scope) of system through study.
Description of drawings
Fig. 1 is a hardware structure diagram of realizing the article title draw-out device of embodiments of the invention.
Fig. 2 is the FBD (function block diagram) of the article title draw-out device of present embodiment.
Fig. 3 is the action flow chart of title candidate sentence extracting part.
Fig. 4 is the exemplary plot of the title candidate sentence that extracts from the Japanese article.
Fig. 5 is the key diagram of the characteristic quantity that extracted by candidate sentence similarity characteristic quantity extracting part.
Fig. 6 is the key diagram that characteristic quantity is asked method.
Fig. 7 is an example of Japanese surname dictionary.
Fig. 8 is the calculation flow chart of 2 yuan of vectorial characteristic quantities of string.
Fig. 9 is the key diagram of 2 yuan of string frequency # ' computing method (x).
Figure 10 is the FBD (function block diagram) of the article title draw-out device of the 2nd embodiment of the present invention.
Figure 11 is the dictionary of Chinese surname and name.
Figure 12 is the dictionary of mechanism's name of China.
Figure 13 is 2 yuan of string key word in title string dictionaries and 2 yuan of string title forbidding keyword string dictionaries of Chinese.
The title candidate sentence of the Chinese sample article of Figure 14 among Figure 14 (a) expression, the characteristic quantity of Figure 14 (b) expression title candidate sentence.
Figure 15 carries out sorted result by SVM to the characteristic quantity of title candidate sentence shown in Figure 14.
Symbol description
10: the article title draw-out device; 12: input media; 14: display device; 16: main storage means; 18: memory storage; 20:CPU; 30: the article input part; 32: title candidate sentence extracting part; 34: candidate sentence leading decision characteristic quantity extracting part; 36: leading decision portion; 38: extract efferent as a result; 60: image article input part; 62: the space of a whole page and image information extracting part; 64: the article of text extracting part.
Embodiment
Below, with reference to accompanying drawing preferred forms of the present invention is described.
(embodiment)
Fig. 1 is the structural drawing of the article title draw-out device of expression embodiments of the invention.Header extracting device 10 comprises input media 12, display device 14, main storage means 16, memory storage 18, central processing unit (CPU) 20 and the bus 22 that connects these devices.
Input media 12 comprises by keyboard operation and comes the keyboard of input information, the optically read optical reading device (scanner) that is recorded in article in the original copy etc., input from the input interface of the data of external device (ED) or external memory storage etc. etc.Display device 14 comprises display that is used for showing the title that extracts from article and related information thereof etc. etc.Main storage means 16 comprises ROM or RAM, stored programme and by the data of calculation process etc., and institute's program stored is used for from article extracting header candidate sentence, or the characteristic quantity of extracting header candidate sentence, or judges article title.Memory storage 18 for example comprises mass storage devices such as hard disk, and storage is scanned the optically read image article data of instrument or the various dictionary databases etc. of use when characteristic quantity extracts.CPU (Central ProcessingUnit: CPU (central processing unit)) 20 control each one according to the program that is stored in the main storage means 16.
Fig. 2 is the block diagram that functionally shows article text draw-out device.The text sentence of article input part 30 input articles.Text sentence for example can be by the received text data of input interface, perhaps, also can be the text data that is extracted by OCR (character recognition device) from the optically read image article data of scanner.Certainly, also can be the text data that the method by in addition obtains.
Title candidate sentence extracting part 32 extracts the title candidate part that may become title from the text sentence of input.The specialized range that title candidate sentence extracting part begins the beginning of article of text from input is as the candidate target scope, the text sentence in being included in the candidate target scope, and will be by the part of specific markers and the cutting of line feed mark as the title candidate sentence.
Fig. 3 represents the motion flow of title candidate sentence extracting part 32.The part of the α % that title candidate sentence extracting part 32 will begin from the beginning of input article is set at candidate target scope (step S101).α is an integer, for example is 50.Then, the text sentence of title candidate sentence extracting part in being included in the candidate target scope, will be labeled (; .?=~@#$%^﹠amp; * _ | $n; ...) and the part of line feed mark institute cutting as title candidate sentence (step S102).At last, will be stored in (step S103) in the memory storage by the set of the title candidate sentence of cutting.
Fig. 4 represents with the Japanese article to serve as the example of input article.This figure (a) is the input article of the text sentence that reads by scanner etc., this figure (b) is made as the candidate target scope with 50% the part that the beginning from the input article begins and the example that extracts, and this figure (c) has represented to be labeled in the candidate target scope and the set of the title candidate sentence of the mark institute cutting that enters a new line.
Return Fig. 2 again, the title candidate sentence that is extracted is provided for candidate sentence leading decision characteristic quantity extracting part 34.The candidate sentence leading decision extracts the characteristic quantity that is used to judge the title candidate sentence with characteristic quantity extracting part 34 from all title candidate sentence.As shown in Figure 5, characteristic quantity is forbidden keyword string 46, postcode 47 and punctuation mark quantity 48 these 9 key elements and is constituted by length 40, the ranking information 41 of similarity, author information 42, mechanism's name information 43, key word in title string information 44, author's positional information 45, the title of candidate sentence.
Fig. 6 is the key diagram of computing method of the each several part information of constitutive characteristic amount." length of candidate sentence " the 40th, the length of title candidate sentence, unit is a byte.For example, represent with the value of the length (byte)/150 (constant) of candidate sentence.
" ranking information of similarity " 41 at first is the similarity between other the sentence that calculates in title candidate sentence and the article, is similarity the highest similarity similarity as this title candidate sentence.For all title candidates, by the ascending order of similarity it is sorted, rank 1 to M (M is the quantity of title candidate sentence) is given to all title candidates.Rank with ranking information=1/ similarity of similarity is represented.
Similarity can be tried to achieve by following method.
Method 1: use the VSM vector characteristic quantity of title candidate sentence, try to achieve the method for the similarity (the perhaps distance between the sentence) between the sentence.VSM vector characteristic quantity can use TF (TF/IDF), the TF (term frequency) of word and the functional value of IDF (inverse document frequency).In addition, also character string can be cut into N unit string, as the TF (TF/IDF) of N unit string, perhaps, the functional value of TF and IDF.And, also can use the similarity between the disclosed vector, the computing method of distance.
Method 2: the editing distance between the use character string is tried to achieve the distance between the sentence.
Method 3: use the length of the maximum common character string between 2 character strings to try to achieve similarity between the sentence.
Method 4: other disclosed arbitrary method.
In the present embodiment, as described later shown in, character string is cut into 2 yuan of strings, calculate the similarities of 2 yuan of strings between the vectorial characteristic quantity.
About " author information " 42, when including the author in the title candidate sentence, the mark of " author information " 42 is set as " 1 ", is set as under other situation " 0 ".For example, can use disclosed proper name extraction technique, perhaps the name extraction technique.Fig. 7 shows the Japanese surname that publishes in phone directory etc. and the surname dictionary of number of packages thereof in order.The surname dictionary that also can compare title candidate sentence and Fig. 7 if hit, is made as " 1 " with the mark of " author information " 42, if miss being made as " 0 ".In addition, the same with the surname dictionary, also can prepare Japanese name dictionary, title candidate sentence and name dictionary are compared, if hit, mark is made as " 1 ", if miss, be made as " 0 ".In addition, also can be only when surname and name both sides hit, mark be made as " 1 ".About whether hitting, be not only the on all four situation of character string, also can be the consistent situations of part such as the place ahead unanimity, rear unanimity.
About " mechanism's name information " 43, in the title candidate sentence, include under the situation of mechanism's name information, the mark of " mechanism's name information " 43 is set as " 1 ", is set as under other situation " 0 ".For example, mechanism's name dictionary and the title candidate sentence of having registered mechanism's name in advance compared, under the situation that mechanism's name has been hit, mark is made as " 1 ", under miss situation, be made as " 0 ".About whether hitting, be not only the on all four situation of character string, also can be the consistent situations of part such as the place ahead unanimity, rear unanimity.
" key word in title string information " the 44th is illustrated in the information that whether includes predefined key word in title string in the title candidate sentence, is the information of frequency of having concentrated the appearance of the position of key word in title string and key word in title string.The key word in title string is for example registered in advance as key word in title string dictionary." author's positional information " 45 given title candidate sentence numbering by the front and back order that the title candidate sentence occurs since 1 in article.Suppose, in i title candidate sentence, occur the author for the first time.Then from numbering 1 " author's positional information "=1, " author's positional information "=" 0 " of other candidate sentence to the title candidate sentence of numbering i+3.
" title forbidding keyword string information " the 46th is illustrated in whether include the information that the title of predesignating bans use of keyword string in the title candidate sentence, is the information of having concentrated the frequency of the position of title forbidding keyword string and the appearance that title is forbidden keyword string.About title forbidding keyword string, character string or the little character string of not using in title of use possibility registered in the dictionary in advance, whether correspondingly check with it.
" postcode " 47 6 continuous bit digital as postcode.In the title candidate sentence, comprise under the situation of postcode, mark is made as " 1 ", be made as under other situation " 0 "." quantity of punctuation mark " 48 be included in ", ", ". " in the title candidate sentence, "; " quantity.
Have again, as shown in Figure 5, show the example that the candidate sentence similarity is made of 9 key elements with characteristic quantity, but be not limited thereto.On title extracts, as long as comprise the 2nd " similarity ranking information " at least, also can be " similarity ranking information " and out of Memory appropriate combination.For example, can be with " similarity ranking information " and the 5th 's " key word in title string information " as characteristic quantity, perhaps with " similarity ranking information " and the 7th 's " title forbidding keyword string information " as characteristic quantity.Certainly, also can append other language message.For example, also can append certificate address information etc.And, if with scanner reading images article, then can obtain the layout information (the position relation of candidate sentence etc.) and the image information (kind of the size of character, color, character etc.) of article, also these information as characteristic quantity can be appended.
Turn back to Fig. 2 again, the characteristic quantity of all title candidate sentence that extracted with characteristic quantity extracting part 34 by the candidate sentence leading decision is offered leading decision portion 36.Leading decision portion 36 is made of the judgement division that constitutes by study.Division can use disclosed sorting technique arbitrarily.For example, as concrete example, can use the sorting technique of SVM (Support Vector Machine).But reference example is as being described in paper " Support Vector Machine To I Ru テ キ ス ト divide Class ", 1998, natural language processing, the SVM engine among the 128-24 etc.
When having extracted title, should extract the result and offer extraction efferent 38 as a result by leading decision portion 36.Extract as a result efferent 38 shows the title that has extracted on display device 14.Simultaneously, can show that also the author waits related information.
Below, the calculation of similarity degree method of characterization amount.At first, begin to extract the character string (2 yuan of strings) of all 2 continuous characters to the right from the left side of title candidate sentence.For example, if the title candidate sentence is " intellecture property ",, be syncopated as the character string of 2 yuan of strings then in the mode of " knowledge ", " know and produce ", " property right ".With A=(β 1, β 2 ... β N) 2 yuan of vectorial characteristic quantities of string of expression title candidate sentence.With B=(β ' 1, β ' 2 ... β ' N) 2 yuan of vectorial characteristic quantities of string of other sentence in the expression article.With following formula, and all similarity sim between other sentence in calculating title candidate sentence and the article (A, B).
[formula 1]
sim ( A , B ) = Σ i ∈ N β i · β i ′ Σ i ∈ N β i 2 Σ i ∈ N β i ′ 2
Fig. 8 shows the flow process when calculating the vectorial characteristic quantity of 2 yuan of strings.Begin to extract the character string (2 yuan of strings) (step S201) of all 2 continuous characters to the right from the left side of title candidate sentence.Then, obtain the frequency of occurrences # (x) (step S202) of 2 yuan of all strings.Next step with reference to 2 yuan of strings of forbidding dictionary 50 of having registered 2 yuan of strings as banning use of in advance, if the title candidate sentence comprises 2 yuan of strings of forbidding, then revises the dimension (step S203) of vectorial characteristic quantity.At last, use revised 2 yuan of string frequency # ' (x), generate vectorial characteristic quantity A, B (step S204).
Fig. 9 has represented 2 yuan of string frequency # ' computing method (x).
MI (x, y): 2 yuan of string x, the mutual information of y;
# (x) is 2 yuan of number of times that string X occurs in this article;
N, 2 yuan of all number of times that string occurs;
# (x, y): the number of times of X and Y co-occurrence in this article;
Like this, the article title draw-out device of present embodiment is extracting header judgement characteristic quantity from the title candidate sentence, extract/judge article title according to this characteristic quantity, therefore, by language message with based on the fusion of discrimination standard of statistics, can extract article title and related information accurately.Because come extracting header and related information according to the content of article fully, so, needn't depend on the space of a whole page, image information and the context of article, can extract the high article title of versatility.Because needn't need keyword string information, the summary content of paper, the related information of professional domain, so title extraction scope does not rely on the field.And, has following feature: do not use the participle analysis, and the substring of extraction 2 selected characters from the title candidate sentence, to be present in mutual information between the 2 all alphabetic character strings in the sentence as the vector of this sentence, with the Cos value (cosine value) between the vector as the similarity between the sentence, thus, only the identification of the mistake of the minute quantity of OCR is difficult to have influence on the judgement of title, is applicable to that the title of the image article that is scanned extracts.With the characteristic quantity of information such as rank, author, mechanism's name, key word in title string, candidate sentence and the author's of the length of title candidate sentence, similarity distance, title forbidding keyword string, postcode, punctuation mark as this, utilize sorter (for example SVM) to judge whether be title, thus, extracting header accurately.
Figure 10 is the block diagram of the 2nd embodiment of expression article title draw-out device of the present invention.The 2nd embodiment is the example that article input part shown in Figure 2 30 has been carried out distortion.Image article input part 60 input picture articles output to the space of a whole page and image information extracting part 62 with the image article data of importing.For example can use scanner to wait imports.The space of a whole page and image information extracting part 62 extract layout information and image information from the image article data.Layout information for example comprises the information such as position relation of title candidate sentence, and image information comprises information such as the size, color, font of character.The extraction of layout information and image information can be used technique known, for example, is disclosed in that the spy opens flat 9-134406 number and the special technology of opening in the flat 2000-148788 grade.
Then, text message extracting part 64 for example by OCR, extracts text message from image information.The OCR that OCR can use technique known or sell in market.The text message that is extracted is provided for title candidate sentence extracting part 32.In addition, in the 2nd embodiment, use in the characteristic quantity extracting part 34, when the characteristic quantity of extracting header candidate sentence, can be included in layout information and image information that the space of a whole page and image information extracting part 62 obtain in the candidate sentence leading decision.
Among the 2nd embodiment, can read in the image article, from the image article that reads in, automatically extract article title by scanner etc.Simultaneously, add in the characteristic quantity of title candidate sentence, can further improve the judgement precision of article title by the layout information that will be contained in the image article data.
Below, the example when Chinese article implemented article title draw-out device of the present invention is described.For Chinese article, also as shown in Figure 2, by article input part 30 input text articles, by title candidate sentence extracting part 32 extracting header candidate sentence from article.The candidate sentence leading decision as shown below, in the characteristic quantity of Chinese, is carried out optimization to author's title and mechanism's name etc. with characteristic quantity extracting part 34.
Figure 11 shows Chinese surname word dictionary and name word dictionary.The method here only is defined in Chinese personal name.Author's abstracting method is made of the surname identification and the name identification of Chinese personal name, can use following criterion.
Because the situation that Chinese personal name surpasses 4 characters is seldom, so, if the character string of title candidate sentence surpasses 4 characters, then be judged as and be not name.
In Chinese personal name, because the surname of 2 characters is considerably less, so, judge whether 2 words of beginning of the character string of title candidate sentence are 2 character surnames.If 2 character surnames, then decidable is a surname for this candidate character strings.
Calculate the name decision content.At first, prepare to have the tabulation frequency of occurrences, that come across the character in Chinese's surname (being called surname word dictionary) and come across the tabulation (being called name word dictionary) of the character in the name.Surname is sorted with the height of word dictionary by the frequency of occurrences of character in proper order with word dictionary and name.And, surname is divided into 3 groups of A, B, C with the word dictionary.
A group:,, then the set of scanned character is made as A and organizes if the accumulative total of frequency is involved to all 95% in the surname scanning that starts anew in the word dictionary.
B group:,, then the set of scanned character is made as B and organizes if the accumulative total of frequency is involved to all 99% in the surname scanning that starts anew in the word dictionary.
C group: the set of all characters is made as C at surname in the word dictionary and organizes.That is,, then become the C group beyond A, the B if meet remainingly 1%.
Equally, 3 groups that name are divided into D, E, F with the word dictionary.The decision content of surname and name is represented with M and N respectively.
If the beginning part of candidate character strings is the surname that is contained in the A set, then M=SA;
If the beginning part of candidate character strings is the surname that is contained in the B set, then M=SB;
If the beginning part of candidate character strings is the surname that is contained in the C set, then M=SC;
If the beginning part of candidate character strings is not the surname that is contained in the C set, then M=0;
If the decline of candidate character strings is the name word that is contained in the D set, then N=SD;
If the decline of candidate character strings is the name word that is contained in the E set, then N=SE;
If the decline of candidate character strings is the name word that is contained in the F set, then N=SF;
If the decline of candidate character strings is not the name word that is contained in the F set, then N=0;
If M+N>threshold values, then being judged to be candidate character strings is name.And SA, SB, SC, SD, SE, SF are constants.Has SA>SB>SC, the relation of SD>SE>SF.
Below, the abstracting method of mechanism's name is described.Here, can use disclosed major name extraction technique, perhaps mechanism's name extraction technique uses following criterion.
Length measurment.If the length of the character string of being imported, thinks then that title candidate part is not mechanism's name smaller or equal to 4 characters, end process.
Whether the character string of checking the title candidate sentence comprises mechanism's name.Figure 12 is an example of mechanism's name dictionary of expression Chinese.If the title candidate partly comprises the mechanism's name in the dictionary, then establish decision content and be+A.
Whether the character string of judging the title candidate sentence is the full name of mechanism.If be full name, then establish decision content and be+B.This is also undertaken by contrasting with mechanism name dictionary.
Whether the character string of checking the title candidate sentence comprises mechanism's name keyword string.If comprise, then establish decision content and be+C.This is undertaken by the contrast of carrying out ad-hoc location (for example, sentence tail) in character string and whether comprising mechanism's name keyword string (for example " university ").
According to top described, if satisfy decision content>threshold values, then the character string with the title candidate sentence is judged to be mechanism's name.And A, B, C are constants.
Figure 13 shows 2 yuan of string key word in title string dictionaries and 2 yuan of string title forbidding keyword string dictionaries of Chinese.When the key word in title string information 44 of the characteristic quantity of extracting header candidate sentence and title forbidding keyword string information 46 (with reference to Fig. 5), with reference to 2 yuan of string key word in title string dictionaries and 2 yuan of string titles forbidding keyword string dictionaries.For example, represented " this disease " of beginning in title forbidding keyword string is judged to be this character string and can not use in article title.
Below, the example of extracting header candidate sentence from the sample article of Chinese is described.Figure 14 (a) has represented the title candidate sentence that extracts from the sample article of Chinese.Figure 14 (b) has represented the characteristic quantity about these title candidate sentence.Among the figure, the dimension numbering of the numeral characteristic quantity that ": " is preceding (be 9 key elements of the characteristic quantity shown in Fig. 5, the 1st, the length of candidate sentence, the 2nd, the ranking information of similarity, the 3rd, author information, the 4th, mechanism's name information, the 5th, key word in title string information, the 6th, author's positional information, the 7th, title forbidding keyword string information, the 8th, postcode, the 9th, the quantity of punctuation mark), the dimension values of the numeral correspondence of ": " back.The article of the back of " # " is corresponding title candidate sentence.
For example, from the 2nd title candidate sentence of beginning, the rank of the similarity of the 2nd dimension is " 1 ", promptly, expression is the highest with the similarity of 2 yuan of string vectors of other title candidate sentence of article, and the key word in title string information of the 5th dimension is " 1 ", and then expression comprises the key word in title string.
The title candidate sentence characteristic quantity that obtains is like this classified by SVM.Figure 15 has carried out dividing the result of time-like with the characteristic quantity of Figure 14 (b) by SVM, and the data representation that the 1st row with dashed lines is surrounded is from the distance of positive example classifying face.From above-mentioned result as can be known, from the positive example classifying face nearest be " track division gage work management pre-test ", this candidate sentence is extracted out as article title.
More than, preferred implementation of the present invention is described in detail, but be not limited to specific embodiment of the present invention, in the scope of described main idea of the present invention, can carry out all distortion/change within the scope of the claims.
Article title draw-out device of the present invention can be used for the extraction of the article title of various language as the article information abstracting method that utilizes linguistry.And, the papery original copy is carried out electronization in real time with the sensation of copy, do not rely on the space of a whole page, image information, the context of papery original copy, can automatically carry out index, therefore, be suitable for general scanning index system most.

Claims (23)

1. an article title draw-out device is characterized in that, comprising:
Title candidate sentence extracting unit, it extracts a plurality of title candidate sentence from article;
The characteristic quantity extracting unit, it extracts the characteristic quantity separately that is used to judge described a plurality of title candidate sentence from described a plurality of title candidate sentence;
Identifying unit, it is according to the characteristic quantity that is extracted, extracting header from a plurality of title candidate sentence; And
Output unit, the title that its output is extracted,
Described characteristic quantity comprises similarity information at least, and this similarity information is the functional value of the similarity of a plurality of sentences in title candidate sentence and the article.
2. article title draw-out device according to claim 1 is characterized in that,
Described similarity information comprises the ranking information of the similarity size of a plurality of sentences in expression title candidate sentence and the article.
3. article title draw-out device according to claim 1 and 2 is characterized in that,
Use is calculated described similarity information from the vector information of the vector information of the selected substring of title candidate sentence and the selected substring of sentence from article.
4. article title draw-out device according to claim 3 is characterized in that,
According to calculating described vector information from the frequency of occurrences of the selected N of title candidate sentence unit string and the frequency of occurrences of the selected N of the sentence from article unit string, wherein N is the natural number more than or equal to 2.
5. article title draw-out device according to claim 4 is characterized in that,
According to the frequency of occurrences compute vector information of described N unit string the time, under the situation that includes the N that the bans use of unit string of predesignating, revise this vector information.
6. article title draw-out device according to claim 1 is characterized in that,
Editing distance by the sentence in title candidate sentence and the article calculates described similarity information.
7. article title draw-out device according to claim 1 is characterized in that,
The maximum length of character string jointly by the sentence in title candidate sentence and the article is calculated described similarity information.
8. article title draw-out device according to claim 1 is characterized in that,
In the title candidate sentence, comprise under the situation of the key word in title string of predesignating the position that described characteristic quantity is comprised represent this keyword string and the key word in title string information of the frequency of occurrences.
9. article title draw-out device according to claim 1 is characterized in that,
In the title candidate sentence, comprise under the situation of predesignating that bans use of the key word in title string, make described characteristic quantity comprise expression this ban use of the position of key word in title string and the frequency of occurrences ban use of key word in title string information.
10. article title draw-out device according to claim 1 is characterized in that,
Described identifying unit is classified to the characteristic quantity of each title candidate sentence by support vector machine, extracts best title candidate sentence according to classification results.
11. article title draw-out device according to claim 1 is characterized in that,
Taglines and related information that described output unit output has been judged.
12. article title draw-out device according to claim 1 is characterized in that,
The article title draw-out device further comprises: input block, and it is used for the input picture article; With the text data extracting unit, it extracts text data from the image article of being imported, title candidate sentence extracting unit extracting header candidate sentence from the article of text that is extracted.
13. article title draw-out device according to claim 12 is characterized in that,
Described title candidate sentence extracting unit is from the beginning of text data, extracting header candidate sentence in certain candidate target scope.
14. article title draw-out device according to claim 13 is characterized in that,
Described article title draw-out device further comprises the unit that extracts layout information from the image article, and described characteristic quantity comprises the layout information that is extracted.
15. an article title abstracting method, extracting header from article is characterized in that, comprises the steps:
From article, extract the step of a plurality of title candidate sentence;
Extract the step of the characteristic quantity be used to judge described a plurality of title candidate sentence from described a plurality of title candidate sentence, this characteristic quantity comprises the similarity information as the functional value of the similarity of a plurality of sentences in title candidate sentence and the article;
According to the characteristic quantity that is extracted, from a plurality of title candidate sentence, extract the step of article title; And
The step of the article title that output is extracted.
16. article title abstracting method according to claim 15 is characterized in that,
Described similarity information comprises the ranking information of the similarity size of a plurality of sentences in expression title candidate sentence and the article.
17. according to claim 15 or 16 described article title abstracting methods, it is characterized in that,
Use is calculated described similarity information from the vector information of the vector information of the selected substring of title candidate sentence and the selected substring of sentence from article.
18. article title abstracting method according to claim 17 is characterized in that,
According to calculating described vector information from the frequency of occurrences of the selected N of title candidate sentence unit string and the frequency of occurrences of the selected N of the sentence from article unit string, wherein N is the natural number more than or equal to 2.
19. article title abstracting method according to claim 18 is characterized in that,
According to the frequency of occurrences compute vector information of described N unit string the time, under the situation that includes the N that the bans use of unit string of predesignating, revise this vector information.
20. article title abstracting method according to claim 15 is characterized in that,
Editing distance by the sentence in title candidate sentence and the article calculates described similarity information.
21. article title abstracting method according to claim 15 is characterized in that,
The maximum length of character string jointly by the sentence in title candidate sentence and the article is calculated described similarity information.
22. article title abstracting method according to claim 15 is characterized in that,
Include in the title candidate sentence under the situation of the key word in title string of predesignating, described characteristic quantity comprises the position of representing this keyword string and the key word in title string information of the frequency of occurrences.
23. article title abstracting method according to claim 15 is characterized in that,
In the title candidate sentence, include under the situation of predesignating that bans use of the key word in title string, described characteristic quantity comprise expression this ban use of the position of keyword string and the frequency of occurrences ban use of key word in title string information.
CNB200510116866XA 2005-10-27 2005-10-27 Automatic extraction device, method and program of essay title and correlation information Expired - Fee Related CN100444194C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200510116866XA CN100444194C (en) 2005-10-27 2005-10-27 Automatic extraction device, method and program of essay title and correlation information

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200510116866XA CN100444194C (en) 2005-10-27 2005-10-27 Automatic extraction device, method and program of essay title and correlation information

Publications (2)

Publication Number Publication Date
CN1955979A CN1955979A (en) 2007-05-02
CN100444194C true CN100444194C (en) 2008-12-17

Family

ID=38063295

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200510116866XA Expired - Fee Related CN100444194C (en) 2005-10-27 2005-10-27 Automatic extraction device, method and program of essay title and correlation information

Country Status (1)

Country Link
CN (1) CN100444194C (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102595214A (en) * 2012-03-06 2012-07-18 浪潮(山东)电子信息有限公司 Method for offering digital TV program correlation recommendation
CN106502985B (en) * 2016-10-20 2020-01-31 清华大学 neural network modeling method and device for generating titles
CN108388872B (en) * 2018-02-28 2021-10-22 北京奇艺世纪科技有限公司 Method and device for identifying news headlines based on font colors
CN116187307B (en) * 2023-04-27 2023-07-14 吉奥时空信息技术股份有限公司 Method, device and storage device for extracting keywords of titles of government articles

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2002108888A (en) * 2000-09-29 2002-04-12 Nippon Telegraph & Telephone East Corp Device and method for extracting keyword of digital contents and computer readable recording medium
CN1365080A (en) * 1995-09-06 2002-08-21 富士通株式会社 Title extracting device and its method for extracting title from file images

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1365080A (en) * 1995-09-06 2002-08-21 富士通株式会社 Title extracting device and its method for extracting title from file images
JP2002108888A (en) * 2000-09-29 2002-04-12 Nippon Telegraph & Telephone East Corp Device and method for extracting keyword of digital contents and computer readable recording medium

Also Published As

Publication number Publication date
CN1955979A (en) 2007-05-02

Similar Documents

Publication Publication Date Title
JP3292388B2 (en) Method and apparatus for summarizing a document without decoding the document image
JP3289968B2 (en) Apparatus and method for electronic document processing
US6907431B2 (en) Method for determining a logical structure of a document
US9514216B2 (en) Automatic classification of segmented portions of web pages
US6178417B1 (en) Method and means of matching documents based on text genre
US8005300B2 (en) Image search system, image search method, and storage medium
US7797622B2 (en) Versatile page number detector
US7756871B2 (en) Article extraction
JP3282860B2 (en) Apparatus for processing digital images of text on documents
JP3232144B2 (en) Apparatus for finding the frequency of occurrence of word phrases in sentences
El et al. Authorship analysis studies: A survey
US8510312B1 (en) Automatic metadata identification
US20070179932A1 (en) Method for finding data, research engine and microprocessor therefor
JP2007172077A (en) Image search system, method thereof, and program thereof
JP2007122403A (en) Device, method, and program for automatically extracting document title and relevant information
CN108197119A (en) The archives of paper quality digitizing solution of knowledge based collection of illustrative plates
CN100444194C (en) Automatic extraction device, method and program of essay title and correlation information
JP2008129793A (en) Document processing system, apparatus and method, and recording medium with program recorded thereon
WO2007070010A1 (en) Improvements in electronic document analysis
Lim et al. Automatic genre detection of web documents
Couasnon et al. Making handwritten archives documents accessible to public with a generic system of document image analysis
Déjean et al. On tables of contents and how to recognize them
CN116822634A (en) Document visual language reasoning method based on layout perception prompt
CN112560849B (en) Neural network algorithm-based grammar segmentation method and system
JPH10198683A (en) Method for sorting document picture

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20081217

Termination date: 20171027