CN1452098A - File classing system and program for carrying out same - Google Patents

File classing system and program for carrying out same Download PDF

Info

Publication number
CN1452098A
CN1452098A CN02141403.3A CN02141403A CN1452098A CN 1452098 A CN1452098 A CN 1452098A CN 02141403 A CN02141403 A CN 02141403A CN 1452098 A CN1452098 A CN 1452098A
Authority
CN
China
Prior art keywords
mentioned
document
identification
view data
important words
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN02141403.3A
Other languages
Chinese (zh)
Inventor
古贺昌史
丸川勝美
田中雅子
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Publication of CN1452098A publication Critical patent/CN1452098A/en
Pending legal-status Critical Current

Links

Images

Abstract

The present invention provides a file classing system and program for carrying out same, which can efficiently classify letters, etc., according to their contents. A character recognition device measures appearance frequencies of words stored in an important word dictionary and a document kind discrimination part estimates document kinds.

Description

Document classification system and realization program thereof
Technical field
The present invention relates to client's window, utilize computing machine inquiring the document classification technology that letter, Email etc. are classified automatically and answering the back-up system of inquiring in enterprise, office etc.
Background technology
Among manufacturing industry, insurance, communications industry, office etc., the business that directly receives the inquiry of document form such as Email, letter, FAX from client becomes in recent years and becomes more and more important.It is difficult requiring a people to answer multiple inquiry expeditiously under a lot of occasions.Usually, the number of packages of inquiry is many that a people does not deal with.In addition, a lot of contents involve many-side.Such as, in manufacturing industry, must handle the document of inquiries such as claim to goods, purchasing method, method of operating.A people will handle all these problems and need knowledge widely.Usually, be difficult to guarantee to have the so extensively staff of knowledge.
So, need a kind of kind of discerning the inquiry document, staff as the expert is given according to each content in its classification back, the system of answering by these staff.
The present invention just relates to the technology of utilizing computing machine that above-mentioned inquiry document is classified, and utilizes computing machine to answer the technology of the back-up system of inquiry.
The technology of utilizing computing machine that the inquiry of Email is classified was known already.As typical method, it is the method for the multivariable mode identification technology of feature with particular words in the document (important words) group's the frequency of occurrences that employing is arranged.The text of Email, filename are text datas, and the frequency of occurrences of word can obtain by simple word comparison or morphemic analysis.In case it is the kind of request for e-mail obtains identification, just automatically also known already with the technology that Email etc. is answered according to its kind.
In addition, the recognition methods as the kind of inquiring letter has the method that the letter content is adopted technology same as described above with the character recognition device textization.
Yet, in the existing this letter kind identification that utilizes literal identification, have the problem of accuracy of identification.Generally, in order to make literal identification high precision int, must be with the word that can occur in advance as dictionaries store.Particularly, in order to discern hand-written literal, word number must be reduced into approximately hundreds of., in the employed character recognition device, it is difficult in advance the word scope that can occur being dwindled in existing letter kind identification.Therefore, be difficult with the kind of fully high precision identification letter.
In addition, in common literal identification, the result that literal is cut apart can leave over the ambiguity of literal recognition result.Such as, as each parts of images that common literal identification is cut apart, can obtain the absent Chinese character of a plurality of literal identifications.Also there is literal to cut apart when itself leaving over ambiguity.Inferring the particular words frequency of occurrences from this literal recognition result, is not tangible processing, the method for calculating the word frequency of occurrences from text data in statu quo can not be used.In addition, if do not allow this ambiguity, as text-processing, then the word in the document much can be omitted with the literal recognition result.
In addition, in literal identification, can't avoid identification error.Owing to these mistakes, the kind of document usually can take place to discern, wrong situation appears.In existing mode,, will make that operating efficiency reduces greatly in that this can not identification and the occasion of wrong identification taken place.
Summary of the invention
First problem that the present invention will solve is to answer in the back-up system in this inquiry, realizes that high-speed, high precision identification utilizes the kind of the letter of literal identification.
Second problem that the present invention will solve is to improve the maintainability of the dictionary of literal identification usefulness.
The 3rd problem that the present invention will solve is that the improvement result exists the literal of ambiguity to discern the interface of the document classification processing of processing and input text data, the compatibility in the raising system.
The 4th problem that the present invention will solve provides a kind of in the occasion that can not discern document kind and identification appearance mistake, can continue to answer the environment of operation expeditiously.
With the set of the important words in the document kind identification word lexicon as literal identification.In addition, not to read whole literal as before, but adopt the frequency of occurrences of a word location technology instrumentation important words.The output form that literal identification is handled, different with the past, be the vector that makes the frequency of occurrences of expression word.The resulting frequency of occurrences is input to the identification of carrying out the document kind in the existing document kind identification.
Not only send the recognition result of document kind from the device that carries out literal identification to the device of answering operation, and the candidate below 2 and the word identification result of document kind identification sent together.Answer apparatus for work, support to answer operation by highlight important words on the image of inquiry letter.In addition, provide a kind of occasion that makes a mistake in the document kind, can utilize the candidate document kind below 2, pass on the environment of letter image to suitable answerer.
Description of drawings
Fig. 1 is that the hardware of embodiment constitutes.
Fig. 2 is the data flow diagram that the treatment scheme of embodiment is shown.
Fig. 3 is the data flow diagram that the flow process of study processing is shown.
Fig. 4 is the data flow diagram that the flow process of Email classification processing is shown.
Fig. 5 is the data flow diagram that the flow process of letter classification processing is shown.
Fig. 6 is the data flow diagram that the flow process of answer treatment is shown.
Fig. 7 is for answering the display frame that operation is used.
Fig. 8 is the action timing diagram that the study treatment step is shown.
Fig. 9 is the action timing diagram that the letter treatment step is shown.
Figure 10 is the data mode of the word frequency of occurrences.
Figure 11 is for answering the data mode of operation with data.
Figure 12 is the input picture example.
Figure 13 is for cutting apart virtual network.
Figure 14 is the important words that detects.
Embodiment
Fig. 1 illustrates the inquiry of the embodiment of the invention and answers system, promptly the inquiry from client is classified, answered the formation of the back-up system of operation automatically.Native system be input as the inquiry that utilizes Email, letter, phone etc.Be output as the answer of the inquiry that utilizes Email, letter, phone etc.For with PERCOM peripheral communication, native system is got in touch by interface and telephone line and outside 1.In addition, the computer of native system carries out message exchange through LAN.
Utilizing before native system accepts inquiry, must calculate the used essential information of the kind that is used for discerning the inquiry document and generate the processing of dictionary, promptly must learn.The 101st, the study computing machine of management study.Study with the learning data of collecting in advance in the data file system 102, is calculated document kind identification information necessary with reference to study with computing machine 101, deposits in the dictionary document system 103 with dictionary as classification.So-called learning data is text data and the right set of its inquiry kind identifier with the inquiry content textization.Text data in learning data has used the example of inquiry in the past.Corresponding document kind is by artificial appointment.The classification dictionary that generates copies Email to through LAN at any time and classifies with the lexicon file system 107 of computing machine 106, the lexicon file system 115 that the letter classification is used computing machine 114 with the lexicon file system 109 and the audio inquiry classification of computing machine 108.
Utilize the inquiry of Email, through gateway 105, receive with computing machine 106 by the Email classification from the Internet 104 of system outside.The Email classification is with computing machine 106, according to the kind of inquiry content recognition Email, with the recognition result of document kind and the corresponding automatic answer computing machine 116 that is transferred to Email in position of important words described later.
The inquiry letter by carrying out light-to-current inversion with the scanner that has the classification and ordination device 110 that letter classification is connected with computing machine 108, is imported as image.The letter classification utilizes word location technology described later with computing machine 108, and the literal in the recognition image is according to the kind of inquiry content recognition letter.The position of recognition result and image and important words described later is transferred to automatic answer computing machine 116 accordingly.Also send into the letter classification with computing machine 108 by the inquiry that fax is sent, implement same processing from telephone line 111 via line control device 113.After above-mentioned processing, letter is sorted out keeping with the classification and ordination device of scanner according to the recognition result of document kind.
By the inquiry of phone, send into audio inquiry classification computing machine 114 from telephone line 111 via line control device 113.The audio inquiry classification is converted to text with 114 pairs of sound of computing machine, classifies according to the inquiry content, is transferred to answer conversation telephone set 112.Answering converses answers with telephone set 112 by utilizing with the corresponding expert staff of content.
Automatically answer with computing machine 116, kind at the document that passes on is the occasion that can answer automatically, from answer civilian routine file system 117, retrieve suitable answer literary composition example, answer, or printer is sealed in utilization automatically, and 118 printings are answered with the letter form with Email.Also have, in the occasion that can not answer automatically, then according to and the corresponding document kind of inquiry document, be transferred to the answer apparatus for work (121,125,126) of suitable expert's standby.
Shown in 121, the formation of answering apparatus for work comprises computing machine 122, input media 123 and image display device 124 that keyboard, mouse etc. are formed.Utilize these devices, each staff generates with reference to the inquiry document and answers document, is transferred to automatic answer computing machine 116.Automatically answer with computing machine 116,, send the answer Email, or print and answer letter in mode same as described above.
The data flow diagram of utilizing Fig. 2 is below illustrated the treatment scheme of native system.In this figure, according to ゲ-Application サ-ソ Application representation (J. マ-チ Application " software configuration technology " modern science society, ISBN4-7649-0124-2 C3050 P5562E), solid arrow is represented information flow, and hollow arrow is represented logistics.In addition, round rectangle is represented to handle, and the information deposited is represented on the right for empty rectangle.
At first, utilize learning data 201, carry out study 202, promptly calculate the processing of the kind information necessary of identification inquiry document, generate word statistic dictionary 203 and important words dictionary 204.It is the set of the word of key character that important words dictionary 204 leaves on the identification document kind.Word statistic dictionary 203 is deposited the necessary statistic of frequency of occurrences identification document kind according to important words.This processing is realized with computing machine 101 by study.
When accepting request for e-mail, Email classification 205 is with reference to word statistic dictionary 203 and important words dictionary 204 identification inquiry document kinds.The inquiry document kind that obtains correspondingly with the positional information of Email itself and the important words that obtains in processing procedure is exported with data as answering operation.This processing is realized with computing machine 106 by the Email classification.
The occasion that inquiry is arranged in letter is inquired the document kinds by letter classification 206 with reference to word statistic dictionary 203 and 204 identifications of important words dictionary.The inquiry document kind that obtains correspondingly with the appearance place of the important words that obtains in the file and picture processing procedure is exported with data as answering operation.On the other hand, for ease of preserving etc., letter itself is sorted out according to resulting inquiry document kind.This processing is realized with the scanner 110 of computing machine 108 and band classification and ordination device by the letter classification.
Utilize to answer and use data, generate the answer literary composition, print and answer letter or send the answer Email by answering 208.In order to upgrade word statistic dictionary 203, the document kind of information of determining when the important words frequency of occurrences and the answer is outputed to dictionary upgrade 210 simultaneously.In input is the occasion of Email, and the text data that appends the content of document kind and Email is a learning data.By utilizing the learning data that appends like this to learn once again, can make word statistic dictionary 203 and important words dictionary 204 adapt to the actual conditions of utilization.This processing, is sealed printer 118 automatically and is answered apparatus for work 121,125,126 realizations with computing machine 116 by automatic answer.
Dictionary upgrades 210, according to the frequency of occurrences and the document kind of the important words in the inquiry document that obtains in utilization classification is upgraded processing with dictionary.The renewal that is used for the statistic of this identification utilizes the method for the pattern-recognition of standard to realize.This processing is realized with computing machine 101 by study.
In the occasion of inquiring, by the kind of audio inquiry classification 207 identification inquiry contents, by answering conversation 208 with the corresponding expert staff of content with phone.This processing is realized with telephone set 112 with computing machine 114 and answer conversation by the audio inquiry classification.
Secondly, in the treatment scheme of learning 202 shown in the data flow diagram of Fig. 3.At first, extract 301 out by important words and from learning data 201, extract important word on identification document kind out, deposit in important words dictionary 204.The morphemic analysis technology of this processing and utilizing natural language and the feature selecting technology of pattern-recognition realize.Secondly, calculate 302 by the word statistic and calculate essential statistic on identification document kind, deposit in the word statistic dictionary 203.Recognition methods is herein used the standard method in the pattern-recognition, such as secondary recognition function and neural network etc.In the occasion that adopts the secondary recognition function, so-called word statistic means the frequency of occurrences and the covariance coefficient thereof of each word.In the occasion of using neural network, the word statistic is the connection weight of network.
Next is in the treatment scheme of the classification of Email shown in the data flow diagram of Fig. 4 205.At first, extract 401 out,, calculate the frequency of occurrences of each important words in the Email with reference to important words dictionary 204 by word.For detecting important words, use general Language Processing technology such as morphemic analysis.Secondly, document kind identification 402 utilizes the word statistic identification document kind in the word statistic dictionary 203.Identification utilizes mode standard recognition methodss such as secondary recognition function and neural network.At last, generate the data that 403 generations obtain the corresponding relation of the recognition result of the appearance position of important words and document kind and Email with data, promptly answer the operation data, the line output of going forward side by side by answering operation.Usually, can enumerate a plurality of candidates of different confidence levels to the document kind.All leave in and answer operation with in the data.
Next is in the treatment scheme of the classification of letter shown in the data flow diagram of Fig. 5 206.At first the zone that will write article in letter is imported as image.Secondly, with reference to important words dictionary 204, from image, discern important words, export the frequency of occurrences of each important words by the identification 502 of word location.Secondly, by document kind identification 503, with reference to the word statistic of depositing in the word statistic dictionary 203, according to the frequency of occurrences identification document kind of important words.Secondly, generate 505 with data, the frequency of occurrences of image and important words and the recognition result of document kind are exported accordingly by answering operation.Letter itself according to document kind recognition result, is sorted out.
Secondly, utilize the stream data specification of Fig. 6 to answer 208 treatment scheme.At first, determine 601, answer operation data document kind, will answer operation and be transferred to answer operation 1~3 (602,603,604) or answer 605 automatically with data according to giving by the destination of delivering letters.This processing is realized with computing machine 116 by automatic answer.Answering in the operation 1~3 (602,603,604), the expert staff of each document kind discusses the inquiry content, generates and answers literary composition.These are by utilizing the staff who answers apparatus for work 1~3 (121,125,126) to realize.Automatically answering in 605, retrieval is answered literary composition example and output accordingly with the document kind from answer routine collected works 606.This processing is realized with computing machine 116 by automatic answer.Send 607 and will answer operation 1~3 (602,603,604) or answer 605 automatically that the answers literary composition that obtains is routine sends with Email by answering Email.This processing is realized with computing machine 116 by automatic answer.In addition, answer literary composition printing 608 will be answered literary composition and be printed to generation answer letter on the paper.This processing realizes by sealing printer 118 automatically.
In fact, the identification of computer document kind is also not necessarily correct.In addition, in identification is handled, the occasion that can not discern and refuse to discern is arranged also sometimes.So, in native system, with the wrong and refusal of following way reply document kind identification.Answering in the operation 1 or 2 or 3 (602,603,604), not the occasion of oneself being responsible for many fields at the inquiry document of assigning to, the staff carries out transfer operations on operation screen described later.As mentioned above, can obtain a plurality of candidates of different confidence levels usually as the document recognition result.Can utilize this point, in the occasion of carrying out transfer operations, automatically to passing on these answer operation data with the corresponding destination of passing on of the document kind candidate below 2.Pass on the destination and also can specify by staff itself.In addition, in the occasion of refusal identification, will be in the text of Email or the important words highlight that detects in the image, support to answer operation.
Fig. 7 is illustrated in an example of answering in the apparatus for work 1~3 (121,125,126) in the display frame of image display device.Show the inquiry document in the inquiry document window 708 in the picture 701.At the occasion videotex of the inquiry document of Email, show the image of letter in the occasion of the inquiry of letter form.In addition, utilize the important words of answering in the work data position to occur, in the same window, the highlight important words generates easily and answers operation.The staff is answering civilian editor 709 ineditings answer literary composition.Can use common word processor herein.In addition, what show in inquiry document window 708 is not the occasion of own responsible document, and the staff can utilize as the click of input media setting and pass on button 703 automatically.Corresponding therewith, be transferred to and corresponding answer apparatus for work of the candidate below 2 of recognition result or automatic answer computing machine with data answering operation.In window 702, show the document recognition result's who also comprises the candidate below 2 candidate.Specify the whereabouts destination to pass on the operator and answer operation, after utilization is arranged at radio button specified documents kind in the window 702, clicks and pass on button 704 with when the data.In the occasion of hope retrieval and the corresponding literary composition example in the past of document kind, then click civilian routine index button 705.So, pass on this answer literary composition example through LAN from answering civilian routine file system 117, be shown in and answer civilian editor 709.As click transmission button 706, just will return the people that send the send Email inquiry through editor's answer literary composition.In addition, as clicking print button 707, just will print by sealing printer 118 automatically through editor's answer literary composition.
Fig. 8 illustrates the treatment step of study 202.In important words is extracted out, at first, from each study with text data Ti (1≤i≤N, N: the data number) utilize morphemic analysis extraction word, with the frequency of occurrences with vector ui=(ui1, ui2 ... uiM) form storage (number of times, M that the word j among 1≤i≤N, the .uij:Ti occurs: total word number).From the right set of the kind ci (manually giving) of vector ui and each text data Ti { (ui ci) }, utilize the known methods such as feature selecting of Branch and Bound (branch and bound) algorithm, word M ' important in the selection sort individual (M '<<M).When needing, also can be by the artificial selection important words.Secondly, in the word statistic is calculated, calculate the frequency of occurrences of each important words, with vector vi=(vi1, vi2 ... viM ') storage.In addition, calculate the document kind and discern necessary statistic.Such as, be the occasion of secondary recognition function in recognition method, calculate each variable vi1, vi2 ... the statistic of average, the related coefficient of viM ' etc.
Fig. 9 illustrates the treatment step of letter classification 206.This processing comprises image input, the identification of word location, and the identification of document kind is sorted out, and generates operation with each step of data.
In common literal identification, all literal in the recognition image.Relative therewith, in the identification of word location, before carrying out, specify word from the reading object of outside.In identifying, the text type of identifying object only is defined in the literal that may occur with the word of appointment, detect the text strings that seems correct word as appointment.In the present embodiment, adopt special mode identified word from image of opening among the flat 11-85909, calculate the frequency of occurrences.This processing comprises that literal cuts apart virtual generation step, explores the identification important words step and the important words frequency of occurrences and calculates step.The important words frequency of occurrences with vector w=(w1, w2 ... wM ') expression.By using this set to carry out the method for the identification of exploration, can increase substantially the precision and the speed of identification to given word.In addition, though identification and precision are inferior to this mode, also can be as existing mode all literal in the recognition image, the technology of utilizing existing word to compare is tried to achieve vector w.
In document kind identification, use the method for general pattern-recognition such as secondary recognition function and neural network, calculate confidence level from vector w and word statistic as each document kind, give in proper order according to the confidence level of document kind.In addition, imitate general method, when the difference of the confidence level of 1 and 2 s' document kind candidate during less than certain value, and the confidence level of 1 document kind candidate is judged as refusal identification during less than certain value.
When sorting out, according to the recognition result of document kind, the scanner 110 of control band classification and ordination device, the document that letter is referred to regulation is piled.
Generating answer with in the data, the appearance position of important words and the recognition result and the corresponding data of image of document kind are also exported in generation, promptly answer the operation data.
Figure 10 illustrates the data mode as the word frequency of occurrences of word location identification.The arrangement that this is made up of the individual record of M '.In each record, deposit the frequency of occurrences of the individual important words of M '.The frequency of occurrences of depositing can be a round values, also corresponding to the real number value of the confidence level of identification.
Figure 11 illustrates and answers the data mode of operation with data.What deposit in variable kindOfMessage1101 is sign, be used for representing distinguishing with the text representation of request for e-mail etc. or as letter and fax with graphical representation.Variable sizeOfMsg1102 represents to deposit in the size of answer operation with the document in the data.Then, in the zone 1103 of sizeOfMsg byte, deposit inquiry letter entity.Deposit text in occasions such as Emails, deposit view data in the occasion of letter and fax.In variable numberOfCandidate1104, deposit the number of the document kind candidate that document kind recognition result obtains.Then, in zone 1105, deposit the document kind candidate record of numberOfCandidate number.Formation of each record comprises value right of the identifier of integer of expression document kind and confidence level thereof.Deposit the number of the important words that detects among the variable numberOfWords1106.Then, the important words of depositing the numberOfWords number in the zone in 1107 detects result's record.Formation of each record comprises record location right of the identifier wordID of important words and the position that detects of expression.As detecting the position,, leave the byte number that the literal of the beginning of the important words in the text data occurs in the occasion of text data.In the occasion of view data, deposit upper end, lower end, the left end in the zone of identification important words, the coordinate of right-hand member.
Summary to the identification of word location is illustrated below.Figure 12 schematically illustrates the example of the letter of inquiry.Usually, in the letter of inquiry, there is not specific form.Therefore, can not understand the position of literal line and the size of literal in advance.In addition, a lot of occasions do not understand that to write across the page still be perpendicular writing.In addition, the interval of the literal line as this example is very little, and the point above " え " of the 2nd row belongs to lastrow or belongs to next line, is the composition that is difficult to judgement sometimes.
In order to solve such problem, in word of the present invention location identification, adopt the special mode of opening flat 11-85909.This mode is to extract the type mode candidate from input picture out, and these relations to cut apart the virtual network performance, afterwards, are carried out the identification of exploration to preassigned word in cutting apart virtual network.This is by utilizing the information of word in the mode of prediction, can high precision and the mode of identified word at high speed.As the method for the candidate of extracting type mode out, enroll, can use after the combination of any number of coordinator in literal line, select the height of its figure that is synthesized into and width are in method between preassigned higher limit and the lower limit.As the mode of exploration, use general breadth-first to explore, the judgement of the expansion of exploration tree is carried out according to the result of literal identification.
Open among the flat 11-85909 the spy, be the difficulty that the literal that solves in the literal line cuts out, import and cut apart virtual network.In the example of Figure 12, the literal line extraction before literal cuts out itself is also very difficult.So, in word of the present invention location identification, when extracting the candidate of type mode out from entire image, whether can connect to cut apart the candidate that virtual network is illustrated in type mode on the either direction in length and breadth, that is to say that literal links up whether to can be used as word and link up and read.Figure 13 illustrates the example of cutting apart virtual network that so obtains.The candidate of the ellipse representation type mode among the figure.Such as, 1301 illustrate two coordinators are generated a type mode candidate altogether.In this occasion, type mode candidate 1301 is corresponding with " ら ".In addition, candidate pattern 1302 is corresponding with the point of " ら " top.Limit 1303 expression candidate patterns 1302 can be connected with candidate pattern 1304, are illustrated in other words in the word and might link up as literal.In addition, herein, in the occasion of outwards coming out internally as limit 1303, expression candidate pattern 1301 also might be connected with candidate pattern 1304.Whether can connect, judge according to the distance between the candidate pattern.The occasion of distance below predetermined threshold value may connect.
Cut apart virtual network as input with what obtain like this, utilize literal identification to explore important words, can detect the place that important words occurs.Such as, be the occasion of important words at " Chi ら " and " price ", as Figure 14 1401 and 1402 shown in, can detect the position of important words.Discern by using the word lexicon of forming by the necessary minimal word of document classification to carry out literal, can high precision and carry out the automatic classification of letter class at high speed.
By with the important words dictionary sharing of Email classification, can be easy to generate the word lexicon of character recognition device.In addition, can upgrade automatically word lexicon according to the example that in utilization, utilizes Email to obtain.
The output that literal identification is handled is the frequency of occurrences of word, improves based on the document kind identification processing of the word frequency of occurrences and the compatibility in the system.Therefore, can obtain most existing document kind note identification apparatus can divert, and identification of document kind and text based document kind based on literal identification in system are discerned the effect of coexistence easily.
In unclassified occasion, also can support to answer staff's operation by on image, indicating important words to answering the staff.In addition, even, also can provide the environment that can continue to answer operation expeditiously in the occasion of classification error.

Claims (10)

1. document classification system is characterized in that comprising:
Be used for importing the input media of the view data of document,
The memory storage of the information of the relevant important words of in the kind identification of above-mentioned document, using of storage and the frequency of occurrences thereof, and
Handle the treating apparatus of above-mentioned view data,
Above-mentioned treating apparatus, the word location technology of the above-mentioned important words of utilization from the view data of above-mentioned input media input, count its occurrence number, according to the information and the above-mentioned counting that are stored in the above-mentioned memory storage, discern the kind of above-mentioned document, above-mentioned document kind recognition result and above-mentioned view data are exported accordingly.
2. the document classification system that puts down in writing as claim 1 is characterized in that
Said system also has display device,
Above-mentioned display device shows above-mentioned document kind recognition result and above-mentioned view data accordingly.
3. the document classification system that puts down in writing as claim 2 is characterized in that
Said system will be recorded in the pen recorder accordingly to the answer literary composition example and the mentioned kind of above-mentioned document,
Above-mentioned display device shows above-mentioned view data and the corresponding above-mentioned answer literary composition example of above-mentioned document kind recognition result.
4. the document classification system that puts down in writing as claim 2 to 3 is characterized in that
Above-mentioned treating apparatus is also exported the positional information of important words in above-mentioned view data of above-mentioned counting,
On above-mentioned display device, emphasize to show this important words in the above-mentioned view data according to above-mentioned positional information.
5. as any one document classification system that puts down in writing in the claim 1 to 3, it is characterized in that
Said system has the classification and ordination device,
Above-mentioned classification and ordination device will be sorted out discharge by kind through the above-mentioned document of identification.
6. as any one document classification system that puts down in writing in the claim 1 to 3, it is characterized in that
Above-mentioned document classification system is connected to communication network,
Above-mentioned treating apparatus also carries out mentioned kind identification for the Email that receives through this communication network.
7. as any one document classification system that puts down in writing in the claim 1 to 3, it is characterized in that
Above-mentioned treating apparatus utilizes the occurrence number of above-mentioned recognition result and above-mentioned important words that the information that is stored in the above-mentioned memory storage is upgraded.
8. a document classification system is characterized in that
Have document kind note identification apparatus and a plurality of document processing device, document processings that are connected through network,
Above-mentioned document kind note identification apparatus comprises:
Obtain the view data of document or the device of text data,
The pen recorder that the information of the important words of using in will discerning with this kind about the kind of above-mentioned document writes down accordingly,
Handle the treating apparatus of above-mentioned view data or text data,
Above-mentioned treating apparatus is discerned above-mentioned document according to above-mentioned information, each confidence level of this recognition result and mentioned kind is together exported, according to the above-mentioned document processing device, document processing of this confidence level decision output.
9. the document classification system that puts down in writing as claim 8 is characterized in that
Above-mentioned treating apparatus, in the occasion of accepting the wrong input of this identification,
According to above-mentioned confidence level, above-mentioned document is transferred to other above-mentioned document processing device, document processings.
10. program is a kind of by carrying out the program of document recognition method with the computing machine that the view data input media is connected, has data storage device and control device,
It is characterized in that
The document recognition methods comprises:
Obtain the step of document data by above-mentioned view data input media,
Utilize the word location technology from above-mentioned view data, to discern the important words of depositing in advance in the above-mentioned memory storage, count each the step of occurrence number of this important words,
According to the step of above-mentioned count value identification document kind, and
The step that above-mentioned document data and above-mentioned document kind recognition result are exported accordingly.
CN02141403.3A 2002-04-19 2002-08-28 File classing system and program for carrying out same Pending CN1452098A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP116976/2002 2002-04-19
JP2002116976A JP2003317034A (en) 2002-04-19 2002-04-19 Document classification system and program for realizing the same

Publications (1)

Publication Number Publication Date
CN1452098A true CN1452098A (en) 2003-10-29

Family

ID=29243476

Family Applications (1)

Application Number Title Priority Date Filing Date
CN02141403.3A Pending CN1452098A (en) 2002-04-19 2002-08-28 File classing system and program for carrying out same

Country Status (2)

Country Link
JP (1) JP2003317034A (en)
CN (1) CN1452098A (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102119383A (en) * 2008-03-19 2011-07-06 德尔夫网络有限公司 Method and subsystem for information acquisition and aggregation to facilitate ontology and language-model generation within a content-search-service system
CN102184343A (en) * 2005-04-06 2011-09-14 株式会社东芝 Report check apparatus and computer program product
CN102637205A (en) * 2012-03-19 2012-08-15 南京大学 Document classification method based on Hadoop
US8966389B2 (en) 2006-09-22 2015-02-24 Limelight Networks, Inc. Visual interface for identifying positions of interest within a sequentially ordered information encoding
US9015172B2 (en) 2006-09-22 2015-04-21 Limelight Networks, Inc. Method and subsystem for searching media content within a content-search service system
CN107005613A (en) * 2014-12-17 2017-08-01 微软技术许可有限责任公司 Message view is optimized based on classifying importance

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2012205181A (en) 2011-03-28 2012-10-22 Fuji Xerox Co Ltd Image processing device and program

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102184343A (en) * 2005-04-06 2011-09-14 株式会社东芝 Report check apparatus and computer program product
CN102184343B (en) * 2005-04-06 2014-06-25 株式会社东芝 Report check apparatus and computer program product
US8966389B2 (en) 2006-09-22 2015-02-24 Limelight Networks, Inc. Visual interface for identifying positions of interest within a sequentially ordered information encoding
US9015172B2 (en) 2006-09-22 2015-04-21 Limelight Networks, Inc. Method and subsystem for searching media content within a content-search service system
CN102119383A (en) * 2008-03-19 2011-07-06 德尔夫网络有限公司 Method and subsystem for information acquisition and aggregation to facilitate ontology and language-model generation within a content-search-service system
CN102637205A (en) * 2012-03-19 2012-08-15 南京大学 Document classification method based on Hadoop
CN102637205B (en) * 2012-03-19 2014-10-15 南京大学 Document classification method based on Hadoop
CN107005613A (en) * 2014-12-17 2017-08-01 微软技术许可有限责任公司 Message view is optimized based on classifying importance

Also Published As

Publication number Publication date
JP2003317034A (en) 2003-11-07

Similar Documents

Publication Publication Date Title
US7734636B2 (en) Systems and methods for electronic document genre classification using document grammars
US7860312B2 (en) System and method for identifying and labeling fields of text associated with scanned business documents
US7672940B2 (en) Processing an electronic document for information extraction
US8538184B2 (en) Systems and methods for handling and distinguishing binarized, background artifacts in the vicinity of document text and image features indicative of a document category
US8897563B1 (en) Systems and methods for automatically processing electronic documents
US8566349B2 (en) Handwritten document categorizer and method of training
JP5073022B2 (en) Low resolution OCR for documents acquired with a camera
US7003725B2 (en) Method and system for normalizing dirty text in a document
EP2202645A1 (en) Method of feature extraction from noisy documents
US20070168382A1 (en) Document analysis system for integration of paper records into a searchable electronic database
Seethalakshmi et al. Optical character recognition for printed Tamil text using Unicode
CN109685052A (en) Method for processing text images, device, electronic equipment and computer-readable medium
JP2005182730A (en) Automatic document separation
CN111752900A (en) File storage method, device, equipment and medium based on RPA and AI
CN111078979A (en) Method and system for identifying network credit website based on OCR and text processing technology
US8699796B1 (en) Identifying sensitive expressions in images for languages with large alphabets
CN1452098A (en) File classing system and program for carrying out same
CN113269101A (en) Bill identification method, device and equipment
Pan et al. A system for automatic Chinese business card recognition
Garris et al. NIST Scoring Package User’s Guide
Tran et al. A deep learning-based system for document layout analysis
JP2004171316A (en) Ocr device, document retrieval system and document retrieval program
CN113806368A (en) System and method for identifying document and automatically establishing database
Garris et al. Federal Register document image database
CN109344254A (en) A kind of address information classification method and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication