CN105608074A - Word counting method and device - Google Patents
Word counting method and device Download PDFInfo
- Publication number
- CN105608074A CN105608074A CN201610028758.5A CN201610028758A CN105608074A CN 105608074 A CN105608074 A CN 105608074A CN 201610028758 A CN201610028758 A CN 201610028758A CN 105608074 A CN105608074 A CN 105608074A
- Authority
- CN
- China
- Prior art keywords
- word
- language
- counting
- character
- punctuation mark
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Probability & Statistics with Applications (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a word counting method and device, and relates to the technical field of computers. The technical problem that in the prior art, itemized word counting cannot be performed on a file containing multiple languages or multiple languages in a text is solved. According to the technical scheme, the word counting method comprises the steps that 1, text content is read, the text is read into a memory in batches according to a certain length; 2, after a batch of the text in the memory is read every time, the text in the memory is scanned, the number of punctuations among the text is recognized and counted, then the punctuations are removed, and a new character string without containing the punctuations is formed; 3, words or characters in the character string of which the punctuations are removed are read, and the languages are recognized word by word and counted; 4, the punctuation numbers and the language text or character numbers which are counted successively are added up separately.
Description
Technical field
The present invention relates to field of computer technology, particularly a kind of word counting method and device.
Background technology
Prior art is for the comparative maturity of word counting technology of same languages, but current word countingDifficult point be in one section of text or document, there is two or more above language, as Chinese and English mixes,The multi-lingual files such as France and Japan Korea Spro, add up language number of words separately by language subitem and cannot realize.
Summary of the invention
The prior art that the present invention will solve can not be many to comprising in multilingual file or passageKind language is itemized and is added up the technical problem of number of words.
In order to address the above problem, the invention provides a kind of word counting method, comprising: step 1, readGet word content, word is read to internal memory in batches according to certain length; Step 2, often reads internal memory oneAfter the word of individual batch, the word in scanning internal memory, identifies and adds up the punctuation mark number between word, itAfter remove punctuation mark, form a new character string that does not comprise punctuation mark; Step 3, reads filtrationFall word or character in the character string of punctuation mark, word for word identify category of language counting; Step 4, willSuccessively the number of the punctuation mark of statistics and every kind of spoken and written languages or character is added separately.
The present invention also comprises a kind of word counting device, comprising: read module, and for reading characters content,Word is read to internal memory in batches according to certain length; Punctuation mark identification module, for often reading internal memoryAfter the word of batch, the word in scanning internal memory, identifies and adds up the punctuation mark number between word,Remove afterwards punctuation mark, form a new character string that does not comprise punctuation mark; Language identification module,For reading word or the character of the character string that filters out punctuation mark, word for word identify category of language counting;Subitem statistical module, for by the successively punctuation mark of statistics and the number of every kind of spoken and written languages or character separatelyBe added.
As seen through the above technical solutions, the invention provides a kind of word counting method and device, to one section of literary compositionThis or document divide language word counting, make word counting more accurately, in detail, are the literary composition in translation fieldPart statistics number of words, provides convenience, and has saved the time.
Brief description of the drawings
A kind of word counting method flow of Fig. 1 Fig. 1;
A kind of word counting method flow of Fig. 2 Fig. 2;
A kind of word counting apparatus structure of Fig. 3 schematic diagram.
Detailed description of the invention
Below in conjunction with drawings and Examples, technical scheme of the present invention is described in detail.
It should be noted that, if do not conflicted, each feature in the embodiment of the present invention and embodiment canMutually combine, all within protection scope of the present invention.
Embodiment mono-, as shown in Figure 1, a kind of word counting method, this technical scheme comprises: step 1,Reading characters content, reads internal memory by word in batches according to certain length; Certain length, Ke YiweiFixing byte, in short can be also passage or a slice article. Can be according to requirements set. StepRapid two, often to read after the word of one batch of internal memory, the word in scanning internal memory, identifies and adds up between wordPunctuation mark number, remove afterwards punctuation mark, form a new character string that does not comprise punctuation mark;Step 3, reads word or character in the character string that filters out punctuation mark, word for word identifies category of language alsoCounting; Step 4, by successively the punctuation mark of statistics and the number of every kind of spoken and written languages or character are added separately.
The invention provides a kind of word counting method, divide language word counting to one section of text or document,Make word counting more accurately, in detail, for the file statistics number of words in translation field, provide convenience, saveTime.
Embodiment bis-, as shown in Figure 2, on the basis of embodiment mono-, more excellent, described step 3, knowsDo not go out corresponding language and count concrete steps and be: whether Chinese identification is successively, if it is counting, asFruit be not identify whether English, if it is counting, if not whether French of identification, ifBeing to count, is other Languages if not identifying, until identify each word or a word pairThe language of answering.
More excellent, for a code database and language model set in every kind of language, traversal code database tentatively identifiesA word or the language classification of a character, then according to the language model of every kind of language and ad hoc rules,Complete identify word, word or a character.
More excellent, described step 3, identifies corresponding language and counts concrete steps and be: between word and wordThe languages with space are not calculated number of words by actual characters number.
More excellent, described step 3, identify corresponding language and count concrete steps and be: between word withSpace distinguish languages, taking space or punctuation mark as according to calculate number of words, word counting is not carried out in space.
As shown in Figure 2, a kind of concrete steps of number of words system method are:
Prepare multilingual document or a string literal;
From file or passage, word is read to internal memory as required in batches;
By punctuation mark algorithm, calculate punctuation mark number, and counting;
Internal memory word, by a punctuation mark filter algorithm, remove punctuation mark, form one newCharacter string;
Read a word or a character in the character string that filters out punctuation mark, know by Chinese successivelyOther algorithm, English recognizer, French recognizer etc., until identify corresponding language and completeIdentify a word or a word, turn to rolling counters forward;
Each speech recognition algorithm, first can, according to computer UNICODE code database, tentatively identify oneThe language of individual word or a character, the word that can not accurately identify for computer UNICODE code database orCharacter, and then mate according to the language model that a large amount of single speech therapy of language is practised separately, probability system doneMeter identification, finally according to some ad hoc rules, a complete word or the word of identifying.
Concrete ad hoc rules is as follows:
1. between the word such as Chinese, Japanese, Korean and word, the languages with space are not calculated by actual characters number,As Who Am I, Si は Who,Deng statistics be respectively 3,3,5;
2. the English languages of distinguishing with space between word that wait, taking space or punctuation mark be according to calculating number of words,As IamaChinese, andyou? statistics is 8;
3. each punctuation mark all calculates as a word or word;
4. each spcial character all calculates as a word or word; As #& etc.;
5. one section of continuous numeral, calculates as a word; If 123456 its numbers of words are 1;
6. one section of continuous letter, calculates as a word; If its number of words of abcdefg is 1;
7. between one section of continuous numeral or letter, inserted one or more letters, numeral or spcial character,Separately add up. As 123a456,123abc456,123456, abc2def, abc123def, abc $ defBe 3 Deng its number of words;
8. word counting is not carried out in space;
Subitem statistical counter, can record every kind of language and punctuation mark subitem statistics;
According to practical business rule, the data output of counter subitem record; As, in to English translation, neglectThe slightly English in file content, only statistics is Chinese, does not record English, if contain other language, needsRecord output, punctuation mark needs output.
In order to address the above problem, to the present invention is directed to the text feature that contains different language different calculating is providedNumber of words mode, wherein text feature comprises: the Asia with space not between the words such as Chinese, Japanese, Korean and wordContinent language feature; The european language feature of distinguishing with space between word; Spcial character or punctuation mark; AltogetherThree classes.
Embodiment tri-, as shown in Figure 3, a kind of word counting device, this technical scheme comprises: read module,For reading characters content, word is read to internal memory in batches according to certain length; Punctuation mark identification mouldPiece, for often reading after the word of one batch of internal memory, the word in scanning internal memory, identifies and adds up wordBetween punctuation mark number, remove afterwards punctuation mark, form a new character that does not comprise punctuation markString; Language identification module, for reading word or the character of the character string that filters out punctuation mark, word for word knowsDo not go out corresponding language counting; Subitem statistical module, for by successively statistics punctuation mark and every kind of languageThe number of speech word or character is added separately.
More excellent, described language identification module, identifies corresponding language and counts concrete steps and be: successivelyWhether Chinese identification is, and if it is counting, if not whether English identification is, is if it is counted,Be French if not identifying, if it is counting, is other Languages if not identifying,Until identify each word or a language that word is corresponding.
More excellent, be that a code database set in every kind of language, traversal code database tentatively identify a word orThe language classification of a character, then according to the feature of every kind of language and ad hoc rules, complete identifies oneIndividual word, word or character.
More excellent, described language identification module, identify corresponding language and count concrete steps and be: word withBetween word, the languages with space are not calculated number of words by actual characters number.
More excellent, described language identification module, identifies corresponding language and counts concrete steps and be: wordBetween with space distinguish languages, taking space or punctuation mark as according to calculate number of words, word is not carried out in spaceNumber statistics.
This device is corresponding one by one with the technical scheme of said method, and all explanations are with reference to said method, at this notRepeat again.
Number of words in energy one section of text of accurate statistics or document (having two kinds and above language); Can be accuratelyNumber of words in statistics word, excel, txt common document form; Can carry out quick and accurate to large file documentTrue word counting. The invention provides a kind of word counting method and device, one section of text or document are carried outDividing language word counting, make word counting more accurately, in detail, is the file statistics number of words in translation field,Provide convenience, saved the time. The present invention can be used for translation field and enters according to the different language of file contentThe quotation of row synthesized translation.
Certainly, the present invention also can have other various embodiments, in the feelings that do not deviate from spirit of the present invention and essence thereofUnder condition, those of ordinary skill in the art work as can make according to the present invention various corresponding changes and distortion, butThese corresponding changes and distortion all should belong to the protection domain of claim of the present invention.
Claims (10)
1. a word counting method, is characterized in that, comprising: step 1, reading characters content, by literary compositionWord is read internal memory in batches according to certain length; Step 2, often reads after the word of one batch of internal memory,Word in scanning internal memory, identifies and adds up the punctuation mark number between word, removes afterwards punctuation mark,Form a new character string that does not comprise punctuation mark; Step 3, reads the character that filters out punctuation markWord in string or character, word for word identify category of language counting; Step 4, by the punctuate symbol of successively adding upNumber and the number of every kind of spoken and written languages or character be added separately.
2. word counting method as claimed in claim 1, is characterized in that, described step 3, identifiesCorresponding language is also counted concrete steps and is: whether Chinese of identification successively, if it is counting, if notWhether English be to identify, if it is counting, if not whether French of identification, if it isCounting, if not whether other Languages of identification, until identify each word or a word is correspondingLanguage.
3. word counting method as claimed in claim 2, is characterized in that, for every kind of language is set oneCode database, traversal code database tentatively identifies a word or the language classification of a character, then according to everyLanguage model and the ad hoc rules of kind of language, complete identify word, word or a character.
4. word counting method as claimed in claim 1, is characterized in that, described step 3, identifiesCorresponding language is also counted concrete steps and is: between word and word not the languages with space by actual characters numberCalculate number of words.
5. word counting method as claimed in claim 1, is characterized in that, described step 3, identifiesCorresponding language is also counted concrete steps and is: the languages of distinguishing with space between word, and with space or punctuate symbolNumber be according to calculating number of words, word counting not being carried out in space.
6. a word counting device, is characterized in that, comprising: read module, and for reading characters content,Word is read to internal memory in batches according to certain length; Punctuation mark identification module, for often reading internal memoryAfter the word of batch, the word in scanning internal memory, identifies and adds up the punctuation mark number between word,Remove afterwards punctuation mark, form a new character string that does not comprise punctuation mark; Language identification module,For reading word or the character of the character string that filters out punctuation mark, word for word identify category of language counting;Subitem statistical module, for by the successively punctuation mark of statistics and the number of every kind of spoken and written languages or character separatelyBe added.
7. word counting device as claimed in claim 6, is characterized in that, described language identification module,Identify corresponding language and count concrete steps and be: whether Chinese identification is successively, if it is counting,Whether English if not identifying, if it is counting, is French if not identifying, asFruit is to count, and is other Languages, until identify each word or a word if not identifyingCorresponding language.
8. word counting device as claimed in claim 7, is characterized in that, for every kind of language is set oneCode database and language model, traversal code database tentatively identifies a word or the language classification of a character,Then according to the language model of every kind of language and ad hoc rules, complete identify word, word or a character.
9. word counting device as claimed in claim 6, is characterized in that, described language identification module,Identify corresponding language and count concrete steps and be: between word and word not the languages with space by actual charactersNumber is calculated number of words.
10. word counting device as claimed in claim 6, is characterized in that, described language identification module,Identify corresponding language and count concrete steps and be: the languages of distinguishing with space between word, with space orPunctuation mark is according to calculating number of words, word counting not being carried out in space.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610028758.5A CN105608074B (en) | 2016-01-15 | 2016-01-15 | A kind of word counting method and device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201610028758.5A CN105608074B (en) | 2016-01-15 | 2016-01-15 | A kind of word counting method and device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN105608074A true CN105608074A (en) | 2016-05-25 |
CN105608074B CN105608074B (en) | 2018-06-29 |
Family
ID=55988018
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201610028758.5A Active CN105608074B (en) | 2016-01-15 | 2016-01-15 | A kind of word counting method and device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN105608074B (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106354711A (en) * | 2016-08-18 | 2017-01-25 | 中译语通科技(北京)有限公司 | Method and device for language identification |
CN106527876A (en) * | 2016-11-10 | 2017-03-22 | 广东工业大学 | Method and system for counting webpage word number |
CN111160015A (en) * | 2019-12-24 | 2020-05-15 | 北京明略软件系统有限公司 | Method, device, computer storage medium and terminal for realizing text analysis |
CN112446262A (en) * | 2019-09-02 | 2021-03-05 | 深圳中兴网信科技有限公司 | Text analysis method, text analysis device, text analysis terminal and computer-readable storage medium |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4049913A (en) * | 1975-10-31 | 1977-09-20 | Nippon Electric Company, Ltd. | System for recognizing speech continuously spoken with number of word or words preselected |
US20020013778A1 (en) * | 1999-09-10 | 2002-01-31 | Neal Michael Renn | Sequential subset catalog search engine |
JP2008065417A (en) * | 2006-09-05 | 2008-03-21 | Hottolink Inc | Associative word group retrieval device and system, and content match type advertisement system |
CN104281603A (en) * | 2013-07-05 | 2015-01-14 | 北大方正集团有限公司 | Word frequency grading statistical method and system |
CN104699669A (en) * | 2015-03-31 | 2015-06-10 | 中译语通科技(北京)有限公司 | Text word-counting method and device |
CN105204738A (en) * | 2015-09-18 | 2015-12-30 | 北京奇虎科技有限公司 | E-book reading quantity determining and ranking methods, terminal device and server |
-
2016
- 2016-01-15 CN CN201610028758.5A patent/CN105608074B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US4049913A (en) * | 1975-10-31 | 1977-09-20 | Nippon Electric Company, Ltd. | System for recognizing speech continuously spoken with number of word or words preselected |
US20020013778A1 (en) * | 1999-09-10 | 2002-01-31 | Neal Michael Renn | Sequential subset catalog search engine |
JP2008065417A (en) * | 2006-09-05 | 2008-03-21 | Hottolink Inc | Associative word group retrieval device and system, and content match type advertisement system |
CN104281603A (en) * | 2013-07-05 | 2015-01-14 | 北大方正集团有限公司 | Word frequency grading statistical method and system |
CN104699669A (en) * | 2015-03-31 | 2015-06-10 | 中译语通科技(北京)有限公司 | Text word-counting method and device |
CN105204738A (en) * | 2015-09-18 | 2015-12-30 | 北京奇虎科技有限公司 | E-book reading quantity determining and ranking methods, terminal device and server |
Non-Patent Citations (1)
Title |
---|
QIANG925 等: "java 统计字数 [问题点数:20分,结帖人qiang925]", 《HTTP://BBS.CSDN.NET/TOPICS/330038421》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106354711A (en) * | 2016-08-18 | 2017-01-25 | 中译语通科技(北京)有限公司 | Method and device for language identification |
CN106527876A (en) * | 2016-11-10 | 2017-03-22 | 广东工业大学 | Method and system for counting webpage word number |
CN112446262A (en) * | 2019-09-02 | 2021-03-05 | 深圳中兴网信科技有限公司 | Text analysis method, text analysis device, text analysis terminal and computer-readable storage medium |
CN111160015A (en) * | 2019-12-24 | 2020-05-15 | 北京明略软件系统有限公司 | Method, device, computer storage medium and terminal for realizing text analysis |
CN111160015B (en) * | 2019-12-24 | 2024-03-05 | 北京明略软件系统有限公司 | Method, device, computer storage medium and terminal for realizing text analysis |
Also Published As
Publication number | Publication date |
---|---|
CN105608074B (en) | 2018-06-29 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111737969B (en) | Resume parsing method and system based on deep learning | |
CN111259652B (en) | Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment | |
CN104408078A (en) | Construction method for key word-based Chinese-English bilingual parallel corpora | |
CN105608074A (en) | Word counting method and device | |
CN107341143B (en) | Sentence continuity judgment method and device and electronic equipment | |
CN106528536A (en) | Multilingual word segmentation method based on dictionaries and grammar analysis | |
CN107577663B (en) | Key phrase extraction method and device | |
Drobac et al. | OCR and post-correction of historical Finnish texts | |
U Rahman | Towards Sindhi corpus construction | |
Gupta et al. | A hybrid approach for entity extraction in code-mixed social media data | |
CN104252446A (en) | Computing device, and verification system and method for consistency of contents of files | |
CN114004221A (en) | Method and device for correcting table content | |
Alotaiby et al. | Processing large Arabic text corpora: Preliminary analysis and results | |
CN104699669A (en) | Text word-counting method and device | |
CN105573981A (en) | Method and device for extracting Chinese names of people and places | |
CN112906352A (en) | Vehicle insurance electronic insurance policy text recognition and extraction method and system | |
CN104699662B (en) | The method and apparatus for identifying overall symbol string | |
Khan et al. | Creation and analysis of a new Bangla text corpus BDNC01 | |
Hocking et al. | Optical character recognition for South African languages | |
Hakro et al. | Printed text image database for Sindhi OCR | |
Bataineh | A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition. | |
CN111553155B (en) | Password word segmentation system and method based on semantic structure | |
CN103942188A (en) | Method and device for identifying corpus languages | |
Nguyen et al. | An in-depth analysis of OCR errors for unconstrained Vietnamese handwriting | |
CN115859988B (en) | Entity account extraction method and system for social text |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information |
Address after: 100040 Shijingshan District railway building, Beijing, the 16 floor Applicant after: Chinese translation language through Polytron Technologies Inc Address before: 100040 Shijingshan District railway building, Beijing, the 16 floor Applicant before: Mandarin Technology (Beijing) Co., Ltd. |
|
CB02 | Change of applicant information | ||
GR01 | Patent grant | ||
GR01 | Patent grant |