CN105608074A - Word counting method and device - Google Patents

Word counting method and device Download PDF

Info

Publication number
CN105608074A
CN105608074A CN201610028758.5A CN201610028758A CN105608074A CN 105608074 A CN105608074 A CN 105608074A CN 201610028758 A CN201610028758 A CN 201610028758A CN 105608074 A CN105608074 A CN 105608074A
Authority
CN
China
Prior art keywords
word
language
counting
character
punctuation mark
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610028758.5A
Other languages
Chinese (zh)
Other versions
CN105608074B (en
Inventor
王建华
程国艮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Mandarin Technology (beijing) Co Ltd
Original Assignee
Mandarin Technology (beijing) Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Mandarin Technology (beijing) Co Ltd filed Critical Mandarin Technology (beijing) Co Ltd
Priority to CN201610028758.5A priority Critical patent/CN105608074B/en
Publication of CN105608074A publication Critical patent/CN105608074A/en
Application granted granted Critical
Publication of CN105608074B publication Critical patent/CN105608074B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a word counting method and device, and relates to the technical field of computers. The technical problem that in the prior art, itemized word counting cannot be performed on a file containing multiple languages or multiple languages in a text is solved. According to the technical scheme, the word counting method comprises the steps that 1, text content is read, the text is read into a memory in batches according to a certain length; 2, after a batch of the text in the memory is read every time, the text in the memory is scanned, the number of punctuations among the text is recognized and counted, then the punctuations are removed, and a new character string without containing the punctuations is formed; 3, words or characters in the character string of which the punctuations are removed are read, and the languages are recognized word by word and counted; 4, the punctuation numbers and the language text or character numbers which are counted successively are added up separately.

Description

A kind of word counting method and device
Technical field
The present invention relates to field of computer technology, particularly a kind of word counting method and device.
Background technology
Prior art is for the comparative maturity of word counting technology of same languages, but current word countingDifficult point be in one section of text or document, there is two or more above language, as Chinese and English mixes,The multi-lingual files such as France and Japan Korea Spro, add up language number of words separately by language subitem and cannot realize.
Summary of the invention
The prior art that the present invention will solve can not be many to comprising in multilingual file or passageKind language is itemized and is added up the technical problem of number of words.
In order to address the above problem, the invention provides a kind of word counting method, comprising: step 1, readGet word content, word is read to internal memory in batches according to certain length; Step 2, often reads internal memory oneAfter the word of individual batch, the word in scanning internal memory, identifies and adds up the punctuation mark number between word, itAfter remove punctuation mark, form a new character string that does not comprise punctuation mark; Step 3, reads filtrationFall word or character in the character string of punctuation mark, word for word identify category of language counting; Step 4, willSuccessively the number of the punctuation mark of statistics and every kind of spoken and written languages or character is added separately.
The present invention also comprises a kind of word counting device, comprising: read module, and for reading characters content,Word is read to internal memory in batches according to certain length; Punctuation mark identification module, for often reading internal memoryAfter the word of batch, the word in scanning internal memory, identifies and adds up the punctuation mark number between word,Remove afterwards punctuation mark, form a new character string that does not comprise punctuation mark; Language identification module,For reading word or the character of the character string that filters out punctuation mark, word for word identify category of language counting;Subitem statistical module, for by the successively punctuation mark of statistics and the number of every kind of spoken and written languages or character separatelyBe added.
As seen through the above technical solutions, the invention provides a kind of word counting method and device, to one section of literary compositionThis or document divide language word counting, make word counting more accurately, in detail, are the literary composition in translation fieldPart statistics number of words, provides convenience, and has saved the time.
Brief description of the drawings
A kind of word counting method flow of Fig. 1 Fig. 1;
A kind of word counting method flow of Fig. 2 Fig. 2;
A kind of word counting apparatus structure of Fig. 3 schematic diagram.
Detailed description of the invention
Below in conjunction with drawings and Examples, technical scheme of the present invention is described in detail.
It should be noted that, if do not conflicted, each feature in the embodiment of the present invention and embodiment canMutually combine, all within protection scope of the present invention.
Embodiment mono-, as shown in Figure 1, a kind of word counting method, this technical scheme comprises: step 1,Reading characters content, reads internal memory by word in batches according to certain length; Certain length, Ke YiweiFixing byte, in short can be also passage or a slice article. Can be according to requirements set. StepRapid two, often to read after the word of one batch of internal memory, the word in scanning internal memory, identifies and adds up between wordPunctuation mark number, remove afterwards punctuation mark, form a new character string that does not comprise punctuation mark;Step 3, reads word or character in the character string that filters out punctuation mark, word for word identifies category of language alsoCounting; Step 4, by successively the punctuation mark of statistics and the number of every kind of spoken and written languages or character are added separately.
The invention provides a kind of word counting method, divide language word counting to one section of text or document,Make word counting more accurately, in detail, for the file statistics number of words in translation field, provide convenience, saveTime.
Embodiment bis-, as shown in Figure 2, on the basis of embodiment mono-, more excellent, described step 3, knowsDo not go out corresponding language and count concrete steps and be: whether Chinese identification is successively, if it is counting, asFruit be not identify whether English, if it is counting, if not whether French of identification, ifBeing to count, is other Languages if not identifying, until identify each word or a word pairThe language of answering.
More excellent, for a code database and language model set in every kind of language, traversal code database tentatively identifiesA word or the language classification of a character, then according to the language model of every kind of language and ad hoc rules,Complete identify word, word or a character.
More excellent, described step 3, identifies corresponding language and counts concrete steps and be: between word and wordThe languages with space are not calculated number of words by actual characters number.
More excellent, described step 3, identify corresponding language and count concrete steps and be: between word withSpace distinguish languages, taking space or punctuation mark as according to calculate number of words, word counting is not carried out in space.
As shown in Figure 2, a kind of concrete steps of number of words system method are:
Prepare multilingual document or a string literal;
From file or passage, word is read to internal memory as required in batches;
By punctuation mark algorithm, calculate punctuation mark number, and counting;
Internal memory word, by a punctuation mark filter algorithm, remove punctuation mark, form one newCharacter string;
Read a word or a character in the character string that filters out punctuation mark, know by Chinese successivelyOther algorithm, English recognizer, French recognizer etc., until identify corresponding language and completeIdentify a word or a word, turn to rolling counters forward;
Each speech recognition algorithm, first can, according to computer UNICODE code database, tentatively identify oneThe language of individual word or a character, the word that can not accurately identify for computer UNICODE code database orCharacter, and then mate according to the language model that a large amount of single speech therapy of language is practised separately, probability system doneMeter identification, finally according to some ad hoc rules, a complete word or the word of identifying.
Concrete ad hoc rules is as follows:
1. between the word such as Chinese, Japanese, Korean and word, the languages with space are not calculated by actual characters number,As Who Am I, Si は Who,Deng statistics be respectively 3,3,5;
2. the English languages of distinguishing with space between word that wait, taking space or punctuation mark be according to calculating number of words,As IamaChinese, andyou? statistics is 8;
3. each punctuation mark all calculates as a word or word;
4. each spcial character all calculates as a word or word; As #& etc.;
5. one section of continuous numeral, calculates as a word; If 123456 its numbers of words are 1;
6. one section of continuous letter, calculates as a word; If its number of words of abcdefg is 1;
7. between one section of continuous numeral or letter, inserted one or more letters, numeral or spcial character,Separately add up. As 123a456,123abc456,123456, abc2def, abc123def, abc $ defBe 3 Deng its number of words;
8. word counting is not carried out in space;
Subitem statistical counter, can record every kind of language and punctuation mark subitem statistics;
According to practical business rule, the data output of counter subitem record; As, in to English translation, neglectThe slightly English in file content, only statistics is Chinese, does not record English, if contain other language, needsRecord output, punctuation mark needs output.
In order to address the above problem, to the present invention is directed to the text feature that contains different language different calculating is providedNumber of words mode, wherein text feature comprises: the Asia with space not between the words such as Chinese, Japanese, Korean and wordContinent language feature; The european language feature of distinguishing with space between word; Spcial character or punctuation mark; AltogetherThree classes.
Embodiment tri-, as shown in Figure 3, a kind of word counting device, this technical scheme comprises: read module,For reading characters content, word is read to internal memory in batches according to certain length; Punctuation mark identification mouldPiece, for often reading after the word of one batch of internal memory, the word in scanning internal memory, identifies and adds up wordBetween punctuation mark number, remove afterwards punctuation mark, form a new character that does not comprise punctuation markString; Language identification module, for reading word or the character of the character string that filters out punctuation mark, word for word knowsDo not go out corresponding language counting; Subitem statistical module, for by successively statistics punctuation mark and every kind of languageThe number of speech word or character is added separately.
More excellent, described language identification module, identifies corresponding language and counts concrete steps and be: successivelyWhether Chinese identification is, and if it is counting, if not whether English identification is, is if it is counted,Be French if not identifying, if it is counting, is other Languages if not identifying,Until identify each word or a language that word is corresponding.
More excellent, be that a code database set in every kind of language, traversal code database tentatively identify a word orThe language classification of a character, then according to the feature of every kind of language and ad hoc rules, complete identifies oneIndividual word, word or character.
More excellent, described language identification module, identify corresponding language and count concrete steps and be: word withBetween word, the languages with space are not calculated number of words by actual characters number.
More excellent, described language identification module, identifies corresponding language and counts concrete steps and be: wordBetween with space distinguish languages, taking space or punctuation mark as according to calculate number of words, word is not carried out in spaceNumber statistics.
This device is corresponding one by one with the technical scheme of said method, and all explanations are with reference to said method, at this notRepeat again.
Number of words in energy one section of text of accurate statistics or document (having two kinds and above language); Can be accuratelyNumber of words in statistics word, excel, txt common document form; Can carry out quick and accurate to large file documentTrue word counting. The invention provides a kind of word counting method and device, one section of text or document are carried outDividing language word counting, make word counting more accurately, in detail, is the file statistics number of words in translation field,Provide convenience, saved the time. The present invention can be used for translation field and enters according to the different language of file contentThe quotation of row synthesized translation.
Certainly, the present invention also can have other various embodiments, in the feelings that do not deviate from spirit of the present invention and essence thereofUnder condition, those of ordinary skill in the art work as can make according to the present invention various corresponding changes and distortion, butThese corresponding changes and distortion all should belong to the protection domain of claim of the present invention.

Claims (10)

1. a word counting method, is characterized in that, comprising: step 1, reading characters content, by literary compositionWord is read internal memory in batches according to certain length; Step 2, often reads after the word of one batch of internal memory,Word in scanning internal memory, identifies and adds up the punctuation mark number between word, removes afterwards punctuation mark,Form a new character string that does not comprise punctuation mark; Step 3, reads the character that filters out punctuation markWord in string or character, word for word identify category of language counting; Step 4, by the punctuate symbol of successively adding upNumber and the number of every kind of spoken and written languages or character be added separately.
2. word counting method as claimed in claim 1, is characterized in that, described step 3, identifiesCorresponding language is also counted concrete steps and is: whether Chinese of identification successively, if it is counting, if notWhether English be to identify, if it is counting, if not whether French of identification, if it isCounting, if not whether other Languages of identification, until identify each word or a word is correspondingLanguage.
3. word counting method as claimed in claim 2, is characterized in that, for every kind of language is set oneCode database, traversal code database tentatively identifies a word or the language classification of a character, then according to everyLanguage model and the ad hoc rules of kind of language, complete identify word, word or a character.
4. word counting method as claimed in claim 1, is characterized in that, described step 3, identifiesCorresponding language is also counted concrete steps and is: between word and word not the languages with space by actual characters numberCalculate number of words.
5. word counting method as claimed in claim 1, is characterized in that, described step 3, identifiesCorresponding language is also counted concrete steps and is: the languages of distinguishing with space between word, and with space or punctuate symbolNumber be according to calculating number of words, word counting not being carried out in space.
6. a word counting device, is characterized in that, comprising: read module, and for reading characters content,Word is read to internal memory in batches according to certain length; Punctuation mark identification module, for often reading internal memoryAfter the word of batch, the word in scanning internal memory, identifies and adds up the punctuation mark number between word,Remove afterwards punctuation mark, form a new character string that does not comprise punctuation mark; Language identification module,For reading word or the character of the character string that filters out punctuation mark, word for word identify category of language counting;Subitem statistical module, for by the successively punctuation mark of statistics and the number of every kind of spoken and written languages or character separatelyBe added.
7. word counting device as claimed in claim 6, is characterized in that, described language identification module,Identify corresponding language and count concrete steps and be: whether Chinese identification is successively, if it is counting,Whether English if not identifying, if it is counting, is French if not identifying, asFruit is to count, and is other Languages, until identify each word or a word if not identifyingCorresponding language.
8. word counting device as claimed in claim 7, is characterized in that, for every kind of language is set oneCode database and language model, traversal code database tentatively identifies a word or the language classification of a character,Then according to the language model of every kind of language and ad hoc rules, complete identify word, word or a character.
9. word counting device as claimed in claim 6, is characterized in that, described language identification module,Identify corresponding language and count concrete steps and be: between word and word not the languages with space by actual charactersNumber is calculated number of words.
10. word counting device as claimed in claim 6, is characterized in that, described language identification module,Identify corresponding language and count concrete steps and be: the languages of distinguishing with space between word, with space orPunctuation mark is according to calculating number of words, word counting not being carried out in space.
CN201610028758.5A 2016-01-15 2016-01-15 A kind of word counting method and device Active CN105608074B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610028758.5A CN105608074B (en) 2016-01-15 2016-01-15 A kind of word counting method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610028758.5A CN105608074B (en) 2016-01-15 2016-01-15 A kind of word counting method and device

Publications (2)

Publication Number Publication Date
CN105608074A true CN105608074A (en) 2016-05-25
CN105608074B CN105608074B (en) 2018-06-29

Family

ID=55988018

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610028758.5A Active CN105608074B (en) 2016-01-15 2016-01-15 A kind of word counting method and device

Country Status (1)

Country Link
CN (1) CN105608074B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354711A (en) * 2016-08-18 2017-01-25 中译语通科技(北京)有限公司 Method and device for language identification
CN106527876A (en) * 2016-11-10 2017-03-22 广东工业大学 Method and system for counting webpage word number
CN111160015A (en) * 2019-12-24 2020-05-15 北京明略软件系统有限公司 Method, device, computer storage medium and terminal for realizing text analysis
CN112446262A (en) * 2019-09-02 2021-03-05 深圳中兴网信科技有限公司 Text analysis method, text analysis device, text analysis terminal and computer-readable storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4049913A (en) * 1975-10-31 1977-09-20 Nippon Electric Company, Ltd. System for recognizing speech continuously spoken with number of word or words preselected
US20020013778A1 (en) * 1999-09-10 2002-01-31 Neal Michael Renn Sequential subset catalog search engine
JP2008065417A (en) * 2006-09-05 2008-03-21 Hottolink Inc Associative word group retrieval device and system, and content match type advertisement system
CN104281603A (en) * 2013-07-05 2015-01-14 北大方正集团有限公司 Word frequency grading statistical method and system
CN104699669A (en) * 2015-03-31 2015-06-10 中译语通科技(北京)有限公司 Text word-counting method and device
CN105204738A (en) * 2015-09-18 2015-12-30 北京奇虎科技有限公司 E-book reading quantity determining and ranking methods, terminal device and server

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4049913A (en) * 1975-10-31 1977-09-20 Nippon Electric Company, Ltd. System for recognizing speech continuously spoken with number of word or words preselected
US20020013778A1 (en) * 1999-09-10 2002-01-31 Neal Michael Renn Sequential subset catalog search engine
JP2008065417A (en) * 2006-09-05 2008-03-21 Hottolink Inc Associative word group retrieval device and system, and content match type advertisement system
CN104281603A (en) * 2013-07-05 2015-01-14 北大方正集团有限公司 Word frequency grading statistical method and system
CN104699669A (en) * 2015-03-31 2015-06-10 中译语通科技(北京)有限公司 Text word-counting method and device
CN105204738A (en) * 2015-09-18 2015-12-30 北京奇虎科技有限公司 E-book reading quantity determining and ranking methods, terminal device and server

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QIANG925 等: "java 统计字数 [问题点数:20分,结帖人qiang925]", 《HTTP://BBS.CSDN.NET/TOPICS/330038421》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106354711A (en) * 2016-08-18 2017-01-25 中译语通科技(北京)有限公司 Method and device for language identification
CN106527876A (en) * 2016-11-10 2017-03-22 广东工业大学 Method and system for counting webpage word number
CN112446262A (en) * 2019-09-02 2021-03-05 深圳中兴网信科技有限公司 Text analysis method, text analysis device, text analysis terminal and computer-readable storage medium
CN111160015A (en) * 2019-12-24 2020-05-15 北京明略软件系统有限公司 Method, device, computer storage medium and terminal for realizing text analysis
CN111160015B (en) * 2019-12-24 2024-03-05 北京明略软件系统有限公司 Method, device, computer storage medium and terminal for realizing text analysis

Also Published As

Publication number Publication date
CN105608074B (en) 2018-06-29

Similar Documents

Publication Publication Date Title
CN111737969B (en) Resume parsing method and system based on deep learning
CN111259652B (en) Bilingual corpus sentence alignment method and device, readable storage medium and computer equipment
CN104408078A (en) Construction method for key word-based Chinese-English bilingual parallel corpora
CN105608074A (en) Word counting method and device
CN107341143B (en) Sentence continuity judgment method and device and electronic equipment
CN106528536A (en) Multilingual word segmentation method based on dictionaries and grammar analysis
CN107577663B (en) Key phrase extraction method and device
Drobac et al. OCR and post-correction of historical Finnish texts
U Rahman Towards Sindhi corpus construction
Gupta et al. A hybrid approach for entity extraction in code-mixed social media data
CN104252446A (en) Computing device, and verification system and method for consistency of contents of files
CN114004221A (en) Method and device for correcting table content
Alotaiby et al. Processing large Arabic text corpora: Preliminary analysis and results
CN104699669A (en) Text word-counting method and device
CN105573981A (en) Method and device for extracting Chinese names of people and places
CN112906352A (en) Vehicle insurance electronic insurance policy text recognition and extraction method and system
CN104699662B (en) The method and apparatus for identifying overall symbol string
Khan et al. Creation and analysis of a new Bangla text corpus BDNC01
Hocking et al. Optical character recognition for South African languages
Hakro et al. Printed text image database for Sindhi OCR
Bataineh A Printed PAW Image Database of Arabic Language for Document Analysis and Recognition.
CN111553155B (en) Password word segmentation system and method based on semantic structure
CN103942188A (en) Method and device for identifying corpus languages
Nguyen et al. An in-depth analysis of OCR errors for unconstrained Vietnamese handwriting
CN115859988B (en) Entity account extraction method and system for social text

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 100040 Shijingshan District railway building, Beijing, the 16 floor

Applicant after: Chinese translation language through Polytron Technologies Inc

Address before: 100040 Shijingshan District railway building, Beijing, the 16 floor

Applicant before: Mandarin Technology (Beijing) Co., Ltd.

CB02 Change of applicant information
GR01 Patent grant
GR01 Patent grant