CN105608074A

CN105608074A - Word counting method and device

Info

Publication number: CN105608074A
Application number: CN201610028758.5A
Authority: CN
Inventors: 王建华; 程国艮
Original assignee: Mandarin Technology (beijing) Co Ltd
Current assignee: Mandarin Technology (beijing) Co Ltd
Priority date: 2016-01-15
Filing date: 2016-01-15
Publication date: 2016-05-25
Anticipated expiration: 2036-01-15
Also published as: CN105608074B

Abstract

The invention discloses a word counting method and device, and relates to the technical field of computers. The technical problem that in the prior art, itemized word counting cannot be performed on a file containing multiple languages or multiple languages in a text is solved. According to the technical scheme, the word counting method comprises the steps that 1, text content is read, the text is read into a memory in batches according to a certain length; 2, after a batch of the text in the memory is read every time, the text in the memory is scanned, the number of punctuations among the text is recognized and counted, then the punctuations are removed, and a new character string without containing the punctuations is formed; 3, words or characters in the character string of which the punctuations are removed are read, and the languages are recognized word by word and counted; 4, the punctuation numbers and the language text or character numbers which are counted successively are added up separately.

Description

A kind of word counting method and device

Technical field

The present invention relates to field of computer technology, particularly a kind of word counting method and device.

Background technology

Prior art is for the comparative maturity of word counting technology of same languages, but current word countingDifficult point be in one section of text or document, there is two or more above language, as Chinese and English mixes,The multi-lingual files such as France and Japan Korea Spro, add up language number of words separately by language subitem and cannot realize.

Summary of the invention

The prior art that the present invention will solve can not be many to comprising in multilingual file or passageKind language is itemized and is added up the technical problem of number of words.

In order to address the above problem, the invention provides a kind of word counting method, comprising: step 1, readGet word content, word is read to internal memory in batches according to certain length; Step 2, often reads internal memory oneAfter the word of individual batch, the word in scanning internal memory, identifies and adds up the punctuation mark number between word, itAfter remove punctuation mark, form a new character string that does not comprise punctuation mark; Step 3, reads filtrationFall word or character in the character string of punctuation mark, word for word identify category of language counting; Step 4, willSuccessively the number of the punctuation mark of statistics and every kind of spoken and written languages or character is added separately.

The present invention also comprises a kind of word counting device, comprising: read module, and for reading characters content,Word is read to internal memory in batches according to certain length; Punctuation mark identification module, for often reading internal memoryAfter the word of batch, the word in scanning internal memory, identifies and adds up the punctuation mark number between word,Remove afterwards punctuation mark, form a new character string that does not comprise punctuation mark; Language identification module,For reading word or the character of the character string that filters out punctuation mark, word for word identify category of language counting;Subitem statistical module, for by the successively punctuation mark of statistics and the number of every kind of spoken and written languages or character separatelyBe added.

As seen through the above technical solutions, the invention provides a kind of word counting method and device, to one section of literary compositionThis or document divide language word counting, make word counting more accurately, in detail, are the literary composition in translation fieldPart statistics number of words, provides convenience, and has saved the time.

Brief description of the drawings

A kind of word counting method flow of Fig. 1 Fig. 1;

A kind of word counting method flow of Fig. 2 Fig. 2;

A kind of word counting apparatus structure of Fig. 3 schematic diagram.

Detailed description of the invention

Below in conjunction with drawings and Examples, technical scheme of the present invention is described in detail.

It should be noted that, if do not conflicted, each feature in the embodiment of the present invention and embodiment canMutually combine, all within protection scope of the present invention.

Embodiment mono-, as shown in Figure 1, a kind of word counting method, this technical scheme comprises: step 1,Reading characters content, reads internal memory by word in batches according to certain length; Certain length, Ke YiweiFixing byte, in short can be also passage or a slice article. Can be according to requirements set. StepRapid two, often to read after the word of one batch of internal memory, the word in scanning internal memory, identifies and adds up between wordPunctuation mark number, remove afterwards punctuation mark, form a new character string that does not comprise punctuation mark;Step 3, reads word or character in the character string that filters out punctuation mark, word for word identifies category of language alsoCounting; Step 4, by successively the punctuation mark of statistics and the number of every kind of spoken and written languages or character are added separately.

The invention provides a kind of word counting method, divide language word counting to one section of text or document,Make word counting more accurately, in detail, for the file statistics number of words in translation field, provide convenience, saveTime.

Embodiment bis-, as shown in Figure 2, on the basis of embodiment mono-, more excellent, described step 3, knowsDo not go out corresponding language and count concrete steps and be: whether Chinese identification is successively, if it is counting, asFruit be not identify whether English, if it is counting, if not whether French of identification, ifBeing to count, is other Languages if not identifying, until identify each word or a word pairThe language of answering.

More excellent, for a code database and language model set in every kind of language, traversal code database tentatively identifiesA word or the language classification of a character, then according to the language model of every kind of language and ad hoc rules,Complete identify word, word or a character.

More excellent, described step 3, identifies corresponding language and counts concrete steps and be: between word and wordThe languages with space are not calculated number of words by actual characters number.

More excellent, described step 3, identify corresponding language and count concrete steps and be: between word withSpace distinguish languages, taking space or punctuation mark as according to calculate number of words, word counting is not carried out in space.

As shown in Figure 2, a kind of concrete steps of number of words system method are:

Prepare multilingual document or a string literal;

From file or passage, word is read to internal memory as required in batches;

By punctuation mark algorithm, calculate punctuation mark number, and counting;

Internal memory word, by a punctuation mark filter algorithm, remove punctuation mark, form one newCharacter string;

Read a word or a character in the character string that filters out punctuation mark, know by Chinese successivelyOther algorithm, English recognizer, French recognizer etc., until identify corresponding language and completeIdentify a word or a word, turn to rolling counters forward;

Each speech recognition algorithm, first can, according to computer UNICODE code database, tentatively identify oneThe language of individual word or a character, the word that can not accurately identify for computer UNICODE code database orCharacter, and then mate according to the language model that a large amount of single speech therapy of language is practised separately, probability system doneMeter identification, finally according to some ad hoc rules, a complete word or the word of identifying.

Concrete ad hoc rules is as follows:

1. between the word such as Chinese, Japanese, Korean and word, the languages with space are not calculated by actual characters number,As Who Am I, Si は Who,Deng statistics be respectively 3,3,5;

2. the English languages of distinguishing with space between word that wait, taking space or punctuation mark be according to calculating number of words,As IamaChinese, andyou? statistics is 8;

3. each punctuation mark all calculates as a word or word;

4. each spcial character all calculates as a word or word; As #& etc.;

5. one section of continuous numeral, calculates as a word; If 123456 its numbers of words are 1;

6. one section of continuous letter, calculates as a word; If its number of words of abcdefg is 1;

7. between one section of continuous numeral or letter, inserted one or more letters, numeral or spcial character,Separately add up. As 123a456,123abc456,123456, abc2def, abc123def, abc $ defBe 3 Deng its number of words;

8. word counting is not carried out in space;

Subitem statistical counter, can record every kind of language and punctuation mark subitem statistics;

According to practical business rule, the data output of counter subitem record; As, in to English translation, neglectThe slightly English in file content, only statistics is Chinese, does not record English, if contain other language, needsRecord output, punctuation mark needs output.

In order to address the above problem, to the present invention is directed to the text feature that contains different language different calculating is providedNumber of words mode, wherein text feature comprises: the Asia with space not between the words such as Chinese, Japanese, Korean and wordContinent language feature; The european language feature of distinguishing with space between word; Spcial character or punctuation mark; AltogetherThree classes.

Embodiment tri-, as shown in Figure 3, a kind of word counting device, this technical scheme comprises: read module,For reading characters content, word is read to internal memory in batches according to certain length; Punctuation mark identification mouldPiece, for often reading after the word of one batch of internal memory, the word in scanning internal memory, identifies and adds up wordBetween punctuation mark number, remove afterwards punctuation mark, form a new character that does not comprise punctuation markString; Language identification module, for reading word or the character of the character string that filters out punctuation mark, word for word knowsDo not go out corresponding language counting; Subitem statistical module, for by successively statistics punctuation mark and every kind of languageThe number of speech word or character is added separately.

More excellent, described language identification module, identifies corresponding language and counts concrete steps and be: successivelyWhether Chinese identification is, and if it is counting, if not whether English identification is, is if it is counted,Be French if not identifying, if it is counting, is other Languages if not identifying,Until identify each word or a language that word is corresponding.

More excellent, be that a code database set in every kind of language, traversal code database tentatively identify a word orThe language classification of a character, then according to the feature of every kind of language and ad hoc rules, complete identifies oneIndividual word, word or character.

More excellent, described language identification module, identify corresponding language and count concrete steps and be: word withBetween word, the languages with space are not calculated number of words by actual characters number.

More excellent, described language identification module, identifies corresponding language and counts concrete steps and be: wordBetween with space distinguish languages, taking space or punctuation mark as according to calculate number of words, word is not carried out in spaceNumber statistics.

This device is corresponding one by one with the technical scheme of said method, and all explanations are with reference to said method, at this notRepeat again.

Number of words in energy one section of text of accurate statistics or document (having two kinds and above language); Can be accuratelyNumber of words in statistics word, excel, txt common document form; Can carry out quick and accurate to large file documentTrue word counting. The invention provides a kind of word counting method and device, one section of text or document are carried outDividing language word counting, make word counting more accurately, in detail, is the file statistics number of words in translation field,Provide convenience, saved the time. The present invention can be used for translation field and enters according to the different language of file contentThe quotation of row synthesized translation.

Certainly, the present invention also can have other various embodiments, in the feelings that do not deviate from spirit of the present invention and essence thereofUnder condition, those of ordinary skill in the art work as can make according to the present invention various corresponding changes and distortion, butThese corresponding changes and distortion all should belong to the protection domain of claim of the present invention.

Claims

1. a word counting method, is characterized in that, comprising: step 1, reading characters content, by literary compositionWord is read internal memory in batches according to certain length; Step 2, often reads after the word of one batch of internal memory,Word in scanning internal memory, identifies and adds up the punctuation mark number between word, removes afterwards punctuation mark,Form a new character string that does not comprise punctuation mark; Step 3, reads the character that filters out punctuation markWord in string or character, word for word identify category of language counting; Step 4, by the punctuate symbol of successively adding upNumber and the number of every kind of spoken and written languages or character be added separately.

2. word counting method as claimed in claim 1, is characterized in that, described step 3, identifiesCorresponding language is also counted concrete steps and is: whether Chinese of identification successively, if it is counting, if notWhether English be to identify, if it is counting, if not whether French of identification, if it isCounting, if not whether other Languages of identification, until identify each word or a word is correspondingLanguage.

3. word counting method as claimed in claim 2, is characterized in that, for every kind of language is set oneCode database, traversal code database tentatively identifies a word or the language classification of a character, then according to everyLanguage model and the ad hoc rules of kind of language, complete identify word, word or a character.

4. word counting method as claimed in claim 1, is characterized in that, described step 3, identifiesCorresponding language is also counted concrete steps and is: between word and word not the languages with space by actual characters numberCalculate number of words.

5. word counting method as claimed in claim 1, is characterized in that, described step 3, identifiesCorresponding language is also counted concrete steps and is: the languages of distinguishing with space between word, and with space or punctuate symbolNumber be according to calculating number of words, word counting not being carried out in space.

6. a word counting device, is characterized in that, comprising: read module, and for reading characters content,Word is read to internal memory in batches according to certain length; Punctuation mark identification module, for often reading internal memoryAfter the word of batch, the word in scanning internal memory, identifies and adds up the punctuation mark number between word,Remove afterwards punctuation mark, form a new character string that does not comprise punctuation mark; Language identification module,For reading word or the character of the character string that filters out punctuation mark, word for word identify category of language counting;Subitem statistical module, for by the successively punctuation mark of statistics and the number of every kind of spoken and written languages or character separatelyBe added.

7. word counting device as claimed in claim 6, is characterized in that, described language identification module,Identify corresponding language and count concrete steps and be: whether Chinese identification is successively, if it is counting,Whether English if not identifying, if it is counting, is French if not identifying, asFruit is to count, and is other Languages, until identify each word or a word if not identifyingCorresponding language.

8. word counting device as claimed in claim 7, is characterized in that, for every kind of language is set oneCode database and language model, traversal code database tentatively identifies a word or the language classification of a character,Then according to the language model of every kind of language and ad hoc rules, complete identify word, word or a character.

9. word counting device as claimed in claim 6, is characterized in that, described language identification module,Identify corresponding language and count concrete steps and be: between word and word not the languages with space by actual charactersNumber is calculated number of words.

10. word counting device as claimed in claim 6, is characterized in that, described language identification module,Identify corresponding language and count concrete steps and be: the languages of distinguishing with space between word, with space orPunctuation mark is according to calculating number of words, word counting not being carried out in space.