CN105760364A - Character set detection method and device - Google Patents

Character set detection method and device Download PDF

Info

Publication number
CN105760364A
CN105760364A CN201610096192.XA CN201610096192A CN105760364A CN 105760364 A CN105760364 A CN 105760364A CN 201610096192 A CN201610096192 A CN 201610096192A CN 105760364 A CN105760364 A CN 105760364A
Authority
CN
China
Prior art keywords
character set
classification
participle
character
word segmentation
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201610096192.XA
Other languages
Chinese (zh)
Other versions
CN105760364B (en
Inventor
徐佳宏
朱吕亮
陈栋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Ipanel TV Inc
Original Assignee
Shenzhen Ipanel TV Inc
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shenzhen Ipanel TV Inc filed Critical Shenzhen Ipanel TV Inc
Priority to CN201610096192.XA priority Critical patent/CN105760364B/en
Publication of CN105760364A publication Critical patent/CN105760364A/en
Application granted granted Critical
Publication of CN105760364B publication Critical patent/CN105760364B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a character set detection method and device. The method comprises the steps of receiving a character set to be processed; selecting character set categories from a prestored character set category set one by one; based on the coding rules corresponding to the selected character set categories, decoding the character set to be processed; recording decoding results generated after decoding is conducted successfully; word segmentation is conducted on the decoding results corresponding to all the character set categories obtained after decoding is conducted successfully in the character set category set, and obtaining a word segmentation result; based on the word segmentation result, determining the target character set category to which the character set to be processed belongs. It is thus clear that according to the character set detection method, the mode of determining the target character set category to which the character set to be processed belongs based on word segmentation is a semantic detection mode, compared with the grammar detection mode adopted in the prior art, more accuracy is achieved, and the success rate of character set category detection is raised.

Description

A kind of character set detection method and device
Technical field
The present invention relates to coding and decoding technical field, more particularly to a kind of character set detection method and device
Background technology
Character is the general name of various word and symbol, and character set is the set of multiple character.In computer realm, the kind of character set has a lot, such as ascii character-set, GB2312 character set, UTF-8 character set, GBK character set etc..Coding rule corresponding to different types of character set is not quite similar, therefore, when system receives pending character set, need to first determine pending character set generic, to carry out subsequent operation based on generic, based on as described in coding rule corresponding to pending character set generic treat processing character collection and be decoded, show decoded content.Wherein, when receive pending character set is not marked with its affiliated character set kind time, system needs to detect character set classification belonging to pending character set.
In the prior art, treat character set classification belonging to processing character collection and carry out detecting and specifically carry out based on the mode of character set encoding rule, namely carry out grammer detection by the coding rule of kinds of characters collection.In this manner, if pending character set meets the coding rule of at least two character set classification, system can not determine the character set classification that pending character set is really affiliated, in this case, it is easy to the situation of detection mistake occurs.As, pending character set is the character set adopting UTF-8 character set to be encoded, and the character set classification belonging to pending character set that system detects is GBK character set, so the coding rule utilizing GBK character set treat processing character collection be decoded time, then there will be mess code phenomenon, it is clear that system detection makes mistakes.
Therefore, how to improve the success rate of character set classification belonging to the pending character set of detection and become the technical barrier urgently overcome.
Summary of the invention
In view of this, the present invention provides a kind of character set detection method and device, detects the accuracy of character set classification belonging to pending character set to improve.
For achieving the above object, the present invention provides following technical scheme:
A kind of character set detection method, it is characterised in that including:
Receive pending character set;
Character set classification is chosen one by one from the character set category set prestored;
Based on the coding rule corresponding to selected character set classification, described pending character set is decoded;
Decoded result after carrying recorded decoding success;
Decoded result corresponding to character set classification after all successfully decodeds in described character set category set is carried out participle, obtains word segmentation result;
The target character collection classification belonging to described pending character set is determined based on described word segmentation result.
Preferably, the described target character collection classification determined based on described word segmentation result belonging to described pending character set, including:
Adding up in described word segmentation result can the number of characters of participle fragment and decoded total number of characters;
Calculate described can the ratio of number of characters and described decoded total number of characters of participle fragment, generation can participle ratio;
Determine with described can participle be the target character collection classification belonging to described pending character set than maximum character set classification.
Preferably, the described target character collection classification determined based on described word segmentation result belonging to described pending character set, including:
Adding up in described word segmentation result can the number of characters of participle fragment;
Determine with described can the maximum character set classification of the number of characters of participle fragment be the target character collection classification belonging to described pending character set.
Preferably, described participle that decoded result corresponding to character set classification after all successfully decodeds in described character set category set is carried out, obtain word segmentation result, particularly as follows:
Based on the dictionary prestored, adopt maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set is carried out participle.
Preferably, described in receive pending character set after, also include:
Judge whether described pending character set is marked with character set classification belonging to described pending character set;
Described from the character set category set prestored, choose character set classification one by one, particularly as follows:
When not being marked with character set classification belonging to described pending character set in determining described pending character set, from the character set category set prestored, choose character set classification one by one.
A kind of character set detecting device, including:
First receives unit, is used for receiving pending character set;
First chooses unit, for choosing character set classification from the character set category set prestored one by one;
First decoding unit, for being decoded described pending character set based on the coding rule corresponding to selected character set classification;
First record unit, for the decoded result after carrying recorded decoding success;
Participle acquiring unit, for the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result;
Determine class location, for determining the target character collection classification belonging to described pending character set based on described word segmentation result.
Preferably, described determine class location, including:
First statistical module, can the number of characters of participle fragment and decoded total number of characters for adding up in described word segmentation result;
Calculate generation module, for calculate described can the ratio of number of characters and described decoded total number of characters of participle fragment, generation can participle ratio;
First determines module, for determine with described can participle be the target character collection classification belonging to described pending character set than maximum character set classification.
Preferably, described determine class location, including:
Second statistical module, can the number of characters of participle fragment for adding up in described word segmentation result;
Second determines module, for determine with described can the maximum character set classification of the number of characters of participle fragment be the target character collection classification belonging to described pending character set.
Preferably, described participle acquiring unit, it is specifically based on the dictionary prestored, adopts maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtain word segmentation result.
Preferably, this device also includes:
First judging unit, is used for judging whether to be marked with in described pending character set character set classification belonging to described pending character set;
Described first when choosing unit specifically for not being marked with character set classification belonging to described pending character set in determining described pending character set, chooses character set classification one by one from the character set category set prestored.
nullKnown via above-mentioned technical scheme,Compared with prior art,Embodiments provide a kind of character set detection method,Concrete,When receiving pending character set,Character set classification is chosen one by one from the character set category set prestored,Thus described pending character set being decoded based on the coding rule corresponding to selected character set classification,Decoded result after carrying recorded decoding success,Decoded result corresponding to character set classification after each successfully decoded is carried out participle,Obtain word segmentation result,Make to determine the target character collection classification belonging to pending character set based on word segmentation result,Obviously,Based on participle, the present invention determines that the target character belonging to pending character set integrates class otherwise as Semantic detection mode,Compared with the mode adopting grammer detection in prior art,More accurate,Improve the success rate of detection character set classification.
Accompanying drawing explanation
In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is only embodiments of the invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to the accompanying drawing provided.
Fig. 1 is the schematic flow sheet of a kind of character set detection method disclosed in one embodiment of the invention;
Fig. 2 is the schematic flow sheet of a kind of character set detection method disclosed in further embodiment of this invention;
Fig. 3 is the structural representation of a kind of character set detecting device disclosed in one embodiment of the invention;
Fig. 4 is the structural representation of a kind of character set detecting device disclosed in further embodiment of this invention.
Detailed description of the invention
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.
One embodiment of the invention discloses a kind of character set detection method, as it is shown in figure 1, the method comprises the following steps:
Step 101: receive pending character set;
Step 102: choose character set classification from the character set category set prestored one by one;
Wherein, the character set category set that system prestores comprises multiple character set classification, such as UTF-8 character set, GBK character set, ASCII character, BIG5 character set, GB18030 character set etc..Concrete, comprising which character set classification in character set category set can be preset by user, and the present invention does not limit.
After receiving pending character set, character set classification can be chosen one by one from the character set category set prestored, that is, choosing a character set classification from character set category set carries out subsequent treatment every time, until all character set classifications in character set category set were all selected.If character set category set is CS, including three character set classifications, respectively C1, C2 and C3, so, can first choose character set classification C1 and carry out subsequent treatment, then choose character set classification C2 again and carry out subsequent treatment, finally choose character set classification C3 and carry out subsequent treatment.
Step 103: described pending character set is decoded based on the coding rule corresponding to selected character set classification;
Wherein, each character set classification correspondence one coding rule, this corresponded manner system is pre-set.
Step 104: the decoded result after carrying recorded decoding success;
When treating processing character collection and being decoded, it may appear that two kinds of results, a kind of is that pending character set meets the currently selected coding rule corresponding to character set classification taken, then then successfully decoded, then the decoded result after carrying recorded decoding success;Another kind is that pending character set does not meet the currently selected coding rule corresponding to character set classification taken, then then decodes failure, then gives up the currently selected character set classification taken.After having decoded, choosing new character set classification again combining from character set classification, being decoded thus treating processing character collection, until all of character set classification is all used to decode in character set category set.
Step 105: the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result;
Decoded result is being carried out in participle process, specifically based on the dictionary prestored, decoded result can carried out participle, wherein, dictionary comprises a large amount of vocabulary, vocabulary in dictionary can pass through to collect in advance and obtain, and certain dictionary can also directly use existing dictionary, such as the dictionary of input method.
It should be noted that, the form of expression and the content of the vocabulary comprised in dictionary do not limit in the present invention, as dictionary include " you, hello, you I, compete with one another, life-and-death, both of you, you come, and I am past, your my pressure whole, you etc. I, It is nice that you are fine " etc. about the various words of " you ".
Wherein, when decoded result being carried out participle based on dictionary, specifically can adopt maximum forward matching method.Accordingly, step 105 is particularly as follows: based on the dictionary prestored, adopt maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtain word segmentation result.
Such as, four words of " It is nice that you are fine " comprised in decoded result, according to above-mentioned dictionary, adopt maximum forward matching method to be divided into a word " It is nice that you are fine ", rather than two words " hello " and " all right ".It is, of course, also possible to adopt other algorithms that decoded result is carried out participle, the present invention does not limit.
Step 106: determine the target character collection classification belonging to described pending character set based on word segmentation result.
Be the equal of based on Semantic detection mode by the target character collection type belonging to the determined pending character set of word segmentation result, compared with adopting grammer detection mode with prior art, more accurately, improve the success rate of detection character set classification.
Another embodiment of the present invention discloses a kind of character set detection method, unlike the embodiments above, and the present embodiment mainly describes the different implementations determining target character collection classification belonging to pending character set based on word segmentation result, specific as follows:
As the first implementation, determine the target character collection classification belonging to described pending character set based on word segmentation result, comprise the following steps:
SA1: adding up in described word segmentation result can the number of characters of participle fragment and decoded total number of characters;
Decoded result corresponding to character set classification after each successfully decoded is being carried out participle, is obtaining after word segmentation result, need to count corresponding with character set classification can the number of characters of participle and decoded total number of characters.
SA2: calculate described can the ratio of number of characters and described decoded total number of characters of participle fragment, generation can participle ratio;
That is, can participle than=can total number of characters after the number of characters summation/decoding of participle fragment
SA3: determine with described can participle be the target character collection classification belonging to described pending character set than maximum character set classification.
Concrete, calculate obtain with all successfully decodeds in character set category set after character set classification corresponding to can after participle ratio, it may be determined that going out participle than maximum character set classification is the target character collection described in described pending character set.
In order to make it easy to understand, illustrate with an instantiation, concrete, it is assumed that the pending character set that system receives is " E4BDA0E5A5BD68656C6C6F " (this section of character set actually belongs to UTF-8 character set).But, this section of character set meets the coding rule of two kinds of character set of UTF-8 and GBK simultaneously, then, by the coding rule corresponding to UTF-8 character set, it is decoded, can decode that successfully, the decoded result of acquisition is " you good hello ", specifically as shown in table 1:
Table 1
By the coding rule of GBK character set, it being decoded, it is also possible to successfully decoded, the decoded result of acquisition is " wash ソ hello ", specifically as shown in table 2:
Table 2
Visible, if adopting existing detection mode to carry out the detection of character set classification, although pending character set falls within the coding rule of GBK character set, but if being GBK character set by the detection of above-mentioned pending character set, obviously detect mistake, can cause that decoded content is mess code.
And in the present invention, set in character set category set and comprise UTF-8 character set and GBK character set, after decoded result after recording above-mentioned successfully decoded, pending character set said target character set classification can be determined by participle ratio by calculating, concrete: for, in UTF-8 character set decoded result " you good hello ", can participle content being " hello " based on dictionary, the number of characters getting final product participle fragment is 2, and total number of characters is 7 after encoding, then, can participle ratio be 2/7.Being " wash ソ hello " for GBK character set decoded result, can not separate word based on dictionary, the number of characters getting final product participle fragment is 0, and after encoding, total character set is 8, then, can participle ratio be 0/8.Obviously, can participle be UTF-8 character set than maximum character set classification, then, it may be determined that above-mentioned pending character set said target character set classification is UTF-8 character set.
As the second implementation, determine the target character collection classification belonging to described pending character set based on word segmentation result, comprise the following steps:
SB1: adding up in described word segmentation result can the number of characters of participle fragment;
SB2: determine with described can the maximum character set classification of the number of characters of participle fragment be the target character collection classification belonging to described pending character set.
Still for the application example in the first implementation, for, in UTF-8 character set decoded result " you good hello ", can participle content be " hello " based on dictionary, the number of characters getting final product participle fragment be 2.Being " wash ソ hello " for GBK character set decoded result, can not separate word based on dictionary, the number of characters getting final product participle fragment is 0.So, can the maximum character set classification of the number of characters of participle fragment be UTF-8 character set, it may be determined that above-mentioned pending character set said target character set classification is UTF-8 character set.
Further embodiment of this invention discloses a kind of character set detection method, as in figure 2 it is shown, the method comprises the following steps:
Step 201: receive pending character set;
Step 202: judge whether to be marked with in described pending character set character set classification belonging to described pending character set;If it is not, enter step 203, if so, enter step 208;
Step 203: choose character set classification from the character set category set prestored one by one;
Wherein, when not being marked with character set classification belonging to described pending character set in determining described pending character set, from the character set category set prestored, choose character set classification one by one, thus determining the target character collection classification belonging to pending character set.
Step 204: described pending character set is decoded based on the coding rule corresponding to selected character set classification;
Step 205: the decoded result after carrying recorded decoding success;
Step 206: the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result;
Step 207: determine the target character collection classification belonging to described pending character set based on word segmentation result;
Step 208: determine the target character collection that character set classification belonging to the pending character set indicated in described pending character set is described pending character set.
If it is to say, the sender sending pending character set is marked with character set classification belonging to it in pending character set, then, directly using the character set classification indicated as target character collection classification, it is not necessary to carry out character set detection.And if pending character set is not marked with character set classification belonging to it, then then need to carry out character set detection.
One embodiment of the invention also discloses a kind of character set detecting device, as it is shown on figure 3, this device includes: first receive unit 301, first choose unit the 302, first decoding unit 303, first and record unit 304, participle acquiring unit 305 and determine class location 306;Wherein:
First receives unit 301, is used for receiving pending character set;
First chooses unit 302, for choosing character set classification from the character set category set prestored one by one;
First decoding unit 303, for being decoded described pending character set based on the coding rule corresponding to selected character set classification;
Wherein, each character set classification correspondence one coding rule, this corresponded manner system is pre-set.
First record unit 304, for the decoded result after carrying recorded decoding success;
When treating processing character collection and being decoded, it may appear that two kinds of results, a kind of is that pending character set meets the currently selected coding rule corresponding to character set classification taken, then then successfully decoded, then the decoded result after carrying recorded decoding success;Another kind is that pending character set does not meet the currently selected coding rule corresponding to character set classification taken, then then decodes failure, then gives up the currently selected character set classification taken.After having decoded, choosing new character set classification again combining from character set classification, being decoded thus treating processing character collection, until all of character set classification is all used to decode in character set category set.
Participle acquiring unit 305, for the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result;
Decoded result is being carried out in participle process, specifically based on the dictionary prestored, decoded result can carried out participle, wherein, dictionary comprises a large amount of vocabulary, vocabulary in dictionary can pass through to collect in advance and obtain, and certain dictionary can also directly use existing dictionary, such as the dictionary of input method.
Wherein, participle acquiring unit specifically based on the dictionary prestored, can adopt maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtain word segmentation result.It is, of course, also possible to adopt other algorithms that decoded result is carried out participle, the present invention does not limit.
Determine class location 306, for determining the target character collection classification belonging to described pending character set based on described word segmentation result.
Be the equal of based on Semantic detection mode by the target character collection type belonging to the determined pending character set of word segmentation result, compared with adopting grammer detection mode with prior art, more accurately, improve the success rate of detection character set classification.
Another embodiment of the present invention discloses a kind of character set detecting device, and unlike the embodiments above, the present embodiment mainly describes the different implementations determining class location, specific as follows:
As a kind of implementation, it is determined that class location includes: the first statistical module, calculating generation module and first determine module;Wherein:
First statistical module, can the number of characters of participle fragment and decoded total number of characters for adding up in described word segmentation result;
Calculate generation module, for calculate described can the ratio of number of characters and described decoded total number of characters of participle fragment, generation can participle ratio;
That is, can participle than=can total number of characters after the number of characters summation/decoding of participle fragment
First determines module, for determine with described can participle be the target character collection classification belonging to described pending character set than maximum character set classification.
As the second implementation, it is determined that class location includes: the second statistical module and second determines module;Wherein:
Second statistical module, can the number of characters of participle fragment for adding up in described word segmentation result;
Second determines module, for determine with described can the maximum character set classification of the number of characters of participle fragment be the target character collection classification belonging to described pending character set.
Further embodiment of this invention discloses a kind of character set detecting device, as shown in Figure 4, this device includes: first receives unit the 401, first judging unit 402, first choose unit the 403, first decoding unit 404, first and record unit 405, participle acquiring unit 406 and determine class location 407;Wherein:
First receives unit 401, is used for receiving pending character set;
First judging unit 402, is used for judging whether to be marked with in described pending character set character set classification belonging to described pending character set;
First chooses unit 403, during for not being marked with character set classification belonging to described pending character set in determining described pending character set, chooses character set classification one by one from the character set category set prestored;
First decoding unit 404, for being decoded described pending character set based on the coding rule corresponding to selected character set classification;
First record unit 405, for the decoded result after carrying recorded decoding success;
Participle acquiring unit 406, for the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result;
Determine class location 407, for determining the target character collection classification belonging to described pending character set based on described word segmentation result.
Certainly, this device can also include first and determine unit, determines for the first judging unit and indicates when stating character set classification belonging to pending character set to some extent in described pending character set, directly the character set classification indicated is defined as target character collection classification.
In this specification, each embodiment adopts the mode gone forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment, owing to it corresponds to the method disclosed in Example, so what describe is fairly simple, relevant part illustrates referring to method part.
Described above to the disclosed embodiments, makes professional and technical personnel in the field be capable of or uses the present invention.The multiple amendment of these embodiments be will be apparent from for those skilled in the art, and generic principles defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention is not intended to be limited to the embodiments shown herein, and is to fit to the widest scope consistent with principles disclosed herein and features of novelty.

Claims (10)

1. a character set detection method, it is characterised in that including:
Receive pending character set;
Character set classification is chosen one by one from the character set category set prestored;
Based on the coding rule corresponding to selected character set classification, described pending character set is decoded;
Decoded result after carrying recorded decoding success;
Decoded result corresponding to character set classification after all successfully decodeds in described character set category set is carried out participle, obtains word segmentation result;
The target character collection classification belonging to described pending character set is determined based on described word segmentation result.
2. character set detection method according to claim 1, it is characterised in that the described target character collection classification determined based on described word segmentation result belonging to described pending character set, including:
Adding up in described word segmentation result can the number of characters of participle fragment and decoded total number of characters;
Calculate described can the ratio of number of characters and described decoded total number of characters of participle fragment, generation can participle ratio;
Determine with described can participle be the target character collection classification belonging to described pending character set than maximum character set classification.
3. character set detection method according to claim 1, it is characterised in that the described target character collection classification determined based on described word segmentation result belonging to described pending character set, including:
Adding up in described word segmentation result can the number of characters of participle fragment;
Determine with described can the maximum character set classification of the number of characters of participle fragment be the target character collection classification belonging to described pending character set.
4. character set detection method according to claim 1, it is characterised in that described participle that decoded result corresponding to character set classification after all successfully decodeds in described character set category set is carried out, obtains word segmentation result, particularly as follows:
Based on the dictionary prestored, adopt maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set is carried out participle.
5. method according to claim 1, it is characterised in that described in receive pending character set after, also include:
Judge whether described pending character set is marked with character set classification belonging to described pending character set;
Described from the character set category set prestored, choose character set classification one by one, particularly as follows:
When not being marked with character set classification belonging to described pending character set in determining described pending character set, from the character set category set prestored, choose character set classification one by one.
6. a character set detecting device, it is characterised in that including:
First receives unit, is used for receiving pending character set;
First chooses unit, for choosing character set classification from the character set category set prestored one by one;
First decoding unit, for being decoded described pending character set based on the coding rule corresponding to selected character set classification;
First record unit, for the decoded result after carrying recorded decoding success;
Participle acquiring unit, for the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result;
Determine class location, for determining the target character collection classification belonging to described pending character set based on described word segmentation result.
7. character set detecting device according to claim 6, it is characterised in that described determine class location, including:
First statistical module, can the number of characters of participle fragment and decoded total number of characters for adding up in described word segmentation result;
Calculate generation module, for calculate described can the ratio of number of characters and described decoded total number of characters of participle fragment, generation can participle ratio;
First determines module, for determine with described can participle be the target character collection classification belonging to described pending character set than maximum character set classification.
8. character set detecting device according to claim 6, it is characterised in that described determine class location, including:
Second statistical module, can the number of characters of participle fragment for adding up in described word segmentation result;
Second determines module, for determine with described can the maximum character set classification of the number of characters of participle fragment be the target character collection classification belonging to described pending character set.
9. character set detecting device according to claim 6, it is characterized in that, described participle acquiring unit, it is specifically based on the dictionary prestored, adopt maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtain word segmentation result.
10. character set detecting device according to claim 6, it is characterised in that this device also includes:
First judging unit, is used for judging whether to be marked with in described pending character set character set classification belonging to described pending character set;
Described first when choosing unit specifically for not being marked with character set classification belonging to described pending character set in determining described pending character set, chooses character set classification one by one from the character set category set prestored.
CN201610096192.XA 2016-02-22 2016-02-22 A kind of character set detection method and device Active CN105760364B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201610096192.XA CN105760364B (en) 2016-02-22 2016-02-22 A kind of character set detection method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201610096192.XA CN105760364B (en) 2016-02-22 2016-02-22 A kind of character set detection method and device

Publications (2)

Publication Number Publication Date
CN105760364A true CN105760364A (en) 2016-07-13
CN105760364B CN105760364B (en) 2018-09-04

Family

ID=56330980

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201610096192.XA Active CN105760364B (en) 2016-02-22 2016-02-22 A kind of character set detection method and device

Country Status (1)

Country Link
CN (1) CN105760364B (en)

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6456308B1 (en) * 1996-08-08 2002-09-24 Agranat Systems, Inc. Embedded web server
CN101013420A (en) * 2006-12-31 2007-08-08 中国科学院计算技术研究所 Method for identifying coding form of Chinese text
CN102508824A (en) * 2011-09-29 2012-06-20 苏州大学 Compression coding and decoding method and device for microblog information
CN102567293A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Coded format detection method and coded format detection device for text files
CN102799572A (en) * 2012-07-27 2012-11-28 深圳市万兴软件有限公司 Text coding manner and text coding apparatus
US8402366B1 (en) * 2009-12-18 2013-03-19 Amazon Technologies, Inc. Format tag stacks for stream-parsing format information
CN104516862A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Method and system for selecting and reading coded format of target document
CN104750666A (en) * 2015-03-12 2015-07-01 明博教育科技有限公司 Text character encoding mode identification method and system

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6456308B1 (en) * 1996-08-08 2002-09-24 Agranat Systems, Inc. Embedded web server
CN101013420A (en) * 2006-12-31 2007-08-08 中国科学院计算技术研究所 Method for identifying coding form of Chinese text
US8402366B1 (en) * 2009-12-18 2013-03-19 Amazon Technologies, Inc. Format tag stacks for stream-parsing format information
CN102567293A (en) * 2010-12-13 2012-07-11 汉王科技股份有限公司 Coded format detection method and coded format detection device for text files
CN102508824A (en) * 2011-09-29 2012-06-20 苏州大学 Compression coding and decoding method and device for microblog information
CN102799572A (en) * 2012-07-27 2012-11-28 深圳市万兴软件有限公司 Text coding manner and text coding apparatus
CN104516862A (en) * 2013-09-29 2015-04-15 北大方正集团有限公司 Method and system for selecting and reading coded format of target document
CN104750666A (en) * 2015-03-12 2015-07-01 明博教育科技有限公司 Text character encoding mode identification method and system

Also Published As

Publication number Publication date
CN105760364B (en) 2018-09-04

Similar Documents

Publication Publication Date Title
Vogel et al. Robust language identification in short, noisy texts: Improvements to liga
WO2018120889A1 (en) Input sentence error correction method and device, electronic device, and medium
CN109446404B (en) Method and device for analyzing emotion polarity of network public sentiment
CN103336766A (en) Short text garbage identification and modeling method and device
Aisopos et al. Content vs. context for sentiment analysis: a comparative analysis over microblogs
CN107992596A (en) A kind of Text Clustering Method, device, server and storage medium
CN102779170B (en) System and method for identifying text floor of webpage
CN109582833B (en) Abnormal text detection method and device
CN102541874A (en) Webpage text content extracting method and device
CN103164698A (en) Method and device of generating fingerprint database and method and device of fingerprint matching of text to be tested
CN109508373A (en) Calculation method, equipment and the computer readable storage medium of enterprise's public opinion index
CN108170806B (en) Sensitive word detection and filtering method and device and computer equipment
CN111079029B (en) Sensitive account detection method, storage medium and computer equipment
CN107341143A (en) A kind of sentence continuity determination methods and device and electronic equipment
CN103605691A (en) Device and method used for processing issued contents in social network
Swanson et al. Extracting the native language signal for second language acquisition
CN112364631B (en) Chinese grammar error detection method and system based on hierarchical multitask learning
CN104035918A (en) Chinese organization name abbreviation recognition system adopting context feature matching
CN113723328A (en) Method for analyzing and understanding chart document panel
CN104391798B (en) Software feature information extracting method
CN108462624B (en) Junk mail identification method and device and electronic equipment
WO2017000341A1 (en) Information processing method, device, and terminal
CN106503244A (en) A kind of processing method of URL similarity
CN112949290B (en) Text error correction method and device and communication equipment
WO2016206446A1 (en) Character encoding method and character decoding method having error correction function and product thereof

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant