CN105760364A

CN105760364A - Character set detection method and device

Info

Publication number: CN105760364A
Application number: CN201610096192.XA
Authority: CN
Inventors: 徐佳宏; 朱吕亮; 陈栋
Original assignee: Shenzhen Ipanel TV Inc
Current assignee: Shenzhen Ipanel TV Inc
Priority date: 2016-02-22
Filing date: 2016-02-22
Publication date: 2016-07-13
Anticipated expiration: 2036-02-22
Also published as: CN105760364B

Abstract

The invention discloses a character set detection method and device. The method comprises the steps of receiving a character set to be processed; selecting character set categories from a prestored character set category set one by one; based on the coding rules corresponding to the selected character set categories, decoding the character set to be processed; recording decoding results generated after decoding is conducted successfully; word segmentation is conducted on the decoding results corresponding to all the character set categories obtained after decoding is conducted successfully in the character set category set, and obtaining a word segmentation result; based on the word segmentation result, determining the target character set category to which the character set to be processed belongs. It is thus clear that according to the character set detection method, the mode of determining the target character set category to which the character set to be processed belongs based on word segmentation is a semantic detection mode, compared with the grammar detection mode adopted in the prior art, more accuracy is achieved, and the success rate of character set category detection is raised.

Description

A kind of character set detection method and device

Technical field

The present invention relates to coding and decoding technical field, more particularly to a kind of character set detection method and device

Background technology

Character is the general name of various word and symbol, and character set is the set of multiple character.In computer realm, the kind of character set has a lot, such as ascii character-set, GB2312 character set, UTF-8 character set, GBK character set etc..Coding rule corresponding to different types of character set is not quite similar, therefore, when system receives pending character set, need to first determine pending character set generic, to carry out subsequent operation based on generic, based on as described in coding rule corresponding to pending character set generic treat processing character collection and be decoded, show decoded content.Wherein, when receive pending character set is not marked with its affiliated character set kind time, system needs to detect character set classification belonging to pending character set.

In the prior art, treat character set classification belonging to processing character collection and carry out detecting and specifically carry out based on the mode of character set encoding rule, namely carry out grammer detection by the coding rule of kinds of characters collection.In this manner, if pending character set meets the coding rule of at least two character set classification, system can not determine the character set classification that pending character set is really affiliated, in this case, it is easy to the situation of detection mistake occurs.As, pending character set is the character set adopting UTF-8 character set to be encoded, and the character set classification belonging to pending character set that system detects is GBK character set, so the coding rule utilizing GBK character set treat processing character collection be decoded time, then there will be mess code phenomenon, it is clear that system detection makes mistakes.

Therefore, how to improve the success rate of character set classification belonging to the pending character set of detection and become the technical barrier urgently overcome.

Summary of the invention

In view of this, the present invention provides a kind of character set detection method and device, detects the accuracy of character set classification belonging to pending character set to improve.

For achieving the above object, the present invention provides following technical scheme:

A kind of character set detection method, it is characterised in that including:

Receive pending character set；

Character set classification is chosen one by one from the character set category set prestored；

Based on the coding rule corresponding to selected character set classification, described pending character set is decoded；

Decoded result after carrying recorded decoding success；

Decoded result corresponding to character set classification after all successfully decodeds in described character set category set is carried out participle, obtains word segmentation result；

The target character collection classification belonging to described pending character set is determined based on described word segmentation result.

Preferably, the described target character collection classification determined based on described word segmentation result belonging to described pending character set, including:

Adding up in described word segmentation result can the number of characters of participle fragment and decoded total number of characters；

Calculate described can the ratio of number of characters and described decoded total number of characters of participle fragment, generation can participle ratio；

Determine with described can participle be the target character collection classification belonging to described pending character set than maximum character set classification.

Adding up in described word segmentation result can the number of characters of participle fragment；

Determine with described can the maximum character set classification of the number of characters of participle fragment be the target character collection classification belonging to described pending character set.

Preferably, described participle that decoded result corresponding to character set classification after all successfully decodeds in described character set category set is carried out, obtain word segmentation result, particularly as follows:

Based on the dictionary prestored, adopt maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set is carried out participle.

Preferably, described in receive pending character set after, also include:

Judge whether described pending character set is marked with character set classification belonging to described pending character set；

Described from the character set category set prestored, choose character set classification one by one, particularly as follows:

When not being marked with character set classification belonging to described pending character set in determining described pending character set, from the character set category set prestored, choose character set classification one by one.

A kind of character set detecting device, including:

First receives unit, is used for receiving pending character set；

First chooses unit, for choosing character set classification from the character set category set prestored one by one；

First decoding unit, for being decoded described pending character set based on the coding rule corresponding to selected character set classification；

First record unit, for the decoded result after carrying recorded decoding success；

Participle acquiring unit, for the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result；

Determine class location, for determining the target character collection classification belonging to described pending character set based on described word segmentation result.

Preferably, described determine class location, including:

First statistical module, can the number of characters of participle fragment and decoded total number of characters for adding up in described word segmentation result；

Calculate generation module, for calculate described can the ratio of number of characters and described decoded total number of characters of participle fragment, generation can participle ratio；

First determines module, for determine with described can participle be the target character collection classification belonging to described pending character set than maximum character set classification.

Preferably, described determine class location, including:

Second statistical module, can the number of characters of participle fragment for adding up in described word segmentation result；

Second determines module, for determine with described can the maximum character set classification of the number of characters of participle fragment be the target character collection classification belonging to described pending character set.

Preferably, described participle acquiring unit, it is specifically based on the dictionary prestored, adopts maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtain word segmentation result.

Preferably, this device also includes:

First judging unit, is used for judging whether to be marked with in described pending character set character set classification belonging to described pending character set；

Described first when choosing unit specifically for not being marked with character set classification belonging to described pending character set in determining described pending character set, chooses character set classification one by one from the character set category set prestored.

nullKnown via above-mentioned technical scheme，Compared with prior art，Embodiments provide a kind of character set detection method，Concrete，When receiving pending character set，Character set classification is chosen one by one from the character set category set prestored，Thus described pending character set being decoded based on the coding rule corresponding to selected character set classification，Decoded result after carrying recorded decoding success，Decoded result corresponding to character set classification after each successfully decoded is carried out participle，Obtain word segmentation result，Make to determine the target character collection classification belonging to pending character set based on word segmentation result，Obviously，Based on participle, the present invention determines that the target character belonging to pending character set integrates class otherwise as Semantic detection mode，Compared with the mode adopting grammer detection in prior art，More accurate，Improve the success rate of detection character set classification.

Accompanying drawing explanation

In order to be illustrated more clearly that the embodiment of the present invention or technical scheme of the prior art, the accompanying drawing used required in embodiment or description of the prior art will be briefly described below, apparently, accompanying drawing in the following describes is only embodiments of the invention, for those of ordinary skill in the art, under the premise not paying creative work, it is also possible to obtain other accompanying drawing according to the accompanying drawing provided.

Fig. 1 is the schematic flow sheet of a kind of character set detection method disclosed in one embodiment of the invention；

Fig. 2 is the schematic flow sheet of a kind of character set detection method disclosed in further embodiment of this invention；

Fig. 3 is the structural representation of a kind of character set detecting device disclosed in one embodiment of the invention；

Fig. 4 is the structural representation of a kind of character set detecting device disclosed in further embodiment of this invention.

Detailed description of the invention

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only a part of embodiment of the present invention, rather than whole embodiments.Based on the embodiment in the present invention, the every other embodiment that those of ordinary skill in the art obtain under not making creative work premise, broadly fall into the scope of protection of the invention.

One embodiment of the invention discloses a kind of character set detection method, as it is shown in figure 1, the method comprises the following steps:

Step 101: receive pending character set；

Step 102: choose character set classification from the character set category set prestored one by one；

Wherein, the character set category set that system prestores comprises multiple character set classification, such as UTF-8 character set, GBK character set, ASCII character, BIG5 character set, GB18030 character set etc..Concrete, comprising which character set classification in character set category set can be preset by user, and the present invention does not limit.

After receiving pending character set, character set classification can be chosen one by one from the character set category set prestored, that is, choosing a character set classification from character set category set carries out subsequent treatment every time, until all character set classifications in character set category set were all selected.If character set category set is CS, including three character set classifications, respectively C1, C2 and C3, so, can first choose character set classification C1 and carry out subsequent treatment, then choose character set classification C2 again and carry out subsequent treatment, finally choose character set classification C3 and carry out subsequent treatment.

Step 103: described pending character set is decoded based on the coding rule corresponding to selected character set classification；

Wherein, each character set classification correspondence one coding rule, this corresponded manner system is pre-set.

Step 104: the decoded result after carrying recorded decoding success；

When treating processing character collection and being decoded, it may appear that two kinds of results, a kind of is that pending character set meets the currently selected coding rule corresponding to character set classification taken, then then successfully decoded, then the decoded result after carrying recorded decoding success；Another kind is that pending character set does not meet the currently selected coding rule corresponding to character set classification taken, then then decodes failure, then gives up the currently selected character set classification taken.After having decoded, choosing new character set classification again combining from character set classification, being decoded thus treating processing character collection, until all of character set classification is all used to decode in character set category set.

Step 105: the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result；

Decoded result is being carried out in participle process, specifically based on the dictionary prestored, decoded result can carried out participle, wherein, dictionary comprises a large amount of vocabulary, vocabulary in dictionary can pass through to collect in advance and obtain, and certain dictionary can also directly use existing dictionary, such as the dictionary of input method.

It should be noted that, the form of expression and the content of the vocabulary comprised in dictionary do not limit in the present invention, as dictionary include " you, hello, you I, compete with one another, life-and-death, both of you, you come, and I am past, your my pressure whole, you etc. I, It is nice that you are fine " etc. about the various words of " you ".

Wherein, when decoded result being carried out participle based on dictionary, specifically can adopt maximum forward matching method.Accordingly, step 105 is particularly as follows: based on the dictionary prestored, adopt maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtain word segmentation result.

Such as, four words of " It is nice that you are fine " comprised in decoded result, according to above-mentioned dictionary, adopt maximum forward matching method to be divided into a word " It is nice that you are fine ", rather than two words " hello " and " all right ".It is, of course, also possible to adopt other algorithms that decoded result is carried out participle, the present invention does not limit.

Step 106: determine the target character collection classification belonging to described pending character set based on word segmentation result.

Be the equal of based on Semantic detection mode by the target character collection type belonging to the determined pending character set of word segmentation result, compared with adopting grammer detection mode with prior art, more accurately, improve the success rate of detection character set classification.

Another embodiment of the present invention discloses a kind of character set detection method, unlike the embodiments above, and the present embodiment mainly describes the different implementations determining target character collection classification belonging to pending character set based on word segmentation result, specific as follows:

As the first implementation, determine the target character collection classification belonging to described pending character set based on word segmentation result, comprise the following steps:

SA1: adding up in described word segmentation result can the number of characters of participle fragment and decoded total number of characters；

Decoded result corresponding to character set classification after each successfully decoded is being carried out participle, is obtaining after word segmentation result, need to count corresponding with character set classification can the number of characters of participle and decoded total number of characters.

SA2: calculate described can the ratio of number of characters and described decoded total number of characters of participle fragment, generation can participle ratio；

That is, can participle than=can total number of characters after the number of characters summation/decoding of participle fragment

SA3: determine with described can participle be the target character collection classification belonging to described pending character set than maximum character set classification.

Concrete, calculate obtain with all successfully decodeds in character set category set after character set classification corresponding to can after participle ratio, it may be determined that going out participle than maximum character set classification is the target character collection described in described pending character set.

In order to make it easy to understand, illustrate with an instantiation, concrete, it is assumed that the pending character set that system receives is " E4BDA0E5A5BD68656C6C6F " (this section of character set actually belongs to UTF-8 character set).But, this section of character set meets the coding rule of two kinds of character set of UTF-8 and GBK simultaneously, then, by the coding rule corresponding to UTF-8 character set, it is decoded, can decode that successfully, the decoded result of acquisition is " you good hello ", specifically as shown in table 1:

Table 1

By the coding rule of GBK character set, it being decoded, it is also possible to successfully decoded, the decoded result of acquisition is " wash ソ hello ", specifically as shown in table 2:

Table 2

Visible, if adopting existing detection mode to carry out the detection of character set classification, although pending character set falls within the coding rule of GBK character set, but if being GBK character set by the detection of above-mentioned pending character set, obviously detect mistake, can cause that decoded content is mess code.

And in the present invention, set in character set category set and comprise UTF-8 character set and GBK character set, after decoded result after recording above-mentioned successfully decoded, pending character set said target character set classification can be determined by participle ratio by calculating, concrete: for, in UTF-8 character set decoded result " you good hello ", can participle content being " hello " based on dictionary, the number of characters getting final product participle fragment is 2, and total number of characters is 7 after encoding, then, can participle ratio be 2/7.Being " wash ソ hello " for GBK character set decoded result, can not separate word based on dictionary, the number of characters getting final product participle fragment is 0, and after encoding, total character set is 8, then, can participle ratio be 0/8.Obviously, can participle be UTF-8 character set than maximum character set classification, then, it may be determined that above-mentioned pending character set said target character set classification is UTF-8 character set.

As the second implementation, determine the target character collection classification belonging to described pending character set based on word segmentation result, comprise the following steps:

SB1: adding up in described word segmentation result can the number of characters of participle fragment；

SB2: determine with described can the maximum character set classification of the number of characters of participle fragment be the target character collection classification belonging to described pending character set.

Still for the application example in the first implementation, for, in UTF-8 character set decoded result " you good hello ", can participle content be " hello " based on dictionary, the number of characters getting final product participle fragment be 2.Being " wash ソ hello " for GBK character set decoded result, can not separate word based on dictionary, the number of characters getting final product participle fragment is 0.So, can the maximum character set classification of the number of characters of participle fragment be UTF-8 character set, it may be determined that above-mentioned pending character set said target character set classification is UTF-8 character set.

Further embodiment of this invention discloses a kind of character set detection method, as in figure 2 it is shown, the method comprises the following steps:

Step 201: receive pending character set；

Step 202: judge whether to be marked with in described pending character set character set classification belonging to described pending character set；If it is not, enter step 203, if so, enter step 208；

Step 203: choose character set classification from the character set category set prestored one by one；

Wherein, when not being marked with character set classification belonging to described pending character set in determining described pending character set, from the character set category set prestored, choose character set classification one by one, thus determining the target character collection classification belonging to pending character set.

Step 204: described pending character set is decoded based on the coding rule corresponding to selected character set classification；

Step 205: the decoded result after carrying recorded decoding success；

Step 206: the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result；

Step 207: determine the target character collection classification belonging to described pending character set based on word segmentation result；

Step 208: determine the target character collection that character set classification belonging to the pending character set indicated in described pending character set is described pending character set.

If it is to say, the sender sending pending character set is marked with character set classification belonging to it in pending character set, then, directly using the character set classification indicated as target character collection classification, it is not necessary to carry out character set detection.And if pending character set is not marked with character set classification belonging to it, then then need to carry out character set detection.

One embodiment of the invention also discloses a kind of character set detecting device, as it is shown on figure 3, this device includes: first receive unit 301, first choose unit the 302, first decoding unit 303, first and record unit 304, participle acquiring unit 305 and determine class location 306；Wherein:

First receives unit 301, is used for receiving pending character set；

First chooses unit 302, for choosing character set classification from the character set category set prestored one by one；

First decoding unit 303, for being decoded described pending character set based on the coding rule corresponding to selected character set classification；

First record unit 304, for the decoded result after carrying recorded decoding success；

Participle acquiring unit 305, for the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result；

Wherein, participle acquiring unit specifically based on the dictionary prestored, can adopt maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtain word segmentation result.It is, of course, also possible to adopt other algorithms that decoded result is carried out participle, the present invention does not limit.

Determine class location 306, for determining the target character collection classification belonging to described pending character set based on described word segmentation result.

Another embodiment of the present invention discloses a kind of character set detecting device, and unlike the embodiments above, the present embodiment mainly describes the different implementations determining class location, specific as follows:

As a kind of implementation, it is determined that class location includes: the first statistical module, calculating generation module and first determine module；Wherein:

As the second implementation, it is determined that class location includes: the second statistical module and second determines module；Wherein:

Further embodiment of this invention discloses a kind of character set detecting device, as shown in Figure 4, this device includes: first receives unit the 401, first judging unit 402, first choose unit the 403, first decoding unit 404, first and record unit 405, participle acquiring unit 406 and determine class location 407；Wherein:

First receives unit 401, is used for receiving pending character set；

First judging unit 402, is used for judging whether to be marked with in described pending character set character set classification belonging to described pending character set；

First chooses unit 403, during for not being marked with character set classification belonging to described pending character set in determining described pending character set, chooses character set classification one by one from the character set category set prestored；

First decoding unit 404, for being decoded described pending character set based on the coding rule corresponding to selected character set classification；

First record unit 405, for the decoded result after carrying recorded decoding success；

Participle acquiring unit 406, for the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtains word segmentation result；

Determine class location 407, for determining the target character collection classification belonging to described pending character set based on described word segmentation result.

Certainly, this device can also include first and determine unit, determines for the first judging unit and indicates when stating character set classification belonging to pending character set to some extent in described pending character set, directly the character set classification indicated is defined as target character collection classification.

In this specification, each embodiment adopts the mode gone forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, between each embodiment identical similar portion mutually referring to.For device disclosed in embodiment, owing to it corresponds to the method disclosed in Example, so what describe is fairly simple, relevant part illustrates referring to method part.

Described above to the disclosed embodiments, makes professional and technical personnel in the field be capable of or uses the present invention.The multiple amendment of these embodiments be will be apparent from for those skilled in the art, and generic principles defined herein can without departing from the spirit or scope of the present invention, realize in other embodiments.Therefore, the present invention is not intended to be limited to the embodiments shown herein, and is to fit to the widest scope consistent with principles disclosed herein and features of novelty.

Claims

1. a character set detection method, it is characterised in that including:

Receive pending character set；

Decoded result after carrying recorded decoding success；

2. character set detection method according to claim 1, it is characterised in that the described target character collection classification determined based on described word segmentation result belonging to described pending character set, including:

3. character set detection method according to claim 1, it is characterised in that the described target character collection classification determined based on described word segmentation result belonging to described pending character set, including:

4. character set detection method according to claim 1, it is characterised in that described participle that decoded result corresponding to character set classification after all successfully decodeds in described character set category set is carried out, obtains word segmentation result, particularly as follows:

5. method according to claim 1, it is characterised in that described in receive pending character set after, also include:

6. a character set detecting device, it is characterised in that including:

First receives unit, is used for receiving pending character set；

7. character set detecting device according to claim 6, it is characterised in that described determine class location, including:

8. character set detecting device according to claim 6, it is characterised in that described determine class location, including:

9. character set detecting device according to claim 6, it is characterized in that, described participle acquiring unit, it is specifically based on the dictionary prestored, adopt maximum forward matching method that the decoded result corresponding to the character set classification after all successfully decodeds in described character set category set carries out participle, obtain word segmentation result.

10. character set detecting device according to claim 6, it is characterised in that this device also includes: