CN108197087A

CN108197087A - Character code recognition methods and device

Info

Publication number: CN108197087A
Application number: CN201810050150.1A
Authority: CN
Inventors: 王占; 王占一
Original assignee: Beijing Qianxin Technology Co Ltd
Current assignee: Beijing Qianxin Technology Co Ltd
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2018-06-22
Anticipated expiration: 2038-01-18
Also published as: CN108197087B

Abstract

The present invention provides a kind of character code recognition methods and device, this method include：Obtain text to be identified；Meet the coding mode of the text to be identified according to the text to be identified and the acquisition of preset coding mode identification model；The file to be identified is decoded according to the coding mode of acquisition, obtains decoding result.The embodiment of the present invention provides a kind of character code recognition methods and device, pass through the text to be identified to getting, text to be identified is obtained according to text to be identified and coding mode identification model and meets probability value corresponding to preset each coding mode, from meeting coding mode that text to be identified is determined for compliance in probability value, then it is decoded acquisition decoding result, so as to reach need not be manually set coding mode and match coding mode needed for characteristic sequence, reduce workload, flexibility is strong.

Description

Character code recognition methods and device

Technical field

The present embodiments relate to technical field of information processing more particularly to a kind of character code recognition methods and devices.

Background technology

In computer information technology field, character code is a basic fundamental.Character code is also referred to as encoding, is word The character code that symbol is concentrated is certain an object in specified set, the biography for storing in a computer so as to text and passing through communication network It passs.The information stored in computer is all to use binary number representation, and to make user readable, it is necessary to according to a certain character Collection is converted by way of character code.Common coding mode mainly has UTF-8, GB2312, GBK, BIG5 etc..It is logical Often, different language has its corresponding applicable coding, and if ISO-8859-1 is mainly used for representing Latin character, GBK, GB2312 are normal For simplified form of Chinese Character, and BIG5 is usually used in Chinese-traditional.

When computer stores and shows information, correct coding staff can not be obtained when due to loss of learning or being modified with Formula leads to not normal use.Therefore, identify that the method and system of character code is extremely important.Common recognition methods has three Kind：(1) determined according to coding range, each coding has a use scope of oneself, but when there is a large amount of coding coincidence point this Kind method will fail.(2) it using characteristic matching, goes to match current letter with the keyword in dictionary or the feature of Manual definition Breath, once successful match can determine.But if matching is unsuccessful, can not determine.(3) character distribution establishes character in advance Probabilistic model, the probability that current character is distributed is calculated according to model and judges ownership situation.This method is for there is specific word The too short coding information effect of language use habit, length is limited.

Invention content

The embodiment of the present invention provides a kind of character code recognition methods and device, for solving coding mode in the prior art The problem of artificial setting of dependence, flexibility is poor.

In a first aspect, the embodiment of the present invention provides a kind of character code recognition methods, including：

Obtain text to be identified；

Meet the volume of the text to be identified according to the text to be identified and the acquisition of preset coding mode identification model Code mode；

The file to be identified is decoded according to the coding mode of acquisition, obtains decoding result.

Optionally, it is described to wait to know according to meeting the text to be identified and the acquisition of preset coding mode identification model The coding mode of other text, including：

The text to be identified is sent in the coding mode identification model calculate and obtains the text to be identified This meets probability value corresponding to preset each coding mode；

According to the coding mode for meeting probability value and being determined for compliance with the text to be identified.

Multiple text chunks are chosen from the text to be identified；

Each text chunk is sent in the coding mode identification model calculate and obtains each text chunk and corresponds to Preset each coding mode meets probability value, according to the coding staff for meeting probability value and being determined for compliance with each text chunk Formula；

The coding mode of the text to be identified is determined according to the coding mode of each text chunk.

Optionally, according to the coding mode for meeting probability value and being determined for compliance with the text to be identified, including：According to institute It states to meet and most probable value is chosen in probability value；The corresponding coding mode of the most probable value is described to be identified as meeting The coding mode of text.

Second aspect, the embodiment of the present invention provide a kind of character code identification device, including：

Acquisition module, for obtaining text to be identified；

Processing module, for according to the text to be identified and preset coding mode identification model acquisition meet described in treat Identify the coding mode of text；

Decoder module is decoded the file to be identified for the coding mode according to acquisition, is decoded As a result.

Optionally, the processing module is specifically used for：

Multiple text chunks are chosen from the text to be identified；

Optionally, the processing module includes computing unit and determination unit, wherein：

Computing unit carries out calculating acquisition for the text to be identified to be sent in the coding mode identification model The text to be identified meets probability value corresponding to preset each coding mode；

Determination unit chooses most probable value for meeting according to, the most probable value is corresponded in probability value Coding mode as the coding mode for meeting the text to be identified.

The third aspect, the embodiment of the present invention provide a kind of electronic equipment, which is characterized in that including：Processor, memory, Bus and storage are on a memory and the computer program that can run on a processor；

Wherein, the processor, memory complete mutual communication by the bus；

The processor realizes method as described above when performing the computer program.

Fourth aspect, the embodiment of the present invention provide a kind of non-transient computer readable storage medium storing program for executing, the non-transient calculating Computer program is stored on machine readable storage medium storing program for executing, which realizes method as described above when being executed by processor.

As shown from the above technical solution, the embodiment of the present invention provides a kind of character code recognition methods and device, by right The text to be identified got obtains text to be identified corresponding to preset according to text to be identified and coding mode identification model Each coding mode meets probability value, from coding mode that text to be identified is determined for compliance in probability value is met, then carries out Decoding obtain decoding result, so as to reach need not be manually set coding mode and match coding mode needed for characteristic sequence, subtract Workload is lacked, flexibility is strong.

Description of the drawings

Fig. 1 is the flow diagram of character code recognition methods that one embodiment of the invention provides；

Fig. 2 is a kind of learning structure frame diagram that one embodiment of the invention provides；

Fig. 3 is the structure diagram of character code identification device that one embodiment of the invention provides；

Fig. 4 is the structure diagram of electronic equipment that one embodiment of the invention provides.

Specific embodiment

With reference to the accompanying drawings and examples, the specific embodiment of the present invention is described in further detail.Implement below Example is used to illustrate the present invention, but be not limited to the scope of the present invention.

Fig. 1 shows that one embodiment of the invention provides a kind of character code recognition methods, including：

S11, text to be identified is obtained；

S12, the text to be identified is met according to the text to be identified and the acquisition of preset coding mode identification model Coding mode；

S13, the file to be identified is decoded according to the coding mode of acquisition, obtains decoding result.

For above-mentioned steps S11- steps S13, it should be noted that in embodiments of the present invention, data are compiled using certain Code mode can generate certain sequence text after being encoded.

For example, " computer technology is fast-developing " is encoded by UTF-8, it is expressed as with 16 systems： e8aea1e7ae97e69cbae68a80e69cafe5bfabe9809fe58f91e5b195；It encodes by GBK, is represented with 16 systems For：bcc6cbe3bbfabcbccaf5bfeccbd9b7a2d5b9.Here sequence length is limited to be no more than L character that (L can be clever Setting living, such as 128).

In embodiments of the present invention, it is also necessary to which explanation can train to obtain coding mode identification mould by deep learning Type, concretely：

100,000 even hundreds thousand of sequence datas are carried out deep learning to iterate, until training error and true rate reach To acceptable degree.LSTM (time recurrent neural network), Text-CNN (Convolutional Neural can be used in model Networks for Sentence Classification) even depth learning structure.

It is illustrated in figure 2 a kind of learning structure frame diagram provided in an embodiment of the present invention.

(1) by the input layer of input_1, the embeding layer (being also expression layer) of embedding_1, the ginseng of embeding layer are connect Numerical value is automatically learned by model.

After sequence is read in, for ease of calculating, each 16 ary codes are converted to the call number of a positive integer first.It builds Vertical mapping table, it is as shown in the table：

It is reserved	a	b	c	d	e	f	0	1	2	3	4	5	6	7	8	9
																	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16

As abc123 is converted to：1,2,3,8,9,10.The sequence of these call numbers, can be by model as input layer data In embeding layer received.Mend 0 in the part of curtailment L.

After embeding layer receives the sequence of call number, the matrix form that can carry out the operations such as convolution is converted into, is exactly Each call number of sequence is initialized as vector.Common conversion regime has randomized, one-hot methods (only hot method), is based on Word embedding inlay technique of word2vec etc., here by taking one-hot methods as an example.Its basic ideas is only one in the corresponding vector of certain character Position is 1, other are 0.Such as abc123 is converted to：

(2) one-dimensional convolutional layer conv1d_1, conv1d_2, conv1d_ for being followed by 3 different size convolution kernels of embeding layer 3, three convolutional layers are concurrency relation.Convolution layer parameter is automatically learned with model.

(3) above-mentioned 3 results condense together, i.e. polymer layer concatenate_1.

(4) after tiling layer flatten_1 processing, restraint layer dropout_1 and full articulamentum dense_1 is met, is connected to Represent multiple nodes of various coding modes.Through excessively taking turns iteration, loss function value (the i.e. difference of predicted value and actual value of output Other metric) it is gradually reduced, until reaching acceptable minimum.Meanwhile mould can be examined with the accuracy rate of verification collection Type effect.

After model reaches promising result, preservation model structure and weighted value are used for system.

It is more ripe technology for obtaining coding mode identification model using deep learning.

In embodiments of the present invention, system is accorded with according to the text to be identified and preset coding mode identification model The coding mode of the text to be identified is closed, specifically may include：

11) text to be identified is sent in the coding mode identification model carry out calculate obtain it is described to be identified Text meets probability value corresponding to preset each coding mode；

12) meet the coding mode that probability value is determined for compliance with the text to be identified according to.

For step 11) and step 12), it should be noted that text to be identified is sent to the coding mode and is identified In model, processing mode is identical with training deep learning process, is also intended to first be processed into the sequence of call number, before conversion The sequence of text to be identified is c4a7cadecad7d0 ..., then can be exchanged into according to mapping table：3,11,1,14,3, Isosorbide-5-Nitrae, 5,3, Isosorbide-5-Nitrae, 14,4,7 ....Then it according to parameter of the weighted value of preservation as embeding layer, convolutional layer, and then is obtained by operation Text to be identified is in the probability value of each coding mode.According to it is described meet most probable value is chosen in probability value, by maximum probability It is worth corresponding coding mode as the coding mode for meeting the text to be identified.

Such as UTF-8:0.01, GBK：0.98, Latin1:0.01, because of 0.98 maximum, therefore take the coding mode that GBK is prediction.

The embodiment of the present invention provides a kind of character code recognition methods, by the text to be identified to getting, according to treating Identification text and coding mode identification model obtain text to be identified and meet probability value corresponding to preset each coding mode, From coding mode that text to be identified is determined for compliance in probability value is met, acquisition decoding result is then decoded, so as to reach Need not be manually set coding mode and match coding mode needed for characteristic sequence, reduce workload, flexibility is strong.

One embodiment of the invention provides a kind of character code recognition methods, including：

S21, text to be identified is obtained；

S22, the text to be identified is met according to the text to be identified and the acquisition of preset coding mode identification model Coding mode；

S23, the file to be identified is decoded according to the coding mode of acquisition, obtains decoding result.

For above-mentioned steps S21- steps S23, it should be noted that in embodiments of the present invention, data are compiled using certain Code mode can generate certain sequence text after being encoded.

System meets the text to be identified according to the text to be identified and the acquisition of preset coding mode identification model Coding mode, specifically may include：

21) multiple text chunks are chosen from the text to be identified；

22) each text chunk is sent in the coding mode identification model calculate and obtains each text chunk correspondence Meet probability value in preset each coding mode, according to the coding staff for meeting probability value and being determined for compliance with each text chunk Formula；

23) coding mode of the text to be identified is determined according to the coding mode of each text chunk.

For step 11) and step 12), it should be noted that text to be identified is sent to the coding mode and is identified In model, multiple text chunks are chosen from the text to be identified.The processing mode of each text chunk and training deep learning mistake Cheng Xiangtong is also intended to first be processed into the sequence of call number, as the sequence of the text to be identified before converting is C4a7cadecad7d0 ... then can be exchanged into according to mapping table：3,11,1,14,3, Isosorbide-5-Nitrae, 5,3, Isosorbide-5-Nitrae, 14,4,7 ....

Then according to parameter of the weighted value of preservation as embeding layer, convolutional layer, and then each text is obtained by operation This section meets probability value in each coding mode.According to it is described meet most probable value is chosen in probability value, by most probable value Corresponding coding mode is as the coding mode for meeting each text chunk.Then using most coding modes of appearance as The coding mode of the text to be identified.

The embodiment of the present invention provides a kind of character code recognition methods, by multiple to the text selection to be identified got Text chunk obtains each text chunk according to each text chunk and coding mode identification model and corresponds to preset each coding mode Meet probability value, from coding mode that each text chunk is determined for compliance in probability value is met, then determine text to be identified Coding mode is decoded acquisition decoding result, need not be manually set needed for coding mode and matching coding mode so as to reach Characteristic sequence, reduce workload, flexibility is strong.

Fig. 3 shows a kind of character code identification device that one embodiment of the invention provides, including acquisition module 31, processing Module 32 and decoder module 33, wherein：

Acquisition module 31, for obtaining text to be identified；

Processing module 32, described in being met according to the text to be identified and the acquisition of preset coding mode identification model The coding mode of text to be identified；

Decoder module 33 is decoded the file to be identified for the coding mode according to acquisition, is solved Code result.

The processing module is specifically used for：

The processing module includes computing unit and determination unit, wherein：

Since described device of the embodiment of the present invention is identical with the principle of above-described embodiment the method, for more detailed Explain that details are not described herein for content.

It it should be noted that can be by hardware processor (hardware processor) come real in the embodiment of the present invention Existing related function module.

The embodiment of the present invention provides a kind of character code identification device, by the text to be identified to getting, according to treating Identification text and coding mode identification model obtain text to be identified and meet probability value corresponding to preset each coding mode, From coding mode that text to be identified is determined for compliance in probability value is met, acquisition decoding result is then decoded, so as to reach Need not be manually set coding mode and match coding mode needed for characteristic sequence, reduce workload, flexibility is strong.

A kind of character code identification device that one embodiment of the invention provides, including acquisition module, processing module and decoding Module, wherein：

Acquisition module, for obtaining text to be identified；

The processing module is specifically used for：

Multiple text chunks are chosen from the text to be identified；

The embodiment of the present invention provides a kind of character code identification device, by multiple to the text selection to be identified got Text chunk obtains each text chunk according to each text chunk and coding mode identification model and corresponds to preset each coding mode Meet probability value, from coding mode that each text chunk is determined for compliance in probability value is met, then determine text to be identified Coding mode is decoded acquisition decoding result, need not be manually set needed for coding mode and matching coding mode so as to reach Characteristic sequence, reduce workload, flexibility is strong.

Fig. 4 shows a kind of electronic equipment that one embodiment of the invention provides, including：It is processor 401, memory 402, total Line 403 and storage are on a memory and the computer program that can run on a processor；

Wherein, the processor, memory complete mutual communication by the bus；

The processor realizes method as described above when performing the computer program, such as including：Obtain text to be identified This；Meet the coding staff of the text to be identified according to the text to be identified and the acquisition of preset coding mode identification model Formula；The file to be identified is decoded according to the coding mode of acquisition, obtains decoding result.

The embodiment of the present invention provides a kind of non-transient computer readable storage medium storing program for executing, the non-transient computer readable storage Computer program is stored on medium, which realizes method as described above when being executed by processor, such as including：It obtains Take text to be identified；The text to be identified is met according to the text to be identified and the acquisition of preset coding mode identification model Coding mode；The file to be identified is decoded according to the coding mode of acquisition, obtains decoding result.

In addition, it will be appreciated by those of skill in the art that although some embodiments described herein include other embodiments In included certain features rather than other feature, but the combination of the feature of different embodiments means in of the invention Within the scope of and form different embodiments.For example, in the following claims, embodiment claimed is appointed One of meaning mode can use in any combination.

It should be noted that the present invention will be described rather than limits the invention, and ability for above-described embodiment Field technique personnel can design alternative embodiment without departing from the scope of the appended claims.In the claims, Any reference mark between bracket should not be configured to limitations on claims.Word "comprising" does not exclude the presence of not Element or step listed in the claims.Word "a" or "an" before element does not exclude the presence of multiple such Element.The present invention can be by means of including the hardware of several different elements and being come by means of properly programmed computer real It is existing.If in the unit claim for listing equipment for drying, several in these devices can be by same hardware branch To embody.The use of word first, second, and third does not indicate that any sequence.These words can be explained and run after fame Claim.

One of ordinary skill in the art will appreciate that：The above embodiments are only used to illustrate the technical solution of the present invention., and It is non-that it is limited；Although the present invention is described in detail with reference to foregoing embodiments, those of ordinary skill in the art It should be understood that：It can still modify to the technical solution recorded in foregoing embodiments either to which part or All technical features carries out equivalent replacement；And it these modifications or replaces, it does not separate the essence of the corresponding technical solution this hair Bright claim limited range.

Claims

1. a kind of character code recognition methods, which is characterized in that including：

Obtain text to be identified；

Meet the coding staff of the text to be identified according to the text to be identified and the acquisition of preset coding mode identification model Formula；

It is 2. according to the method described in claim 1, it is characterized in that, described according to the text to be identified and preset coding staff Formula identification model obtains the coding mode for meeting the text to be identified, including：

The text to be identified is sent in the coding mode identification model calculate and obtains the text pair to be identified Probability value should be met in preset each coding mode；

It is 3. according to the method described in claim 1, it is characterized in that, described according to the text to be identified and preset coding staff Formula identification model obtains the coding mode for meeting the text to be identified, including：

Multiple text chunks are chosen from the text to be identified；

Each text chunk is sent in the coding mode identification model calculate and obtains each text chunk corresponding to default Each coding mode meet probability value, according to the coding mode for meeting probability value and being determined for compliance with each text chunk；

4. according to the method described in claim 2, it is characterized in that, according to it is described meet probability value be determined for compliance with it is described to be identified The coding mode of text, including：According to it is described meet most probable value is chosen in probability value；The most probable value is corresponding Coding mode is as the coding mode for meeting the text to be identified.

5. a kind of character code identification device, which is characterized in that including：

Acquisition module, for obtaining text to be identified；

Processing module, it is described to be identified for being met according to the text to be identified and the acquisition of preset coding mode identification model The coding mode of text；

Decoder module is decoded the file to be identified for the coding mode according to acquisition, obtains decoding result.

6. device according to claim 5, which is characterized in that the processing module is specifically used for：

7. device according to claim 5, which is characterized in that the processing module is specifically used for：

Multiple text chunks are chosen from the text to be identified；

8. device according to claim 6, which is characterized in that the processing module includes computing unit and determination unit, Wherein：

Computing unit is carried out for the text to be identified to be sent in the coding mode identification model described in calculating acquisition Text to be identified meets probability value corresponding to preset each coding mode；

Determination unit chooses most probable value, by the corresponding volume of the most probable value for meeting according in probability value Code mode is as the coding mode for meeting the text to be identified.

9. a kind of electronic equipment, which is characterized in that including：Processor, memory, bus and storage on a memory and can located The computer program run on reason device；

Wherein, the processor, memory complete mutual communication by the bus；

The processor realizes the method as described in any one of claim 1-4 when performing the computer program.

10. a kind of non-transient computer readable storage medium storing program for executing, which is characterized in that on the non-transient computer readable storage medium storing program for executing Computer program is stored with, the side as described in any one of claim 1-4 is realized when which is executed by processor Method.