CN108197087B

CN108197087B - Character code recognition method and device

Info

Publication number: CN108197087B
Application number: CN201810050150.1A
Authority: CN
Inventors: 王占一
Original assignee: Qianxin Technology Group Co Ltd
Current assignee: Qianxin Technology Group Co Ltd
Priority date: 2018-01-18
Filing date: 2018-01-18
Publication date: 2021-11-16
Anticipated expiration: 2038-01-18
Also published as: CN108197087A

Abstract

The invention provides a character code recognition method and a device, wherein the method comprises the following steps: acquiring a text to be identified; acquiring a coding mode which accords with the text to be recognized according to the text to be recognized and a preset coding mode recognition model; and decoding the file to be identified according to the obtained coding mode to obtain a decoding result. The embodiment of the invention provides a character code identification method and device, which are used for acquiring a text to be identified, obtaining a coincidence probability value of the text to be identified corresponding to each preset coding mode according to the text to be identified and a coding mode identification model, determining the coding mode conforming to the text to be identified from the coincidence probability values, and then decoding to obtain a decoding result, so that the purpose that characteristic sequences required by manually setting the coding mode and matching the coding mode are not needed is achieved, the workload is reduced, and the flexibility is strong.

Description

Character code recognition method and device

Technical Field

The embodiment of the invention relates to the technical field of information processing, in particular to a character code identification method and device.

Background

In the field of computer information technology, character encoding is a basic technology. Character encoding, also known as word set coding, is the encoding of characters in a character set into an object in a specified set for the storage of text in a computer and the transmission of text over a communications network. The information stored in the computer is represented by binary numbers, and in order to be understood by a user, the information must be converted by character encoding according to a certain character set. Common encoding modes mainly include UTF-8, GB2312, GBK, BIG5 and the like. Generally, different languages have their corresponding applicable codes, such as ISO-8859-1, which is mainly used to represent Latin characters, GBK, GB2312, which is commonly used in simplified Chinese, and BIG5, which is commonly used in traditional Chinese.

When a computer stores and displays information, the correct coding mode sometimes cannot be obtained due to the fact that the information is missing or modified, and therefore normal application cannot be achieved. Therefore, a method and system for recognizing character codes are very important. There are three common identification methods: (1) each code has its own usage range, determined by the code range, but this approach will fail when there are a large number of code coincidence points. (2) Using feature matching, current information is matched with keywords in the dictionary or manually defined features, which can be determined once matching is successful. But cannot be determined if the match is unsuccessful. (3) The character distribution method is characterized in that a probability model of characters is established in advance, and the attribution condition is judged by calculating the probability of the current character distribution according to the model. This method has limited effect on coded information that has a short space and a habit of using a specific word.

Disclosure of Invention

The embodiment of the invention provides a character code identification method and device, which are used for solving the problems that in the prior art, a coding mode depends on manual setting and the flexibility is poor.

In a first aspect, an embodiment of the present invention provides a character encoding and recognition method, including:

acquiring a text to be identified;

acquiring a coding mode which accords with the text to be recognized according to the text to be recognized and a preset coding mode recognition model;

and decoding the file to be identified according to the obtained coding mode to obtain a decoding result.

Optionally, the obtaining, according to the text to be recognized and a preset coding mode recognition model, a coding mode that conforms to the text to be recognized includes:

sending the text to be recognized to the coding mode recognition model for calculation to obtain the coincidence probability values of the text to be recognized corresponding to the preset coding modes;

and determining the coding mode conforming to the text to be recognized according to the conforming probability value.

selecting a plurality of text segments from the text to be recognized;

sending each text segment to the coding mode identification model for calculation to obtain a coincidence probability value of each text segment corresponding to each preset coding mode, and determining the coding mode conforming to each text segment according to the coincidence probability value;

and determining the coding mode of the text to be recognized according to the coding mode of each text segment.

Optionally, determining, according to the coincidence probability value, a coding mode that coincides with the text to be recognized includes: selecting a maximum probability value according to the coincidence probability values; and taking the coding mode corresponding to the maximum probability value as the coding mode conforming to the text to be recognized.

In a second aspect, an embodiment of the present invention provides a character encoding and recognizing apparatus, including:

the acquisition module is used for acquiring a text to be recognized;

the processing module is used for acquiring a coding mode conforming to the text to be recognized according to the text to be recognized and a preset coding mode recognition model;

and the decoding module is used for decoding the file to be identified according to the obtained coding mode to obtain a decoding result.

Optionally, the processing module is specifically configured to:

selecting a plurality of text segments from the text to be recognized;

Optionally, the processing module comprises a computing unit and a determining unit, wherein:

the calculation unit is used for sending the text to be recognized to the coding mode recognition model for calculation to obtain the coincidence probability values of the text to be recognized corresponding to the preset coding modes;

and the determining unit is used for selecting a maximum probability value according to the coincidence probability values and taking the coding mode corresponding to the maximum probability value as the coding mode conforming to the text to be recognized.

In a third aspect, an embodiment of the present invention provides an electronic device, including: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;

the processor and the memory complete mutual communication through the bus;

the processor, when executing the computer program, implements the method as described above.

In a fourth aspect, embodiments of the present invention provide a non-transitory computer-readable storage medium having a computer program stored thereon, which when executed by a processor implements the method as described above.

According to the technical scheme, the character code identification method and the character code identification device are provided, the obtained text to be identified is subjected to coincidence probability values of the text to be identified corresponding to the preset coding modes according to the text to be identified and the coding mode identification model, the coding mode conforming to the text to be identified is determined from the coincidence probability values, and then decoding is performed to obtain a decoding result, so that the characteristic sequences required by the coding mode and the matched coding mode are not required to be manually set, the workload is reduced, and the flexibility is high.

Drawings

Fig. 1 is a schematic flow chart of a character encoding and recognizing method according to an embodiment of the present invention;

FIG. 2 is a diagram of a learning framework according to an embodiment of the present invention;

FIG. 3 is a schematic structural diagram of a character encoding and recognizing apparatus according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of an electronic device according to an embodiment of the present invention.

Detailed Description

The following detailed description of embodiments of the present invention is provided in connection with the accompanying drawings and examples. The following examples are intended to illustrate the invention but are not intended to limit the scope of the invention.

Fig. 1 shows that an embodiment of the present invention provides a character encoding and recognizing method, including:

s11, acquiring a text to be recognized;

s12, acquiring a coding mode conforming to the text to be recognized according to the text to be recognized and a preset coding mode recognition model;

s13, decoding the file to be identified according to the obtained coding mode to obtain a decoding result.

It should be noted that, in the embodiment of the present invention, after the data is encoded by using a certain encoding method, a certain sequence text is generated in the steps S11 to S13.

For example, "rapid development of computer technology" is encoded in UTF-8, expressed as 16-ary: e8aea1e7ae97e69cbae68a80e69cafe5bfabe9809fe58f91e5b 195; coded in GBK, expressed in 16-ary: bcc6cbe3 bbfabbccaf 5bfeccbd9b7a2d5b 9. Here the sequence length is limited to no more than L characters (L can be flexibly set, e.g. 128).

In the embodiment of the present invention, it should be further noted that the coding mode identification model may be obtained through deep learning training, and specifically may be:

and performing deep learning repeated iteration on hundreds of thousands of even hundreds of thousands of sequence data until the training error and the truth rate reach an acceptable level. The model may use a deep learning structure such as LSTM (temporal recursive Neural network), Text-CNN (temporal Neural Networks for Session Classification), etc.

Fig. 2 is a diagram of a learning structure framework according to an embodiment of the present invention.

(1) Starting from the input layer of input _1, the embedded layer (also called the presentation layer) of embedding _1 is connected, and the parameter values of the embedded layer are obtained by model automatic learning.

After reading in the sequence, each 16-ary code is first converted to an index number of a positive integer for ease of calculation. A mapping table is established, as shown in the following table:

reservation	a	b	c	d	e	f	0	1	2	3	4	5	6	7	8	9
																	0	1	2	3	4	5	6	7	8	9	10	11	12	13	14	15	16

As abc123 translates into: 1,2,3,8,9, 10. These sequences of index numbers, as input layer data, may be received by an embedding layer in the model. The part with length less than L is supplemented with 0.

After receiving the sequence of index numbers, the embedding layer converts the sequence into a matrix form capable of performing operations such as convolution and the like, namely, initializing each index number of the sequence into a vector. Common conversion methods include a random method, a one-hot method (one-hot method), a word embedding method based on word2vec, and the like, and the one-hot method is taken as an example here. The basic idea is that only one bit in a vector corresponding to a character is 1, and the others are all 0. For example abc123 translates to:

(2) the embedded layer is followed by 3 one-dimensional convolutional layers of convolutional kernels of different sizes, conv1d _1, conv1d _2 and conv1d _3, the three convolutional layers being in parallel relation. And (4) automatically learning the convolutional layer parameters and the model.

(3) The above 3 results are aggregated together, i.e. aggregation layer concatenate _ 1.

(4) After being processed by the tiling layer flatten _1, the hierarchical structure is connected to a plurality of nodes representing various encoding modes by a constraint layer dropout _1 and a fully connected layer dense _ 1. Through multiple iterations, the output loss function value (i.e., the difference metric between the predicted value and the true value) gradually decreases until an acceptable minimum value is reached. Meanwhile, the model effect can be checked by the accuracy of the verification set.

And when the model achieves a satisfactory effect, the model structure and the weight value are stored for the system to use.

The method is a mature technology for obtaining the coding mode identification model by adopting deep learning.

In the embodiment of the present invention, the system obtains, according to the text to be recognized and the preset coding mode recognition model, a coding mode conforming to the text to be recognized, which specifically includes:

11) sending the text to be recognized to the coding mode recognition model for calculation to obtain the coincidence probability values of the text to be recognized corresponding to the preset coding modes;

12) and determining the coding mode conforming to the text to be recognized according to the conforming probability value.

For step 11) and step 12), it should be noted that, the text to be recognized is sent to the coding mode recognition model, the processing mode is the same as the deep learning training process, and the text to be recognized is processed into the sequence of the index number, and if the sequence of the text to be recognized before conversion is c4a7 masked 7d0 …, the text to be recognized can be converted into: 3, 11, 1, 14, 3, 1, 4, 5, 3, 1, 4, 14, 4, 7 … …. And then, according to the stored weight values, the weight values are used as parameters of the embedded layer and the convolutional layer, and further, probability values of the text to be recognized in each coding mode are obtained through calculation. And selecting a maximum probability value according to the coincidence probability values, and taking the coding mode corresponding to the maximum probability value as the coding mode conforming to the text to be recognized.

Such as UTF-8:0.01, GBK: 0.98, Latin1:0.01, and GBK is taken as a predictive coding mode because 0.98 is the largest.

The embodiment of the invention provides a character code identification method, which is characterized in that according to the obtained text to be identified, the coincidence probability values of the text to be identified, which correspond to the preset coding modes, are obtained according to the text to be identified and the coding mode identification model, the coding modes conforming to the text to be identified are determined from the coincidence probability values, and then decoding is carried out to obtain a decoding result, so that the characteristic sequences required by the coding modes and the matched coding modes are not required to be manually set, the workload is reduced, and the flexibility is strong.

An embodiment of the present invention provides a character code recognition method, including:

s21, acquiring a text to be recognized;

s22, acquiring a coding mode conforming to the text to be recognized according to the text to be recognized and a preset coding mode recognition model;

s23, decoding the file to be identified according to the obtained coding mode to obtain a decoding result.

It should be noted that, in the embodiment of the present invention, after the data is encoded by using a certain encoding method, a certain sequence text is generated in the steps S21 to S23.

The system obtains the coding mode according with the text to be recognized and the preset coding mode recognition model, and specifically may include:

21) selecting a plurality of text segments from the text to be recognized;

22) sending each text segment to the coding mode identification model for calculation to obtain a coincidence probability value of each text segment corresponding to each preset coding mode, and determining the coding mode conforming to each text segment according to the coincidence probability value;

23) and determining the coding mode of the text to be recognized according to the coding mode of each text segment.

With respect to step 11) and step 12), it should be noted that, a text to be recognized is sent to the coding mode recognition model, and a plurality of text segments are selected from the text to be recognized. The processing mode of each text segment is the same as the training deep learning process, and the text segment is also processed into a sequence of index numbers, and if the sequence of the text to be recognized before conversion is c4a7 cached 7d0 …, the text to be recognized can be converted into the following text sequence according to a mapping table: 3, 11, 1, 14, 3, 1, 4, 5, 3, 1, 4, 14, 4, 7 … ….

And then, according to the stored weight values, the weight values are used as parameters of the embedded layer and the convolution layer, and the coincidence probability values of the text segments in each coding mode are obtained through calculation. And selecting a maximum probability value according to the coincidence probability values, and taking the coding mode corresponding to the maximum probability value as the coding mode conforming to each text segment. And then taking the most appeared coding mode as the coding mode of the text to be recognized.

The embodiment of the invention provides a character code identification method, which comprises the steps of selecting a plurality of text segments from an acquired text to be identified, obtaining a coincidence probability value of each text segment corresponding to each preset coding mode according to each text segment and a coding mode identification model, determining the coding mode conforming to each text segment from the coincidence probability values, then determining the coding mode of the text to be identified, and decoding to obtain a decoding result, so that the characteristic sequence required by manually setting the coding mode and matching the coding mode is not needed, the workload is reduced, and the flexibility is strong.

Fig. 3 shows a character code recognition apparatus provided in an embodiment of the present invention, which includes an obtaining module 31, a processing module 32, and a decoding module 33, where:

the acquiring module 31 is used for acquiring a text to be recognized;

the processing module 32 is configured to obtain a coding mode according with the text to be recognized according to the text to be recognized and a preset coding mode recognition model;

and the decoding module 33 is configured to decode the file to be identified according to the obtained encoding mode, so as to obtain a decoding result.

The processing module is specifically configured to:

The processing module comprises a calculation unit and a determination unit, wherein:

Since the principle of the apparatus according to the embodiment of the present invention is the same as that of the method according to the above embodiment, further details are not described herein for further explanation.

It should be noted that, in the embodiment of the present invention, the relevant functional module may be implemented by a hardware processor (hardware processor).

The embodiment of the invention provides a character code recognition device, which is characterized in that according to the obtained text to be recognized, the coincidence probability values of the text to be recognized corresponding to each preset coding mode are obtained according to the text to be recognized and the coding mode recognition model, the coding mode conforming to the text to be recognized is determined from the coincidence probability values, and then decoding is carried out to obtain a decoding result, so that the characteristic sequences required by manually setting the coding mode and matching the coding mode are not needed, the workload is reduced, and the flexibility is strong.

An embodiment of the present invention provides a character code recognition apparatus, including an obtaining module, a processing module, and a decoding module, wherein:

the acquisition module is used for acquiring a text to be recognized;

The processing module is specifically configured to:

selecting a plurality of text segments from the text to be recognized;

The embodiment of the invention provides a character code recognition device, which selects a plurality of text segments for an acquired text to be recognized, obtains a coincidence probability value of each text segment corresponding to each preset coding mode according to each text segment and a coding mode recognition model, determines the coding mode conforming to each text segment from the coincidence probability values, then determines the coding mode of the text to be recognized, and decodes the coding mode to obtain a decoding result, so that the characteristic sequence required by manually setting and matching the coding mode is not needed, the workload is reduced, and the flexibility is strong.

Fig. 4 shows an electronic device provided in an embodiment of the present invention, including: a processor 401, a memory 402, a bus 403, and computer programs stored on the memory and executable on the processor;

the processor and the memory complete mutual communication through the bus;

the processor, when executing the computer program, implements a method as described above, for example comprising: acquiring a text to be identified; acquiring a coding mode which accords with the text to be recognized according to the text to be recognized and a preset coding mode recognition model; and decoding the file to be identified according to the obtained coding mode to obtain a decoding result.

An embodiment of the present invention provides a non-transitory computer readable storage medium, on which a computer program is stored, and when executed by a processor, the computer program implements the method as described above, for example, including: acquiring a text to be identified; acquiring a coding mode which accords with the text to be recognized according to the text to be recognized and a preset coding mode recognition model; and decoding the file to be identified according to the obtained coding mode to obtain a decoding result.

Furthermore, those skilled in the art will appreciate that while some embodiments described herein include some features included in other embodiments, rather than other features, combinations of features of different embodiments are meant to be within the scope of the invention and form different embodiments. For example, in the following claims, any of the claimed embodiments may be used in any combination.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The usage of the words first, second and third, etcetera do not indicate any ordering. These words may be interpreted as names.

Those of ordinary skill in the art will understand that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions and scope of the present invention as defined in the appended claims.

Claims

1. A character code recognition method, comprising:

acquiring a text to be identified;

obtaining a coding mode which accords with the text to be recognized according to the text to be recognized and a preset coding mode recognition model, wherein the coding mode recognition model is obtained through deep learning training;

decoding the file to be identified according to the obtained coding mode to obtain a decoding result;

the obtaining of the coding mode conforming to the text to be recognized according to the text to be recognized and a preset coding mode recognition model comprises the following steps:

selecting a plurality of text segments from the text to be recognized;

determining the coding mode of the text to be recognized according to the occurrence frequency of the coding mode of each text segment;

and the coding mode identification model is connected to a plurality of nodes of each coding mode through full-connection layers to obtain the coincidence probability value of each coding mode.

2. The method according to claim 1, wherein the obtaining of the coding mode conforming to the text to be recognized according to the text to be recognized and a preset coding mode recognition model comprises:

3. The method of claim 2, wherein determining the encoding mode corresponding to the text to be recognized according to the corresponding probability value comprises: selecting a maximum probability value according to the coincidence probability values; and taking the coding mode corresponding to the maximum probability value as the coding mode conforming to the text to be recognized.

4. A character code recognition apparatus, comprising:

the acquisition module is used for acquiring a text to be recognized;

the processing module is used for acquiring a coding mode conforming to the text to be recognized according to the text to be recognized and a preset coding mode recognition model, and the coding mode recognition model is obtained through deep learning training;

the decoding module is used for decoding the file to be identified according to the obtained coding mode to obtain a decoding result;

wherein the processing module is specifically configured to:

selecting a plurality of text segments from the text to be recognized;

5. The apparatus of claim 4, wherein the processing module is specifically configured to:

6. The apparatus of claim 5, wherein the processing module comprises a computing unit and a determining unit, wherein:

7. An electronic device, comprising: a processor, a memory, a bus, and a computer program stored on the memory and executable on the processor;

the processor and the memory complete mutual communication through the bus;

the processor, when executing the computer program, implements the method of any of claims 1-3.

8. A non-transitory computer-readable storage medium, having stored thereon a computer program which, when executed by a processor, implements the method of any one of claims 1-3.