CN105653506A

CN105653506A - Method and device for processing texts in GPU on basis of character encoding conversion

Info

Publication number: CN105653506A
Application number: CN201511020414.1A
Authority: CN
Inventors: 潘昊
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2015-12-30
Filing date: 2015-12-30
Publication date: 2016-06-08
Anticipated expiration: 2035-12-30
Also published as: CN105653506B

Abstract

Embodiments of the invention provide a method and device for processing texts in a GPU on the basis of character encoding conversion. The method comprises the following steps: obtaining a binary encoding form of each character in an input text; judging whether the binary encoding form is consistent with a preset encoding form; if the binary encoding form is not consistent with the preset encoding form, carrying out binary encoding on the character by adopting the preset encoding form and converting the encoded character into a binary floating point number type; and if the binary encoding form is consistent with the preset encoding form, converting the character into the binary floating point number type and submitting the character of the binary floating point number type to the GPU to calculate and process. Through the embodiments of the invention, the characters in the texts can be converted into the binary floating point number type which can be processed by the GPU, so that the floating point number processing ability of the GPU is effectively utilized.

Description

The method of text-processing in a kind of GPU based on character coding conversion and device

Technical field

The present invention relates to GPU technical field of data processing, particularly relate to method and the device of text-processing in a kind of GPU based on character coding conversion.

Background technology

The appearance of graphic process unit (GraphicsProcessingUnit, GPU) in recent years, serves inestimable pushing effect to the development of high-performance computing field. The powerful computing performance of GPU, makes it obtain successful case more more than traditional solution, at present, in the field of what GPU data processing had been got more and more apply to big data computing and machine learning.

In data processing, GPU has clear superiority relative to central processing unit (CentralProcessingUnit, CPU) on hardware framework, particularly the processing power of floating number. At present, GPU is also mainly employed for the fields such as graphicprocessing, video code conversion or voice analysis. CPU is mainly adopted, but due to CPU architecture, CPU is poor to the processing speed of text, affects the efficiency of text-processing when being processed by text.

Summary of the invention

The object of the embodiment of the present invention is to provide method and the device of text-processing in a kind of GPU based on character coding conversion, to realize the floating number processing power using GPU, it is to increase the speed of text-processing, reaches significantly improving of text analyzing processing speed.

For achieving the above object, the embodiment of the invention discloses the method for text-processing in a kind of GPU based on character coding conversion, described method comprises:

Obtain the binary coding form of each character in input text;

Judge that whether described binary coding form is consistent with default coding form;

If inconsistent, adopt described default coding form that described character is carried out binary coding; By coding after character conversion be binary floating point number type; If consistent, then it would be binary floating point number type by described character conversion;

GPU is submitted in the character being converted to described binary floating point number type and carries out computing.

Preferably, described default coding form comprises the one in following coding form:

Unicode coding form, GB2312 coding form, GBK coding form or GB18030 coding form.

Preferably, described judge that whether described binary coding form consistent with default coding form before, described method also comprises:

Obtain the binary floating point number type that GPU supports;

Described is binary floating point number type by the character conversion after coding, comprising:

By the binary floating point number type that the character conversion after described coding is described GPU support;

Described is binary floating point number type by described character conversion, comprising:

By the binary floating point number type that described character conversion is described GPU support.

Preferably, the binary floating point number type that described acquisition GPU supports comprises:

The API provided by GPU Computational frame, obtains the binary floating point number type that GPU supports.

Preferably, described is that binary floating point number type comprises by the character conversion after coding:

By the binary floating point number type that the character conversion after coding is default;

Described is that binary floating point number type comprises by described character conversion:

By the binary floating point number type that described character conversion is default.

Preferably, the described GPU that submitted to by the character being converted to described binary floating point number type carries out computing and comprises:

Judge whether the length of the binary floating point number type that GPU supports is not less than the length of the binary floating point number type after described character conversion;

If it does, then the character being converted to described binary floating point number type is submitted to GPU, the character of the binary floating point number type after this conversion is directly processed by GPU;

Otherwise, the described character being converted to described binary floating point number type is split, the character of described binary floating point number type is split as the length that described GPU can hold, and the character of the binary floating point number type after fractionation is sent to described GPU process.

Preferably, the binary floating point number type that described GPU supports comprises: half precision binary floating point number type, single precision binary floating point number type and double precision binary floating number type.

Present invention also offers the device of text-processing in a kind of GPU based on character coding conversion, described device comprises:

Character coding obtaining unit, for obtaining the binary coding form of character in each text of input;

Coding judging unit, for judging that whether described binary coding form is consistent with default coding form, if described binary coding form is inconsistent with the coding form preset, then trigger coding arrangement unit, if described binary coding form is consistent with the coding form preset, then trigger transcoder unit;

Described coding arrangement unit, for adopting described default coding form that described character is carried out binary coding;

Described transcoder unit, for by coding after character or described character conversion be binary floating point number type;

Processing unit submitted in character, carries out computing for the character being converted to described binary floating point number type is submitted to GPU.

Preferably, described device also comprises: GPU obtaining unit, for obtaining the binary floating point number type that GPU supports;

Described transcoder unit, specifically for by the character conversion after described coding being the binary floating point number type of described GPU support; Or, it is the binary floating point number type that described GPU supports by described character conversion.

Preferably, described transcoder unit, specifically for being default binary floating point number type by the character after coding or described character conversion.

Preferably, described character submits to processing unit to comprise: GPU parameter judgment sub-unit, character split subelement and character process subelement;

Described GPU parameter judgment sub-unit, for judging the length of the binary floating point number type that GPU supports and whether be not less than the length of the binary floating point number type after described character conversion, if not, then trigger described character and split subelement, if it does, then trigger described character process subelement;

Described character splits subelement, for the described character being converted to described binary floating point number type is split, the character of described binary floating point number type is split as the length that described GPU can hold, and the character of the binary floating point number type after fractionation is sent to described GPU process;

Described character process subelement, for the character being converted to described binary floating point number type is submitted to GPU, the character of the binary floating point number type after this conversion is directly processed by GPU.

The method of text-processing in a kind of GPU based on character coding conversion that the embodiment of the present invention provides and device, by providing a kind of character coding conversion method, character in text can be converted into the binary floating point number type that GPU can process, effectively utilize the floating number processing power of GPU. Certainly, arbitrary product or the method for implementing the present invention must not necessarily need to reach above-described all advantages simultaneously.

Accompanying drawing explanation

In order to be illustrated more clearly in the embodiment of the present invention or technical scheme of the prior art, it is briefly described to the accompanying drawing used required in embodiment or description of the prior art below, apparently, accompanying drawing in the following describes is only some embodiments of the present invention, for those of ordinary skill in the art, under the prerequisite not paying creative work, it is also possible to obtain other accompanying drawing according to these accompanying drawings.

The method flow schematic diagram of text-processing in a kind of GPU based on character coding conversion that Fig. 1 provides for the embodiment of the present invention;

The method flow schematic diagram of text-processing in the GPU that another kind that Fig. 2 provides for the embodiment of the present invention is changed based on character coding;

The apparatus structure schematic diagram of text-processing in a kind of GPU based on character coding conversion that Fig. 3 provides for the embodiment of the present invention;

The apparatus structure schematic diagram of text-processing in the GPU that another kind that Fig. 4 provides for the embodiment of the present invention is changed based on character coding.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is only the present invention's part embodiment, instead of whole embodiments. Based on the embodiment in the present invention, those of ordinary skill in the art, not making other embodiments all obtained under creative work prerequisite, belong to the scope of protection of the invention.

The method of text-processing in a kind of GPU based on character coding conversion that the embodiment of the present invention provides and device, by providing a kind of character coding method, character in text is converted into the binary floating point number type that GPU can process, and then effectively utilizes the floating number processing power of GPU.

The method flow schematic diagram of text-processing in a kind of GPU based on character coding conversion that Fig. 1 provides for the embodiment of the present invention, comprises the steps:

The binary coding form of S101, each character obtained in input text.

The binary coding form of each character described can be ASCII coding form, Unicode coding form or other coding forms, the present embodiment does not limit the binary coding form of each character in text, as long as the binary coding used in text.

Character coding mode in text has multiple, and also form several comparatively standardization and unified coded system, it is convenient to follow-up some process to text, such as, the forms such as Unicode coding form, GB2312 coding form, GBK coding form and GB18030 coding form.Character in text is processed the coding form needing first to confirm character in text, the mode obtaining the coding form of character in text has a variety of, it is generally, the binary coding form of each character in text is obtained by known condition, such as, the plug-in unit of conventional identification character encoding forms, determines Unicode coding form or other coding forms by this kind of plug-in unit. The binary coding mode obtaining character belongs to prior art, and the present embodiment does not limit the binary coding form how obtaining each character how obtained in input text, as long as the binary coding form of each character in input text can be obtained.

S102, judge that whether described binary coding form is consistent with default coding form, if inconsistent, carry out step S103, if unanimously, directly skip step S103 and carry out step S104.

Described default coding form is a kind of unified coding form, and described default coding form comprises the one in following coding form: Unicode coding form, GB2312 coding form, GBK coding form and GB18030 coding form.

When judging that whether described binary coding form is consistent with default coding form, it is necessary to determine the information such as the coding structure of the binary coding coding form of character, code length in text. By information such as the binary coding form of character in comparative analysis text and the coding structure of coding form preset, code lengths, judge that whether both coding forms are consistent, that is, in text, whether the coding form of character is consistent with default coding form.

The embodiment of the present invention is converted to a kind of unified coding form by unified for the various coding form of the character in literary composition, i.e. described pre-arranged code form, based on described default coding form, simplify follow-up code conversion strategy, that is, avoid needing the switching process of multiple correspondence because of the character in the text of process different coding form.

S103, adopt described default coding form that described character is carried out binary coding.

When the binary coding form of described character is inconsistent with the coding form preset, just need to adopt described default coding form that described character is carried out binary coding, such as, the coding form preset is Unicode coding form, and the character in text is GBK coding form, just need by character coding rule, it is the character of corresponding Unicode coding form by the character conversion of GBK coding form in text.

S104, by coding after character or described character conversion be binary floating point number type.

When the binary coding form of described character is inconsistent with the coding form preset, it is necessary to adopt described default coding form that described character is carried out binary coding, be binary floating point number type by the character conversion after coding. When the binary coding form of described character is consistent with the coding form preset, it is binary floating point number type by described character conversion.

Described binary floating point number type comprises: half precision binary floating point number type, single precision binary floating point number type and double precision binary float.

It is binary floating point number by character conversion, it is possible to based on completing conversion in CPU, it is also possible to based on completing conversion in GPU. In switching process, it is necessary to be a certain binary floating point number according to a certain fixed conversion rule by the character conversion after coding, can change according to the attribute of character itself when specifically changing.Such as, if the character in text can be converted to Unicode coding form, then can be preferentially single precision binary floating point number by this character conversion; If the character in text comprises the Unicode coding form of auxiliary plane, can be then single precision binary floating point number by this character conversion, wherein, Unicode coding form divides 17 groups of layouts, and each group is called a plane, and each plane has 65536 code points, substantially multi-lingual plane is called the 0th plane, being in Unicode a coding section, coding is from U+000 to U+FFF, and other planes outside the 0th plane are called auxiliary plane. When changing, it is also possible to according to electronics for storing the size of the internal memory of the character after this conversion, or the bandwidth when character after this conversion is sent to GPU, it is determined that corresponding binary floating point number type. Such as, if being subject to the restriction of above-mentioned internal memory or bandwidth, it is also possible to be half precision binary floating point number by this character conversion.

Above-mentioned switching process, it is possible to complete in CPU, it is also possible to complete in GPU, if conversion completes in CPU, then needs the binary floating point number after by conversion to be sent to GPU storage inside, if conversion completes in GPU, is just directly stored in GPU inside.

Below with a concrete conversion example, above-mentioned switching process is described.

According to international standard IEEE754, the binary floating point number V of any one character can be expressed as form below:

V=(-1)^S*M*2^E

(-1) ^s represents sign position, works as s=0, and V is positive number; Working as s=1, V is negative, and M represents significant figure, is more than or equal to 1, is less than 2,2^E and represents exponent bits, and E is index.

According to the regulation of IEEE754, for the single precision floating datum of 32,1 the highest is-symbol position S, then 8 is index E, and remaining 23 are significant figure M. The half accuracy floating-point number for 16 accounts between 2 byte of null, and 1 the highest is-symbol position S, then 5 is index E, and remaining 10 are significant figure M.

If the character in text being finally converted to single precision binary floating point number, when default coding form is the UCS-2 in Unicode coding form, by the character of UCS-2 coding form, when being converted to single precision binary floating point number, index E is set to fixed value, then four binary numerals in the UCS-2 of this character being encoded are as a high position of the mantissa M of single precision binary floating point number or low position, and mantissa M remains part and mends 0.

Such as, Chinese character " in " adopt the UCS-2 in Unicode coding form to encode after, the Unicode code of the employing hexadecimal representation obtained is " u4e2d ", and its binary coding is 0100111000101101; Index E is fixed as 1, adds that single precision floating datum exponent mediant 127 is 128, and converting single precision binary floating point number to is 01000000001001110001011010000000.

The conversion mode of the present embodiment is not limited to which, and the conversion mode that the character after described coding or described character are converted into specific binary floating point number can be met system demand by a certain fixing rule of any employing.

S105, GPU is submitted in the character being converted to described default binary floating point number type carry out computing.

The application embodiment of the present invention, it is possible to the character in text is converted into the binary floating point number type that GPU can process, effectively utilizes the floating number processing power of GPU.

In the GPU that another kind that Fig. 2 provides for the embodiment of the present invention is changed based on character coding, the method flow schematic diagram of text-processing, comprises the steps:

The binary coding form of S201, each character obtained in input text.

S202, the binary floating point number type obtaining GPU support.

The binary floating point number type that described GPU supports comprises: half precision binary floating point number type, single precision binary floating point number type and double precision binary floating number type.

The API provided by GPU Computational frame, it is determined that GPU can hold the space encoder of floating number size, and then obtain the binary floating point number type of GPU support.

S203, judge that whether described binary coding form is consistent with default coding form, if inconsistent, carry out step S204, if unanimously, directly skip step S204 and carry out step S205.

S204, adopt described default coding form that described character is carried out binary coding.

S205, it is binary floating point number type by the character after described coding or described character conversion.

S206, GPU is submitted in the character being converted to described binary floating point number type carry out computing.

The application embodiment of the present invention, character in text can be converted into the binary floating point number type that GPU can directly support, GPU is when processing the floating number of the binary floating point number type mated mutually, it is possible to the floating number processing power utilizing GPU self rapidly and efficiently.

On the basis of Fig. 1, in another kind of enforcement mode that the embodiment of the present invention provides, the step S105 of the method also comprises:

Judge whether the length of the binary floating point number type that GPU supports is not less than the length of the binary floating point number type after described character conversion.

If it does, then the character being converted to described binary floating point number type is submitted to GPU, the character of the binary floating point number type after this conversion is directly processed by GPU.

The described character being converted to described binary floating point number type is split, the character of described binary floating point number type is split as the length that described GPU can hold. such as, GPU only supports half precision binary floating point number, Chinese character " in " it is 01000000001001110001011010000000 with single precision binary floating point number, the length of binary floating point number type that GPU supports be less than Chinese character " in " single precision binary floating point number, also be exactly the length of binary floating point number type supported of GPU cannot hold Chinese character " in " single precision binary floating point number, now need to Chinese character " in " single precision binary floating point number split, it is split as two and half precision binary floating point numbers, ensure the length of the binary floating point number type that GPU supports can hold Chinese character " in " single precision binary floating point number, Chinese character after fractionation " in " represent for 0100000100111000+0100000010110100 with half precision binary floating point number.

The application embodiment of the present invention, the binary floating point number type length supported for GPU with transform after the not treatable situation of binary floating point number type, after the character of binary floating point number type is split, effectively utilize the floating number processing power of GPU, improve the ability that GPU processes float simultaneously.

Character in text can be converted into the binary floating point number type that GPU can directly support, GPU is when processing the floating number of the binary floating point number type mated mutually, it is possible to the floating number processing power utilizing GPU self rapidly and efficiently.

The apparatus structure schematic diagram of text-processing in a kind of GPU based on character coding conversion that Fig. 3 provides for the embodiment of the present invention, corresponding with the flow process shown in Fig. 1, comprising: character coding obtaining unit 301, coding judging unit 302, coding arrangement unit 303, transcoder unit 304 and submission processing unit 305:

Character coding obtaining unit 301, for obtaining the binary coding form of character in each text of input;

Coding judging unit 302, for judging that whether described binary coding form is consistent with default coding form, if described binary coding form is inconsistent with the coding form preset, then trigger coding arrangement unit, if described binary coding form is consistent with the coding form preset, then trigger transcoder unit;

Described coding arrangement unit 303, for adopting described default coding form that described character is carried out binary coding;

Described transcoder unit 304, for by coding after character or described character conversion be binary floating point number type;

Processing unit 305 submitted in character, carries out computing for the character being converted to described binary floating point number type is submitted to GPU.

Described transcoder unit 304, specifically for being default binary floating point number type by the character after coding or described character conversion.

Described character submits to processing unit 305 to comprise: GPU parameter judgment sub-unit (not shown), character split subelement (not shown) and character process subelement (not shown).

The application embodiment of the present invention, the binary floating point number type length supported for GPU with transform after the not treatable situation of binary floating point number type, the character of binary floating point number type is being split, while effectively utilize the floating number processing power of GPU, it is to increase GPU processes the range of float.

The apparatus structure schematic diagram of text-processing in a kind of GPU based on character coding conversion that Fig. 4 provides for the embodiment of the present invention, corresponding with the flow process shown in Fig. 2, comprising: character coding obtaining unit 401, GPU obtaining unit 402, coding judging unit 403, coding arrangement unit 404, transcoder unit 405 and submission processing unit 406:

Character coding obtaining unit 401, for obtaining the binary coding form of character in each text of input;

GPU obtaining unit 402, for obtaining the binary floating point number type that GPU supports;

Coding judging unit 403, for judging that whether described binary coding form is consistent with default coding form, if described binary coding form is inconsistent with the coding form preset, then trigger coding arrangement unit, if described binary coding form is consistent with the coding form preset, then directly trigger transcoder unit;

Described coding arrangement unit 404, for adopting described default coding form that described character is carried out binary coding;

Described transcoder unit 405, specifically for by the character conversion after described coding being the binary floating point number type of described GPU support; Described is the binary floating point number type that described GPU supports by described character conversion;

Processing unit 406 submitted in character, carries out computing for the character being converted to described default binary floating point number type is submitted to GPU.

It should be noted that, herein, the such as relational terms of first and second grades and so on is only used for separating an entity or operation with another entity or operational zone, and not necessarily requires or imply to there is any this kind of actual relation or sequentially between these entities or operation. And, term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability, so that comprise the process of a series of key element, method, article or equipment not only comprise those key elements, but also comprise other key elements clearly do not listed, or also comprise the key element intrinsic for this kind of process, method, article or equipment. When not more restrictions, the key element limited by statement " comprising ... ", and be not precluded within process, method, article or the equipment comprising described key element and also there is other identical element.

Each embodiment in this specification sheets all adopts relevant mode to describe, and what between each embodiment, identical similar part illustrated see, each embodiment emphasis mutually is the difference with other embodiments. Especially, for system embodiment, owing to it is substantially similar to embodiment of the method, so what describe is fairly simple, relevant part illustrates see the part of embodiment of the method.

The foregoing is only the better embodiment of the present invention, it is not intended to limit protection scope of the present invention. All do within the spirit and principles in the present invention any amendment, equivalent replacement, improvement etc., be all included in protection scope of the present invention.

Claims

1., based on a method for text-processing in the GPU of character coding conversion, it is applied to electronics, it is characterised in that, described method comprises step:

Obtain the binary coding form of each character in input text;

2. method according to claim 1, it is characterised in that, described default coding form comprises the one in following coding form:

3. method according to claim 1, it is characterised in that, described judge that whether described binary coding form consistent with default coding form before, described method also comprises:

Obtain the binary floating point number type that GPU supports;

4. method according to claim 3, it is characterised in that, the binary floating point number type that described acquisition GPU supports comprises:

5. method according to claim 1, it is characterised in that, described is that binary floating point number type comprises by the character conversion after coding:

6. method according to claim 5, it is characterised in that, the described GPU that submitted to by the character being converted to described binary floating point number type carries out computing and comprises:

7. method according to the arbitrary item of claim 1-6, it is characterised in that, the binary floating point number type that described GPU supports comprises: half precision binary floating point number type, single precision binary floating point number type and double precision binary floating number type.

8., based on a device for text-processing in the GPU of character coding conversion, it is applied to electronics, it is characterised in that, comprising:

9. device according to claim 8, it is characterised in that, also comprise:

GPU obtaining unit, for obtaining the binary floating point number type that GPU supports;

10. device according to claim 8, it is characterised in that, described transcoder unit, specifically for being default binary floating point number type by the character after coding or described character conversion.

11. devices according to claim 10, it is characterised in that, described character submits to processing unit to comprise: GPU parameter judgment sub-unit, character split subelement and character process subelement;