CN111179937A

CN111179937A - Method, apparatus and computer-readable storage medium for text processing

Info

Publication number: CN111179937A
Application number: CN201911349033.6A
Authority: CN
Inventors: 周康明; 陈猛
Original assignee: Shanghai Eye Control Technology Co Ltd
Current assignee: Shanghai Eye Control Technology Co Ltd
Priority date: 2019-12-24
Filing date: 2019-12-24
Publication date: 2020-05-19

Abstract

The invention provides a text processing method, text processing equipment and a computer-readable storage medium. The method comprises the following steps: acquiring texts of the corpus; extracting Arabic numeral strings in the text, and converting the extracted Arabic numeral strings into Chinese numeral strings to form a Chinese text of the corpus; intercepting a sliding window text from the Chinese text by using a sliding window, and converting a middle character of the sliding window text into pinyin to form a sliding window pinyin text; and identifying the sliding window pinyin text based on a prediction identification network to obtain the pinyin identification result in the sliding window pinyin text.

Description

Method, apparatus and computer-readable storage medium for text processing

Technical Field

The present invention relates to the field of text processing, and more particularly, to a method for text processing, an apparatus implementing the method, and a computer-readable storage medium.

Background

With the development of computer hardware and speech recognition technology, speech recognition is increasingly applied in various fields. In the field of aviation, the effect of weather on the flight of an aircraft is of great importance. However, in the weather voice recognition process, due to the influence of factors such as nonstandard pronunciation, the recognition result of the weather voice may be deviated, so that the weather broadcast generated according to the recognition result is not accurate enough, and the flight judgment of the aircraft pilot is adversely affected. In addition, in some other fields, inaccuracy of the speech recognition result text may adversely affect subsequent further processing (e.g., translation of the result text, etc.). Further, in some other fields, such as the field of text editing, a deviation such as a wrongly written word may occur in the edited text due to carelessness of the author itself, and the deviation may sometimes be inconspicuous even if the author checks repeatedly.

Disclosure of Invention

In view of the above problems, a method for text processing, particularly a method for processing a result text of speech recognition, is proposed herein to correct an erroneous recognition result text due to pronunciation inaccuracy or the like in speech recognition.

According to one aspect of the invention, a method for text processing is provided. The method comprises the following steps: acquiring texts of the corpus; extracting Arabic numeral strings in the text, and converting the extracted Arabic numeral strings into Chinese numeral strings to form a Chinese text of the corpus; intercepting a sliding window text from the Chinese text by using a sliding window, and converting a middle character of the sliding window text into pinyin to form a sliding window pinyin text; and identifying the sliding window pinyin text based on a prediction identification network to obtain the pinyin identification result in the sliding window pinyin text.

According to another aspect of the present invention, there is provided an apparatus for text processing. The apparatus comprises: a memory having computer program code stored thereon; and a processor configured to execute the computer program code to perform the method as described above.

According to yet another aspect of the present invention, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon a computer program code which, when executed, performs the method as described above.

By using the scheme of the invention, the maximum likelihood prediction result of each character in the text can be obtained by converting the numbers in the text into Chinese and then into pinyin and identifying the pinyin by using the prediction identification network, thereby correcting errors possibly occurring in the text.

Drawings

FIG. 1 shows a flow diagram of a method for text processing according to an embodiment of the invention;

FIG. 2 shows a flowchart of steps for obtaining text of a corpus, according to one embodiment of the invention;

FIG. 3 shows a flowchart of steps for forming a Chinese text of a corpus in accordance with one embodiment of the present invention;

FIG. 4 shows a flowchart of steps for forming a sliding window pinyin text;

FIG. 5 shows a flowchart of steps for obtaining recognition results, according to one embodiment of the invention;

FIG. 6 is a diagram illustrating a predictive identification network for obtaining identification results according to an embodiment of the invention; and

FIG. 7 shows a schematic block diagram of an example device that may be used to implement an embodiment of the invention.

Detailed Description

The embodiments of the present invention will be described in detail below with reference to the accompanying drawings in order to more clearly understand the objects, features and advantages of the present invention. It should be understood that the embodiments shown in the drawings are not intended to limit the scope of the present invention, but are merely intended to illustrate the spirit of the technical solution of the present invention.

In the following description, for the purposes of illustrating various inventive embodiments, certain specific details are set forth in order to provide a thorough understanding of the various inventive embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details. In other instances, well-known devices, structures and techniques associated with this application may not be shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.

Throughout the specification and claims, the word "comprise" and variations thereof, such as "comprises" and "comprising," are to be understood as an open, inclusive meaning, i.e., as being interpreted to mean "including, but not limited to," unless the context requires otherwise.

Reference throughout this specification to "one embodiment" or "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment. Thus, the appearances of the phrases "in one embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

As used in the specification and the appended claims, the singular forms "a", "an", and "the" may include plural referents unless the context clearly dictates otherwise. It should be noted that the term "or" is generally employed in its sense including "and/or" unless the context clearly dictates otherwise.

FIG. 1 shows a flow diagram of a method 100 for text processing, according to an embodiment of the invention.

As shown in FIG. 1, the method 100 includes a step 110 in which text of a corpus is obtained. Here, the text may be a recognition result text generated after voice recognition, or may also be a text (for example, an article written by an author and having wrongly written characters) obtained by other means, where the text may have wrong characters, and the corpus refers to an original material including the text, and generally includes some formats, symbols, and even picture contents in addition to the text to be processed.

FIG. 2 shows a flowchart of step 110 for obtaining text of a corpus, according to an embodiment of the invention. As shown in FIG. 2, step 110 may include a sub-step 112 in which the original corpus associated with the item to be identified is obtained. Here, the items to be recognized define the range of the original corpus. For example, if the method 100 is used for post-processing of weather speech recognition results text, then the raw corpora relating to weather reports should be obtained in sub-step 112. These raw corpora may be automatically crawled from the internet using crawler technology or obtained in other ways, such as from a free or fee-based database.

Next, in sub-step 114, each raw corpus obtained in sub-step 112 is filtered to obtain qualified raw corpora. Here, the unqualified raw corpora mainly refers to raw corpora that do not include content related to the item to be detected, or corpora that are not complete in content, or corpora that do not meet other requirements (such as format requirements). The technique for screening the original corpus may be any screening technique known or developed in the future, and will not be described in detail herein.

After the qualified raw corpus is obtained, in sub-step 116, the obtained qualified raw corpus is filtered to remove the beginning, end, specific characters, etc. of the raw corpus, thereby generating a text of the corpus to be processed. Here, as known to those skilled in the art, the original corpus such as the weather report will contain a certain amount of contents irrelevant to weather, for example, the beginning of the report usually includes the information source, the end of the report usually includes the travel prompt or dressing prompt, etc., and the middle of the report may also contain specific characters irrelevant to weather such as item symbols, numbers, or "#" and "#" for typesetting, etc. Such extraneous content should first be filtered out before processing the text.

Returning next to FIG. 1, after obtaining the text of the corpus in step 110, the method 100 proceeds to step 120, where the Arabic numeral strings in the text are extracted and converted into Chinese numeral strings to form a Chinese text of the corpus.

FIG. 3 shows a flowchart of the step 120 of forming a Chinese text of a corpus according to one embodiment of the present invention. As shown in fig. 3, step 120 may include a sub-step 122 of traversing the text of the corpus obtained in step 110, and extracting a list of arabic numeral strings therein for each line of the text, respectively. There are various methods for extracting the arabic numeral string, for example, the arabic numeral string in the text can be extracted by using a regular method (e.g., a regular expression in Python).

Assuming that there are n lines (n is an integer greater than or equal to 1) of text of a corpus to be processed, L is applied to each line₁、L₂、……L_nRespectively obtain the list res of the Arabic numeral strings of the row_i＝[z_i1,z_i2,…,z_ik]Therein res_iRepresents the ith line (L)_i) List of Arabic numeral strings of (a), z_i1,z_i2,…,z_ikRepresents the ith line (L)_i) K is an integer greater than or equal to 0, and the value of k may be different for different rows.

Here, arabic numerals refer to ten different numerals 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and a arabic numeral string refers to a numeral string composed of arabic numerals and decimal points ". that is, commonly known as integers and decimals (also called rational numbers), which are described herein as a arabic numeral string for clarity. For example, the Arabic numeral string may be in the form of "123", "123.4", "123.45", etc.

Those skilled in the art will appreciate that lines in the text that are only empty spaces may also be deleted prior to substep 122 or at step 110 to avoid wasting processing resources.

Next, in sub-step 124, each list res is sorted_iEach Arabic numeral string z in (1)_ij(j ═ 1, 2, …, k) respectively into chinese digit strings Z_ij。

Specifically, as shown in FIG. 3, in one embodiment, substep 124 may further comprise substep 1242, wherein list res is extracted_iArabic numeral string z in (1)_ijIn sub-step 1244, the Arabic numeral string z is determined_ijWhether a decimal point is included.

If the Arabic numeral string z_ijIncluding a decimal point (i.e., the determination of sub-step 1244 is yes), then in sub-step 1246, each digit to the left of the decimal point is converted to a chinese digit and the unit of the digit in which the digit is located is appended to obtain a left chinese digit string, the decimal point is converted to a "dot" word, and each digit to the right of the decimal point is directly converted to a chinese digit to obtain a left chinese digit stringObtaining the right Chinese digit string, and splicing the obtained left Chinese digit string, dot character and right Chinese digit string in sequence to obtain the Arabic digit string z_ijChinese character digit string Z_ij。

For example, assuming that the extracted arabic numeral string at the sub-step 1242 is "123.45", it is judged at the sub-step 1244 that the arabic numeral string includes a decimal point ".", so that at the sub-step 1246, each of the numerals "1", "2", and "3" of "123" on the left side of the decimal point is converted into chinese numerals "one", "two", and "three", respectively, and conversion units (hundreds, ten, null (i.e., units having no conversion units)) of its corresponding digit are attached to obtain a left chinese numeral string "one hundred twenty three"; convert the decimal point "-" to a "point"; and each digit "4" and "5" to the right of the decimal point is directly converted to the chinese digit "four" and "five" to get the right chinese digit string "four five". Then, the left Chinese number string ' one hundred twenty-three ', the decimal point ' and the right Chinese number string ' four-five ' are spliced together in sequence to obtain the Chinese number string ' one hundred twenty-three point four-five ' corresponding to the Arabic number string ' 123.45 '.

On the other hand, if the Arabic numeral string z is present_ijIf the decimal point is not included (i.e., the determination in sub-step 1244 is "no"), then in sub-step 1248, the Arabic numeral string z is extracted_ijEach digit of (a) is converted into a Chinese digit and the unit of the digit where the digit is located is attached to obtain an Arabic digit string z_ijChinese character digit string Z_ij。

For example, assuming that the extracted arabic number string in sub-step 1242 is "123", it is determined that the arabic number string does not include the decimal point ". at sub-step 1244, so that, at sub-step 1248, each of the numbers" 1 "," 2 ", and" 3 "of the arabic character string" 123 "is converted into chinese numbers" one "," two ", and" three ", respectively, and the conversion units (hundreds, tens, null (i.e., bits have no conversion units)) of its corresponding digit are attached to obtain the chinese number string" one hundred twenty three "corresponding to the arabic number string" 123 ", which is similar to the method of obtaining the chinese number string on the left side in sub-step 1246.

Substep 124 is repeated until each row L is obtained₁、L₂、……L_nRes list of arabic number strings_i＝[z_i1,z_i2,…,z_ik]List RES of corresponding Chinese digit strings_i＝[Z_i1,Z_i2,…,Z_ik]Wherein the Arabic numeral string z_ijCorresponding to Chinese digit string Z_ij。

After converting the Arabic numeral string in the text into the Chinese numeral string, step 120 may further include (not shown) searching and listing res from the text_iAnd replacing the found Arabic numeral strings with the corresponding Chinese numeral strings. For example, assume that the Arabic number string and the list res are found from text_iJ element z of (2)_ijSame, then can be selected from the list res_iCorresponding list RES_iFind the jth element Z_ijAnd the Arabic numeral string z in the text is combined_ijReplaced by a corresponding Chinese digit string Z_ij。

All the arabic number strings in the text are sequentially replaced with chinese number strings in the above-described method, and the resulting text is called a chinese text.

Through the above process, arabic numerals in a text (e.g., a result text of speech recognition) are converted into corresponding chinese numerals for subsequent further processing.

Returning next to FIG. 1, the method 100 proceeds to step 130, where the sliding window text is truncated from the Chinese text generated in step 120 using a sliding window, and the middle characters of the sliding window text are converted to pinyin to form a sliding window pinyin text.

FIG. 4 shows a flowchart of the step 130 for forming a sliding window pinyin text, in accordance with one embodiment of the present invention. As shown in fig. 4, step 130 may include a substep 132 in which the beginning and end of the chinese text generated in step 120 are respectively filled with a particular string of characters to form a filled-in text. Here, assuming that the size of chinese text is N characters (N is an integer greater than or equal to 1), the length of the added specific character string is m (m is an integer greater than or equal to 1), and the size of the resulting pad text is N +2 m. Here, the specific character string includes characters different from the characters in the chinese text generated in step 120, and may be a character string formed by repeating a plurality of special characters (e.g., a character string obtained by repeating a plurality of symbols "+" or "#"), or a word having a specific meaning (e.g., an english word "start" or "stop") composed of a plurality of characters, or the like. The addition of specific character strings at the beginning and end of the Chinese text respectively can be used for identifying the beginning and the end of the text on one hand, and can be made suitable for the following bidirectional prediction (such as BilSTM) without overflow at the beginning and the end on the other hand. For example, assuming that the size of the sliding window to be employed is 11 characters, m may be set to 5, and specific character strings of length 5 are added each at the beginning and end of the chinese text.

Next, in sub-step 134, the resulting filled-in text of sub-step 132 is traversed with a sliding window of size 2m +1 to truncate at least one sliding window text. Here, the sliding window text refers to a text fragment of length 2m +1 cut out from the pad text. Since the length of the chinese text is at least 1 (length 0 would be filtered out in step 110 or 120), the number of truncated sliding window texts is also at least 1.

Next, in sub-step 136, an embedding (embedding) operation is performed on the first m characters and the last m characters in each of the sliding window texts, respectively, and the middle characters of the sliding window texts are converted into a non-tonal pinyin (or a similar sound (e.g., a recognition error due to an incorrect pronunciation), collectively referred to herein as pinyin). Wherein, the result of performing the embedding operation on the first m characters may be referred to as an embedded first character string, and the result of performing the embedding operation on the last m character strings may be referred to as an embedded last character string. Here, the embedding operation is an operation method capable of converting a large sparse vector into a low dimensional space that retains semantic relationships. For example, if a Word (Word) is considered as the smallest unit of text, then Word Embedding can be understood as a mapping whose process is: mapping or embedding (embedding) a certain word in a text space to another numerical vector space by a certain method, wherein the embedding is usually dimension reduction. There are many specific ways to implement embedding, and the details are not described herein.

Next, in sub-step 138, the resulting embedded first string, the non-tonal pinyin, and the embedded tail string of sub-step 136 are concatenated. Here, the text resulting from the concatenation is referred to as the sliding window pinyin text for each sliding window text, such as sliding window pinyin text 602 ("today's weather is cheering, fit out") and 604 ("sky's weather is sunny lang, fit out") shown in fig. 6. Note that, in the sliding window pinyin text, only the middle character (i.e., the character to be predicted) is pinyin, and the left and right m characters are chinese characters.

Returning to FIG. 1, after step 130, the method 100 continues to step 140, where the sliding window pinyin text is recognized based on the predictive recognition network to obtain a recognition result of the pinyins in the sliding window pinyin text.

Fig. 5 shows a flow chart of a step 140 for obtaining a recognition result according to an embodiment of the invention. Fig. 6 shows a schematic structural diagram of a predictive identification network 600 for obtaining identification results according to an embodiment of the present invention. Prediction recognition networks are described herein by way of example as bidirectional long-term networks (BilSTM) for bi-directionally predicting input text to produce predicted values for characters to be predicted. However, those skilled in the art will appreciate that the present invention is not so limited and other bi-directional or uni-directional predictive recognition networks may be used to recognize pinyins in sliding window pinyin text, as well, within the spirit of the present invention.

As shown in fig. 5, step 140 may include sub-step 142, where as shown in the embedding layer in fig. 6, the embedded vector lengths of the embedded first character string and the embedded last character string may be obtained, for example, from the sliding

window pinyin text

602 or 604, and the non-tonal pinyin is one-hot encoded such that the vector length of the encoded pinyin is equal to the embedded vector length, and the encoded pinyin is inserted into the embedded first character string and the embedded last character string to obtain the encoded text of the sliding window text at sub-step 144.

One-hot encoding, also referred to herein as one-bit-efficient encoding, employs an N-bit status register to encode N states, each having its own independent register bit, and only one bit being active at any one time. The specific implementation of One-hot encoding is well known to those skilled in the art and will not be described herein.

Next, in sub-step 146, the encoded text is input to a BilSTM encoding layer as shown in FIG. 6, whereby the encoded text of the sliding window text is identified using BilSTM to obtain a plurality of prediction scores for the pinyin.

Next, in sub-step 148, the plurality of prediction scores are input to a Conditional Random Field (CRF) classification layer as shown in FIG. 6, which constrains the plurality of prediction scores to select the sequence of labels with the highest prediction score as the recognition result for the pinyin to be predicted in the sliding-window pinyin text.

In one embodiment, the highest prediction score may be selected using a score function, e.g., as follows

Wherein X is the currently entered pinyin and Y_XIs a collection of a plurality of annotation sequences,

is the current sequence of the annotations and the label sequence,

is the marked sequence of the currently input pinyin calculated by a scoring function s (#;)

The score below, argmax () is a function of the maximum.

As shown in fig. 6, through sub-steps 142 to 148, for the pinyin "qing" in the sliding-window pinyin text 602 ("qing lang, fit for today's weather"), the CRF layer outputs its recognition result 606 "sunny", and for the pinyin "lang" in the sliding-window pinyin text 604 ("qing lang, fit for trip" day's weather "), the CRF layer outputs its recognition result 608" lang ".

By using the scheme of the invention, the Arabic numerals in the voice recognition result are converted into Chinese numerals, and then are converted into pinyin for predicting and recognizing again, so that the voice recognition errors caused by pronunciation and the like can be corrected. In addition, the data characteristics of the input network are acquired through a sliding window, and the result of the speech recognition can be further confirmed through the forward and backward prediction, or the recognition error (for example, the speech recognition result error caused by the fact that the front nasal sound and the rear nasal sound of a speaker are not separated can be corrected through the context semantics through the method).

FIG. 7 shows a schematic block diagram of an example device 200 that may be used to implement an embodiment of the invention. The device 200 may be, for example, a computer for text processing or a handheld mobile device or the like. As shown, device 200 may include one or more Central Processing Units (CPUs) 210 (only one shown schematically) that may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)220 or loaded from a storage unit 280 into a Random Access Memory (RAM) 230. In the RAM 230, various programs and data required for the operation of the device 200 can also be stored. The CPU 210, ROM 220, and RAM 230 are connected to each other through a bus 240. An input/output (I/O) interface 250 is also connected to bus 240.

A number of components in device 200 are connected to I/O interface 250, including: an input unit 260 such as a keyboard, a mouse, etc.; an output unit 270 such as various types of displays, speakers, and the like; a storage unit 280 such as a magnetic disk, an optical disk, or the like; and a communication unit 290 such as a network card, modem, wireless communication transceiver, etc. The communication unit 290 allows the device 200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.

The method 100 described above may be performed, for example, by the processing unit 210 of the apparatus 200. For example, in some embodiments, the method 100 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 280. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 200 via the ROM 220 and/or the communication unit 290. When the computer program is loaded into RAM 230 and executed by CPU 210, one or more of the operations of method 100 described above may be performed. Further, the communication unit 290 may support wired or wireless communication functions.

The method 100 and the apparatus 200 for text processing according to the present invention are described above with reference to the accompanying drawings. However, it will be appreciated by those skilled in the art that the performance of the steps of the method 100 is not limited to the order shown in the figures and described above, but may be performed in any other reasonable order. Further, the device 200 also need not include all of the components shown in fig. 7, it may include only some of the components necessary to perform the functions described in the present invention, and the manner in which these components are connected is not limited to the form shown in the drawings. For example, in the case where the device 200 is a portable device such as a cellular phone, the device 200 may have a different structure compared to that in fig. 7.

The present invention may be methods, apparatus, systems and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therein for carrying out aspects of the present invention.

The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.

The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.

The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.

Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.

These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.

The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.

The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.

Claims

1. A method for text processing, comprising:

acquiring texts of the corpus;

extracting Arabic numeral strings in the text, and converting the extracted Arabic numeral strings into Chinese numeral strings to form a Chinese text of the corpus;

intercepting a sliding window text from the Chinese text by using a sliding window, and converting a middle character of the sliding window text into pinyin to form a sliding window pinyin text; and

and identifying the sliding window pinyin text based on a prediction identification network to obtain the pinyin identification result in the sliding window pinyin text.

2. The method of claim 1, wherein obtaining text of a corpus comprises:

acquiring an original corpus related to a project to be identified;

screening the original corpus to obtain qualified original corpus; and

and filtering the qualified original corpus to remove the beginning, the end and specific characters of the original corpus so as to generate a text of the corpus.

3. The method of claim 1, wherein forming the chinese text of the corpus comprises:

traversing the text of the corpus, and respectively extracting a list of Arabic number strings in each line of the text; and

each Arabic numeral string in the list is converted into a Chinese numeral string respectively.

4. The method of claim 3, wherein separately converting each Arabic numeral string in the list to a Chinese numeral string comprises:

extracting the Arabic numeral strings in the list;

determining whether the Arabic number string includes a decimal point;

if the Arabic numeral string comprises a decimal point, converting each digit on the left side of the decimal point into a Chinese digit and respectively attaching a unit of the digit where the digit is located to obtain a left Chinese numeral string, converting the decimal point into a 'dot' word, directly converting each digit on the right side of the decimal point into a Chinese digit to obtain a right Chinese numeral string, and sequentially splicing the left Chinese numeral string, the 'dot' word and the right Chinese numeral string to serve as the Chinese numeral string of the Arabic numeral string; and

if the Arabic numeral string does not comprise a decimal point, each digit of the Arabic numeral string is converted into a Chinese digit, and the Chinese digit string of the Arabic numeral string is obtained by respectively adding the unit of the digit where the digit is located.

5. The method of claim 3, wherein separately converting each Arabic numeral string in the list to a Chinese numeral string further comprises:

and searching the text for the same Arabic numeral string in the list, and replacing the Arabic numeral string with a Chinese numeral string corresponding to the Arabic numeral string.

6. The method of claim 1, wherein forming a sliding window pinyin text comprises:

filling specific character strings in the head and the tail of the Chinese text respectively to form filled texts, wherein the specific character strings comprise characters different from the characters in the Chinese text, and the length of the specific character strings is m, and m is an integer greater than or equal to 1;

traversing the filling text by a sliding window with the size of 2m +1 to intercept at least one sliding window text;

performing embedding (embedding) operation on the first m characters and the last m characters in each sliding window text respectively to obtain an embedded first character string and an embedded tail character string, and converting the middle characters in the sliding window text into a pinyin without tones; and

and splicing the embedded first character string, the pinyin without the tones and the embedded tail character string to obtain a sliding window pinyin text of each sliding window text.

7. The method of claim 6, wherein obtaining the recognition result of the pinyin in the sliding-window pinyin text comprises:

acquiring the embedded vector length of the embedded first character string and the embedded tail character string, and performing one-hot coding on the pinyin without tones so that the vector length of the encoded pinyin is equal to the embedded vector length;

inserting the coded pinyin into the embedded first character string and the embedded tail character string to obtain a coded text of the sliding window text;

identifying the coded text of the sliding window text by utilizing a bidirectional long-short time network (BilSTM) to obtain a plurality of prediction samples of the pinyin; and

constraining the plurality of prediction samples based on a Conditional Random Field (CRF) to select a tag sequence with a highest prediction score as an identification result of the pinyin.

8. The method of claim 1, wherein the text is a result text of speech recognition.

9. An apparatus for text processing, comprising:

a memory having computer program code stored thereon; and

a processor configured to execute the computer program code to perform the method of any of claims 1 to 8.

10. A computer readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 1 to 8.