CN111179937A - Method, apparatus and computer-readable storage medium for text processing - Google Patents
Method, apparatus and computer-readable storage medium for text processing Download PDFInfo
- Publication number
- CN111179937A CN111179937A CN201911349033.6A CN201911349033A CN111179937A CN 111179937 A CN111179937 A CN 111179937A CN 201911349033 A CN201911349033 A CN 201911349033A CN 111179937 A CN111179937 A CN 111179937A
- Authority
- CN
- China
- Prior art keywords
- text
- chinese
- pinyin
- string
- sliding window
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 52
- 238000012545 processing Methods 0.000 title claims abstract description 30
- 238000004590 computer program Methods 0.000 claims description 14
- 230000002457 bidirectional effect Effects 0.000 claims description 3
- 238000012216 screening Methods 0.000 claims description 3
- 238000001914 filtration Methods 0.000 claims 1
- 238000003672 processing method Methods 0.000 abstract 1
- 238000010586 diagram Methods 0.000 description 16
- 230000006870 function Effects 0.000 description 12
- 238000004891 communication Methods 0.000 description 6
- 230000008569 process Effects 0.000 description 5
- 230000005540 biological transmission Effects 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 4
- 230000003287 optical effect Effects 0.000 description 3
- 230000002411 adverse Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000000835 fiber Substances 0.000 description 2
- 238000013507 mapping Methods 0.000 description 2
- 230000001902 propagating effect Effects 0.000 description 2
- RYGMFSIKBFXOCR-UHFFFAOYSA-N Copper Chemical compound [Cu] RYGMFSIKBFXOCR-UHFFFAOYSA-N 0.000 description 1
- 238000007476 Maximum Likelihood Methods 0.000 description 1
- 230000001413 cellular effect Effects 0.000 description 1
- 229910052802 copper Inorganic materials 0.000 description 1
- 239000010949 copper Substances 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000012634 fragment Substances 0.000 description 1
- 230000007774 longterm Effects 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 239000000463 material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012805 post-processing Methods 0.000 description 1
- 230000009467 reduction Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000013519 translation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G10—MUSICAL INSTRUMENTS; ACOUSTICS
- G10L—SPEECH ANALYSIS TECHNIQUES OR SPEECH SYNTHESIS; SPEECH RECOGNITION; SPEECH OR VOICE PROCESSING TECHNIQUES; SPEECH OR AUDIO CODING OR DECODING
- G10L15/00—Speech recognition
- G10L15/26—Speech to text systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/045—Combinations of networks
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/049—Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Computational Linguistics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- General Physics & Mathematics (AREA)
- Biomedical Technology (AREA)
- Data Mining & Analysis (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Molecular Biology (AREA)
- Computing Systems (AREA)
- General Engineering & Computer Science (AREA)
- Biophysics (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Life Sciences & Earth Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Human Computer Interaction (AREA)
- Acoustics & Sound (AREA)
- Multimedia (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a text processing method, text processing equipment and a computer-readable storage medium. The method comprises the following steps: acquiring texts of the corpus; extracting Arabic numeral strings in the text, and converting the extracted Arabic numeral strings into Chinese numeral strings to form a Chinese text of the corpus; intercepting a sliding window text from the Chinese text by using a sliding window, and converting a middle character of the sliding window text into pinyin to form a sliding window pinyin text; and identifying the sliding window pinyin text based on a prediction identification network to obtain the pinyin identification result in the sliding window pinyin text.
Description
Technical Field
The present invention relates to the field of text processing, and more particularly, to a method for text processing, an apparatus implementing the method, and a computer-readable storage medium.
Background
With the development of computer hardware and speech recognition technology, speech recognition is increasingly applied in various fields. In the field of aviation, the effect of weather on the flight of an aircraft is of great importance. However, in the weather voice recognition process, due to the influence of factors such as nonstandard pronunciation, the recognition result of the weather voice may be deviated, so that the weather broadcast generated according to the recognition result is not accurate enough, and the flight judgment of the aircraft pilot is adversely affected. In addition, in some other fields, inaccuracy of the speech recognition result text may adversely affect subsequent further processing (e.g., translation of the result text, etc.). Further, in some other fields, such as the field of text editing, a deviation such as a wrongly written word may occur in the edited text due to carelessness of the author itself, and the deviation may sometimes be inconspicuous even if the author checks repeatedly.
Disclosure of Invention
In view of the above problems, a method for text processing, particularly a method for processing a result text of speech recognition, is proposed herein to correct an erroneous recognition result text due to pronunciation inaccuracy or the like in speech recognition.
According to one aspect of the invention, a method for text processing is provided. The method comprises the following steps: acquiring texts of the corpus; extracting Arabic numeral strings in the text, and converting the extracted Arabic numeral strings into Chinese numeral strings to form a Chinese text of the corpus; intercepting a sliding window text from the Chinese text by using a sliding window, and converting a middle character of the sliding window text into pinyin to form a sliding window pinyin text; and identifying the sliding window pinyin text based on a prediction identification network to obtain the pinyin identification result in the sliding window pinyin text.
According to another aspect of the present invention, there is provided an apparatus for text processing. The apparatus comprises: a memory having computer program code stored thereon; and a processor configured to execute the computer program code to perform the method as described above.
According to yet another aspect of the present invention, a computer-readable storage medium is provided. The computer readable storage medium has stored thereon a computer program code which, when executed, performs the method as described above.
By using the scheme of the invention, the maximum likelihood prediction result of each character in the text can be obtained by converting the numbers in the text into Chinese and then into pinyin and identifying the pinyin by using the prediction identification network, thereby correcting errors possibly occurring in the text.
Drawings
FIG. 1 shows a flow diagram of a method for text processing according to an embodiment of the invention;
FIG. 2 shows a flowchart of steps for obtaining text of a corpus, according to one embodiment of the invention;
FIG. 3 shows a flowchart of steps for forming a Chinese text of a corpus in accordance with one embodiment of the present invention;
FIG. 4 shows a flowchart of steps for forming a sliding window pinyin text;
FIG. 5 shows a flowchart of steps for obtaining recognition results, according to one embodiment of the invention;
FIG. 6 is a diagram illustrating a predictive identification network for obtaining identification results according to an embodiment of the invention; and
FIG. 7 shows a schematic block diagram of an example device that may be used to implement an embodiment of the invention.
Detailed Description
The embodiments of the present invention will be described in detail below with reference to the accompanying drawings in order to more clearly understand the objects, features and advantages of the present invention. It should be understood that the embodiments shown in the drawings are not intended to limit the scope of the present invention, but are merely intended to illustrate the spirit of the technical solution of the present invention.
In the following description, for the purposes of illustrating various inventive embodiments, certain specific details are set forth in order to provide a thorough understanding of the various inventive embodiments. One skilled in the relevant art will recognize, however, that the embodiments may be practiced without one or more of the specific details. In other instances, well-known devices, structures and techniques associated with this application may not be shown or described in detail to avoid unnecessarily obscuring the description of the embodiments.
Throughout the specification and claims, the word "comprise" and variations thereof, such as "comprises" and "comprising," are to be understood as an open, inclusive meaning, i.e., as being interpreted to mean "including, but not limited to," unless the context requires otherwise.
Reference throughout this specification to "one embodiment" or "some embodiments" means that a particular feature, structure, or characteristic described in connection with the embodiments is included in at least one embodiment. Thus, the appearances of the phrases "in one embodiment" or "in some embodiments" in various places throughout this specification are not necessarily all referring to the same embodiment. Furthermore, the particular features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.
As used in the specification and the appended claims, the singular forms "a", "an", and "the" may include plural referents unless the context clearly dictates otherwise. It should be noted that the term "or" is generally employed in its sense including "and/or" unless the context clearly dictates otherwise.
FIG. 1 shows a flow diagram of a method 100 for text processing, according to an embodiment of the invention.
As shown in FIG. 1, the method 100 includes a step 110 in which text of a corpus is obtained. Here, the text may be a recognition result text generated after voice recognition, or may also be a text (for example, an article written by an author and having wrongly written characters) obtained by other means, where the text may have wrong characters, and the corpus refers to an original material including the text, and generally includes some formats, symbols, and even picture contents in addition to the text to be processed.
FIG. 2 shows a flowchart of step 110 for obtaining text of a corpus, according to an embodiment of the invention. As shown in FIG. 2, step 110 may include a sub-step 112 in which the original corpus associated with the item to be identified is obtained. Here, the items to be recognized define the range of the original corpus. For example, if the method 100 is used for post-processing of weather speech recognition results text, then the raw corpora relating to weather reports should be obtained in sub-step 112. These raw corpora may be automatically crawled from the internet using crawler technology or obtained in other ways, such as from a free or fee-based database.
Next, in sub-step 114, each raw corpus obtained in sub-step 112 is filtered to obtain qualified raw corpora. Here, the unqualified raw corpora mainly refers to raw corpora that do not include content related to the item to be detected, or corpora that are not complete in content, or corpora that do not meet other requirements (such as format requirements). The technique for screening the original corpus may be any screening technique known or developed in the future, and will not be described in detail herein.
After the qualified raw corpus is obtained, in sub-step 116, the obtained qualified raw corpus is filtered to remove the beginning, end, specific characters, etc. of the raw corpus, thereby generating a text of the corpus to be processed. Here, as known to those skilled in the art, the original corpus such as the weather report will contain a certain amount of contents irrelevant to weather, for example, the beginning of the report usually includes the information source, the end of the report usually includes the travel prompt or dressing prompt, etc., and the middle of the report may also contain specific characters irrelevant to weather such as item symbols, numbers, or "#" and "#" for typesetting, etc. Such extraneous content should first be filtered out before processing the text.
Returning next to FIG. 1, after obtaining the text of the corpus in step 110, the method 100 proceeds to step 120, where the Arabic numeral strings in the text are extracted and converted into Chinese numeral strings to form a Chinese text of the corpus.
FIG. 3 shows a flowchart of the step 120 of forming a Chinese text of a corpus according to one embodiment of the present invention. As shown in fig. 3, step 120 may include a sub-step 122 of traversing the text of the corpus obtained in step 110, and extracting a list of arabic numeral strings therein for each line of the text, respectively. There are various methods for extracting the arabic numeral string, for example, the arabic numeral string in the text can be extracted by using a regular method (e.g., a regular expression in Python).
Assuming that there are n lines (n is an integer greater than or equal to 1) of text of a corpus to be processed, L is applied to each line1、L2、……LnRespectively obtain the list res of the Arabic numeral strings of the rowi=[zi1,zi2,…,zik]Therein resiRepresents the ith line (L)i) List of Arabic numeral strings of (a), zi1,zi2,…,zikRepresents the ith line (L)i) K is an integer greater than or equal to 0, and the value of k may be different for different rows.
Here, arabic numerals refer to ten different numerals 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, and a arabic numeral string refers to a numeral string composed of arabic numerals and decimal points ". that is, commonly known as integers and decimals (also called rational numbers), which are described herein as a arabic numeral string for clarity. For example, the Arabic numeral string may be in the form of "123", "123.4", "123.45", etc.
Those skilled in the art will appreciate that lines in the text that are only empty spaces may also be deleted prior to substep 122 or at step 110 to avoid wasting processing resources.
Next, in sub-step 124, each list res is sortediEach Arabic numeral string z in (1)ij(j ═ 1, 2, …, k) respectively into chinese digit strings Zij。
Specifically, as shown in FIG. 3, in one embodiment, substep 124 may further comprise substep 1242, wherein list res is extractediArabic numeral string z in (1)ijIn sub-step 1244, the Arabic numeral string z is determinedijWhether a decimal point is included.
If the Arabic numeral string zijIncluding a decimal point (i.e., the determination of sub-step 1244 is yes), then in sub-step 1246, each digit to the left of the decimal point is converted to a chinese digit and the unit of the digit in which the digit is located is appended to obtain a left chinese digit string, the decimal point is converted to a "dot" word, and each digit to the right of the decimal point is directly converted to a chinese digit to obtain a left chinese digit stringObtaining the right Chinese digit string, and splicing the obtained left Chinese digit string, dot character and right Chinese digit string in sequence to obtain the Arabic digit string zijChinese character digit string Zij。
For example, assuming that the extracted arabic numeral string at the sub-step 1242 is "123.45", it is judged at the sub-step 1244 that the arabic numeral string includes a decimal point ".", so that at the sub-step 1246, each of the numerals "1", "2", and "3" of "123" on the left side of the decimal point is converted into chinese numerals "one", "two", and "three", respectively, and conversion units (hundreds, ten, null (i.e., units having no conversion units)) of its corresponding digit are attached to obtain a left chinese numeral string "one hundred twenty three"; convert the decimal point "-" to a "point"; and each digit "4" and "5" to the right of the decimal point is directly converted to the chinese digit "four" and "five" to get the right chinese digit string "four five". Then, the left Chinese number string ' one hundred twenty-three ', the decimal point ' and the right Chinese number string ' four-five ' are spliced together in sequence to obtain the Chinese number string ' one hundred twenty-three point four-five ' corresponding to the Arabic number string ' 123.45 '.
On the other hand, if the Arabic numeral string z is presentijIf the decimal point is not included (i.e., the determination in sub-step 1244 is "no"), then in sub-step 1248, the Arabic numeral string z is extractedijEach digit of (a) is converted into a Chinese digit and the unit of the digit where the digit is located is attached to obtain an Arabic digit string zijChinese character digit string Zij。
For example, assuming that the extracted arabic number string in sub-step 1242 is "123", it is determined that the arabic number string does not include the decimal point ". at sub-step 1244, so that, at sub-step 1248, each of the numbers" 1 "," 2 ", and" 3 "of the arabic character string" 123 "is converted into chinese numbers" one "," two ", and" three ", respectively, and the conversion units (hundreds, tens, null (i.e., bits have no conversion units)) of its corresponding digit are attached to obtain the chinese number string" one hundred twenty three "corresponding to the arabic number string" 123 ", which is similar to the method of obtaining the chinese number string on the left side in sub-step 1246.
After converting the Arabic numeral string in the text into the Chinese numeral string, step 120 may further include (not shown) searching and listing res from the textiAnd replacing the found Arabic numeral strings with the corresponding Chinese numeral strings. For example, assume that the Arabic number string and the list res are found from textiJ element z of (2)ijSame, then can be selected from the list resiCorresponding list RESiFind the jth element ZijAnd the Arabic numeral string z in the text is combinedijReplaced by a corresponding Chinese digit string Zij。
All the arabic number strings in the text are sequentially replaced with chinese number strings in the above-described method, and the resulting text is called a chinese text.
Through the above process, arabic numerals in a text (e.g., a result text of speech recognition) are converted into corresponding chinese numerals for subsequent further processing.
Returning next to FIG. 1, the method 100 proceeds to step 130, where the sliding window text is truncated from the Chinese text generated in step 120 using a sliding window, and the middle characters of the sliding window text are converted to pinyin to form a sliding window pinyin text.
FIG. 4 shows a flowchart of the step 130 for forming a sliding window pinyin text, in accordance with one embodiment of the present invention. As shown in fig. 4, step 130 may include a substep 132 in which the beginning and end of the chinese text generated in step 120 are respectively filled with a particular string of characters to form a filled-in text. Here, assuming that the size of chinese text is N characters (N is an integer greater than or equal to 1), the length of the added specific character string is m (m is an integer greater than or equal to 1), and the size of the resulting pad text is N +2 m. Here, the specific character string includes characters different from the characters in the chinese text generated in step 120, and may be a character string formed by repeating a plurality of special characters (e.g., a character string obtained by repeating a plurality of symbols "+" or "#"), or a word having a specific meaning (e.g., an english word "start" or "stop") composed of a plurality of characters, or the like. The addition of specific character strings at the beginning and end of the Chinese text respectively can be used for identifying the beginning and the end of the text on one hand, and can be made suitable for the following bidirectional prediction (such as BilSTM) without overflow at the beginning and the end on the other hand. For example, assuming that the size of the sliding window to be employed is 11 characters, m may be set to 5, and specific character strings of length 5 are added each at the beginning and end of the chinese text.
Next, in sub-step 134, the resulting filled-in text of sub-step 132 is traversed with a sliding window of size 2m +1 to truncate at least one sliding window text. Here, the sliding window text refers to a text fragment of length 2m +1 cut out from the pad text. Since the length of the chinese text is at least 1 (length 0 would be filtered out in step 110 or 120), the number of truncated sliding window texts is also at least 1.
Next, in sub-step 136, an embedding (embedding) operation is performed on the first m characters and the last m characters in each of the sliding window texts, respectively, and the middle characters of the sliding window texts are converted into a non-tonal pinyin (or a similar sound (e.g., a recognition error due to an incorrect pronunciation), collectively referred to herein as pinyin). Wherein, the result of performing the embedding operation on the first m characters may be referred to as an embedded first character string, and the result of performing the embedding operation on the last m character strings may be referred to as an embedded last character string. Here, the embedding operation is an operation method capable of converting a large sparse vector into a low dimensional space that retains semantic relationships. For example, if a Word (Word) is considered as the smallest unit of text, then Word Embedding can be understood as a mapping whose process is: mapping or embedding (embedding) a certain word in a text space to another numerical vector space by a certain method, wherein the embedding is usually dimension reduction. There are many specific ways to implement embedding, and the details are not described herein.
Next, in sub-step 138, the resulting embedded first string, the non-tonal pinyin, and the embedded tail string of sub-step 136 are concatenated. Here, the text resulting from the concatenation is referred to as the sliding window pinyin text for each sliding window text, such as sliding window pinyin text 602 ("today's weather is cheering, fit out") and 604 ("sky's weather is sunny lang, fit out") shown in fig. 6. Note that, in the sliding window pinyin text, only the middle character (i.e., the character to be predicted) is pinyin, and the left and right m characters are chinese characters.
Returning to FIG. 1, after step 130, the method 100 continues to step 140, where the sliding window pinyin text is recognized based on the predictive recognition network to obtain a recognition result of the pinyins in the sliding window pinyin text.
Fig. 5 shows a flow chart of a step 140 for obtaining a recognition result according to an embodiment of the invention. Fig. 6 shows a schematic structural diagram of a predictive identification network 600 for obtaining identification results according to an embodiment of the present invention. Prediction recognition networks are described herein by way of example as bidirectional long-term networks (BilSTM) for bi-directionally predicting input text to produce predicted values for characters to be predicted. However, those skilled in the art will appreciate that the present invention is not so limited and other bi-directional or uni-directional predictive recognition networks may be used to recognize pinyins in sliding window pinyin text, as well, within the spirit of the present invention.
As shown in fig. 5, step 140 may include sub-step 142, where as shown in the embedding layer in fig. 6, the embedded vector lengths of the embedded first character string and the embedded last character string may be obtained, for example, from the sliding window pinyin text 602 or 604, and the non-tonal pinyin is one-hot encoded such that the vector length of the encoded pinyin is equal to the embedded vector length, and the encoded pinyin is inserted into the embedded first character string and the embedded last character string to obtain the encoded text of the sliding window text at sub-step 144.
One-hot encoding, also referred to herein as one-bit-efficient encoding, employs an N-bit status register to encode N states, each having its own independent register bit, and only one bit being active at any one time. The specific implementation of One-hot encoding is well known to those skilled in the art and will not be described herein.
Next, in sub-step 146, the encoded text is input to a BilSTM encoding layer as shown in FIG. 6, whereby the encoded text of the sliding window text is identified using BilSTM to obtain a plurality of prediction scores for the pinyin.
Next, in sub-step 148, the plurality of prediction scores are input to a Conditional Random Field (CRF) classification layer as shown in FIG. 6, which constrains the plurality of prediction scores to select the sequence of labels with the highest prediction score as the recognition result for the pinyin to be predicted in the sliding-window pinyin text.
In one embodiment, the highest prediction score may be selected using a score function, e.g., as follows
Wherein X is the currently entered pinyin and YXIs a collection of a plurality of annotation sequences,is the current sequence of the annotations and the label sequence,is the marked sequence of the currently input pinyin calculated by a scoring function s (#;)The score below, argmax () is a function of the maximum.
As shown in fig. 6, through sub-steps 142 to 148, for the pinyin "qing" in the sliding-window pinyin text 602 ("qing lang, fit for today's weather"), the CRF layer outputs its recognition result 606 "sunny", and for the pinyin "lang" in the sliding-window pinyin text 604 ("qing lang, fit for trip" day's weather "), the CRF layer outputs its recognition result 608" lang ".
By using the scheme of the invention, the Arabic numerals in the voice recognition result are converted into Chinese numerals, and then are converted into pinyin for predicting and recognizing again, so that the voice recognition errors caused by pronunciation and the like can be corrected. In addition, the data characteristics of the input network are acquired through a sliding window, and the result of the speech recognition can be further confirmed through the forward and backward prediction, or the recognition error (for example, the speech recognition result error caused by the fact that the front nasal sound and the rear nasal sound of a speaker are not separated can be corrected through the context semantics through the method).
FIG. 7 shows a schematic block diagram of an example device 200 that may be used to implement an embodiment of the invention. The device 200 may be, for example, a computer for text processing or a handheld mobile device or the like. As shown, device 200 may include one or more Central Processing Units (CPUs) 210 (only one shown schematically) that may perform various suitable actions and processes in accordance with computer program instructions stored in a Read Only Memory (ROM)220 or loaded from a storage unit 280 into a Random Access Memory (RAM) 230. In the RAM 230, various programs and data required for the operation of the device 200 can also be stored. The CPU 210, ROM 220, and RAM 230 are connected to each other through a bus 240. An input/output (I/O) interface 250 is also connected to bus 240.
A number of components in device 200 are connected to I/O interface 250, including: an input unit 260 such as a keyboard, a mouse, etc.; an output unit 270 such as various types of displays, speakers, and the like; a storage unit 280 such as a magnetic disk, an optical disk, or the like; and a communication unit 290 such as a network card, modem, wireless communication transceiver, etc. The communication unit 290 allows the device 200 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
The method 100 described above may be performed, for example, by the processing unit 210 of the apparatus 200. For example, in some embodiments, the method 100 may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 280. In some embodiments, part or all of the computer program may be loaded and/or installed onto the device 200 via the ROM 220 and/or the communication unit 290. When the computer program is loaded into RAM 230 and executed by CPU 210, one or more of the operations of method 100 described above may be performed. Further, the communication unit 290 may support wired or wireless communication functions.
The method 100 and the apparatus 200 for text processing according to the present invention are described above with reference to the accompanying drawings. However, it will be appreciated by those skilled in the art that the performance of the steps of the method 100 is not limited to the order shown in the figures and described above, but may be performed in any other reasonable order. Further, the device 200 also need not include all of the components shown in fig. 7, it may include only some of the components necessary to perform the functions described in the present invention, and the manner in which these components are connected is not limited to the form shown in the drawings. For example, in the case where the device 200 is a portable device such as a cellular phone, the device 200 may have a different structure compared to that in fig. 7.
The present invention may be methods, apparatus, systems and/or computer program products. The computer program product may include a computer-readable storage medium having computer-readable program instructions embodied therein for carrying out aspects of the present invention.
The computer readable storage medium may be a tangible device that can hold and store the instructions for use by the instruction execution device. The computer readable storage medium may be, for example, but not limited to, an electronic memory device, a magnetic memory device, an optical memory device, an electromagnetic memory device, a semiconductor memory device, or any suitable combination of the foregoing. More specific examples (a non-exhaustive list) of the computer readable storage medium would include the following: a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), a Static Random Access Memory (SRAM), a portable compact disc read-only memory (CD-ROM), a Digital Versatile Disc (DVD), a memory stick, a floppy disk, a mechanical coding device, such as punch cards or in-groove projection structures having instructions stored thereon, and any suitable combination of the foregoing. Computer-readable storage media as used herein is not to be construed as transitory signals per se, such as radio waves or other freely propagating electromagnetic waves, electromagnetic waves propagating through a waveguide or other transmission medium (e.g., optical pulses through a fiber optic cable), or electrical signals transmitted through electrical wires.
The computer-readable program instructions described herein may be downloaded from a computer-readable storage medium to a respective computing/processing device, or to an external computer or external storage device via a network, such as the internet, a local area network, a wide area network, and/or a wireless network. The network may include copper transmission cables, fiber optic transmission, wireless transmission, routers, firewalls, switches, gateway computers and/or edge servers. The network adapter card or network interface in each computing/processing device receives computer-readable program instructions from the network and forwards the computer-readable program instructions for storage in a computer-readable storage medium in the respective computing/processing device.
The computer program instructions for carrying out operations of the present invention may be assembler instructions, Instruction Set Architecture (ISA) instructions, machine-related instructions, microcode, firmware instructions, state setting data, or source or object code written in any combination of one or more programming languages, including an object oriented programming language such as Smalltalk, C + + or the like and conventional procedural programming languages, such as the "C" programming language or similar programming languages. The computer-readable program instructions may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or server. In the case of a remote computer, the remote computer may be connected to the user's computer through any type of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or the connection may be made to an external computer (for example, through the Internet using an Internet service provider). In some embodiments, aspects of the present invention are implemented by personalizing an electronic circuit, such as a programmable logic circuit, a Field Programmable Gate Array (FPGA), or a Programmable Logic Array (PLA), with state information of computer-readable program instructions, which can execute the computer-readable program instructions.
Aspects of the present invention are described herein with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each block of the flowchart illustrations and/or block diagrams, and combinations of blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer-readable program instructions.
These computer-readable program instructions may be provided to a processing unit of a general purpose computer, special purpose computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processing unit of the computer or other programmable data processing apparatus, create means for implementing the functions/acts specified in the flowchart and/or block diagram block or blocks. These computer-readable program instructions may also be stored in a computer-readable storage medium that can direct a computer, programmable data processing apparatus, and/or other devices to function in a particular manner, such that the computer-readable medium storing the instructions comprises an article of manufacture including instructions which implement the function/act specified in the flowchart and/or block diagram block or blocks.
The computer readable program instructions may also be loaded onto a computer, other programmable data processing apparatus, or other devices to cause a series of operational steps to be performed on the computer, other programmable apparatus or other devices to produce a computer implemented process such that the instructions which execute on the computer, other programmable apparatus or other devices implement the functions/acts specified in the flowchart and/or block diagram block or blocks.
The flowchart and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to various embodiments of the present invention. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of instructions, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
Having described embodiments of the present invention, the foregoing description is intended to be exemplary, not exhaustive, and not limited to the embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the described embodiments. The terms used herein were chosen in order to best explain the principles of the embodiments, the practical application, or technical improvements to the techniques in the marketplace, or to enable others of ordinary skill in the art to understand the embodiments disclosed herein.
Claims (10)
1. A method for text processing, comprising:
acquiring texts of the corpus;
extracting Arabic numeral strings in the text, and converting the extracted Arabic numeral strings into Chinese numeral strings to form a Chinese text of the corpus;
intercepting a sliding window text from the Chinese text by using a sliding window, and converting a middle character of the sliding window text into pinyin to form a sliding window pinyin text; and
and identifying the sliding window pinyin text based on a prediction identification network to obtain the pinyin identification result in the sliding window pinyin text.
2. The method of claim 1, wherein obtaining text of a corpus comprises:
acquiring an original corpus related to a project to be identified;
screening the original corpus to obtain qualified original corpus; and
and filtering the qualified original corpus to remove the beginning, the end and specific characters of the original corpus so as to generate a text of the corpus.
3. The method of claim 1, wherein forming the chinese text of the corpus comprises:
traversing the text of the corpus, and respectively extracting a list of Arabic number strings in each line of the text; and
each Arabic numeral string in the list is converted into a Chinese numeral string respectively.
4. The method of claim 3, wherein separately converting each Arabic numeral string in the list to a Chinese numeral string comprises:
extracting the Arabic numeral strings in the list;
determining whether the Arabic number string includes a decimal point;
if the Arabic numeral string comprises a decimal point, converting each digit on the left side of the decimal point into a Chinese digit and respectively attaching a unit of the digit where the digit is located to obtain a left Chinese numeral string, converting the decimal point into a 'dot' word, directly converting each digit on the right side of the decimal point into a Chinese digit to obtain a right Chinese numeral string, and sequentially splicing the left Chinese numeral string, the 'dot' word and the right Chinese numeral string to serve as the Chinese numeral string of the Arabic numeral string; and
if the Arabic numeral string does not comprise a decimal point, each digit of the Arabic numeral string is converted into a Chinese digit, and the Chinese digit string of the Arabic numeral string is obtained by respectively adding the unit of the digit where the digit is located.
5. The method of claim 3, wherein separately converting each Arabic numeral string in the list to a Chinese numeral string further comprises:
and searching the text for the same Arabic numeral string in the list, and replacing the Arabic numeral string with a Chinese numeral string corresponding to the Arabic numeral string.
6. The method of claim 1, wherein forming a sliding window pinyin text comprises:
filling specific character strings in the head and the tail of the Chinese text respectively to form filled texts, wherein the specific character strings comprise characters different from the characters in the Chinese text, and the length of the specific character strings is m, and m is an integer greater than or equal to 1;
traversing the filling text by a sliding window with the size of 2m +1 to intercept at least one sliding window text;
performing embedding (embedding) operation on the first m characters and the last m characters in each sliding window text respectively to obtain an embedded first character string and an embedded tail character string, and converting the middle characters in the sliding window text into a pinyin without tones; and
and splicing the embedded first character string, the pinyin without the tones and the embedded tail character string to obtain a sliding window pinyin text of each sliding window text.
7. The method of claim 6, wherein obtaining the recognition result of the pinyin in the sliding-window pinyin text comprises:
acquiring the embedded vector length of the embedded first character string and the embedded tail character string, and performing one-hot coding on the pinyin without tones so that the vector length of the encoded pinyin is equal to the embedded vector length;
inserting the coded pinyin into the embedded first character string and the embedded tail character string to obtain a coded text of the sliding window text;
identifying the coded text of the sliding window text by utilizing a bidirectional long-short time network (BilSTM) to obtain a plurality of prediction samples of the pinyin; and
constraining the plurality of prediction samples based on a Conditional Random Field (CRF) to select a tag sequence with a highest prediction score as an identification result of the pinyin.
8. The method of claim 1, wherein the text is a result text of speech recognition.
9. An apparatus for text processing, comprising:
a memory having computer program code stored thereon; and
a processor configured to execute the computer program code to perform the method of any of claims 1 to 8.
10. A computer readable storage medium having stored thereon computer program code which, when executed, performs the method of any of claims 1 to 8.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911349033.6A CN111179937A (en) | 2019-12-24 | 2019-12-24 | Method, apparatus and computer-readable storage medium for text processing |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911349033.6A CN111179937A (en) | 2019-12-24 | 2019-12-24 | Method, apparatus and computer-readable storage medium for text processing |
Publications (1)
Publication Number | Publication Date |
---|---|
CN111179937A true CN111179937A (en) | 2020-05-19 |
Family
ID=70652100
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911349033.6A Pending CN111179937A (en) | 2019-12-24 | 2019-12-24 | Method, apparatus and computer-readable storage medium for text processing |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN111179937A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112037762A (en) * | 2020-09-10 | 2020-12-04 | 中航华东光电(上海)有限公司 | Chinese-English mixed speech recognition method |
CN113723082A (en) * | 2021-08-30 | 2021-11-30 | 支付宝(杭州)信息技术有限公司 | Method and device for detecting Chinese pinyin from text |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1212404A (en) * | 1997-09-19 | 1999-03-31 | 国际商业机器公司 | Method for identifying character/numeric string in Chinese speech recognition system |
US9672827B1 (en) * | 2013-02-11 | 2017-06-06 | Mindmeld, Inc. | Real-time conversation model generation |
CN108536669A (en) * | 2018-02-27 | 2018-09-14 | 北京达佳互联信息技术有限公司 | Literal information processing method, device and terminal |
CN109147767A (en) * | 2018-08-16 | 2019-01-04 | 平安科技(深圳)有限公司 | Digit recognition method, device, computer equipment and storage medium in voice |
CN109461459A (en) * | 2018-12-07 | 2019-03-12 | 平安科技(深圳)有限公司 | Speech assessment method, apparatus, computer equipment and storage medium |
CN109801630A (en) * | 2018-12-12 | 2019-05-24 | 平安科技(深圳)有限公司 | Digital conversion method, device, computer equipment and the storage medium of speech recognition |
CN109815476A (en) * | 2018-12-03 | 2019-05-28 | 国网浙江省电力有限公司杭州供电公司 | A kind of term vector representation method based on Chinese morpheme and phonetic joint statistics |
CN109871535A (en) * | 2019-01-16 | 2019-06-11 | 四川大学 | A kind of French name entity recognition method based on deep neural network |
CN109977398A (en) * | 2019-02-21 | 2019-07-05 | 江苏苏宁银行股份有限公司 | A kind of speech recognition text error correction method of specific area |
CN110232923A (en) * | 2019-05-09 | 2019-09-13 | 青岛海信电器股份有限公司 | A kind of phonetic control command generation method, device and electronic equipment |
-
2019
- 2019-12-24 CN CN201911349033.6A patent/CN111179937A/en active Pending
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1212404A (en) * | 1997-09-19 | 1999-03-31 | 国际商业机器公司 | Method for identifying character/numeric string in Chinese speech recognition system |
US9672827B1 (en) * | 2013-02-11 | 2017-06-06 | Mindmeld, Inc. | Real-time conversation model generation |
CN108536669A (en) * | 2018-02-27 | 2018-09-14 | 北京达佳互联信息技术有限公司 | Literal information processing method, device and terminal |
CN109147767A (en) * | 2018-08-16 | 2019-01-04 | 平安科技(深圳)有限公司 | Digit recognition method, device, computer equipment and storage medium in voice |
CN109815476A (en) * | 2018-12-03 | 2019-05-28 | 国网浙江省电力有限公司杭州供电公司 | A kind of term vector representation method based on Chinese morpheme and phonetic joint statistics |
CN109461459A (en) * | 2018-12-07 | 2019-03-12 | 平安科技(深圳)有限公司 | Speech assessment method, apparatus, computer equipment and storage medium |
CN109801630A (en) * | 2018-12-12 | 2019-05-24 | 平安科技(深圳)有限公司 | Digital conversion method, device, computer equipment and the storage medium of speech recognition |
CN109871535A (en) * | 2019-01-16 | 2019-06-11 | 四川大学 | A kind of French name entity recognition method based on deep neural network |
CN109977398A (en) * | 2019-02-21 | 2019-07-05 | 江苏苏宁银行股份有限公司 | A kind of speech recognition text error correction method of specific area |
CN110232923A (en) * | 2019-05-09 | 2019-09-13 | 青岛海信电器股份有限公司 | A kind of phonetic control command generation method, device and electronic equipment |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112037762A (en) * | 2020-09-10 | 2020-12-04 | 中航华东光电(上海)有限公司 | Chinese-English mixed speech recognition method |
CN113723082A (en) * | 2021-08-30 | 2021-11-30 | 支付宝(杭州)信息技术有限公司 | Method and device for detecting Chinese pinyin from text |
CN113723082B (en) * | 2021-08-30 | 2024-08-02 | 支付宝(杭州)信息技术有限公司 | Method and device for detecting Chinese pinyin from text |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107220235B (en) | Speech recognition error correction method and device based on artificial intelligence and storage medium | |
US10372821B2 (en) | Identification of reading order text segments with a probabilistic language model | |
CN113807098B (en) | Model training method and device, electronic equipment and storage medium | |
CN112036162B (en) | Text error correction adaptation method and device, electronic equipment and storage medium | |
CN110717331A (en) | Neural network-based Chinese named entity recognition method, device, equipment and storage medium | |
CN112395385B (en) | Text generation method and device based on artificial intelligence, computer equipment and medium | |
CN111160004B (en) | Method and device for establishing sentence-breaking model | |
CN111079432B (en) | Text detection method and device, electronic equipment and storage medium | |
CN111753532B (en) | Error correction method and device for Western text, electronic equipment and storage medium | |
JP2023012522A (en) | Method and device for training document reading model based on cross modal information | |
CN113673228B (en) | Text error correction method, apparatus, computer storage medium and computer program product | |
CN116757184B (en) | Vietnam voice recognition text error correction method and system integrating pronunciation characteristics | |
CN113918031A (en) | System and method for Chinese punctuation recovery using sub-character information | |
CN113076720A (en) | Long text segmentation method and device, storage medium and electronic device | |
CN111916063A (en) | Sequencing method, training method, system and storage medium based on BPE (Business Process Engineer) coding | |
CN111179937A (en) | Method, apparatus and computer-readable storage medium for text processing | |
CN113743101A (en) | Text error correction method and device, electronic equipment and computer storage medium | |
CN113836308B (en) | Network big data long text multi-label classification method, system, device and medium | |
CN114218940B (en) | Text information processing and model training method, device, equipment and storage medium | |
CN115730585A (en) | Text error correction and model training method and device, storage medium and equipment | |
CN113255329A (en) | English text spelling error correction method and device, storage medium and electronic equipment | |
CN110895659A (en) | Model training method, recognition method, device and computing equipment | |
CN113095082A (en) | Method, device, computer device and computer readable storage medium for text processing based on multitask model | |
US20240086637A1 (en) | Efficient hybrid text normalization | |
CN111783433A (en) | Text retrieval error correction method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
AD01 | Patent right deemed abandoned |
Effective date of abandoning: 20230616 |
|
AD01 | Patent right deemed abandoned |