CN113033200A - Data processing method, text recognition model generation method and text recognition method - Google Patents

Data processing method, text recognition model generation method and text recognition method Download PDF

Info

Publication number
CN113033200A
CN113033200A CN202110581037.8A CN202110581037A CN113033200A CN 113033200 A CN113033200 A CN 113033200A CN 202110581037 A CN202110581037 A CN 202110581037A CN 113033200 A CN113033200 A CN 113033200A
Authority
CN
China
Prior art keywords
code
word segmentation
codes
participle
text recognition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110581037.8A
Other languages
Chinese (zh)
Other versions
CN113033200B (en
Inventor
宁亚光
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Century TAL Education Technology Co Ltd
Original Assignee
Beijing Century TAL Education Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Century TAL Education Technology Co Ltd filed Critical Beijing Century TAL Education Technology Co Ltd
Priority to CN202110581037.8A priority Critical patent/CN113033200B/en
Publication of CN113033200A publication Critical patent/CN113033200A/en
Application granted granted Critical
Publication of CN113033200B publication Critical patent/CN113033200B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biophysics (AREA)
  • Biomedical Technology (AREA)
  • Evolutionary Computation (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Databases & Information Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Character Discrimination (AREA)

Abstract

The application provides a data processing method, a text recognition model generation method and a text recognition method, wherein the data processing method comprises the following steps: performing word segmentation on the text to be processed to obtain a word segmentation result; encoding the participles in the participle result to obtain codes of the participle result, wherein the codes of the participle result comprise at least two types of codes, each type of code in the at least two types of codes is the code of the participle about each type of characters, and at least one type of characters in each type of characters belong to mathematical characters; and determining a training sample according to the coding of the word segmentation result. The model trained by the training sample obtained by the application can improve the capability of processing mathematic related natural language processing tasks.

Description

Data processing method, text recognition model generation method and text recognition method
Technical Field
The present application relates to the field of data processing, and in particular, to a data processing method, a text recognition model generation method, and a text recognition method.
Background
Since the main application scenarios of the current natural language processing are news, reading, translation and the like, the encoding and dictionary comparison in the pre-training stage focuses on understanding of natural language sequence and semantics. However, in the scenario of the mathematical problem, the mathematical problem not only contains a large number of natural sentences, but also is interspersed with a large number of mathematical symbols and numbers. The applicant finds that the existing models trained based on the complete natural language may not support most mathematical symbols, or do not take special consideration to mathematical characters, so that the models trained in this way do not perform well in the application scenarios related to mathematical problems.
Disclosure of Invention
The embodiment of the application provides a data processing method, a text recognition model generation method and a text recognition method, which are used for solving the problems in the related technology, and the technical scheme is as follows:
in a first aspect, an embodiment of the present application provides a data processing method, including:
performing word segmentation on the text to be processed to obtain a word segmentation result;
encoding the participles in the participle result to obtain codes of the participle result, wherein the codes of the participle result comprise at least two types of codes, each type of code in the at least two types of codes is the code of the participle about each type of characters, and at least one type of characters in each type of characters belong to mathematical characters;
and determining a training sample according to the coding of the word segmentation result.
In a second aspect, an embodiment of the present application provides a method for generating a text recognition model, including:
acquiring training data, wherein the training data comprises training samples and labels of the training samples, and the training samples comprise the training samples determined by the data processing method of the first aspect;
and training a preset neural network according to the training data, obtaining a text recognition model after the training is finished, wherein the text recognition model can recognize the text containing the mathematical characters.
In a third aspect, an embodiment of the present application provides a text recognition method, including:
performing word segmentation processing on a text to be recognized to obtain a word segmentation result;
encoding the participles in the participle result to obtain codes of the participle result, wherein the codes of the participle result comprise at least two types of codes, each type of code in the at least two types of codes is the code of the participle about each type of characters, and at least one type of characters in each type of characters belong to mathematical characters;
and inputting the codes of the word segmentation result into a text recognition model to obtain a text recognition result, wherein the text recognition model is generated by using the generation method of the text recognition model in the second aspect.
In a fourth aspect, an embodiment of the present application provides a data processing apparatus, including:
the first word segmentation module is used for carrying out word segmentation on the text to be processed to obtain a word segmentation result;
the first coding module is used for coding the participles in the participle result to obtain codes of the participle result, wherein the codes of the participle result comprise at least two types of codes, each type of code in the at least two types of codes is a code of the participle about each type of character, and at least one type of character in each type of character belongs to a mathematical character;
and the training sample determining module is used for determining the training sample according to the code of the word segmentation result.
In a fifth aspect, an embodiment of the present application provides an apparatus for generating a text recognition model, including:
a training data acquisition module, configured to acquire training data, where the training data includes training samples and labels of the training samples, where the training samples include the training samples determined by the data processing apparatus of the fourth aspect;
and the training module is used for training a preset neural network according to the training data, obtaining a text recognition model after the training is finished, and recognizing the text containing the mathematical characters by the text recognition model.
In a sixth aspect, an embodiment of the present application provides a text recognition apparatus, including:
the third word segmentation module is used for carrying out word segmentation processing on the text to be recognized to obtain word segmentation results;
the second coding module is used for coding the participles in the participle result to obtain codes of the participle result, wherein the codes of the participle result comprise at least two types of codes, each type of code in the at least two types of codes is a code of the participle about each type of character, and at least one type of character in each type of character belongs to a mathematical character;
and the recognition module is used for inputting the codes of the word segmentation results into a text recognition model to obtain text recognition results, wherein the text recognition model is obtained by using the generation device of the text recognition model in the fifth aspect.
In a seventh aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor. Wherein the memory and the processor are in communication with each other via an internal connection path, the memory is configured to store instructions, the processor is configured to execute the instructions stored by the memory, and the processor is configured to perform the method of any of the above aspects when the processor executes the instructions stored by the memory.
In an eighth aspect, embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the computer program runs on a computer, the method in any one of the above-mentioned aspects is executed.
The advantages or beneficial effects in the above technical solution at least include: the mathematical characters can be specially considered, so that the model trained by the training sample can improve the capability of processing mathematical correlation natural language processing tasks.
The foregoing summary is provided for the purpose of description only and is not intended to be limiting in any way. In addition to the illustrative aspects, embodiments, and features described above, further aspects, embodiments, and features of the present application will be readily apparent by reference to the drawings and following detailed description.
Drawings
In the drawings, like reference numerals refer to the same or similar parts or elements throughout the several views unless otherwise specified. The figures are not necessarily to scale. It is appreciated that these drawings depict only some embodiments in accordance with the disclosure and are therefore not to be considered limiting of its scope.
FIG. 1 is a flow chart of a data processing method according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of generating a text recognition model according to an embodiment of the present application;
fig. 3 is a flowchart of a specific example provided by a method for generating a text recognition model according to an embodiment of the present application;
FIG. 4 is a flow chart of a text recognition method according to an embodiment of the present application;
FIG. 5 is a block diagram of a data processing apparatus according to an embodiment of the present application;
fig. 6 is a block diagram of a device for generating a text recognition model according to an embodiment of the present application;
FIG. 7 is a block diagram of a text recognition apparatus according to an embodiment of the present application;
fig. 8 is a block diagram of an electronic device according to an embodiment of the present application.
Detailed Description
In the following, only certain exemplary embodiments are briefly described. As those skilled in the art will recognize, the described embodiments may be modified in various different ways, all without departing from the spirit or scope of the present application. Accordingly, the drawings and description are to be regarded as illustrative in nature, and not as restrictive.
Fig. 1 shows a flow chart of a data processing method according to an embodiment of the present application. As shown in fig. 1, the data processing method may include:
s101, performing word segmentation on the text to be processed to obtain word segmentation results.
S102, the participles in the participle result are coded to obtain codes of the participle result, wherein the codes of the participle result comprise at least two types of codes, each type of code in the at least two types of codes is a code of the participle about each type of characters, and at least one type of characters in each type of characters belong to mathematical characters.
And S103, determining a training sample according to the coding of the word segmentation result.
The text to be processed can be obtained from a corpus prepared in advance. The corpus includes mathematically related text, such as mathematical titles, mathematical papers, etc., and the corpus may also include plain text that is not mathematically related. The text to be processed may be a sentence with a mathematical expression, for example "calculate 33 x 3.3".
The mathematical characters may be numbers, and the numbers include arabic numbers 0 to 9. The mathematical characters may also be numerical values, which are represented by one, two or more arabic numbers. It should be noted that the mathematics and numerical values are mainly directed to different word segmentation modes. For example, if "12 × 3.4" is segmented into "12", "x" and "3.4", then "12" and "3.4" are numerical values; if classified as "1", "2", "x", "3", "a", "4", then "1", "2", "3", "4" are numbers. Mathematical characters may also be mathematical symbols including, but not limited to, quantitative symbols (e.g., a circumference ratio "pi"), arithmetic symbols (e.g., an addition "+", a subtraction "-"), relational symbols (e.g., an equal "="), binding symbols (e.g., a small bracket "()"), and the like. In addition, the text to be processed may also include language characters such as chinese characters, english, and the like.
In step S101, the text to be processed may be input into a word segmentation device, and the word segmentation device performs word segmentation on the text to be processed to obtain a word segmentation result. The segmenter may be trained using a pre-prepared corpus such that the vocabulary of the segmenter contains all the words or phrases present in the corpus. In addition, in order to expand the word amount of the word list, different characters and words in other corpora can be combined to obtain a plurality of word lists, and one word list corresponds to one type of characters, such as a Chinese character word list V1, a digital word list V2, a mathematical symbol word list V3 and the like.
In step S102, the codes of each word segment include at least two types of codes, one type of code corresponds to one type of character, and the characters corresponding to at least one type of code belong to mathematical characters. For example, the encoding of the participle is
Figure DEST_PATH_IMAGE001
Wherein,
Figure DEST_PATH_IMAGE002
Corresponding to the characters of the Chinese characters,
Figure DEST_PATH_IMAGE003
in correspondence with the mathematical symbols, the symbols,
Figure DEST_PATH_IMAGE004
corresponding to a number. In particular, the amount of the solvent to be used,
Figure 498425DEST_PATH_IMAGE002
carrying out one-hot coding according to the Chinese character word list V1,
Figure DEST_PATH_IMAGE005
indicates whether the word segmentation result is a mathematical symbol,
Figure DEST_PATH_IMAGE006
one-hot coding is performed according to a mathematical symbol vocabulary V3,
Figure DEST_PATH_IMAGE007
indicates whether the word segmentation result indicates a numerical value,
Figure DEST_PATH_IMAGE008
according to the specific numerical value of the word segmentation result. The encoding of the participles can also be
Figure DEST_PATH_IMAGE009
Figure DEST_PATH_IMAGE010
Or
Figure DEST_PATH_IMAGE011
And the like.
In step S103, the training samples may be used to train a preset neural network to obtain a text recognition model.
The data processing method provided by the embodiment of the application adopts at least two types of coding modes for the training sample, and one type of coding is coding related to mathematical characters, so that the mathematical characters can be specially considered, and the capability of processing mathematical related natural language processing tasks can be improved by a model trained by the training sample. Therefore, in a mathematical scene, the upper-layer application is better supported, such as automatic correction of mathematical problems, intelligent problem solving and the like.
In one embodiment, before step S102, the method further includes: in the case where a numerical value exists in the word segmentation result, the numerical value is re-segmented by a single number. In step S102, encoding the participles in the participle result, including: and encoding the participles in the participle result after the participle is performed again.
For example, the sentence "12 × 3.4" is participled to obtain the participle results "12", "×", "3.4"; the word segmentation is performed again to obtain "1", "2", "x", "3", "d", "4"; then, "1", "2", "x", "3", "4" obtained by the word re-segmentation are encoded.
Therefore, the code is encoded after the code is split into the single Arabic numerals [ 0-9 ], and the preheating code corresponding to the single Arabic numerals [ 0-9 ] can be used for setting, so that the setting of the code is more standardized.
In one embodiment, the mathematical characters include at least one of numbers, numerical values, and mathematical symbols. One of the at least two types of codes is a code of a word segmentation relative to a number or a numerical value, and/or one of the at least two types of codes is a code of a word segmentation relative to a mathematical symbol. Therefore, the numbers or the numerical values and the mathematical symbols are specially considered through the two types of codes of the numbers or the numerical values and the codes of the mathematical symbols, so that the models trained by the training samples can better process the numbers or the numerical values and the mathematical symbols, and the processing capacity of the models in the application scene related to the mathematical problems is improved.
For example, the encoding of the word segmentation result may be a combination of encoding on numbers or numerical values and encoding on mathematical symbols, a combination of encoding on numbers or numerical values or mathematical symbols and encoding on other kinds of characters, a combination of encoding on numbers or numerical values, encoding on mathematical symbols and encoding on other kinds of characters, and the like.
Here, the codes of other kinds of characters may be codes of word segmentation with respect to a third character, and the third character may be a chinese character and/or an english character, etc. Specifically, the process of determining the type of encoding of the word segmentation result is: and acquiring a preset word list which comprises a plurality of participles belonging to the third character and preset codes (such as one-hot codes) corresponding to the participles, finding the preset codes corresponding to the participle result in the preset word list, and acquiring the codes of the participle result relative to the third character.
In one embodiment, each of the at least two types of codes includes a first sub-code and/or a second sub-code, where the first sub-code is used to indicate whether the participle includes the corresponding category character, and the second sub-code is used to indicate the content of the corresponding category character included in the participle.
Therefore, whether the participle comprises a certain type of characters or not can be represented by utilizing the first subcode, the specific value of the certain type of characters in the participle can be represented by the second subcode, information of the various types of characters contained in the participle can be well processed, the codes are input into the model, and the model can more fully know the information of the various types of characters based on the codes, so that the prediction effect is improved.
In one embodiment, in the case that the first type of code in the at least two types of codes is a word-segmentation-related code, the number-corresponding one-hot code is adopted as the second sub-code of the first type of code. In the case where the first type of code of the at least two types of codes is a word-segmentation-related code with respect to a value, the value itself is employed as a second subcode of the first type of code.
Therefore, aiming at different types of characters, the encoding mode corresponding to the characteristics of the characters is adopted, and the encoding effect is ensured. Only ten states of the number [ 0-9 ] are provided, and the ten states are subjected to one-hot coding directly, so that the coding standardization is facilitated; the state of the numerical value is infinite, so that the numerical value is difficult to directly carry out the one-hot coding, and the numerical value is directly adopted as the coding, thereby being more beneficial to realizing the coding.
In one embodiment, the training sample comprises a part of codes left after the codes of the word segmentation results are randomly masked, and the labels of the training sample comprise the part of codes of the word segmentation results which are randomly masked;
or the training sample comprises a code of a first sentence content in the code of the word segmentation result, the label of the training sample comprises a code of a second sentence content in the code of the word segmentation result, and the first sentence content and the second sentence content are two adjacent sentences in the training sample.
In this way, the model is trained by adopting a random mask or a mode of predicting the next sentence, the coding modes of the training sample label and the training sample are consistent, and when loss calculation is carried out by using the coding of the label, special consideration can be better carried out on mathematical characters.
Fig. 2 shows a flow chart of a method of generating a text recognition model according to an embodiment of the present application. As shown in fig. 2, the method for generating the text recognition model may include:
s201, obtaining training data, where the training data includes training samples and labels of the training samples, where the training samples include training samples obtained by using the data processing method provided in any embodiment of the present application.
S202, training a preset neural network according to training sample training data, and obtaining a text recognition model after training is completed, wherein the text recognition model can recognize texts containing mathematical characters.
Therefore, the training sample of the embodiment of the application can specially consider mathematical characters, and the capability of processing mathematical correlation natural language processing tasks can be improved through the model trained by the training sample. Therefore, in a mathematical scene, the upper-layer application is better supported, such as automatic correction of mathematical problems, intelligent problem solving and the like.
In one embodiment, step S202 involves loss calculation of all or a specified portion of the codes in the labels of the training samples. Specifically, the encoding of the participle in the label comprises a first subcode and a second subcode, wherein the first subcode represents whether the participle contains the corresponding class character, and the second subcode represents that the participle contains the content of the corresponding class character, and the first subcode and the second subcode in the encoding of the participle are adopted for loss calculation, or the first subcode in the encoding of the participle is adopted for loss calculation.
Illustratively, the encoding of the participles in the training sample is
Figure DEST_PATH_IMAGE012
The code of the participle in the label is also set as
Figure 252449DEST_PATH_IMAGE012
. During the training process, the training process can be
Figure 607686DEST_PATH_IMAGE012
All participate in the loss calculation, or only
Figure DEST_PATH_IMAGE013
And participating in loss calculation.
Illustratively, the training process of step S202 may be: 1) inputting the training sample into a preset neural network to obtain a recognition result; 2) calculating the error of the identification result and the label by using a loss function, and 3) returning the error along the minimum gradient direction according to the derivative of the loss function, and correcting each weight value in the preset neural network; 4) and returning to the step 1), and stopping iteration until the loss function value reaches a satisfactory value to obtain the text recognition model.
In conventional natural language processing, each word is followed by which word is probabilistically biased (e.g., natural language, the probability of a word being followed by a "word" is higher relative to the probability of other words), but in arithmetic, there is a possibility that the word is followed by several words, there is no probabilistical bias (e.g., number 1 is followed by various numbers, and also a mathematical sign, without any probability bias), and in natural language processing of mathematics, the goal is to know what number is going. That is, the learning capacity is fixed, some information does not need prediction, and no prediction is performed on the information, so that the second method only needs part of encoding to participate in the loss calculation, so that the difficulty of learning other information is reduced and the accuracy is improved on the prediction of other information.
The following is a specific example provided in the embodiments of the present application, and referring to fig. 3, the steps of the specific example are as follows:
s300, obtaining the mathematical corpora and establishing a mathematical corpus.
S301, training the word segmentation device by adopting a statistical-based method aiming at the whole mathematic corpus, and mainly aiming at enabling the whole word list to contain all characters or words appearing in the corpus. Simultaneously combining different characters and words in other corpus to obtain a Chinese character word list with scale of K
Figure DEST_PATH_IMAGE014
Mathematical symbol word list
Figure DEST_PATH_IMAGE015
. The reason for merging other corpora is that the mathematical corpus has limited words and, in order to expand it, other existing corpora are merged to expand words and words of the mathematical corpus to obtain words including mathematical aspects and other words.
S302, selecting a sentence to be processed, performing word segmentation on the sentence to be processed by adopting the word segmentation device trained in the step S301, and then coding a word segmentation result. The statement to be processed may preferably be a statement with a mathematical expression.
Exemplarily, two word segmentation modes and corresponding encoding modes are given below.
First, for any sentence, for the numbers in the sentence, the word segmenter performs word segmentation according to the whole number and then performs encoding. Taking the sentence "calculate 37.18 × 341.9" as an example, the participle can be divided into "calculate", "37.18", "×", "341.9", and then the participle of "calculate", "37.18", "×", "341.9" is encoded. The coding area of each participle is divided into five parts
Figure 992443DEST_PATH_IMAGE012
Wherein, in the step (A),
Figure 234245DEST_PATH_IMAGE002
the parts are according to Chinese character word list
Figure 292593DEST_PATH_IMAGE014
The vector that is subject to the one-hot encoding,
Figure 340052DEST_PATH_IMAGE005
is mathematical symbol encoding, indicates whether the word segmentation result is a mathematical symbol,
Figure 102734DEST_PATH_IMAGE006
for the one-hot encoding of a mathematical symbol,
Figure 227072DEST_PATH_IMAGE007
is a numerical code, indicates whether the analysis result represents a numerical value,
Figure 874565DEST_PATH_IMAGE008
is the specific numerical value represented by the number.
Second, for any word segmentation result, the words are segmented according to a single number for the numbers in the word segmentation result, and then the words are encoded. Also taking the statement "calculate 37.18 × 341.9" as an example, the participle can be followed by "calculate", "3", "7", "", "1", "8", "x", "3", "4", "1", "", "9", and then the respective participle of "calculate", "3", "7", "" "," 1 "," 8 "," x "," 3 "," 4 "," 1 "," 9 "is encoded. The coding area of each participle is divided into five parts
Figure 330692DEST_PATH_IMAGE012
Wherein
Figure 726426DEST_PATH_IMAGE002
The parts are according to Chinese character word list
Figure 361063DEST_PATH_IMAGE014
The vector that is subject to the one-hot encoding,
Figure 89110DEST_PATH_IMAGE005
is mathematical symbol encoding, indicates whether the word segmentation result is a mathematical symbol,
Figure 783265DEST_PATH_IMAGE006
for the one-hot encoding of a mathematical symbol,
Figure 282556DEST_PATH_IMAGE007
is a digital code, and indicates whether the character string of the word segmentation result is [ 0-9 ]],
Figure 847398DEST_PATH_IMAGE008
Is a one-hot encoding of the number.
The above two encoding methods are only examples, and the encoding region may be 3 portions thereof, for example
Figure DEST_PATH_IMAGE016
Or
Figure 139489DEST_PATH_IMAGE010
S303, taking the coded vector as the input of a neural network; the choice of neural network includes, but is not limited to, converter-based bi-directional encoding Representation from transforms (BERT), GPT-2.0, word to vector (word 2 vec), or Elmo, among other language models.
S304, in the training process, a mode of masking random positions in sentences can be adopted for training. The method comprises the following specific operations: for codes corresponding to words at random positions in a statement, masking (nulling or zeroing) is carried out at an input stage, and a definition model outputs the masked codes corresponding to the words or the words.
Alternatively, training is performed in a manner that predicts the next sentence in the paragraph. The method comprises the following specific operations: and covering the next sentence in a section of speech in the input stage, and defining a model to output a code corresponding to the next sentence.
S305, in the training process, loss calculation needs to be carried out according to the label (label), and the method comprises two modes: first, for a participle to be predicted, the label is set to the entire coding region, i.e.
Figure 976733DEST_PATH_IMAGE012
The whole area; during training, the whole coding region participates in loss calculation. Second, for the participles that need to be predicted, the label is set to the entire coding region, i.e.
Figure 573323DEST_PATH_IMAGE012
The whole area; only the [ 2 ], [ alpha ], [ beta ] -state
Figure DEST_PATH_IMAGE017
]And participating in loss calculation. The first way is for the model prediction output to be full outcome prediction and the second way is for the model prediction output to be character type only prediction.
And S306, obtaining a final text recognition model after full training.
When an application scenario is used for performing upper-layer tasks related to natural language processing (such as text classification, emotion analysis, machine translation, intelligent question solving and the like), firstly, an unsupervised training method is performed on a relatively large corpus to obtain an initial model; then, fine tuning is performed on the specific task data set, and at this time, the method provided in the above steps S300 to S306 may be adopted to complete training of the initial model, so as to obtain the final text recognition model.
Fig. 4 shows a flow chart of a text recognition method according to an embodiment of the application. As shown in fig. 4, the method may include:
s401, performing word segmentation processing on the text to be recognized to obtain word segmentation results.
S402, encoding the participles in the participle result to obtain the codes of the participle result, wherein the codes of the participles in the participle result comprise at least two types of codes, each type of code in the at least two types of codes is the code of the participle about each type of characters, and at least one type of characters in the characters belong to mathematical characters.
And S403, inputting the codes of the word segmentation results into a text recognition model to obtain text recognition results, wherein the text recognition model is generated by adopting the method for generating the text recognition model provided by any embodiment of the application.
In this way, in the prediction stage, the encoding of the input model takes special consideration for mathematical characters, and in the training stage, the training samples of the input model also take special consideration for mathematical characters, so that the model can improve the natural language processing capability related to mathematics and improve the adaptability in a mathematical scene.
Further, regarding the encoding manner of the word segmentation result in step S402, reference may be made to the encoding manner of the word segmentation result in the data processing method provided in the embodiment of the present application, which is not described herein again.
Fig. 5 shows a block diagram of a data processing apparatus 500 according to an embodiment of the present application. As shown in fig. 5, the apparatus may include:
the first word segmentation module 501 is configured to perform word segmentation on the text to be processed to obtain a word segmentation result.
The first encoding module 502 is configured to encode the participles in the participle result to obtain codes of the participle result, where the codes of the participle result include at least two types of codes, each type of code in the at least two types of codes is a code of the participle about each type of character, and at least one type of character in each type of character belongs to a mathematical character.
And a training sample determining module 503, configured to determine a training sample according to the coding of the word segmentation result.
In one embodiment, the data processing apparatus further comprises: and a second word segmentation module.
And the second word segmentation module is used for carrying out word segmentation on the numerical value again according to a single number under the condition that the numerical value exists in the word segmentation result.
And the first coding module is used for coding the participles in the participle result after the participles are performed again.
In one embodiment, the mathematical characters include at least one of numbers, numerical values, and mathematical symbols.
One of the at least two types of codes is a code of a word segmentation relative to a number or a numerical value, and/or one of the at least two types of codes is a code of a word segmentation relative to a mathematical symbol.
In one embodiment, each of the at least two types of codes includes a first sub-code and/or a second sub-code, where the first sub-code is used to indicate whether the participle includes the corresponding category character, and the second sub-code is used to indicate the content of the corresponding category character included in the participle.
In one embodiment, in the case that the first type of code in the at least two types of codes is a word-segmentation-related code, the number-corresponding one-hot code is adopted as the second sub-code of the first type of code.
In the case where the first type of code of the at least two types of codes is a word-segmentation-related code with respect to a value, the value itself is employed as a second subcode of the first type of code.
In one embodiment, the training sample comprises a part of codes left after the codes of the word segmentation results are randomly masked, and the labels of the training sample comprise the part of codes of the word segmentation results which are randomly masked;
or the training sample comprises a code of a first sentence content in the code of the word segmentation result, the label of the training sample comprises a code of a second sentence content in the code of the word segmentation result, and the first sentence content and the second sentence content are two adjacent sentences in the training sample.
Fig. 6 shows a block diagram of a device 600 for generating a text recognition model according to an embodiment of the present application. As shown in fig. 6, the apparatus may include:
the training data obtaining module 601 is configured to obtain training data, where the training data includes training samples and labels of the training samples, where the training samples include training samples obtained by the data processing apparatus according to any embodiment of the present application.
The training module 602 is configured to train a preset neural network according to training data, and obtain a text recognition model after the training is completed, where the text recognition model is capable of recognizing a text containing mathematical characters.
In one embodiment, the encoding of the participle in the label comprises a first subcode and a second subcode, the first subcode indicates whether the participle contains the corresponding class character, and the second subcode indicates that the participle contains the content of the corresponding class character, the first subcode and the second subcode in the encoding of the mathematical character are used for loss calculation, or the first subcode in the encoding of the mathematical character is used for loss calculation.
Fig. 7 shows a block diagram of a text recognition apparatus 700 according to an embodiment of the present application. As shown in fig. 7, the apparatus may include:
and a third word segmentation module 701, configured to perform word segmentation processing on the text to be recognized to obtain a word segmentation result.
The second encoding module 702 is configured to encode the participle in the participle result to obtain an encoding of the participle result, where the encoding of the participle result includes at least two types of encoding, each type of encoding in the at least two types of encoding is an encoding of the participle with respect to each type of character, and at least one type of character in each type of character belongs to a mathematical character.
The recognition module 703 is configured to input the code of the word segmentation result into a text recognition model to obtain a text recognition result, where the text recognition model is obtained by using the generation apparatus of the text recognition model according to any embodiment of the present application.
The functions of each module in each apparatus in the embodiment of the present application may refer to corresponding descriptions in the above method, and are not described herein again.
Fig. 8 shows a block diagram of an electronic device according to an embodiment of the present application. As shown in fig. 8, the electronic apparatus includes: a memory 810 and a processor 820, the memory 810 having stored therein computer programs operable on the processor 820. The processor 820 realizes the data processing method, the generation method of the text recognition model, and the text recognition method in the above-described embodiments when executing the computer program. The number of the memory 810 and the processor 820 may be one or more.
The electronic device further includes:
and a communication interface 830, configured to communicate with an external device, and perform data interactive transmission.
If the memory 810, the processor 820 and the communication interface 830 are implemented independently, the memory 810, the processor 820 and the communication interface 830 may be connected to each other through a bus and perform communication with each other. The bus may be an Industry Standard Architecture (ISA) bus, a Peripheral Component Interconnect (PCI) bus, an Extended ISA (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 8, but this is not intended to represent only one bus or type of bus.
Optionally, in an implementation, if the memory 810, the processor 820 and the communication interface 830 are integrated on a chip, the memory 810, the processor 820 and the communication interface 830 may complete communication with each other through an internal interface.
Embodiments of the present application provide a computer-readable storage medium, which stores a computer program, and when the program is executed by a processor, the computer program implements the method provided in the embodiments of the present application.
The embodiment of the present application further provides a chip, where the chip includes a processor, and is configured to call and execute the instruction stored in the memory from the memory, so that the communication device in which the chip is installed executes the method provided in the embodiment of the present application.
An embodiment of the present application further provides a chip, including: the system comprises an input interface, an output interface, a processor and a memory, wherein the input interface, the output interface, the processor and the memory are connected through an internal connection path, the processor is used for executing codes in the memory, and when the codes are executed, the processor is used for executing the method provided by the embodiment of the application.
It should be understood that the processor may be a Central Processing Unit (CPU), other general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other programmable logic device, discrete gate or transistor logic, discrete hardware components, etc. A general purpose processor may be a microprocessor or any conventional processor or the like. It is noted that the processor may be an advanced reduced instruction set machine (ARM) architecture supported processor.
Further, optionally, the memory may include a read-only memory and a random access memory, and may further include a nonvolatile random access memory. The memory may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may include a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically Erasable EPROM (EEPROM), or a flash memory. Volatile memory can include Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of RAM are available. For example, Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), Synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous SDRAM (DDR SDRAM), Enhanced SDRAM (ESDRAM), synchlink DRAM (SLDRAM), and direct memory bus RAM (DR RAM).
In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. The procedures or functions according to the present application are generated in whole or in part when the computer program instructions are loaded and executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer-readable storage medium or transmitted from one computer-readable storage medium to another computer-readable storage medium.
In the description herein, references to the description of the term "one embodiment," "some embodiments," "an example," "a specific example," or "some examples," etc., mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the application. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, various embodiments or examples and features of different embodiments or examples described in this specification can be combined and combined by one skilled in the art without contradiction.
Furthermore, the terms "first", "second" and "first" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include at least one such feature. In the description of the present application, "a plurality" means two or more unless specifically limited otherwise.
Any process or method descriptions in flow charts or otherwise described herein may be understood as representing modules, segments, or portions of code which include one or more executable instructions for implementing specific logical functions or steps of the process. And the scope of the preferred embodiments of the present application includes other implementations in which functions may be performed out of the order shown or discussed, including substantially concurrently or in reverse order, depending on the functionality involved.
The logic and/or steps represented in the flowcharts or otherwise described herein, e.g., an ordered listing of executable instructions that can be considered to implement logical functions, can be embodied in any computer-readable medium for use by or in connection with an instruction execution system, apparatus, or device, such as a computer-based system, processor-containing system, or other system that can fetch the instructions from the instruction execution system, apparatus, or device and execute the instructions.
It should be understood that portions of the present application may be implemented in hardware, software, firmware, or a combination thereof. In the above embodiments, the various steps or methods may be implemented in software or firmware stored in memory and executed by a suitable instruction execution system. All or part of the steps of the method of the above embodiments may be implemented by hardware that is configured to be instructed to perform the relevant steps by a program, which may be stored in a computer-readable storage medium, and which, when executed, includes one or a combination of the steps of the method embodiments.
In addition, functional units in the embodiments of the present application may be integrated into one processing module, or each unit may exist alone physically, or two or more units are integrated into one module. The integrated module can be realized in a hardware mode, and can also be realized in a software functional module mode. The integrated module may also be stored in a computer-readable storage medium if it is implemented in the form of a software functional module and sold or used as a separate product. The storage medium may be a read-only memory, a magnetic or optical disk, or the like.
The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive various changes or substitutions within the technical scope of the present application, and these should be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (20)

1. A data processing method, comprising:
performing word segmentation on the text to be processed to obtain a word segmentation result;
encoding the word segmentation in the word segmentation result to obtain the code of the word segmentation result, wherein the code of the word segmentation result comprises at least two types of codes, each type of code in the at least two types of codes is the code of the word segmentation about each type of character, and at least one type of character in each type of character belongs to a mathematical character;
and determining a training sample according to the coding of the word segmentation result.
2. The method of claim 1, further comprising: under the condition that a numerical value exists in the word segmentation result, carrying out word segmentation on the numerical value again according to a single number;
encoding the participles in the participle result, comprising: and encoding the participles in the participle result after the secondary participle.
3. The method of claim 1 or 2, wherein the mathematical characters comprise at least one of numbers, numerical values, and mathematical symbols;
one of the at least two types of codes is a code of the word segmentation about a number or a numerical value, and/or one of the at least two types of codes is a code of the word segmentation about a mathematical symbol.
4. The method according to claim 1 or 2, wherein each of the at least two types of codes comprises a first sub-code and/or a second sub-code, wherein the first sub-code indicates whether the segmented word contains a corresponding category character, and the second sub-code indicates content of the segmented word containing the corresponding category character.
5. The method according to claim 4, wherein in the case that the first type of code in the at least two types of codes is a code of the word segmentation with respect to a number, a one-hot code corresponding to the number is adopted as a second sub-code of the first type of code;
and in the case that the first code in the at least two types of codes is a code of the word segmentation about a numerical value, adopting the numerical value as a second sub-code of the first code.
6. The method according to claim 1, wherein the training samples comprise partial codes left after the codes of the word segmentation results are randomly masked, and the labels of the training samples comprise partial codes of the word segmentation results which are randomly masked;
or, the training sample includes a code of a first sentence content in the code of the word segmentation result, the label of the training sample includes a code of a second sentence content in the code of the word segmentation result, and the first sentence content and the second sentence content are two adjacent sentences in the training sample.
7. A method for generating a text recognition model, comprising:
obtaining training data comprising training samples and labels for the training samples, wherein the training samples comprise training samples determined using the data processing method of any one of claims 1 to 6;
and training a preset neural network according to the training data, and obtaining a text recognition model after the training is finished, wherein the text recognition model can recognize texts containing mathematical characters.
8. The method according to claim 7, wherein the encoding of the participle in the label includes a first subcode and a second subcode, the first subcode indicates whether the participle includes the corresponding category character, and the second subcode indicates that the participle includes the content of the corresponding category character, and the loss calculation is performed by using the first subcode and the second subcode in the encoding of the mathematic character, or the loss calculation is performed by using the first subcode in the encoding of the mathematic character.
9. A text recognition method, comprising:
performing word segmentation processing on a text to be recognized to obtain a word segmentation result;
encoding the word segmentation in the word segmentation result to obtain the code of the word segmentation result, wherein the code of the word segmentation result comprises at least two types of codes, each type of code in the at least two types of codes is the code of the word segmentation about each type of character, and at least one type of character in each type of character belongs to a mathematical character;
inputting the code of the word segmentation result into a text recognition model to obtain a text recognition result, wherein the text recognition model is generated by using the generation method of the text recognition model according to claim 7 or 8.
10. A data processing apparatus, comprising:
the first word segmentation module is used for carrying out word segmentation on the text to be processed to obtain a word segmentation result;
the first coding module is used for coding the participles in the participle result to obtain codes of the participle result, wherein the codes of the participle result comprise at least two types of codes, each type of code in the at least two types of codes is the code of the participle about each type of characters, and at least one type of characters in each type of characters belong to mathematical characters;
and the training sample determining module is used for determining a training sample according to the coding of the word segmentation result.
11. The apparatus of claim 10, further comprising: a second word segmentation module;
the second word segmentation module is used for carrying out word segmentation on the numerical value again according to a single number under the condition that the numerical value exists in the word segmentation result;
and the first coding module is used for coding the participles in the participle result after the participle is performed again.
12. The apparatus of claim 10 or 11, wherein the mathematical characters comprise at least one of numbers, numerical values, and mathematical symbols;
one of the at least two types of codes is a code of the word segmentation about a number or a numerical value, and/or one of the at least two types of codes is a code of the word segmentation about a mathematical symbol.
13. The apparatus according to claim 10 or 11, wherein each of the at least two types of codes comprises a first sub-code and/or a second sub-code, wherein the first sub-code indicates whether the segmented word contains a corresponding category character, and the second sub-code indicates content of the segmented word containing the corresponding category character.
14. The apparatus according to claim 13, wherein in the case that the first type of code in the at least two types of codes is a code of the word segmentation with respect to a number, a one-hot code corresponding to the number is adopted as a second sub-code of the first type of code;
and in the case that the first code in the at least two types of codes is a code of the word segmentation about a numerical value, adopting the numerical value as a second sub-code of the first code.
15. The apparatus according to claim 10, wherein the training samples comprise partial codes left after the codes of the word segmentation results are randomly masked, and the labels of the training samples comprise partial codes of the word segmentation results which are randomly masked;
or, the training sample includes a code of a first sentence content in the code of the word segmentation result, the label of the training sample includes a code of a second sentence content in the code of the word segmentation result, and the first sentence content and the second sentence content are two adjacent sentences in the training sample.
16. An apparatus for generating a text recognition model, comprising:
a training data acquisition module configured to acquire training data, the training data including training samples and labels of the training samples, wherein the training samples include training samples determined by using the data processing apparatus according to any one of claims 10 to 15;
and the training module is used for training a preset neural network according to the training data, and obtaining a text recognition model after the training is finished, wherein the text recognition model can recognize texts containing mathematical characters.
17. The apparatus according to claim 16, wherein the encoding of the participle in the tag includes a first subcode and a second subcode, the first subcode indicates whether the participle includes the corresponding category character, and the second subcode indicates that the participle includes the content of the corresponding category character, and the loss calculation is performed by using the first subcode and the second subcode in the encoding of the math character, or the loss calculation is performed by using the first subcode in the encoding of the math character.
18. A text recognition apparatus, comprising:
the third word segmentation module is used for carrying out word segmentation processing on the text to be recognized to obtain word segmentation results;
the second coding module is used for coding the participles in the participle result to obtain codes of the participle result, wherein the codes of the participle result comprise at least two types of codes, each type of code in the at least two types of codes is the code of the participle about each type of characters, and at least one type of characters in each type of characters belong to mathematical characters;
a recognition module, configured to input the coding of the word segmentation result into a text recognition model to obtain a text recognition result, where the text recognition model is obtained by using the apparatus for generating a text recognition model according to claim 16 or 17.
19. An electronic device, comprising: a processor and a memory, the memory having stored therein instructions that are loaded and executed by the processor to implement the data processing method of any one of claims 1 to 6, the method of generating a text recognition model of claim 7 or 8, or the text recognition method of claim 9.
20. A computer-readable storage medium, in which a computer program is stored which, when being executed by a processor, carries out the data processing method of any one of claims 1 to 6, the method of generating a text recognition model of claim 7 or 8, or the text recognition method of claim 9.
CN202110581037.8A 2021-05-27 2021-05-27 Data processing method, text recognition model generation method and text recognition method Active CN113033200B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110581037.8A CN113033200B (en) 2021-05-27 2021-05-27 Data processing method, text recognition model generation method and text recognition method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110581037.8A CN113033200B (en) 2021-05-27 2021-05-27 Data processing method, text recognition model generation method and text recognition method

Publications (2)

Publication Number Publication Date
CN113033200A true CN113033200A (en) 2021-06-25
CN113033200B CN113033200B (en) 2021-08-24

Family

ID=76455689

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110581037.8A Active CN113033200B (en) 2021-05-27 2021-05-27 Data processing method, text recognition model generation method and text recognition method

Country Status (1)

Country Link
CN (1) CN113033200B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792529A (en) * 2021-11-17 2021-12-14 北京华云安信息技术有限公司 Text character coding method and device for machine learning and electronic equipment
CN116052648A (en) * 2022-08-03 2023-05-02 荣耀终端有限公司 Training method, using method and training system of voice recognition model

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224521A (en) * 2015-09-28 2016-01-06 北大方正集团有限公司 Key phrases extraction method and use its method obtaining correlated digital resource and device
CN109255013A (en) * 2018-08-14 2019-01-22 平安医疗健康管理股份有限公司 Claims Resolution decision-making technique, device, computer equipment and storage medium
CN109960804A (en) * 2019-03-21 2019-07-02 江西风向标教育科技有限公司 A kind of topic text sentence vector generation method and device
US20190362266A1 (en) * 2017-06-08 2019-11-28 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for text attribute determination using a conditional random field model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224521A (en) * 2015-09-28 2016-01-06 北大方正集团有限公司 Key phrases extraction method and use its method obtaining correlated digital resource and device
US20190362266A1 (en) * 2017-06-08 2019-11-28 Beijing Didi Infinity Technology And Development Co., Ltd. Systems and methods for text attribute determination using a conditional random field model
CN110709828A (en) * 2017-06-08 2020-01-17 北京嘀嘀无限科技发展有限公司 System and method for determining text attributes using conditional random field model
CN109255013A (en) * 2018-08-14 2019-01-22 平安医疗健康管理股份有限公司 Claims Resolution decision-making technique, device, computer equipment and storage medium
CN109960804A (en) * 2019-03-21 2019-07-02 江西风向标教育科技有限公司 A kind of topic text sentence vector generation method and device

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113792529A (en) * 2021-11-17 2021-12-14 北京华云安信息技术有限公司 Text character coding method and device for machine learning and electronic equipment
CN113792529B (en) * 2021-11-17 2022-05-06 北京华云安信息技术有限公司 Text character coding method and device for machine learning and electronic equipment
CN116052648A (en) * 2022-08-03 2023-05-02 荣耀终端有限公司 Training method, using method and training system of voice recognition model
CN116052648B (en) * 2022-08-03 2023-10-20 荣耀终端有限公司 Training method, using method and training system of voice recognition model

Also Published As

Publication number Publication date
CN113033200B (en) 2021-08-24

Similar Documents

Publication Publication Date Title
CN110489555B (en) Language model pre-training method combined with similar word information
CN107729313B (en) Deep neural network-based polyphone pronunciation distinguishing method and device
CN111062217B (en) Language information processing method and device, storage medium and electronic equipment
CN113033200B (en) Data processing method, text recognition model generation method and text recognition method
CN111611810A (en) Polyphone pronunciation disambiguation device and method
CN110377882B (en) Method, apparatus, system and storage medium for determining pinyin of text
CN113380223B (en) Method, device, system and storage medium for disambiguating polyphone
CN115599901B (en) Machine question-answering method, device, equipment and storage medium based on semantic prompt
Singh et al. HINDIA: a deep-learning-based model for spell-checking of Hindi language
CN111145914B (en) Method and device for determining text entity of lung cancer clinical disease seed bank
CN114997174B (en) Intention recognition model training and voice intention recognition method and device and related equipment
CN114881035A (en) Method, device, equipment and storage medium for augmenting training data
CN114333838A (en) Method and system for correcting voice recognition text
CN113268996A (en) Method for expanding corpus, training method for translation model and product
CN116702765A (en) Event extraction method and device and electronic equipment
CN113066510B (en) Vowel weak reading detection method and device
CN115577105A (en) Medical text information extraction method and device based on multitask learning
CN113012685B (en) Audio recognition method and device, electronic equipment and storage medium
CN114372467A (en) Named entity extraction method and device, electronic equipment and storage medium
CN113836297A (en) Training method and device for text emotion analysis model
CN111816171A (en) Training method of voice recognition model, voice recognition method and device
CN112989821B (en) Phonetic notation method for polyphone and computer storage medium
CN111368526B (en) Sequence labeling method and system
CN110866390B (en) Method and device for recognizing Chinese grammar error, computer equipment and storage medium
Olivo et al. CRFPOST: Part-of-Speech Tagger for Filipino Texts using Conditional Random Fields

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant