CN110705217A - Wrongly-written character detection method and device, computer storage medium and electronic equipment - Google Patents
Wrongly-written character detection method and device, computer storage medium and electronic equipment Download PDFInfo
- Publication number
- CN110705217A CN110705217A CN201910846339.6A CN201910846339A CN110705217A CN 110705217 A CN110705217 A CN 110705217A CN 201910846339 A CN201910846339 A CN 201910846339A CN 110705217 A CN110705217 A CN 110705217A
- Authority
- CN
- China
- Prior art keywords
- pinyin
- character
- data
- characteristic
- model
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000001514 detection method Methods 0.000 title claims abstract description 40
- 238000003860 storage Methods 0.000 title claims abstract description 16
- 238000000034 method Methods 0.000 claims abstract description 49
- 238000012549 training Methods 0.000 claims abstract description 28
- 230000006870 function Effects 0.000 claims description 18
- 238000004590 computer program Methods 0.000 claims description 10
- 238000006243 chemical reaction Methods 0.000 claims description 7
- 230000015654 memory Effects 0.000 claims description 5
- 238000010586 diagram Methods 0.000 description 12
- 230000008569 process Effects 0.000 description 10
- 238000012545 processing Methods 0.000 description 8
- 238000012986 modification Methods 0.000 description 5
- 230000004048 modification Effects 0.000 description 5
- 238000012937 correction Methods 0.000 description 4
- 238000003058 natural language processing Methods 0.000 description 4
- 230000006854 communication Effects 0.000 description 3
- 238000010276 construction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000009826 distribution Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012360 testing method Methods 0.000 description 2
- 241000282412 Homo Species 0.000 description 1
- 235000010678 Paulownia tomentosa Nutrition 0.000 description 1
- 240000002834 Paulownia tomentosa Species 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000013459 approach Methods 0.000 description 1
- 210000004556 brain Anatomy 0.000 description 1
- 238000009924 canning Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 238000010606 normalization Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000007781 pre-processing Methods 0.000 description 1
- 230000006403 short-term memory Effects 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Machine Translation (AREA)
Abstract
A method, a device, a computer storage medium and an electronic device for detecting wrongly written characters comprise: determining text data to be detected; converting the text data into pinyin data; generating a feature template of the pinyin data based on an ngram model; inputting the characteristic template of the pinyin data into a pre-established wrongly written character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model; and determining whether the text data to be detected has wrongly written characters according to the output result of the wrongly written character detection model. By adopting the scheme in the application, the wrongly written characters can be simply and efficiently detected.
Description
Technical Field
The present application relates to data processing technologies, and in particular, to a method and an apparatus for detecting wrongly written characters, a computer storage medium, and an electronic device.
Background
With the popularization of smart phones and other mobile devices, communication among people is mainly based on pinyin typing. Due to various accidental factors in the typing process, such as too fast typing, uncommon characters not found, or hand errors, some wrongly written characters may occur in the communication process. Wrongly written words can be recognized and corrected by the human brain for humans, however, wrongly written words can cause great problems for machines. In a computer, words are stored as 0 and 1, different words have different values, the values are independent and have no correlation (such as same pronunciation, similar font, etc.) like characters. This has led to the need for miswritten word correction when computers are communicating in human computers while performing natural language processing.
The current technique for identifying wrongly written characters is mainly to identify wrongly written characters according to a method of a large amount of texts based on frequency and a dictionary; this approach is complicated, computational speed is not high, and misregistered word recognition needs to be updated from time to time.
Problems existing in the prior art:
the existing method for identifying wrongly written characters is complex in process and low in efficiency.
Disclosure of Invention
The embodiment of the application provides a method and a device for detecting wrongly written characters, a computer storage medium and electronic equipment, so as to solve the technical problems.
According to a first aspect of embodiments of the present application, there is provided a method for detecting a wrongly written word, including:
determining text data to be detected;
converting the text data into pinyin data;
generating a feature template of the pinyin data based on an ngram model;
inputting the characteristic template of the pinyin data into a pre-established wrongly written character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model;
and determining whether the text data to be detected has wrongly written characters according to the output result of the wrongly written character detection model.
According to a second aspect of embodiments of the present application, there is provided a wrongly written word detecting apparatus, including:
the data determining module is used for determining text data to be detected;
the pinyin conversion module is used for converting the text data into pinyin data;
the template generating module is used for generating a feature template of the pinyin data based on an ngram model;
the model detection module is used for inputting the characteristic template of the pinyin data into a pre-constructed wrongly written character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model;
and the result determining module is used for determining whether the text data to be detected has wrongly written characters according to the output result of the wrongly written character detecting model.
According to a third aspect of embodiments of the present application, there is provided a computer storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of the method as described above.
According to a fourth aspect of embodiments herein, there is provided an electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method as described above.
According to the wrongly-written character detection method and device, the computer storage medium and the electronic equipment, after text data to be detected are converted into pinyin, the feature template for generating the pinyin data is input into the wrongly-written character detection model which is constructed in advance, and then whether wrongly-written characters exist in the text data is detected and determined.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:
fig. 1 is a schematic flow chart illustrating an implementation of a method for detecting a wrongly written word according to an embodiment of the present application;
fig. 2 is a schematic structural diagram of a device for detecting a wrongly written word according to a second embodiment of the present application;
fig. 3 shows a schematic structural diagram of an electronic device in the fourth embodiment of the present application.
Detailed Description
In the process of implementing the present application, the inventors found that:
based on a Long Short-Term Memory neural network model (LSTM), the method can be considered to correct wrongly written words; however, although this method can solve the problem of inconvenient updating based on frequency and dictionary method, LSTM is advantageous for long text prediction, and the wrongly written words in the sentence belong to a local problem in the text, and the LSTM has a general processing effect on the local problem.
In view of the above problems, embodiments of the present application provide a method and an apparatus for detecting wrongly written characters, a computer storage medium, and an electronic device, where a training sample is trained to construct a feature template of a CRF model, the CRF model is then trained, parameters of the CRF model are adjusted, and then wrongly written characters are recognized and corrected, so that wrongly written characters can be corrected quickly and accurately, and the method, the apparatus, the computer storage medium, and the electronic device are simple and fast.
CRF model, Conditional Random Field model, the mathematical language description of CRF is: if X and Y are random variables, and P (Y | X) is a conditional probability distribution of Y given X, if the random variable Y constitutes a Markov random field, then the conditional probability distribution P (Y | X) is called a conditional random field.
The ngram model is a language model, realizes automatic conversion to Chinese characters by using collocation information between adjacent words in the context, and assumes that the occurrence of the Nth word is only related to the former N-1 words but not related to any other words, and the probability of the whole sentence is the product of the occurrence probabilities of all words.
The scheme in the embodiment of the application can be implemented by adopting various computer languages, such as object-oriented programming language Java and transliterated scripting language JavaScript.
In order to make the technical solutions and advantages of the embodiments of the present application more apparent, the following further detailed description of the exemplary embodiments of the present application with reference to the accompanying drawings makes it clear that the described embodiments are only a part of the embodiments of the present application, and are not exhaustive of all embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.
Example one
Fig. 1 is a schematic flowchart illustrating an implementation of a method for detecting a wrongly written word according to an embodiment of the present application.
As shown in the figure, the method for detecting wrongly written words comprises the following steps:
step 101, determining text data to be detected;
step 102, converting the text data into pinyin data;
103, generating a feature template of the pinyin data based on an ngram model;
step 104, inputting the characteristic template of the pinyin data into a pre-established wrongly written character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model;
and 105, determining whether the text data to be detected has wrongly written characters according to the output result of the wrongly written character detection model.
In specific implementation, the text data to be detected is Chinese characters or Chinese characters. The conversion of the text data into pinyin data can be realized by using a Chinese conversion pinyin tool or a Chinese conversion pinyin algorithm in the prior art, and the application of converting Chinese characters into pinyin and the like is available at present, and the details are not repeated in the application.
Assuming that the text data to be detected is "rich in wisdom in short languages", the text data is converted into pinyin data, for example: "zai duan duan de yu yan zhong canning you feng de zhi hui".
And further generating a feature template of the pinyin data based on an ngram model, inputting the feature template into a pre-constructed wrongly-written character detection model, wherein the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and the feature template based on the ngram model, so that whether wrongly-written characters exist in the text data to be detected can be directly output.
The wrongly-written character detection method provided by the embodiment of the application comprises the steps that after text data to be detected is converted into pinyin, a feature template for generating the pinyin data is input into a pre-constructed wrongly-written character detection model so as to detect and determine whether wrongly-written characters exist in the text data.
In one embodiment, the generating the feature template of the pinyin data includes:
generating a first characteristic for each pinyin according to the front pinyin and the rear pinyin of each pinyin;
generating a second characteristic for each pinyin according to the number of times each pinyin appears in the pinyin data;
extracting the pinyin data according to a preset window 2 or 3 to generate binary character groups, and generating two third features by taking each character in the binary character groups as an ngram feature;
generating a characteristic template of the pinyin data according to the first characteristic, the second characteristic and the two third characteristics; the characteristic templates of the pinyin data include characteristic templates of each pinyin.
In a specific implementation, the front pinyin and the back pinyin of each pinyin can refer to the front pinyin and the back pinyin of each pinyin. Specifically, the previous pinyin and the next pinyin can be respectively a front pinyin and a rear pinyin of the current pinyin in the pinyin.
For the case that the current pinyin is at the beginning or the end, the pinyin before and after the current pinyin can be respectively set as begin or end marks.
In one embodiment, the generating a first feature for each pinyin according to its front and back pinyins includes:
determining a previous pinyin and a next pinyin of a current pinyin in the pinyin data;
generating a first characteristic of the current pinyin;
wherein the first characteristic is (current pinyin, previous pinyin of the current pinyin, and next pinyin of the current pinyin).
In specific implementation, the embodiment of the present application may determine, for each pinyin, a previous pinyin and a next pinyin of the pinyin in the pinyin data, for example: the pinyin data corresponding to the 'i love the country' is wo/ai/zu/guo, and for the first pinyin 'wo', the former pinyin is begin and the latter pinyin is 'ai', the first characteristic of the first pinyin 'wo' is (wo, begin, ai); for a second pinyin "ai", where the previous pinyin is wo and the next pinyin is zu, the first characteristic of the second pinyin "ai" is (ai, wo, zu); for a third pinyin "zu", where the previous pinyin is zu and the next pinyin is "guo", the first characteristic of the third pinyin "zu" (zu, ai, guo); for the last pinyin "guo" whose previous pinyin is zu and the next pinyin is end, the first characteristic of the last pinyin "guo" is (guo, zu, end).
In an embodiment, the extracting the pinyin data according to a preset window 2 or 3 to generate a binary word group, and generating two third features with each word in the binary word group as an ngram feature respectively includes:
extracting the pinyin data according to a window with a preset window value of 2 or 3 to generate a binary word group;
generating a first third feature by taking a first character in the binary character group as an ngram feature; the first third characteristic is (the pinyin of the ith character and the (i + 1) th character, the probability that the pinyin which is the previous pinyin of the (i + 1) th character is the pinyin of the ith character);
taking a second character in the binary character group as an ngram characteristic to generate a second third characteristic; the second third characteristic is (the pinyin of the jth character and the jth +1 character, the jth +1 character: the probability that the pinyin next to the pinyin of the jth character is the pinyin of the jth character).
In specific implementation, the text data is assumed to be 'I graduation from Shanghai university of transportation', and the corresponding pinyin data is 'wo/bi/ye/yu/shang/hai/jiao/tang/da/xue'.
Extracting binary character groups from the pinyin data according to the size of a window 2 to obtain: (i am my two, wo | bi), (graduation, bi | ye), (amateur, ye | yu), (in the upper, yu | shang), (in the upper sea, shang | hai), (sea deal, hai | jiao), (traffic, jiao | tang), (large, tang | da), (university, da | xue).
Taking the first character as an ngram feature, generating a first third feature, taking the first pinyin wo as an example, the first third feature of wo is (forget me | wang: p1, and don't me | wu: p 2.); wherein,
taking the second character as an ngram feature, generating a second third feature, taking the first pinyin wo as an example, the second third feature of wo ═ i (we | men: p3, my | fang: p 4.); wherein,
in practical implementation, for each pinyin, taking each word as the ngram feature, there may be many word combinations in the third feature, which is only an example.
In one embodiment, after the generating the binary word group and before generating the two third features, the method further includes:
counting the frequency of each binary word group;
and removing the binary word with the frequency lower than the preset frequency threshold value.
In specific implementation, the preset frequency threshold may be set according to actual needs, for example, may be 100, 35, and the like, and the specific numerical value of the frequency threshold is not limited in the present application.
According to the method and the device, after the binary word is generated and before the subsequent characteristic template is generated, the binary word is preprocessed, the frequency in the binary word is lower than the preset frequency threshold value and is removed, the subsequent calculated amount is reduced, and the unobvious characteristic is removed.
In one embodiment, the wrongly written words detection model is constructed as follows:
collecting training corpora;
marking pinyin on the training corpus;
generating a feature template of the pinyin based on an ngram model;
and training a conditional random field CRF model by taking the characteristic template as a characteristic function to obtain the wrongly written character detection model.
In specific implementation, training expectations can be collected from network resources such as microblogs, billows, today's headings, people's daily newspapers and the like, and the sources of corpus collection are not limited by the application.
Before the phonetic alphabet is marked on the corpus, the embodiment of the application can also perform preliminary correction on wrongly written characters on the corpus, and specifically can correct the wrongly written characters through manual correction or other prior art.
After the training corpus is marked with pinyin, the pinyin and the Chinese characters are respectively coded, and a feature template based on an ngram model is constructed on the assumption that the set of the pinyin is p and the set of the Chinese characters is m. The specific construction process can be as follows:
first, the corresponding features between pinyin and words are constructed, such as: suppose that the character vector feature of the pinyin 'wo' is [ you: 0, I: m1, He: 0, Wo: m2, and 0], the marking method is as follows: in the Chinese character set m, for a character with the pronunciation of 'wo', marking the occurrence frequency of the character in the training corpus; for words whose pronunciation is not 'wo', zero is marked; marking the corresponding characteristics of all the pinyins and the characters in the pinyin set p according to the marking mode.
Then, n-gram features between the pinyin and the characters are constructed. Specifically, all the corpora can be sequentially extracted according to the size of a preset window and labeled as binary word groups, and the binary word groups with lower frequency removal frequency in the binary word groups can be counted.
And constructing a characteristic template of each pinyin according to the data, specifically, assuming that the initial pinyin template of each pinyin is empty, sequentially adding front and back pinyins, adding corresponding characteristics of the pinyins and characters, and adding binary character groups which respectively take the ith character as an ngram characteristic to obtain a final characteristic template.
And finally, training a conditional random field CRF model by taking the characteristic template as a characteristic function to obtain the wrongly written character detection model.
The CRF model may specifically be:
the function f is a characteristic function, k is the number of the characteristic function, P (X) is the joint probability of pinyin, Y | X is different Chinese character combinations under the condition that the pinyin is X, P (Y | X) is the probability of the current Chinese character combination under the condition that the pinyin is X, i is the pinyin or the serial number of a character in a sentence in the pinyin data, and c is the sentence serial number in the pinyin data; z (x) is a normalization function. Specifically, the formula z (x) is as follows:
in one embodiment, the method further comprises:
and correcting the wrongly written characters when the text data to be detected has the wrongly written characters.
In specific implementation, the method and the device can correct the wrongly written characters when the wrongly written characters exist in the text data, so that the aim of automatically correcting the wrongly written characters based on the CRF model can be fulfilled, and a subsequent natural language processing result is more accurate.
Example two
Based on the same inventive concept, the embodiment of the present application provides a device for detecting wrongly written characters, and the principle of the device for solving the technical problem is similar to that of a method for detecting wrongly written characters, and repeated parts are not repeated.
Fig. 2 is a schematic structural diagram of a device for detecting a wrongly written word according to a second embodiment of the present application.
As shown in the figure, the wrongly written word detecting apparatus includes:
a data determining module 201, configured to determine text data to be detected;
a pinyin conversion module 202, configured to convert the text data into pinyin data;
the template generating module 203 is used for generating a feature template of the pinyin data based on an ngram model;
the model detection module 204 is used for inputting the feature template of the pinyin data based on the ngram model into a pre-constructed wrongly written or mispronounced character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model;
and a result determining module 205, configured to determine whether the text data to be detected has a wrong character according to an output result of the wrong character detection model.
The wrongly-written character detection device provided by the embodiment of the application converts text data to be detected into pinyin, and then the feature template for generating the pinyin data is input into a wrongly-written character detection model which is constructed in advance to detect and determine whether wrongly-written characters exist in the text data.
In one embodiment, the apparatus comprises a model building module, wherein the model building module comprises:
the corpus collection unit is used for collecting training corpuses;
the pinyin marking unit is used for marking pinyin for the training corpus;
the model template generating unit is used for generating a feature template of the pinyin based on the ngram model;
and the model training unit is used for training the CRF model by taking the characteristic template as a characteristic function to obtain the wrongly written or mispronounced character detection model.
In one embodiment, the template generation module includes:
the first characteristic unit is used for generating a first characteristic for each pinyin according to the front pinyin and the back pinyin of each pinyin;
a second feature unit for generating a second feature for each pinyin according to the number of times that each pinyin appears in the pinyin data;
the third characteristic unit is used for extracting the pinyin data according to a preset window 2 or 3 to generate a binary character group, and respectively generating two third characteristics by taking each character in the binary character group as an ngram characteristic;
the template generating unit is used for generating a characteristic template of the pinyin data according to the first characteristic, the second characteristic and the two third characteristics; the characteristic templates of the pinyin data include characteristic templates of each pinyin.
In one embodiment, the first feature cell includes:
a pinyin determining subunit, configured to determine a previous pinyin and a next pinyin of the current pinyin in the pinyin data;
a first feature generation subunit, configured to generate a first feature of the current pinyin;
wherein the first characteristic is (current pinyin, previous pinyin of the current pinyin, and next pinyin of the current pinyin)
In one embodiment, the third feature cell includes:
the character group generating subunit is used for extracting the pinyin data according to a window with a preset window value of 2 to generate a binary character group;
a third feature generation subunit, configured to generate a first third feature by using the first word in the binary word group as an ngram feature; the first third characteristic is (the pinyin of the ith character and the (i + 1) th character, the probability that the pinyin which is the previous pinyin of the (i + 1) th character is the pinyin of the ith character); taking a second character in the binary character group as an ngram characteristic to generate a second third characteristic; the second third characteristic is (the pinyin of the jth character and the jth +1 character, the jth +1 character: the probability that the pinyin next to the pinyin of the jth character is the pinyin of the jth character).
In one embodiment, the template generation module further comprises:
the filtering unit is used for counting the frequency of each binary word group after the binary word group is generated and before two third features are generated; and removing the binary word with the frequency lower than the preset frequency threshold value.
In one embodiment, the apparatus further comprises:
and the wrongly written character correcting module is used for correcting the wrongly written characters when the text data to be detected has the wrongly written characters.
EXAMPLE III
Based on the same inventive concept, embodiments of the present application further provide a computer storage medium, which is described below.
The computer storage medium has a computer program stored thereon, which when executed by a processor implements the steps of the method of detecting a wrongly written word according to an embodiment.
The computer storage medium provided by the embodiment of the application converts text data to be detected into pinyin, and then generates a characteristic template of the pinyin data and inputs the characteristic template into a pre-constructed wrongly-written character detection model to further detect and determine whether wrongly-written characters exist in the text data.
Example four
Based on the same inventive concept, the embodiment of the present application further provides an electronic device, which is described below.
Fig. 3 shows a schematic structural diagram of an electronic device in the fourth embodiment of the present application.
As shown, the electronic device includes memory 301 for storing one or more programs, and one or more processors 302; the one or more programs, when executed by the one or more processors, implement the method for detecting wrongly written words as described in embodiment one.
The electronic equipment provided by the embodiment of the application converts the text data to be detected into pinyin, and then the characteristic template for generating the pinyin data is input into the wrongly written character detection model which is constructed in advance to detect and determine whether wrongly written characters exist in the text data.
EXAMPLE five
In order to facilitate the implementation of the present application, the embodiments of the present application are described with a specific example.
Step one, searching training corpora on network resources such as microblogs, today's headings, people's daily newspapers and the like, and manually performing primary correction on wrongly written words on the searched corpora.
Step two, performing data processing on the collected corpus, wherein the specific data processing process is as follows:
A. marking pinyin on the collected corpus;
B. coding the pinyin and the Chinese characters respectively;
suppose the size of the pinyin set is p and the size of the Chinese character set is m.
C. And constructing corresponding characteristics between the pinyin and the characters.
For example: suppose that the character vector features of the pinyin "wo" are [ you: 0, i: m1, he: 0, wo: m2, and 0. ] (assuming that the total number of Chinese characters is 4000, then 4000 features should be included in the character vector features, wherein the pinyin is the number of occurrences of wo and the pinyin is not 0), wherein m1 represents the number of occurrences of "i" in the corpus and m2 represents the number of occurrences of "wo" in the corpus.
The marking method comprises the following steps: in the Chinese character set m, the pinyin is the number of times that the word of 'wo' appears in the training corpus, the pinyin is not the word of 'wo' and is marked with zero, and the corresponding characteristics between all pinyins and characters in the pinyin set p are marked.
D. And constructing n-gram characteristics between the pinyin and the characters.
The embodiment of the application adopts a Bi-Gram technology in an n-Gram language model, and the specific construction process is as follows:
1) sequentially extracting all the linguistic data according to the window size of 2, and marking the pinyin to form a binary word group;
for example: from the sentence "i graduate at Shanghai university of transportation" (i graduate, wo | bi), (graduate, bi | ye), (amateur, ye | yu), (in Shanghai, yu | shang), (in Shanghai, shang | hai), etc., binary words are extracted.
2) And counting the frequency of the binary word group, and removing the binary word group with lower frequency.
For example: setting the frequency threshold of the binary word to be 500, (graduation, bi | ye), (amateur, ye | yu), (shanghai, shang | hai) and other binary words with the frequency higher than 500, and then keeping the binary words; for (i' i, wo | bi), (above, yu | shang), etc. binary words with frequencies lower than 500, the embodiment of the present application removes the two binary words.
Step three, constructing a characteristic template of a Conditional Random Field (CRF) model
1. Determining the CRF model as follows:
the function f in the CRF model is a feature function, and specifically may be:
where x0 is a character in a sentence.
In this embodiment of the present application, the CRF model may set a plurality of feature functions, and may be represented by a feature template when implemented specifically, for example: (wo, you: 0, I: m1, He: 0, Wo: m2, and 0) represents the characteristic template of Pinyin wo.
2. Constructing templates for each pinyin
Assume that the initial pinyin template q is empty.
Adding pinyin features per se, such as: "I graduates from Shanghai Master university," the former pinyin of wo is begin, and the latter pinyin is bi; the previous pinyin of ye is bi, the next pinyin is yu, and the pinyin template q of the pinyin wo is (wo, begin, bi).
Adding corresponding characteristics between pinyin and characters for each pinyin, and assuming that the occurrence frequency of 'I' in the pinyin data is 320 and the occurrence frequency of 'Wo' in the pinyin data is 240, then a pinyin template q of the pinyin 'wo' (wo, begin, bi, you: 0, I: 320, He: 0, Wo: 240, 0.);
and adding a binary word group, wherein the first word ngram characteristic is as follows: q (wo, begin, bi, you: 0, I: 320, He: 0, Wo: 240, both: 0, forget-me | wang: p1, nothing | wu: p2, Beijing | bei:0,..); wherein,
suppose that the frequency of occurrence of the pinyin wa preceding the pinyin wo is 200 and the frequency of occurrence of the pinyin wu preceding the wo is 300, and for "i graduation in Shanghai teacher university", the pinyin preceding the wo is begin, and the probabilities of occurrence of forgetting me and not being me are both 0, then q ═ q (wo, begin, bi, you: 0, I: 320, He: 0, Wo: 240, both: 0, forget-me | wang:0, not-me | wu:0, Beijing | bei: 0.).
Continue to add binary word groups, second word ngram features, such as: q (wo, begin, bi, you: 0, I: 320, He: 0, Wo: 240, both: 0, forget-me | Wang:0, WU:0, Beijing | bei:0, we | men: p3, my | fang: p4..);
wherein,
suppose that the frequency of occurrence of pinyin after wo, where men is men, is 500, the frequency of occurrence of pinyin after wo, where bi is 400, and for "i graduate at Shanghai teacher university", where be after wo, we have an occurrence probability of 0, and 400, then q ═ q (wo, begin, bi, you: 0, i: 320, he: 0, Wo: 240, both: 0, forget-me | Wang:0, nothing | wu:0, Beijing | bei:0, us | men:0, I'll | Can: 400, my | fang: 0.).
Step four, training CRF model
Taking the example of "I graduate from Shanghai university of transportation",
respectively inputting the characteristic templates (obtained according to the third step) corresponding to all pinyin of 'wo \ bi \ ye \ yu \ shang \ hai \ jiao \ tung \ da \ xue';
for example: for wo, the feature template q ═ is (wo, begin, bi, you: 0, I: 320, he: 0, Wao: 240, both: 0, forget-me | Wang:0, don't-me | wu:0, Beijing | bei:0, us | men:0, I'll | bi:400, my | fang: 0.), input to the CRF model function:
the specific parameter correspondence relationship is as follows:
… and so on.
Because of the many feature functions, the embodiments of the present application are not shown one by one here.
The final output is "I graduate Shanghai university of transportation".
All sample data are divided into a training set and a testing set.
During the training process, the parameters of the CRF model can be adjusted at any time, for example: and the complexity c of the model and the like enable the accuracy of the model in the test set to be highest, and a well-trained CRF model is obtained.
Step five, forecasting
For new text data, for example: "I am from background".
Firstly, text data 'i from background' is converted into pinyin data 'wo/lai/zi/yu/ben/jing';
secondly, generating a characteristic template of the text pinyin according to the third step;
and finally, inputting a CRF model which is constructed in advance for prediction, and if the prediction result is that the CRF model is not from Beijing, determining that wrongly written characters exist, and automatically correcting the CRF model.
Compared with a traditional dictionary and statistical method, the CRF model has good statistical characteristics, the workload of model updating is small, and the method for recognizing and correcting the wrongly written characters based on the CRF model does not have the technology in the aspect at present. In addition, in the CRF model construction process, the feature template based on the ngram language model is added, the characteristics of the language model and the feature function expansibility of the CRF are effectively combined, and an ideal effect is achieved.
In practical application, the application scenario of the embodiment of the application is wide, and the application scenario can be applied to intelligent customer service, for example: the user can also help the intelligent customer service to recognize and correct the user's sentence by typing wrong words in the communication between the user and the intelligent customer service, so that the purpose of accurately judging the user's intention is achieved.
In addition, the method can also be applied to a data preprocessing process of natural language processing, wrongly written characters in the text are processed in advance, and a natural language processing result is more accurate.
For example:
and (3) user input: i want to buy a train ticket from Shanghai to the background;
by adopting the technical scheme provided by the embodiment of the application, the intelligent ticketing system can execute the following operations:
firstly, correcting wrongly written characters by using the technical scheme provided by the embodiment of the application, wherein the corrected wrongly written characters are 'I want to buy a train ticket from Shanghai to Beijing';
then, entities are extracted, for example: shanghai (origin), beijing (destination);
and finally, finishing the drawing of the ticket.
If the technical scheme provided by the embodiment of the application is not adopted for correcting wrongly written characters, the intelligent ticket drawing system can not draw a ticket because the 'background' cannot be found.
As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While the preferred embodiments of the present application have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all alterations and modifications as fall within the scope of the application.
It will be apparent to those skilled in the art that various changes and modifications may be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is intended to include such modifications and variations as well.
Claims (10)
1. A method for detecting wrongly written characters, comprising:
determining text data to be detected;
converting the text data into pinyin data;
generating a feature template of the pinyin data based on an ngram model;
inputting a feature template of the pinyin data based on an ngram model into a pre-constructed wrongly written character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model;
and determining whether the text data to be detected has wrongly written characters according to the output result of the wrongly written character detection model.
2. The method of claim 1, wherein the wrongly written words detection model is constructed as follows:
collecting training corpora;
marking pinyin on the training corpus;
generating a feature template of the pinyin based on an ngram model;
and training a CRF model by taking the characteristic template as a characteristic function to obtain the wrongly written character detection model.
3. The method of claim 1 or 2, wherein the generating the feature template of the pinyin data includes:
generating a first characteristic for each pinyin according to the front pinyin and the rear pinyin of each pinyin;
generating a second characteristic for each pinyin according to the number of times each pinyin appears in the pinyin data;
extracting the pinyin data according to a preset window 2 or 3 to generate binary character groups, and generating two third features by taking each character in the binary character groups as an ngram feature;
generating a characteristic template of the pinyin data according to the first characteristic, the second characteristic and the two third characteristics; the characteristic templates of the pinyin data include characteristic templates of each pinyin.
4. The method of claim 3, wherein generating the first feature for each pinyin based on previous and subsequent pinyins of each pinyin comprises:
determining a previous pinyin and a next pinyin of a current pinyin in the pinyin data;
generating a first characteristic of the current pinyin;
wherein the first characteristic is (current pinyin, previous pinyin of the current pinyin, and next pinyin of the current pinyin).
5. The method of claim 3, wherein the extracting the pinyin data according to a preset window 2 or 3 to generate binary word groups, and generating two third features for the ngram feature for each word in the binary word groups respectively comprises:
extracting the pinyin data according to a window with a preset window value of 2 or 3 to generate a binary word group;
generating a first third feature by taking a first character in the binary character group as an ngram feature; the first third characteristic is (the pinyin of the ith character and the (i + 1) th character, the probability that the pinyin which is the previous pinyin of the (i + 1) th character is the pinyin of the ith character);
taking a second character in the binary character group as an ngram characteristic to generate a second third characteristic; the second third characteristic is (the pinyin of the jth character and the jth +1 character, the jth +1 character: the probability that the pinyin next to the pinyin of the jth character is the pinyin of the jth character).
6. The method of claim 2, wherein after the generating the binary word group and before generating two third features, further comprising:
counting the frequency of each binary word group;
and removing the binary word with the frequency lower than the preset frequency threshold value.
7. The method of claim 1, further comprising:
and correcting the wrongly written characters when the text data to be detected has the wrongly written characters.
8. A wrongly written character detecting apparatus, comprising:
the data determining module is used for determining text data to be detected;
the pinyin conversion module is used for converting the text data into pinyin data;
the template generating module is used for generating a feature template of the pinyin data based on an ngram model;
the model detection module is used for inputting the characteristic template of the pinyin data into a pre-constructed wrongly written character detection model; the wrongly-written character detection model is obtained by training according to a conditional random field CRF model and a feature template based on an ngram model;
and the result determining module is used for determining whether the text data to be detected has wrongly written characters according to the output result of the wrongly written character detecting model.
9. A computer storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 7.
10. An electronic device comprising one or more processors, and memory for storing one or more programs; the one or more programs, when executed by the one or more processors, implement the method of any of claims 1 to 7.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910846339.6A CN110705217B (en) | 2019-09-09 | 2019-09-09 | Wrongly written or mispronounced word detection method and device, computer storage medium and electronic equipment |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910846339.6A CN110705217B (en) | 2019-09-09 | 2019-09-09 | Wrongly written or mispronounced word detection method and device, computer storage medium and electronic equipment |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110705217A true CN110705217A (en) | 2020-01-17 |
CN110705217B CN110705217B (en) | 2023-07-21 |
Family
ID=69195144
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910846339.6A Active CN110705217B (en) | 2019-09-09 | 2019-09-09 | Wrongly written or mispronounced word detection method and device, computer storage medium and electronic equipment |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110705217B (en) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112307748A (en) * | 2020-03-02 | 2021-02-02 | 北京字节跳动网络技术有限公司 | Method and device for processing text |
CN112560451A (en) * | 2021-02-20 | 2021-03-26 | 京华信息科技股份有限公司 | Wrongly written character proofreading method and device for automatically generating training data |
CN112800987A (en) * | 2021-02-02 | 2021-05-14 | 中国联合网络通信集团有限公司 | Chinese character processing method and device |
Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106297797A (en) * | 2016-07-26 | 2017-01-04 | 百度在线网络技术(北京)有限公司 | Method for correcting error of voice identification result and device |
CN108021712A (en) * | 2017-12-28 | 2018-05-11 | 中南大学 | The method for building up of N-Gram models |
CN108304375A (en) * | 2017-11-13 | 2018-07-20 | 广州腾讯科技有限公司 | A kind of information identifying method and its equipment, storage medium, terminal |
CN108491392A (en) * | 2018-03-29 | 2018-09-04 | 广州视源电子科技股份有限公司 | Method, system, computer device and storage medium for correcting character spelling errors |
CN108519973A (en) * | 2018-03-29 | 2018-09-11 | 广州视源电子科技股份有限公司 | Character spelling detection method, system, computer equipment and storage medium |
CN109271526A (en) * | 2018-08-14 | 2019-01-25 | 阿里巴巴集团控股有限公司 | Method for text detection, device, electronic equipment and computer readable storage medium |
CN109492202A (en) * | 2018-11-12 | 2019-03-19 | 浙江大学山东工业技术研究院 | A kind of Chinese error correction of coding and decoded model based on phonetic |
CN109992765A (en) * | 2017-12-29 | 2019-07-09 | 北京京东尚科信息技术有限公司 | Text error correction method and device, storage medium and electronic equipment |
CN110110041A (en) * | 2019-03-15 | 2019-08-09 | 平安科技(深圳)有限公司 | Wrong word correcting method, device, computer installation and storage medium |
CN110162789A (en) * | 2019-05-13 | 2019-08-23 | 北京一览群智数据科技有限责任公司 | A kind of vocabulary sign method and device based on the Chinese phonetic alphabet |
-
2019
- 2019-09-09 CN CN201910846339.6A patent/CN110705217B/en active Active
Patent Citations (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106297797A (en) * | 2016-07-26 | 2017-01-04 | 百度在线网络技术(北京)有限公司 | Method for correcting error of voice identification result and device |
CN108304375A (en) * | 2017-11-13 | 2018-07-20 | 广州腾讯科技有限公司 | A kind of information identifying method and its equipment, storage medium, terminal |
CN108021712A (en) * | 2017-12-28 | 2018-05-11 | 中南大学 | The method for building up of N-Gram models |
CN109992765A (en) * | 2017-12-29 | 2019-07-09 | 北京京东尚科信息技术有限公司 | Text error correction method and device, storage medium and electronic equipment |
CN108491392A (en) * | 2018-03-29 | 2018-09-04 | 广州视源电子科技股份有限公司 | Method, system, computer device and storage medium for correcting character spelling errors |
CN108519973A (en) * | 2018-03-29 | 2018-09-11 | 广州视源电子科技股份有限公司 | Character spelling detection method, system, computer equipment and storage medium |
CN109271526A (en) * | 2018-08-14 | 2019-01-25 | 阿里巴巴集团控股有限公司 | Method for text detection, device, electronic equipment and computer readable storage medium |
CN109492202A (en) * | 2018-11-12 | 2019-03-19 | 浙江大学山东工业技术研究院 | A kind of Chinese error correction of coding and decoded model based on phonetic |
CN110110041A (en) * | 2019-03-15 | 2019-08-09 | 平安科技(深圳)有限公司 | Wrong word correcting method, device, computer installation and storage medium |
CN110162789A (en) * | 2019-05-13 | 2019-08-23 | 北京一览群智数据科技有限责任公司 | A kind of vocabulary sign method and device based on the Chinese phonetic alphabet |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112307748A (en) * | 2020-03-02 | 2021-02-02 | 北京字节跳动网络技术有限公司 | Method and device for processing text |
CN112800987A (en) * | 2021-02-02 | 2021-05-14 | 中国联合网络通信集团有限公司 | Chinese character processing method and device |
CN112800987B (en) * | 2021-02-02 | 2023-07-21 | 中国联合网络通信集团有限公司 | Chinese character processing method and device |
CN112560451A (en) * | 2021-02-20 | 2021-03-26 | 京华信息科技股份有限公司 | Wrongly written character proofreading method and device for automatically generating training data |
CN112560451B (en) * | 2021-02-20 | 2021-05-14 | 京华信息科技股份有限公司 | Wrongly written character proofreading method and device for automatically generating training data |
Also Published As
Publication number | Publication date |
---|---|
CN110705217B (en) | 2023-07-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110598203B (en) | Method and device for extracting entity information of military design document combined with dictionary | |
CN107291783B (en) | Semantic matching method and intelligent equipment | |
CN109147767B (en) | Method, device, computer equipment and storage medium for recognizing numbers in voice | |
CN110457689B (en) | Semantic processing method and related device | |
CN113901797B (en) | Text error correction method, device, equipment and storage medium | |
CN109543181B (en) | Named entity model and system based on combination of active learning and deep learning | |
CN111310440B (en) | Text error correction method, device and system | |
CN111209740B (en) | Text model training method, text error correction method, electronic device and storage medium | |
CN110705217B (en) | Wrongly written or mispronounced word detection method and device, computer storage medium and electronic equipment | |
EP4131076A1 (en) | Serialized data processing method and device, and text processing method and device | |
CN110555206A (en) | named entity identification method, device, equipment and storage medium | |
CN110222184A (en) | A kind of emotion information recognition methods of text and relevant apparatus | |
CN112560451B (en) | Wrongly written character proofreading method and device for automatically generating training data | |
CN116991875B (en) | SQL sentence generation and alias mapping method and device based on big model | |
CN114153971A (en) | Error-containing Chinese text error correction, identification and classification equipment | |
CN112036179B (en) | Electric power plan information extraction method based on text classification and semantic frame | |
CN113326702A (en) | Semantic recognition method and device, electronic equipment and storage medium | |
CN105389303B (en) | A kind of automatic fusion method of heterologous corpus | |
CN110472231B (en) | Method and device for identifying legal document case | |
CN114896966B (en) | Chinese text grammar error positioning method, system, equipment and medium | |
CN116070642A (en) | Text emotion analysis method and related device based on expression embedding | |
CN110929514A (en) | Text proofreading method and device, computer readable storage medium and electronic equipment | |
CN115691503A (en) | Voice recognition method and device, electronic equipment and storage medium | |
CN115936010A (en) | Text abbreviation data processing method and device | |
CN115718889A (en) | Industry classification method and device for company profile |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
TA01 | Transfer of patent application right | ||
TA01 | Transfer of patent application right |
Effective date of registration: 20210318 Address after: 200333 room 3110, No. 100, Lane 130, Taopu Road, Putuo District, Shanghai Applicant after: Shanghai zebra Laila Logistics Technology Co.,Ltd. Address before: Room 308-1, area C, 1718 Daduhe Road, Putuo District, Shanghai 200333 Applicant before: Shanghai kjing XinDa science and Technology Group Co.,Ltd. |
|
GR01 | Patent grant | ||
GR01 | Patent grant |