CN110489754A - Quickly generate the method and system of standard corpus - Google Patents

Quickly generate the method and system of standard corpus Download PDF

Info

Publication number
CN110489754A
CN110489754A CN201910768046.0A CN201910768046A CN110489754A CN 110489754 A CN110489754 A CN 110489754A CN 201910768046 A CN201910768046 A CN 201910768046A CN 110489754 A CN110489754 A CN 110489754A
Authority
CN
China
Prior art keywords
text
sentence
specification information
similarity
initial position
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910768046.0A
Other languages
Chinese (zh)
Other versions
CN110489754B (en
Inventor
刘云芳
江敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Shulan Technology Co Ltd
Original Assignee
Hangzhou Shulan Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Shulan Technology Co Ltd filed Critical Hangzhou Shulan Technology Co Ltd
Priority to CN201910768046.0A priority Critical patent/CN110489754B/en
Publication of CN110489754A publication Critical patent/CN110489754A/en
Application granted granted Critical
Publication of CN110489754B publication Critical patent/CN110489754B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention relates to a kind of method and systems for quickly generating standard corpus.The present invention automatically searches information corresponding with specification information using computer in sentence, can more effectively generate standard corpus.

Description

Quickly generate the method and system of standard corpus
Technical field
The present invention relates to Computer Natural Language Processing technologies, more particularly, to the method for quickly generating standard corpus And system.
Background technique
Identification and extraction problem for word in sentence (especially nonstandard or wrong word), it will usually use Supervised learning method in machine learning field, such as BiLSTM+CRF model.In the art, supervised learning method refers to The corpus (i.e. standard corpus) for having mark is inputted into computer, with training machine learning model;When inputting nothing in a computer When the sentence of mark, the mark of the sentence can be obtained by the machine learning model.It is used to know by supervised learning method Not or before extraction word lack of standardization, need to carry out training machine learning model using a large amount of standard corpus.
The existing method for generating standard corpus is needed using a large amount of hand labor.For example, Environmental Protection Agency receives the letter of complaint Breath " incoming person's reflection washes someone beside shield scenic spot and throws away rubbish freely, and destroys scenic environment ".The staff of Environmental Protection Agency is directed to place name Artificial judgement is carried out, and the information of place names (referred to as " specification information ") of artificial selection specification is " West Lake scenic spot ".Although this example In specification information be to be described relative to place name, specification information can also be retouched relative to the word of other parts of speech or classification It states;Such as the corresponding specification information of nonstandard verb " learning seat " is " study ".In the disclosure, specification information refers to meeting The word or phrase of usual grammer and term habit.In precedent, after selecting specification information, according to existing generation standard The method of corpus, it is also necessary to which staff returns to " wash shield scape in calling information according to specification information " West Lake scenic spot " It is labeled behind area ", to generate standard corpus.For example, the standard corpus generated from aforementioned calling information can be " next/O Electricity/O people/O is anti-/O reflects/O washes/the P shield/P scape/area P/P by/side O/O has/O people/O unrest/O throws away/O rubbish/O rubbish/O ,/O breaks/O is bad/O Scape/the area the O/O ring/border O/O ", wherein be labeled as "/O " textual representation its belong to other texts, and be labeled as the literal table of "/P " Show and it belongs to the text for the identification for needing machine learning model (symbol " 0 " and " P " in this example are merely illustrative, can be according to need Other symbols are wanted or get used to, as long as the two is not identical).
In order to reduce artificial participation during generating standard corpus, the method for new generation standard corpus is needed.
Summary of the invention
One aspect of the present invention is a kind of side for searching information corresponding with specification information in sentence using computer Method, comprising: (1) set initial position for the position of the first text of the sentence using the computer;(2) described in utilizing Computer is according to pre-defined rule and regular according to the similarity calculation between text since the initial position, described in determination It whether there is information corresponding with the specification information in sentence;And (3) utilize the computer, if it is determined that the sentence It is middle to there is information corresponding with the specification information, then terminate the search operation, otherwise by the starting in the sentence Position moves back a text, then executes step (2).
According to an embodiment of the invention, searching used institute during information corresponding with specification information in sentence Stating pre-defined rule is: if residue length of the sentence since the initial position is greater than or equal to the specification information Length, and each text of the sentence since the initial position and each text of the specification information from the beginning It is all identical or have or be more than scheduled similarity, it is determined that the sentence from the initial position exist and the specification letter Cease corresponding information.
According to an embodiment of the invention, searching used institute during information corresponding with specification information in sentence Stating pre-defined rule is: if residue length of the sentence since the initial position is greater than or equal to the specification information Length, and each text of the sentence since the initial position with the specification information from the beginning and described Each text in the predetermined ratio of the overall length of specification information it is identical have or be more than scheduled similarity, it is determined that it is described There is information corresponding with the specification information from the initial position in sentence.
According to an embodiment of the invention, searching used institute during information corresponding with specification information in sentence Stating pre-defined rule is: if residue length of the sentence since the initial position is greater than or equal to the specification information Length, and each text of the sentence since the initial position and each text of the specification information from the beginning Continuously it is lower than predetermined quantity lower than the quantity of the text of scheduled similarity each other, it is determined that the sentence is from the starting Position, which is risen, has information corresponding with the specification information.
According to an embodiment of the invention, searching used institute during information corresponding with specification information in sentence Stating pre-defined rule is: if residue length of the sentence since the initial position is less than the length of the specification information, Then determine that there is no information corresponding with the specification information in the sentence.
According to an embodiment of the invention, it is similar between text to calculate to can use the Chinese phonetic alphabet or font for Chinese Degree.
One aspect of the present invention be it is a kind of information corresponding with specification information is searched in sentence using computer be System, comprising: (1) for setting the position of the first text of the sentence to using computer in the device of initial position;(2) it uses In utilizing computer, according to pre-defined rule and according to the similarity calculation rule between text, determine the sentence from described Beginning position starts the device with the presence or absence of information corresponding with the specification information;And (3) are used to utilize computer, if really The fixed sentence comes into existence information corresponding with the specification information from the initial position, then terminates the search operation, Otherwise the initial position is moved back into a text in the sentence, then executes the device of step (2).
Another aspect of the present invention is a kind of computer-readable medium, is stored thereon with computer-readable instruction, described Method described in various embodiments of the present invention is able to carry out when computer-readable instruction is computer-executed.
The embodiment of the present invention can save artificial workload, the efficiency being labeled to sentence be improved, in order to more It is quickly generated standard corpus.
Detailed description of the invention
Fig. 1 is the schematic diagram according to the fractionation literal pool of the embodiment of the present invention.
Specific embodiment
Illustrate the contents of the present invention now with reference to several exemplary embodiments.It should be appreciated that illustrating these embodiments Merely to those of ordinary skill in the art better understood when and therefore realize the contents of the present invention, rather than it is dark Show and any restrictions are carried out to the scope of the present invention.
As used herein, term " includes " and its variant should be read as meaning opening " including but not limited to " Put formula term.Term "based" should be read as " being based at least partially on ".Term " one embodiment " and " a kind of embodiment " It should be read as " at least one embodiment ".Term " another embodiment " should be read as " at least one other embodiment ".
The method of the embodiment of the present invention can use computer and more effectively generate standard corpus.Standard speech generated Material can be used in the machine learning model that training is suitable for identifying or extracting the supervised learning method of word.This machine learning Model can be used in various scenes.For example, can identify the word with different parts of speech in the sentence of input, such as Identify noun, verb, adjective etc.;It can also identify in the sentence of input that there is different classes of word, such as place name, people Name, mechanism name etc..It in an embodiment of the present invention, will be with the work of standard corpus needed for the machine learning model of place name for identification Illustrate method of the invention for example.Those skilled in the art is it can be appreciated that method of the invention can be used for knowing The machine learning model of the word of other other classifications or part of speech.
Method of the invention can use computer, and it is " right with specification information in read statement to search according to pre-defined rule Answer " information, the corresponding information is then automatically marked in read statement.Read statement through marking is standard speech Material.Information corresponding with specification information can refer to that identical with specification information or similarity with specification information is more than predetermined Threshold value information.
The method of the present invention includes text corresponding with specification information is searched in read statement using computer.It can benefit With computer, method by calculating similarity, successively compare first text in specification information with it is every in read statement A text, if find with the sufficiently high text of first text similarity in specification information, similarly to calculate phase Like degree method successively judge the text in read statement subsequent text whether with after first text of specification information Each text similarity it is sufficiently high.It is greater than for example, being found from read statement with first text similarity of specification information 0.5 text as doubtful initial position, then judge in read statement the doubtful subsequent text in initial position whether with specification The similarity of first subsequent text of text of information is both greater than 0.5;If the subsequent text of first text and specification are believed It is all successively similar to cease first later text of text, then marks these texts corresponding with specification information one by one in read statement Word is using as a standard corpus.Text corresponding with specification information in read statement can also be replaced with to specification information and same When for replaced sentence mark specification information therein using as another standard corpus.If the subsequent text of the first text Word and first later text of text of specification information are successively identical, then mark one by one in read statement corresponding with specification information Text using as a standard corpus.
It is above-mentioned successively compare in, if the inadequate phase between specification information and read statement of the text on some position Seemingly, then it terminates and compares, and initial position in read statement is moved back to the position of a text, and re-start and above-mentioned successively compare Compared with process (it is at this time it is appreciated that also shorter than specification information if moving back the remaining length of read statement after a text, can be with Determine that there is no information corresponding with specification information in read statement).In some embodiments, it is above-mentioned successively compare in, Text only on the position of two or more continuous predetermined quantities not similar enough the ability between specification information and read statement Termination is compared and (is ignored not similar enough on single or predetermined quantity (such as two) position below), and will be in read statement Initial position moves back the position of a text, then re-starts successively comparison procedure above-mentioned (or until read statement is surplus Remaining length is also shorter than specification information, also terminates search procedure at this time).
According to some embodiments of the present invention, it does not require to be compared with whole texts in specification information and to all Text all has or just determines that read statement has information corresponding with specification information more than scheduled similarity, but when specification The portion that the part compared in information reached whole length of specification information predetermined ratio (such as 75%) and has been compared Each text has or more than scheduled similarity in point, then it is assumed that has letter corresponding with specification information in read statement Breath.It takes in read statement and is labeled with the text strings of the overall length equal length of specification information to generate standard corpus at this time.
Preferably due to may include in read statement and non-legible content (such as punctuate, number etc.), therefore can be with Improve the method that text corresponding with the specification information being only made of text is searched in read statement.For example, can will input Non-legible content in sentence replaces with arbitrary additional character, such as " # ".Then, in successively comparison procedure above-mentioned, If finding additional character in read statement, the process successively compared is terminated, and will be immediately following the position after the additional character Set the comparison re-started as initial position successively.
It can be according to the specific method to select the similarity between two texts of calculating of language.It, can be with for Chinese The similarity between two texts is judged using the Chinese phonetic alphabet or font of two texts.
Judge that the method for the similarity between two texts can judge two respectively according to the phonetic of two Chinese texts Then the similarity of the initial consonant of acquisition and simple or compound vowel of a Chinese syllable is combined again and (such as is added the two by the initial consonant of a text and the similarity of simple or compound vowel of a Chinese syllable Sum after power), to obtain final similarity.One illustrative method includes the following steps:
(1) Chinese phonetic alphabet is converted by specification information and read statement
Firstly, it is necessary to convert phonetic for each text in read statement and specification information.Existing incite somebody to action can be used Chinese characters are converted into the method (such as Hidden Markov method or Viterbi method) of phonetic.
For example above-mentioned, calling information can be converted into " lai dian ren fan ying xi hu jing Qu pang bian you ren luan reng la ji, po huai jing qu huan jing ", specification information can be with It is converted into " xi hu jing qu ".
As another example, calling information is that " I is people from the continent Hu Jianhu, our this side ambient conditions were originally fine, certainly From a large amount of factory has here been built, environment is by serious destruction, it is desirable to related leadership science ", and specification information is " Fujian Foochow ".After being converted into phonetic, the phonetic of calling information is " wo shi hu jian hu zhou ren, wo men zhe Bian huan jing qing kuang ben lai hen hao, zi cong zhe bian jian le da liang De gong chang, huan jing zao dao le yan zhong de po huai, xi wang xiang guan Ling dao zhong shi ", the phonetic of specification information are " fu jian fu zhou ".
(2) phonetic is split
It can use initial consonant table and rhythm matrix split acquired phonetic.For example, being " sound after phonetic " hu " fractionation Female: h, simple or compound vowel of a Chinese syllable: u " is " initial consonant: j, simple or compound vowel of a Chinese syllable: i, an " after phonetic " jian " fractionation.
The method for splitting initial consonant be it is preceding to method for splitting, the method for splitting simple or compound vowel of a Chinese syllable be after to method for splitting.
Forward direction method for splitting utilizes initial consonant table, successively inquired since the starting letter of a phonetic in initial consonant table whether Comprising with the initial consonant of phonetic starting beginning of letter.If the phonetic includes a kind of initial consonant of starting beginning with phonetic, retain The initial consonant;If retaining longest initial consonant comprising two or more initial consonants.Such as phonetic " shu ", the initial consonant with " s " beginning be s and Sh, final choice sh is as initial consonant.
Backward method for splitting be utilize rhythm matrix since the last letter of a phonetic start in successively inquiry rhythm matrix whether Include the simple or compound vowel of a Chinese syllable with the last letter of phonetic for ending.If the phonetic includes a kind of last alphabetical rhythm for ending with phonetic Mother then retains the simple or compound vowel of a Chinese syllable;If retaining longest simple or compound vowel of a Chinese syllable comprising two or more simple or compound vowel of a Chinese syllable.Such as phonetic " shei ", ended up with i Simple or compound vowel of a Chinese syllable has " ei, ai, i ", only includes now i and ei, retains simple or compound vowel of a Chinese syllable ei.
In the case where in a phonetic comprising double simple or compound vowel of a Chinese syllable, the invention also provides a kind of methods for splitting double simple or compound vowel of a Chinese syllable. The simple or compound vowel of a Chinese syllable split out using the preceding initial consonant split out to method for splitting and backward method for splitting can be deleted from phonetic, be deleted Remaining content is that sky then determines that phonetic fractionation finishes afterwards, otherwise judges some whether remaining part belong in rhythm matrix Simple or compound vowel of a Chinese syllable.Remaining part is left simple or compound vowel of a Chinese syllable if belonging to, otherwise the phonetic is incorrect pinyin.Such as phonetic " shuang ", It is " sh " by the preceding initial consonant split out to method for splitting, the simple or compound vowel of a Chinese syllable split out after passing through to method for splitting is " ang ", remaining Part be " u ".Then it determines that " u " is simple or compound vowel of a Chinese syllable by rhythm matrix, then retains " u ".
(3) similarity is calculated
The similarity between the phonetic of two texts can be calculated based on the result after splitting phonetic.It is according to the present invention Embodiment, in order to calculate the similarity of the phonetic between two texts, can the similarity respectively to initial consonant and simple or compound vowel of a Chinese syllable count It calculates.Calculation method can be, for example, that the similarity weight of Jie Kade calculation method, wherein initial consonant and simple or compound vowel of a Chinese syllable can be disposed as 0.5。
In the art, Jie Kade calculation method is as follows:
Wherein, the molecule in above-mentioned formula indicates the intersection number of set A and B, the i.e. identical number of set A and B, denominator Indicate the number of the union of set A and B.
In an embodiment of the present invention, phonetic " hu " and " fu " split result are respectively " initial consonant: h, simple or compound vowel of a Chinese syllable: u " and " sound It is female: f, simple or compound vowel of a Chinese syllable: u ".The intersection number of initial consonant is 0 after fractionation, and union is that [h, f] i.e. number is 2, so similarity is 0.Simple or compound vowel of a Chinese syllable Intersection be [u] number be 1, union is that [u] number is 1, similarity 1.Total similarity calculation result is 0.5*0+0.5*1= 0.5.Wherein first " 0.5 " on the equation left side indicates the weighted value of initial consonant, and second " 0.5 " on the equation left side indicates simple or compound vowel of a Chinese syllable Weighted value, " 0.5 " on the right of equation indicate total similarity between calculated two phonetic.In some embodiments of the present invention In, the weighted value of initial consonant and simple or compound vowel of a Chinese syllable is adjustable.
For another example phonetic " chuang " and " shuang " split result are respectively " initial consonant: ch, simple or compound vowel of a Chinese syllable: u, ang " and " sound It is female: sh, simple or compound vowel of a Chinese syllable: u, ang ".The intersection number of initial consonant is 0, and union is [ch, sh], number 2, therefore the similarity meter of initial consonant Calculating result is 0.Intersection [u, the ang] number of simple or compound vowel of a Chinese syllable is 1, and union [u, ang] number is 2, therefore the similarity calculation knot of simple or compound vowel of a Chinese syllable Fruit is 1.Total similarity calculation result 0.5*0+0.5*1=0.5.
In another example phonetic " chang " and " chuang " split result be respectively " initial consonant: ch, simple or compound vowel of a Chinese syllable: ang " and " initial consonant: Ch, simple or compound vowel of a Chinese syllable: u, ang ".The intersection of initial consonant is that [ch] number is 1, and union is that [ch] number is 1, the similarity calculation result of initial consonant It is 1.The intersection of simple or compound vowel of a Chinese syllable is that [ang] number is 1, and union is that [u, ang] number is 2, and the similarity calculation result of simple or compound vowel of a Chinese syllable is 0.5. Total similarity calculation result is 0.5*1+0.5*0.5=0.75.
For Chinese text, the similarity between two texts can also be judged by font.One illustrative side Method includes the following steps:
(1) prepare to split literal pool
It can prepare to split literal pool in advance, wherein storing all texts in splitting literal pool and its can be split At character form structure.Fig. 1 schematically shows the character form structure being split into for splitting segment word in literal pool.Example Such as, there is a bracket behind " paulownia " text, indicate that the text only has a kind of method for splitting in splitting literal pool;" mulberry " text There are two brackets below, and indicating the text, there are two types of method for splitting in splitting literal pool.
(2) text is split based on fractionation literal pool
It in an embodiment of the present invention, can be based on splitting literal pool for all texts in calling information and specification information Word carries out deconsolidation process, if a text retains the result that every kind of fractionation mode is split out there are many mode of fractionation.It is tearing open Fractionation can be recycled by dividing in processing, until the result split cannot be split again (it is thorough for splitting), such as " refined " Text, first time split result are " woods, San ", wherein " woods " text can also be further split into " wood, wood ", then " refined " The final split result of text is " wood, wood, San ".
(3) similarity of font is calculated
Font similarity between two texts is that the split result based on font is calculated.For example, " China fir " is split Be " wood, San " after point, " refined " be split after for " wood, wood, San ".If necessary to calculate between " China fir " and " refined " the two texts The similarity of font then also can use Jie Kade calculation method.Wherein the intersection of " China fir " and " refined " is " wood, San ", number It is 2, union is " wood, wood, San ", number 3.Therefore the calculated result of similarity is 2/3=0.667.
In an embodiment of the present invention, it can choose according to the phonetic of two texts or font and calculate between two texts Similarity method in one come execute generate standard corpus method, can also with one wherein calculate similarity side Another is used when method is hard to work (such as when the similarity based on phonetic is difficult to find specification information in calling information) The method (such as method that similarity is calculated according to the font of two texts) for calculating similarity, to generate standard corpus.
Computer is utilized in various embodiments of the present invention." computer " can be including single central processing unit (CPU) Single computer, be also possible to include multiple CPU single computer, can also be the cluster that multiple computers are formed.It calculates Machine can be independent equipment, be also possible to Embedded equipment.The method and apparatus of various embodiments of the present invention can be implemented as Pure software module (such as the software program write with Java language) will be run on computers, also can according to need It is embodied as the pure hardware module (such as dedicated asic chip or fpga chip) of computer, is also implemented as computer Combine the module (such as the fixer system for being stored with fixed code) of software and hardware.
Another aspect of the present invention is a kind of computer-readable medium, is stored thereon with computer-readable instruction, described Instruct the method for being performed implementable various embodiments of the present invention.
Those of ordinary skill in the art may be aware that the foregoing is merely exemplary embodiment of the present invention, and do not have to In the limitation present invention.The present invention can also include various modifications and variations.It is any make within the spirit and scope of the present invention repair Change and variation should be included within the scope of the present invention.

Claims (18)

1. a kind of method for searching information corresponding with specification information in sentence, comprising:
(1) initial position is set by the position of the first text of the sentence using computer;
(2) computer is utilized, according to pre-defined rule and according to the similarity calculation rule between text, determines the sentence from institute Initial position is stated to start with the presence or absence of information corresponding with the specification information;And
(3) computer is utilized, if it is determined that the sentence comes into existence corresponding with the specification information from the initial position Information then terminates the search operation, and the initial position is otherwise moved back a text in the sentence, then executes step Suddenly (2).
2. according to the method described in claim 1, wherein the pre-defined rule is:
If residue length of the sentence since the initial position is greater than or equal to the length of the specification information, and Each text and the specification information each text from the beginning of the sentence since the initial position it is identical or Have or more than scheduled similarity, it is determined that the sentence exists corresponding with the specification information from the initial position Information.
3. according to the method described in claim 1, wherein the pre-defined rule is:
If residue length of the sentence since the initial position is greater than or equal to the length of the specification information, and Each text of the sentence since the initial position and the specification information are from the beginning and in the specification information Overall length predetermined ratio in each text it is identical or have or be more than scheduled similarity, it is determined that the sentence is from institute It states initial position and rises and there is information corresponding with the specification information.
4. according to the method described in claim 1, wherein the pre-defined rule is:
If residue length of the sentence since the initial position is greater than or equal to the length of the specification information, and Between each text of the sentence since the initial position and each text of the specification information from the beginning Continuously it is lower than predetermined quantity lower than the quantity of the text of scheduled similarity, it is determined that the sentence is deposited from the initial position In information corresponding with the specification information.
5. according to the method described in claim 1, wherein the pre-defined rule is:
If residue length of the sentence since the initial position is less than the length of the specification information, it is determined that described Information corresponding with the specification information is not present in sentence from the initial position.
6. according to the method described in claim 1, wherein, the similarity calculation rule between the text is counted using the Chinese phonetic alphabet Calculate the similarity between the text.
7. according to the method described in claim 6, wherein, the similarity calculation rule between the text includes the following steps to count Calculate the similarity between the first text and the second text:
(a) first text and second text are separately converted to the Chinese phonetic alphabet using computer;
(b) Chinese phonetic alphabet is split as initial consonant and simple or compound vowel of a Chinese syllable using computer;With
(c) similarity of the initial consonant of first text and second text and the difference of simple or compound vowel of a Chinese syllable is calculated using computer, and According to the respective weighted value of the initial consonant and the simple or compound vowel of a Chinese syllable, the similarity between first text and the second text is calculated.
8. according to the method described in claim 7, wherein the initial consonant for calculating first text and second text or The similarity of the difference of simple or compound vowel of a Chinese syllable is carried out according to following formula:
Wherein, A, B respectively indicate the set of initial consonant or simple or compound vowel of a Chinese syllable in the Chinese phonetic alphabet of first text and the second text.
9. according to the method described in claim 7, wherein including: by the operation that the Chinese phonetic alphabet is split as initial consonant and simple or compound vowel of a Chinese syllable
Using initial consonant table, to fractionation before carrying out since the starting letter of the Chinese phonetic alphabet, the initial consonant is obtained;And
Using rhythm matrix, to fractionation after carrying out since the last letter of the Chinese phonetic alphabet, the simple or compound vowel of a Chinese syllable is obtained.
10. according to the method described in claim 9, wherein the operation that the Chinese phonetic alphabet is split as initial consonant and simple or compound vowel of a Chinese syllable is also wrapped It includes:
The initial consonant that the forward direction splitting step is split out and the simple or compound vowel of a Chinese syllable that the backward splitting step is split out are spelled from the Chinese It is deleted in sound;And
It determines that Chinese phonetic alphabet fractionation finishes if the remaining content of the Chinese phonetic alphabet described after the deletion is sky, otherwise judges Whether the remaining part belongs to some simple or compound vowel of a Chinese syllable in rhythm matrix, and the remaining part is if it is left rhythm It is female.
11. according to the method described in claim 1, wherein, the similarity calculation rule between the text is calculated using font The similarity of first text and the second text.
12. according to the method for claim 11, wherein the similarity root between first text and second text It is calculated according to following steps:
(a) prepare to split literal pool;
(b) it is based on the fractionation literal pool, first text and second text are thoroughly split, distinguished More than first a character form structures and more than second a character form structures corresponding to first text and second text;With
(c) similarity between more than first a character form structure and more than second a character form structure is calculated, as described the Similarity between one text and the second text.
13. according to the method for claim 12, wherein described calculate more than first a character form structure and described more than second The step of similarity between a character form structure, carries out according to following formula:
Wherein, A and B respectively indicates a character form structure more than described first and the corresponding set of more than the second a character form structure.
14. according to the method described in claim 1, further include:
Non-legible content in the sentence is replaced with into additional character,
If wherein the pre-defined rule includes: by each text of the sentence since the initial position and the rule The each text of model information from the beginning carries out encountering the additional character in the sentence during comparing one by one, then really Information corresponding with the specification information is not present in the fixed sentence since the initial position.
15. according to the method described in claim 1, further include:
(4) computer is utilized, if determining there is information corresponding with the specification information in the sentence in step (3), The corresponding information is marked, in the read statement to generate standard corpus.
16. according to the method described in claim 1, further include:
(4) computer is utilized, if determining there is information corresponding with the specification information in the sentence in step (3), The corresponding information in the read statement is replaced with into the specification information.
17. a kind of system for searching information corresponding with specification information in sentence, comprising:
(1) for setting the position of the first text of the sentence to using computer in the device of initial position;
(2) for utilizing computer, according to pre-defined rule and according to the similarity calculation rule between text, the sentence is determined It whether there is the device of information corresponding with the specification information since the initial position;And
(3) for utilizing computer, if it is determined that the sentence comes into existence and the specification information pair from the initial position The information answered then terminates the search operation, and the initial position is otherwise moved back a text in the sentence, is then held The device of row step (2).
18. a kind of computer-readable medium is stored thereon with the executable instruction of computer, described instruction, which is performed, to be held In row claim 1-16 it is one of any described in method.
CN201910768046.0A 2019-08-20 2019-08-20 Method and system for quickly generating standard corpus Active CN110489754B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910768046.0A CN110489754B (en) 2019-08-20 2019-08-20 Method and system for quickly generating standard corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910768046.0A CN110489754B (en) 2019-08-20 2019-08-20 Method and system for quickly generating standard corpus

Publications (2)

Publication Number Publication Date
CN110489754A true CN110489754A (en) 2019-11-22
CN110489754B CN110489754B (en) 2023-01-03

Family

ID=68552177

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910768046.0A Active CN110489754B (en) 2019-08-20 2019-08-20 Method and system for quickly generating standard corpus

Country Status (1)

Country Link
CN (1) CN110489754B (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07121379A (en) * 1993-10-28 1995-05-12 Nec Software Ltd Plural languages mixing compiler
CN1558318A (en) * 2004-02-10 2004-12-29 姜金霞 Digital Chinese input method
WO2013178002A1 (en) * 2012-05-29 2013-12-05 中国移动通信集团公司 Voice recognition and matching method and device, and computer program and storage medium
CN109785842A (en) * 2017-11-14 2019-05-21 蔚来汽车有限公司 Speech recognition error correction method and speech recognition error correction system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH07121379A (en) * 1993-10-28 1995-05-12 Nec Software Ltd Plural languages mixing compiler
CN1558318A (en) * 2004-02-10 2004-12-29 姜金霞 Digital Chinese input method
WO2013178002A1 (en) * 2012-05-29 2013-12-05 中国移动通信集团公司 Voice recognition and matching method and device, and computer program and storage medium
CN103456297A (en) * 2012-05-29 2013-12-18 中国移动通信集团公司 Method and device for matching based on voice recognition
CN109785842A (en) * 2017-11-14 2019-05-21 蔚来汽车有限公司 Speech recognition error correction method and speech recognition error correction system

Also Published As

Publication number Publication date
CN110489754B (en) 2023-01-03

Similar Documents

Publication Publication Date Title
US9069753B2 (en) Determining proximity measurements indicating respective intended inputs
CN108304445B (en) Text abstract generation method and device
US6311152B1 (en) System for chinese tokenization and named entity recognition
CA2313968A1 (en) A method for correcting the error characters in the result of speech recognition and the speech recognition system using the same
JPH08305730A (en) Automatic method for selection of key phrase from document of machine-readable format to processor
JP5403696B2 (en) Language model generation apparatus, method and program thereof
JP2003527676A (en) A linguistic input architecture that converts one text format to the other text format with modeless input
JP2002517039A (en) Word segmentation in Chinese text
JP2007122509A (en) Device, method and program for determining naturalness of phrase sequence
CN109359227A (en) Acquisition methods, device, computer equipment and the storage medium of similar mark
Bedrick et al. Robust kaomoji detection in Twitter
US7599921B2 (en) System and method for improved name matching using regularized name forms
CN111401012A (en) Text error correction method, electronic device and computer readable storage medium
Xiong et al. Extended HMM and ranking models for Chinese spelling correction
WO2014189400A1 (en) A method for diacritisation of texts written in latin- or cyrillic-derived alphabets
CN107229611B (en) Word alignment-based historical book classical word segmentation method
JP2011008784A (en) System and method for automatically recommending japanese word by using roman alphabet conversion
JP5722375B2 (en) End-of-sentence expression conversion apparatus, method, and program
JP2011065384A (en) Text analysis device, method, and program coping with wrong letter and omitted letter
CN110489754A (en) Quickly generate the method and system of standard corpus
Fragkou Text segmentation for language identification in Greek forums
JP2008059389A (en) Vocabulary candidate output system, vocabulary candidate output method, and vocabulary candidate output program
JP5583230B2 (en) Information search apparatus and information search method
JP3369127B2 (en) Morphological analyzer
JP7124358B2 (en) Output program, information processing device and output control method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Method and System for Quickly Generating Standard Corpus

Granted publication date: 20230103

Pledgee: Wuhan Yuyun Shoudao Technology Co.,Ltd.

Pledgor: HANGZHOU DTWAVE TECHNOLOGY Co.,Ltd.

Registration number: Y2024980004293

PE01 Entry into force of the registration of the contract for pledge of patent right