CN101246473B - Segmentation system evaluating method and segmentation evaluating system - Google Patents

Segmentation system evaluating method and segmentation evaluating system Download PDF

Info

Publication number
CN101246473B
CN101246473B CN2008100898349A CN200810089834A CN101246473B CN 101246473 B CN101246473 B CN 101246473B CN 2008100898349 A CN2008100898349 A CN 2008100898349A CN 200810089834 A CN200810089834 A CN 200810089834A CN 101246473 B CN101246473 B CN 101246473B
Authority
CN
China
Prior art keywords
evaluation
partition system
cutting
target
words partition
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN2008100898349A
Other languages
Chinese (zh)
Other versions
CN101246473A (en
Inventor
张耀杰
邵荣防
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shenzhen Tencent Computer Systems Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN2008100898349A priority Critical patent/CN101246473B/en
Publication of CN101246473A publication Critical patent/CN101246473A/en
Application granted granted Critical
Publication of CN101246473B publication Critical patent/CN101246473B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)

Abstract

Disclosed is an evaluating method for word segmentation system that includes benchmark word segmentation system and target word segmentation system, comprising segmenting test corpus respectively in benchmark word segmentation system and target word segmentation system to obtain benchmark word segmentation result and target word segmentation result according to which evaluating parameter is generated for evaluating the target word segmentation system. The invention make an improvement in treating efficiency of word segmentation system evaluation with a saving cost.

Description

A kind of evaluating method of Words partition system and a kind of segmentation system evaluating
Technical field
The present invention relates to natural language processing field, particularly relate to a kind of evaluating method and a kind of segmentation system evaluating of Words partition system.
Background technology
Participle technique belongs to the natural language processing technique category, and for a word, the people can understand which is a speech by the knowledge of oneself, and which is not a speech, but how to allow computing machine can understand yet, and is exactly participle technique problem to be solved.For example, English is unit with the speech, is to separate by the space between speech and the speech, and for english sentence " I am a student ", computing machine can very simply know that by the space student is a word; And Chinese is to be unit with the word, and all words link up and could describe a meaning in the sentence, and for Chinese sentence " I am a student ", computing machine can not be readily understood that " ", " life " two words just represent a speech altogether.The Chinese character sequence of Chinese is cut into significant speech, is exactly Chinese word segmentation.For example, for " I am a student ", the result of participle can be: I am a student.
Word algorithm can be divided into three major types in existing minute: based on the segmenting method of string matching, based on the segmenting method of understanding with based on the segmenting method of adding up.For any one ripe Words partition system, can not rely on a certain algorithm to realize separately, and need comprehensive different algorithm.Because Chinese word segmentation is the basis of other Chinese information processing, all need to apply to Chinese words segmentation such as technology such as search engine, Machine Translation (MT), phonetic synthesis, classification automatically, autoabstract, automatic check and correction, therefore, be necessary to set up a corresponding evaluating mechanism, be used for estimating the rationality of Words partition system participle.
In the prior art, for the evaluation and test of Words partition system adopt usually with the result of Words partition system participle and basic corpus (as, Beijing University's corpus) mates, rationality according to the result verification participle that mates, in practice, a large amount of manually-operated of process need of this coupling evaluation and test just can be accomplished, and data processing amount is bigger, it is lower not only to evaluate and test efficient, and cost is also higher.
Therefore, present stage needs the urgent technical matters that solves of those skilled in the art to be exactly, and how to save as much as possible under the condition of cost, improves the treatment effeciency to the Words partition system evaluation and test.
Summary of the invention
Technical matters to be solved by this invention provides a kind of evaluating method of Words partition system, with saving under the condition of cost, improves the treatment effeciency to the Words partition system evaluation and test.
Another object of the present invention has provided a kind of segmentation system evaluating, in order to guarantee said method realization and application in practice.
In order to solve the problems of the technologies described above, the embodiment of the invention discloses a kind of evaluating method of Words partition system, comprising:
Adopt the target Words partition system that testing material is carried out repeatedly cutting, obtain a plurality of target cutting results, and adopt the benchmark Words partition system that described test is expected to carry out cutting, obtain benchmark cutting result;
Described a plurality of target cutting results are vertically contrasted, obtain the first evaluation and test parameter, in order to assess the stability of described target Words partition system; And described target cutting result and described benchmark cutting result laterally contrasted, obtain the second evaluation and test parameter, in order to assess the accuracy of described target Words partition system.
Preferably, described method also comprises:
Write down described evaluation and test parameter and meet pre-conditioned testing material.
Preferably, described method also comprises:
From do not meet described pre-conditioned testing material, choose certain testing material and generate corpus.
Preferably, described evaluation and test step comprises:
A plurality of target cutting results according to described testing material calculate the first evaluation and test parameter; And, mate described benchmark cutting result and target cutting result and obtain the second evaluation and test parameter;
Evaluate and test described target Words partition system according to the described first evaluation and test parameter and the second evaluation and test parameter.
Preferably, described cutting step comprises:
Obtain the characteristic information of described testing material;
Described benchmark Words partition system and target Words partition system carry out cutting according to described characteristic information to described testing material respectively, obtain benchmark cutting result and target cutting result.
Preferably, described cutting step also comprises:
The testing material that will have same characteristic information is saved in the identical file.
Preferably, it is unreasonable language material that described evaluation and test parameter meets pre-conditioned testing material, describedly is recorded as record in file or database.
Preferably, described benchmark Words partition system is a magnanimity Word Intelligent Segmentation system.
The embodiment of the invention also discloses a kind of segmentation system evaluating, described evaluating system is used to evaluate and test Words partition system, and described Words partition system comprises benchmark Words partition system and target Words partition system, and described evaluating system comprises:
Acquisition module is used to adopt the target Words partition system that testing material is carried out repeatedly cutting as a result, obtains a plurality of target cutting results, and adopts the benchmark Words partition system that described test is expected to carry out cutting, obtains benchmark cutting result;
Parameter evaluation and test module is used for described a plurality of target cutting results are vertically contrasted, and obtains the first evaluation and test parameter, in order to assess the stability of described target Words partition system; And described target cutting result and described benchmark cutting result laterally contrasted, obtain the second evaluation and test parameter, in order to assess the accuracy of described target Words partition system.
Preferably, described evaluating system also comprises:
Logging modle is used to write down described evaluation and test parameter and meets pre-conditioned testing material.
Preferably, described evaluating system also comprises:
Self-defined storehouse generation module is used for from not meeting described pre-conditioned testing material, chooses certain testing material and generates corpus.
Preferably, described parameter evaluation and test module comprises:
The calculation of parameter submodule is used for calculating the first evaluation and test parameter according to a plurality of target cutting results of described testing material; And, mate described benchmark cutting result and target cutting result and obtain the second evaluation and test parameter;
The evaluation and test submodule is used for evaluating and testing described target Words partition system according to the described first evaluation and test parameter and the second evaluation and test parameter.
Preferably, described acquisition module as a result comprises:
The feature extraction submodule is used to obtain the characteristic information of described testing material;
The cutting submodule is used to obtain described benchmark Words partition system and target Words partition system respectively according to benchmark cutting result and the target cutting result of described characteristic information after to described testing material cutting.
Preferably, described acquisition module as a result also comprises:
Preserve submodule, the testing material that is used for having same characteristic information is saved to identical file.
Compared with prior art, the present invention has the following advantages:
The present invention assesses the rationality of Words partition system by the evaluation and test parameter, particularly, assess the Words partition system rationality by two indexs, the one, will adopt the target Words partition system that the target cutting result that testing material carries out repeatedly cutting acquisition is compared, to obtain target cutting result's consistance situation, promptly obtain the stability evaluation and test parameter of Words partition system; The 2nd, the target cutting result to testing material is compared with the benchmark cutting result who adopts the benchmark Words partition system to obtain, to obtain target cutting result and benchmark cutting result's match condition, promptly obtain the accuracy evaluation and test parameter of Words partition system.Based on described evaluation and test parameter evaluation and test target Words partition system, can be in the prior art, each processing procedure that all needs word segmentation result and basic corpus with the target Words partition system to compare judgement is simplified greatly, improved treatment effeciency, and effectively saved the cost of artificial evaluation and test the Words partition system evaluation and test.
Description of drawings
Fig. 1 is the process flow diagram of the evaluating method embodiment 1 of a kind of Words partition system of the present invention;
Fig. 2 is the process flow diagram of the evaluating method embodiment 2 of a kind of Words partition system of the present invention;
Fig. 3 is the structured flowchart of a kind of segmentation system evaluating embodiment of the present invention.
Embodiment
For above-mentioned purpose of the present invention, feature and advantage can be become apparent more, the present invention is further detailed explanation below in conjunction with the drawings and specific embodiments.
The present invention can describe in the general context of the computer executable instructions of being carried out by computing machine, for example program module.Usually, program module comprises the routine carrying out particular task or realize particular abstract, program, object, assembly, data structure or the like.Also can in distributed computing environment, put into practice the present invention, in these distributed computing environment, by by communication network connected teleprocessing equipment execute the task.In distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium that comprises memory device.
One of core idea of the embodiment of the invention is, assess the rationality of Words partition system by the evaluation and test parameter, particularly, assess the Words partition system rationality by two indexs, the one, will adopt the target Words partition system that the target cutting result that testing material carries out repeatedly cutting acquisition is compared, to obtain target cutting result's consistance situation, promptly obtain the stability evaluation and test parameter of Words partition system; The 2nd, the target cutting result to testing material is compared with the benchmark cutting result who adopts the benchmark Words partition system to obtain, to obtain target cutting result and benchmark cutting result's match condition, promptly obtain the accuracy evaluation and test parameter of Words partition system.Based on described evaluation and test parameter evaluation and test target Words partition system, can be in the prior art, each processing procedure that all needs word segmentation result and basic corpus with the target Words partition system to compare judgement is simplified greatly, improved treatment effeciency, and effectively saved the cost of artificial evaluation and test the Words partition system evaluation and test.
With reference to figure 1, show the process flow diagram of the evaluating method embodiment 1 of a kind of Words partition system of the present invention, can may further comprise the steps:
Step 101, testing material is carried out cutting at benchmark Words partition system and target Words partition system respectively, obtain benchmark cutting result and target cutting result;
Step 102, the described benchmark cutting result of foundation and target cutting result generate the evaluation and test parameter, to evaluate and test described target Words partition system.
Segmenting method commonly used in the Words partition system comprises:
1, based on the segmenting method of string matching: be meant according to certain strategy the entry in Chinese character string to be analyzed and the machine dictionary that presets is mated that if find certain character string in dictionary, then the match is successful (identifying a speech).The actual Words partition system that uses, all be mechanical Chinese word segmentation as a kind of branch means just, also need further improve the accuracy rate of cutting by utilizing various other language messages.
2, based on the segmenting method of mark scanning or sign cutting: be meant preferential identification and be syncopated as the speech that some have obvious characteristic in character string to be analyzed, with these speech as breakpoint, former character string can be divided into less string and advance mechanical Chinese word segmentation again, thereby reduce the error rate of mating; Perhaps participle and part-of-speech tagging are combined, utilize abundant grammatical category information that participle is made a strategic decision and offer help, and in the mark process, conversely word segmentation result is tested, adjusted again, thereby improve the accuracy rate of cutting.
3, based on the segmenting method of understanding: be meant by allowing the understanding of anthropomorphic distich of computer mould, reach the effect of identification speech.Its basic thought is exactly to carry out sentence structure, semantic analysis in participle, utilizes syntactic information and semantic information to handle the ambiguity phenomenon.It generally includes three parts: participle subsystem, syntactic-semantic subsystem, master control part.Under the coordination of master control part, the participle subsystem can obtain the sentence structure and the semantic information of relevant speech, sentence etc. and come the participle ambiguity is judged that promptly it has simulated the understanding process of people to sentence.This segmenting method need use a large amount of linguistries and information.
4, based on the segmenting method of adding up: be meant, the confidence level that can reflect into speech preferably owing to word and the frequency or the probability of the adjacent co-occurrence of word in the Chinese information, so can add up to the frequency of the combination of each word of adjacent co-occurrence in the language material, calculate their information that appears alternatively, and the adjacent co-occurrence probabilities that calculate two Chinese character X, Y.The information of appearing alternatively can embody the tightness degree of marriage relation between the Chinese character.When tightness degree is higher than some threshold values, can think that just this word group may constitute a speech.This method only needs the word group frequency in the language material is added up, and does not need the cutting dictionary.
Words partition system needs comprehensive multiple segmenting method to finish the participle operation usually.With comparatively ripe at present magnanimity Word Intelligent Segmentation system is example, adopts " compound divides morphology ", so-called compound exactly, be equivalent to promptly just integrate treating disease with different medicines with the compound notion in the Chinese medicine, same, for the identification of Chinese word, need multiple algorithm to handle different problems.For obtaining to evaluate and test effect preferably, in the present embodiment, can finish based on comparatively ripe, perfect benchmark Words partition system, as above-mentioned magnanimity Word Intelligent Segmentation system to the evaluation and test of target Words partition system.
For improving the evaluation and test effect, described testing material can be for having the language material set of a certain characteristic information, and in this case, the cutting step 101 of present embodiment can comprise following substep:
Substep A1, obtain the characteristic information of described testing material;
Substep A2, described benchmark Words partition system and target Words partition system carry out cutting according to described characteristic information to described testing material respectively, obtain benchmark cutting result and target cutting result.
In practice, described characteristic information can obtain by the special marking that foundation language material feature is made language material, as POS-tagging, part of speech mark etc.
For example, testing material is:
{ you } are { I } [walking] and then.
Little red meeting tomorrow [help] { I } examination.
{ we } [writing] (distich) is not [dress] (delicately), but [] (knowledge).
(genseng) { this } plants (plant),<tender and lovely very much.
In above-mentioned testing material, noun is with () mark, and verb is with [] mark, adjective with<mark, pronoun can directly be analyzed by the part of speech of respective markers mark part when test with { } mark.
As another kind of embodiment, the cutting step 101 of present embodiment can comprise following substep:
Substep B1, obtain the characteristic information of described testing material;
Substep B2, the testing material that will have a same characteristic information are saved in the identical file;
Substep B3, described benchmark Words partition system and target Words partition system carry out cutting according to described characteristic information to described testing material respectively, obtain benchmark cutting result and target cutting result.
For example, the testing material of similar part of speech is kept in the same file, makes Words partition system directly carry out corresponding cutting to testing material and get final product according to part of speech.
Certainly, it all is feasible that those skilled in the art adopt any method to finish cutting, for example, adopts multiple mark, as in " very He Xie development " " this life " " sole dinner party " " development ", " life ", " dinner party " mark noun or verb; And for example, " giving " mark preposition and verb in " giving the peasant " make Words partition system carry out cutting according to the weight of respective markers; Perhaps, language material opsition dependents such as Tang poetry, the such poems of the Song Dynasty, Yuan songs are made pauses in reading unpunctuated ancient writings, because the part of punctuate is represented a kind of part of speech often, and the part of speech of same positions was also basic identical after first and second and four (also having the 5th, six and eight in the regulated verse) were made pauses in reading unpunctuated ancient writings, then can carry out cutting etc. according to this position rule, the present invention does not need to make qualification to this.
Preferably, described step 102 also can adopt the correlation step among the embodiment 2, certainly, it also is feasible that those skilled in the art adopt any prior art to generate the evaluation and test parameter based on described benchmark cutting result and target cutting result according to actual needs, and the present invention does not limit this.
With reference to figure 2, show the process flow diagram of the evaluating method embodiment 2 of a kind of Words partition system of the present invention, can may further comprise the steps:
Step 201, testing material is carried out cutting at benchmark Words partition system and target Words partition system respectively, obtain benchmark cutting result and target cutting result;
Step 202, the described benchmark cutting result of foundation and target cutting result generate the evaluation and test parameter, to evaluate and test described target Words partition system;
Step 203, the described evaluation and test parameter of record meet pre-conditioned testing material;
Step 204, from do not meet described pre-conditioned testing material, choose certain testing material and generate corpus.
Preferably, described step 202 can comprise following substep:
A plurality of target cutting results of substep C1, the described testing material of foundation calculate the first evaluation and test parameter; And, mate described benchmark cutting result and target cutting result and obtain the second evaluation and test parameter;
Substep C2, the described first evaluation and test parameter of foundation and the second evaluation and test parameter are evaluated and tested described target Words partition system.
The described first evaluation and test parameter and the second evaluation and test parameter are rational two indexs of assessment Words partition system, particularly, the described first evaluation and test parameter can vertically contrast the target cutting result of repeatedly cutting acquisition, to obtain target cutting result's consistance situation, promptly assess the stability evaluation and test parameter of Words partition system; The described second evaluation and test parameter can laterally contrast target cutting result and benchmark cutting result, to obtain target cutting result and benchmark cutting result's match condition, promptly assesses the accuracy evaluation and test parameter of Words partition system.
The another technique effect that present embodiment brings is to be saved in another file or the database for the irrational testing material of target Words partition system cutting, with thinking that the developer provides the foundation of location modification or other processing.
For example, by following code evaluation and test target cutting result:
char*?SegWord(const?char*?word,int?TYPE);
/*
Function: the character string of appointment is carried out cutting with the mode of appointment
Input:
Word: character string to be slit
TYPE: the type of cutting, 0 is the mode of extension of benchmark Words partition system band mark
1 marks for first pattern of target Words partition system
2 mark for second pattern of target Words partition system
Output:
Character string after the cutting on request
*/
int?WordsSegCheck(const?char*?filename,char*?Pos,char*?errorFile);
/*
Function: a testing material file that contains similar part of speech is tested, exported correct number of results, and the language material result of mistake is outputed in another file
Input:
Filename: the filename of input file
Pos: the part of speech of testing material in the input file
ErrorFile: the export file name of the testing material that goes wrong
Output:
Correct testing material number of results
*/
As can be seen, above-mentioned code goes for the situation of aforementioned similar part of speech language material test.
typedef?struce?SegSign{
Char chStart; // mark first symbol only accounts for a character, as " ("
Char chEnd; // mark end mark only accounts for a character, as ") "
Char* pPos; // comprised part-of-speech tagging partly
};
int?SentenceSegCheck(const?char*?filename,
struct?SegSign[],char*?errorFile);
/*
Function a: file that contains the sentence language material is tested, exported correct language material number of results, and the language material result of mistake is outputed in another file
Input:
Filename: the filename of input file
SegSign[]: the part of speech of mark (bracket) and mark
With aforementioned noun usefulness () mark, verb [] mark, adjective usefulness<mark, pronoun is labeled as example with { }, then be,
SegSign[0]={′(′,′)′,″/n″};
SegSign[1]={′[′,′]′,″/v″};
SegSign[2]={′<′,′>′,″/a″};
SegSign[3]={′{′,′}′,″/r″};
ErrorFile: the export file name of the testing material that goes wrong
Output:
Correct language material number of results
*/
As can be seen, above-mentioned code goes for the situation of aforementioned mark part of speech language material test.
int?PoemSegCheck(const?char*?filename,int?bit,char*?errorFile);
/*
Function a: file that contains the bent language material of poem is tested, exported correct language material number of results, and the language material result of mistake is outputed in another file
Input:
Filename: the filename of input file
Bit: mould four back other speech except that the bit sentence want antithesis carefully and neatly done
ErrorFile: the language material export file name that goes wrong in the test
Output:
Correct language material number of results
(language material can with! Beginning occurs once! , promptly corresponding language material)
*/
In the present embodiment, can also further generate the corpus of self-defined standard, not repeat with the language material of assurance, so that the work that the evaluation and test of target Words partition system and problem return is more simplified with the benchmark Words partition system by step 204.
For aforesaid each method embodiment, for simple description, so it all is expressed as a series of combination of actions, but those skilled in the art should know, the present invention is not subjected to the restriction of described sequence of movement, because according to the present invention, some step can adopt other orders or carry out simultaneously.Secondly, those skilled in the art also should know, the embodiment described in the instructions all belongs to preferred embodiment, and related action and module might not be that the present invention is necessary.
With reference to figure 3, show the structural framing figure of a kind of segmentation system evaluating embodiment of the present invention, described evaluating system is used to evaluate and test Words partition system, and described Words partition system comprises benchmark Words partition system and target Words partition system, and described evaluating system can comprise module:
Acquisition module 301 as a result, are used for testing material is carried out cutting at benchmark Words partition system and target Words partition system respectively, obtain benchmark cutting result and target cutting result;
Parameter evaluation and test module 302 is used for generating the evaluation and test parameter according to described benchmark cutting result and target cutting result, to evaluate and test described target Words partition system.
Preferably, in the present embodiment, described evaluating system can also comprise:
Logging modle 303 is used to write down described evaluation and test parameter and meets pre-conditioned testing material.
Preferably, in the present embodiment, described evaluating system can also comprise:
Self-defined storehouse generation module 304 is used for from not meeting described pre-conditioned testing material, chooses certain testing material and generates corpus.
Preferably, described parameter evaluation and test module can comprise following submodule:
The calculation of parameter submodule is used for calculating the first evaluation and test parameter according to a plurality of target cutting results of described testing material; And, mate described benchmark cutting result and target cutting result and obtain the second evaluation and test parameter;
The evaluation and test submodule is used for evaluating and testing described target Words partition system according to the described first evaluation and test parameter and the second evaluation and test parameter.
Preferably, described acquisition module as a result can comprise following submodule:
The feature extraction submodule is used to obtain the characteristic information of described testing material;
The cutting submodule is used to obtain described benchmark Words partition system and target Words partition system respectively according to benchmark cutting result and the target cutting result of described characteristic information after to described testing material cutting.
Preferably, described acquisition module as a result can also comprise following submodule:
Preserve submodule, the testing material that is used for having same characteristic information is saved to identical file.
Below based on a kind of concrete application of the embodiment of the invention to illustrate further the present invention.
The evaluating system of the application embodiment of the invention can comprise the process of Words partition system rationality evaluation and test:
Step S1, read in the corpus K of self-defined standard in the evaluating system;
In practice, for the language material that does not have among the corpus K to occur, suppose that all the cutting of benchmark Words partition system and mark are correct.
After step S2, evaluating system are received testing material S, be a backup S ' earlier;
Step S3, evaluating system are analyzed S, select suitable method of calling to send to benchmark Words partition system (HL) and target Words partition system (QS);
Step S4, benchmark Words partition system and target Words partition system carry out cutting to S respectively, wherein, and benchmark Words partition system return results S (HL), target Words partition system return results S (QS);
Step S5, parameter evaluation and test module send S to the target Words partition system, obtain Si (i=0...N) as a result, contrast S (QS) carries out judgement of stability, draws stability parameter (the first evaluation and test parameter) S_stab (S), if full marks are 100 minutes, then can obtain stability parameter by following formula:
S_tab(S)=100*X/N
Wherein, X is phase Si and the identical number of S (QS).
Step S6, parameter evaluation and test module are calculated the result's of QS (S) and HL (S) same number Nsame automatically according to QS (S) and HL (S), promptly can obtain accuracy parameter (the second evaluation and test parameter) by following formula:
S_vera(S)=100*Nsame/Nall
Wherein, Nall is max{HL (S), QS (S) }.
Step S7, aforementioned stable parameter and accuracy parameter be weighted summation obtain total evaluation and test parameter, can pass through following formulate:
sum(S)=k1*S_stab(S)+(1-k1)*S_vera(S)
Wherein, k1 is the coefficient of 0-1, and in practice, k1 can be arranged to as far as possible little number, such as 0.01.
Step S8, the condition of the choosing F of unreasonable language material is set,, deposits among the set temp for the language material of total evaluation and test parameter less than F;
Step S9, with Beijing University's corpus compare judgement, the result that language material is correct deposit in the set K in;
Step S10, next language material is repeated the S1-S9 step, after n language material evaluation and test, the QS system is evaluated and tested and is total=k2*|temp|/n according to evaluating and testing parameter.
Wherein, k2 is a coefficient, can be 10 power.
In the above-described embodiments, the description of each embodiment is all emphasized particularly on different fields, do not have the part that describes in detail among certain embodiment, can get final product referring to the associated description of aforementioned part.Above-mentionedly arbitrarily enumerated several embodiment of the present invention, those skilled in the art are appropriate combination, selection as the case may be, can bring into play technology effect of the present invention fully.Combination in any based on the foregoing description all is embodiment of the present invention, but this instructions has not just described in detail one by one at this as space is limited.
Because system shown in Figure 3 can correspondence be applicable among the aforesaid the whole bag of tricks embodiment that so description is comparatively simple, not detailed part can be referring to the description of this instructions front appropriate section.
More than evaluating method and a kind of segmentation system evaluating of a kind of Words partition system provided by the present invention is described in detail, used specific case herein principle of the present invention and embodiment are set forth, the explanation of above embodiment just is used for helping to understand method of the present invention and core concept thereof; Simultaneously, for one of ordinary skill in the art, according to thought of the present invention, the part that all can change in specific embodiments and applications, in sum, this description should not be construed as limitation of the present invention.

Claims (13)

1. the evaluating method of a Words partition system is characterized in that, described Words partition system comprises benchmark Words partition system and target Words partition system, and described method comprises:
Adopt the target Words partition system that testing material is carried out repeatedly cutting, obtain a plurality of target cutting results, and adopt the benchmark Words partition system that described test is expected to carry out cutting, obtain benchmark cutting result;
Described a plurality of target cutting results are vertically contrasted, obtain the first evaluation and test parameter, in order to assess the stability of described target Words partition system; And described target cutting result and described benchmark cutting result laterally contrasted, obtain the second evaluation and test parameter, in order to assess the accuracy of described target Words partition system.
2. the method for claim 1 is characterized in that, also comprises:
Write down described evaluation and test parameter and meet pre-conditioned testing material.
3. method as claimed in claim 2 is characterized in that, also comprises:
From do not meet described pre-conditioned testing material, choose certain testing material and generate corpus.
4. as claim 1,2 or 3 described methods, it is characterized in that described evaluation and test step comprises:
A plurality of target cutting results according to described testing material calculate the first evaluation and test parameter; And, mate described benchmark cutting result and target cutting result and obtain the second evaluation and test parameter;
Evaluate and test described target Words partition system according to the described first evaluation and test parameter and the second evaluation and test parameter.
5. the method for claim 1 is characterized in that, described cutting step comprises:
Obtain the characteristic information of described testing material;
Described benchmark Words partition system and target Words partition system carry out cutting according to described characteristic information to described testing material respectively, obtain benchmark cutting result and target cutting result.
6. method as claimed in claim 5 is characterized in that, described cutting step also comprises:
The testing material that will have same characteristic information is saved in the identical file.
7. method as claimed in claim 2 is characterized in that, it is unreasonable language material that described evaluation and test parameter meets pre-conditioned testing material, describedly is recorded as record in file or database.
8. a segmentation system evaluating is characterized in that, described evaluating system is used to evaluate and test Words partition system, and described Words partition system comprises benchmark Words partition system and target Words partition system, and described evaluating system comprises:
Acquisition module is used to adopt the target Words partition system that testing material is carried out repeatedly cutting as a result, obtains a plurality of target cutting results, and adopts the benchmark Words partition system that described test is expected to carry out cutting, obtains benchmark cutting result;
Parameter evaluation and test module is used for described a plurality of target cutting results are vertically contrasted, and obtains the first evaluation and test parameter, in order to assess the stability of described target Words partition system; And described target cutting result and described benchmark cutting result laterally contrasted, obtain the second evaluation and test parameter, in order to assess the accuracy of described target Words partition system.
9. evaluating system as claimed in claim 8 is characterized in that, also comprises:
Logging modle is used to write down described evaluation and test parameter and meets pre-conditioned testing material.
10. evaluating system as claimed in claim 9 is characterized in that, also comprises:
Self-defined storehouse generation module is used for from not meeting described pre-conditioned testing material, chooses certain testing material and generates corpus.
11., it is characterized in that described parameter evaluation and test module comprises as claim 8,9 or 10 described evaluating systems:
The calculation of parameter submodule is used for calculating the first evaluation and test parameter according to a plurality of target cutting results of described testing material; And, mate described benchmark cutting result and target cutting result and obtain the second evaluation and test parameter;
The evaluation and test submodule is used for evaluating and testing described target Words partition system according to the described first evaluation and test parameter and the second evaluation and test parameter.
12. evaluating system as claimed in claim 8 is characterized in that, described acquisition module as a result comprises:
The feature extraction submodule is used to obtain the characteristic information of described testing material;
The cutting submodule is used to obtain described benchmark Words partition system and target Words partition system respectively according to benchmark cutting result and the target cutting result of described characteristic information after to described testing material cutting.
13. evaluating system as claimed in claim 12 is characterized in that, described acquisition module as a result also comprises:
Preserve submodule, the testing material that is used for having same characteristic information is saved to identical file.
CN2008100898349A 2008-03-28 2008-03-28 Segmentation system evaluating method and segmentation evaluating system Active CN101246473B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2008100898349A CN101246473B (en) 2008-03-28 2008-03-28 Segmentation system evaluating method and segmentation evaluating system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2008100898349A CN101246473B (en) 2008-03-28 2008-03-28 Segmentation system evaluating method and segmentation evaluating system

Publications (2)

Publication Number Publication Date
CN101246473A CN101246473A (en) 2008-08-20
CN101246473B true CN101246473B (en) 2010-09-15

Family

ID=39946933

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2008100898349A Active CN101246473B (en) 2008-03-28 2008-03-28 Segmentation system evaluating method and segmentation evaluating system

Country Status (1)

Country Link
CN (1) CN101246473B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102043791B (en) * 2009-10-10 2014-04-30 深圳市世纪光速信息技术有限公司 Method and device for evaluating word classification
CN106156002A (en) * 2016-06-30 2016-11-23 乐视控股(北京)有限公司 The system of selection of participle dictionary and system
CN108415899B (en) * 2018-01-31 2021-09-17 北京联合大学 Braille word segmentation modification method and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5485372A (en) * 1994-06-01 1996-01-16 Mitsubishi Electric Research Laboratories, Inc. System for underlying spelling recovery
CN1107276C (en) * 1996-01-30 2003-04-30 华建机器翻译有限公司 Fully automatic system for separating Chinese words from sentences
CN101071421A (en) * 2007-05-14 2007-11-14 腾讯科技(深圳)有限公司 Chinese word cutting method and device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5485372A (en) * 1994-06-01 1996-01-16 Mitsubishi Electric Research Laboratories, Inc. System for underlying spelling recovery
CN1107276C (en) * 1996-01-30 2003-04-30 华建机器翻译有限公司 Fully automatic system for separating Chinese words from sentences
CN101071421A (en) * 2007-05-14 2007-11-14 腾讯科技(深圳)有限公司 Chinese word cutting method and device

Also Published As

Publication number Publication date
CN101246473A (en) 2008-08-20

Similar Documents

Publication Publication Date Title
US11475209B2 (en) Device, system, and method for extracting named entities from sectioned documents
Dolan et al. Automatically deriving structured knowledge bases from on-line dictionaries
Meziane et al. Generating natural language specifications from UML class diagrams
CN101539907B (en) Part-of-speech tagging model training device and part-of-speech tagging system and method thereof
Constant et al. MWU-aware part-of-speech tagging with a CRF model and lexical resources
CN110489760A (en) Based on deep neural network text auto-collation and device
Xu et al. Extracting domain knowledge elements of construction safety management: Rule-based approach using Chinese natural language processing
CN106407113B (en) A kind of bug localization method based on the library Stack Overflow and commit
CN106066866A (en) A kind of automatic abstracting method of english literature key phrase and system
CN105843801A (en) Multi-translation parallel corpus construction system
CN103593335A (en) Chinese semantic proofreading method based on ontology consistency verification and reasoning
CN108665141B (en) Method for automatically extracting emergency response process model from emergency plan
CN107656921A (en) A kind of short text dependency analysis method based on deep learning
CN105868187B (en) The construction method of more translation Parallel Corpus
Chen Extraction and visualization of traceability relationships between documents and source code
CN110188359B (en) Text entity extraction method
CN106202034A (en) A kind of adjective word sense disambiguation method based on interdependent constraint and knowledge and device
CN110781681A (en) Translation model-based elementary mathematic application problem automatic solving method and system
Arunthavanathan et al. Support for traceability management of software artefacts using natural language processing
CN111651569B (en) Knowledge base question-answering method and system in electric power field
CN103927179A (en) Program readability analysis method based on WordNet
CN101246473B (en) Segmentation system evaluating method and segmentation evaluating system
Osman et al. Generate use case from the requirements written in a natural language using machine learning
CN106202036A (en) A kind of verb Word sense disambiguation method based on interdependent constraint and knowledge and device
CN106126501A (en) A kind of noun Word sense disambiguation method based on interdependent constraint and knowledge and device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C41 Transfer of patent application or patent right or utility model
TR01 Transfer of patent right

Effective date of registration: 20151221

Address after: The South Road in Guangdong province Shenzhen city Fiyta building 518057 floor 5-10 Nanshan District high tech Zone

Patentee after: Shenzhen Tencent Computer System Co., Ltd.

Address before: 2, 518044, East 410 room, SEG science and Technology Park, Zhenxing Road, Shenzhen, Guangdong, Futian District

Patentee before: Tencent Technology (Shenzhen) Co., Ltd.