CN103838794A

CN103838794A - Word segmentation method suitable for specialized search engine

Info

Publication number: CN103838794A
Application number: CN201210491416.9A
Authority: CN
Inventors: 郑世明
Original assignee: DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Current assignee: DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority date: 2012-11-27
Filing date: 2012-11-27
Publication date: 2014-06-04

Abstract

The invention discloses a word segmentation method suitable for a specialized search engine. The method comprises the following steps that firstly, a first word index view and a first word entry view are built according to a professional master dictionary table and a thesaurus table; two view data of a whole dictionary are stored in an internal storage through an array; the searching and matching process is circulated. The search objects of the specialized search engine are generally technical documentations in a professional field, the characteristic items of the documents are based on terminological dictionaries, and compared with a general dictionary, the number of vocabularies included in the terminological dictionaries is small. Therefore, only professional entries need to be matched, and it is unnecessary to segment all entries in a sentence like a comprehensive search engine. Therefore, the method is enlightened by the first word Hash structure, and the word segmentation method can improve professional word segment efficiency, overcomes the defect of frequently looking up a word in the dictionary in traditional maximum matching word segmentation and the defect that the first word Hash wastes storage space, and is simple and practical/.

Description

A kind of segmenting method that is applicable to professional search engine

Technical field

The present invention relates to a kind of automatic word segmentation technology of Chinese, particularly a kind of segmenting method that is applicable to professional search engine.

Background technology

Since the eighties, developed successively some Words partition systems at home, the segmenting method of use also has multiple.But be summed up nothing more than two classes: a class is that understanding formula is divided morphology, the reading process that utilizes the knowledge of grammar of Chinese and semantic knowledge and psychological knowledge to attempt to imitate the mankind carries out participle.Participle database, knowledge base and inference machine need to be set up in this participle, mainly comprises that expert system divides morphology, point morphology based on grammer and rule, point morphology based on neural network etc.; Another kind of is that mechanical type divides morphology, and this point of morphology is generally take dictionary for word segmentation as foundation, the cutting of having mated one by one word by the word in Chinese character string and vocabulary in document.Wherein in dictionary for word segmentation, not relating to the information about language self such as too many morphology, semanteme, syntactic knowledge, is mainly a vocabulary.In dictionary, the selection of the number of entry, entry directly has influence on last participle effect.It mainly comprises forward, reverse maximum matching method, Best Match Method, by word traversal, Word-frequency etc.Comparatively speaking, the algorithm complex of first kind participle scheme is high, and its validity is still needed and verified further in real work in feasibility.Because Chinese is to lack the mark of word and strict word-building rule after all.Can the existing morphology of language circle, syntax and rule of combination remain very general and complexity, effectively, systematically be converted into the form that computing machine adopts and probably be difficult to final conclusion.Therefore this segmenting method is only in conceptual phase, and distance is practical, and also there is a big difference, generally should not adopt.Equations of The Second Kind segmenting method is realized simple, comes more concrete, practically compared with the first kind, and can reach higher accuracy.

In search engine, conventional participle technique is a kind of mechanical Chinese word segmentation method based on dictionary for word segmentation, i.e. just reverse maximum matching method.It can not carry out cutting word according to the semantic feature of document context, larger to the dependence of dictionary, so in the time that reality is used, can cause unavoidably some participle mistakes.In order to improve the accuracy of system participle, the participle scheme that conventionally adopts Forward Maximum Method method and reverse maximum matching method to combine in the practical application of search engine.First according to punctuate, document is carried out to rough lumber and divide, document decomposition is become to several subsegments, and then these subsegments are scanned to cutting by Forward Maximum Method method and reverse maximum matching method.If the matching result that two kinds of segmenting methods obtain is identical, think that participle is correct, otherwise, by comprise two-part minimum length processing simultaneously.

Point word algorithm that the just reverse maximum coupling of using in search engine at present combines and the dictionary institutional framework of lead-in Hash are to be all based upon on the basis of universaling dictionary, and its requires the whole cuttings of entry until individual character.But as the normally technical documentation of professional domain of its object search of professional search engine, the characteristic item of these documents is all based on terminological dictionary, the vocabulary that terminological dictionary comprises compared with universaling dictionary is little, therefore only need the professional entry of coupling, without as comprehensive search engine, all entries in sentence being carried out to whole cuttings.

Summary of the invention

The problems referred to above that exist for solving prior art, the present invention is subject to the inspiration of lead-in Hash structure, design a kind of efficiency that can improve professional participle, avoid the defect of frequently consulting the dictionary in the maximum coupling of tradition participle, and the simple and practical segmenting method of the drawback of lead-in Hash waste storage space.

To achieve these goals, technical scheme of the present invention is as follows: a kind of segmenting method that is applicable to professional search engine: comprise the following steps:

A, according to professional main dictionary table and synonymicon table model lead-in indexed view and lead-in entry view;

When B, initialization by array by two viewdata graftabls of whole dictionary;

C, carry out rough lumber according to punctuate and divide, then from sentence, take out in order a Chinese character and search with dichotomy in lead-in indexed view, enter circulation next time if do not find;

D otherwise forward in lead-in entry view respectively by wherein with " entry length " intercept the character string of sentence respective length;

E, according to first letter the entry in entry view takes out all entry names with this word beginning in turn and compares with the intercepting string of respective length, and number of comparisons is determined by first number of words in lead-in indexed view;

If the match is successful more to corresponding entry statistical counting, if entry derives from main dictionary directly to this word counting, if entry derives from thesaurus, tackle the corresponding main dictionary word counting of this word;

Skip the Chinese character that this entry comprises simultaneously and enter circulation next time; Otherwise directly enter circulation next time;

F, repeating step A-E until article finish.

Compared with prior art, the present invention has following beneficial effect:

1. keep the preferential feature of traditional maximum coupling, be also suitable for the participle statistics that Chinese and English mixes entry (as " A-grade in the first class ", " Java example " etc.) simultaneously.

2. change the way that intercepts word string coupling dictionary entry in the maximum coupling of tradition, and adopted dictionary entry coupling respective length to intercept the matching process of word string.Guaranteed that all couplings are all effective couplings, avoided traditional maximum matching method order look up the dictionary in a large amount of invalid matching judgment, improved the efficiency of participle.

3. set up lead-in index according to terminological dictionary, avoided traditional lead-in hash indexing method in professional search engine, to waste the drawback of storage space.

4. method is simply easy to realize.Without setting up new index structure table, only utilize existing database table structure to realize, reduce the complexity of setting up index, can effectively be applicable to the use of professional search engine.

Accompanying drawing explanation

1, the total accompanying drawing of the present invention, wherein:

Fig. 1 is point morphology flow structure schematic diagram that is applicable to professional search engine in the present invention.

Embodiment

Below in conjunction with accompanying drawing, the present invention is described further.Workflow of the present invention is as Fig. 1, according to professional main dictionary table and synonymicon table model lead-in indexed view and two views of lead-in entry view.Then when initialization by array by two viewdata graftabls of whole dictionary.In the time of participle, first carrying out rough lumber according to punctuate divides, then from sentence, take out in order a Chinese character searches with dichotomy in " the entry lead-in " of lead-in indexed view, enter circulation next time if do not find, otherwise forward in lead-in entry view respectively different " entry length " in entry view according to first letter to and intercept the character string of sentence respective length, then according to first letter the entry in entry view takes out all entry names with this word beginning in turn and compares (number of comparisons is determined by " first number of words " in lead-in indexed view) with the intercepting string of respective length, corresponding entry statistical counting (if deriving from main dictionary, is directly counted entry to this word if the match is successful again, if entry derives from thesaurus, tackle the corresponding main dictionary word counting of this word), skip the Chinese character that this entry comprises simultaneously and enter circulation next time.Otherwise directly enter circulation next time.So repeatedly mate until article finishes.

Claims

1. a segmenting method that is applicable to professional search engine, is characterized in that: comprise the following steps:

When B, initialization by array by two viewdata graftabls of whole dictionary;

F, repeating step A-E until article finish.