CN103838794A - Word segmentation method suitable for specialized search engine - Google Patents
Word segmentation method suitable for specialized search engine Download PDFInfo
- Publication number
- CN103838794A CN103838794A CN201210491416.9A CN201210491416A CN103838794A CN 103838794 A CN103838794 A CN 103838794A CN 201210491416 A CN201210491416 A CN 201210491416A CN 103838794 A CN103838794 A CN 103838794A
- Authority
- CN
- China
- Prior art keywords
- word
- entry
- dictionary
- professional
- lead
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Data Mining & Analysis (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a word segmentation method suitable for a specialized search engine. The method comprises the following steps that firstly, a first word index view and a first word entry view are built according to a professional master dictionary table and a thesaurus table; two view data of a whole dictionary are stored in an internal storage through an array; the searching and matching process is circulated. The search objects of the specialized search engine are generally technical documentations in a professional field, the characteristic items of the documents are based on terminological dictionaries, and compared with a general dictionary, the number of vocabularies included in the terminological dictionaries is small. Therefore, only professional entries need to be matched, and it is unnecessary to segment all entries in a sentence like a comprehensive search engine. Therefore, the method is enlightened by the first word Hash structure, and the word segmentation method can improve professional word segment efficiency, overcomes the defect of frequently looking up a word in the dictionary in traditional maximum matching word segmentation and the defect that the first word Hash wastes storage space, and is simple and practical/.
Description
Technical field
The present invention relates to a kind of automatic word segmentation technology of Chinese, particularly a kind of segmenting method that is applicable to professional search engine.
Background technology
Since the eighties, developed successively some Words partition systems at home, the segmenting method of use also has multiple.But be summed up nothing more than two classes: a class is that understanding formula is divided morphology, the reading process that utilizes the knowledge of grammar of Chinese and semantic knowledge and psychological knowledge to attempt to imitate the mankind carries out participle.Participle database, knowledge base and inference machine need to be set up in this participle, mainly comprises that expert system divides morphology, point morphology based on grammer and rule, point morphology based on neural network etc.; Another kind of is that mechanical type divides morphology, and this point of morphology is generally take dictionary for word segmentation as foundation, the cutting of having mated one by one word by the word in Chinese character string and vocabulary in document.Wherein in dictionary for word segmentation, not relating to the information about language self such as too many morphology, semanteme, syntactic knowledge, is mainly a vocabulary.In dictionary, the selection of the number of entry, entry directly has influence on last participle effect.It mainly comprises forward, reverse maximum matching method, Best Match Method, by word traversal, Word-frequency etc.Comparatively speaking, the algorithm complex of first kind participle scheme is high, and its validity is still needed and verified further in real work in feasibility.Because Chinese is to lack the mark of word and strict word-building rule after all.Can the existing morphology of language circle, syntax and rule of combination remain very general and complexity, effectively, systematically be converted into the form that computing machine adopts and probably be difficult to final conclusion.Therefore this segmenting method is only in conceptual phase, and distance is practical, and also there is a big difference, generally should not adopt.Equations of The Second Kind segmenting method is realized simple, comes more concrete, practically compared with the first kind, and can reach higher accuracy.
In search engine, conventional participle technique is a kind of mechanical Chinese word segmentation method based on dictionary for word segmentation, i.e. just reverse maximum matching method.It can not carry out cutting word according to the semantic feature of document context, larger to the dependence of dictionary, so in the time that reality is used, can cause unavoidably some participle mistakes.In order to improve the accuracy of system participle, the participle scheme that conventionally adopts Forward Maximum Method method and reverse maximum matching method to combine in the practical application of search engine.First according to punctuate, document is carried out to rough lumber and divide, document decomposition is become to several subsegments, and then these subsegments are scanned to cutting by Forward Maximum Method method and reverse maximum matching method.If the matching result that two kinds of segmenting methods obtain is identical, think that participle is correct, otherwise, by comprise two-part minimum length processing simultaneously.
Point word algorithm that the just reverse maximum coupling of using in search engine at present combines and the dictionary institutional framework of lead-in Hash are to be all based upon on the basis of universaling dictionary, and its requires the whole cuttings of entry until individual character.But as the normally technical documentation of professional domain of its object search of professional search engine, the characteristic item of these documents is all based on terminological dictionary, the vocabulary that terminological dictionary comprises compared with universaling dictionary is little, therefore only need the professional entry of coupling, without as comprehensive search engine, all entries in sentence being carried out to whole cuttings.
Summary of the invention
The problems referred to above that exist for solving prior art, the present invention is subject to the inspiration of lead-in Hash structure, design a kind of efficiency that can improve professional participle, avoid the defect of frequently consulting the dictionary in the maximum coupling of tradition participle, and the simple and practical segmenting method of the drawback of lead-in Hash waste storage space.
To achieve these goals, technical scheme of the present invention is as follows: a kind of segmenting method that is applicable to professional search engine: comprise the following steps:
A, according to professional main dictionary table and synonymicon table model lead-in indexed view and lead-in entry view;
When B, initialization by array by two viewdata graftabls of whole dictionary;
C, carry out rough lumber according to punctuate and divide, then from sentence, take out in order a Chinese character and search with dichotomy in lead-in indexed view, enter circulation next time if do not find;
D otherwise forward in lead-in entry view respectively by wherein with " entry length " intercept the character string of sentence respective length;
E, according to first letter the entry in entry view takes out all entry names with this word beginning in turn and compares with the intercepting string of respective length, and number of comparisons is determined by first number of words in lead-in indexed view;
If the match is successful more to corresponding entry statistical counting, if entry derives from main dictionary directly to this word counting, if entry derives from thesaurus, tackle the corresponding main dictionary word counting of this word;
Skip the Chinese character that this entry comprises simultaneously and enter circulation next time; Otherwise directly enter circulation next time;
F, repeating step A-E until article finish.
Compared with prior art, the present invention has following beneficial effect:
1. keep the preferential feature of traditional maximum coupling, be also suitable for the participle statistics that Chinese and English mixes entry (as " A-grade in the first class ", " Java example " etc.) simultaneously.
2. change the way that intercepts word string coupling dictionary entry in the maximum coupling of tradition, and adopted dictionary entry coupling respective length to intercept the matching process of word string.Guaranteed that all couplings are all effective couplings, avoided traditional maximum matching method order look up the dictionary in a large amount of invalid matching judgment, improved the efficiency of participle.
3. set up lead-in index according to terminological dictionary, avoided traditional lead-in hash indexing method in professional search engine, to waste the drawback of storage space.
4. method is simply easy to realize.Without setting up new index structure table, only utilize existing database table structure to realize, reduce the complexity of setting up index, can effectively be applicable to the use of professional search engine.
Accompanying drawing explanation
1, the total accompanying drawing of the present invention, wherein:
Fig. 1 is point morphology flow structure schematic diagram that is applicable to professional search engine in the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described further.Workflow of the present invention is as Fig. 1, according to professional main dictionary table and synonymicon table model lead-in indexed view and two views of lead-in entry view.Then when initialization by array by two viewdata graftabls of whole dictionary.In the time of participle, first carrying out rough lumber according to punctuate divides, then from sentence, take out in order a Chinese character searches with dichotomy in " the entry lead-in " of lead-in indexed view, enter circulation next time if do not find, otherwise forward in lead-in entry view respectively different " entry length " in entry view according to first letter to and intercept the character string of sentence respective length, then according to first letter the entry in entry view takes out all entry names with this word beginning in turn and compares (number of comparisons is determined by " first number of words " in lead-in indexed view) with the intercepting string of respective length, corresponding entry statistical counting (if deriving from main dictionary, is directly counted entry to this word if the match is successful again, if entry derives from thesaurus, tackle the corresponding main dictionary word counting of this word), skip the Chinese character that this entry comprises simultaneously and enter circulation next time.Otherwise directly enter circulation next time.So repeatedly mate until article finishes.
Claims (1)
1. a segmenting method that is applicable to professional search engine, is characterized in that: comprise the following steps:
A, according to professional main dictionary table and synonymicon table model lead-in indexed view and lead-in entry view;
When B, initialization by array by two viewdata graftabls of whole dictionary;
C, carry out rough lumber according to punctuate and divide, then from sentence, take out in order a Chinese character and search with dichotomy in lead-in indexed view, enter circulation next time if do not find;
D otherwise forward in lead-in entry view respectively by wherein with " entry length " intercept the character string of sentence respective length;
E, according to first letter the entry in entry view takes out all entry names with this word beginning in turn and compares with the intercepting string of respective length, and number of comparisons is determined by first number of words in lead-in indexed view;
If the match is successful more to corresponding entry statistical counting, if entry derives from main dictionary directly to this word counting, if entry derives from thesaurus, tackle the corresponding main dictionary word counting of this word;
Skip the Chinese character that this entry comprises simultaneously and enter circulation next time; Otherwise directly enter circulation next time;
F, repeating step A-E until article finish.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210491416.9A CN103838794A (en) | 2012-11-27 | 2012-11-27 | Word segmentation method suitable for specialized search engine |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201210491416.9A CN103838794A (en) | 2012-11-27 | 2012-11-27 | Word segmentation method suitable for specialized search engine |
Publications (1)
Publication Number | Publication Date |
---|---|
CN103838794A true CN103838794A (en) | 2014-06-04 |
Family
ID=50802303
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210491416.9A Pending CN103838794A (en) | 2012-11-27 | 2012-11-27 | Word segmentation method suitable for specialized search engine |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN103838794A (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105468584A (en) * | 2015-12-31 | 2016-04-06 | 武汉鸿瑞达信息技术有限公司 | Filtering method and system for bad literal information in text |
CN108170682A (en) * | 2018-01-18 | 2018-06-15 | 北京同盛科创科技有限公司 | A kind of Chinese word cutting method and computing device based on specialized vocabulary |
CN110825608A (en) * | 2018-08-08 | 2020-02-21 | 北京京东尚科信息技术有限公司 | Key semantic testing method and device, storage medium and electronic equipment |
CN113553408A (en) * | 2021-06-25 | 2021-10-26 | 西安电子科技大学 | Industrial big data search optimization method, system, equipment, medium and terminal |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20000060727A (en) * | 1999-03-18 | 2000-10-16 | 오민희 | The electronic dictionary whit multi-keyword |
CN102375842A (en) * | 2010-08-20 | 2012-03-14 | 姚尹雄 | Method for evaluating and extracting keyword set in whole field |
-
2012
- 2012-11-27 CN CN201210491416.9A patent/CN103838794A/en active Pending
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20000060727A (en) * | 1999-03-18 | 2000-10-16 | 오민희 | The electronic dictionary whit multi-keyword |
CN102375842A (en) * | 2010-08-20 | 2012-03-14 | 姚尹雄 | Method for evaluating and extracting keyword set in whole field |
Non-Patent Citations (1)
Title |
---|
刘峰: "通用中英文专业搜索引擎技术的研究及应用", 《中国硕士学位论文全文数据库 信息科技辑》 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN105468584A (en) * | 2015-12-31 | 2016-04-06 | 武汉鸿瑞达信息技术有限公司 | Filtering method and system for bad literal information in text |
CN108170682A (en) * | 2018-01-18 | 2018-06-15 | 北京同盛科创科技有限公司 | A kind of Chinese word cutting method and computing device based on specialized vocabulary |
CN108170682B (en) * | 2018-01-18 | 2021-09-07 | 北京同盛科创科技有限公司 | Chinese word segmentation method based on professional vocabulary and computing equipment |
CN110825608A (en) * | 2018-08-08 | 2020-02-21 | 北京京东尚科信息技术有限公司 | Key semantic testing method and device, storage medium and electronic equipment |
CN113553408A (en) * | 2021-06-25 | 2021-10-26 | 西安电子科技大学 | Industrial big data search optimization method, system, equipment, medium and terminal |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zheng et al. | Question answering over knowledge graphs: question understanding via template decomposition | |
CN105701253B (en) | The knowledge base automatic question-answering method of Chinese natural language question semanteme | |
Cai et al. | An encoder-decoder framework translating natural language to database queries | |
CN105868204B (en) | A kind of method and device for converting Oracle scripting language SQL | |
CN106126620A (en) | Method of Chinese Text Automatic Abstraction based on machine learning | |
CN106682209A (en) | Cross-language scientific and technical literature retrieval method and cross-language scientific and technical literature retrieval system | |
CN108665141B (en) | Method for automatically extracting emergency response process model from emergency plan | |
CN105608232A (en) | Bug knowledge modeling method based on graphic database | |
CN106933869A (en) | A kind of method and apparatus of operating database | |
CN103678287A (en) | Method for unifying keyword translation | |
CN103838794A (en) | Word segmentation method suitable for specialized search engine | |
CN114625748A (en) | SQL query statement generation method and device, electronic equipment and readable storage medium | |
CN107679124B (en) | Knowledge graph Chinese question-answer retrieval method based on dynamic programming algorithm | |
Embley et al. | Transforming web tables to a relational database | |
Flor | A fast and flexible architecture for very large word n-gram datasets | |
CN103617265A (en) | Ontology query engine optimizing system based on ontology semantic information | |
CN109446277A (en) | Relational data intelligent search method and system based on Chinese natural language | |
Giordani et al. | Automatic generation and reranking of sql-derived answers to nl questions | |
Wang et al. | Semi-supervised chinese open entity relation extraction | |
He-ping et al. | Research and implementation of ontology automatic construction based on relational database | |
CN101706792A (en) | Chinese query clause oriented three-level inquired target analysis method | |
Gao et al. | ICST Math Retrieval System for NTCIR-11 Math-2 Task. | |
CN110717014A (en) | Ontology knowledge base dynamic construction method | |
Yi | An english pos tagging approach based on maximum entropy | |
CN115617965A (en) | Rapid retrieval method for language structure big data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20140604 |
|
RJ01 | Rejection of invention patent application after publication |