CN103838794A - Word segmentation method suitable for specialized search engine - Google Patents

Word segmentation method suitable for specialized search engine Download PDF

Info

Publication number
CN103838794A
CN103838794A CN201210491416.9A CN201210491416A CN103838794A CN 103838794 A CN103838794 A CN 103838794A CN 201210491416 A CN201210491416 A CN 201210491416A CN 103838794 A CN103838794 A CN 103838794A
Authority
CN
China
Prior art keywords
word
entry
dictionary
professional
lead
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201210491416.9A
Other languages
Chinese (zh)
Inventor
郑世明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Original Assignee
DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd filed Critical DALIAN LINGDONG TECHNOLOGY DEVELOPMENT Co Ltd
Priority to CN201210491416.9A priority Critical patent/CN103838794A/en
Publication of CN103838794A publication Critical patent/CN103838794A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a word segmentation method suitable for a specialized search engine. The method comprises the following steps that firstly, a first word index view and a first word entry view are built according to a professional master dictionary table and a thesaurus table; two view data of a whole dictionary are stored in an internal storage through an array; the searching and matching process is circulated. The search objects of the specialized search engine are generally technical documentations in a professional field, the characteristic items of the documents are based on terminological dictionaries, and compared with a general dictionary, the number of vocabularies included in the terminological dictionaries is small. Therefore, only professional entries need to be matched, and it is unnecessary to segment all entries in a sentence like a comprehensive search engine. Therefore, the method is enlightened by the first word Hash structure, and the word segmentation method can improve professional word segment efficiency, overcomes the defect of frequently looking up a word in the dictionary in traditional maximum matching word segmentation and the defect that the first word Hash wastes storage space, and is simple and practical/.

Description

A kind of segmenting method that is applicable to professional search engine
Technical field
The present invention relates to a kind of automatic word segmentation technology of Chinese, particularly a kind of segmenting method that is applicable to professional search engine.
Background technology
Since the eighties, developed successively some Words partition systems at home, the segmenting method of use also has multiple.But be summed up nothing more than two classes: a class is that understanding formula is divided morphology, the reading process that utilizes the knowledge of grammar of Chinese and semantic knowledge and psychological knowledge to attempt to imitate the mankind carries out participle.Participle database, knowledge base and inference machine need to be set up in this participle, mainly comprises that expert system divides morphology, point morphology based on grammer and rule, point morphology based on neural network etc.; Another kind of is that mechanical type divides morphology, and this point of morphology is generally take dictionary for word segmentation as foundation, the cutting of having mated one by one word by the word in Chinese character string and vocabulary in document.Wherein in dictionary for word segmentation, not relating to the information about language self such as too many morphology, semanteme, syntactic knowledge, is mainly a vocabulary.In dictionary, the selection of the number of entry, entry directly has influence on last participle effect.It mainly comprises forward, reverse maximum matching method, Best Match Method, by word traversal, Word-frequency etc.Comparatively speaking, the algorithm complex of first kind participle scheme is high, and its validity is still needed and verified further in real work in feasibility.Because Chinese is to lack the mark of word and strict word-building rule after all.Can the existing morphology of language circle, syntax and rule of combination remain very general and complexity, effectively, systematically be converted into the form that computing machine adopts and probably be difficult to final conclusion.Therefore this segmenting method is only in conceptual phase, and distance is practical, and also there is a big difference, generally should not adopt.Equations of The Second Kind segmenting method is realized simple, comes more concrete, practically compared with the first kind, and can reach higher accuracy.
In search engine, conventional participle technique is a kind of mechanical Chinese word segmentation method based on dictionary for word segmentation, i.e. just reverse maximum matching method.It can not carry out cutting word according to the semantic feature of document context, larger to the dependence of dictionary, so in the time that reality is used, can cause unavoidably some participle mistakes.In order to improve the accuracy of system participle, the participle scheme that conventionally adopts Forward Maximum Method method and reverse maximum matching method to combine in the practical application of search engine.First according to punctuate, document is carried out to rough lumber and divide, document decomposition is become to several subsegments, and then these subsegments are scanned to cutting by Forward Maximum Method method and reverse maximum matching method.If the matching result that two kinds of segmenting methods obtain is identical, think that participle is correct, otherwise, by comprise two-part minimum length processing simultaneously.
Point word algorithm that the just reverse maximum coupling of using in search engine at present combines and the dictionary institutional framework of lead-in Hash are to be all based upon on the basis of universaling dictionary, and its requires the whole cuttings of entry until individual character.But as the normally technical documentation of professional domain of its object search of professional search engine, the characteristic item of these documents is all based on terminological dictionary, the vocabulary that terminological dictionary comprises compared with universaling dictionary is little, therefore only need the professional entry of coupling, without as comprehensive search engine, all entries in sentence being carried out to whole cuttings.
Summary of the invention
The problems referred to above that exist for solving prior art, the present invention is subject to the inspiration of lead-in Hash structure, design a kind of efficiency that can improve professional participle, avoid the defect of frequently consulting the dictionary in the maximum coupling of tradition participle, and the simple and practical segmenting method of the drawback of lead-in Hash waste storage space.
To achieve these goals, technical scheme of the present invention is as follows: a kind of segmenting method that is applicable to professional search engine: comprise the following steps:
A, according to professional main dictionary table and synonymicon table model lead-in indexed view and lead-in entry view;
When B, initialization by array by two viewdata graftabls of whole dictionary;
C, carry out rough lumber according to punctuate and divide, then from sentence, take out in order a Chinese character and search with dichotomy in lead-in indexed view, enter circulation next time if do not find;
D otherwise forward in lead-in entry view respectively by wherein with " entry length " intercept the character string of sentence respective length;
E, according to first letter the entry in entry view takes out all entry names with this word beginning in turn and compares with the intercepting string of respective length, and number of comparisons is determined by first number of words in lead-in indexed view;
If the match is successful more to corresponding entry statistical counting, if entry derives from main dictionary directly to this word counting, if entry derives from thesaurus, tackle the corresponding main dictionary word counting of this word;
Skip the Chinese character that this entry comprises simultaneously and enter circulation next time; Otherwise directly enter circulation next time;
F, repeating step A-E until article finish.
Compared with prior art, the present invention has following beneficial effect:
1. keep the preferential feature of traditional maximum coupling, be also suitable for the participle statistics that Chinese and English mixes entry (as " A-grade in the first class ", " Java example " etc.) simultaneously.
2. change the way that intercepts word string coupling dictionary entry in the maximum coupling of tradition, and adopted dictionary entry coupling respective length to intercept the matching process of word string.Guaranteed that all couplings are all effective couplings, avoided traditional maximum matching method order look up the dictionary in a large amount of invalid matching judgment, improved the efficiency of participle.
3. set up lead-in index according to terminological dictionary, avoided traditional lead-in hash indexing method in professional search engine, to waste the drawback of storage space.
4. method is simply easy to realize.Without setting up new index structure table, only utilize existing database table structure to realize, reduce the complexity of setting up index, can effectively be applicable to the use of professional search engine.
Accompanying drawing explanation
1, the total accompanying drawing of the present invention, wherein:
Fig. 1 is point morphology flow structure schematic diagram that is applicable to professional search engine in the present invention.
Embodiment
Below in conjunction with accompanying drawing, the present invention is described further.Workflow of the present invention is as Fig. 1, according to professional main dictionary table and synonymicon table model lead-in indexed view and two views of lead-in entry view.Then when initialization by array by two viewdata graftabls of whole dictionary.In the time of participle, first carrying out rough lumber according to punctuate divides, then from sentence, take out in order a Chinese character searches with dichotomy in " the entry lead-in " of lead-in indexed view, enter circulation next time if do not find, otherwise forward in lead-in entry view respectively different " entry length " in entry view according to first letter to and intercept the character string of sentence respective length, then according to first letter the entry in entry view takes out all entry names with this word beginning in turn and compares (number of comparisons is determined by " first number of words " in lead-in indexed view) with the intercepting string of respective length, corresponding entry statistical counting (if deriving from main dictionary, is directly counted entry to this word if the match is successful again, if entry derives from thesaurus, tackle the corresponding main dictionary word counting of this word), skip the Chinese character that this entry comprises simultaneously and enter circulation next time.Otherwise directly enter circulation next time.So repeatedly mate until article finishes.

Claims (1)

1. a segmenting method that is applicable to professional search engine, is characterized in that: comprise the following steps:
A, according to professional main dictionary table and synonymicon table model lead-in indexed view and lead-in entry view;
When B, initialization by array by two viewdata graftabls of whole dictionary;
C, carry out rough lumber according to punctuate and divide, then from sentence, take out in order a Chinese character and search with dichotomy in lead-in indexed view, enter circulation next time if do not find;
D otherwise forward in lead-in entry view respectively by wherein with " entry length " intercept the character string of sentence respective length;
E, according to first letter the entry in entry view takes out all entry names with this word beginning in turn and compares with the intercepting string of respective length, and number of comparisons is determined by first number of words in lead-in indexed view;
If the match is successful more to corresponding entry statistical counting, if entry derives from main dictionary directly to this word counting, if entry derives from thesaurus, tackle the corresponding main dictionary word counting of this word;
Skip the Chinese character that this entry comprises simultaneously and enter circulation next time; Otherwise directly enter circulation next time;
F, repeating step A-E until article finish.
CN201210491416.9A 2012-11-27 2012-11-27 Word segmentation method suitable for specialized search engine Pending CN103838794A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201210491416.9A CN103838794A (en) 2012-11-27 2012-11-27 Word segmentation method suitable for specialized search engine

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201210491416.9A CN103838794A (en) 2012-11-27 2012-11-27 Word segmentation method suitable for specialized search engine

Publications (1)

Publication Number Publication Date
CN103838794A true CN103838794A (en) 2014-06-04

Family

ID=50802303

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201210491416.9A Pending CN103838794A (en) 2012-11-27 2012-11-27 Word segmentation method suitable for specialized search engine

Country Status (1)

Country Link
CN (1) CN103838794A (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468584A (en) * 2015-12-31 2016-04-06 武汉鸿瑞达信息技术有限公司 Filtering method and system for bad literal information in text
CN108170682A (en) * 2018-01-18 2018-06-15 北京同盛科创科技有限公司 A kind of Chinese word cutting method and computing device based on specialized vocabulary
CN110825608A (en) * 2018-08-08 2020-02-21 北京京东尚科信息技术有限公司 Key semantic testing method and device, storage medium and electronic equipment
CN113553408A (en) * 2021-06-25 2021-10-26 西安电子科技大学 Industrial big data search optimization method, system, equipment, medium and terminal

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000060727A (en) * 1999-03-18 2000-10-16 오민희 The electronic dictionary whit multi-keyword
CN102375842A (en) * 2010-08-20 2012-03-14 姚尹雄 Method for evaluating and extracting keyword set in whole field

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20000060727A (en) * 1999-03-18 2000-10-16 오민희 The electronic dictionary whit multi-keyword
CN102375842A (en) * 2010-08-20 2012-03-14 姚尹雄 Method for evaluating and extracting keyword set in whole field

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘峰: "通用中英文专业搜索引擎技术的研究及应用", 《中国硕士学位论文全文数据库 信息科技辑》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105468584A (en) * 2015-12-31 2016-04-06 武汉鸿瑞达信息技术有限公司 Filtering method and system for bad literal information in text
CN108170682A (en) * 2018-01-18 2018-06-15 北京同盛科创科技有限公司 A kind of Chinese word cutting method and computing device based on specialized vocabulary
CN108170682B (en) * 2018-01-18 2021-09-07 北京同盛科创科技有限公司 Chinese word segmentation method based on professional vocabulary and computing equipment
CN110825608A (en) * 2018-08-08 2020-02-21 北京京东尚科信息技术有限公司 Key semantic testing method and device, storage medium and electronic equipment
CN113553408A (en) * 2021-06-25 2021-10-26 西安电子科技大学 Industrial big data search optimization method, system, equipment, medium and terminal

Similar Documents

Publication Publication Date Title
Zheng et al. Question answering over knowledge graphs: question understanding via template decomposition
CN105701253B (en) The knowledge base automatic question-answering method of Chinese natural language question semanteme
Cai et al. An encoder-decoder framework translating natural language to database queries
CN105868204B (en) A kind of method and device for converting Oracle scripting language SQL
CN106126620A (en) Method of Chinese Text Automatic Abstraction based on machine learning
CN106682209A (en) Cross-language scientific and technical literature retrieval method and cross-language scientific and technical literature retrieval system
CN108665141B (en) Method for automatically extracting emergency response process model from emergency plan
CN105608232A (en) Bug knowledge modeling method based on graphic database
CN106933869A (en) A kind of method and apparatus of operating database
CN103678287A (en) Method for unifying keyword translation
CN103838794A (en) Word segmentation method suitable for specialized search engine
CN114625748A (en) SQL query statement generation method and device, electronic equipment and readable storage medium
CN107679124B (en) Knowledge graph Chinese question-answer retrieval method based on dynamic programming algorithm
Embley et al. Transforming web tables to a relational database
Flor A fast and flexible architecture for very large word n-gram datasets
CN103617265A (en) Ontology query engine optimizing system based on ontology semantic information
CN109446277A (en) Relational data intelligent search method and system based on Chinese natural language
Giordani et al. Automatic generation and reranking of sql-derived answers to nl questions
Wang et al. Semi-supervised chinese open entity relation extraction
He-ping et al. Research and implementation of ontology automatic construction based on relational database
CN101706792A (en) Chinese query clause oriented three-level inquired target analysis method
Gao et al. ICST Math Retrieval System for NTCIR-11 Math-2 Task.
CN110717014A (en) Ontology knowledge base dynamic construction method
Yi An english pos tagging approach based on maximum entropy
CN115617965A (en) Rapid retrieval method for language structure big data

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20140604

RJ01 Rejection of invention patent application after publication