CN102402501A - Term extraction method and device - Google Patents

Term extraction method and device Download PDF

Info

Publication number
CN102402501A
CN102402501A CN2010102826910A CN201010282691A CN102402501A CN 102402501 A CN102402501 A CN 102402501A CN 2010102826910 A CN2010102826910 A CN 2010102826910A CN 201010282691 A CN201010282691 A CN 201010282691A CN 102402501 A CN102402501 A CN 102402501A
Authority
CN
China
Prior art keywords
term
characteristic
speech
candidate
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN2010102826910A
Other languages
Chinese (zh)
Inventor
杨宇航
于浩
孟遥
陆应亮
夏迎炬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Fujitsu Ltd
Original Assignee
Fujitsu Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Fujitsu Ltd filed Critical Fujitsu Ltd
Priority to CN2010102826910A priority Critical patent/CN102402501A/en
Publication of CN102402501A publication Critical patent/CN102402501A/en
Pending legal-status Critical Current

Links

Images

Abstract

The invention discloses a term extraction method device. The term extraction method comprises the steps of: obtaining at least two characteristics of a candidate term; and extracting the candidate term on the basis of the obtained characteristics, wherein the characteristics comprise head and tail word or word characteristics and deictic word characteristics.

Description

The terminology extraction method and apparatus
Technical field
The present invention relates to word processing field.Especially, the present invention relates to a kind of terminology extraction method and apparatus.
Background technology
Term is the vocabulary unit that is used to represent the most basic knowledge in a field.The purpose that extracts term is to extract significant speech or phrase, and this speech or phrase are represented specific to the implication in this field or notion.Owing to utilize the result of terminology extraction to carry out express-analysis, so those skilled in the art have carried out broad research to terminology extraction to text.
Summary of the invention
An object of the present invention is, a kind of method and apparatus that is used to extract term is provided.Provide hereinafter about brief overview of the present invention, so that the basic comprehension about some aspect of the present invention is provided.Should be appreciated that this general introduction is not about exhaustive general introduction of the present invention.It is not that intention is confirmed key of the present invention or pith, neither be intended to limit scope of the present invention.Its purpose only is to provide some notion with the form of simplifying, with this as the preorder in greater detail of argumentation after a while.
According to embodiments of the invention, obtain at least two characteristics of candidate's term, and candidate's term is extracted based on the characteristic of being obtained; Wherein said characteristic comprises speech or word characteristic and deictic words characteristic end to end.
By according to embodiments of the invention, can improve the terminology extraction result effectively, and can utilize different character to carry out terminology extraction according to the concrete condition of practical application, thus the accuracy of the term that raising is extracted.
Through below in conjunction with the detailed description of accompanying drawing to most preferred embodiment of the present invention, these and other advantage of the present invention will be more obvious.
Description of drawings
The present invention can hereinafter combine the given description of accompanying drawing to be better understood through reference.Said accompanying drawing comprises in this manual and forms the part of this instructions together with following detailed description, and is used for further illustrating the preferred embodiments of the present invention and explains principle and advantage of the present invention.In the accompanying drawings:
Fig. 1 shows the process flow diagram that is used to extract the method for term according to an embodiment of the invention;
Fig. 2 shows the process flow diagram of method that is used to extract term according to another embodiment of the invention;
Fig. 3 shows the schematic diagram that is used to extract the device of term according to an embodiment of the invention;
Fig. 4 shows the schematic diagram of device that is used to extract term according to another embodiment of the invention;
Fig. 5 shows the block diagram of the exemplary configurations of the computing machine that can be used for implementing method according to an embodiment of the invention and/or device.
In the accompanying drawings, identical or corresponding method step or parts have used identical or corresponding reference marker.
Embodiment
To combine accompanying drawing that example embodiment of the present invention is described hereinafter.In order to know and for simplicity, in instructions, not describe all characteristics of actual embodiment.Yet, should understand, in the process of any this practical embodiments of exploitation, must make a lot of decisions specific to embodiment, so that realize developer's objectives, and these decisions may change along with the difference of embodiment to some extent.In addition, might be very complicated and time-consuming though will also be appreciated that development, concerning the those skilled in the art that have benefited from present disclosure, this development only is customary task.
At this, what also need explain a bit is for fear of having blured the present invention because of unnecessary details, only to show in the accompanying drawings and the closely-related apparatus structure of scheme according to the present invention, and omitted other details little with relation of the present invention.
The inventor discovers that the terminology extraction technology has been developed several different methods at present, yet the whole bag of tricks can have shortcoming separately.For example, can not discern the term that does not have statistical significance, because this method is very responsive for the frequency that term occurs based on the method for statistics.Method based on triggering speech uses predetermined linguistic rules to carry out aftertreatment all the time, and it may extract some insignificant word strings perhaps as significant speech can ignore some significant speech.Depend on the quality and quantity of domain knowledge to a great extent based on the method for knowledge, so it is difficult to be used to new field.Therefore, if can will combine based on the terminology extraction method of different characteristic, can overcome the weak point of only using based on the terminology extraction method of a certain characteristic effectively according to concrete applicable cases.
Therefore, according to one embodiment of present invention, a kind of method that is used to extract term has been proposed.Fig. 1 shows the indicative flowchart of this method.
Need to prove, before carrying out, obtained candidate's term from the outside according to method shown in Figure 1.Said candidate's term can utilize any method, comprises that method of the prior art obtains.For example, can at first obtain original language material, can comprise the various pre-service of subordinate sentence, participle, part-of-speech tagging etc. subsequently, to obtain candidate's term these language materials.Can adopt existing natural language processing method to come original language material is carried out pre-service at this.
About above-mentioned content how to obtain candidate's term is well-known to those skilled in the art, therefore is not described in detail here.
Can see that from Fig. 1 this method comprises the steps.
Step S110: at least two characteristics obtaining candidate's term.According to an embodiment of the invention, these characteristics comprise speech or word characteristic and deictic words characteristic end to end.
Speech or word are the speech or the words of term beginning, ending end to end.The inventor notices that for some technical fields, the speech or the word of term beginning, ending may have strong indicative function to term.Such as in field of biology; The speech end to end " scanning " of term " PSTM ", " microscope " just have strong indicative function, according to this end to end speech can confirm to a great extent this end to end speech can form the term in this field together with the speech of centre.Perhaps further, the word end to end of this term " is swept ", " mirror " also has stronger indicative function, promptly indicate this end to end word can form the term in this field together with the content of centre.Therefore, based on speech or word carry out the accuracy rate that terminology extraction can improve terminology extraction end to end.
Another feature is the deictic words characteristics.The inventor notices that occurring words has certain indicative function for the term border before and after term, therefore it is defined as deictic words.Utilize the deictic words characteristic to carry out the accuracy that terminology extraction helps to improve terminology extraction equally.
In order to obtain deictic words, can use the method for vocabulary, wherein in this vocabulary predefined various deictic words, through the inquiry vocabulary can confirm deictic words.Also possible in addition is that the language material good by mark comes training classifier, and extracts deictic words by the sorter after the training.These methods are well-known to those skilled in the art, are not described in detail here.
The inventor finds that preceding deictic words have different features with back deictic words (promptly appearing at the deictic words of term front and back), also are the diverse location that they have shown term.Preceding deictic words show thereafter possibly be term, and then deictic words possibly be term before showing.If preceding deictic words and back deictic words are distinguished, then can more clearly show the characteristic of relational language, also define the border of term simultaneously more accurately.It is therefore preferable that the front and back deictic words are distinguished.For example, if confirm that certain deictic words is preceding deictic words, explain that then the word after these deictic words possibly be a term.Perhaps, are back deictic words if confirm certain deictic words, explain that then the word before these deictic words possibly be a term.Further preferably, beginning of the sentence and sentence tail are considered that as special deictic words for example the beginning of the sentence sign can be used as special preceding deictic words, and the knowledge of sentence tail tag can be used as special back deictic words.
For example for language material " PSTM is a kind of high resolution microscope based on quantum tunneling effect ", wherein " PSTM ", " quantum tunneling effect " and " high resolution microscope " are the terms of this field of biology.In an embodiment according to the present invention, preceding deictic words comprise: beginning of the sentence sign, " based on " with " ", then deictic words comprise: the knowledge of sentence tail tag, " being " and " ".By deictic words characteristic before and after these, can carry out terminology extraction effectively.
Need to prove, in step S110, can certainly obtain other characteristics of candidate's term.For example can obtain words-frequency feature.Words-frequency feature shows the number of times that speech or phrase occur in the text of certain scale.Perhaps also can obtain the part of speech characteristic of candidate's term.For example, the speech in the Modern Chinese can be divided into 12 kinds of parts of speech, and wherein notional word comprises noun, verb, adjective, number, measure word and pronoun, and function word comprises adverbial word, preposition, conjunction, auxiliary word, onomatopoeia and interjection.Words-frequency feature and part of speech characteristic are well-known to those skilled in the art, therefore no longer specifically describe here.
After obtaining at least two characteristics of candidate's term, in step S120, candidate's term is extracted based on the characteristic of being obtained.
After obtaining above-mentioned characteristic, can utilize this area the whole bag of tricks commonly used to carry out the extraction of candidate's term.For example, according to another embodiment of the invention the process flow diagram of method that is used to extract term has been shown in Fig. 2.From Fig. 2, can see, in step S120, comprise step S1201: utilize sorter to come candidate's term is extracted based on the characteristic of being obtained.In the method for utilizing sorter that candidate's term is extracted, at first utilize the seed that mark is good in advance to come training classifier based on the selected characteristic of term, utilize this sorter to come candidate's term is extracted subsequently.The method of utilizing sorter to extract candidate's term is well-known to those skilled in the art, therefore is not described in detail.
Because in the method that is used for extracting term according to an embodiment of the invention, extract based on a plurality of characteristics of candidate's term, only therefore can avoid weak point based on the abstracting method of single characteristic.In addition, especially extract term, thereby improved the accuracy rate of terminology extraction based on speech or word characteristic and deictic words characteristic end to end.
Need to prove; Though described speech or word characteristic and deictic words characteristic end to end in the above embodiments; Those skilled in the art are understood that easily; Can be respectively separately end to end speech or word characteristic or deictic words characteristic be used in combination with other characteristics, the present invention is not limited to use simultaneously above-mentioned two kinds of characteristics.
Correspondingly, according to one embodiment of present invention, a kind of device that is used to extract term has been proposed.Fig. 3 shows this schematic diagram that is used to extract the device of term.
Likewise, obtained candidate's term in advance from the outside.Said candidate's term can utilize any method, comprises that method of the prior art obtains.For example, can at first obtain original language material, can comprise the various pre-service of subordinate sentence, participle, part-of-speech tagging etc. subsequently, to obtain candidate's term these language materials.Can adopt existing natural language processing method to come original language material is carried out pre-service at this.
About above-mentioned content how to obtain candidate's term is well-known to those skilled in the art, therefore is not described in detail here.
Can see that from Fig. 3 this device comprises characteristic getter 310 and terminology extraction device 320.
Characteristic getter 310 is arranged at least two characteristics obtaining candidate's term.According to an embodiment of the invention, these characteristics comprise speech or word characteristic and deictic words characteristic end to end.
Speech or word are the speech or the words of term beginning, ending end to end.The inventor notices that for some specific technical fields, the speech or the word of term beginning, ending may have strong indicative function to term.Therefore, based on speech or word carry out the accuracy rate that terminology extraction can improve terminology extraction end to end.Particular content about this characteristic sees also the description partly of top method, no longer repeats here.
Another feature is the deictic words characteristics.The inventor notices that occurring words has certain indicative function for the term border before and after term, therefore it is defined as deictic words.Utilize the deictic words characteristic to carry out the accuracy that terminology extraction helps to improve terminology extraction equally.Especially,, then can more clearly show the characteristic of relational language, also define the border of term simultaneously more accurately if the inventor finds preceding deictic words and back deictic words are distinguished.It is therefore preferable that the front and back deictic words are distinguished.Further preferably, beginning of the sentence and sentence tail are considered that as special deictic words for example the beginning of the sentence sign can be used as special preceding deictic words, and the knowledge of sentence tail tag can be used as special back deictic words.Particular content about the deictic words characteristic sees also the description partly of top method, no longer repeats here.
Need to prove that characteristic getter 310 can certainly be arranged to other characteristics of obtaining candidate's term.For example can obtain words-frequency feature, perhaps also can obtain the part of speech characteristic of candidate's term.Words-frequency feature and part of speech characteristic are well-known to those skilled in the art, therefore no longer specifically describe here.
After obtaining at least two characteristics of candidate's term, in terminology extraction device 320, candidate's term is extracted based on the characteristic of being obtained.
Terminology extraction device 320 can utilize this area the whole bag of tricks commonly used to carry out the extraction of candidate's term.For example, according to another embodiment of the invention the schematic diagram of device that is used to extract term has been shown in Fig. 4.Can see that from Fig. 4 terminology extraction device 320 comprises sorter 3201, so that utilize sorter to come candidate's term is extracted based on the characteristic of being obtained.In the method for utilizing 3201 pairs of candidate's terms of sorter to extract, at first utilize the seed that mark is good in advance to come training classifier 3201 based on the selected characteristic of term, utilize this sorter 3201 to come candidate's term is extracted subsequently.The method of utilizing sorter to extract candidate's term is well-known to those skilled in the art, therefore is not described in detail.
Because at the device that is used for extracting term according to an embodiment of the invention, extract based on a plurality of characteristics of candidate's term, only therefore can avoid weak point based on single characteristic.In addition, especially extract term, thereby improved the accuracy rate of terminology extraction based on speech or word characteristic and deictic words characteristic end to end.
Need to prove; Obtain speech or word characteristic and deictic words characteristic end to end though described characteristic getter 310 in the above embodiments; Those skilled in the art are understood that easily; Can be respectively separately end to end speech or word characteristic or deictic words characteristic be used in combination with other characteristics, the present invention is not limited to use simultaneously above-mentioned two kinds of characteristics.
Each forms module in the said apparatus, the unit can be configured through the mode of software, firmware, hardware or its combination.Dispose spendable concrete means or mode and be well known to those skilled in the art, repeat no more at this.Under situation about realizing through software or firmware; From storage medium or network the program that constitutes this software is installed to the computing machine with specialized hardware structure (multi-purpose computer 500 for example shown in Figure 5); This computing machine can be carried out various functions etc. when various program is installed.
In Fig. 5, CPU (CPU) 501 carries out various processing according to program stored among ROM (read-only memory) (ROM) 502 or from the program that storage area 508 is loaded into random-access memory (ram) 503.In RAM 503, also store data required when CPU 501 carries out various processing or the like as required.CPU 501, ROM 502 and RAM 503 are connected to each other via bus 504.Input/output interface 505 also is connected to bus 504.
Following parts are connected to input/output interface 505: importation 506 (comprising keyboard, mouse or the like), output 507 (comprise display; Such as cathode ray tube (CRT), LCD (LCD) etc. and loudspeaker etc.), storage area 508 (comprising hard disk etc.), communications portion 509 (comprising that NIC is such as LAN card, modulator-demodular unit etc.).Communications portion 509 is handled such as the Internet executive communication via network.As required, driver 510 also can be connected to input/output interface 505.Detachable media 511 is installed on the driver 510 such as disk, CD, magneto-optic disk, semiconductor memory or the like as required, makes the computer program of therefrom reading be installed to as required in the storage area 508.
Realizing through software under the situation of above-mentioned series of processes, such as detachable media 511 program that constitutes software is being installed such as the Internet or storage medium from network.
It will be understood by those of skill in the art that this storage medium is not limited to shown in Figure 5 wherein having program stored therein, distribute so that the detachable media 511 of program to be provided to the user with equipment with being separated.The example of detachable media 511 comprises disk (comprising floppy disk (registered trademark)), CD (comprising compact disc read-only memory (CD-ROM) and digital universal disc (DVD)), magneto-optic disk (comprising mini-disk (MD) (registered trademark)) and semiconductor memory.Perhaps, storage medium can be hard disk that comprises in ROM 502, the storage area 508 or the like, computer program stored wherein, and be distributed to the user with the equipment that comprises them.
The present invention also proposes a kind of program product that stores the instruction code of machine-readable.When said instruction code is read and carried out by machine, can carry out above-mentioned method according to the embodiment of the invention.Correspondingly, the storage medium that is used for carrying the program product of the above-mentioned instruction code that stores machine-readable is also included within of the present invention open.Said storage medium includes but not limited to floppy disk, CD, magneto-optic disk, storage card, memory stick or the like.
At last; Also need to prove; Term " comprises ", " comprising " or its any other variant are intended to contain comprising of nonexcludability; Thereby make to comprise that process, method, article or the equipment of a series of key elements not only comprise those key elements, but also comprise other key elements of clearly not listing, or also be included as this process, method, article or equipment intrinsic key element.In addition, under the situation that do not having much more more restrictions, the key element that limits by statement " comprising ... ", and be not precluded within process, method, article or the equipment that comprises said key element and also have other identical element.
Though more than combine accompanying drawing to describe embodiments of the invention in detail, should be understood that top described embodiment just is used to explain the present invention, and be not construed as limiting the invention.For a person skilled in the art, can make various modifications and change to above-mentioned embodiment and do not deviate from essence of the present invention and scope.Therefore, scope of the present invention is only limited appended claim and equivalents thereof.
Description through above is not difficult to find out, according to embodiments of the invention, following scheme is provided:
1. 1 kinds of methods that are used to extract term of remarks comprise:
-obtain at least two characteristics of candidate's term;
-based on the characteristic of being obtained candidate's term is extracted;
Wherein said characteristic comprises speech or word characteristic and deictic words characteristic end to end.
Remarks 2. is according to remarks 1 described method, and wherein said speech end to end or word are the speech or the words of term beginning, ending.
Remarks 3. is according to remarks 1 described method, and the word of indicative function appears, has for the term border in wherein said deictic words before and after term.
Remarks 4. is according to each the described method in the remarks 1 to 3, and wherein said characteristic also comprises words-frequency feature and/or part of speech characteristic.
Remarks 5. wherein comprises the step that candidate's term extracts based on the characteristic of being obtained according to each the described method in the remarks 1 to 3: utilize sorter to come candidate's term is extracted based on said characteristic.
6. 1 kinds of devices that are used to extract term of remarks comprise:
The characteristic getter, it is arranged at least two characteristics obtaining candidate's term; And
The terminology extraction device, it is arranged to based on the characteristic of being obtained candidate's term is extracted;
Wherein said characteristic comprises speech or word characteristic and deictic words characteristic end to end.
Remarks 7. is according to remarks 6 described devices, and wherein said speech end to end or word are the speech or the words of term beginning, ending.
Remarks 8. is according to remarks 6 described devices, and the word of indicative function appears, has for the term border in wherein said deictic words before and after term.
Remarks 9. is according to each the described device in the remarks 6 to 8, and wherein said characteristic also comprises words-frequency feature and/or part of speech characteristic.
Remarks 10. is according to each the described device in the remarks 6 to 8, and wherein said terminology extraction device utilizes sorter to come candidate's term is extracted based on said characteristic.
11. 1 kinds of program products of remarks, this program product comprises the executable instruction of machine, when on messaging device, carrying out said instruction, said instruction makes said messaging device carry out like each the described method in the remarks 1 to 5.
12. 1 kinds of storage mediums of remarks, this storage medium comprises machine-readable program code, when on messaging device, carrying out said program code, said program code makes said messaging device carry out like each the described method in the remarks 1 to 5.

Claims (10)

1. method that is used to extract term comprises:
-obtain at least two characteristics of candidate's term;
-based on the characteristic of being obtained candidate's term is extracted;
Wherein said characteristic comprises speech or word characteristic and deictic words characteristic end to end.
2. method according to claim 1, wherein said speech end to end or word are the speech or the words of term beginning, ending.
3. the word of indicative function appears, has for the term border in method according to claim 1, wherein said deictic words before and after term.
4. according to each the described method in the claim 1 to 3, wherein said characteristic also comprises words-frequency feature and/or part of speech characteristic.
5. according to each the described method in the claim 1 to 3, wherein the step that candidate's term extracts is comprised: utilize sorter to come candidate's term is extracted based on said characteristic based on the characteristic of being obtained.
6. device that is used to extract term comprises:
The characteristic getter, it is arranged at least two characteristics obtaining candidate's term; And
The terminology extraction device, it is arranged to based on the characteristic of being obtained candidate's term is extracted;
Wherein said characteristic comprises speech or word characteristic and deictic words characteristic end to end.
7. device according to claim 6, wherein said speech end to end or word are the speech or the words of term beginning, ending.
8. the word of indicative function appears, has for the term border in device according to claim 6, wherein said deictic words before and after term.
9. according to each the described device in the claim 6 to 8, wherein said characteristic also comprises words-frequency feature and/or part of speech characteristic.
10. according to each the described device in the claim 6 to 8, wherein said terminology extraction device utilizes sorter to come candidate's term is extracted based on said characteristic.
CN2010102826910A 2010-09-09 2010-09-09 Term extraction method and device Pending CN102402501A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN2010102826910A CN102402501A (en) 2010-09-09 2010-09-09 Term extraction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN2010102826910A CN102402501A (en) 2010-09-09 2010-09-09 Term extraction method and device

Publications (1)

Publication Number Publication Date
CN102402501A true CN102402501A (en) 2012-04-04

Family

ID=45884721

Family Applications (1)

Application Number Title Priority Date Filing Date
CN2010102826910A Pending CN102402501A (en) 2010-09-09 2010-09-09 Term extraction method and device

Country Status (1)

Country Link
CN (1) CN102402501A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572621A (en) * 2015-01-05 2015-04-29 语联网(武汉)信息技术有限公司 Decision tree based term judgment method
CN107463548A (en) * 2016-06-02 2017-12-12 阿里巴巴集团控股有限公司 Short phrase picking method and device

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020076112A1 (en) * 2000-12-18 2002-06-20 Philips Electronics North America Corporation Apparatus and method of program classification based on syntax of transcript information
CN101122919A (en) * 2007-09-14 2008-02-13 中国科学院计算技术研究所 Professional term extraction method and system
CN101320374A (en) * 2008-07-10 2008-12-10 昆明理工大学 Field question classification method combining syntax structural relationship and field characteristic
CN101599071A (en) * 2009-07-10 2009-12-09 华中科技大学 The extraction method of conversation text topic
CN101655866A (en) * 2009-08-14 2010-02-24 北京中献电子技术开发中心 Automatic decimation method of scientific and technical terminology
CN101799849A (en) * 2010-03-17 2010-08-11 哈尔滨工业大学 Method for realizing non-barrier automatic psychological consult by adopting computer

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20020076112A1 (en) * 2000-12-18 2002-06-20 Philips Electronics North America Corporation Apparatus and method of program classification based on syntax of transcript information
CN101122919A (en) * 2007-09-14 2008-02-13 中国科学院计算技术研究所 Professional term extraction method and system
CN101320374A (en) * 2008-07-10 2008-12-10 昆明理工大学 Field question classification method combining syntax structural relationship and field characteristic
CN101599071A (en) * 2009-07-10 2009-12-09 华中科技大学 The extraction method of conversation text topic
CN101655866A (en) * 2009-08-14 2010-02-24 北京中献电子技术开发中心 Automatic decimation method of scientific and technical terminology
CN101799849A (en) * 2010-03-17 2010-08-11 哈尔滨工业大学 Method for realizing non-barrier automatic psychological consult by adopting computer

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
刘康 等: "基于层叠CRFs模型的句子褒贬度分析研究", 《中文信息学报》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104572621A (en) * 2015-01-05 2015-04-29 语联网(武汉)信息技术有限公司 Decision tree based term judgment method
CN104572621B (en) * 2015-01-05 2018-01-26 语联网(武汉)信息技术有限公司 A kind of term decision method based on decision tree
CN107463548A (en) * 2016-06-02 2017-12-12 阿里巴巴集团控股有限公司 Short phrase picking method and device

Similar Documents

Publication Publication Date Title
Rahman et al. Content extraction from html documents
JP4463256B2 (en) System and method for providing automatically completed recommended words that link multiple languages
CN103123618B (en) Text similarity acquisition methods and device
US8041559B2 (en) System and method for disambiguating non diacritized arabic words in a text
US20100023318A1 (en) Method and device for retrieving data and transforming same into qualitative data of a text-based document
US11797607B2 (en) Method and apparatus for constructing quality evaluation model, device and storage medium
CN106777275A (en) Entity attribute and property value extracting method based on many granularity semantic chunks
Sedláček et al. A new Czech morphological analyser ajka
CN103309926A (en) Chinese and English-named entity identification method and system based on conditional random field (CRF)
CN103678285A (en) Machine translation method and machine translation system
CN113158653B (en) Training method, application method, device and equipment for pre-training language model
CN101464898A (en) Method for extracting feature word of text
CN102681981A (en) Natural language lexical analysis method, device and analyzer training method
CN100429648C (en) Automatic segmentation of texts comprising chunsk without separators
JP2001092485A (en) Method for registering speech information, method for determining recognized character string, speech recognition device, recording medium in which software product for registering speech information is stored, and recording medium in which software product for determining recognized character string is stored
CN108681547A (en) A kind of web content converting method and device based on small routine
CN103514151A (en) Dependency grammar analysis method and device and auxiliary classifier training method
CN108304387B (en) Method, device, server group and storage medium for recognizing noise words in text
KR101070371B1 (en) Apparatus and Method for Words Sense Disambiguation Using Korean WordNet and its program stored recording medium
CN103678270B (en) Semantic primitive abstracting method and semantic primitive extracting device
CN110737770B (en) Text data sensitivity identification method and device, electronic equipment and storage medium
JP7040227B2 (en) Information processing programs, information processing methods, and information processing equipment
CN112134858B (en) Sensitive information detection method, device, equipment and storage medium
CN102402501A (en) Term extraction method and device
CN107862045A (en) A kind of across language plagiarism detection method based on multiple features

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Application publication date: 20120404