CN106844326A - A kind of method and device for obtaining word - Google Patents

A kind of method and device for obtaining word Download PDF

Info

Publication number
CN106844326A
CN106844326A CN201510886318.9A CN201510886318A CN106844326A CN 106844326 A CN106844326 A CN 106844326A CN 201510886318 A CN201510886318 A CN 201510886318A CN 106844326 A CN106844326 A CN 106844326A
Authority
CN
China
Prior art keywords
sentence
participle
candidate
independent
whole
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510886318.9A
Other languages
Chinese (zh)
Inventor
钦滨杰
陈晓敏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Gridsum Technology Co Ltd
Original Assignee
Beijing Gridsum Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Gridsum Technology Co Ltd filed Critical Beijing Gridsum Technology Co Ltd
Priority to CN201510886318.9A priority Critical patent/CN106844326A/en
Publication of CN106844326A publication Critical patent/CN106844326A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention discloses a kind of method and device for obtaining word, be related to field of computer technology, main purpose be by word mark realm information to improve language material word between whole and part relation extraction accuracy rate.The main technical scheme of the present invention is:Text data to obtaining is pre-processed, and obtains the independent sentence with participle information;In the independent sentence, the candidate's sentence with parallel construction is filtered out using stay in place form;Using the participle information in domain lexicon and candidate's sentence, the field participle with parallel construction in candidate's sentence is determined, the domain lexicon is the dictionary that record has same area participle;According to the position feature of the field participle, field participle set of the output with whole and part relation.Present invention is mainly used for the word of the whole and part relation in text that obtains.

Description

A kind of method and device for obtaining word
Technical field
The present invention relates to field of computer technology, more particularly to a kind of method and device for obtaining word.
Background technology
As the development of network technology, the scale of data message are more and more huger, therefrom to get Data message is accomplished by significantly more efficient Text Classification.And existing some ripe texts point Class technology for English text application effect relative ideal, and for Chinese text classifying quality simultaneously It is undesirable.To find out its cause, the effect of the semantic factor wherein in Chinese text can not be ignored.Most Based on semantic relation have two classes:1st, the relation between upperseat concept and subordinate concept, subordinate concept Appearance just for the sake of limit upperseat concept extension;2nd, predication relation, this is most also most basic Relation.The statement of one basic vocabulary unit to another basic vocabulary unit.And grammatical form is then Major part is produced to express these relations.
In the relation of upper and subordinate concept, most commonly seen is exactly the relation of whole and part:It is whole Body generally has a structure, and their part is separable and has a specific function.At present Point this classification treatment in, the mode for extracting whole and part relation word is typically all solid based on some Fixed pattern, including vocabulary, syntactic pattern determine the whole and part relation between word.For example, The method of fetching portion whole relation from webpage based on parallel construction, using whole and part relation Pattern obtains language material from Google, matches the sentence with parallel construction, therefrom obtains out given The part concept of global concept, automatic cluster is carried out with hierarchical clustering algorithm to the part concept of candidate, To determine the word with whole and part relation.But, the mode of this parallel construction is caned The corpus data allotted is that the structure in form with template matches, and in actual content simultaneously Non- is the relation of whole and part, therefore the extraction accuracy rate of which is relatively low.
The content of the invention
In view of this, the present invention provides a kind of method and device for obtaining word, and main purpose is to lead to The extraction accuracy rate of whole and part relation between word mark realm information is crossed to improve language material word.
To reach above-mentioned purpose, present invention generally provides following technical scheme:
On the one hand, the invention provides a kind of method for obtaining word, the method includes:
Text data to obtaining is pre-processed, and obtains the independent sentence with participle information;
In the independent sentence, the candidate's sentence with parallel construction is filtered out using stay in place form;
Using the participle information in domain lexicon and candidate's sentence, in determining candidate's sentence Field participle with parallel construction, the domain lexicon is the dictionary that record has same area participle;
According to the position feature of the field participle, field participle of the output with whole and part relation Set.
On the other hand, the invention provides a kind of device for obtaining word, the device includes:
Pretreatment unit, for being pre-processed to the text data for obtaining, obtains with participle information Independent sentence;
Screening unit, in the independent sentence that is obtained in the pretreatment unit, using stay in place form Filter out the candidate's sentence with parallel construction;
Determining unit, for utilizing the participle information in domain lexicon and candidate's sentence, it is determined that There is the field participle of parallel construction in candidate's sentence of the screening unit selection;
Output unit, the position feature of the field participle for being determined according to the determining unit, output Field participle set with whole and part relation.
The method and device of a kind of acquisition word proposed according to the invention described above, by text language Participle, subordinate sentence treatment that material is carried out, and filtered out using stay in place form and be selected language with parallel construction Sentence.Just can tentatively be selected in in corpus of text for parallel construction be probably with whole and part close Candidate's sentence of system.Participle information in using candidate's sentence, and selected domain lexicon, Judge whether the participle with parallel construction belongs to identical field, if so, then can be according to participle Position in sentence determines the whole and part relation between each participle, while with corresponding pass System is subject to output display.The fixed form used relative to existing judgement whole and part relation For alignments, the method applied in the present invention is further sentenced by being added to the participle in sentence It is disconnected, it is determined that the participle with parallel construction is belonging to the participle in same class field such that it is able to according to The particular content of participle avoids the formalization that participle is extracted.Judge further according to the position relationship between participle Which participle belongs to overall field participle, and which belongs to certain fields participle.Divide so as to further improve The extraction accuracy of word whole and part relation.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantage and benefit for Those of ordinary skill in the art will be clear understanding.Accompanying drawing is only used for showing the mesh of preferred embodiment , and it is not considered as limitation of the present invention.And in whole accompanying drawing, with identical with reference to symbol Number represent identical part.In the accompanying drawings:
Fig. 1 shows a kind of flow chart of the method for acquisition word that the embodiment of the present invention is proposed;
Fig. 2 shows the flow chart of another method for obtaining word that the embodiment of the present invention is proposed;
Fig. 3 shows a kind of composition frame chart of the device of acquisition word that the embodiment of the present invention is proposed;
Fig. 4 shows the composition frame chart of another device for obtaining word that the embodiment of the present invention is proposed.
Specific embodiment
Exemplary embodiment of the invention is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing Exemplary embodiment of the invention is shown, it being understood, however, that may be realized in various forms the present invention Without that should be limited by embodiments set forth here.Conversely, there is provided these embodiments are able to more Thoroughly understand the present invention, and can be by the complete technology for conveying to this area of the scope of the present invention Personnel.
A kind of method for obtaining word is the embodiment of the invention provides, as shown in figure 1, the method is used for The word with whole and part relation in corpus of text is obtained, specific steps include:
101st, the text data for obtaining is pre-processed, obtains the independent sentence with participle information.
In embodiments of the present invention, the text data of acquisition refers to for extracting with whole and part pass The corpus data of copula language, the specific source for obtaining can be chosen from different corpus do not allow field or The text data of theme.And it refers to then to carry out dividing processing to the text of big section or entire chapter to pre-process, obtain To brief text datas such as the sentences or phrase for being easy to treatment.Can be specifically by participle, divide The text-processing technology of sentence carries out the subdivision of text, because participle, subordinate sentence technology have been widely used Text-processing technology, therefore, the embodiment of the present invention is not illustrated to this, while also not limiting Specific participle mode or subordinate sentence mode.The purpose is to obtain the independent sentence with participle information.Its In, independent sentence is that have the simple sentence for completing structure or form, and participle information is then to enter the simple sentence There is which participle in resulting word segmentation result after row word segmentation processing, such as simple sentence, each participle is in sentence The information such as position.
102nd, the candidate's sentence with parallel construction is filtered out using stay in place form.
In independent sentence obtained by step 101, screened using stay in place form, selecting has The independent sentence of parallel construction.Wherein, stay in place form is preset in systems for judging sentence structure Template, in the present embodiment, whether the stay in place form for being used is for judging have in independent sentence There is the template of parallel construction.And parallel construction also includes the arranged side by side and phrase of word in an independent sentence It is arranged side by side, be also not specifically limited for the specific parallel construction embodiment of the present invention.Simply according to solely The sentence structure of vertical sentence is judged that the independent sentence that will meet parallel construction is defined as candidate's sentence.
103rd, using the participle information in domain lexicon and candidate's sentence, determine have in candidate's sentence There is the field participle of parallel construction.
Domain lexicon is the dictionary that record has same area participle, due to whole and part relation Participle or phrase are necessarily belonging to identical field, therefore, by judging have side by side in candidate's sentence The participle of structure whether belong to identical field just turn into judge whether these participles can constitute entirety and portion The premise of the relation of dividing.If that is, two have the participle of parallel construction and are not belonging to same Field, then the two words be also impossible to have or there is the pass of whole and part with other participles in this innings System.In this step, in addition to the participle art with other array structure in candidate's sentence to be determined, Whether other participles belong to the domain lexicon in also needing to determine the office, subsequently to be had according to participle Whether the positional information of body determines the relation between participle with whole and part.
104th, according to the position feature of field participle, field participle of the output with whole and part relation Set.
The position feature of field participle is the participle recorded during above-mentioned participle in independent sentence Positional information, the whole and part relation between participle is judged according to different positional informations.For example, Automobile including engine, gearbox and tire etc., wherein, " engine ", " gearbox ", " tire " It is exactly the participle with parallel construction, and these participles belong to automotive field, so, by judging " automobile " and " engine ", " gearbox ", the position relationship of " tire ", it can be determined that go out " automobile " It is the participle with whole and part relation with " engine ", " gearbox ", " tire ".
After the participle combination of all whole and part relations in obtaining text, output possesses comprising all The field participle set of whole and part relation.
The side of the acquisition word that the embodiment of the present invention is used is can be seen that with reference to above-mentioned implementation Method, by the participle that is carried out to corpus of text, subordinate sentence treatment, and is filtered out using stay in place form and had Parallel construction is selected sentence.Just can tentatively be selected in in corpus of text for parallel construction is probably Candidate's sentence with whole and part relation.Participle information in using candidate's sentence, and Selected domain lexicon, judges whether the participle with parallel construction belongs to identical field, if so, The whole and part relation between each participle can be then determined according to position of the participle in sentence, together When output display is subject to corresponding relation.Adopted relative to existing judgement whole and part relation For the alignments of fixed form, the method that the embodiment of the present invention is used is by sentence Participle add it is further judge, it is determined that the participle with parallel construction is belonging in same class field Participle such that it is able to the particular content according to participle avoid participle extract formalization.Further according to point Position relationship between word judges which participle belongs to overall field participle, and which belongs to certain fields point Word.So as to further improve the extraction accuracy of participle whole and part relation.
Method in order to above-mentioned acquisition word proposed by the present invention is explained in more detail, the present invention is implemented Example also proposed a kind of method for obtaining word, as shown in Fig. 2 the method is wrapped when word is extracted Including step is:
201st, subordinate sentence treatment is carried out to the text data for obtaining, obtains the independent sentence.
Subordinate sentence treatment is carried out to acquired text data.Simplest mode is judged in text Punctuation mark, fullstop, exclamation mark, question mark etc. can be represented the symbol of independent sentence as subordinate sentence Standard, and subordinate sentence can not be carried out with symbols such as comma, pause mark, branches.Complete to divide text with this Sentence processing procedure.
202nd, word segmentation processing, and the participle information flag that will be obtained are carried out to independent sentence in the independent language In sentence.
After subordinate sentence treatment is completed, in addition it is also necessary to further carry out word segmentation processing to resulting independent sentence, And by the result queue of participle in the independent sentence, so that the reading of subsequent treatment is called.Wherein, The result of participle includes the positional information of specific participle and the participle in independent sentence.
203rd, the independent sentence with parallel construction is extracted using characteristic symbol.
Characteristic symbol is used to represent in the present embodiment and there are parallel construction in independent sentence, wherein should Characteristic symbol, can comprise at least one of the following:Pause mark, logical relation symbol;For example, pause mark can With with ", " represent, logical relation symbol can for coordination symbol (can be represented with " "), Punctuate or the characters such as choice relation symbol (can be represented with " ‖ ").Can be by using characteristic symbol Independent sentence with parallel construction is screened.The specific characteristic symbol present invention can be according to reality Applicable cases are selected, and this present invention is not defined.
204th, the candidate's sentence with whole and part relation is gone out using template filter certainly.
In the independent sentence with parallel construction, selected again with whole and part by template certainly The independent sentence of relation, and it is defined as candidate's sentence.Wherein, template is for judging independent language certainly Sentence has the sentence structure of whole and part relation.Also, can be included in template certainly various Sentence structure, for example, ^ (.*) include (.*)、(.*)、(.*) structure (such as mobile phone include processor, The parts such as internal memory, screen, shell), (.*) by (.*)、(.*)、(.*) etc. composition $ structure (as electricity Brain is made up of main frame, display, mouse, keyboard etc.), (.*) (it is | as | have | be divided into) (.*)、(.*)、 (.*) $ structure (such as automobile is divided into car, lorry).
Sentence structure in the affirmative template can as needed be increased or be deleted.Therefore, specific mould Sentence structure in plate is not limited in embodiments of the present invention.
Further, the accuracy rate of whole and part relation is judged to improve, will can also be met willing Candidate's sentence of solid plate recycles the negative template to carry out checking matching, thus will with parallel construction but Sentence without whole and part relation is excluded.For example, mobile phone is means of communication, intelligently sets Standby, electronic equipment." means of communication ", " smart machine ", " electronic equipment " in this has simultaneously Array structure but be not whole and part relation with " mobile phone ".Therefore, will be with this quasi-sentence knot The sentence of being selected of structure is excluded.Sentence structure in specific negative template is also included:^ such as (.*)、 (.*)、(.*) $ structure, ^ (.*) it is (.*)、(.*)、(.*) $ structure, (.*)、(.*)、(.*) (it is | as | have | be divided into) (.*) $ structure.Candidate's sentence reservation that negative template will not met is gone forward side by side The follow-up treatment of row.
205th, using the participle information in domain lexicon and candidate's sentence, determine have in candidate's sentence There is the field participle of parallel construction.
It is determined that be first domain lexicon to be selected before the participle of field, and the selection one of domain lexicon As be obtain the text when it is according to determined by the content of the text, or optional by providing Domain lexicon table selected.The domain lexicon is all points with text art The dictionary of word.Matched with the participle in domain lexicon by by the participle information in candidate's sentence, It may determine which participle gone out in candidate's sentence is the participle of same area, especially judging should In candidate's sentence with parallel construction participle whether be same area participle, if identical, by this A little participles are defined as field participle.
206th, the overall field participle and certain fields participle in candidate's sentence are determined using situation template.
Situation template is sentenced similar to above-mentioned affirmative template for the position according to participle in sentence The specific object of disconnected participle, i.e. the participle is overall field participle or certain fields participle.It is most of In the case of, the participle with parallel construction belongs to certain fields participle.And overall field participle and portion The relation of point field participle is the relation of upperseat concept and subordinate concept.
207th, the field participle with whole and part relation is extracted.
After the overall field participle and certain fields participle in candidate's sentence is determined, it is possible to by participle Extracted from candidate's sentence.Further, word can also be carried out to the participle for being extracted Amendment, remove unnecessary qualifier in some participles, such as removal number, measure word or tail word suffix Deng qualifier.
208th, the field participle set with whole and part relation is exported in the form of a list.
Finally, revised overall field participle and certain fields are added in corresponding form to arrange The form of table is exported.It should be noted that in the list, include in the text from all The field participle with whole and part relation that subordinate sentence is extracted, therefore, the list can also be regarded It is a field participle set, and is the field participle set with whole and part corresponding relation.
Further, as the realization to the above method, the embodiment of the invention provides a kind of acquisition word The device of language, the device embodiment is corresponding with preceding method embodiment, for ease of reading, present apparatus reality Example is applied no longer to repeat the detail content in preceding method embodiment one by one, it should be understood that this Device in embodiment can correspond to the full content realized in preceding method embodiment.The device is set In the equipment analyzed for corpus of text, the word with whole and part relation is particularly extracted Computing device, as shown in figure 3, the device includes:
Pretreatment unit 31, for pre-processing the text data for obtaining, obtains believing with participle The independent sentence of breath;
Screening unit 32, in the independent sentence that is obtained in the pretreatment unit 31, using structure Template filter goes out the candidate's sentence with parallel construction;
Determining unit 33, for utilizing the participle information in domain lexicon and candidate's sentence, really There is the field participle of parallel construction, the field in candidate's sentence of the fixed selection of the screening unit 32 Dictionary is the dictionary that record has same area participle;
Output unit 34, the position feature of the field participle for being determined according to the determining unit 33, Field participle set of the output with whole and part relation.
Further, as shown in figure 4, the pretreatment unit 31 includes:
Subordinate sentence module 311, for carrying out subordinate sentence treatment to the text data, obtains the independent sentence;
Word-dividing mode 312, for carrying out word segmentation processing to the independent sentence that the subordinate sentence module 311 is obtained, Obtain the participle information of the independent sentence;
Mark module, for the participle information flag that obtains the word-dividing mode 312 in the independence In sentence.
Further, as shown in figure 4, the screening unit 32 includes:
Extraction module 321, for extracting the independent sentence with parallel construction using characteristic symbol, wherein, The characteristic symbol comprises at least one of the following:Pause mark, logical relation symbol;
Screening module 322, the independent sentence of the parallel construction for being extracted in the extraction module 321 In, go out the candidate's sentence with whole and part relation, the template certainly using template filter certainly For the sentence structure for judging there is in the independent sentence whole and part relation.
Further, as shown in figure 4, the screening module 322 includes:
Screening submodule 3221, for using negate template filter meet it is described certainly module independent language Sentence, the negative template is used to judge the sentence with non-integral with part relations in the independent sentence Structure;
Determination sub-module 3222, for determining not meeting the negative mould that the screening submodule 3221 is used The independent sentence of plate is candidate's sentence.
Further, as shown in figure 4, the determining unit 33 includes:
Selecting module 331, for choosing domain lexicon;
Judge module 332, for the participle information in candidate's sentence, judges candidate's language Sentence in have parallel construction participle whether be the selecting module 331 choose domain lexicon in neck Domain participle;
Determining module 333, for judging participle in the domain lexicon when the judge module 332 When, determine that the participle is field participle.
Further, as shown in figure 4, the output unit 34 includes:
Determining module 341, for determining the overall field participle in candidate's sentence using situation template With certain fields participle, the relation of the overall field participle and certain fields participle be upperseat concept with The relation of subordinate concept;
Extraction module 342, closed with whole and part determined by the determining module 341 for extracting The field participle of system;
Output module 343, for being extracted the extraction module 342 with whole and part relation Field participle set export in the form of a list.
Further, as shown in figure 4, the extraction module 342 includes:
Amendment submodule 3421, for being modified with certain fields participle to the overall field participle Treatment, the correcting process includes:Removal number, removal measure word and/or removal tail word suffix;
Extracting sub-module 3422, the entirety after treatment is corrected for extracting the amendment submodule 3421 Field participle and certain fields participle.
In sum, the method and device of the acquisition word that the embodiment of the present invention is used, by text Participle, subordinate sentence treatment that this language material is carried out, and filter out obtaining with parallel construction using stay in place form Select sentence.Just can tentatively be selected in in corpus of text for parallel construction is probably with entirety and portion Candidate's sentence of the relation of dividing.Participle information in using candidate's sentence, and selected domain term Allusion quotation, judges whether the participle with parallel construction belongs to identical field, if so, then can basis Position of the participle in sentence determines the whole and part relation between each participle, while with corresponding Relation be subject to output display.Relative to the stent that existing judgement whole and part relation is used For the alignments of plate, the method that the embodiment of the present invention is used is added by the participle in sentence It is further to judge, it is determined that the participle with parallel construction is belonging to the participle in same class field, from And the formalization of participle extraction can be avoided according to the particular content of participle.Further according to the position between participle The relation of putting judges which participle belongs to overall field participle, and which belongs to certain fields participle.So as to enter One step improves the extraction accuracy of participle whole and part relation.
The device for obtaining word includes processor and memory, and above-mentioned pretreatment unit, screening are single Unit, determining unit and output unit etc. in memory, are held as program unit storage by processor Storage said procedure unit in memory is gone to realize corresponding function.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can To set one or more, whole and part relation between language material word is improved by adjusting kernel parameter Extraction accuracy rate.
Memory potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM), memory includes at least one storage chip.
Present invention also provides a kind of computer program product, when being performed on data processing equipment, It is adapted for carrying out the program code of initialization there are as below methods step:Text data to obtaining carries out pre- place Reason, obtains the independent sentence with participle information;In the independent sentence, sieved using stay in place form Select the candidate's sentence with parallel construction;Using the participle in domain lexicon and candidate's sentence Information, determines the field participle with parallel construction in candidate's sentence, and the domain lexicon is note Record has the dictionary of same area participle;According to the position feature of the field participle, output has overall With the field participle set of part relations.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system, Or computer program product.Therefore, the application can be implemented using complete hardware embodiment, complete software The form of the embodiment in terms of example or combination software and hardware.And, the application can be used at one Or multiple wherein includes the computer-usable storage medium of computer usable program code (including but not Be limited to magnetic disk storage, CD-ROM, optical memory etc.) on the computer program product implemented Form.
The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer program The flow chart and/or block diagram of product is described.It should be understood that can be realized flowing by computer program instructions In each flow and/or square frame and flow chart and/or block diagram in journey figure and/or block diagram Flow and/or square frame combination.Can provide these computer program instructions to all-purpose computer, specially With the processor of computer, Embedded Processor or other programmable data processing devices producing one Machine so that produced by the instruction of computer or the computing device of other programmable data processing devices It is raw to be used to realize in one flow of flow chart or multiple flow and/or block diagram one square frame or multiple side The device of the function of being specified in frame.
These computer program instructions may be alternatively stored in can guide computer or other programmable datas to process In the computer-readable memory that equipment works in a specific way so that storage is deposited in the computer-readable Instruction in reservoir is produced and includes the manufacture of command device, and command device realization is in flow chart one The function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can also be loaded into computer or other programmable data processing devices On so that series of operation steps is performed on computer or other programmable devices to produce computer The treatment of realization, so as to the instruction performed on computer or other programmable devices is provided for realizing Specified in one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames The step of function.
In a typical configuration, computing device include one or more processors (CPU), input/ Output interface, network interface and internal memory.
Memory potentially includes the volatile memory in computer-readable medium, random access memory The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash RAM).Memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be with Information Store is realized by any method or technique.Information can be computer-readable instruction, data knot Structure, the module of program or other data.The example of the storage medium of computer includes, but are not limited to phase Become internal memory (PRAM), static RAM (SRAM), dynamic random access memory (DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electricity can Erasable programmable read-only memory (EPROM) (EEPROM), fast flash memory bank or other memory techniques, read-only light Disk read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic Cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium, Can be used to store the information that can be accessed by a computing device.Defined according to herein, computer-readable Medium does not include temporary computer readable media (transitory media), such as data-signal and load of modulation Ripple.
Also, it should be noted that term " including ", "comprising" or its any other variant be intended to contain Lid nonexcludability is included, so that process, method, commodity including a series of key elements or setting It is standby not only to include those key elements, but also other key elements including being not expressly set out, or also wrap Include is this process, method, commodity or the intrinsic key element of equipment.In the feelings without more limitations Under condition, the key element limited by sentence "including a ...", it is not excluded that the process including key element, Also there is other identical element in method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or calculating Machine program product.Therefore, the application can use complete hardware embodiment, complete software embodiment or knot Close the form of the embodiment in terms of software and hardware.And, the application can use at one or more it In include computer-usable storage medium (the including but not limited to disk of computer usable program code Memory, CD-ROM, optical memory etc.) on implement computer program product form.
Embodiments herein is these are only, the application is not limited to.For this area skill For art personnel, the application can have various modifications and variations.It is all spirit herein and principle it Interior made any modification, equivalent substitution and improvements etc., should be included in claims hereof model Within enclosing.

Claims (10)

1. it is a kind of obtain word method, it is characterised in that methods described includes:
Text data to obtaining is pre-processed, and obtains the independent sentence with participle information;
In the independent sentence, the candidate's sentence with parallel construction is filtered out using stay in place form;
Using the participle information in domain lexicon and candidate's sentence, in determining candidate's sentence Field participle with parallel construction, wherein, the domain lexicon is that record has same area participle Dictionary;
According to the position feature of the field participle, field participle of the output with whole and part relation Set.
2. method according to claim 1, it is characterised in that the text data to obtaining is carried out Pretreatment, obtaining the independent sentence with participle information includes:
Subordinate sentence treatment is carried out to the text data, the independent sentence is obtained;
Word segmentation processing is carried out to the independent sentence, the participle information of the independent sentence is obtained;
By the participle information flag in the independent sentence.
3. method according to claim 1 and 2, it is characterised in that using stay in place form screening Going out the candidate's sentence with parallel construction includes:
The independent sentence with parallel construction is extracted using characteristic symbol;Wherein, the characteristic symbol is extremely One of the following is included less:Pause mark, logical relation symbol;
In the independent sentence of the parallel construction, gone out with whole and part using template filter certainly Candidate's sentence of relation, the template certainly is used to judge have whole and part in the independent sentence The sentence structure of relation.
4. method according to claim 3, it is characterised in that provided using template filter certainly The candidate's sentence for having whole and part relation includes:
Using negating that template filter meets the independent sentence for affirming template, the negative template is used for Judge the sentence structure with non-integral with part relations in the independent sentence;
It is determined that the independent sentence for not meeting the negative template is candidate's sentence.
5. method according to claim 4, it is characterised in that using domain lexicon and described Participle information in candidate's sentence, determines the field participle bag with parallel construction in candidate's sentence Include:
Choose domain lexicon;
According to the participle information in candidate's sentence, judge that there is parallel construction in candidate's sentence Participle whether be field participle in the domain lexicon;
If, it is determined that the participle is field participle.
6. method according to claim 5, it is characterised in that according to the position of the field participle Feature is put, field participle set of the output with whole and part relation includes:
The overall field participle and certain fields participle in candidate's sentence are determined using situation template, The overall field participle and the relation that the relation of certain fields participle is upperseat concept and subordinate concept;
Extract the field participle with whole and part relation;
Field participle set with whole and part relation is exported in the form of a list.
7. method according to claim 6, it is characterised in that extract and closed with whole and part The field participle of system includes:
Treatment, the correcting process bag are modified to the overall field participle and certain fields participle Include:Removal number, removal measure word and/or removal tail word suffix;
Extract the overall field participle after the correcting process and certain fields participle.
8. it is a kind of obtain word device, it is characterised in that described device includes:
Pretreatment unit, for being pre-processed to the text data for obtaining, obtains with participle information Independent sentence;
Screening unit, in the independent sentence that is obtained in the pretreatment unit, using stay in place form Filter out the candidate's sentence with parallel construction;
Determining unit, for utilizing the participle information in domain lexicon and candidate's sentence, it is determined that There is the field participle of parallel construction, the domain lexicon in candidate's sentence of the screening unit selection It is to record the dictionary for having same area participle;
Output unit, the position feature of the field participle for being determined according to the determining unit, output Field participle set with whole and part relation.
9. device according to claim 8, it is characterised in that the pretreatment unit includes:
Subordinate sentence module, for carrying out subordinate sentence treatment to the text data, obtains the independent sentence;
Word-dividing mode, for carrying out word segmentation processing to the independent sentence that the subordinate sentence module is obtained, obtains The participle information of the independent sentence;
Mark module, for the participle information flag that obtains the word-dividing mode in the independent sentence In.
10. device according to claim 8 or claim 9, it is characterised in that the screening unit bag Include:
Extraction module, for extracting the independent sentence with parallel construction using characteristic symbol, wherein, The characteristic symbol comprises at least one of the following:Pause mark, logical relation symbol;
Screening module, in the independent sentence of the parallel construction that the extraction module is extracted, utilizing Certainly template filter goes out the candidate's sentence with whole and part relation, and the template certainly is used to judge Sentence structure with whole and part relation in the independent sentence.
CN201510886318.9A 2015-12-04 2015-12-04 A kind of method and device for obtaining word Pending CN106844326A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510886318.9A CN106844326A (en) 2015-12-04 2015-12-04 A kind of method and device for obtaining word

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510886318.9A CN106844326A (en) 2015-12-04 2015-12-04 A kind of method and device for obtaining word

Publications (1)

Publication Number Publication Date
CN106844326A true CN106844326A (en) 2017-06-13

Family

ID=59150525

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510886318.9A Pending CN106844326A (en) 2015-12-04 2015-12-04 A kind of method and device for obtaining word

Country Status (1)

Country Link
CN (1) CN106844326A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543185A (en) * 2018-11-22 2019-03-29 联想(北京)有限公司 Utterance topic acquisition methods and device
CN109543150A (en) * 2017-09-21 2019-03-29 北京国双科技有限公司 A kind for the treatment of method and apparatus of court's trial notes
CN109614499A (en) * 2018-11-22 2019-04-12 阿里巴巴集团控股有限公司 A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment
CN109657235A (en) * 2018-12-05 2019-04-19 云孚科技(北京)有限公司 A kind of mixing segmenting method
CN110244860A (en) * 2018-03-08 2019-09-17 北京搜狗科技发展有限公司 A kind of input method, device and electronic equipment
CN110413998A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 A kind of adaptive Chinese word cutting method and its system, medium towards power industry
CN111124144A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Input data processing method and device
CN111581358A (en) * 2020-04-08 2020-08-25 北京百度网讯科技有限公司 Information extraction method and device and electronic equipment
CN111767715A (en) * 2020-06-10 2020-10-13 北京奇艺世纪科技有限公司 Method, device, equipment and storage medium for person identification

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102360383A (en) * 2011-10-15 2012-02-22 西安交通大学 Method for extracting text-oriented field term and term relationship

Cited By (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109543150A (en) * 2017-09-21 2019-03-29 北京国双科技有限公司 A kind for the treatment of method and apparatus of court's trial notes
CN109543150B (en) * 2017-09-21 2022-11-22 北京国双科技有限公司 Method and device for processing court trial notes
CN110244860B (en) * 2018-03-08 2024-02-02 北京搜狗科技发展有限公司 Input method and device and electronic equipment
CN110244860A (en) * 2018-03-08 2019-09-17 北京搜狗科技发展有限公司 A kind of input method, device and electronic equipment
CN111124144A (en) * 2018-10-31 2020-05-08 北京国双科技有限公司 Input data processing method and device
CN111124144B (en) * 2018-10-31 2023-04-07 北京国双科技有限公司 Input data processing method and device
CN109614499B (en) * 2018-11-22 2023-02-17 创新先进技术有限公司 Dictionary generation method, new word discovery method, device and electronic equipment
CN109543185B (en) * 2018-11-22 2021-11-16 联想(北京)有限公司 Statement topic acquisition method and device
CN109543185A (en) * 2018-11-22 2019-03-29 联想(北京)有限公司 Utterance topic acquisition methods and device
CN109614499A (en) * 2018-11-22 2019-04-12 阿里巴巴集团控股有限公司 A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment
CN109657235B (en) * 2018-12-05 2022-11-25 云孚科技(北京)有限公司 Mixed word segmentation method
CN109657235A (en) * 2018-12-05 2019-04-19 云孚科技(北京)有限公司 A kind of mixing segmenting method
CN110413998A (en) * 2019-07-16 2019-11-05 深圳供电局有限公司 A kind of adaptive Chinese word cutting method and its system, medium towards power industry
CN110413998B (en) * 2019-07-16 2023-04-21 深圳供电局有限公司 Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof
CN111581358A (en) * 2020-04-08 2020-08-25 北京百度网讯科技有限公司 Information extraction method and device and electronic equipment
CN111581358B (en) * 2020-04-08 2023-08-18 北京百度网讯科技有限公司 Information extraction method and device and electronic equipment
CN111767715A (en) * 2020-06-10 2020-10-13 北京奇艺世纪科技有限公司 Method, device, equipment and storage medium for person identification

Similar Documents

Publication Publication Date Title
CN106844326A (en) A kind of method and device for obtaining word
CN107204184B (en) Audio recognition method and system
CN108304468B (en) Text classification method and text classification device
CN112632980B (en) Enterprise classification method and system based on big data deep learning and electronic equipment
US11048934B2 (en) Identifying augmented features based on a bayesian analysis of a text document
CN106610931B (en) Topic name extraction method and device
CN103488752B (en) A kind of search method of POI intelligent retrievals
CN111178079B (en) Triplet extraction method and device
JP2020191076A (en) Prediction of api endpoint descriptions from api documentation
CN106649250A (en) Method and device for identifying emotional new words
CN110134844A (en) Subdivision field public sentiment monitoring method, device, computer equipment and storage medium
CN110990563A (en) Artificial intelligence-based traditional culture material library construction method and system
CN112579733A (en) Rule matching method, rule matching device, storage medium and electronic equipment
CN107832302A (en) Participle processing method, device, mobile terminal and computer-readable recording medium
CN106855852B (en) Statement emotion determining method and device
CN110019807B (en) Commodity classification method and device
CN111523301B (en) Contract document compliance checking method and device
US20230351121A1 (en) Method and system for generating conversation flows
CN111241269B (en) Short message text classification method and device, electronic equipment and storage medium
CN112395407A (en) Method and device for extracting enterprise entity relationship and storage medium
CN115130437B (en) Intelligent document filling method and device and storage medium
CN109511000B (en) Bullet screen category determination method, bullet screen category determination device, bullet screen category determination equipment and storage medium
CN106802940A (en) A kind of method and device for calculating text subject model
CN113255368B (en) Method and device for emotion analysis of text data and related equipment
US20220309276A1 (en) Automatically classifying heterogenous documents using machine learning techniques

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing

Applicant after: Beijing Guoshuang Technology Co.,Ltd.

Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing

Applicant before: Beijing Guoshuang Technology Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170613