CN106844326A - A kind of method and device for obtaining word - Google Patents
A kind of method and device for obtaining word Download PDFInfo
- Publication number
- CN106844326A CN106844326A CN201510886318.9A CN201510886318A CN106844326A CN 106844326 A CN106844326 A CN 106844326A CN 201510886318 A CN201510886318 A CN 201510886318A CN 106844326 A CN106844326 A CN 106844326A
- Authority
- CN
- China
- Prior art keywords
- sentence
- participle
- candidate
- independent
- whole
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Machine Translation (AREA)
Abstract
The invention discloses a kind of method and device for obtaining word, be related to field of computer technology, main purpose be by word mark realm information to improve language material word between whole and part relation extraction accuracy rate.The main technical scheme of the present invention is:Text data to obtaining is pre-processed, and obtains the independent sentence with participle information;In the independent sentence, the candidate's sentence with parallel construction is filtered out using stay in place form;Using the participle information in domain lexicon and candidate's sentence, the field participle with parallel construction in candidate's sentence is determined, the domain lexicon is the dictionary that record has same area participle;According to the position feature of the field participle, field participle set of the output with whole and part relation.Present invention is mainly used for the word of the whole and part relation in text that obtains.
Description
Technical field
The present invention relates to field of computer technology, more particularly to a kind of method and device for obtaining word.
Background technology
As the development of network technology, the scale of data message are more and more huger, therefrom to get
Data message is accomplished by significantly more efficient Text Classification.And existing some ripe texts point
Class technology for English text application effect relative ideal, and for Chinese text classifying quality simultaneously
It is undesirable.To find out its cause, the effect of the semantic factor wherein in Chinese text can not be ignored.Most
Based on semantic relation have two classes:1st, the relation between upperseat concept and subordinate concept, subordinate concept
Appearance just for the sake of limit upperseat concept extension;2nd, predication relation, this is most also most basic
Relation.The statement of one basic vocabulary unit to another basic vocabulary unit.And grammatical form is then
Major part is produced to express these relations.
In the relation of upper and subordinate concept, most commonly seen is exactly the relation of whole and part:It is whole
Body generally has a structure, and their part is separable and has a specific function.At present
Point this classification treatment in, the mode for extracting whole and part relation word is typically all solid based on some
Fixed pattern, including vocabulary, syntactic pattern determine the whole and part relation between word.For example,
The method of fetching portion whole relation from webpage based on parallel construction, using whole and part relation
Pattern obtains language material from Google, matches the sentence with parallel construction, therefrom obtains out given
The part concept of global concept, automatic cluster is carried out with hierarchical clustering algorithm to the part concept of candidate,
To determine the word with whole and part relation.But, the mode of this parallel construction is caned
The corpus data allotted is that the structure in form with template matches, and in actual content simultaneously
Non- is the relation of whole and part, therefore the extraction accuracy rate of which is relatively low.
The content of the invention
In view of this, the present invention provides a kind of method and device for obtaining word, and main purpose is to lead to
The extraction accuracy rate of whole and part relation between word mark realm information is crossed to improve language material word.
To reach above-mentioned purpose, present invention generally provides following technical scheme:
On the one hand, the invention provides a kind of method for obtaining word, the method includes:
Text data to obtaining is pre-processed, and obtains the independent sentence with participle information;
In the independent sentence, the candidate's sentence with parallel construction is filtered out using stay in place form;
Using the participle information in domain lexicon and candidate's sentence, in determining candidate's sentence
Field participle with parallel construction, the domain lexicon is the dictionary that record has same area participle;
According to the position feature of the field participle, field participle of the output with whole and part relation
Set.
On the other hand, the invention provides a kind of device for obtaining word, the device includes:
Pretreatment unit, for being pre-processed to the text data for obtaining, obtains with participle information
Independent sentence;
Screening unit, in the independent sentence that is obtained in the pretreatment unit, using stay in place form
Filter out the candidate's sentence with parallel construction;
Determining unit, for utilizing the participle information in domain lexicon and candidate's sentence, it is determined that
There is the field participle of parallel construction in candidate's sentence of the screening unit selection;
Output unit, the position feature of the field participle for being determined according to the determining unit, output
Field participle set with whole and part relation.
The method and device of a kind of acquisition word proposed according to the invention described above, by text language
Participle, subordinate sentence treatment that material is carried out, and filtered out using stay in place form and be selected language with parallel construction
Sentence.Just can tentatively be selected in in corpus of text for parallel construction be probably with whole and part close
Candidate's sentence of system.Participle information in using candidate's sentence, and selected domain lexicon,
Judge whether the participle with parallel construction belongs to identical field, if so, then can be according to participle
Position in sentence determines the whole and part relation between each participle, while with corresponding pass
System is subject to output display.The fixed form used relative to existing judgement whole and part relation
For alignments, the method applied in the present invention is further sentenced by being added to the participle in sentence
It is disconnected, it is determined that the participle with parallel construction is belonging to the participle in same class field such that it is able to according to
The particular content of participle avoids the formalization that participle is extracted.Judge further according to the position relationship between participle
Which participle belongs to overall field participle, and which belongs to certain fields participle.Divide so as to further improve
The extraction accuracy of word whole and part relation.
Brief description of the drawings
By reading the detailed description of hereafter preferred embodiment, various other advantage and benefit for
Those of ordinary skill in the art will be clear understanding.Accompanying drawing is only used for showing the mesh of preferred embodiment
, and it is not considered as limitation of the present invention.And in whole accompanying drawing, with identical with reference to symbol
Number represent identical part.In the accompanying drawings:
Fig. 1 shows a kind of flow chart of the method for acquisition word that the embodiment of the present invention is proposed;
Fig. 2 shows the flow chart of another method for obtaining word that the embodiment of the present invention is proposed;
Fig. 3 shows a kind of composition frame chart of the device of acquisition word that the embodiment of the present invention is proposed;
Fig. 4 shows the composition frame chart of another device for obtaining word that the embodiment of the present invention is proposed.
Specific embodiment
Exemplary embodiment of the invention is more fully described below with reference to accompanying drawings.Although showing in accompanying drawing
Exemplary embodiment of the invention is shown, it being understood, however, that may be realized in various forms the present invention
Without that should be limited by embodiments set forth here.Conversely, there is provided these embodiments are able to more
Thoroughly understand the present invention, and can be by the complete technology for conveying to this area of the scope of the present invention
Personnel.
A kind of method for obtaining word is the embodiment of the invention provides, as shown in figure 1, the method is used for
The word with whole and part relation in corpus of text is obtained, specific steps include:
101st, the text data for obtaining is pre-processed, obtains the independent sentence with participle information.
In embodiments of the present invention, the text data of acquisition refers to for extracting with whole and part pass
The corpus data of copula language, the specific source for obtaining can be chosen from different corpus do not allow field or
The text data of theme.And it refers to then to carry out dividing processing to the text of big section or entire chapter to pre-process, obtain
To brief text datas such as the sentences or phrase for being easy to treatment.Can be specifically by participle, divide
The text-processing technology of sentence carries out the subdivision of text, because participle, subordinate sentence technology have been widely used
Text-processing technology, therefore, the embodiment of the present invention is not illustrated to this, while also not limiting
Specific participle mode or subordinate sentence mode.The purpose is to obtain the independent sentence with participle information.Its
In, independent sentence is that have the simple sentence for completing structure or form, and participle information is then to enter the simple sentence
There is which participle in resulting word segmentation result after row word segmentation processing, such as simple sentence, each participle is in sentence
The information such as position.
102nd, the candidate's sentence with parallel construction is filtered out using stay in place form.
In independent sentence obtained by step 101, screened using stay in place form, selecting has
The independent sentence of parallel construction.Wherein, stay in place form is preset in systems for judging sentence structure
Template, in the present embodiment, whether the stay in place form for being used is for judging have in independent sentence
There is the template of parallel construction.And parallel construction also includes the arranged side by side and phrase of word in an independent sentence
It is arranged side by side, be also not specifically limited for the specific parallel construction embodiment of the present invention.Simply according to solely
The sentence structure of vertical sentence is judged that the independent sentence that will meet parallel construction is defined as candidate's sentence.
103rd, using the participle information in domain lexicon and candidate's sentence, determine have in candidate's sentence
There is the field participle of parallel construction.
Domain lexicon is the dictionary that record has same area participle, due to whole and part relation
Participle or phrase are necessarily belonging to identical field, therefore, by judging have side by side in candidate's sentence
The participle of structure whether belong to identical field just turn into judge whether these participles can constitute entirety and portion
The premise of the relation of dividing.If that is, two have the participle of parallel construction and are not belonging to same
Field, then the two words be also impossible to have or there is the pass of whole and part with other participles in this innings
System.In this step, in addition to the participle art with other array structure in candidate's sentence to be determined,
Whether other participles belong to the domain lexicon in also needing to determine the office, subsequently to be had according to participle
Whether the positional information of body determines the relation between participle with whole and part.
104th, according to the position feature of field participle, field participle of the output with whole and part relation
Set.
The position feature of field participle is the participle recorded during above-mentioned participle in independent sentence
Positional information, the whole and part relation between participle is judged according to different positional informations.For example,
Automobile including engine, gearbox and tire etc., wherein, " engine ", " gearbox ", " tire "
It is exactly the participle with parallel construction, and these participles belong to automotive field, so, by judging
" automobile " and " engine ", " gearbox ", the position relationship of " tire ", it can be determined that go out " automobile "
It is the participle with whole and part relation with " engine ", " gearbox ", " tire ".
After the participle combination of all whole and part relations in obtaining text, output possesses comprising all
The field participle set of whole and part relation.
The side of the acquisition word that the embodiment of the present invention is used is can be seen that with reference to above-mentioned implementation
Method, by the participle that is carried out to corpus of text, subordinate sentence treatment, and is filtered out using stay in place form and had
Parallel construction is selected sentence.Just can tentatively be selected in in corpus of text for parallel construction is probably
Candidate's sentence with whole and part relation.Participle information in using candidate's sentence, and
Selected domain lexicon, judges whether the participle with parallel construction belongs to identical field, if so,
The whole and part relation between each participle can be then determined according to position of the participle in sentence, together
When output display is subject to corresponding relation.Adopted relative to existing judgement whole and part relation
For the alignments of fixed form, the method that the embodiment of the present invention is used is by sentence
Participle add it is further judge, it is determined that the participle with parallel construction is belonging in same class field
Participle such that it is able to the particular content according to participle avoid participle extract formalization.Further according to point
Position relationship between word judges which participle belongs to overall field participle, and which belongs to certain fields point
Word.So as to further improve the extraction accuracy of participle whole and part relation.
Method in order to above-mentioned acquisition word proposed by the present invention is explained in more detail, the present invention is implemented
Example also proposed a kind of method for obtaining word, as shown in Fig. 2 the method is wrapped when word is extracted
Including step is:
201st, subordinate sentence treatment is carried out to the text data for obtaining, obtains the independent sentence.
Subordinate sentence treatment is carried out to acquired text data.Simplest mode is judged in text
Punctuation mark, fullstop, exclamation mark, question mark etc. can be represented the symbol of independent sentence as subordinate sentence
Standard, and subordinate sentence can not be carried out with symbols such as comma, pause mark, branches.Complete to divide text with this
Sentence processing procedure.
202nd, word segmentation processing, and the participle information flag that will be obtained are carried out to independent sentence in the independent language
In sentence.
After subordinate sentence treatment is completed, in addition it is also necessary to further carry out word segmentation processing to resulting independent sentence,
And by the result queue of participle in the independent sentence, so that the reading of subsequent treatment is called.Wherein,
The result of participle includes the positional information of specific participle and the participle in independent sentence.
203rd, the independent sentence with parallel construction is extracted using characteristic symbol.
Characteristic symbol is used to represent in the present embodiment and there are parallel construction in independent sentence, wherein should
Characteristic symbol, can comprise at least one of the following:Pause mark, logical relation symbol;For example, pause mark can
With with ", " represent, logical relation symbol can for coordination symbol (can be represented with " "),
Punctuate or the characters such as choice relation symbol (can be represented with " ‖ ").Can be by using characteristic symbol
Independent sentence with parallel construction is screened.The specific characteristic symbol present invention can be according to reality
Applicable cases are selected, and this present invention is not defined.
204th, the candidate's sentence with whole and part relation is gone out using template filter certainly.
In the independent sentence with parallel construction, selected again with whole and part by template certainly
The independent sentence of relation, and it is defined as candidate's sentence.Wherein, template is for judging independent language certainly
Sentence has the sentence structure of whole and part relation.Also, can be included in template certainly various
Sentence structure, for example, ^ (.*) include (.*)、(.*)、(.*) structure (such as mobile phone include processor,
The parts such as internal memory, screen, shell), (.*) by (.*)、(.*)、(.*) etc. composition $ structure (as electricity
Brain is made up of main frame, display, mouse, keyboard etc.), (.*) (it is | as | have | be divided into) (.*)、(.*)、
(.*) $ structure (such as automobile is divided into car, lorry).
Sentence structure in the affirmative template can as needed be increased or be deleted.Therefore, specific mould
Sentence structure in plate is not limited in embodiments of the present invention.
Further, the accuracy rate of whole and part relation is judged to improve, will can also be met willing
Candidate's sentence of solid plate recycles the negative template to carry out checking matching, thus will with parallel construction but
Sentence without whole and part relation is excluded.For example, mobile phone is means of communication, intelligently sets
Standby, electronic equipment." means of communication ", " smart machine ", " electronic equipment " in this has simultaneously
Array structure but be not whole and part relation with " mobile phone ".Therefore, will be with this quasi-sentence knot
The sentence of being selected of structure is excluded.Sentence structure in specific negative template is also included:^ such as (.*)、
(.*)、(.*) $ structure, ^ (.*) it is (.*)、(.*)、(.*) $ structure, (.*)、(.*)、(.*)
(it is | as | have | be divided into) (.*) $ structure.Candidate's sentence reservation that negative template will not met is gone forward side by side
The follow-up treatment of row.
205th, using the participle information in domain lexicon and candidate's sentence, determine have in candidate's sentence
There is the field participle of parallel construction.
It is determined that be first domain lexicon to be selected before the participle of field, and the selection one of domain lexicon
As be obtain the text when it is according to determined by the content of the text, or optional by providing
Domain lexicon table selected.The domain lexicon is all points with text art
The dictionary of word.Matched with the participle in domain lexicon by by the participle information in candidate's sentence,
It may determine which participle gone out in candidate's sentence is the participle of same area, especially judging should
In candidate's sentence with parallel construction participle whether be same area participle, if identical, by this
A little participles are defined as field participle.
206th, the overall field participle and certain fields participle in candidate's sentence are determined using situation template.
Situation template is sentenced similar to above-mentioned affirmative template for the position according to participle in sentence
The specific object of disconnected participle, i.e. the participle is overall field participle or certain fields participle.It is most of
In the case of, the participle with parallel construction belongs to certain fields participle.And overall field participle and portion
The relation of point field participle is the relation of upperseat concept and subordinate concept.
207th, the field participle with whole and part relation is extracted.
After the overall field participle and certain fields participle in candidate's sentence is determined, it is possible to by participle
Extracted from candidate's sentence.Further, word can also be carried out to the participle for being extracted
Amendment, remove unnecessary qualifier in some participles, such as removal number, measure word or tail word suffix
Deng qualifier.
208th, the field participle set with whole and part relation is exported in the form of a list.
Finally, revised overall field participle and certain fields are added in corresponding form to arrange
The form of table is exported.It should be noted that in the list, include in the text from all
The field participle with whole and part relation that subordinate sentence is extracted, therefore, the list can also be regarded
It is a field participle set, and is the field participle set with whole and part corresponding relation.
Further, as the realization to the above method, the embodiment of the invention provides a kind of acquisition word
The device of language, the device embodiment is corresponding with preceding method embodiment, for ease of reading, present apparatus reality
Example is applied no longer to repeat the detail content in preceding method embodiment one by one, it should be understood that this
Device in embodiment can correspond to the full content realized in preceding method embodiment.The device is set
In the equipment analyzed for corpus of text, the word with whole and part relation is particularly extracted
Computing device, as shown in figure 3, the device includes:
Pretreatment unit 31, for pre-processing the text data for obtaining, obtains believing with participle
The independent sentence of breath;
Screening unit 32, in the independent sentence that is obtained in the pretreatment unit 31, using structure
Template filter goes out the candidate's sentence with parallel construction;
Determining unit 33, for utilizing the participle information in domain lexicon and candidate's sentence, really
There is the field participle of parallel construction, the field in candidate's sentence of the fixed selection of the screening unit 32
Dictionary is the dictionary that record has same area participle;
Output unit 34, the position feature of the field participle for being determined according to the determining unit 33,
Field participle set of the output with whole and part relation.
Further, as shown in figure 4, the pretreatment unit 31 includes:
Subordinate sentence module 311, for carrying out subordinate sentence treatment to the text data, obtains the independent sentence;
Word-dividing mode 312, for carrying out word segmentation processing to the independent sentence that the subordinate sentence module 311 is obtained,
Obtain the participle information of the independent sentence;
Mark module, for the participle information flag that obtains the word-dividing mode 312 in the independence
In sentence.
Further, as shown in figure 4, the screening unit 32 includes:
Extraction module 321, for extracting the independent sentence with parallel construction using characteristic symbol, wherein,
The characteristic symbol comprises at least one of the following:Pause mark, logical relation symbol;
Screening module 322, the independent sentence of the parallel construction for being extracted in the extraction module 321
In, go out the candidate's sentence with whole and part relation, the template certainly using template filter certainly
For the sentence structure for judging there is in the independent sentence whole and part relation.
Further, as shown in figure 4, the screening module 322 includes:
Screening submodule 3221, for using negate template filter meet it is described certainly module independent language
Sentence, the negative template is used to judge the sentence with non-integral with part relations in the independent sentence
Structure;
Determination sub-module 3222, for determining not meeting the negative mould that the screening submodule 3221 is used
The independent sentence of plate is candidate's sentence.
Further, as shown in figure 4, the determining unit 33 includes:
Selecting module 331, for choosing domain lexicon;
Judge module 332, for the participle information in candidate's sentence, judges candidate's language
Sentence in have parallel construction participle whether be the selecting module 331 choose domain lexicon in neck
Domain participle;
Determining module 333, for judging participle in the domain lexicon when the judge module 332
When, determine that the participle is field participle.
Further, as shown in figure 4, the output unit 34 includes:
Determining module 341, for determining the overall field participle in candidate's sentence using situation template
With certain fields participle, the relation of the overall field participle and certain fields participle be upperseat concept with
The relation of subordinate concept;
Extraction module 342, closed with whole and part determined by the determining module 341 for extracting
The field participle of system;
Output module 343, for being extracted the extraction module 342 with whole and part relation
Field participle set export in the form of a list.
Further, as shown in figure 4, the extraction module 342 includes:
Amendment submodule 3421, for being modified with certain fields participle to the overall field participle
Treatment, the correcting process includes:Removal number, removal measure word and/or removal tail word suffix;
Extracting sub-module 3422, the entirety after treatment is corrected for extracting the amendment submodule 3421
Field participle and certain fields participle.
In sum, the method and device of the acquisition word that the embodiment of the present invention is used, by text
Participle, subordinate sentence treatment that this language material is carried out, and filter out obtaining with parallel construction using stay in place form
Select sentence.Just can tentatively be selected in in corpus of text for parallel construction is probably with entirety and portion
Candidate's sentence of the relation of dividing.Participle information in using candidate's sentence, and selected domain term
Allusion quotation, judges whether the participle with parallel construction belongs to identical field, if so, then can basis
Position of the participle in sentence determines the whole and part relation between each participle, while with corresponding
Relation be subject to output display.Relative to the stent that existing judgement whole and part relation is used
For the alignments of plate, the method that the embodiment of the present invention is used is added by the participle in sentence
It is further to judge, it is determined that the participle with parallel construction is belonging to the participle in same class field, from
And the formalization of participle extraction can be avoided according to the particular content of participle.Further according to the position between participle
The relation of putting judges which participle belongs to overall field participle, and which belongs to certain fields participle.So as to enter
One step improves the extraction accuracy of participle whole and part relation.
The device for obtaining word includes processor and memory, and above-mentioned pretreatment unit, screening are single
Unit, determining unit and output unit etc. in memory, are held as program unit storage by processor
Storage said procedure unit in memory is gone to realize corresponding function.
Kernel is included in processor, is gone in memory to transfer corresponding program unit by kernel.Kernel can
To set one or more, whole and part relation between language material word is improved by adjusting kernel parameter
Extraction accuracy rate.
Memory potentially includes the volatile memory in computer-readable medium, random access memory
The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash
RAM), memory includes at least one storage chip.
Present invention also provides a kind of computer program product, when being performed on data processing equipment,
It is adapted for carrying out the program code of initialization there are as below methods step:Text data to obtaining carries out pre- place
Reason, obtains the independent sentence with participle information;In the independent sentence, sieved using stay in place form
Select the candidate's sentence with parallel construction;Using the participle in domain lexicon and candidate's sentence
Information, determines the field participle with parallel construction in candidate's sentence, and the domain lexicon is note
Record has the dictionary of same area participle;According to the position feature of the field participle, output has overall
With the field participle set of part relations.
It should be understood by those skilled in the art that, embodiments herein can be provided as method, system,
Or computer program product.Therefore, the application can be implemented using complete hardware embodiment, complete software
The form of the embodiment in terms of example or combination software and hardware.And, the application can be used at one
Or multiple wherein includes the computer-usable storage medium of computer usable program code (including but not
Be limited to magnetic disk storage, CD-ROM, optical memory etc.) on the computer program product implemented
Form.
The application is with reference to the method according to the embodiment of the present application, equipment (system) and computer program
The flow chart and/or block diagram of product is described.It should be understood that can be realized flowing by computer program instructions
In each flow and/or square frame and flow chart and/or block diagram in journey figure and/or block diagram
Flow and/or square frame combination.Can provide these computer program instructions to all-purpose computer, specially
With the processor of computer, Embedded Processor or other programmable data processing devices producing one
Machine so that produced by the instruction of computer or the computing device of other programmable data processing devices
It is raw to be used to realize in one flow of flow chart or multiple flow and/or block diagram one square frame or multiple side
The device of the function of being specified in frame.
These computer program instructions may be alternatively stored in can guide computer or other programmable datas to process
In the computer-readable memory that equipment works in a specific way so that storage is deposited in the computer-readable
Instruction in reservoir is produced and includes the manufacture of command device, and command device realization is in flow chart one
The function of being specified in flow or multiple one square frame of flow and/or block diagram or multiple square frames.
These computer program instructions can also be loaded into computer or other programmable data processing devices
On so that series of operation steps is performed on computer or other programmable devices to produce computer
The treatment of realization, so as to the instruction performed on computer or other programmable devices is provided for realizing
Specified in one flow of flow chart or multiple one square frame of flow and/or block diagram or multiple square frames
The step of function.
In a typical configuration, computing device include one or more processors (CPU), input/
Output interface, network interface and internal memory.
Memory potentially includes the volatile memory in computer-readable medium, random access memory
The form such as device (RAM) and/or Nonvolatile memory, such as read-only storage (ROM) or flash memory (flash
RAM).Memory is the example of computer-readable medium.
Computer-readable medium includes that permanent and non-permanent, removable and non-removable media can be with
Information Store is realized by any method or technique.Information can be computer-readable instruction, data knot
Structure, the module of program or other data.The example of the storage medium of computer includes, but are not limited to phase
Become internal memory (PRAM), static RAM (SRAM), dynamic random access memory
(DRAM), other kinds of random access memory (RAM), read-only storage (ROM), electricity can
Erasable programmable read-only memory (EPROM) (EEPROM), fast flash memory bank or other memory techniques, read-only light
Disk read-only storage (CD-ROM), digital versatile disc (DVD) or other optical storages, magnetic
Cassette tape, the storage of tape magnetic rigid disk or other magnetic storage apparatus or any other non-transmission medium,
Can be used to store the information that can be accessed by a computing device.Defined according to herein, computer-readable
Medium does not include temporary computer readable media (transitory media), such as data-signal and load of modulation
Ripple.
Also, it should be noted that term " including ", "comprising" or its any other variant be intended to contain
Lid nonexcludability is included, so that process, method, commodity including a series of key elements or setting
It is standby not only to include those key elements, but also other key elements including being not expressly set out, or also wrap
Include is this process, method, commodity or the intrinsic key element of equipment.In the feelings without more limitations
Under condition, the key element limited by sentence "including a ...", it is not excluded that the process including key element,
Also there is other identical element in method, commodity or equipment.
It will be understood by those skilled in the art that embodiments herein can be provided as method, system or calculating
Machine program product.Therefore, the application can use complete hardware embodiment, complete software embodiment or knot
Close the form of the embodiment in terms of software and hardware.And, the application can use at one or more it
In include computer-usable storage medium (the including but not limited to disk of computer usable program code
Memory, CD-ROM, optical memory etc.) on implement computer program product form.
Embodiments herein is these are only, the application is not limited to.For this area skill
For art personnel, the application can have various modifications and variations.It is all spirit herein and principle it
Interior made any modification, equivalent substitution and improvements etc., should be included in claims hereof model
Within enclosing.
Claims (10)
1. it is a kind of obtain word method, it is characterised in that methods described includes:
Text data to obtaining is pre-processed, and obtains the independent sentence with participle information;
In the independent sentence, the candidate's sentence with parallel construction is filtered out using stay in place form;
Using the participle information in domain lexicon and candidate's sentence, in determining candidate's sentence
Field participle with parallel construction, wherein, the domain lexicon is that record has same area participle
Dictionary;
According to the position feature of the field participle, field participle of the output with whole and part relation
Set.
2. method according to claim 1, it is characterised in that the text data to obtaining is carried out
Pretreatment, obtaining the independent sentence with participle information includes:
Subordinate sentence treatment is carried out to the text data, the independent sentence is obtained;
Word segmentation processing is carried out to the independent sentence, the participle information of the independent sentence is obtained;
By the participle information flag in the independent sentence.
3. method according to claim 1 and 2, it is characterised in that using stay in place form screening
Going out the candidate's sentence with parallel construction includes:
The independent sentence with parallel construction is extracted using characteristic symbol;Wherein, the characteristic symbol is extremely
One of the following is included less:Pause mark, logical relation symbol;
In the independent sentence of the parallel construction, gone out with whole and part using template filter certainly
Candidate's sentence of relation, the template certainly is used to judge have whole and part in the independent sentence
The sentence structure of relation.
4. method according to claim 3, it is characterised in that provided using template filter certainly
The candidate's sentence for having whole and part relation includes:
Using negating that template filter meets the independent sentence for affirming template, the negative template is used for
Judge the sentence structure with non-integral with part relations in the independent sentence;
It is determined that the independent sentence for not meeting the negative template is candidate's sentence.
5. method according to claim 4, it is characterised in that using domain lexicon and described
Participle information in candidate's sentence, determines the field participle bag with parallel construction in candidate's sentence
Include:
Choose domain lexicon;
According to the participle information in candidate's sentence, judge that there is parallel construction in candidate's sentence
Participle whether be field participle in the domain lexicon;
If, it is determined that the participle is field participle.
6. method according to claim 5, it is characterised in that according to the position of the field participle
Feature is put, field participle set of the output with whole and part relation includes:
The overall field participle and certain fields participle in candidate's sentence are determined using situation template,
The overall field participle and the relation that the relation of certain fields participle is upperseat concept and subordinate concept;
Extract the field participle with whole and part relation;
Field participle set with whole and part relation is exported in the form of a list.
7. method according to claim 6, it is characterised in that extract and closed with whole and part
The field participle of system includes:
Treatment, the correcting process bag are modified to the overall field participle and certain fields participle
Include:Removal number, removal measure word and/or removal tail word suffix;
Extract the overall field participle after the correcting process and certain fields participle.
8. it is a kind of obtain word device, it is characterised in that described device includes:
Pretreatment unit, for being pre-processed to the text data for obtaining, obtains with participle information
Independent sentence;
Screening unit, in the independent sentence that is obtained in the pretreatment unit, using stay in place form
Filter out the candidate's sentence with parallel construction;
Determining unit, for utilizing the participle information in domain lexicon and candidate's sentence, it is determined that
There is the field participle of parallel construction, the domain lexicon in candidate's sentence of the screening unit selection
It is to record the dictionary for having same area participle;
Output unit, the position feature of the field participle for being determined according to the determining unit, output
Field participle set with whole and part relation.
9. device according to claim 8, it is characterised in that the pretreatment unit includes:
Subordinate sentence module, for carrying out subordinate sentence treatment to the text data, obtains the independent sentence;
Word-dividing mode, for carrying out word segmentation processing to the independent sentence that the subordinate sentence module is obtained, obtains
The participle information of the independent sentence;
Mark module, for the participle information flag that obtains the word-dividing mode in the independent sentence
In.
10. device according to claim 8 or claim 9, it is characterised in that the screening unit bag
Include:
Extraction module, for extracting the independent sentence with parallel construction using characteristic symbol, wherein,
The characteristic symbol comprises at least one of the following:Pause mark, logical relation symbol;
Screening module, in the independent sentence of the parallel construction that the extraction module is extracted, utilizing
Certainly template filter goes out the candidate's sentence with whole and part relation, and the template certainly is used to judge
Sentence structure with whole and part relation in the independent sentence.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510886318.9A CN106844326A (en) | 2015-12-04 | 2015-12-04 | A kind of method and device for obtaining word |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201510886318.9A CN106844326A (en) | 2015-12-04 | 2015-12-04 | A kind of method and device for obtaining word |
Publications (1)
Publication Number | Publication Date |
---|---|
CN106844326A true CN106844326A (en) | 2017-06-13 |
Family
ID=59150525
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201510886318.9A Pending CN106844326A (en) | 2015-12-04 | 2015-12-04 | A kind of method and device for obtaining word |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106844326A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543185A (en) * | 2018-11-22 | 2019-03-29 | 联想(北京)有限公司 | Utterance topic acquisition methods and device |
CN109543150A (en) * | 2017-09-21 | 2019-03-29 | 北京国双科技有限公司 | A kind for the treatment of method and apparatus of court's trial notes |
CN109614499A (en) * | 2018-11-22 | 2019-04-12 | 阿里巴巴集团控股有限公司 | A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment |
CN109657235A (en) * | 2018-12-05 | 2019-04-19 | 云孚科技(北京)有限公司 | A kind of mixing segmenting method |
CN110244860A (en) * | 2018-03-08 | 2019-09-17 | 北京搜狗科技发展有限公司 | A kind of input method, device and electronic equipment |
CN110413998A (en) * | 2019-07-16 | 2019-11-05 | 深圳供电局有限公司 | A kind of adaptive Chinese word cutting method and its system, medium towards power industry |
CN111124144A (en) * | 2018-10-31 | 2020-05-08 | 北京国双科技有限公司 | Input data processing method and device |
CN111581358A (en) * | 2020-04-08 | 2020-08-25 | 北京百度网讯科技有限公司 | Information extraction method and device and electronic equipment |
CN111767715A (en) * | 2020-06-10 | 2020-10-13 | 北京奇艺世纪科技有限公司 | Method, device, equipment and storage medium for person identification |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102360383A (en) * | 2011-10-15 | 2012-02-22 | 西安交通大学 | Method for extracting text-oriented field term and term relationship |
-
2015
- 2015-12-04 CN CN201510886318.9A patent/CN106844326A/en active Pending
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102360383A (en) * | 2011-10-15 | 2012-02-22 | 西安交通大学 | Method for extracting text-oriented field term and term relationship |
Cited By (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109543150A (en) * | 2017-09-21 | 2019-03-29 | 北京国双科技有限公司 | A kind for the treatment of method and apparatus of court's trial notes |
CN109543150B (en) * | 2017-09-21 | 2022-11-22 | 北京国双科技有限公司 | Method and device for processing court trial notes |
CN110244860B (en) * | 2018-03-08 | 2024-02-02 | 北京搜狗科技发展有限公司 | Input method and device and electronic equipment |
CN110244860A (en) * | 2018-03-08 | 2019-09-17 | 北京搜狗科技发展有限公司 | A kind of input method, device and electronic equipment |
CN111124144A (en) * | 2018-10-31 | 2020-05-08 | 北京国双科技有限公司 | Input data processing method and device |
CN111124144B (en) * | 2018-10-31 | 2023-04-07 | 北京国双科技有限公司 | Input data processing method and device |
CN109614499B (en) * | 2018-11-22 | 2023-02-17 | 创新先进技术有限公司 | Dictionary generation method, new word discovery method, device and electronic equipment |
CN109543185B (en) * | 2018-11-22 | 2021-11-16 | 联想(北京)有限公司 | Statement topic acquisition method and device |
CN109543185A (en) * | 2018-11-22 | 2019-03-29 | 联想(北京)有限公司 | Utterance topic acquisition methods and device |
CN109614499A (en) * | 2018-11-22 | 2019-04-12 | 阿里巴巴集团控股有限公司 | A kind of dictionary generating method, new word discovery method, apparatus and electronic equipment |
CN109657235B (en) * | 2018-12-05 | 2022-11-25 | 云孚科技(北京)有限公司 | Mixed word segmentation method |
CN109657235A (en) * | 2018-12-05 | 2019-04-19 | 云孚科技(北京)有限公司 | A kind of mixing segmenting method |
CN110413998A (en) * | 2019-07-16 | 2019-11-05 | 深圳供电局有限公司 | A kind of adaptive Chinese word cutting method and its system, medium towards power industry |
CN110413998B (en) * | 2019-07-16 | 2023-04-21 | 深圳供电局有限公司 | Self-adaptive Chinese word segmentation method oriented to power industry, system and medium thereof |
CN111581358A (en) * | 2020-04-08 | 2020-08-25 | 北京百度网讯科技有限公司 | Information extraction method and device and electronic equipment |
CN111581358B (en) * | 2020-04-08 | 2023-08-18 | 北京百度网讯科技有限公司 | Information extraction method and device and electronic equipment |
CN111767715A (en) * | 2020-06-10 | 2020-10-13 | 北京奇艺世纪科技有限公司 | Method, device, equipment and storage medium for person identification |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106844326A (en) | A kind of method and device for obtaining word | |
CN107204184B (en) | Audio recognition method and system | |
CN108304468B (en) | Text classification method and text classification device | |
CN112632980B (en) | Enterprise classification method and system based on big data deep learning and electronic equipment | |
US11048934B2 (en) | Identifying augmented features based on a bayesian analysis of a text document | |
CN106610931B (en) | Topic name extraction method and device | |
CN103488752B (en) | A kind of search method of POI intelligent retrievals | |
CN111178079B (en) | Triplet extraction method and device | |
JP2020191076A (en) | Prediction of api endpoint descriptions from api documentation | |
CN106649250A (en) | Method and device for identifying emotional new words | |
CN110134844A (en) | Subdivision field public sentiment monitoring method, device, computer equipment and storage medium | |
CN110990563A (en) | Artificial intelligence-based traditional culture material library construction method and system | |
CN112579733A (en) | Rule matching method, rule matching device, storage medium and electronic equipment | |
CN107832302A (en) | Participle processing method, device, mobile terminal and computer-readable recording medium | |
CN106855852B (en) | Statement emotion determining method and device | |
CN110019807B (en) | Commodity classification method and device | |
CN111523301B (en) | Contract document compliance checking method and device | |
US20230351121A1 (en) | Method and system for generating conversation flows | |
CN111241269B (en) | Short message text classification method and device, electronic equipment and storage medium | |
CN112395407A (en) | Method and device for extracting enterprise entity relationship and storage medium | |
CN115130437B (en) | Intelligent document filling method and device and storage medium | |
CN109511000B (en) | Bullet screen category determination method, bullet screen category determination device, bullet screen category determination equipment and storage medium | |
CN106802940A (en) | A kind of method and device for calculating text subject model | |
CN113255368B (en) | Method and device for emotion analysis of text data and related equipment | |
US20220309276A1 (en) | Automatically classifying heterogenous documents using machine learning techniques |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100083 No. 401, 4th Floor, Haitai Building, 229 North Fourth Ring Road, Haidian District, Beijing Applicant after: Beijing Guoshuang Technology Co.,Ltd. Address before: 100086 Cuigong Hotel, 76 Zhichun Road, Shuangyushu District, Haidian District, Beijing Applicant before: Beijing Guoshuang Technology Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20170613 |