CN105095391A - Device and method for identifying organization name by word segmentation program - Google Patents

Device and method for identifying organization name by word segmentation program Download PDF

Info

Publication number
CN105095391A
CN105095391A CN201510379024.7A CN201510379024A CN105095391A CN 105095391 A CN105095391 A CN 105095391A CN 201510379024 A CN201510379024 A CN 201510379024A CN 105095391 A CN105095391 A CN 105095391A
Authority
CN
China
Prior art keywords
entry
speech
organization names
dictionary
module
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201510379024.7A
Other languages
Chinese (zh)
Inventor
李月雷
王志青
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Original Assignee
Beijing Qihoo Technology Co Ltd
Qizhi Software Beijing Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Qihoo Technology Co Ltd, Qizhi Software Beijing Co Ltd filed Critical Beijing Qihoo Technology Co Ltd
Priority to CN201510379024.7A priority Critical patent/CN105095391A/en
Publication of CN105095391A publication Critical patent/CN105095391A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/374Thesaurus
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to the technical field of network data communication and discloses a device and a method for identifying an organization name by a word segmentation program. The device comprises a storage module, a word segmentation module, an identification module and an output module, wherein the storage module is applicable to data storage, the word segmentation module is applicable in segmenting words in a sentence to be identified by an entry dictionary in order to obtain entries in the sentence to be identified; the identification module is applicable in extracting the entries which can satisfy a relevant word property of the preset organization name and is found in a word property dictionary from the entries obtained from word segmentation, can splice the extracted entries according to connection rules of the preset relevant word property, takes a spliced entry as a candidate organization name and adds the entry into a candidate set, and selects an entry satisfying output conditions of the preset organization name from the candidate set; and the output module is applicable in taking the selected entry as the organization name and outputting the entry. The device and the method for identifying the organization name by the word segmentation program provided by the invention can solve and realize the problem of extracting the organization name from a text and obtain the beneficial effect of automatically extracting the organization name from the text.

Description

Utilize the device and method of participle procedure identification organization names
Technical field
The present invention relates to network data communication technical field, be specifically related to the device and method utilizing participle procedure identification organization names.
Background technology
In the prior art, carry out in text mining process, an important operation identifies named entity, such as, identifies the name in text, organization names etc.Named entity recognition (NE) refers to the entity identifying and have certain sense in text, mainly comprises name, place name, mechanism's name, proper noun etc.
Wherein, organization names refers to office, group or other enterprises and institutions, comprises the title of school, company, hospital, research institute and government bodies etc.Organization names is a subset of proper noun, and number is huge especially.Compare with place name with name, the form of organization names is unstable. and along with the development of society, have new mechanism title and occur, old organization names is eliminated, reorganizes or renames.In addition, the composition of organization names does not have unification of the motherland specification, and majority fails to take in dictionary.
Therefore, need a kind of can from text the technical scheme of extraction mechanism title, to adapt to the continuous change of organization names.
Summary of the invention
In view of the above problems, the present invention is proposed to provide a kind of device and method utilizing participle procedure identification organization names overcoming the problems referred to above or solve the problem at least in part.
According to one aspect of the present invention, provide a kind of device utilizing participle procedure identification organization names, this device comprises:
Memory module, is suitable for storing entry dictionary, part of speech dictionary, reset mechanism title is correlated with part of speech, preset relevant part of speech concatenate rule and reset mechanism title output condition;
Word-dividing mode, is suitable for utilizing entry dictionary that sentence to be identified is carried out participle, obtains entry in sentence to be identified;
Identification module, be suitable for being extracted in from participle gained entry the reset mechanism title that meets found in part of speech dictionary to be correlated with the entry of part of speech, according to presetting relevant part of speech concatenate rule, the entry of extraction is spliced, by splicing gained entry alternatively organization names add in candidate collection, from candidate collection, select to meet the entry of reset mechanism title output condition;
Output module, is suitable for the entry selected from candidate collection to export as organization names.
Alternatively, described reset mechanism title part of speech of being correlated with comprises at least one item in following part of speech: suffix in field, organization names in place, brand, organization names.
Alternatively, described device also comprises:
Described structure module, is suitable for building the filtration dictionary corresponding with at least one reset mechanism title output condition according to search word with the relevant information of the link searched, is stored in described memory module by the filtration dictionary of structure;
Described filtering module, is suitable for utilizing the filtration dictionary stored in described memory module to filter the entry that described identification module is selected from candidate collection;
Described output module, is further adapted for the residue entry after using described filtering module filtration and exports as organization names.
Alternatively, described identification module, is also suitable for being extracted in part of speech dictionary from participle gained entry finding the entry that part of speech is complete organization names;
Described output module, the part of speech being also suitable for described identification module to extract is that the entry of complete organization names exports as organization names.
Alternatively, described identification module, is also suitable for being arranged in participle entry dictionary used when participle gained entry, and when not being arranged in part of speech dictionary, judges whether described entry comprises reset mechanism title suffix; When described entry comprises reset mechanism title suffix, described entry is added in the part of speech dictionary of described memory module storage as complete organization names.
Alternatively, described identification module, also be suitable for, when sentence to be identified comprises the entry combination meeting presets, according to the frequency of occurrences of one or more word at least one entry in entry combination, determining whether that described entry is combined alternatively organization names to add in candidate collection.
Alternatively, described device also comprises:
Receiver module, is suitable for the check information receiving user's input;
Described identification module, be also suitable for according to the check information received revise the entry dictionary, part of speech dictionary, the reset mechanism title that store in described memory module be correlated with part of speech, preset relevant part of speech concatenate rule, reset mechanism title output condition or filter dictionary.
Alternatively, described structure module, be further adapted for and count the positive example entry corresponding with at least one reset mechanism title output condition from the search word comprising organization names and the relevant information of the link searched, described positive example entry is added in the filtration dictionary that described memory module stores.
Alternatively, described structure module, be further adapted for the search word never comprising organization names and in the relevant information of the link searched, count the negative routine entry corresponding with at least one reset mechanism title output condition, described negative routine entry being added in the filtration dictionary of described memory module storage.
Alternatively, described structure module, also be suitable for, according to the position relationship belonged in search word and the relevant information of the link searched between the number of the entry of the relevant part of speech of reset mechanism title and described entry, determining whether comprise organization names in described search word and described relevant information.
According to a further aspect in the invention, provide a kind of method utilizing participle procedure identification organization names, the method comprises:
Utilize entry dictionary that sentence to be identified is carried out participle, obtain entry in sentence to be identified;
From participle gained entry, be extracted in the reset mechanism title that meets found in part of speech dictionary to be correlated with the entry of part of speech;
According to presetting relevant part of speech concatenate rule, the entry of extraction is spliced, by splicing gained entry alternatively organization names add in candidate collection;
The entry meeting reset mechanism title output condition is selected from candidate collection;
The entry selected from candidate collection is exported as organization names.
Alternatively, described reset mechanism title part of speech of being correlated with comprises at least one item in following part of speech: suffix in field, organization names in place, brand, organization names.
Alternatively, described method also comprises:
The filtration dictionary corresponding with at least one reset mechanism title output condition is built with the relevant information of the link searched according to search word;
Filtration dictionary is utilized to filter the entry selected from candidate collection;
Described the entry selected from candidate collection to be exported as organization names, comprising:
Residue entry after filtering utilizing filtration dictionary exports as organization names.
Alternatively, described method also comprises:
Be extracted in from participle gained entry in part of speech dictionary and find the entry that part of speech is complete organization names;
Be that the entry of complete organization names exports as organization names using the part of speech of extraction.
Alternatively, described method also comprises:
When participle gained entry is arranged in participle entry dictionary used, and when not being arranged in part of speech dictionary, judge whether described entry comprises reset mechanism title suffix;
When described entry comprises reset mechanism title suffix, described entry is added in part of speech dictionary as complete organization names.
Alternatively, described method also comprises:
When sentence to be identified comprises the entry combination meeting presets, according to the frequency of occurrences of one or more word at least one entry in entry combination, determine whether that described entry is combined alternatively organization names to add in candidate collection.
Alternatively, described method also comprises:
Receive the check information of user's input;
According to the check information amendment entry dictionary received, part of speech dictionary, reset mechanism title be correlated with part of speech, preset relevant part of speech concatenate rule, reset mechanism title output condition or filter dictionary.
Alternatively, the described filtration dictionary corresponding with at least one reset mechanism title output condition with the relevant information structure of the link searched according to search word, comprising:
From the search word comprising organization names and the relevant information of the link searched, count the positive example entry corresponding with at least one reset mechanism title output condition, described positive example entry is added and filters in dictionary.
Alternatively, the described filtration dictionary corresponding with at least one reset mechanism title output condition with the relevant information structure of the link searched according to search word, comprising:
The search word never comprising organization names with count the negative routine entry corresponding with at least one reset mechanism title output condition in the relevant information of the link searched, described negative routine entry is added and filters in dictionary.
Alternatively, described method also comprises:
According to the position relationship belonged in search word and the relevant information of the link searched between the number of the entry of the relevant part of speech of reset mechanism title and described entry, determine whether comprise organization names in described search word and described relevant information.
Can utilize entry dictionary that sentence to be identified is carried out participle according to technical scheme of the present invention, obtain entry in sentence to be identified; From participle gained entry, be extracted in the reset mechanism title that meets found in part of speech dictionary to be correlated with the entry of part of speech; According to presetting relevant part of speech concatenate rule, the entry of extraction is spliced, by splicing gained entry alternatively organization names add in candidate collection; The entry meeting reset mechanism title output condition is selected from candidate collection; The entry selected from candidate collection is exported as organization names.Thus can from text automatic drawing machine structure title, achieve the beneficial effect of automatic drawing machine structure title from text.In addition, because the entry of extraction can be spliced according to presetting relevant part of speech concatenate rule, enhancing the range of choice for the organization names exported, and then improve the accuracy of extraction mechanism title; Because utilize entry dictionary to carry out participle, utilize reset mechanism title to select entry, make the operation of extraction mechanism title time less used, improve treatment effeciency.
Above-mentioned explanation is only the general introduction of technical solution of the present invention, in order to technological means of the present invention can be better understood, and can be implemented according to the content of instructions, and can become apparent, below especially exemplified by the specific embodiment of the present invention to allow above and other objects of the present invention, feature and advantage.
Accompanying drawing explanation
By reading hereafter detailed description of the preferred embodiment, various other advantage and benefit will become cheer and bright for those of ordinary skill in the art.Accompanying drawing only for illustrating the object of preferred implementation, and does not think limitation of the present invention.And in whole accompanying drawing, represent identical parts by identical reference symbol.In the accompanying drawings:
Fig. 1 shows the process flow diagram of the method utilizing participle procedure identification organization names according to an embodiment of the invention;
Fig. 2 shows according to an embodiment of the invention to the process flow diagram of the method that part of speech dictionary supplements;
Fig. 3 shows the process flow diagram carrying out the method verified according to an embodiment of the invention according to user's input;
Fig. 4 shows the process flow diagram of the method utilizing participle procedure identification organization names according to an embodiment of the invention;
Fig. 5 shows the structural drawing of the device of title identification according to an embodiment of the invention;
Fig. 6 shows the structural drawing of the device of title identification according to an embodiment of the invention; And
Fig. 7 shows the structural drawing of the device of title identification according to an embodiment of the invention.
Embodiment
Below with reference to accompanying drawings exemplary embodiment of the present disclosure is described in more detail.Although show exemplary embodiment of the present disclosure in accompanying drawing, however should be appreciated that can realize the disclosure in a variety of manners and not should limit by the embodiment set forth here.On the contrary, provide these embodiments to be in order to more thoroughly the disclosure can be understood, and complete for the scope of the present disclosure can be conveyed to those skilled in the art.
Fig. 1 shows the process flow diagram of the method utilizing participle procedure identification organization names according to an embodiment of the invention.The method has in the equipment of organization names extraction task for various, and this equipment can be server can be also terminal, and the present invention is not particularly limited this.As shown in Figure 1, the method comprises the steps.
In step s 110, utilize entry dictionary that sentence to be identified is carried out participle, obtain entry in sentence to be identified.
For example, entry dictionary comprises the entry of configuration, and entry in entry dictionary and sentence to be identified are carried out matching ratio comparatively, by sentence participle to be identified, and then obtains entry in sentence to be identified.
In the step s 120, from participle gained entry, be extracted in the reset mechanism title that meets found in part of speech dictionary to be correlated with the entry of part of speech.
Wherein, reset mechanism title part of speech of being correlated with comprises at least one item in following part of speech: suffix in field, organization names in place, brand, organization names.
Part of speech is one of attribute that entry is corresponding, determines the part of speech of participle gained entry according to part of speech dictionary, determines that part of speech meets reset mechanism title and to be correlated with the entry of part of speech.Such as, reset mechanism title part of speech of being correlated with comprises: place, as Beijing; Brand, as Cisco; Field in organization names, as science and technology; Suffix in organization names, as company limited.From sentence participle gained entry to be identified, determine that part of speech is the entry of one of above-mentioned part of speech.
In step s 130, which, according to presetting relevant part of speech concatenate rule, the entry of extraction is spliced, by splicing gained entry alternatively organization names add in candidate collection.
For example, that presets that relevant part of speech concatenate rule can comprise in following rule is one or more:
Place+place; Place+brand; Suffix in place+organization names; Field in place+organization names; Field in brand+organization names; Brand+brand; Suffix in brand+organization names; Field in field+organization names in organization names; Suffix in field+organization names in organization names.
Preset relevant part of speech concatenate rule by above-mentioned the reset mechanism title that meets extracted is correlated with the entry splicing of part of speech, spliced entry is added in candidate collection.
In addition, in one embodiment, described method also comprises: be extracted in from participle gained entry in part of speech dictionary and find the entry that part of speech is complete organization names; Be that the entry of complete organization names exports as organization names using the part of speech of extraction.
In the present embodiment, from the entry of sentence to be identified, determine that part of speech is the entry of complete organization names according to part of speech dictionary, this kind of entry does not need to splice, and directly exports as organization names.Wherein, be that the entry of complete organization names carries out output as organization names and can comprise using the part of speech of extraction: be that the entry of complete organization names adds in candidate collection by the part of speech of extraction, when selecting to meet the entry of reset mechanism title output condition from candidate collection, be that the entry of complete organization names is selected as the entry meeting reset mechanism title output condition using part of speech, and then be that the entry of complete organization names exports by this part of speech.
In one embodiment, described method also comprises: when sentence to be identified comprises the entry combination meeting presets, according to the frequency of occurrences of one or more word at least one entry in entry combination, determine whether that entry being combined alternatively organization names adds in candidate collection.
For example, presets is the character string forms of " place+xx+ (field+suffix)+", wherein xx represents two words, when the frequency of occurrences of each word in xx or two word frequencies seize the opportunity be greater than certain threshold value time, judge that this entry is combined as candidate's organization names, the combination of this entry is added in candidate collection..
In step S140, from candidate collection, select the entry meeting reset mechanism title output condition.
For example, reset mechanism title output condition comprises in following condition one or more.
Suffix word in (place word | brand word | domain term in organization names)+organization names, represents the entry that in the word of arbitrary part of speech in field in place, brand and organization names and organization names, suffix word forms;
Place word+brand word+, represent the entry that place word and brand word form or the entry that the word of place word and brand word and other parts of speech forms.
Domain term in brand word+organization names+, represent the entry that in brand word and organization names, domain term forms, or the entry that in brand word and organization names, the word of domain term and other parts of speech forms.
Brand word, expression entry is an independent brand word;
Place word, expression entry is an independent place word.
As previously mentioned, when being extracted in part of speech dictionary from participle gained entry that to find part of speech be the entry of complete organization names, reset mechanism title output condition also can comprise: complete organization names, so, be that the entry of complete organization names is selected from candidate collection by the part of speech of extraction, export.
In one embodiment, described method also comprises: build the filtration dictionary corresponding with at least one reset mechanism title output condition according to search word with the relevant information of the link searched; Filtration dictionary is utilized to filter the entry selected from candidate collection.
Further, the described filtration dictionary corresponding with at least one reset mechanism title output condition with the relevant information structure of the link searched according to search word, comprise: from the search word comprising organization names and the relevant information of the link searched, count the positive example entry corresponding with at least one reset mechanism title output condition, positive example entry is added and filters in dictionary.
Further, the described filtration dictionary corresponding with at least one reset mechanism title output condition with the relevant information structure of the link searched according to search word, comprise: the search word never comprising organization names with count the negative routine entry corresponding with at least one reset mechanism title output condition in the relevant information of the link searched, negative routine entry is added and filters in dictionary.
In addition, described method also comprises: according to the position relationship belonged in search word and the relevant information of the link searched between the number of the entry of the relevant part of speech of reset mechanism title and entry, determine whether comprise organization names in search word and relevant information.
For example, the complete mechanism name identified and with in organization names suffix ending candidate's organization names precision higher, those candidate's organization names are not filtered.
For reset mechanism title output condition: place word+brand word, domain term, brand word and place word in brand word+organization names, build corresponding filtration dictionary respectively, filters dictionary and comprise positive example entry and negative routine entry.Filtering rule can be retained by the candidate's organization names mated with positive example entry, also can be to be deleted by the candidate's organization names mated with negative routine entry.
For the reset mechanism title output condition of " place word+brand word ", build as follows and filter dictionary.
Add the search word query of link url click and link relevant information corresponding to url (such as, heading message), from comprise organization names query the link searched relevant information add up with the word frequency information of the entry of " place word+brand word " form, determine the entry as positive example according to word frequency information.Never comprise organization names, but add up with the negative example of " place word+brand word " form in the relevant information that searches of the query comprising " place word+brand word " form entry.Gained positive example and negative example are added to and filters in dictionary, utilize gained positive example and negative example to filter with the candidate mechanism name of " place+brand " form.Such as, positive example is Beijing Qihoo, and negative example is that country is rich and powerful.
For the reset mechanism title output condition of " in brand word+organization names domain term ", build as follows and filter dictionary.
Add the search word query of link url click and link relevant information corresponding to url (such as, heading message), from comprise organization names query the link searched relevant information add up the word frequency information of the entry ended up with " in organization names domain term ", determine the entry as positive example according to word frequency information.Never comprise organization names, but to end up the negative example of statistics in relevant information that query searches with " in organization names domain term ".Gained positive example and negative example are added to and filters in dictionary, utilize gained positive example and negative example to filter with the candidate mechanism name of " in brand word+organization names domain term " form.Such as, positive example is Qihoo's science and technology, and negative example is modern moral education.
For the reset mechanism title output condition of independent " brand word ", build as follows and filter dictionary.
Add the search word query of link url click and link relevant information corresponding to url (such as, heading message), from comprise organization names query the link searched relevant information the word frequency information of statistics separately " brand word ", determine the entry as positive example according to word frequency information.Never comprise organization names, but the negative example of statistics in the relevant information that searches of the query comprising independent brand word.Gained positive example and negative example are added to and filters in dictionary, utilize gained positive example and the candidate mechanism name of negative example to independent " brand word " form to filter.Such as, positive example is Qihoo, and negative example is rich and powerful.
For the reset mechanism title output condition of independent " place word ", build as follows and filter dictionary.
Add the search word query of link url click and link relevant information corresponding to url (such as, heading message), from comprise organization names query the link searched relevant information the word frequency information of statistics separately " place word ", determine the entry as positive example according to word frequency information.Never comprise organization names, but the negative example of statistics in the relevant information that searches of the query comprising independent place word.Gained positive example and negative example are added to and filters in dictionary, utilize gained positive example and the candidate mechanism name of negative example to independent " place word " form to filter.Such as, positive example is Shangdi Shenzhou Digital Building, and negative example is exhibition road, Beijing.
Wherein, the judgement whether comprising organization names in the relevant information that query or url is corresponding is adopted with the following method.During the relevant information corresponding as query or url comprises " field in place, brand, organization names; suffix in organization names " any three, there is the position relationship preset in three words, such as, when three word spacings meet preset window distance, determine whether comprise organization names in query or title.
In step S150, the entry selected from candidate collection is exported as organization names.
In one example, filtration dictionary is utilized to filter the entry selected from candidate collection, in such cases, described the entry selected from candidate collection to be exported as organization names, comprising: using utilize filter dictionary filter after residue entry export as organization names.
In one embodiment, supplement part of speech dictionary, as shown in Figure 2, described method also comprises:
Step S210, when participle gained entry is arranged in participle entry dictionary used, and when not being arranged in part of speech dictionary, judges whether entry comprises reset mechanism title suffix.
Step S220, when entry comprises reset mechanism title suffix, adds to entry in part of speech dictionary as complete organization names.
For example, as entries such as Lan Xiang technical school, agricultural banks, it is included in dictionary for word segmentation, but be not included in part of speech dictionary, judge in these entries, whether to comprise reset mechanism title suffix, if comprised, then it can be used as independently complete organization names to add in part of speech dictionary.
In one embodiment, verify according to user's input, as shown in Figure 3, described method also comprises:
Step S310, receives the check information of user's input.
Step S320, according to the check information amendment entry dictionary received, part of speech dictionary, reset mechanism title be correlated with part of speech, preset relevant part of speech concatenate rule, reset mechanism title output condition or filter dictionary.
The present embodiment can from text automatic drawing machine structure title, achieve the beneficial effect of automatic drawing machine structure title from text.In addition, because the entry of extraction can be spliced according to presetting relevant part of speech concatenate rule, enhancing the range of choice for the organization names exported, and then improve the accuracy of extraction mechanism title; Because utilize entry dictionary to carry out participle, utilize reset mechanism title to select entry, make the operation of extraction mechanism title time less used, improve treatment effeciency.
Fig. 4 shows the process flow diagram utilizing participle procedure identification organization names according to an embodiment of the invention.The method has in the equipment of organization names extraction task for various, and this equipment can be server can be also terminal, and the present invention is not particularly limited this.As shown in Figure 4, the method comprises the steps.
In step S402, according to the position relationship belonged in search word and the relevant information of the link searched between the number of the entry of the relevant part of speech of reset mechanism title and entry, determine whether comprise organization names in search word and relevant information.
In step s 404, from the search word comprising organization names and the relevant information of the link searched, count the positive example entry corresponding with at least one reset mechanism title output condition, positive example entry is added and filters in dictionary.
In step S406, the search word never comprising organization names with count the negative routine entry corresponding with at least one reset mechanism title output condition in the relevant information of the link searched, negative routine entry is added and filters in dictionary.
In step S408, utilize entry dictionary that sentence to be identified is carried out participle, obtain entry in sentence to be identified.
In step S410, from participle gained entry, be extracted in the reset mechanism title that meets found in part of speech dictionary be correlated with the entry of part of speech.
Wherein, reset mechanism title part of speech of being correlated with comprises at least one item in following part of speech: suffix in field, organization names in place, brand, organization names.
In step S412, according to presetting relevant part of speech concatenate rule, the entry of extraction is spliced, by splicing gained entry alternatively organization names add in candidate collection.
In step S414, when sentence to be identified comprises the entry combination meeting presets, according to the frequency of occurrences of one or more word at least one entry in entry combination, determine whether that entry being combined alternatively organization names adds in candidate collection.
In step S416, being extracted in part of speech dictionary and finding the entry that part of speech is complete organization names from participle gained entry, is that the entry of complete organization names adds in candidate collection as organization names using the part of speech of extraction.
In step S418, from candidate collection, select the entry meeting reset mechanism title output condition.
In the step s 420, filtration dictionary is utilized to filter the entry selected from candidate collection.
In step S422, the entry that the residue after filtering utilizing filtration dictionary is selected exports as organization names, and the entry being complete organization names using part of speech in candidate collection exports as organization names.
Above are only the exemplary of the method for participle procedure identification organization names that utilizes of the present invention to illustrate, the present invention is not limited thereto.All do within spirit of the present invention or principle any amendment, equivalent replacement, improvement etc., be all included in protection scope of the present invention.
Fig. 5 shows the structural drawing of the device utilizing participle procedure identification organization names according to an embodiment of the invention.This device has in the equipment of organization names extraction task for various, and this equipment can be server can be also terminal, and the present invention is not particularly limited this.As shown in Figure 5, this device comprises as lower module.
Memory module 510, is suitable for storing entry dictionary, part of speech dictionary, reset mechanism title is correlated with part of speech, preset relevant part of speech concatenate rule and reset mechanism title output condition;
Word-dividing mode 520, is suitable for utilizing entry dictionary that sentence to be identified is carried out participle, obtains entry in sentence to be identified;
Identification module 530, be suitable for being extracted in from participle gained entry the reset mechanism title that meets found in part of speech dictionary to be correlated with the entry of part of speech, according to presetting relevant part of speech concatenate rule, the entry of extraction is spliced, by splicing gained entry alternatively organization names add in candidate collection, from candidate collection, select to meet the entry of reset mechanism title output condition;
Output module 540, is suitable for the entry selected from candidate collection to export as organization names.
In one embodiment, described reset mechanism title part of speech of being correlated with comprises at least one item in following part of speech: suffix in field, organization names in place, brand, organization names.
For example, entry dictionary comprises the entry of configuration, and entry in entry dictionary and sentence to be identified are carried out matching ratio comparatively by word-dividing mode 520, by sentence participle to be identified, and then obtains entry in sentence to be identified.
Part of speech is one of attribute that entry is corresponding, determines the part of speech of participle gained entry according to part of speech dictionary, determines that part of speech meets reset mechanism title and to be correlated with the entry of part of speech.Such as, reset mechanism title part of speech of being correlated with comprises: place, as Beijing; Brand, as Cisco; Field in organization names, as science and technology; Suffix in organization names, as company limited.Identification module 530 determines that part of speech is the entry of one of above-mentioned part of speech from sentence participle gained entry to be identified.
That presets that relevant part of speech concatenate rule can comprise in following rule is one or more:
Place+place; Place+brand; Suffix in place+organization names; Field in place+organization names; Field in brand+organization names; Brand+brand; Suffix in brand+organization names; Field in field+organization names in organization names; Suffix in field+organization names in organization names.
Identification module 530 is preset relevant part of speech concatenate rule the reset mechanism title that meets extracted to be correlated with the entry splicing of part of speech by above-mentioned, is added in candidate collection by spliced entry.
Reset mechanism title output condition comprises in following condition one or more.
Suffix word in (place word | brand word | domain term in organization names)+organization names, represents the entry that in the word of arbitrary part of speech in field in place, brand and organization names and organization names, suffix word forms;
Place word+brand word+, represent the entry that place word and brand word form or the entry that the word of place word and brand word and other parts of speech forms.
Domain term in brand word+organization names+, represent the entry that in brand word and organization names, domain term forms, or the entry that in brand word and organization names, the word of domain term and other parts of speech forms.
Brand word, expression entry is an independent brand word;
Place word, expression entry is an independent place word.
Identification module 530 selects the entry meeting reset mechanism title output condition from candidate collection.
The entry selected from candidate collection exports as organization names by output module 540.
In one embodiment, as shown in Figure 6, device also comprises:
Build module 610, be suitable for building the filtration dictionary corresponding with at least one reset mechanism title output condition according to search word with the relevant information of the link searched, the filtration dictionary of structure be stored in described memory module 510;
Filtering module 620, is suitable for utilizing the filtration dictionary stored in memory module 510 to filter the entry that identification module is selected from candidate collection;
Output module 530, is further adapted for the residue entry after using described filtering module filtration and exports as organization names.
Further, build module 610, be further adapted for and count the positive example entry corresponding with at least one reset mechanism title output condition from the search word comprising organization names and the relevant information of the link searched, described positive example entry is added in the filtration dictionary that described memory module 510 stores.
Further, build module 610, be further adapted for the search word never comprising organization names and in the relevant information of the link searched, count the negative routine entry corresponding with at least one reset mechanism title output condition, described negative routine entry being added in the filtration dictionary that described memory module 510 stores.
Further, build module 610, also be suitable for, according to the position relationship belonged in search word and the relevant information of the link searched between the number of the entry of the relevant part of speech of reset mechanism title and described entry, determining whether comprise organization names in described search word and described relevant information.
For example, the complete mechanism name identified and with in organization names suffix ending candidate's organization names precision higher, those candidate's organization names are not filtered.
For reset mechanism title output condition: place word+brand word, domain term, brand word and place word in brand word+organization names, build corresponding filtration dictionary respectively, filters dictionary and comprise positive example entry and negative routine entry.Filtering rule can be retained by the candidate's organization names mated with positive example entry, also can be to be deleted by the candidate's organization names mated with negative routine entry.
For the reset mechanism title output condition of " place word+brand word ", build as follows and filter dictionary.
Build module 610 add the search word query of link url click and link relevant information corresponding to url (such as, heading message), from comprise organization names query the link searched relevant information add up with the word frequency information of the entry of " place word+brand word " form, determine the entry as positive example according to word frequency information.Build module 610 and never comprise organization names, but add up with the negative example of " place word+brand word " form in the relevant information that searches of the query comprising " place word+brand word " form entry.Build module 610 gained positive example and negative example to be added in filtration dictionary, utilize gained positive example and negative example to filter with the candidate mechanism name of " place+brand " form.Such as, positive example is Beijing Qihoo, and negative example is that country is rich and powerful.
For the reset mechanism title output condition of " in brand word+organization names domain term ", build as follows and filter dictionary.
Build module 610 add the search word query of link url click and link relevant information corresponding to url (such as, heading message), from comprise organization names query the link searched relevant information add up the word frequency information of the entry ended up with " in organization names domain term ", determine the entry as positive example according to word frequency information.Build module 610 and never comprise organization names, but to end up the negative example of statistics in relevant information that query searches with " in organization names domain term ".Build module 610 gained positive example and negative example to be added in filtration dictionary, utilize gained positive example and negative example to filter with the candidate mechanism name of " in brand word+organization names domain term " form.Such as, positive example is Qihoo's science and technology, and negative example is modern moral education.
For the reset mechanism title output condition of independent " brand word ", build as follows and filter dictionary.
Build module 610 add the search word query of link url click and link relevant information corresponding to url (such as, heading message), from comprise organization names query the link searched relevant information the word frequency information of statistics separately " brand word ", determine the entry as positive example according to word frequency information.Build module 610 and never comprise organization names, but the negative example of statistics in the relevant information that searches of the query comprising independent brand word.Gained positive example and negative example are added in filtration dictionary by structure module 610, utilize gained positive example and the candidate mechanism name of negative example to independent " brand word " form to filter.Such as, positive example is Qihoo, and negative example is rich and powerful.
For the reset mechanism title output condition of independent " place word ", build as follows and filter dictionary.
Build module 610 add the search word query of link url click and link relevant information corresponding to url (such as, heading message), from comprise organization names query the link searched relevant information the word frequency information of statistics separately " place word ", determine the entry as positive example according to word frequency information.Build module 610 and never comprise organization names, but the negative example of statistics in the relevant information that searches of the query comprising independent place word.Gained positive example and negative example are added in filtration dictionary by structure module 610, utilize gained positive example and the candidate mechanism name of negative example to independent " place word " form to filter.Such as, positive example is Shangdi Shenzhou Digital Building, and negative example is exhibition road, Beijing.
Wherein, the judgement whether comprising organization names in the relevant information that query or url is corresponding is adopted with the following method.During the relevant information corresponding as query or url comprises " field in place, brand, organization names; suffix in organization names " any three, there is the position relationship preset in three words, such as, when three word spacings meet preset window distance, determine whether comprise organization names in query or title.
In one embodiment, identification module 530, is also suitable for being extracted in part of speech dictionary from participle gained entry finding the entry that part of speech is complete organization names;
Output module 540, the part of speech being also suitable for described identification module to extract is that the entry of complete organization names exports as organization names.
In one embodiment, identification module 530, is also suitable for being arranged in participle entry dictionary used when participle gained entry, and when not being arranged in part of speech dictionary, judges whether described entry comprises reset mechanism title suffix; When described entry comprises reset mechanism title suffix, described entry is added in the part of speech dictionary that memory module 510 stores as complete organization names.
In one embodiment, identification module 530, also be suitable for, when sentence to be identified comprises the entry combination meeting presets, according to the frequency of occurrences of one or more word at least one entry in entry combination, determining whether that described entry is combined alternatively organization names to add in candidate collection.
For example, presets is the character string forms of " place+xx+ (field+suffix)+", wherein xx represents two words, identification module 530 when the frequency of occurrences of each word in xx or two word frequencies seize the opportunity be greater than certain threshold value time, judge that this entry is combined as candidate's organization names, the combination of this entry is added in candidate collection..
In one embodiment, as shown in Figure 7, device also comprises: receiver module 710, is suitable for the check information receiving user's input;
Identification module 530, be also suitable for according to the check information received revise the entry dictionary, part of speech dictionary, the reset mechanism title that store in described memory module be correlated with part of speech, preset relevant part of speech concatenate rule, reset mechanism title output condition or filter dictionary.
It should be noted that:
Intrinsic not relevant to any certain computer, virtual bench or miscellaneous equipment with display at this algorithm provided.Various fexible unit also can with use based on together with this teaching.According to description above, the structure constructed required by this kind of device is apparent.In addition, the present invention is not also for any certain programmed language.It should be understood that and various programming language can be utilized to realize content of the present invention described here, and the description done language-specific is above to disclose preferred forms of the present invention.
In instructions provided herein, describe a large amount of detail.But can understand, embodiments of the invention can be put into practice when not having these details.In some instances, be not shown specifically known method, structure and technology, so that not fuzzy understanding of this description.
Similarly, be to be understood that, in order to simplify the disclosure and to help to understand in each inventive aspect one or more, in the description above to exemplary embodiment of the present invention, each feature of the present invention is grouped together in single embodiment, figure or the description to it sometimes.But, the method for the disclosure should be construed to the following intention of reflection: namely the present invention for required protection requires feature more more than the feature clearly recorded in each claim.Or rather, as claims below reflect, all features of disclosed single embodiment before inventive aspect is to be less than.Therefore, the claims following embodiment are incorporated to this embodiment thus clearly, and wherein each claim itself is as independent embodiment of the present invention.
Those skilled in the art are appreciated that and adaptively can change the module in the equipment in embodiment and they are arranged in one or more equipment different from this embodiment.Module in embodiment or unit or assembly can be combined into a module or unit or assembly, and multiple submodule or subelement or sub-component can be put them in addition.Except at least some in such feature and/or process or unit be mutually repel except, any combination can be adopted to combine all processes of all features disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) and so disclosed any method or equipment or unit.Unless expressly stated otherwise, each feature disclosed in this instructions (comprising adjoint claim, summary and accompanying drawing) can by providing identical, alternative features that is equivalent or similar object replaces.
In addition, those skilled in the art can understand, although embodiments more described herein to comprise in other embodiment some included feature instead of further feature, the combination of the feature of different embodiment means and to be within scope of the present invention and to form different embodiments.Such as, in the following claims, the one of any of embodiment required for protection can use with arbitrary array mode.
All parts embodiment of the present invention with hardware implementing, or can realize with the software module run on one or more processor, or realizes with their combination.It will be understood by those of skill in the art that the some or all functions utilizing the some or all parts in the equipment of participle procedure identification organization names that microprocessor or digital signal processor (DSP) can be used in practice to realize according to the embodiment of the present invention.The present invention can also be embodied as part or all equipment for performing method as described herein or device program (such as, computer program and computer program).Realizing program of the present invention and can store on a computer-readable medium like this, or the form of one or more signal can be had.Such signal can be downloaded from internet website and obtain, or provides on carrier signal, or provides with any other form.
The present invention will be described instead of limit the invention to it should be noted above-described embodiment, and those skilled in the art can design alternative embodiment when not departing from the scope of claims.In the claims, any reference symbol between bracket should be configured to limitations on claims.Word " comprises " not to be got rid of existence and does not arrange element in the claims or step.Word "a" or "an" before being positioned at element is not got rid of and be there is multiple such element.The present invention can by means of including the hardware of some different elements and realizing by means of the computing machine of suitably programming.In the unit claim listing some devices, several in these devices can be carry out imbody by same hardware branch.Word first, second and third-class use do not represent any order.Can be title by these word explanations.

Claims (10)

1. utilize a device for participle procedure identification organization names, this device comprises:
Memory module, is suitable for storing entry dictionary, part of speech dictionary, reset mechanism title is correlated with part of speech, preset relevant part of speech concatenate rule and reset mechanism title output condition;
Word-dividing mode, is suitable for utilizing entry dictionary that sentence to be identified is carried out participle, obtains entry in sentence to be identified;
Identification module, be suitable for being extracted in from participle gained entry the reset mechanism title that meets found in part of speech dictionary to be correlated with the entry of part of speech, according to presetting relevant part of speech concatenate rule, the entry of extraction is spliced, by splicing gained entry alternatively organization names add in candidate collection, from candidate collection, select to meet the entry of reset mechanism title output condition;
Output module, is suitable for the entry selected from candidate collection to export as organization names.
2. device according to claim 1, wherein,
Described reset mechanism title part of speech of being correlated with comprises at least one item in following part of speech: suffix in field, organization names in place, brand, organization names.
3. device according to claim 1 and 2, wherein, described device also comprises:
Described structure module, is suitable for building the filtration dictionary corresponding with at least one reset mechanism title output condition according to search word with the relevant information of the link searched, is stored in described memory module by the filtration dictionary of structure;
Described filtering module, is suitable for utilizing the filtration dictionary stored in described memory module to filter the entry that described identification module is selected from candidate collection;
Described output module, is further adapted for the residue entry after using described filtering module filtration and exports as organization names.
4. according to the arbitrary described device of claims 1 to 3, wherein,
Described identification module, is also suitable for being extracted in part of speech dictionary from participle gained entry finding the entry that part of speech is complete organization names;
Described output module, the part of speech being also suitable for described identification module to extract is that the entry of complete organization names exports as organization names.
5. according to the arbitrary described device of Claims 1-4, wherein,
Described identification module, is also suitable for being arranged in participle entry dictionary used when participle gained entry, and when not being arranged in part of speech dictionary, judges whether described entry comprises reset mechanism title suffix; When described entry comprises reset mechanism title suffix, described entry is added in the part of speech dictionary of described memory module storage as complete organization names.
6. according to the arbitrary described device of claim 1 to 5, wherein,
Described identification module, also be suitable for when sentence to be identified comprises the entry combination meeting presets, according to the frequency of occurrences of one or more word at least one entry in entry combination, determine whether that described entry is combined alternatively organization names to add in candidate collection.
7., according to the arbitrary described device of claim 1 to 6, wherein, described device also comprises:
Receiver module, is suitable for the check information receiving user's input;
Described identification module, be also suitable for according to the check information received revise the entry dictionary, part of speech dictionary, the reset mechanism title that store in described memory module be correlated with part of speech, preset relevant part of speech concatenate rule, reset mechanism title output condition or filter dictionary.
8. according to the arbitrary described device of claim 1 to 7, wherein,
Described structure module, be further adapted for and count the positive example entry corresponding with at least one reset mechanism title output condition from the search word comprising organization names and the relevant information of the link searched, described positive example entry is added in the filtration dictionary that described memory module stores.
9. according to the arbitrary described device of claim 1 to 8, wherein,
Described structure module, be further adapted for the search word never comprising organization names and in the relevant information of the link searched, count the negative routine entry corresponding with at least one reset mechanism title output condition, described negative routine entry being added in the filtration dictionary of described memory module storage.
10. utilize a method for participle procedure identification organization names, the method comprises:
Utilize entry dictionary that sentence to be identified is carried out participle, obtain entry in sentence to be identified;
From participle gained entry, be extracted in the reset mechanism title that meets found in part of speech dictionary to be correlated with the entry of part of speech;
According to presetting relevant part of speech concatenate rule, the entry of extraction is spliced, by splicing gained entry alternatively organization names add in candidate collection;
The entry meeting reset mechanism title output condition is selected from candidate collection;
The entry selected from candidate collection is exported as organization names.
CN201510379024.7A 2015-06-30 2015-06-30 Device and method for identifying organization name by word segmentation program Pending CN105095391A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201510379024.7A CN105095391A (en) 2015-06-30 2015-06-30 Device and method for identifying organization name by word segmentation program

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201510379024.7A CN105095391A (en) 2015-06-30 2015-06-30 Device and method for identifying organization name by word segmentation program

Publications (1)

Publication Number Publication Date
CN105095391A true CN105095391A (en) 2015-11-25

Family

ID=54575828

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201510379024.7A Pending CN105095391A (en) 2015-06-30 2015-06-30 Device and method for identifying organization name by word segmentation program

Country Status (1)

Country Link
CN (1) CN105095391A (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107291695A (en) * 2017-06-28 2017-10-24 三角兽(北京)科技有限公司 Information processor and its participle processing method
CN108108379A (en) * 2016-11-25 2018-06-01 北京国双科技有限公司 Keyword opens up the method and device of word
CN108170672A (en) * 2017-12-22 2018-06-15 武汉数博科技有限责任公司 A kind of Chinese organization names real-time analysis method and system
CN108595435A (en) * 2018-05-03 2018-09-28 鹏元征信有限公司 A kind of organization names identifying processing method, intelligent terminal and storage medium
CN109032375A (en) * 2018-06-29 2018-12-18 北京百度网讯科技有限公司 Candidate text sort method, device, equipment and storage medium
CN109933800A (en) * 2019-03-22 2019-06-25 中国农业银行股份有限公司 Creation method, information query method and the device of data structures system
CN110309175A (en) * 2018-03-02 2019-10-08 北大方正集团有限公司 Reference book method of calibration and reference book calibration equipment
CN114220054A (en) * 2021-12-15 2022-03-22 北京中科智易科技有限公司 Method for analyzing tactical action of equipment and synchronously displaying equipment based on equipment bus data

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8041730B1 (en) * 2006-10-24 2011-10-18 Google Inc. Using geographic data to identify correlated geographic synonyms
CN102880647A (en) * 2012-08-24 2013-01-16 北京百度网讯科技有限公司 Method and device for acquiring another name of organization
CN103186524A (en) * 2011-12-30 2013-07-03 高德软件有限公司 Address name identification method and device
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
CN104679885A (en) * 2015-03-17 2015-06-03 北京理工大学 User search string organization name recognition method based on semantic feature model

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8041730B1 (en) * 2006-10-24 2011-10-18 Google Inc. Using geographic data to identify correlated geographic synonyms
CN103186524A (en) * 2011-12-30 2013-07-03 高德软件有限公司 Address name identification method and device
CN102880647A (en) * 2012-08-24 2013-01-16 北京百度网讯科技有限公司 Method and device for acquiring another name of organization
CN103309852A (en) * 2013-06-14 2013-09-18 瑞达信息安全产业股份有限公司 Method for discovering compound words in specific field based on statistics and rules
CN104679885A (en) * 2015-03-17 2015-06-03 北京理工大学 User search string organization name recognition method based on semantic feature model

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108108379A (en) * 2016-11-25 2018-06-01 北京国双科技有限公司 Keyword opens up the method and device of word
CN108108379B (en) * 2016-11-25 2021-05-28 北京国双科技有限公司 Keyword word expansion method and device
CN107291695A (en) * 2017-06-28 2017-10-24 三角兽(北京)科技有限公司 Information processor and its participle processing method
CN107291695B (en) * 2017-06-28 2019-01-11 三角兽(北京)科技有限公司 Information processing unit and its participle processing method
CN108170672A (en) * 2017-12-22 2018-06-15 武汉数博科技有限责任公司 A kind of Chinese organization names real-time analysis method and system
CN110309175A (en) * 2018-03-02 2019-10-08 北大方正集团有限公司 Reference book method of calibration and reference book calibration equipment
CN110309175B (en) * 2018-03-02 2021-12-03 北大方正集团有限公司 Tool book checking method and tool book checking device
CN108595435A (en) * 2018-05-03 2018-09-28 鹏元征信有限公司 A kind of organization names identifying processing method, intelligent terminal and storage medium
CN108595435B (en) * 2018-05-03 2020-09-01 鹏元征信有限公司 Organization name recognition processing method, intelligent terminal and storage medium
CN109032375A (en) * 2018-06-29 2018-12-18 北京百度网讯科技有限公司 Candidate text sort method, device, equipment and storage medium
CN109032375B (en) * 2018-06-29 2022-07-19 北京百度网讯科技有限公司 Candidate text sorting method, device, equipment and storage medium
CN109933800A (en) * 2019-03-22 2019-06-25 中国农业银行股份有限公司 Creation method, information query method and the device of data structures system
CN114220054A (en) * 2021-12-15 2022-03-22 北京中科智易科技有限公司 Method for analyzing tactical action of equipment and synchronously displaying equipment based on equipment bus data

Similar Documents

Publication Publication Date Title
CN105095391A (en) Device and method for identifying organization name by word segmentation program
CN104715064B (en) It is a kind of to realize the method and server that keyword is marked on webpage
CN107861753B (en) APP generation index, retrieval method and system and readable storage medium
CN104462505A (en) Search method and device
US10394939B2 (en) Resolving outdated items within curated content
CN104699737A (en) Method and system for managing a search
CN104699751A (en) Search recommending method and device based on search terms
CN104462508A (en) Character relation search method and device based on knowledge graph
CN104462504A (en) Method and device for providing reasoning process data in search
US11651014B2 (en) Source code retrieval
CN110765761A (en) Contract sensitive word checking method and device based on artificial intelligence and storage medium
CN105550169A (en) Method and device for identifying point of interest names based on character length
CN104699845A (en) Question-style search word based providing method and device of search results
CN103914533A (en) Promotion search result display method and device
CN108197243A (en) Method and device is recommended in a kind of input association based on user identity
CN105095381A (en) Method and device for new word identification
CN102646124A (en) Method for automatically identifying address information
CN103942264A (en) Method and device for pushing webpages containing news information
CN104899214A (en) Data processing method and system for setting up input suggestions
US20110320466A1 (en) Methods and systems for filtering search results
CN106354721A (en) Retrieval method and device based on authority
CN105159921A (en) Method and apparatus for de-duplicating point-of-interest (POI) data in map
CN109614535B (en) Method and device for acquiring network data based on Scapy framework
CN104462552A (en) Question and answer page core word extracting method and device
CN113051919A (en) Method and device for identifying named entity

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20151125

RJ01 Rejection of invention patent application after publication