CN104281565A - Semantic dictionary constructing method and device - Google Patents

Semantic dictionary constructing method and device Download PDF

Info

Publication number
CN104281565A
CN104281565A CN201410521385.6A CN201410521385A CN104281565A CN 104281565 A CN104281565 A CN 104281565A CN 201410521385 A CN201410521385 A CN 201410521385A CN 104281565 A CN104281565 A CN 104281565A
Authority
CN
China
Prior art keywords
word
sentence
semanteme
close
user
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201410521385.6A
Other languages
Chinese (zh)
Other versions
CN104281565B (en
Inventor
曾增烽
李朋凯
林英展
何径舟
石磊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Baidu Online Network Technology Beijing Co Ltd
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN201410521385.6A priority Critical patent/CN104281565B/en
Publication of CN104281565A publication Critical patent/CN104281565A/en
Application granted granted Critical
Publication of CN104281565B publication Critical patent/CN104281565B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

An embodiment of the invention discloses a semantic dictionary constructing method and device. The method includes: extracting sentences with same or similar meanings from weblogs of users; according to meanings of words in the sentences with the same or similar semantics, aligning the words in the sentences with the same or similar meanings so as to acquire alternative words with the same or similar meanings; according to contextual information of the alternative words in the sentences with the same or similar meanings, clustering the alternative words so as to acquire words with the same meanings, and adding the words with the same meanings into a semantic dictionary. By the method and device, the semantic dictionary including synonym data is constructed efficiently by mining the weblogs of the users.

Description

Semantic dictionary construction method and device
Technical field
The embodiment of the present invention relates to network data processing technology, particularly relates to a kind of semantic dictionary construction method and device.
Background technology
According to the engineering practice of current natural language processing, structure language material enriches, the reliable semantic dictionary of data has basic effect for the treatment effeciency and accuracy rate improving natural language processing system.But traditional semantic dictionary mostly relies on and manually carries out Data Collection and processing.Further, because the professional requirement of semantic dictionary self to data is higher, so need the personnel carrying out Data Collection and processing to have higher specialty background knowledge and deep language grounding in basic skills.So, can above-mentioned requirements be met and to participate in the personnel amount of the writing work of semantic dictionary generally less.Because the personnel amount performing Data Collection and processing is limited, add Data Collection and processing work relies on manual completing, cause the inefficiency of traditional semantic dictionary writing work.And in today of information explosion, natural language processing system needs the corpus data processing magnanimity, the semantic dictionary writing efficiency of poor efficiency like this is obviously difficult to the demand meeting Information procession and process.
Summary of the invention
In view of this, the embodiment of the present invention proposes a kind of semantic dictionary construction method and device, to build semantic dictionary efficiently.
First aspect, embodiments provide a kind of semantic dictionary construction method, described method comprises:
The sentence with identical or close semanteme is extracted from the network log of user;
According to the semanteme of the word in the described sentence with identical or close semanteme, the word in the described sentence with identical or close semanteme is alignd, thus obtain the alternative word with identical or close semanteme;
According to the language ambience information of described alternative word in the described sentence with identical or close semanteme, cluster is carried out to described alternative word, to obtain the word with identical semanteme, and the word with identical semanteme is added in semantic dictionary.
Second aspect, embodiments provide a kind of semantic dictionary construction device, described device comprises:
Statement screening module, for extracting the sentence with identical or close semanteme from the network log of user;
Word screening module, for having the semanteme of the word in the sentence of identical or close semanteme described in basis, aligns to the word in the described sentence with identical or close semanteme, thus obtains the alternative word with identical or close semanteme;
Word cluster module, for according to the language ambience information of described alternative word in the described sentence with identical or close semanteme, carries out cluster to described alternative word, to obtain the word with identical semanteme, and is added in semantic dictionary by the word with identical semanteme.
The semantic dictionary construction method that the embodiment of the present invention provides and device, by extracting the sentence with identical or close semanteme from the network log of user, semanteme according to the word in the described sentence with identical or close semanteme aligns to the word in the described sentence with identical or close semanteme, thus obtain the alternative word with identical or close semanteme, according to the language ambience information of described alternative word in the described sentence with identical or close semanteme, cluster is carried out to described alternative word, to obtain the word with identical semanteme, and the word with identical semanteme is added in semantic dictionary, thus can excavate synonym data from the network log of user, and then build semantic dictionary efficiently.
Accompanying drawing explanation
By reading the detailed description done non-limiting example done with reference to the following drawings, other features, objects and advantages of the present invention will become more obvious:
Fig. 1 is the process flow diagram of the semantic dictionary construction method that first embodiment of the invention provides;
Fig. 2 is the schematic diagram of word alignment in the semantic dictionary construction method that provides of first embodiment of the invention;
Fig. 3 is the process flow diagram of the semantic dictionary construction method that second embodiment of the invention provides;
Fig. 4 is the process flow diagram of statement screening in the semantic dictionary construction method that provides of second embodiment of the invention;
Fig. 5 is the process flow diagram of the semantic dictionary construction method that third embodiment of the invention provides;
Fig. 6 is the process flow diagram of statement screening in the semantic dictionary construction method that provides of third embodiment of the invention;
Fig. 7 is the process flow diagram of word screening in the semantic dictionary construction method that provides of fourth embodiment of the invention;
Fig. 8 is the process flow diagram of the semantic dictionary construction method that fifth embodiment of the invention provides;
Fig. 9 is the schematic diagram of word cluster in the semantic dictionary construction method that provides of fifth embodiment of the invention;
Figure 10 is the process flow diagram of word cluster in the semantic dictionary construction method that provides of fifth embodiment of the invention;
Figure 11 is the structural drawing of the semantic dictionary construction device that sixth embodiment of the invention provides.
Embodiment
Below in conjunction with drawings and Examples, the present invention is described in further detail.Be understandable that, specific embodiment described herein is only for explaining the present invention, but not limitation of the invention.It also should be noted that, for convenience of description, illustrate only part related to the present invention in accompanying drawing but not full content.
Fig. 1 and Fig. 2 shows the first embodiment of the present invention.
Fig. 1 is the process flow diagram of the semantic dictionary construction method that first embodiment of the invention provides.See Fig. 1, described semantic dictionary construction method comprises:
S110, extracts the sentence with identical or close semanteme from the network log of user.
The today widely popularized in internet, increasing people is obtained information by internet.User is when using internet to carry out web page browsing, web search, and especially when using search engine to carry out web search, server end can produce a large amount of network log data according to the practical operation of user.In the present embodiment, semantic dictionary is built by excavating the network log of user.
Described network log comprises user's click logs and user conversation daily record.Described user's click logs is used for recording user when using search engine, from input retrieval sentence to clicking the process needing result entry that the webpage browsed is corresponding from search results pages.Described user conversation daily record is used for recording user different query manipulation performed in one query session.Described different query manipulation is the web page interrogation operation using different retrieval sentence.
Preferably, from the network log of user, extract the sentence with identical or close semanteme comprise: the title obtaining the retrieval sentence used in user's query script and the webpage clicked according to user's click logs, and using the title of described retrieval sentence and described webpage as the sentence with identical or close semanteme; Or according at least two the retrieval sentences used in user conversation log acquisition user one query session, using described at least two retrieval sentences as the sentence with identical or close semanteme.
Described retrieval sentence is that user uses when using search engine to retrieve internet web page, and search engine is used for mating from different web page contents, thus provides the statement of Search Results.Described retrieval sentence is a complete statement normally, such as " opening achievement is universally acknowledged ".But under some particular cases, described retrieval sentence possibility is also imperfect, such as " iphone 6 price ".
S120, according to the semanteme of the word in the described sentence with identical or close semanteme, aligns to the word in the described sentence with identical or close semanteme, thus obtains the alternative word with identical or close semanteme.
Extract the sentence with identical or close semanteme from the network log of user after, according to the described semanteme with words different in the sentence of identical or close semanteme, the word in the described sentence with identical or close semanteme is alignd.
Fig. 2 is the schematic diagram of word alignment in the semantic dictionary construction method that provides of first embodiment of the invention.See Fig. 2, described in there is identical or close semanteme two sentences 210,220 be split as word 201, then with word 201 for unit, the word 201 in the described sentence with identical or close semanteme is alignd.Be carry out according to the semanteme of described word 201 to the alignment of described word 201, therefore, after completing the alignment to word 201, two words 201,202 be aligned generally have identical or close semanteme.
Complete to described there is the alignment of word in the sentence of identical or close semanteme after, using the word be aligned as the alternative word with identical or close semanteme.
S130, according to the language ambience information of described alternative word in the described sentence with identical or close semanteme, carries out cluster to described alternative word, to obtain the word with identical semanteme, and is added in semantic dictionary by the word with identical semanteme.
The alternate data item of the semantic dictionary that the described alternative word with identical or close semanteme just will build.Also need to do further process to the described alternative word with identical or close semanteme, could therefrom extract the word really with identical semanteme, and the word described in foundation with identical semanteme builds semantic dictionary.
The cluster to described alternative word to the described further process with the alternative word of identical or close semanteme.Described cluster completes according to the language ambience information of described alternative word in the described sentence with identical or close semanteme.Concrete, described alternative word to be had in the sentence of identical or close semanteme language ambience information as the attribute information of described alternative word described, and according to the attribute information of described alternative word, cluster is carried out to described alternative word.Like this, to have that the word in the sentence of identical or close semanteme with similar language ambience information is just aggregated into be a class described.The appearance that different words is always replaced each other in similar linguistic context environment, then can think that the word replacing appearance each other has identical semanteme.Therefore, polymerization is become the word of a class as the word with identical semanteme.
Complete after the clustering processing with word in identical or close sentence, the word with identical semanteme got is added into described semantic dictionary.
The present embodiment by extracting the sentence with identical or close semanteme from the network log of user, according to the semanteme of the word in the described sentence with identical or close semanteme, word in the described sentence with identical or close semanteme is alignd, thus obtain the alternative word with identical or close semanteme, last according to the language ambience information of described alternative word in the described sentence with identical or close semanteme, cluster is carried out to described alternative word, to obtain the word with identical semanteme, and the word with identical semanteme is added in semantic dictionary, with the network log of user for Data Source, therefrom excavate the word with identical semanteme automatically, and the word with identical semanteme excavated is added into semantic dictionary, thus complete the efficient structure of semantic dictionary.
Fig. 3 and Fig. 4 shows the second embodiment of the present invention.
Fig. 3 is the process flow diagram of the semantic dictionary construction method that second embodiment of the invention provides.Described semantic dictionary construction method is based on first embodiment of the invention, further, from the network log of user, extract the sentence with identical or close semanteme comprise: the title obtaining the retrieval sentence used in user's query script and the webpage clicked according to user's click logs, and using the title of described retrieval sentence and described webpage as the sentence with identical or close semanteme.
See Fig. 3, described semantic dictionary construction method comprises:
S310, obtains the title of the retrieval sentence used in user query script and the webpage clicked according to user's click logs, and using the title of described retrieval sentence and described webpage as the sentence with identical or close semanteme.
In general, web page title corresponding to the web page interlinkage that retrieval sentence and the user of input when user uses search engine to carry out web search finally click has identical or close semanteme.Therefore, in the present embodiment, the retrieval sentence used in user's query script and the web page title clicked are wanted to listen or the sentence of close semanteme as having.
S320, according to the semanteme of the word in the described sentence with identical or close semanteme, aligns to the word in the described sentence with identical or close semanteme, thus obtains the alternative word with identical or close semanteme.
S330, according to the language ambience information of described alternative word in the described sentence with identical or close semanteme, carries out cluster to described alternative word, to obtain the word with identical semanteme, and is added in semantic dictionary by the word with identical semanteme.
Fig. 4 is the process flow diagram of statement screening in the semantic dictionary construction method that provides of second embodiment of the invention.See Fig. 4, preferably, obtain the title of the retrieval sentence used in user's query script and the webpage clicked according to user's click logs, and the title of described retrieval sentence and described webpage comprised as the sentence with identical or close semanteme:
S311, after using identical retrieval sentence to retrieve, clicks the number of times of same web page link according to user's click logs calculating user.
Described user's click logs have recorded and utilizes retrieval sentence searching web pages from described result for retrieval, to select a web page interlinkage to user from user, clicks the overall process browsed.Therefore, after identical retrieval sentence can being used according to described user's click logs shutdown user, the number of times of same web page link is clicked.
S312, if described number of times exceedes frequency threshold value, then using the web page title of described retrieval sentence and described web page interlinkage as the sentence with identical or close semanteme.
The present embodiment by using the title of described retrieval sentence and described webpage as the sentence with identical or close semanteme, thus achieve to the excavation of sentence with identical or close semanteme from user's click logs, and and then the efficient structure achieved semantic dictionary.
Fig. 5 and Fig. 6 shows the third embodiment of the present invention.
Fig. 5 is the process flow diagram of the semantic dictionary construction method that third embodiment of the invention provides.Described semantic dictionary construction method is based on first embodiment of the invention, further, from the network log of user, extract the sentence with identical or close semanteme comprise: according at least two the retrieval sentences used in user conversation log acquisition user one query session, using described at least two retrieval sentences as the sentence with identical or close semanteme.
See Fig. 5, described semantic dictionary construction method comprises:
S510, according at least two the retrieval sentences used in user conversation log acquisition user one query session, using described at least two retrieval sentences as the sentence with identical or close semanteme.
User is when using search engine retrieving webpage, if use the result for retrieval of a retrieval sentence unsatisfactory, he generally can change a kind of expression way, namely changes the retrieval sentence that has identical or close semanteme, more once searches for network.Such as, if the result utilizing retrieval sentence " modern architecture in Japan construction situation " to carry out retrieving is undesirable, user may transfer to use retrieval sentence " Japan High-speed Railway construction situation " to retrieve again.And if user goes for Query Result more comprehensively, then the situation that above-mentioned retrieval sentence is replaced may occur repeatedly in one query session.
Because the situation that above-mentioned retrieval sentence is replaced occurs in the one query session of the user that is everlasting, therefore can using at least two retrieval sentences in user's one query session as the sentence with identical or close semanteme.
S520, according to the semanteme of the word in the described sentence with identical or close semanteme, aligns to the word in the described sentence with identical or close semanteme, thus obtains the alternative word with identical or close semanteme.
S530, according to the language ambience information of described alternative word in the described sentence with identical or close semanteme, carries out cluster to described alternative word, to obtain the word with identical semanteme, and is added in semantic dictionary by the word with identical semanteme.
Fig. 6 is the process flow diagram of statement screening in the semantic dictionary construction method that provides of third embodiment of the invention.See Fig. 6, preferably, according at least two the retrieval sentences used in user conversation log acquisition user one query session, described at least two retrieval sentences are comprised as the sentence with identical or close semanteme:
S511, according to the number of times that at least two retrieval sentences in user conversation log acquisition user one query session occur successively.
Described user conversation daily record is used for recording user different query manipulation performed in one query session.Further, the retrieval sentence corresponding to described different query manipulation can be recorded when described user conversation day aims at recording described different query manipulation.Therefore, the number of times that in user's one query session, at least two retrieval sentences occur successively can be obtained from described user conversation daily record.
S512, if the number of times that described at least two retrieval sentences occur successively is greater than frequency threshold value, then retrieves sentences as the sentence with identical or close semanteme using described at least two.
At least two retrieval sentences that the present embodiment passes through to occur in just one query session are as the sentence with identical or close semanteme, thus achieve to the excavation of sentence with identical or close semanteme from user's click logs, and and then the efficient structure achieved semantic dictionary.
Fig. 7 shows the fourth embodiment of the present invention.
Fig. 7 is the process flow diagram of word screening in the semantic dictionary construction method that provides of fourth embodiment of the invention.Described semantic dictionary construction method is based on first embodiment of the invention, further, semanteme according to the word in the described sentence with identical or close semanteme aligns to the word in the described sentence with identical or close semanteme, thus the acquisition alternative word had in identical or close semanteme comprises: carry out text matches to the word in the described sentence with identical or close semanteme, the word mated completely is carried out text justification; According to preset word matching template, text justification is carried out to the word in the described sentence with identical or close semanteme; And/or according to having the statistics of alignment result of sentence of identical or close semanteme to other, word higher for the probability that aligns in existing alignment result is carried out text justification.
See Fig. 7, the semanteme according to the word in the described sentence with identical or close semanteme aligns to the word in the described sentence with identical or close semanteme, thus the alternative word that acquisition has identical or close semanteme comprises:
S121, carries out text matches to the word in the described sentence with identical or close semanteme, the word mated completely is carried out text justification.
If do not consider the situation of polysemy, identical word should be have identical semanteme in different sentences.Therefore, between the sentence with identical or close semanteme, carry out text matches, to search identical word, and identical word is alignd.
S122, carries out text justification according to preset word matching template to the word in the described sentence with identical or close semanteme.
Performing can also according to predefined word matching template to the described alignment with word in the sentence of identical or close semanteme.The recognition rule performing the word be aligned in the alignment procedure of word is defined in institute's predicate matching template.In the process of coupling performing word, alignment can be performed according to institute's predicate Matching Model to the word in the described sentence with identical or close semanteme.
S123, according to the statistics of alignment result of sentence other to identical or close semanteme, carries out text justification by word higher for the probability that aligns in existing alignment result.
Except above-mentioned according to except text matches and word matching template, can also according to the alignment performed the statistical information of existing alignment result the word in the described sentence with identical or close semanteme.Concrete, word higher for the probability that aligns in existing alignment result can be alignd.
Above-described is a kind of preferred implementation of text justification.The operating process of actual text justification can also be, perform according to text matches text justification and and template text justification, perform according to template and according to statistical information text justification, only perform according to text matches text justification, only perform according to the text justification of template, or only perform the text justification according to statistical information.
The present embodiment is by carrying out text matches to the word in the described sentence with identical or close semanteme, the word mated completely is carried out text justification, according to preset word matching template, text justification is carried out to the word in the described sentence with identical or close semanteme, and/or according to other being had to the statistics of alignment result of sentence of identical or close semanteme, word higher for the probability that aligns in existing alignment result is carried out text justification, thus achieve the described alignment with word in the sentence of identical or close semanteme
Fig. 8 to Figure 10 shows the fifth embodiment of the present invention.
Fig. 8 is the process flow diagram of the semantic dictionary construction method that fifth embodiment of the invention provides.Described semantic dictionary construction method is based on first embodiment of the invention, further, according to the language ambience information of described alternative word in the described sentence with identical or close semanteme, cluster is carried out to described alternative word, comprise to obtain the word with identical semanteme: cluster is carried out to described alternative word, using alternative word similar after cluster as the word with identical semanteme according to the context of described alternative word in the described sentence with identical or close semanteme.
See Fig. 8, described semantic dictionary construction method comprises:
S810, extracts the sentence with identical or close semanteme from the network log of user.
S820, according to the semanteme of the word in the described sentence with identical or close semanteme, aligns to the word in the described sentence with identical or close semanteme, thus obtains the alternative word with identical or close semanteme.
S830, carries out cluster to described alternative word, using alternative word similar after cluster as the word with identical semanteme according to the context of described alternative word in the described sentence with identical or close semanteme.
Whether the present invention has identical semanteme by the language ambience information of word to word identifies.In the present embodiment, using the language ambience information of the context of word in the described sentence with identical or close semanteme as described word.After obtaining the context of described alternative word, with the context according to described alternative word, cluster is carried out to described alternative word, and using alternative word similar after cluster as the word with identical semanteme.
Fig. 9 is the schematic diagram of word cluster in the semantic dictionary construction method that provides of fifth embodiment of the invention.See Fig. 9, when carrying out word cluster, word is mapped to alternative word space 900 according to the context property of himself by alternative word.In described alternative word space 900, each node 901 represents an alternative word.After completing the spatial mappings of alternative word, in described alternative word space 901, running clustering algorithm, is different classifications 910 by described alternative word cluster.
Figure 10 is the process flow diagram of word cluster in the semantic dictionary construction method that provides of fifth embodiment of the invention.See Figure 10, preferably, according to the context of described alternative word in the described sentence with identical or close semanteme, cluster is carried out to described alternative word, alternative word similar after cluster is comprised as the word with identical semanteme:
S831, described alternative word is projected to alternative word space by the context according to described alternative word.
Described alternative word space is a space with n dimension, and described alternative word can represent with a point in described alternative word space.Each alternative word has a context vector, and this context vector has n the context property extracted from network log.This n context property is corresponded to n the dimension in described alternative word space, just unique for described alternative word has been projected to described alternative word space.
Conveniently the calculating of described alternative word space middle distance, carries out index to the different values of described context property.Such as, when the value of context property is " I ", corresponding call number is 1, and when the value of context property is " we ", corresponding call number is 2.
S832, random k central point of specifying described alternative word space.
In the present embodiment, as a reference cluster is carried out to the alternative word in described alternative word space with the k specified an at random central point.Positive integer k represents the independent semantic quantity finally had in the semantic dictionary that will build.Such as, the value of k is defined as 1000, then there are 1000 independently semantemes in the final semantic dictionary built.Should be understood that, once described central point is selected, then the context property value in the context vector that described central point is corresponding just can be determined.Owing to having carried out index to described context property value, the call number that described context property is corresponding also just can have been determined.
S833, calculates the distance between described alternative word and a described k central point respectively according to the context property of described alternative word, and described alternative word is included into the classification representated by a central point nearest with it in described alternative word space.
Preferably, can distance according to the mathematic interpolation between the call number of the context property value of described alternative word and described central point between alternative word and described central point.
S834, recalculates the central point of each classification, and the central point recalculated makes the distance sum in itself and classification between all alternative word be minimum.
S835, judges whether central point changes, if described central point changes, re-executes S833, if described central point is constant, then completes the cluster to described alternative word.
After completing the cluster to described alternative word, described alternative word is clustered into as different classifications.Be in different classes of in alternative word be exactly the word with identical semanteme.Therefore, the word with identical semanteme be in same classification is added into described semantic dictionary.It is that same class else has the word of identical semanteme that table 1 shows cluster:
Table 1
Classification 1 Classification 2 Classification 3 Classification 4 Classification 5
Refining Customization Regard Burn Take
Manufacture Tailor-made Make Boil in water for a while, then dress with soy, vinegar, etc. Sit
Make Make Be used as Stew in shallow water Take
Make ? ? Fry Catch up with
Forging ? ? Fire Take
? ? ? Baked ?
? ? ? Cook ?
See table 1, by the reserved word of cluster in same classification, be there is identical semanteme, semantic dictionary can be added into as the word with identical semanteme.
The present embodiment is by carrying out cluster according to the context of described alternative word in the described sentence with identical or close semanteme to described alternative word, and using alternative word similar after cluster as the word with identical semanteme, making the identification by achieving the cluster of alternative word the word with identical semanteme, improve the efficiency that semantic dictionary builds.
Figure 11 shows the sixth embodiment of the present invention.
Figure 11 is the structural drawing of the semantic dictionary construction device that sixth embodiment of the invention provides.See Figure 11, described semantic dictionary construction device comprises: statement screening module 1110, word screening module 1120 and word cluster module 1130.
Described statement screening module 1110 for extracting the sentence with identical or close semanteme from the network log of user.
Described word screening module 1120, for having the semanteme of the word in the sentence of identical or close semanteme described in basis, is alignd to the word in the described sentence with identical or close semanteme, thus is obtained the alternative word with identical or close semanteme.
Described word cluster module 1130 is for according to the language ambience information of described alternative word in the described sentence with identical or close semanteme, cluster is carried out to described alternative word, to obtain the word with identical semanteme, and the word with identical semanteme is added in semantic dictionary.
Preferably, described statement screening module 1110 comprises: the first statement screening unit 1111 or the second statement screening unit 1112.
Described first statement screening unit 1111 for obtaining the title of the retrieval sentence used in user query script and the webpage clicked according to user's click logs, and using the title of described retrieval sentence and described webpage as the sentence with identical or close semanteme.
Described second statement screening unit 1112, for according at least two the retrieval sentences used in user conversation log acquisition user one query session, retrieves sentences as the sentence with identical or close semanteme using described at least two.
Preferably, described first statement screening unit 1111 specifically for:
After using identical retrieval sentence to retrieve according to user's click logs calculating user, click the number of times of same web page link;
If described number of times exceedes frequency threshold value, then using the web page title of described retrieval sentence and described web page interlinkage as the sentence with identical or close semanteme.
Preferably, described second statement screening unit 1112 specifically for:
According to the number of times that at least two retrieval sentences in user conversation log acquisition user one query session occur successively;
If the number of times that described at least two retrieval sentences occur successively is greater than frequency threshold value, then retrieve sentences as the sentence with identical or close semanteme using described at least two.
Preferably, described word screening module 1120 comprises: coupling alignment unit 1121, template alignment unit 1122 and/or statistics alignment unit 1123.
The word mated completely, for carrying out text matches to the word in the described sentence with identical or close semanteme, is carried out text justification by described coupling alignment unit 1121.
Described template alignment unit 1122 is for carrying out text justification according to preset word matching template to the word in the described sentence with identical or close semanteme.
Described statistics alignment unit 1123 has the statistics of the alignment result of the sentence of identical or close semanteme to other for basis, word higher for the probability that aligns in existing alignment result is carried out text justification.
Preferably, described word cluster module 1130 comprises: context cluster cell 1131.
Described context cluster cell 1131 is for carrying out cluster to described alternative word, using alternative word similar after cluster as the word with identical semanteme according to the context of described alternative word in the described sentence with identical or close semanteme.
The invention described above embodiment sequence number, just to describing, does not represent the quality of embodiment.
Those of ordinary skill in the art should be understood that, above-mentioned of the present invention each module or each step can realize with general calculation element, they can concentrate on single calculation element, or be distributed on network that multiple calculation element forms, alternatively, they can realize with the executable program code of computer installation, thus they storages can be performed by calculation element in the storage device, or they are made into each integrated circuit modules respectively, or the multiple module in them or step are made into single integrated circuit module to realize.Like this, the present invention is not restricted to the combination of any specific hardware and software.
Each embodiment in this instructions all adopts the mode of going forward one by one to describe, and what each embodiment stressed is the difference with other embodiments, the same or analogous part between each embodiment mutually see.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, to those skilled in the art, the present invention can have various change and change.All do within spirit of the present invention and principle any amendment, equivalent replacement, improvement etc., all should be included within protection scope of the present invention.

Claims (12)

1. a semantic dictionary construction method, is characterized in that, comprising:
The sentence with identical or close semanteme is extracted from the network log of user;
According to the semanteme of the word in the described sentence with identical or close semanteme, the word in the described sentence with identical or close semanteme is alignd, thus obtain the alternative word with identical or close semanteme;
According to the language ambience information of described alternative word in the described sentence with identical or close semanteme, cluster is carried out to described alternative word, to obtain the word with identical semanteme, and the word with identical semanteme is added in semantic dictionary.
2. method according to claim 1, is characterized in that, extracts the sentence with identical or close semanteme and comprise from the network log of user:
The title of the retrieval sentence used in user query script and the webpage clicked is obtained according to user's click logs, and using the title of described retrieval sentence and described webpage as the sentence with identical or close semanteme; Or
According at least two the retrieval sentences used in user conversation log acquisition user one query session, using described at least two retrieval sentences as the sentence with identical or close semanteme.
3. method according to claim 2, it is characterized in that, obtain the title of the retrieval sentence used in user's query script and the webpage clicked according to user's click logs, and the title of described retrieval sentence and described webpage comprised as the sentence with identical or close semanteme:
After using identical retrieval sentence to retrieve according to user's click logs calculating user, click the number of times of same web page link;
If described number of times exceedes frequency threshold value, then using the web page title of described retrieval sentence and described web page interlinkage as the sentence with identical or close semanteme.
4. method according to claim 2, is characterized in that, according at least two the retrieval sentences used in user conversation log acquisition user one query session, is comprised by described at least two retrieval sentences as the sentence with identical or close semanteme:
According to the number of times that at least two retrieval sentences in user conversation log acquisition user one query session occur successively;
If the number of times that described at least two retrieval sentences occur successively is greater than frequency threshold value, then retrieve sentences as the sentence with identical or close semanteme using described at least two.
5. method according to claim 1, it is characterized in that, semanteme according to the word in the described sentence with identical or close semanteme aligns to the word in the described sentence with identical or close semanteme, thus the alternative word that acquisition has identical or close semanteme comprises:
Text matches is carried out to the word in the described sentence with identical or close semanteme, the word mated completely is carried out text justification;
According to preset word matching template, text justification is carried out to the word in the described sentence with identical or close semanteme; And/or
According to the statistics of alignment result of sentence other to identical or close semanteme, word higher for the probability that aligns in existing alignment result is carried out text justification.
6. method according to claim 1, is characterized in that, carries out cluster, comprise to obtain the word with identical semanteme according to the language ambience information of described alternative word in the described sentence with identical or close semanteme to described alternative word:
Cluster is carried out to described alternative word, using alternative word similar after cluster as the word with identical semanteme according to the context of described alternative word in the described sentence with identical or close semanteme.
7. a semantic dictionary construction device, is characterized in that, comprising:
Statement screening module, for extracting the sentence with identical or close semanteme from the network log of user;
Word screening module, for having the semanteme of the word in the sentence of identical or close semanteme described in basis, aligns to the word in the described sentence with identical or close semanteme, thus obtains the alternative word with identical or close semanteme;
Word cluster module, for according to the language ambience information of described alternative word in the described sentence with identical or close semanteme, carries out cluster to described alternative word, to obtain the word with identical semanteme, and is added in semantic dictionary by the word with identical semanteme.
8. device according to claim 7, is characterized in that, described statement screening module comprises:
First statement screening unit, for obtaining the title of the retrieval sentence used in user query script and the webpage clicked, and using the title of described retrieval sentence and described webpage as the sentence with identical or close semanteme according to user's click logs; Or
Second statement screening unit, for according at least two the retrieval sentences used in user conversation log acquisition user one query session, using described at least two retrieval sentences as the sentence with identical or close semanteme.
9. device according to claim 8, is characterized in that, described first statement screening unit specifically for:
After using identical retrieval sentence to retrieve according to user's click logs calculating user, click the number of times of same web page link;
If described number of times exceedes frequency threshold value, then using the web page title of described retrieval sentence and described web page interlinkage as the sentence with identical or close semanteme.
10. device according to claim 8, is characterized in that, described second statement screening unit specifically for:
According to the number of times that at least two retrieval sentences in user conversation log acquisition user one query session occur successively;
If the number of times that described at least two retrieval sentences occur successively is greater than frequency threshold value, then retrieve sentences as the sentence with identical or close semanteme using described at least two.
11. devices according to claim 7, is characterized in that, described word screening module comprises:
Coupling alignment unit, for carrying out text matches to the word in the described sentence with identical or close semanteme, carries out text justification by the word mated completely;
Template alignment unit, for carrying out text justification according to preset word matching template to the word in the described sentence with identical or close semanteme; And/or
Statistics alignment unit, has the statistics of the alignment result of the sentence of identical or close semanteme, word higher for the probability that aligns in existing alignment result is carried out text justification to other for basis.
12. devices according to claim 7, is characterized in that, described word cluster module comprises:
Context cluster cell, for carrying out cluster to described alternative word, using alternative word similar after cluster as the word with identical semanteme according to the context of described alternative word in the described sentence with identical or close semanteme.
CN201410521385.6A 2014-09-30 2014-09-30 Semantic dictionary construction method and device Active CN104281565B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410521385.6A CN104281565B (en) 2014-09-30 2014-09-30 Semantic dictionary construction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410521385.6A CN104281565B (en) 2014-09-30 2014-09-30 Semantic dictionary construction method and device

Publications (2)

Publication Number Publication Date
CN104281565A true CN104281565A (en) 2015-01-14
CN104281565B CN104281565B (en) 2017-09-05

Family

ID=52256450

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410521385.6A Active CN104281565B (en) 2014-09-30 2014-09-30 Semantic dictionary construction method and device

Country Status (1)

Country Link
CN (1) CN104281565B (en)

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631025A (en) * 2015-12-29 2016-06-01 腾讯科技(深圳)有限公司 Normalization processing method and device for query tags
CN106383872A (en) * 2016-09-06 2017-02-08 北京百度网讯科技有限公司 Artificial intelligence-based information processing method and apparatus
CN107562761A (en) * 2016-06-30 2018-01-09 阿里巴巴集团控股有限公司 A kind of information-pushing method and device
CN107577655A (en) * 2016-07-05 2018-01-12 北京国双科技有限公司 Name acquiring method and apparatus
CN108509409A (en) * 2017-02-27 2018-09-07 芋头科技(杭州)有限公司 A method of automatically generating semantic similarity sentence sample
CN108536676A (en) * 2018-03-28 2018-09-14 广州华多网络科技有限公司 Data processing method, device, electronic equipment and storage medium
CN110991168A (en) * 2019-12-05 2020-04-10 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101599075A (en) * 2009-07-02 2009-12-09 清华大学 Chinese abbreviation disposal route and device
CN102306144A (en) * 2011-07-18 2012-01-04 南京邮电大学 Terms disambiguation method based on semantic dictionary
US8725674B1 (en) * 2006-06-30 2014-05-13 At&T Intellectual Property Ii, L.P. Method and apparatus for providing a product metadata driven operations support system
US20140278362A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Entity Recognition in Natural Language Processing Systems

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8725674B1 (en) * 2006-06-30 2014-05-13 At&T Intellectual Property Ii, L.P. Method and apparatus for providing a product metadata driven operations support system
CN101599075A (en) * 2009-07-02 2009-12-09 清华大学 Chinese abbreviation disposal route and device
CN102306144A (en) * 2011-07-18 2012-01-04 南京邮电大学 Terms disambiguation method based on semantic dictionary
US20140278362A1 (en) * 2013-03-15 2014-09-18 International Business Machines Corporation Entity Recognition in Natural Language Processing Systems

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
刘辉: "搜索引擎联邦算法设计与系统实现", 《中国优秀博硕士学位论文全文数据库(硕士)信息科技辑》 *
吴勇等: "基于语境和语义的中文文本聚类算法研究", 《科技信息》 *
鲍钰: "WEB日志挖掘及其应用研究", 《中国博士学位论文全文数据库信息科技辑》 *

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631025A (en) * 2015-12-29 2016-06-01 腾讯科技(深圳)有限公司 Normalization processing method and device for query tags
CN107562761A (en) * 2016-06-30 2018-01-09 阿里巴巴集团控股有限公司 A kind of information-pushing method and device
CN107577655A (en) * 2016-07-05 2018-01-12 北京国双科技有限公司 Name acquiring method and apparatus
CN106383872A (en) * 2016-09-06 2017-02-08 北京百度网讯科技有限公司 Artificial intelligence-based information processing method and apparatus
CN108509409A (en) * 2017-02-27 2018-09-07 芋头科技(杭州)有限公司 A method of automatically generating semantic similarity sentence sample
CN108536676A (en) * 2018-03-28 2018-09-14 广州华多网络科技有限公司 Data processing method, device, electronic equipment and storage medium
CN108536676B (en) * 2018-03-28 2020-10-13 广州华多网络科技有限公司 Data processing method and device, electronic equipment and storage medium
CN110991168A (en) * 2019-12-05 2020-04-10 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium
WO2021109787A1 (en) * 2019-12-05 2021-06-10 京东方科技集团股份有限公司 Synonym mining method, synonym dictionary application method, medical synonym mining method, medical synonym dictionary application method, synonym mining apparatus and storage medium
US11977838B2 (en) 2019-12-05 2024-05-07 Boe Technology Group Co., Ltd. Synonym mining method, application method of synonym dictionary, medical synonym mining method, application method of medical synonym dictionary, synonym mining device and storage medium
CN110991168B (en) * 2019-12-05 2024-05-17 京东方科技集团股份有限公司 Synonym mining method, synonym mining device, and storage medium

Also Published As

Publication number Publication date
CN104281565B (en) 2017-09-05

Similar Documents

Publication Publication Date Title
CN106446148B (en) A kind of text duplicate checking method based on cluster
CN104281565A (en) Semantic dictionary constructing method and device
CN105468605B (en) Entity information map generation method and device
Nayak et al. Survey on pre-processing techniques for text mining
CN103631929B (en) A kind of method of intelligent prompt, module and system for search
El-Fishawy et al. Arabic summarization in twitter social network
CN107239512B (en) A kind of microblogging comment spam recognition methods of combination comment relational network figure
CN103678670A (en) Micro-blog hot word and hot topic mining system and method
CN107273474A (en) Autoabstract abstracting method and system based on latent semantic analysis
CN105844424A (en) Product quality problem discovery and risk assessment method based on network comments
CN103914494A (en) Method and system for identifying identity of microblog user
CN103049435A (en) Text fine granularity sentiment analysis method and text fine granularity sentiment analysis device
CN104978332B (en) User-generated content label data generation method, device and correlation technique and device
CN104484343A (en) Topic detection and tracking method for microblog
CN102682000A (en) Text clustering method, question-answering system applying same and search engine applying same
CN104268230B (en) A kind of Chinese micro-blog viewpoint detection method based on heterogeneous figure random walk
Sun et al. A novel context-based implicit feature extracting method
CN111581990A (en) Cross-border transaction matching method and device
CN104346382B (en) Use the text analysis system and method for language inquiry
CN110442730A (en) A kind of knowledge mapping construction method based on deepdive
CN111488429A (en) Short text clustering system based on search engine and short text clustering method thereof
CN104391969A (en) User query statement syntactic structure determining method and device
Ballatore et al. Linking geographic vocabularies through WordNet
CN104679784A (en) O2B intelligent searching method and system
CN103336803A (en) Method for generating name-embedded spring festival scrolls through computer

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant