WO2017197947A1 - 先行词的确定方法和装置 - Google Patents

先行词的确定方法和装置 Download PDF

Info

Publication number
WO2017197947A1
WO2017197947A1 PCT/CN2017/074800 CN2017074800W WO2017197947A1 WO 2017197947 A1 WO2017197947 A1 WO 2017197947A1 CN 2017074800 W CN2017074800 W CN 2017074800W WO 2017197947 A1 WO2017197947 A1 WO 2017197947A1
Authority
WO
WIPO (PCT)
Prior art keywords
candidate
pronoun
antecedent
word
antecedents
Prior art date
Application number
PCT/CN2017/074800
Other languages
English (en)
French (fr)
Inventor
杨月奎
陈雨杰
赵琳
黄玉兰
刘莉
王迪
Original Assignee
腾讯科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 腾讯科技(深圳)有限公司 filed Critical 腾讯科技(深圳)有限公司
Priority to EP17798514.0A priority Critical patent/EP3460678A4/en
Priority to JP2018529148A priority patent/JP6752282B2/ja
Priority to KR1020187015847A priority patent/KR102163549B1/ko
Publication of WO2017197947A1 publication Critical patent/WO2017197947A1/zh
Priority to US16/009,474 priority patent/US10810372B2/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/268Morphological analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present invention relates to the field of information processing, and in particular to a method and apparatus for determining an antecedent.
  • the machine In the man-machine dialogue, the machine needs to accurately understand the context information in the statement. If the machine cannot accurately understand the context information in the statement, the dialogue information will be blurred, and the problem is the main problem causing the information to be blurred.
  • referential digestion is the question of determining which noun phrase a pronoun points to in a chapter.
  • referential decoding algorithms there are several kinds of referential decoding algorithms: (1) searching from left to right first, and hierarchically traversing the syntax tree to achieve digestion, the algorithm needs to traverse the information to be identified, and the traversal workload is large; (2) Semantic constraints are added on the basis of syntactic knowledge. This method is effective in English pronouns, but Chinese vocabulary is difficult to handle.
  • the embodiment of the invention provides a method and a device for determining an antecedent word to solve at least the technical problem of low processing efficiency of the reference digestion.
  • a method for determining an antecedent comprising: obtaining statement information to be recognized; and extracting from the statement information when identifying a pronoun in the statement information Word candidate features and word features of the plurality of candidate antecedents; determining target antecedent words referred to by the pronouns from the plurality of candidate antecedents based on word features of the plurality of candidate antecedents.
  • an apparatus for determining an antecedent comprising: an obtaining unit, configured to acquire sentence information to be recognized; and an extracting unit, configured to identify the statement information
  • an obtaining unit configured to acquire sentence information to be recognized
  • an extracting unit configured to identify the statement information
  • a plurality of candidate antecedents and word features of the plurality of candidate antecedents are extracted from the sentence information
  • determining means is configured to use the plurality of candidate antecedent words based on the plurality of candidate antecedent words
  • the target antecedent referred to by the pronoun is determined in the candidate antecedent.
  • the word features of the candidate antecedent and each candidate antecedent are extracted from the sentence information, and the target antecedent of the pronoun is determined by using the feature of the candidate antecedent.
  • the target antecedent specified by the pronoun can be automatically locked by the word feature of the candidate antecedent extracted from the sentence information, thereby solving the problem of low processing efficiency of the reference digestion in the prior art, and achieving accurate and efficient Determine the effect of the pronoun's antecedent.
  • FIG. 1 is a schematic diagram of a network environment of an optional method for determining an antecedent according to an embodiment of the present invention
  • FIG. 2 is a flowchart 1 of a method for determining an antecedent according to an embodiment of the present invention
  • FIG. 3 is a second flowchart of a method for determining an antecedent according to an embodiment of the present invention.
  • FIG. 4 is a third flowchart of a method for determining an antecedent according to an embodiment of the present invention.
  • FIG. 5 is a first schematic diagram of an apparatus for determining an antecedent according to an embodiment of the present invention
  • FIG. 6 is a second schematic diagram of an apparatus for determining an antecedent according to an embodiment of the present invention.
  • FIG. 7 is a third schematic diagram of an apparatus for determining an antecedent according to an embodiment of the present invention.
  • FIG. 8 is a fourth schematic diagram of an apparatus for determining an antecedent according to an embodiment of the present invention.
  • FIG. 9 is a block diagram showing the internal structure of a server according to an embodiment of the present invention.
  • Antecedent A phrase that is semantically related to the current pronoun, such as a word or phrase referred to by a pronoun.
  • Session The session collection.
  • Predicate A term used to describe or determine the relationship between a shell's properties, features, or objects.
  • the predicate typically includes verbs and adjectives.
  • an embodiment of a method for determining an antecedent is provided, and it should be noted that the steps illustrated in the flowchart of the accompanying drawings may be performed in a computer system such as a set of computer executable instructions, and Although the logical order is shown in the flowcharts, in some cases the steps shown or described may be performed in a different order than the ones described herein.
  • the above information processing method is applied to the network environment as shown in FIG. 1.
  • the network environment includes a terminal 101 and a server 103 (which may be a server or a cloud platform of a network connection application), wherein the terminal may establish a connection with the server through the network, and the processor may be set on both the terminal and the server.
  • the above networks include, but are not limited to, a wide area network, a metropolitan area network, or a local area network.
  • the terminal may be a terminal having an input device, such as a mobile terminal (for example, a mobile phone, a tablet, etc.), and the terminal may install an intelligent conversation client.
  • the server corresponds to the smart conversation client, and the server can be used to process information sent by the terminal by using the smart conversation client.
  • FIG. 2 is a flow chart of a method of determining an antecedent in accordance with an embodiment of the present invention. As shown in FIG. 2, the method may include the following steps:
  • Step S202 Acquire sentence information to be identified
  • Step S204 extracting, from the sentence information, word features of the plurality of candidate antecedents and the plurality of candidate antecedents in the case that the pronoun exists in the sentence information;
  • Step S206 Determine the target antecedent referred to by the pronoun from the plurality of candidate antecedents based on the word features of the plurality of candidate antecedents.
  • the word features of the candidate antecedent and each candidate antecedent are extracted from the sentence information, and the target antecedent referred to by the pronoun is determined by the word feature of the candidate antecedent.
  • the target antecedent specified by the pronoun can be automatically locked by the word feature of the candidate antecedent extracted from the sentence information, thereby solving the problem of low processing efficiency of the reference digestion in the prior art, and achieving accurate and efficient Determine the effect of the pronoun's antecedent.
  • the statement information to be identified in the foregoing embodiment may be sent by the terminal 101 to the server, and the statement information may be text information, and the text information may be obtained by converting the voice information in the session information, or may be directly from the statement information.
  • the extracted text information may also be information extracted from the article, and the source of the information is not limited in this application.
  • the statement information is a set of session information generated by a client and a server during a human-machine conversation.
  • the sentence features of the candidate antecedent and each candidate antecedent may be extracted from the sentence information in sequence, or may be in the slave statement.
  • the candidate antecedent is extracted from the information while extracting the sentence features of the candidate antecedent.
  • the words referred to by the pronouns may be nouns or noun phrases, and the candidate antecedents extracted are nouns or noun phrases.
  • the predator can be used to segment the sentence information in the sentence information through the word segmenter, and the plurality of words obtained from the word segmentation are included in the word segmentation. Words that are part of pronouns (ie, pronouns) and nouns/nouns (ie, candidate antecedents) are extracted.
  • the target antecedent referred to by the pronoun may be determined from the plurality of candidate antecedents based on the word features of the plurality of candidate antecedents, wherein the term features may include semantic features and grammatical features.
  • the communication between the intelligent conversation client and the server is established, and the communication relationship is used to send the session information to the server through the intelligent conversation client, after the server receives the session information.
  • the session information is text information
  • the session information is used as statement information.
  • the session information is voice information
  • the voice information is converted into text information, and the converted text information is used as statement information.
  • the server identifies the statement information. If a pronoun is found in the statement information, the session set generated by the session process (ie, the above statement information) is obtained, and multiple candidate antecedents and each candidate advance are extracted from the statement information. The word feature of the word, using the word feature to determine the target antecedent of the pronoun.
  • the pronoun in the statement information may be replaced with the target antecedent to complete the statement information.
  • determining the target antecedent referred to by the pronoun from the plurality of candidate antecedents based on the word features of the plurality of candidate antecedents may include: determining each based on the word features of each candidate antecedent The referential weight value of the candidate antecedent; the candidate antecedent with the largest weight value is selected as the target antecedent of the pronoun.
  • the feature of the word in the foregoing embodiment may be a semantic feature or a grammatical feature
  • the The semantic feature and/or grammatical feature determines a referential weight value of each candidate antecedent relative to the pronoun, and sorts each obtained referential weight value to obtain a sequence of weighted values, if the weight value is The sequence is arranged according to the weight value of the reference, and the candidate antecedent corresponding to the first weight value in the sequence of weight values is used as the target antecedent of the pronoun; if the sequence of weights is referred to When the weights of the referents are arranged from small to large, the candidate antecedent corresponding to the last weight value in the sequence of weights is referred to as the target antecedent of the pronoun.
  • the largest weighting value of the plurality of referential weight values may be obtained according to a pairwise comparison manner.
  • the candidate antecedent corresponding to the largest referential weight value is selected as the target antecedent referred to by the pronoun.
  • each of the plurality of candidate antecedents includes one or more word features, and wherein each candidate antecedent of the plurality of candidate antecedents includes a word feature, The word feature of each candidate antecedent is converted into a feature value, and the feature value is used as a referential weight value of the candidate antecedent.
  • each candidate antecedent of the plurality of candidate antecedents includes one or more word features
  • the referential weight of each candidate antecedent is determined based on the word features of each candidate antecedent
  • the value includes: converting the extracted word features into feature values; using the feature coefficients of one or more word features set in advance, performing linear weighting calculation on the feature values of each candidate antecedent, and obtaining the fingers of each candidate antecedent Subrogation weight.
  • each word feature of each candidate antecedent is separately converted into a feature value, using one or more presets A feature coefficient of the word feature, and performing linear weighting calculation on the plurality of feature values to obtain a referential weight value of each candidate antecedent.
  • the word features are two
  • the feature values of the two word features are t 1 and t 2 , respectively, and the preset feature coefficients ⁇ 1 and ⁇ 2 of the two word features are acquired, and the two feature values are performed.
  • Linear weighting calculation: Weight ⁇ 1 ⁇ t 1 + ⁇ 2 ⁇ t 2 .
  • the characteristic coefficients of these features can be given initial values according to experience, and can also be trained.
  • the corpus adjusts the size of the feature coefficient.
  • each candidate antecedent of the plurality of candidate antecedents includes one or more word features, and the word features include at least one of: a singular and plural feature of the candidate antecedent, a candidate antecedent and The distance between pronouns, whether the candidate antecedent appears in the prepositional phrase, and the semantic relevance of the pronoun and the candidate antecedent.
  • the word feature includes the singular and plural features of the candidate antecedent
  • the singular pronoun cannot refer to the plural antecedent
  • the singular and plural number is an important feature for judging whether the two words have a referential relationship, for example, “Today The weather is very good, my classmates and I are ready to go out for a walk.”
  • the pronoun "I” here is singular, while the "classmates” are plural, and the singular cannot refer to plural.
  • the singular and plural of the candidate antecedent is consistent with the singular and plural of the pronoun can be used to convert the singular and complex features into eigenvalues, for example, if the singular and plural of the candidate antecedent and the singular and plural of the pronoun If they are consistent, the eigenvalue is set to the first constant; if the singular and plural of the candidate antecedent does not match the singular and plural of the pronoun, the eigenvalue is set to the second constant.
  • the first constant may be 1 and the second constant may be 0.
  • the distance between the candidate antecedent and the pronoun in the above embodiment generally considers the distance between sentences or between paragraphs where the two words are located, and may also refer to the number of characters between the two words. In a multi-round conversation, a complete sentence information needs to be expressed in multiple sentences. The closer the distance between the candidate antecedent and the pronoun sentence, the greater the correlation. The distance between the pronoun and the antecedent is also significant.
  • the word feature includes the distance between the candidate antecedent and the pronoun
  • the distance between the candidate antecedent and the sentence in which the pronoun is located, or the number of characters in the interval between the two words may be used. Or the number of statements as its eigenvalue.
  • Nouns in direct object and indirect object are referred to as having no significant difference in probability, while nouns in prepositional phrases are referred to with lower probability. Therefore, in the embodiment of the present invention, whether the candidate antecedent appears in the prepositional phrase as a word feature can be used.
  • the feature value may be set to a constant if the candidate antecedent appears in the prepositional phrase, such as 1; in the case where the candidate antecedent does not appear in the prepositional phrase, Set the feature value to another constant, such as 0.
  • the relevance of the semantic dependent words may also be used as a word feature (ie, the semantic relevance of the pronouns in the above embodiment and the candidate antecedent), for example, the statement information is “the police found the thief to escape from prison and aggravated his punishment. " Among them, the candidate antecedent "thief" and the pronoun "he” depend on “jailbreak” and “penalty” respectively. These two semantic dependencies have great correlation, and we can see the semantic dependence of pronouns and candidate antecedents. The degree of correlation between words can help determine the referential relationship.
  • the semantic relevance of the pronoun and the candidate antecedent can be determined based on the correlation between the semantic dependencies of the two words.
  • P is a to-be-dissolved pronoun
  • A is a candidate antecedent
  • (Px 1 , Px 2 ... Px i ) is a pronoun of a pronoun
  • (Ax 1 , Ax 2 ... Ax j ) is a dependent word of the candidate antecedent
  • i, j is a natural number
  • i is the number of pronoun dependent words
  • j is the number of dependent words of the candidate antecedent
  • WordSence ( P, A) is:
  • the feature value may be a value calculated by the above formula.
  • the candidate antecedent set for each to-be-dissolved pronoun in the training corpus first determines the candidate antecedent set for each to-be-dissolved pronoun in the training corpus, and then judge whether the pronoun needs to be digested according to the consistency constraint rule, perform feature extraction, based on pronouns and candidate antecedents.
  • the distance, semantics and grammar information propose a method for human-to-human dialogue, which is called Chinese pronouns, and determines the final candidate antecedent.
  • determining whether the pronouns need to be digested before extracting the plurality of candidate antecedents and the word features of the plurality of candidate antecedents from the sentence information, determining whether the pronouns need to be digested.
  • the word features of the plurality of candidate antecedents and the plurality of candidate antecedents are extracted from the sentence information; when it is judged that the pronoun does not need to be digested, no more information is extracted from the sentence information.
  • Candidates first Word features and word features of multiple candidate antecedents.
  • judging whether the pronoun needs to be digested can be achieved by judging whether the proximate word of the pronoun is a noun. If the proximate word of the pronoun is a noun, it is determined that the pronoun does not need to be dispelled, and if the proximate word of the pronoun is not a noun, then It is judged that the pronoun needs to be digested, and the word features of the plurality of candidate antecedents and the plurality of candidate antecedents may be extracted from the sentence information.
  • extracting the word features of the plurality of candidate antecedents and the plurality of candidate antecedents from the sentence information includes: searching for pronouns in the sentence information, and obtaining adjacent words of the found pronouns; and the case where the adjacent words are not nouns
  • the word features of the plurality of candidate antecedents and the plurality of candidate antecedents are extracted from the sentence information.
  • extracting a plurality of candidate antecedents from the statement information includes:
  • the embodiment of the present invention is described in detail below with reference to FIG. 3. As shown in FIG. 3, the embodiment may include the following steps:
  • Step S301 detecting a modern word in the sentence information.
  • step S306 a step of detecting whether a modern word is generated in the sentence information (ie, step S306 described below) may be performed, and in the case where the pronoun is detected, the step is entered.
  • Step S302 Determine whether the pronoun needs to be digested.
  • step S303 If it is determined that the pronoun needs to be digested, step S303 is performed; if it is determined that the pronoun does not need to be dissipated, step S306 is continued: detecting whether a modern word is generated in the sentence information.
  • the adjacent word of the pronoun is a noun. If the proximate word of the pronoun is a noun, it is judged that the pronoun does not need to be digested; if the proximate word of the pronoun is not a noun, it is judged that the pronoun needs to be digested. .
  • Step S303 Acquire a plurality of candidate antecedents.
  • whether the word is extracted may be determined based on whether there is a mutual referential relationship between the word to be extracted and the pronoun. If there is a mutual referential relationship between the word to be extracted and the pronoun, the word is extracted; otherwise, vice versa.
  • candidate antecedent words such as nouns or noun phrases
  • whether the candidate antecedent and the pronoun can refer to each other can be used to filter the plurality of candidate antecedents. , get the filtered candidate antecedent.
  • the word features of the filtered candidate antecedent are extracted from the sentence information, and the target antecedent is selected from the filtered candidate antecedent based on the extracted word features.
  • Step S304 Extract the word features of the candidate antecedent.
  • Step S305 determining the target antecedent of the pronoun by using the word feature of the candidate antecedent.
  • a noun or a noun phrase that is closer to the pronoun may be searched, that is, a noun phrase whose distance from the pronoun in the sentence information is within a preset distance is obtained.
  • the noun phrase is found, if there is no possible referential relationship between the noun phrase and the pronoun, the noun or noun phrase is not extracted, that is, the noun or noun phrase is not used as a candidate antecedent of the pronoun;
  • the noun or noun phrase is extracted and used as a candidate antecedent.
  • determining whether the noun phrase and the pronoun refer to each other includes: determining whether the part of the conjunction between the noun phrase and the pronoun is a predicate; if the part of the noun phrase and the pronoun is not a predicate, determining The noun phrase and the pronoun can refer to each other; if the part of the noun phrase and the pronoun is a predicate, it can be judged that the noun phrase and the pronoun cannot Enough to refer to each other.
  • the predicate can be a verb or an adjective.
  • the candidate antecedent “juice extractor” and the pronoun “fruit” are also bound by the predicate “squeeze”, and the two belong to a relationship that cannot be referred to each other.
  • whether the pronoun and the candidate antecedent can refer to each other can be determined by the output result of the parser.
  • the candidate antecedent can be filtered by judging whether the noun phrase and the pronoun refer to each other, and the processing amount of the word and the word feature is reduced.
  • candidate antecedent words such as nouns or noun phrases
  • whether the candidate antecedent and the pronoun can refer to each other can be used to filter the plurality of candidate antecedents.
  • the filtered candidate antecedent is obtained.
  • the word features of the filtered candidate antecedent are extracted from the sentence information, and the target antecedent is selected from the filtered candidate antecedent based on the extracted word features.
  • the weighting of the candidate antecedent ie, the weighting value of the candidate
  • the weighting of the candidate antecedent can be sorted according to the manner of linear weighting of different feature weights, and the weight with the highest weight is the final selected pronoun.
  • the embodiment may include the following steps:
  • Step S401 Filter the candidate antecedent words by using grammatical constraints in case the recognized pronouns need to be digested.
  • the grammatical constraint herein may refer to a rule that cannot be referred to between the pronoun and the candidate antecedent. If the pronoun and the candidate antecedent cannot be referred to, the candidate antecedent is directly filtered out.
  • Step S402 Extract the word features of the remaining candidate antecedents.
  • the word features may include: a singular and plural feature, a distance between the candidate antecedent and the pronoun, a semantic relevance of the candidate antecedent and the pronoun, and whether the candidate antecedent is in the prepositional phrase.
  • Step S403 Convert the feature into a feature value.
  • the singular and plural number consistency weights Sp if the candidate antecedent and the pronoun singular and plural numbers are consistent with 1, if the candidate antecedent and the pronoun singular and plural numbers do not coincide with zero.
  • the feature weight is Dis, and there are several rounds of conversation between the candidate antecedent and the pronoun.
  • the grammatical constraint weight Sc the candidate antecedent is 1 in the prepositional phrase, not 0.
  • Semantic Dependency Correlation Feature Ws (ie, the semantic relevance of the candidate antecedent and the pronoun) may be implemented by using the corresponding steps in the foregoing embodiments, and details are not described herein.
  • Step S404 Calculate the total weight of the candidate antecedent (ie, the referential weight value in the above embodiment).
  • the coefficient of the weight of these features (such as ⁇ 1 ) is given an initial value according to experience, and then the coefficient size of the weight is adjusted by training the corpus.
  • Step S405 Determine the candidate antecedent with the largest weight value as the target antecedent.
  • the candidate antecedent of the maximum weight is selected as the digestion result.
  • the method according to the above embodiment can be implemented by means of software plus a necessary general hardware platform, and of course can also be through hardware, but in many cases the former is a better implementation.
  • the technical solution of the present invention which is essential or contributes to the prior art, may be embodied in the form of a software product stored in a storage medium (such as ROM/RAM, disk,
  • the optical disc includes a number of instructions for causing a terminal device (which may be a cell phone, a computer, a server, or a network device, etc.) to perform the methods described in various embodiments of the present invention.
  • the apparatus includes:
  • the obtaining unit 51 is configured to obtain statement information to be identified
  • the extracting unit 53 is configured to extract, from the sentence information, a plurality of candidate antecedent words and a plurality of candidate antecedent word features, if the pronoun exists in the sentence information;
  • the determining unit 55 is configured to determine a target antecedent referred to by the pronoun from the plurality of candidate antecedents based on the word features of the plurality of candidate antecedents.
  • the word features of the candidate antecedent and each candidate antecedent are extracted from the sentence information, and the target antecedent referred to by the pronoun is determined by the word feature of the candidate antecedent.
  • the target antecedent specified by the pronoun can be automatically locked by the word feature of the candidate antecedent extracted from the sentence information, thereby solving the problem of low processing efficiency of the reference digestion in the prior art, and achieving accurate and efficient Determine the effect of the pronoun's antecedent.
  • the statement information to be identified in the foregoing embodiment may be sent by the terminal 101 to the server, and the statement information may be text information, and the text information may be voice information in the session information.
  • the converted information may also be text information extracted directly from the sentence information, or may be information extracted from the article.
  • the source of the information is not limited in this application.
  • the statement information is a set of session information generated by a client and a server during a human-machine conversation.
  • the sentence features of the candidate antecedent and each candidate antecedent may be extracted from the sentence information in sequence, or may be in the slave statement.
  • the candidate antecedent is extracted from the information while extracting the sentence features of the candidate antecedent.
  • the words referred to by the pronouns may be nouns or noun phrases, and the candidate antecedents extracted are nouns or noun phrases.
  • the predator can be used to segment the sentence information in the sentence information through the word segmenter, and the plurality of words obtained from the word segmentation are included in the word segmentation. Words that are part of pronouns (ie, pronouns) and nouns/nouns (ie, candidate antecedents) are extracted.
  • the target antecedent referred to by the pronoun may be determined from the plurality of candidate antecedents based on the word features of the plurality of candidate antecedents, wherein the term features may include semantic features and grammatical features.
  • the pronoun in the statement information may be replaced with the target antecedent to complete the statement information.
  • the determining unit includes: a determining module 61, as shown in FIG. 6, for determining a referential weight value of each candidate antecedent based on a word feature of each candidate antecedent; a selection module 63, It is used to select the candidate antecedent with the largest weight value as the target antecedent of the pronoun.
  • the word feature in the foregoing embodiment may be a semantic feature or a grammatical feature, and the semantic feature and/or the grammatical feature are used to determine the weight value of each candidate antecedent relative to the pronoun, and each obtained finger is obtained.
  • the weighted value is sorted to obtain a sequence of weighted values, if the index
  • the weighted value sequence is arranged according to the weight value of the reference, and the candidate antecedent corresponding to the first weight value in the sequence of weights is used as the target antecedent of the pronoun; if the weight is The value sequence is arranged from small to large according to the weight value of the reference, and the candidate antecedent corresponding to the last weight value in the sequence of weight values is used as the target antecedent of the pronoun.
  • the largest weighting value of the plurality of referential weight values may be obtained according to a pairwise comparison manner.
  • the candidate antecedent corresponding to the largest referential weight value is selected as the target antecedent referred to by the pronoun.
  • each of the plurality of candidate antecedents includes one or more word features, and wherein each candidate antecedent of the plurality of candidate antecedents includes a word feature, The word feature of each candidate antecedent is converted into a feature value, and the feature value is used as a referential weight value of the candidate antecedent.
  • each candidate antecedent of the plurality of candidate antecedents includes one or more word features
  • the determining module 61 shown in FIG. 6 includes:
  • a conversion sub-module 611 configured to convert the extracted word features into feature values
  • the calculating sub-module 613 is configured to perform linear weighting calculation on the feature values of each candidate antecedent by using feature coefficients of one or more word features set in advance, to obtain a referential weight value of each candidate antecedent.
  • each word feature of each candidate antecedent is separately converted into a feature value, using one or more presets A feature coefficient of the word feature, and performing linear weighting calculation on the plurality of feature values to obtain a referential weight value of each candidate antecedent.
  • each of the plurality of candidate antecedents includes one or more word features
  • the word features include at least one of: a singular and plural feature of the candidate antecedent, a candidate antecedent and a pronoun The distance between them, whether the candidate antecedent appears in the prepositional phrase, and the semantic relevance of the pronoun and the candidate antecedent.
  • the singular and complex features are converted into eigenvalues.
  • the eigenvalue is set to The first constant; if the singular and plural of the candidate antecedent does not coincide with the singular and plural of the pronoun, the eigenvalue is set to the second constant.
  • the first constant may be 1 and the second constant may be 0.
  • the word feature includes the distance between the candidate antecedent and the pronoun
  • the distance between the candidate antecedent and the sentence in which the pronoun is located, or the number of characters in the interval between the two words may be used. Or the number of statements as its eigenvalue.
  • the feature value may be set to a constant if the candidate antecedent appears in the prepositional phrase, such as 1; if the candidate antecedent does not appear in the prepositional phrase, The eigenvalue is set to another constant, such as 0.
  • the feature value may be a value calculated by the above formula.
  • the extracting unit 53 may include: a searching module 71 for finding a neighboring word of a pronoun in the sentence information; and an extracting module 73 for notifying the part of the word in the adjacent word In the case, word features of a plurality of candidate antecedents and a plurality of candidate antecedents are extracted from the sentence information.
  • the extracting unit may include: an obtaining module 81, configured to acquire a noun phrase whose distance from the pronoun in the sentence information is within a preset distance; and a determining module 83, configured to determine between the noun phrase and the pronoun Whether they refer to each other, if the noun phrase and the pronoun refer to each other, the noun phrase is used as the candidate antecedent.
  • the determining module includes: a determining sub-module, configured to determine whether the part of speech of the connected word between the noun phrase and the pronoun is a predicate; if the part of the noun phrase and the pronoun is not a predicate, determining the noun phrase It can refer to each other with pronouns; if the part of speech of noun phrase and pronoun is predicate, it can be judged that noun phrase and pronoun can not refer to each other.
  • the candidate antecedent set for each to-be-dissolved pronoun in the training corpus first determines the candidate antecedent set for each to-be-dissolved pronoun in the training corpus, and then judge whether the pronoun needs to be digested according to the consistency constraint rule, perform feature extraction, based on pronouns and candidate antecedents.
  • the distance, semantics and grammar information propose a method for human-to-human dialogue, which is called Chinese pronouns, and determines the final candidate antecedent.
  • the modules provided in this embodiment are the same as the methods used in the corresponding steps of the method embodiment, and the application scenarios may be the same.
  • the solution involved in the above module may not be limited to the content and scenario in the foregoing embodiment, and the foregoing module may be run on a computer terminal or a mobile terminal, and may be implemented by software or hardware.
  • a server for implementing the foregoing method and apparatus for determining an antecedent is further provided.
  • the server includes:
  • the server includes: one or more (only one shown in the figure) processor 901, memory 903, and transmission device 905 (such as the transmitting device in the above embodiment), as shown in FIG.
  • the terminal may also include an input and output device 907.
  • the memory 903 can be used to store the software program and the module, such as the method for determining the antecedent in the embodiment of the present invention and the program instruction/module corresponding to the device, and the processor 901 runs the software program and the module stored in the memory 903, thereby Perform various functional applications and data processing, that is, implement the above-described method for determining antecedent.
  • Memory 903 can include high speed random access memory, and can also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid state memory.
  • memory 903 can further include memory remotely located relative to processor 901, which can be connected to the terminal over a network. Examples of such networks include, but are not limited to, the Internet, intranets, local area networks, mobile communication networks, and combinations thereof.
  • the transmission device 905 described above is for receiving or transmitting data via a network, and can also be used for data transmission between the processor and the memory. Specific examples of the above network may include a wired network And wireless network.
  • the transmission device 905 includes a Network Interface Controller (NIC) that can be connected to other network devices and routers via a network cable to communicate with the Internet or a local area network.
  • the transmission device 905 is a Radio Frequency (RF) module for communicating with the Internet wirelessly.
  • NIC Network Interface Controller
  • RF Radio Frequency
  • the memory 903 is used to store an application.
  • the processor is configured to: obtain the sentence information to be identified; and when the pronoun exists in the sentence information, extract the word features of the plurality of candidate antecedents and the plurality of candidate antecedents from the sentence information; The word feature of the antecedent, and the target antecedent referred to by the pronoun is determined from the plurality of candidate antecedents.
  • the processor is further configured to: determine, according to the word features of each candidate antecedent, a referential weight value of each candidate antecedent; and select a candidate antecedent with the largest weight value as the target of the pronoun word.
  • the processor is further configured to perform the step of: each candidate antecedent in the plurality of candidate antecedents includes one or more word features, and determining a referential weight value of each candidate antecedent based on the word features of each candidate antecedent
  • the method comprises: converting the extracted word features into feature values; using a feature coefficient of one or more word features set in advance, performing linear weighting calculation on the feature values of each candidate antecedent to obtain a reference of each candidate antecedent Weights.
  • each candidate antecedent of the plurality of candidate antecedents includes one or more word features, and the word features include at least one of: a singular and plural feature of the candidate antecedent, a candidate antecedent and The distance between pronouns, whether the candidate antecedent appears in the prepositional phrase, and the semantic relevance of the pronoun and the candidate antecedent.
  • the processor is further configured to: extract the word features of the plurality of candidate antecedents and the plurality of candidate antecedents from the sentence information, including: searching for adjacent words of the pronouns in the statement information; and if the part of the adjacent words is not a noun And extracting word features of the plurality of candidate antecedents and the plurality of candidate antecedents from the sentence information.
  • the processor is further configured to perform the following steps, and extracting multiple candidate antecedents from the statement information includes: Obtain a noun phrase whose distance from the pronoun is within a preset distance; determine whether the noun phrase and the pronoun refer to each other; if the noun phrase and the pronoun refer to each other, the noun phrase is used as the candidate antecedent.
  • the processor is further configured to perform the following steps: determining whether the noun phrase and the pronoun refer to each other include: determining whether the part of the noun phrase and the pronoun is a predicate; if the noun phrase and the pronoun are connected words If it is not a predicate, it is judged that the noun phrase and the pronoun can refer to each other; if the part of the noun phrase and the pronoun is a predicate, it is judged that the noun phrase and the pronoun cannot refer to each other.
  • the structure shown in FIG. 9 is only illustrative, and the terminal can be a smart phone (such as an Android mobile phone, an iOS mobile phone, etc.), a tablet computer, a palm computer, and a mobile Internet device (MID). Terminal equipment such as PAD.
  • FIG. 9 does not limit the structure of the above electronic device.
  • the terminal may also include more or fewer components (such as a network interface, processing device, etc.) than shown in FIG. 9, or have a different configuration than that shown in FIG.
  • Embodiments of the present invention also provide a storage medium.
  • the foregoing storage medium may be used to store program code for executing the above method.
  • the storage medium is arranged to store program code for performing the following steps:
  • the target antecedent referred to by the pronoun is determined.
  • the storage medium is arranged to store program code for performing the following steps: based on each candidate The word features of the antecedent determine the referential weight value of each candidate antecedent; the candidate antecedent with the largest weight value is selected as the target antecedent of the pronoun.
  • the storage medium is arranged to store program code for performing the following steps, each candidate antecedent of the plurality of candidate antecedents comprising one or more word features, each candidate leading is determined based on the word characteristics of each candidate antecedent
  • the referential weight value of the word includes: converting the extracted word feature into a feature value; using a feature coefficient of one or more word features set in advance, performing linear weighting calculation on the feature value of each candidate antecedent to obtain each The referential weight of the candidate antecedent.
  • the storage medium is arranged to store program code for performing the following steps, each candidate antecedent of the plurality of candidate antecedents comprising one or more word features, the word features comprising at least one of: a singular and plural number of candidate antecedents
  • the feature the distance between the candidate antecedent and the pronoun, whether the candidate antecedent appears in the prepositional phrase, and the semantic relevance of the pronoun and the candidate antecedent.
  • the storage medium is configured to store program code for performing the following steps, and extracting the word features of the plurality of candidate antecedents and the plurality of candidate antecedents from the sentence information comprises: finding adjacent words of the pronouns in the statement information; In the case where it is not a noun, the word features of the plurality of candidate antecedents and the plurality of candidate antecedents are extracted from the sentence information.
  • the storage medium is configured to store program code for performing the following steps, and extracting a plurality of candidate antecedents from the statement information includes: obtaining a noun phrase whose distance from the pronoun is within a preset distance in the sentence information; determining a noun phrase and a pronoun Whether they refer to each other; if noun phrases and pronouns refer to each other, noun phrases are used as candidate antecedents.
  • the storage medium is configured to store program code for performing the following steps, and determining whether the noun phrase and the pronoun refer to each other includes: determining whether the part of the noun phrase and the pronoun is a predicate; if the noun phrase and the pronoun If the part of the conjunction is not a predicate, it is judged that the noun phrase and the pronoun can refer to each other; if the part of the noun phrase and the pronoun is a predicate, it is judged that the noun phrase and the pronoun cannot Refer to each other.
  • the foregoing storage medium may include, but is not limited to: a USB flash drive, only A medium that can store program code, such as a read-only memory (ROM), a random access memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
  • ROM read-only memory
  • RAM random access memory
  • removable hard disk such as a hard disk, a magnetic disk, or an optical disk.
  • the integrated unit in the above embodiment if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in the above-described computer readable storage medium.
  • the technical solution of the present invention may contribute to the prior art or all or part of the technical solution may be embodied in the form of a software product stored in a storage medium.
  • a number of instructions are included to cause one or more computer devices (which may be a personal computer, server or network device, etc.) to perform all or part of the steps of the methods described in various embodiments of the present invention.
  • the disclosed client may be implemented in other manners.
  • the device embodiments described above are merely illustrative.
  • the division of the unit is only a logical function division.
  • multiple units or components may be combined or may be Integrate into another system, or some features can be ignored or not executed.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection through some interface, unit or module, and may be electrical or otherwise.
  • the units described as separate components may or may not be physically separated, and the components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of the embodiment.
  • each functional unit in each embodiment of the present invention may be integrated into one processing unit, or each unit may exist physically separately, or two or more units may be integrated into one unit.
  • the above integrated unit can be implemented in the form of hardware or in the form of a software functional unit.

Abstract

一种先行词的确定方法和装置。其中,该方法包括:获取待识别的语句信息(S202);在识别出所述语句信息中存在代词的情况下,从语句信息中提取多个候选先行词和所述多个候选先行词的词语特征(S204);基于多个候选先行词的词语特征,从多个候选先行词中确定代词所指代的目标先行词(S206)。解决了指代消解的处理效率低的技术问题。

Description

先行词的确定方法和装置
本申请要求于2016年05月20日提交中国专利局、申请号为201610341637.6、发明名称“先行词的确定方法和装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及信息处理领域,具体而言,涉及一种先行词的确定方法和装置。
背景技术
在人机对话中需机器准确理解语句中的上下文信息,如果机器无法准确理解语句中的上下文信息,会造成对话信息模糊,指代问题是造成信息模糊的主要问题。
广义上讲,指代消解是在篇章中确定代词指向哪个名词短语的问题。现有技术中存在如下几种指代消解算法:(1)自左向右先广搜索,层次遍历句法树达到消解工作,该算法需要遍历待识别的信息,遍历工作量很大;(2)在句法知识基础上加入语义约束,该方式在英文代词指代消解效果还行,但是中文词汇处理难度大,该方法不适用于汉语的指代消解;(3)把语义信息加入到LRC(left-right centering)算法中实现对候选先行词的过滤,但是该算法所利用的语义信息需要事先手工定义,测试语料同样进行了手工清洗掉不流利的成分。
由于中文浅层词汇处理难度比较大,在消解工作中要进行分词,并且对于名词没有明确的单复数、性别的特征,代词也没有明确的主格和宾格特征,口语会话中省略话语很常见。这些难点都使得上述的指代消解方案无法适用于中文的代词指代消解,目前的代词指代消解工作中更多的是依赖人工语料清洗、标注,没有有效的指代消解的处理方案。
针对上述的问题,目前尚未提出有效的解决方案。
发明内容
本发明实施例提供了一种先行词的确定方法和装置,以至少解决指代消解的处理效率低的技术问题。
根据本发明实施例的一个方面,提供了一种先行词的确定方法,该方法包括:获取待识别的语句信息;在识别出所述语句信息中存在代词的情况下,从语句信息中提取多个候选先行词和所述多个候选先行词的词语特征;基于所述多个候选先行词的词语特征,从所述多个候选先行词中确定所述代词所指代的目标先行词。
根据本发明实施例的另一方面,还提供了一种先行词的确定装置,该装置包括:获取单元,用于获取待识别的语句信息;提取单元,用于在识别出所述语句信息中存在代词的情况下,从语句信息中提取多个候选先行词和所述多个候选先行词的词语特征;确定单元,用于基于所述多个候选先行词的词语特征,从所述多个候选先行词中确定所述代词所指代的目标先行词。
在本发明实施例中,在语句信息中存在代词的情况下,从语句信息中提取候选先行词和各个候选先行词的词语特征,利用候选先行词的词语特征确定代词所指代的目标先行词。在该方案中,通过从语句信息中提取出来的候选先行词的词语特征,可以自动锁定代词指定的目标先行词,从而解决了现有技术中指代消解的处理效率低的问题,实现了准确高效确定代词的先行词的效果。
附图说明
此处所说明的附图用来提供对本发明的进一步理解,构成本申请的一部分,本发明的示意性实施例及其说明用于解释本发明,并不构成对本发 明的不当限定。在附图中:
图1是根据本发明实施例的一种可选的先行词的确定方法的网络环境示意图;
图2是根据本发明实施例的先行词的确定方法的流程图一;
图3是根据本发明实施例的先行词的确定方法的流程图二;
图4是根据本发明实施例的先行词的确定方法的流程图三;
图5是根据本发明实施例的先行词的确定装置的示意图一;
图6是根据本发明实施例的先行词的确定装置的示意图二;
图7是根据本发明实施例的先行词的确定装置的示意图三;
图8是根据本发明实施例的先行词的确定装置的示意图四;
图9是根据本发明实施例的服务器的内部结构框图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例能够以除了在这里图示或描述的那些以外的顺序实施。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出 的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
首先,对本申请实施例涉及的术语解释如下:
指代:是指当前的代词与上文中出现的词、短语存在的语义关联。
先行词:与当前代词存在语义关联的短语,如代词所指代的词语或短语。
Query:会话中的文本信息。
Session:会话集合。
谓词:用来描述或判定壳体性质、特征或者客体之间关系的词项,该谓词一般包括动词和形容词。
临近词:在语句信息中位置相邻的词语。
依存词:在语义上相互依存、相依附存在的词。
实施例1
根据本发明实施例,提供了一种先行词的确定方法的实施例,需要说明的是,在附图的流程图示出的步骤可以在诸如一组计算机可执行指令的计算机系统中执行,并且,虽然在流程图中示出了逻辑顺序,但是在某些情况下,可以以不同于此处的顺序执行所示出或描述的步骤。
可选地,在本实施例中,上述信息处理方法以应用于如图1所示的网络环境中。该网络环境包括终端101和服务器103(该服务器可以为网络连接应用的服务器或云平台),其中,终端可以与服务器通过网络建立连接,终端和服务器上均可以设置处理器。
上述网络包括但不限于:广域网、城域网或局域网。上述终端可以为具有输入设备的终端,如移动终端(例如,手机、平板电脑等),该终端可以安装智能对话客户端。可选地,服务器与该智能对话客户端相对应,该服务器可以用于处理终端利用智能对话客户端发送的信息。
图2是根据本发明实施例的先行词的确定方法的流程图。如图2所示,该方法可以包括如下步骤:
步骤S202:获取待识别的语句信息;
步骤S204:在识别出所述语句信息中存在代词的情况下,从语句信息中提取多个候选先行词和多个候选先行词的词语特征;
步骤S206:基于多个候选先行词的词语特征,从多个候选先行词中确定代词所指代的目标先行词。
通过上述实施例,在语句信息中存在代词的情况下,从语句信息中提取候选先行词和各个候选先行词的词语特征,利用候选先行词的词语特征确定代词所指代的目标先行词。在该方案中,通过从语句信息中提取出来的候选先行词的词语特征,可以自动锁定代词指定的目标先行词,从而解决了现有技术中指代消解的处理效率低的问题,实现了准确高效确定代词的先行词的效果。
需要说明的是,上述实施例中的代词、候选先行词和候选先行词的词语特征都是从语句信息中提取出来的,不需要预先定义、也不需要人工语料清洗和标注,大大提高了处理速度。
上述实施例中的待识别的语句信息可以是终端101发送给服务器的,该语句信息可以是文字信息,该文字信息可以是将会话信息中的语音信息转化得到的,也可以是直接从语句信息中提取的文字信息,还可以是从文章中提取的信息,本申请对该信息的来源不做限定。
具体地,语句信息为一个客户端与服务器在人机对话的过程中产生的会话信息集合。
其中,从语句信息中提取多个候选先行词和多个候选先行词的词语特征的过程中,可以依序从语句信息中提取候选先行词和各个候选先行词的语句特征,也可以在从语句信息中提取候选先行词的同时提取候选先行词的语句特征。
需要说明的是,代词所指代的词可以为名词或名词短语,提取到的候选先行词均为名词或名词短语。
进一步地,从语句信息中提取代词和多个候选先行词的过程中,可以利用预先设置好的分词器,通过该分词器对语句信息中的语句信息进行分词,从分词得到的多个词语中提取词性为代词的词(即代词)、以及名词/名词短语(即候选先行词)。
根据本发明的上述实施例,可以基于多个候选先行词的词语特征,从多个候选先行词中确定代词所指代的目标先行词,其中,该词语特征可以包括语义特征和语法特征。
下面详细本发明实施例:
终端启动智能对话客户端(以下简称客户端)之后,建立智能对话客户端与服务器之间的通信,利用该通信关系,通过智能对话客户端向服务器发送会话信息,服务器在接收到该会话信息之后,若该会话信息为文字信息,则将该会话信息作为语句信息,若该会话信息为语音信息,则将该语音信息转换为文字信息,并将转换得到的文字信息作为语句信息。
服务器对语句信息进行识别,若识别出该语句信息中有代词,则获取该会话过程产生的会话集合(即上述的语句信息),并从该语句信息中提取多个候选先行词和各个候选先行词的词语特征,利用该词语特征确定代词指代的目标先行词。
在确定代词所指代的目标先行词之后,可以将该语句信息中的代词替换为目标先行词,以将该语句信息补充完整。
根据本发明的上述实施例,基于多个候选先行词的词语特征,从多个候选先行词中确定代词所指代的目标先行词可以包括:基于每个候选先行词的词语特征,确定每个候选先行词的指代权重值;将指代权重值最大的候选先行词选取为代词所指代的目标先行词。
具体地,上述实施例中的词语特征可以为语义特征或语法特征,利用 该语义特征和/或语法特征,确定每个候选先行词相对于该代词的指代权重值,并对得到的各个指代权重值进行排序,得到指代权重值序列,若该指代权重值序列按照指代权重值从大到小排列,则将指代权重值序列中第一个指代权重值对应的候选先行词作为代词所指代的目标先行词;若该指代权重值序列按照指代权重值从小到大排列,则将指代权重值序列中最后一个指代权重值对应的候选先行词作为代词所指代的目标先行词。
在一个可选的实现方式中,在确定每个候选先行词相对于该代词的指代权重值之后,可以按照两两比较的方式,获取多个指代权重值中最大的指代权重值,将最大的指代权重值对应的候选先行词选取为代词所指代的目标先行词。
在一个可选的实施例中,多个候选先行词中的每个候选先行词包括一个或多个词语特征,在多个候选先行词中的每个候选先行词包括一个词语特征的情况下,将每个候选先行词的词语特征转换为特征值,并将该特征值作为该候选先行词的指代权重值。
在另一个可选的实施例中,多个候选先行词中的每个候选先行词包括一个或多个词语特征,基于每个候选先行词的词语特征,确定每个候选先行词的指代权重值包括:将提取到的词语特征转换为特征值;利用预先设置的一个或多个词语特征的特征系数,对每个候选先行词的特征值进行线性加权计算,得到每个候选先行词的指代权重值。
具体地,在多个候选先行词中的每个候选先行词包括多个词语特征的情况下,将每个候选先行词的每个词语特征分别转换为特征值,利用预先设置的一个或多个词语特征的特征系数,对该多个特征值进行线性加权计算,得到每个候选先行词的指代权重值。
例如,若词语特征为两个,该两个词语特征的特征值分别为t1和t2,获取该两个词语特征的预先设置的特征系数λ1和λ2,对该两个特征值进行线性加权计算:Weight=λ1·t12·t2
其中,这些特征的特征系数可以根据经验赋予初始值,也可以通过训 练语料调整该特征系数的大小。
在一个可选的实施例中,多个候选先行词中的每个候选先行词包括一个或多个词语特征,词语特征包括下述至少之一:候选先行词的单复数特征、候选先行词与代词之间的距离、候选先行词是否出现在介词短语中、以及代词和候选先行词的语义关联性。
在词语特征包括候选先行词的单复数特征的情况下,由于单数的代词是无法指代复数的先行词的,单复数一致是判断两个词是否存在指代关系的重要特征,如,“今天天气很好,我和同学们准备出去逛逛”,这里的代词“我”是单数,而“同学们”是复数,单数无法指代复数。在提取到单复数特征之后,可以利用候选先行词的单复数是否与代词的单复数一致的特征,将其单复数特征转换为特征值,如,若候选先行词的单复数与代词的单复数一致,则将其特征值设置为第一常数;若候选先行词的单复数与代词的单复数不一致,则将其特征值设置为第二常数。可选地,第一常数可以为1,第二常数可以为0。
上述实施例中的候选先行词与代词之间的距离通常考虑的是两个词语所在句子之间或者段落之间的距离,也可以指两个词语之间的字符数。在多轮会话中,一个完整的语句信息需要多句表述完,候选先行词和代词所在句子的距离越近,相关性也会越大,在这里考虑代词和先行词的距离意义也很大。在词语特征包括候选先行词与代词之间的距离的情况下,将词语特征转换为特征值的过程中,可以将候选先行词与代词所在句子之间的距离、或者两个词语间隔的字符数或语句数作为其特征值。
基于大量的多轮对话语料,分析发现语法结构对指代消解工作有很大的影响。位于直接宾语、间接宾语中的名词被指代到的概率无明显差异,而位于介词短语中的名词被指代的概率比较低。因此在本发明实施例中,可以将候选先行词是否出现在介词短语作为一个词语特征。在将词语特征转换为特征值时,可以在候选先行词出现在介词短语中的情况下,将特征值设置为一个常数,如1;在候选先行词未出现在介词短语中的情况下, 将特征值设置为另一个常数,如0。
可选地,语义依存词的相关性也可以作为一个词语特征(即上述实施例中的代词和候选先行词的语义关联性),例如,语句信息为“警察发现小偷越狱,加重对他的刑罚”,其中,候选先行词“小偷”和代词“他”分别依存于“越狱”和“刑罚”,这两个语义依存词具有很大的相关性,可以看出代词和候选先行词的语义依存词之间的相关性大小可以帮助确定指代关系。
其中,该代词和候选先行词的语义关联性可以基于该两个词的语义依存词之间的相关性确定。
在一个可选的实施例中,P为待消解代词,A为候选先行词,(Px1,Px2...Pxi)为代词的依存词,(Ax1,Ax2...Axj)为候选先行词的依存词,i,j为自然数,i表示代词依存词的数量,j表述候选先行词的依存词的数量,具体地,代词P和候选先行词A的语义关联性WordSence(P,A)为:
Figure PCTCN2017074800-appb-000001
在词语特征包括代词和候选先行词之间的语义关联性的情况下,该特征值可以为通过上述公式计算得到的值。
为了更好的补充对话信息的完整性,首先对训练语料中的每个待消解代词确定候选先行词集合,然后根据一致性约束规则判断代词是否需要消解,进行特征抽取,基于代词和候选先行词的距离、语义和语法等信息提出一种适用于人机对话的人称中文代词指代消解方法,确定最终的候选先行词。
可选地,在从语句信息中提取多个候选先行词和所述多个候选先行词的词语特征之前,判断代词是否需要消解。在判断出代词需要消解的情况下,从语句信息中提取多个候选先行词和多个候选先行词的词语特征;在判断出代词不需要消解的情况下,则不再从语句信息中提取多个候选先行 词和多个候选先行词的词语特征。
具体地,判断代词是否需要消解,可以通过判断代词的临近词是否为名词来实现,若该代词的临近词为名词,则判断出该代词无需消解,若该代词的临近词不为名词,则判断出该代词需要消解,可以从语句信息中提取多个候选先行词和多个候选先行词的词语特征。
例如:“今天天气很好,小明他要出去逛逛”。日常对话中经常会出现类似对话,此处的代词“他”是不需要消解的。从语法角度来讲,该代词“他”的临近词是“小明”,小明为名词,在两个名词临近的情况下,若其中一个词为名词,这两个词无需消解,即可明了代词的含义。
具体地,从语句信息中提取多个候选先行词和多个候选先行词的词语特征包括:查找语句信息中代词,并获取查找到的代词的临近词;在临近词的词性不为名词的情况下,从语句信息中提取多个候选先行词和多个候选先行词的词语特征。
在一个可选的实施例中,从语句信息中提取多个候选先行词包括:
获取语句信息中与代词的距离在预设距离内的名词短语;判断名词短语与代词之间是否相互指代;若名词短语与代词之间相互指代,则将名词短语作为候选先行词。
下面结合图3详述本发明实施例,如图3所示,该实施例可以包括如如下步骤:
步骤S301:检测出语句信息中出现代词。
可选地,可以执行检测语句信息中是否出现代词的步骤(即下述的步骤S306),在检测出代词的情况下,进入该步骤。
步骤S302:判断该代词是否需要消解。
若判断出该代词需要消解,则执行步骤S303;若判断出该代词不需消解,则继续执行步骤S306:检测语句信息中是否出现代词。
具体地,可以通过判断代词的临近词是否为名词来实现,若该代词的临近词为名词,则判断出该代词无需消解;若该代词的临近词不为名词,则判断出该代词需要消解。
步骤S303:获取多个候选先行词。
在该步骤中,从语句信息中提取候选先行词时,可以基于该待提取的词与代词之间是否能够存在相互指代的关系,来确定是否提取该词。若该待提取的词与代词之间能够存在相互指代的关系,则提取该词;否则,反之。
可选地,该实施例中,也可以在提取到所有的候选先行词(如名词或名词短语)之后,利用该候选先行词与代词是否可以相互指代,而对多个候选先行词进行过滤,得到过滤后的候选先行词。然后从语句信息中提取过滤后的候选先行词的词语特征,并基于该提取到的词语特征,从过滤后的候选先行词中选取目标先行词。
步骤S304:提取候选先行词的词语特征。
步骤S305:利用候选先行词的词语特征,确定代词指代的目标先行词。
根据本发明的上述实施例,可以在语句信息中,查找与该代词距离较近的名词或名词短语,也即,获取语句信息中与代词的距离在预设距离内的名词短语。在查找到名词短语之后,若该名词短语与代词之间不可能存在指代关系,则不提取该名词或名词短语,也即,不将该名词或名词短语作为代词的候选先行词;若该名词短语与代词之间可以相互指代,则提取该名词或名词短语,并将其作为候选先行词。
具体地,判断名词短语与代词之间是否相互指代包括:判断名词短语和代词之间的连接词的词性是否为谓词;若名词短语和代词之间的连接词的词性不为谓词,则判断出名词短语与代词之间能够相互指代;若名词短语和代词之间的连接词的词性为谓词,则判断出名词短语与代词之间不能 够相互指代。
其中,谓词可以为动词或形容词。例如,“使用榨汁机榨水果很健康”,候选先行词“榨汁机”和代词“水果”同样被谓词“榨”所绑定,两者属于不能相互指代的关系。可选地,可以通过语法解析器的输出结果判断代词和候选先行词之间是否可以相互指代。
在该实施例中,通过判断名词短语与代词之间是否相互指代可以对候选先行词,进行过滤,减少词语及词语特征的处理量。
进一步地,该实施例中,也可以在提取到所有的候选先行词(如名词或名词短语)之后,利用该候选先行词与代词是否可以相互指代,而对多个候选先行词进行过滤,得到过滤后的候选先行词。然后从语句信息中提取过滤后的候选先行词的词语特征,并基于该提取到的词语特征,从过滤后的候选先行词中选取目标先行词。
根据本发明的上述实施例,可以基于不同特征权重线性加权的方式,消解时通过对候选先行词的权重(即指代权重值)大小进行排序,权重最高的作为最终被选择的指代词。
下面结合图4详述本发明上述实施例,如图4所示,该实施例可以包括如下步骤:
步骤S401:在识别出的代词需消解的情况下,利用语法约束过滤候选先行词。
具体地,此处的语法约束可以指符合代词和候选先行词之间不能指代的规则,若代词和候选先行词之间不能指代,则直接过滤掉候选先行词。
步骤S402:提取剩余的候选先行词的词语特征。
其中,词语特征可以包括:单复数特征、候选先行词与代词之间的距离、候选先行词与代词的语义相关性、以及候选先行词是否在介词短语中等。
步骤S403:将特征转换为特征值。
其中,单复数一致性权重Sp,若候选先行词与代词的单复数一致为1,若候选先行词与代词的单复数不一致为0。
距离特征权重Dis,候选先行词和代词之间有多少轮会话,则该特征值为几。
语法约束权重Sc,候选先行词在介词短语中为1,不在为0。
语义依存词相关性特征Ws(即候选先行词与代词的语义相关性),可选地,可以采用上述实施例中对应步骤的实现方式实现,在此不再赘述。
步骤S404:计算候选先行词总的权重(即上述实施例中的指代权重值)。
候选先行词总的权重为:Weight=λ1·Sp+λ2·Dis+λ3·Sc+λ4·Ws。
其中,这些特征的权重的系数(如λ1)根据经验赋予初值,然后通过训练语料调整权重的系数大小。
步骤S405:将指代权重值最大的候选先行词确定为目标先行词。
也即,选择最大权重的候选先行词作为消解结果。
在上述技术方案,我们会综合考虑代词和候选先行词的距离、语法、语义等特征,并且通过大量的真实多轮会话语料分析,加入语法约束规则,最终将该技术在真实的人机会话场景中进行应用,取得了很好的效果。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本发明并不受所描述的动作顺序的限制,因为依据本发明,某些步骤可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本发明所必须的。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到根 据上述实施例的方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本发明各个实施例所述的方法。
实施例2
根据本发明实施例,还提供了一种用于实施上述先行词的确定方法的确定装置,如图5所示,该装置包括:
获取单元51,用于获取待识别的语句信息;
提取单元53,用于在识别出语句信息中存在代词的情况下,从语句信息中提取多个候选先行词和多个候选先行词的词语特征;
确定单元55,用于基于多个候选先行词的词语特征,从多个候选先行词中确定代词所指代的目标先行词。
通过上述实施例,在语句信息中存在代词的情况下,从语句信息中提取候选先行词和各个候选先行词的词语特征,利用候选先行词的词语特征确定代词所指代的目标先行词。在该方案中,通过从语句信息中提取出来的候选先行词的词语特征,可以自动锁定代词指定的目标先行词,从而解决了现有技术中指代消解的处理效率低的问题,实现了准确高效确定代词的先行词的效果。
需要说明的是,上述实施例中的代词、候选先行词和候选先行词的词语特征都是从语句信息中提取出来的,不需要预先定义、也不需要人工语料清洗和标注,大大提高了处理速度。
上述实施例中的待识别的语句信息可以是终端101发送给服务器的,该语句信息可以是文字信息,该文字信息可以是将会话信息中的语音信息 转化得到的,也可以是直接从语句信息中提取的文字信息,还可以是从文章中提取的信息,本申请对该信息的来源不做限定。
具体地,语句信息为一个客户端与服务器在人机对话的过程中产生的会话信息集合。
其中,从语句信息中提取多个候选先行词和多个候选先行词的词语特征的过程中,可以依序从语句信息中提取候选先行词和各个候选先行词的语句特征,也可以在从语句信息中提取候选先行词的同时提取候选先行词的语句特征。
需要说明的是,代词所指代的词可以为名词或名词短语,提取到的候选先行词均为名词或名词短语。
进一步地,从语句信息中提取代词和多个候选先行词的过程中,可以利用预先设置好的分词器,通过该分词器对语句信息中的语句信息进行分词,从分词得到的多个词语中提取词性为代词的词(即代词)、以及名词/名词短语(即候选先行词)。
根据本发明的上述实施例,可以基于多个候选先行词的词语特征,从多个候选先行词中确定代词所指代的目标先行词,其中,该词语特征可以包括语义特征和语法特征。
在确定代词所指代的目标先行词之后,可以将该语句信息中的代词替换为目标先行词,以将该语句信息补充完整。
根据本发明的上述实施例,确定单元包括如图6所示的:确定模块61,用于基于每个候选先行词的词语特征,确定每个候选先行词的指代权重值;选取模块63,用于将指代权重值最大的候选先行词选取为代词所指代的目标先行词。
具体地,上述实施例中的词语特征可以为语义特征或语法特征,利用该语义特征和/或语法特征,确定每个候选先行词相对于该代词的指代权重值,并对得到的各个指代权重值进行排序,得到指代权重值序列,若该指 代权重值序列按照指代权重值从大到小排列,则将指代权重值序列中第一个指代权重值对应的候选先行词作为代词所指代的目标先行词;若该指代权重值序列按照指代权重值从小到大排列,则将指代权重值序列中最后一个指代权重值对应的候选先行词作为代词所指代的目标先行词。
在一个可选的实现方式中,在确定每个候选先行词相对于该代词的指代权重值之后,可以按照两两比较的方式,获取多个指代权重值中最大的指代权重值,将最大的指代权重值对应的候选先行词选取为代词所指代的目标先行词。
在一个可选的实施例中,多个候选先行词中的每个候选先行词包括一个或多个词语特征,在多个候选先行词中的每个候选先行词包括一个词语特征的情况下,将每个候选先行词的词语特征转换为特征值,并将该特征值作为该候选先行词的指代权重值。
具体地,多个候选先行词中的每个候选先行词包括一个或多个词语特征,如图6所示的确定模块61包括:
转换子模块611,用于将提取到的词语特征转换为特征值;
计算子模块613,用于利用预先设置的一个或多个词语特征的特征系数,对每个候选先行词的特征值进行线性加权计算,得到每个候选先行词的指代权重值。
具体地,在多个候选先行词中的每个候选先行词包括多个词语特征的情况下,将每个候选先行词的每个词语特征分别转换为特征值,利用预先设置的一个或多个词语特征的特征系数,对该多个特征值进行线性加权计算,得到每个候选先行词的指代权重值。
根据本发明的上述实施例,多个候选先行词中的每个候选先行词包括一个或多个词语特征,词语特征包括下述至少之一:候选先行词的单复数特征、候选先行词与代词之间的距离、候选先行词是否出现在介词短语中、以及代词和候选先行词的语义关联性。
利用候选先行词的单复数是否与代词的单复数一致的特征,将其单复数特征转换为特征值,如,若候选先行词的单复数与代词的单复数一致,则将其特征值设置为第一常数;若候选先行词的单复数与代词的单复数不一致,则将其特征值设置为第二常数。可选地,第一常数可以为1,第二常数可以为0。
在词语特征包括候选先行词与代词之间的距离的情况下,将词语特征转换为特征值的过程中,可以将候选先行词与代词所在句子之间的距离、或者两个词语间隔的字符数或语句数作为其特征值。
在将词语特征转换为特征值时,可以在候选先行词出现在介词短语中的情况下,将特征值设置为一个常数,如1;在候选先行词未出现在介词短语中的情况下,将特征值设置为另一个常数,如0。
在词语特征包括代词和候选先行词之间的语义关联性的情况下,该特征值可以为通过上述公式计算得到的值。
根据本发明的上述实施例,如图7所示,提取单元53可以包括:查找模块71,用于查找语句信息中代词的临近词;提取模块73,用于在临近词的词性不为名词的情况下,从语句信息中提取多个候选先行词和多个候选先行词的词语特征。
具体地,如图8所示,提取单元可以包括:获取模块81,用于获取语句信息中与代词的距离在预设距离内的名词短语;判断模块83,用于判断名词短语与代词之间是否相互指代,若名词短语与代词之间相互指代,则将名词短语作为候选先行词。
进一步的,判断模块包括:判断子模块,用于判断名词短语和代词之间的连接词的词性是否为谓词;若名词短语和代词之间的连接词的词性不为谓词,则判断出名词短语与代词之间能够相互指代;若名词短语和代词之间的连接词的词性为谓词,则判断出名词短语与代词之间不能够相互指代。
为了更好的补充对话信息的完整性,首先对训练语料中的每个待消解代词确定候选先行词集合,然后根据一致性约束规则判断代词是否需要消解,进行特征抽取,基于代词和候选先行词的距离、语义和语法等信息提出一种适用于人机对话的人称中文代词指代消解方法,确定最终的候选先行词。
本实施例中所提供的各个模块与方法实施例对应步骤所提供的使用方法相同、应用场景也可以相同。当然,需要注意的是,上述模块涉及的方案可以不限于上述实施例中的内容和场景,且上述模块可以运行在计算机终端或移动终端,可以通过软件或硬件实现。
实施例3
根据本发明实施例,还提供了一种用于实施上述先行词的确定方法和装置的服务器,如图9所示,该服务器包括:
如图9所示,该服务器包括:一个或多个(图中仅示出一个)处理器901、存储器903、以及传输装置905(如上述实施例中的发送装置),如图9所示,该终端还可以包括输入输出设备907。
其中,存储器903可用于存储软件程序以及模块,如本发明实施例中的先行词的确定方法和装置对应的程序指令/模块,处理器901通过运行存储在存储器903内的软件程序以及模块,从而执行各种功能应用以及数据处理,即实现上述的先行词的确定方法。存储器903可包括高速随机存储器,还可以包括非易失性存储器,如一个或者多个磁性存储装置、闪存、或者其他非易失性固态存储器。在一些实例中,存储器903可进一步包括相对于处理器901远程设置的存储器,这些远程存储器可以通过网络连接至终端。上述网络的实例包括但不限于互联网、企业内部网、局域网、移动通信网及其组合。
上述的传输装置905用于经由一个网络接收或者发送数据,还可以用于处理器与存储器之间的数据传输。上述的网络具体实例可包括有线网络 及无线网络。在一个实例中,传输装置905包括一个网络适配器(Network Interface Controller,NIC),其可通过网线与其他网络设备与路由器相连从而可与互联网或局域网进行通讯。在一个实例中,传输装置905为射频(Radio Frequency,RF)模块,其用于通过无线方式与互联网进行通讯。
其中,具体地,存储器903用于存储应用程序。
处理器用于执行如下步骤:获取待识别的语句信息;在识别出语句信息中存在代词的情况下,从语句信息中提取多个候选先行词和多个候选先行词的词语特征;基于多个候选先行词的词语特征,从多个候选先行词中确定代词所指代的目标先行词。
处理器还用于执行如下步骤:基于每个候选先行词的词语特征,确定每个候选先行词的指代权重值;将指代权重值最大的候选先行词选取为代词所指代的目标先行词。
处理器还用于执行如下步骤,多个候选先行词中的每个候选先行词包括一个或多个词语特征,基于每个候选先行词的词语特征,确定每个候选先行词的指代权重值包括:将提取到的词语特征转换为特征值;利用预先设置的一个或多个词语特征的特征系数,对每个候选先行词的特征值进行线性加权计算,得到每个候选先行词的指代权重值。
处理器还用于执行如下步骤,多个候选先行词中的每个候选先行词包括一个或多个词语特征,词语特征包括下述至少之一:候选先行词的单复数特征、候选先行词与代词之间的距离、候选先行词是否出现在介词短语中、以及代词和候选先行词的语义关联性。
处理器还用于执行如下步骤,从语句信息中提取多个候选先行词和多个候选先行词的词语特征包括:查找语句信息中代词的临近词;在临近词的词性不为名词的情况下,从语句信息中提取多个候选先行词和多个候选先行词的词语特征。
处理器还用于执行如下步骤,从语句信息中提取多个候选先行词包括: 获取语句信息中与代词的距离在预设距离内的名词短语;判断名词短语与代词之间是否相互指代;若名词短语与代词之间相互指代,则将名词短语作为候选先行词。
处理器还用于执行如下步骤,判断名词短语与代词之间是否相互指代包括:判断名词短语和代词之间的连接词的词性是否为谓词;若名词短语和代词之间的连接词的词性不为谓词,则判断出名词短语与代词之间能够相互指代;若名词短语和代词之间的连接词的词性为谓词,则判断出名词短语与代词之间不能够相互指代。
可选地,本实施例中的具体示例可以参考上述实施例1和实施例2中所描述的示例,本实施例在此不再赘述。
本领域普通技术人员可以理解,图9所示的结构仅为示意,终端可以是智能手机(如Android手机、iOS手机等)、平板电脑、掌上电脑以及移动互联网设备(Mobile Internet Devices,MID)、PAD等终端设备。图9其并不对上述电子装置的结构造成限定。例如,终端还可包括比图9中所示更多或者更少的组件(如网络接口、处理装置等),或者具有与图9所示不同的配置。
实施例4
本发明的实施例还提供了一种存储介质。可选地,在本实施例中,上述存储介质可以用于存储执行上述方法的程序代码。
可选地,在本实施例中,存储介质被设置为存储用于执行以下步骤的程序代码:
获取待识别的语句信息;在识别出语句信息中存在代词的情况下,从语句信息中提取多个候选先行词和多个候选先行词的词语特征;基于多个候选先行词的词语特征,从多个候选先行词中确定代词所指代的目标先行词。
存储介质被设置为存储用于执行以下步骤的程序代码:基于每个候选 先行词的词语特征,确定每个候选先行词的指代权重值;将指代权重值最大的候选先行词选取为代词所指代的目标先行词。
存储介质被设置为存储用于执行以下步骤的程序代码,多个候选先行词中的每个候选先行词包括一个或多个词语特征,基于每个候选先行词的词语特征,确定每个候选先行词的指代权重值包括:将提取到的词语特征转换为特征值;利用预先设置的一个或多个词语特征的特征系数,对每个候选先行词的特征值进行线性加权计算,得到每个候选先行词的指代权重值。
存储介质被设置为存储用于执行以下步骤的程序代码,多个候选先行词中的每个候选先行词包括一个或多个词语特征,词语特征包括下述至少之一:候选先行词的单复数特征、候选先行词与代词之间的距离、候选先行词是否出现在介词短语中、以及代词和候选先行词的语义关联性。
存储介质被设置为存储用于执行以下步骤的程序代码,从语句信息中提取多个候选先行词和多个候选先行词的词语特征包括:查找语句信息中代词的临近词;在临近词的词性不为名词的情况下,从语句信息中提取多个候选先行词和多个候选先行词的词语特征。
存储介质被设置为存储用于执行以下步骤的程序代码,从语句信息中提取多个候选先行词包括:获取语句信息中与代词的距离在预设距离内的名词短语;判断名词短语与代词之间是否相互指代;若名词短语与代词之间相互指代,则将名词短语作为候选先行词。
存储介质被设置为存储用于执行以下步骤的程序代码,判断名词短语与代词之间是否相互指代包括:判断名词短语和代词之间的连接词的词性是否为谓词;若名词短语和代词之间的连接词的词性不为谓词,则判断出名词短语与代词之间能够相互指代;若名词短语和代词之间的连接词的词性为谓词,则判断出名词短语与代词之间不能够相互指代。
可选地,在本实施例中,上述存储介质可以包括但不限于:U盘、只 读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、移动硬盘、磁碟或者光盘等各种可以存储程序代码的介质。
可选地,本实施例中的具体示例可以参考上述实施例中所描述的示例,本实施例在此不再赘述。
上述本发明实施例序号仅仅为了描述,不代表实施例的优劣。
上述实施例中的集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在上述计算机可读取的存储介质中。基于这样的理解,本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在存储介质中,包括若干指令用以使得一台或多台计算机设备(可为个人计算机、服务器或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。
在本发明的上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
在本申请所提供的几个实施例中,应该理解到,所揭露的客户端,可通过其它的方式实现。其中,以上所描述的装置实施例仅仅是示意性的,例如所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,单元或模块的间接耦合或通信连接,可以是电性或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本发明各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
以上所述仅是本发明的优选实施方式,应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以做出若干改进和润饰,这些改进和润饰也应视为本发明的保护范围。

Claims (14)

  1. 一种先行词的确定方法,其中,包括:
    获取待识别的语句信息;
    在识别出所述语句信息中存在代词的情况下,从语句信息中提取多个候选先行词和所述多个候选先行词的词语特征;
    基于所述多个候选先行词的词语特征,从所述多个候选先行词中确定所述代词所指代的目标先行词。
  2. 根据权利要求1所述的方法,其中,基于所述多个候选先行词的词语特征,从所述多个候选先行词中确定所述代词所指代的目标先行词包括:
    基于每个所述候选先行词的词语特征,确定每个所述候选先行词的指代权重值;
    将指代权重值最大的候选先行词选取为所述代词所指代的目标先行词。
  3. 根据权利要求2所述的方法,其中,所述多个候选先行词中的每个候选先行词包括一个或多个所述词语特征,基于每个所述候选先行词的词语特征,确定每个所述候选先行词的指代权重值包括:
    将提取到的词语特征转换为特征值;
    利用预先设置的一个或多个所述词语特征的特征系数,对每个所述候选先行词的所述特征值进行线性加权计算,得到每个所述候选先行词的指代权重值。
  4. 根据权利要求2所述的方法,其中,所述多个候选先行词中的每个候选先行词包括一个或多个所述词语特征,所述词语特征包括下述至少之一:
    所述候选先行词的单复数特征、所述候选先行词与所述代词之间的距离、所述候选先行词是否出现在介词短语中、以及所述代词和所 述候选先行词的语义关联性。
  5. 根据权利要求1所述的方法,其中,从语句信息中提取多个候选先行词和所述多个候选先行词的词语特征包括:
    查找所述语句信息中代词的临近词;
    在所述临近词的词性不为名词的情况下,从所述语句信息中提取多个候选先行词和所述多个候选先行词的词语特征。
  6. 根据权利要求1或5所述的方法,其中,从语句信息中提取多个候选先行词包括:
    获取所述语句信息中与所述代词的距离在预设距离内的名词短语;
    判断所述名词短语与所述代词之间是否相互指代;
    若所述名词短语与所述代词之间相互指代,则将所述名词短语作为所述候选先行词。
  7. 根据权利要求6所述的方法,其中,判断所述名词短语与所述代词之间是否相互指代包括:
    判断所述名词短语和所述代词之间的连接词的词性是否为谓词;
    若所述名词短语和所述代词之间的连接词的词性不为谓词,则判断出所述名词短语与所述代词之间能够相互指代;
    若所述名词短语和所述代词之间的连接词的词性为谓词,则判断出所述名词短语与所述代词之间不能够相互指代。
  8. 一种先行词的确定装置,其中,包括:
    获取单元,被设置为获取待识别的语句信息;
    提取单元,被设置为在识别出所述语句信息中存在代词的情况下,从语句信息中提取多个候选先行词和所述多个候选先行词的词语特征;
    确定单元,被设置为基于所述多个候选先行词的词语特征,从所 述多个候选先行词中确定所述代词所指代的目标先行词。
  9. 根据权利要求8所述的装置,其中,所述确定单元包括:
    确定模块,被设置为基于每个所述候选先行词的词语特征,确定每个所述候选先行词的指代权重值;
    选取模块,被设置为将指代权重值最大的候选先行词选取为所述代词所指代的目标先行词。
  10. 根据权利要求9所述的装置,其中,所述多个候选先行词中的每个候选先行词包括一个或多个所述词语特征,所述确定模块包括:
    转换子模块,被设置为将提取到的词语特征转换为特征值;
    计算子模块,被设置为利用预先设置的一个或多个所述词语特征的特征系数,对每个所述候选先行词的所述特征值进行线性加权计算,得到每个所述候选先行词的指代权重值。
  11. 根据权利要求9所述的装置,其中,所述多个候选先行词中的每个候选先行词包括一个或多个所述词语特征,所述词语特征包括下述至少之一:
    所述候选先行词的单复数特征、所述候选先行词与所述代词之间的距离、所述候选先行词是否出现在介词短语中、以及所述代词和所述候选先行词的语义关联性。
  12. 根据权利要求8所述的装置,其中,所述提取单元包括:
    查找模块,被设置为查找所述语句信息中代词的临近词;
    提取模块,被设置为在所述临近词的词性不为名词的情况下,从所述语句信息中提取多个候选先行词和所述多个候选先行词的词语特征。
  13. 根据权利要求8或12所述的装置,其中,所述提取单元包括:
    获取模块,被设置为获取所述语句信息中与所述代词的距离在预设距离内的名词短语;
    判断模块,被设置为判断所述名词短语与所述代词之间是否相互指代,
    若所述名词短语与所述代词之间相互指代,则将所述名词短语作为所述候选先行词。
  14. 根据权利要求13所述的装置,其中,所述判断模块包括:
    判断子模块,被设置为判断所述名词短语和所述代词之间的连接词的词性是否为谓词;若所述名词短语和所述代词之间的连接词的词性不为谓词,则判断出所述名词短语与所述代词之间能够相互指代;若所述名词短语和所述代词之间的连接词的词性为谓词,则判断出所述名词短语与所述代词之间不能够相互指代。
PCT/CN2017/074800 2016-05-20 2017-02-24 先行词的确定方法和装置 WO2017197947A1 (zh)

Priority Applications (4)

Application Number Priority Date Filing Date Title
EP17798514.0A EP3460678A4 (en) 2016-05-20 2017-02-24 METHOD AND APPARATUS FOR DETERMINING ANCEDENTS
JP2018529148A JP6752282B2 (ja) 2016-05-20 2017-02-24 先行詞の決定方法及び装置
KR1020187015847A KR102163549B1 (ko) 2016-05-20 2017-02-24 선행사의 결정방법 및 장치
US16/009,474 US10810372B2 (en) 2016-05-20 2018-06-15 Antecedent determining method and apparatus

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201610341637.6A CN107402913B (zh) 2016-05-20 2016-05-20 先行词的确定方法和装置
CN201610341637.6 2016-05-20

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US16/009,474 Continuation US10810372B2 (en) 2016-05-20 2018-06-15 Antecedent determining method and apparatus

Publications (1)

Publication Number Publication Date
WO2017197947A1 true WO2017197947A1 (zh) 2017-11-23

Family

ID=60325646

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2017/074800 WO2017197947A1 (zh) 2016-05-20 2017-02-24 先行词的确定方法和装置

Country Status (6)

Country Link
US (1) US10810372B2 (zh)
EP (1) EP3460678A4 (zh)
JP (1) JP6752282B2 (zh)
KR (1) KR102163549B1 (zh)
CN (1) CN107402913B (zh)
WO (1) WO2017197947A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919161A (zh) * 2019-04-01 2019-06-21 成都大学 基于图像识别的通信方法及装置
CN112733534A (zh) * 2020-12-25 2021-04-30 北京左医科技有限公司 医患对话中半截词指向症状获取方法及系统

Families Citing this family (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108920500B (zh) * 2018-05-24 2022-02-11 众安信息技术服务有限公司 一种时间解析方法
CN108681538B (zh) * 2018-05-28 2022-02-22 哈尔滨工业大学 一种基于深度学习的动词短语省略消解方法
CN109446517B (zh) * 2018-10-08 2022-07-05 平安科技(深圳)有限公司 指代消解方法、电子装置及计算机可读存储介质
CN109325234B (zh) * 2018-10-10 2023-06-20 深圳前海微众银行股份有限公司 语句处理方法、设备及计算机可读存储介质
CN109471919B (zh) * 2018-11-15 2021-08-10 北京搜狗科技发展有限公司 零代词消解方法及装置
CN110162600B (zh) * 2019-05-20 2024-01-30 腾讯科技(深圳)有限公司 一种信息处理的方法、会话响应的方法及装置
CN111984766B (zh) * 2019-05-21 2023-02-24 华为技术有限公司 缺失语义补全方法及装置
CN110705206B (zh) * 2019-09-23 2021-08-20 腾讯科技(深圳)有限公司 一种文本信息的处理方法及相关装置
CN110674630B (zh) * 2019-09-24 2023-03-21 北京明略软件系统有限公司 指代消解方法和装置、电子设备及存储介质
CN111325034A (zh) * 2020-02-12 2020-06-23 平安科技(深圳)有限公司 多轮对话中语义补齐的方法、装置、设备及存储介质
CN113297843B (zh) * 2020-02-24 2023-01-13 华为技术有限公司 指代消解的方法、装置及电子设备
CN111522909B (zh) * 2020-04-10 2024-04-02 海信视像科技股份有限公司 一种语音交互方法及服务器
CN111651578B (zh) * 2020-06-02 2023-10-03 北京百度网讯科技有限公司 人机对话方法、装置及设备
CN112148847B (zh) * 2020-08-27 2024-03-12 出门问问创新科技有限公司 一种语音信息的处理方法及装置
CN112989008A (zh) * 2021-04-21 2021-06-18 上海汽车集团股份有限公司 一种多轮对话改写方法、装置和电子设备
US11848017B2 (en) * 2021-06-10 2023-12-19 Sap Se Pronoun-based natural language processing
US20240073161A1 (en) * 2022-08-26 2024-02-29 SoundHound AI IP, LLC. Message processing method, information processing apparatus, and program

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446943A (zh) * 2008-12-10 2009-06-03 苏州大学 一种中文处理中基于语义角色信息的指代消解方法
CN102110087A (zh) * 2009-12-24 2011-06-29 北京大学 字符数据中实体消解的方法和装置
CN104462053A (zh) * 2013-09-22 2015-03-25 江苏金鸽网络科技有限公司 一种文本内的基于语义特征的人称代词指代消解方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7383169B1 (en) * 1994-04-13 2008-06-03 Microsoft Corporation Method and system for compiling a lexical knowledge base
US5799268A (en) * 1994-09-28 1998-08-25 Apple Computer, Inc. Method for extracting knowledge from online documentation and creating a glossary, index, help database or the like
US7813916B2 (en) * 2003-11-18 2010-10-12 University Of Utah Acquisition and application of contextual role knowledge for coreference resolution
US7376551B2 (en) * 2005-08-01 2008-05-20 Microsoft Corporation Definition extraction
US8594996B2 (en) * 2007-10-17 2013-11-26 Evri Inc. NLP-based entity recognition and disambiguation
CN103150405B (zh) * 2013-03-29 2014-12-10 苏州大学 一种分类模型建模方法、中文跨文本指代消解方法和系统
US9497153B2 (en) * 2014-01-30 2016-11-15 Google Inc. Associating a segment of an electronic message with one or more segment addressees
US9652453B2 (en) * 2014-04-14 2017-05-16 Xerox Corporation Estimation of parameters for machine translation without in-domain parallel data
CN104281645B (zh) * 2014-08-27 2017-06-16 北京理工大学 一种基于词汇语义和句法依存的情感关键句识别方法
CN105988990B (zh) * 2015-02-26 2021-06-01 索尼公司 汉语零指代消解装置和方法、模型训练方法和存储介质

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446943A (zh) * 2008-12-10 2009-06-03 苏州大学 一种中文处理中基于语义角色信息的指代消解方法
CN102110087A (zh) * 2009-12-24 2011-06-29 北京大学 字符数据中实体消解的方法和装置
CN104462053A (zh) * 2013-09-22 2015-03-25 江苏金鸽网络科技有限公司 一种文本内的基于语义特征的人称代词指代消解方法

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109919161A (zh) * 2019-04-01 2019-06-21 成都大学 基于图像识别的通信方法及装置
CN112733534A (zh) * 2020-12-25 2021-04-30 北京左医科技有限公司 医患对话中半截词指向症状获取方法及系统

Also Published As

Publication number Publication date
US20180307671A1 (en) 2018-10-25
CN107402913B (zh) 2020-10-09
EP3460678A1 (en) 2019-03-27
KR102163549B1 (ko) 2020-10-08
JP6752282B2 (ja) 2020-09-09
CN107402913A (zh) 2017-11-28
US10810372B2 (en) 2020-10-20
JP2019504395A (ja) 2019-02-14
KR20180078318A (ko) 2018-07-09
EP3460678A4 (en) 2019-06-05

Similar Documents

Publication Publication Date Title
WO2017197947A1 (zh) 先行词的确定方法和装置
WO2018157789A1 (zh) 一种语音识别的方法、计算机、存储介质以及电子装置
WO2017084334A1 (zh) 一种语种识别方法、装置、设备及计算机存储介质
CN106874441A (zh) 智能问答方法和装置
CN110347790B (zh) 基于注意力机制的文本查重方法、装置、设备及存储介质
CN114580382A (zh) 文本纠错方法以及装置
CN109271524B (zh) 知识库问答系统中的实体链接方法
US10740570B2 (en) Contextual analogy representation
US11699034B2 (en) Hybrid artificial intelligence system for semi-automatic patent infringement analysis
CN110717021A (zh) 人工智能面试中获取输入文本和相关装置
US8806455B1 (en) Systems and methods for text nuclearization
CN110659392B (zh) 检索方法及装置、存储介质
US10055400B2 (en) Multilingual analogy detection and resolution
CN110245361B (zh) 短语对提取方法、装置、电子设备及可读存储介质
US10061770B2 (en) Multilingual idiomatic phrase translation
US9892112B1 (en) Machine learning to determine analogy outcomes
CN110427626B (zh) 关键词的提取方法及装置
JP4401269B2 (ja) 対訳判断装置及びプログラム
CN112183117B (zh) 一种翻译评价的方法、装置、存储介质及电子设备
US10325025B2 (en) Contextual analogy representation
US10503768B2 (en) Analogic pattern determination
CN111401070A (zh) 词义相似度确定方法及装置、电子设备及存储介质
CN115577090B (zh) 基于成语理解的语音对话方法、装置、设备及存储介质
US20200142991A1 (en) Identification of multiple foci for topic summaries in a question answering system
CN116306639A (zh) 疾病名称标准化方法、装置、存储介质及电子设备

Legal Events

Date Code Title Description
ENP Entry into the national phase

Ref document number: 20187015847

Country of ref document: KR

Kind code of ref document: A

WWE Wipo information: entry into national phase

Ref document number: 2018529148

Country of ref document: JP

NENP Non-entry into the national phase

Ref country code: DE

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 17798514

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2017798514

Country of ref document: EP

Effective date: 20181220