CN110413996A - Construct the method and device of zero reference resolution corpus - Google Patents

Construct the method and device of zero reference resolution corpus Download PDF

Info

Publication number
CN110413996A
CN110413996A CN201910635597.XA CN201910635597A CN110413996A CN 110413996 A CN110413996 A CN 110413996A CN 201910635597 A CN201910635597 A CN 201910635597A CN 110413996 A CN110413996 A CN 110413996A
Authority
CN
China
Prior art keywords
sentence
word
target
processed
zero reference
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910635597.XA
Other languages
Chinese (zh)
Other versions
CN110413996B (en
Inventor
梁忠平
温祖杰
蒋亮
张家兴
李小龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Advanced New Technologies Co Ltd
Advantageous New Technologies Co Ltd
Original Assignee
Alibaba Group Holding Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Group Holding Ltd filed Critical Alibaba Group Holding Ltd
Priority to CN201910635597.XA priority Critical patent/CN110413996B/en
Publication of CN110413996A publication Critical patent/CN110413996A/en
Application granted granted Critical
Publication of CN110413996B publication Critical patent/CN110413996B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

This specification embodiment provides a kind of method and device for constructing zero reference resolution corpus, and method includes: the part of speech for obtaining the corresponding word sequence of sentence to be processed first, and marking each word that the word sequence includes.Then it determines in each word that the word sequence includes, part of speech is each word frequency of occurrence in word sequence respectively of noun.When in the word sequence there are part of speech be noun, frequency of occurrence be not less than 2 one or more candidate words when, a candidate word may be selected as target word, and from the target word in multiple appearance positions in sentence to be processed, select at least one target position, and delete the target word of each target position, obtain the calibration sentence that item is referred to comprising zero.Later, calibration sentence, target word and each target position can be combined, obtains the zero reference resolution corpus for carrying out zero reference resolution for treating parsing sentence.

Description

Construct the method and device of zero reference resolution corpus
Technical field
This specification one or more embodiment is related to computer field, more particularly to the side of zero reference resolution corpus of construction Method and device.
Background technique
Zero reference item refers to the reference word being omitted in sentence, and the reference word being omitted should undertake accordingly in sentence Grammatical item, and user usually can be inferred to the reference word being omitted according to sentence itself.For example, exemplary sentence is " king Teacher is brought in office in order to teach Xiao Ming to learn, by Xiao Ming ", the corresponding complete sentence of grammer should be " Wang Laoshi In order to teach Xiao Ming to learn, Xiao Ming has been taken back office by [he] ", the reference word [he] being omitted is one zero reference item, Zero object for referring to item reference is " Wang Laoshi ".
Zero reference resolution is the natural language processing task being widely used, and main purpose is to find in sentence Include zero refers to item, and determines the object that the zero reference item refers to.Zero reference resolution is carried out to sentence in order to realize, usually Need to construct large-scale zero reference resolution corpus in advance.
Currently, mainly constructing zero reference resolution corpus by way of manually marking, large-scale zero can not be quickly obtained Reference resolution corpus.In view of this, it is desirable to have improved plan can be conducive to be quickly obtained large-scale zero reference resolution language Material.
Summary of the invention
This specification one or more embodiment provides a kind of method and device for constructing zero reference resolution corpus, favorably In being quickly obtained large-scale zero reference resolution corpus.
In a first aspect, providing a kind of method for constructing zero reference resolution corpus, which comprises
The corresponding word sequence of sentence to be processed is obtained, and marks the part of speech for each word that the word sequence includes;
It determines in each word that the word sequence includes, part of speech is each word going out in the word sequence respectively of noun The existing frequency;
It detects in each word that the word sequence includes with the presence or absence of at least one candidate word, wherein the candidate word Part of speech is noun, and corresponding frequency of occurrence is not less than 2;
There are candidate word described at least one, select the candidate word as target word, Yi Jicong The target word selects at least one target position in multiple appearance positions in the sentence to be processed, and by each institute The target word for stating target position is deleted, and calibration sentence is obtained;
By the calibration sentence, the target word and each target position combination, zero reference resolution corpus is obtained, The zero reference resolution corpus carries out zero reference resolution for treating parsing sentence.
In a kind of possible embodiment,
The zero reference resolution corpus is the positive sample for train language model;Wherein, the language model, for pre- It surveys include in the sentence of corresponding input zero and refers to the position of item, and predict the object that the zero reference item refers to.
In a kind of possible embodiment,
The method also includes:
In the case where at least one described candidate word is not present, detect in obtained multiple calibration sentences, With the presence or absence of calibration sentence identical with the sentence to be processed;
There is no calibration sentence identical with the sentence to be processed, the sentence to be processed is determined as For training the negative sample of the language model.
In a kind of possible embodiment,
Before the corresponding word sequence of the acquisition sentence to be processed, further includes:
Text data is acquired from webpage;
Data cleansing and pretreatment are carried out to the text data, obtain text to be processed;
Subordinate sentence processing is carried out to the text to be processed, obtains at least one described sentence to be processed.
In a kind of possible embodiment,
Multiple appearance positions of the target word in the sentence to be processed, by the target word in the word sequence In corresponding multiple serial numbers indicate.
In a kind of possible embodiment,
From the target word in multiple appearance positions in the sentence to be processed, at least one target position is selected, Include: from the target word in multiple appearance positions in the sentence to be processed, randomly choose at least one target position.
In a kind of possible embodiment,
It is described from the target word in multiple appearance positions in the sentence to be processed, select at least one target position It sets, comprising:
According to the data set comprising multiple sample sentences, determine the target word in the sentence to be processed it is multiple go out Existing position refers to the conditional probability that item refers to by zero respectively;
According to the corresponding conditional probability of each appearance position, at least one target position is selected from multiple appearance positions It sets, wherein the corresponding conditional probability in each target position is respectively corresponded not less than non-selected each appearance position Conditional probability.
In a kind of possible embodiment,
According to the data set comprising multiple sample sentences, determine the target word in the sentence to be processed it is multiple go out Existing position refers to the conditional probability that item refers to by zero respectively, comprising:
At least one target sentences is determined from the data set comprising multiple sample sentences, wherein each target It include at least one described target word in sentence;
For each target sentences, obtains and refer to the zero of the target word and refer to the of item in the target sentences One position, and obtain the second position of the target word in the target sentences;
It determines at least one described target sentences, first position is located at the first frequency before its corresponding second position Secondary, first position is located at second frequency after its corresponding second position;
According to first frequency and second frequency, it is multiple in the sentence to be processed to calculate the target word Appearance position refers to the conditional probability that item refers to by zero respectively.
Second aspect, provides a kind of device for constructing zero reference resolution corpus, and described device includes:
Word segmentation processing module is configured to obtain the corresponding word sequence of sentence to be processed, and marks the word sequence and include The part of speech of each word;
Word frequency statistics module is configured to determine that part of speech is each word point of noun in each word that the word sequence includes Frequency of occurrence not in the word sequence;
First detection module is configured to detect in each word that the word sequence includes candidate with the presence or absence of at least one Word, wherein the part of speech of the candidate word is noun, and corresponding frequency of occurrence is not less than 2;
Sentence processing module is configured to there are candidate word described at least one, selects a candidate Word selects at least one in multiple appearance positions in the sentence to be processed as target word, and from the target word Target position, and the target word of each target position is deleted, obtain calibration sentence;
Corpus constructing module is configured to the calibration sentence, the target word and each target position combination, Zero reference resolution corpus is obtained, the zero reference resolution corpus carries out zero reference resolution for treating parsing sentence.
In a kind of possible embodiment,
The zero reference resolution corpus is the positive sample for train language model;Wherein, the language model, for pre- It surveys include in the sentence of corresponding input zero and refers to the position of item, and predict the object that the zero reference item refers to.
In a kind of possible embodiment,
Described device further include:
Second detection module is configured in the case where at least one described candidate word is not present, what detection had obtained In multiple calibration sentences, if there is calibration sentence identical with the sentence to be processed;
Negative sample determining module is configured to there is no calibration sentence identical with the sentence to be processed, The sentence to be processed is determined as to be used to train the negative sample of the language model.
In a kind of possible embodiment,
Described device further include:
Data acquisition module is configured to acquire text data from webpage;
Preprocessing module is configured to carry out data cleansing and pretreatment to the text data, obtains text to be processed;
Subordinate sentence processing module is configured to carry out subordinate sentence processing to the text to be processed, it is described wait locate to obtain at least one Manage sentence.
In a kind of possible embodiment,
Multiple appearance positions of the target word in the sentence to be processed, by the target word in the word sequence In corresponding multiple serial numbers indicate.
In a kind of possible embodiment,
The sentence processing module, concrete configuration are multiple appearance positions from the target word in the sentence to be processed In setting, at least one target position is randomly choosed.
In a kind of possible embodiment,
The sentence processing module, comprising:
Conditional probability determination unit is configured to determine that the target word exists according to the data set comprising multiple sample sentences Multiple appearance positions in the sentence to be processed refer to the conditional probability that item refers to by zero respectively;
Sentence processing unit is configured to according to the corresponding conditional probability of each appearance position, from multiple appearance positions Middle at least one target position of selection, wherein the corresponding conditional probability in each target position, not less than non-selected The corresponding conditional probability of each appearance position.
In a kind of possible embodiment,
The conditional probability determination unit, concrete configuration are as follows: determined from the data set comprising multiple sample sentences to Few target sentences, wherein include at least one described target word in each target sentences;For each mesh Sentence is marked, obtains first position of the zero reference item for referring to the target word in the target sentences, and obtain the mesh Mark the second position of the word in the target sentences;It determines at least one described target sentences, first position is located at its correspondence The second position before first frequency, first position be located at second frequency after its corresponding second position;According to described First frequency and second frequency calculate multiple appearance positions of the target word in the sentence to be processed respectively by zero Refer to the conditional probability that item refers to.
The third aspect provides a kind of computer readable storage medium, is stored thereon with computer program, when the calculating When machine program executes in a computer, computer is enabled to execute method described in any one of first aspect.
Fourth aspect provides a kind of calculating equipment, including memory and processor, and being stored in the memory can hold Line code when the processor executes the executable code, realizes method described in any one of first aspect.
The method and device provided by this specification embodiment can obtain the corresponding word sequence of sentence to be processed first, And mark the part of speech for each word that the word sequence includes.Then it determines in each word that the word sequence includes, part of speech is noun Each word frequency of occurrence in word sequence respectively.When there are part of speech being noun, frequency of occurrence not less than 2 in the word sequence When one or more candidate words, that is, a candidate word may be selected as target word, and from the target word in sentence to be processed Multiple appearance positions in, select at least one target position, and the target word of each target position is deleted, obtain comprising zero Refer to the calibration sentence of item.Later, calibration sentence, target word and each target position can be combined, is obtained for treating Parsing sentence carries out zero reference resolution corpus of zero reference resolution.As it can be seen that during zero reference resolution corpus of construction, without disappearing The excessive time is consumed to sentence progress semantic analysis and mark, is conducive to be quickly obtained large-scale zero reference resolution corpus.
Detailed description of the invention
In order to illustrate the technical solution of the embodiments of the present invention more clearly, required use in being described below to embodiment Attached drawing be briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this For the those of ordinary skill of field, without creative efforts, it can also be obtained according to these attached drawings others Attached drawing.
Fig. 1 shows a kind of schematic diagram of the applicable application scenarios of this specification one or more embodiment;
Fig. 2 shows a kind of flow charts of the method for the zero reference resolution corpus of construction of this specification embodiment offer;
Fig. 3, which is shown, to be handled for an exemplary sentence to construct the process schematic of zero reference resolution corpus;
Fig. 4 shows the flow chart of the method for the zero reference resolution corpus of another construction of this specification embodiment offer;
Fig. 5 shows a kind of structural schematic diagram of the device of the zero reference resolution corpus of construction of this specification embodiment offer.
Specific embodiment
With reference to the accompanying drawing, each non-limiting embodiment provided by this specification is described in detail.
Fig. 1 shows a kind of schematic diagram of the applicable application scenarios of this specification one or more embodiment.
The sentence for being used to provide it for user for intelligent answer, machine translation etc. responds, and refers to realize Determine the operation system of business, the semanteme for the sentence expectation expression that operation system needs accurate understanding to provide it could be preferable Realize specified services.In actual application, may cause because of the speech habits or other reasons of user to operation system Item often is referred to comprising zero in the sentence of offer;At this time, it may be necessary to according to the language model trained in advance or the language constructed in advance Method rule base carries out zero reference resolution to the sentence for referring to item comprising zero, so that operation system accurate understanding sentence it is expected to express Semanteme.In general, language model is obtained using the training of large-scale zero reference resolution corpus, syntax rule library is to big Zero reference resolution corpus of scale is for statistical analysis, according to the building of the result of statistical analysis.As it can be seen that if you need to realize to sentence Carry out zero reference resolution, it is necessary first to obtain large-scale zero reference resolution corpus.
Conventional, semantic analysis can be carried out to the sentence in text, to find text by manual read's a large amount of text The object that the sentence for referring to item comprising zero in this and zero reference item refer to.It later, can people for the sentence for referring to item comprising zero Work marks the appearance position of the zero reference item in sentence, and marks the object that the zero reference item refers to.In this way, can be obtained Zero be made of the object that the sentence comprising zero reference item, zero appearance position of the reference item in the sentence, zero reference item refer to Reference resolution corpus.
However, when obtaining zero reference resolution corpus through the above way, by including the zero sentence institute for referring to item in text Accounting example is relatively small, comprising zero refer to item sentence proportion it is relatively large, need to waste the more time to not wrapping The sentence for referring to item containing zero carries out semantic analysis;Moreover, being also required to by being manually labeled to the sentence comprising zero reference item Occupy longer time.Therefore, in aforesaid way, one zero reference resolution corpus of every construction is required to consumption longer time, Large-scale zero reference resolution corpus can not be quickly obtained.
In view of the above problems, this specification embodiment considers a kind of situation, i.e., for any one sentence, if there is Part of speech is that a target word of noun repeatedly occurs in the sentence, then being directed to multiple appearance of the target word in the sentence Position, deletes one or more after the target word that target position occurs, and obtained calibration sentence i.e. may be to include zero finger For the sentence of item;Moreover, each target position be possible for calibration sentence in include zero refer to item appearance position, and this zero Referring to the object that item refers to is the target word.In this case, it is only necessary to sentence, target word and each will be demarcated accordingly Target position combination, can quickly obtain a zero reference resolution corpus, carry out language to the sentence without consuming the excessive time Justice analysis is labeled the sentence without the consumption excessive time, is conducive to be quickly obtained large-scale zero reference resolution Corpus.
In view of the foregoing, the basic conception of this specification embodiment has been to provide a kind of zero reference resolution corpus of construction Method and device.The embodiment of the above basic conception is specifically described with reference to the accompanying drawing.
Fig. 2 shows a kind of flow diagrams of method for constructing zero reference resolution corpus.
It is appreciated that implementing the executing subject of the method for zero reference resolution corpus of construction as shown in Figure 2, can be such as Fig. 1 Calculating equipment in shown application scenarios, the calculating equipment include but is not limited to server or general computer.As shown in Fig. 2, The method for constructing zero reference resolution corpus at least may include steps of 21~step 29: step 21, obtain sentence to be processed Corresponding word sequence, and mark the part of speech for each word that the word sequence includes;Step 23, determine that the word sequence includes each In a word, part of speech is each word frequency of occurrence in the word sequence respectively of noun;Step 25, the word sequence packet is detected It whether there is at least one candidate word in each word contained, wherein the part of speech of the candidate word is noun, and corresponding appearance is frequently It is secondary to be not less than 2;Step 27, there are candidate word described at least one, select the candidate word as target Word, and from the target word in multiple appearance positions in the sentence to be processed, at least one target position is selected, and The target word of each target position is deleted, calibration sentence is obtained;Step 29, by the calibration sentence, the mesh Word and the combination of each target position are marked, obtains zero reference resolution corpus, the zero reference resolution corpus is for treating point It analyses sentence and carries out zero reference resolution.
Firstly, obtaining the corresponding word sequence of sentence to be processed in step 21, and mark each word that the word sequence includes Part of speech.
It specifically, can be by calling language technology platform (Language Technology Platform, LTP), nature Language Processing and information retrieval shared platform (Natural Language Processing&Information Retrieval, NLPIR) or other participle tools, realization carry out word segmentation processing to sentence to be processed, obtain the corresponding word sequence of sentence to be processed, And mark the part of speech for each word that word sequence includes.
Fig. 3 is turned next to, Fig. 3, which is shown, to be handled for an exemplary sentence to construct zero reference resolution corpus Process schematic.As shown in figure 3, exemplary sentence " Xiao Ming has eaten an apple, apple very sweet tea " is used as sentence to be processed, it is first First word segmentation processing can be carried out to exemplary sentence, can be obtained corresponding word order be classified as [Xiao Ming, eat, one, apple, apple Fruit, very, sweet tea];Then to include in the word sequence each word carry out part-of-speech tagging, annotation results can for [Xiao Ming/nr, eat/ V ,/ul, one/mq, apple/n, apple/n, very/dc, sweet tea/a].
Then, in step 23, determine in each word that the word sequence includes, part of speech for noun each word respectively in institute Frequency of occurrence in predicate sequence.
Referring to FIG. 3, be not difficult to count according to annotation results, in the corresponding word sequence of exemplary sentence, noun " apple " Frequency of occurrence is 2.It is appreciated that in practical business scene, in the corresponding word sequence of sentence to be processed may comprising it is multiple not The word same, part of speech is noun.
Then, it in step 25, detects and whether there is at least one candidate word in each word that the word sequence includes, In, the part of speech of the candidate word is noun, and corresponding frequency of occurrence is not less than 2.
Here, if part of speech be noun word in sentence to be processed/word sequence in repeatedly occur, which can make For candidate word.
Further, in step 27, there are candidate word described at least one, a candidate word is selected At least one mesh is selected in multiple appearance positions in the sentence to be processed as target word, and from the target word Cursor position, and the target word of each target position is deleted, obtain calibration sentence.
Specifically, a candidate word can be randomly choosed as target word.
It should be noted that can be accomplished in several ways indicates target word to be processed in practical application scene Appearance position in sentence.Specifically, in a kind of possible embodiment, it is corresponding more in word sequence by target word A serial number, to indicate multiple appearance positions of the target word in sentence to be processed;I.e. for appearing in sentence to be processed every time The target word of the secondary appearance can be corresponded to the serial number in word sequence, the target as the secondary appearance by the target word in son Appearance position of the word in sentence to be processed.Referring again to FIGS. 3, in illustrative sentence to be processed, for target word " apple ", The corresponding serial number in word sequence of " apple " first appeared in sentence to be processed is 5, can be indicated using serial number 5 The appearance position of " apple " that is first appeared in sentence to be processed;It is similar, second of " apple " occurred in sentence to be processed The corresponding serial number in word sequence is 6, then, it can indicate to appear in sentence to be processed for the second time using serial number 6 In " apple " appearance position.
It, can be by this time for appearing in the target word in sentence to be processed every time in alternatively possible embodiment The corresponding character ordinal number in sentence to be processed of the first character of the target word of appearance, the target word as the secondary appearance is wait locate Manage the appearance position in sentence.Referring again to FIGS. 3, in illustrative sentence to be processed, for target word " apple ", sentence to be processed The corresponding character ordinal number in sentence to be processed of the first character " apple " of " apple " that first appears in son is 7, using character Serial number 7 indicates the appearance position of " apple " that first appears in sentence to be processed.
Obviously, it is also possible to realize the appearance position for indicating target word in sentence to be processed by other means.
It should be noted that from target word in multiple appearance positions in sentence to be processed when selection target position, institute The total amount of selection target position should be less than the total amount of multiple appearance positions of the target word in sentence to be processed, that is, ensure subsequent After deleting in the process the target word of each target position, it can at least retain the target word in obtained calibration sentence Certain it is primary occur, calibration sentence is become and refers to item comprising zero and zero to refer to the object that item refers to be target word Sentence.
In a more specific example, multiple appearance positions from the target word in the sentence to be processed In setting, at least one target position is selected, comprising: from multiple appearance positions of the target word in the sentence to be processed In, randomly choose at least one target position.It is random by being carried out to multiple appearance positions of the target word in sentence to be processed The mode of selection is conducive to relatively quick construct zero reference resolution corpus.
But it if is randomly choosed for multiple appearance positions of the target word in sentence to be processed, to be processed After the target word for deleting each target position in sentence, but there may be the language habits for not meeting user for obtained calibration sentence Used situation.For example, illustrative sentence " Xiao Ming has eaten an apple, apple very sweet tea " to be processed, if it is to be processed to delete this The target word " apple " first appeared in sentence, obtained calibration sentence are " Xiao Ming has eaten one, apple very sweet tea ";At this point, should Calibration sentence obviously do not meet user speech habits namely user it is almost impossible offer with calibration sentence " Xiao Ming has eaten one It is a, apple very sweet tea " the similar sentence of grammer, no matter according to the zero reference resolution corpus construction syntax rule comprising the calibration sentence Library or train language model are all difficult to preferably complete to carry out zero reference resolution to other sentences.
Therefore, in order to obtain the calibration sentences of the speech habits for more meeting user, in another more specific example In, it is described from the target word in multiple appearance positions in the sentence to be processed, select at least one target position, wrap It includes: according to the data set comprising multiple sample sentences, determining multiple appearance positions of the target word in the sentence to be processed It sets and refers to the conditional probability that item refers to by zero respectively;According to the corresponding conditional probability of each appearance position, from multiple appearance At least one target position is selected in position, wherein the corresponding conditional probability in each target position, not less than not being chosen The corresponding conditional probability of each appearance position selected.
In the example, for each appearance position of the target word in sentence to be processed, the corresponding condition of the appearance position Probability is bigger, then illustrates that most users when expressing the semanteme of sentence expectation expression, there is biggish probability can omit this The target word that existing position occurs, namely after deleting the target word that the biggish appearance position of conditional probability occurs, obtain Calibration sentence there is bigger probability to meet the speech habits of user.
In the example, the multiple sample sentences for including in data set should all be the sentence for more meeting the speech habits of user Son;Specifically, sample sentence can be manually to sentence carry out semantic analysis after, marked zero reference item and (marked out Zero refers to position of the item in sentence) and its sentence of object that refers to.
In one more specifically example, data set of the basis comprising multiple sample sentences determines the target Multiple appearance positions of the word in the sentence to be processed refer to the conditional probability that item refers to by zero respectively, comprising: from comprising more At least one target sentences is determined in the data set of a sample sentence, wherein comprising at least in each target sentences One target word;For each target sentences, the zero reference item for referring to the target word is obtained in the target sentence First position in son, and obtain the second position of the target word in the target sentences;Determine it is described at least one In target sentences, first position be located at its corresponding second position before first frequency, first position be located at its corresponding Second frequency after two positions;According to first frequency and second frequency, the target word is calculated described wait locate The multiple appearance positions managed in sentence refer to the conditional probability that item refers to by zero respectively.
Still it for sentence to be processed, is determined from data set multiple comprising target word " apple " shown in Fig. 3 Target sentences after, for each target sentences, can get refer to target word " apple " zero refer to item in the target sentence First position in son, and obtain the second position of target word " apple " in target sentences.It is assumed that in multiple target sentences, Refer to the first position in the zero reference Xiang Qi said target sentence of target word " apple ", the target word " apple referred to positioned at it First frequency of the fruit " before the second position in its said target sentence is a, and the zero reference item for referring to target word " apple " exists First position in its said target sentence, second of the target word " apple " referred to positioned at it in its said target sentence Second frequency after setting is b.So, for the target word in sentence to be processed " Xiao Ming has eaten an apple, apple very sweet tea " Apple, can calculate first appearance position of " apple " in sentence to be processed and refer to the conditional probability that item refers to by zero is a/ (a+b), calculating second appearance position of " apple " in sentence to be processed and referring to the conditional probability that item refers to by zero is b/ (a+b)。
According to a specific embodiment, it is assumed that conditional probability b/ (a+b) greater than condition probability a/ (a+b), that is, sentence to be processed " apple " has the higher conditional probability that item reference is referred to by zero in second appearance position in son, then just deleting second " apple " of appearance position obtains calibration sentence " Xiao Ming has eaten an apple, very sweet tea ".
Correspondingly, combining the calibration sentence, the target word and each target position in step 29, obtaining To zero reference resolution corpus, the zero reference resolution corpus carries out zero reference resolution for treating parsing sentence.
It, can be by calibration sentence " Xiao Ming has eaten an apple, very sweet tea " and target position " 5 " and mesh referring again to Fig. 3 Word " apple " combination is marked, a zero reference resolution corpus is obtained.
It should be noted that building auxiliary is realized for sentence carries out the syntax rule library of zero reference resolution, it can Use the zero reference resolution corpus combined on a large scale by calibration sentence, target word and target position.Training is used For the language model for zero object for referring to item and its reference that prediction sentence includes, except needs are using largely by calibration sentence Except the zero reference resolution corpus that son, target word and target position are combined is as positive sample, it is also necessary to use a part The sentence of item is not referred to as negative sample comprising zero.
Correspondingly, in order to obtain zero reference resolution corpus for train language model, as negative sample, one kind can In the embodiment of energy, the method also includes: in the case where at least one described candidate word is not present, detection has been obtained Multiple calibration sentences in, if there is calibration sentence identical with the sentence to be processed;There is no with it is described to In the case where handling the identical calibration sentence of sentence, the sentence to be processed is determined as to be used to train the negative of the language model Sample.
In the embodiment, in sentence used in everyday, the sentence comprising zero reference item is relatively fewer, does not refer to comprising zero The sentence of item is relatively more;Therefore, can obtain it is large-scale, as zero reference resolution corpus of positive sample after, for The case where there is no candidate words in each word that processing sentence includes, further inquires in obtained each calibration sentence, It is identical as the sentence to be processed with the presence or absence of calibration sentence.It is identical as the sentence to be processed if there is a calibration sentence, then Illustrate that the sentence to be processed itself may be the sentence for referring to item comprising zero, should not be used as the negative sample for train language model This;Conversely, then illustrating that the sentence to be processed may be the sentence for not referring to item comprising zero, which can be determined as Zero reference resolution corpus for train language model, as negative sample.
In summary description is as it can be seen that obtain the process of zero reference resolution corpus by method provided by the embodiments of the present application In, semantic analysis is carried out to sentence without consuming the excessive time, sentence is labeled without the consumption excessive time;Phase It answers, if it is possible to obtain large-scale sentence to be processed, then can be quickly obtained large-scale zero reference resolution corpus.Cause This, on the basis of embodiment as shown in Figure 2, as shown in figure 4, in a kind of possible embodiment, the step 21 it Before, the method can also include the following steps 31~step 33: step 31, text data be acquired from webpage;Step 33, right The text data carries out data cleansing and pretreatment, obtains text to be processed;Step 35, the text to be processed is divided Sentence processing, obtains at least one described sentence to be processed.
In step 31, text data is acquired from webpage.The data carried in webpage are usually public data, these disclosures Data are also easy to acquire, it is only necessary to these public datas are simply handled, can be obtained it is large-scale, can be used in structure Make the sentence to be processed of zero reference resolution corpus.
Specifically, user can also select the data source of text data according to its actual demand, for example, can be by microblogging, Baidu Encyclopaedia, publication and manage the knowledge base of paper, knowledge base etc. for managing patent application document is used as data source, under these data sources Text data it is occupied in its corresponding webpage ratio it is relatively large, and in these text datas actual bearer sentence Also more meet the speech habits of user.In general, can be quick from webpage corresponding to these data sources by web crawlers Acquire text data.
Then, in step 33, data cleansing and pretreatment is carried out to the text data, obtain text to be processed.From net It may further include other characters for being unfavorable for constructing zero reference resolution corpus, by text in the text data acquired in page Data carry out data cleansing, can remove the character for being unfavorable for constructing zero reference resolution corpus in text data, such as removal text HTML (Hyper Text Markup Language, the hypertext markup language) label for including in data.Moreover, by complete It is pre-processed at the text data of data cleansing, the text to be processed for meeting user demand can be obtained;For example, adjustment is counted According to the typesetting of each sentence, chapter in the text data after cleaning, taken out from the pretreated text to be processed of completion so as to subsequent Take complete sentence.
Later, in step 35, subordinate sentence processing is carried out to the text to be processed, obtains at least one described sentence to be processed Son.
Specifically, can by characterized in text to be processed some sentence complete expression punctuation mark ".", " ", "!" etc., subordinate sentence processing is carried out to text to be processed, realization extracts multiple complete sentences from text to be processed.It is not difficult to manage Solution, for multiple sentences to be processed that step 35 obtains, can be rapidly completed to each by each step as shown in Figure 2 wait locate Reason sentence is handled, to be quickly obtained large-scale zero reference resolution corpus.
Based on design identical with embodiment of the method, this specification embodiment additionally provides a kind of zero reference resolution language of construction The device of material, the device can have calculating, the software of processing capacity, hardware or combinations thereof to realize by any.In general, should Device can be deployed in the calculating equipment of application scenarios as shown in Figure 1.
Fig. 5 shows a kind of structural schematic diagram of device for constructing zero reference resolution corpus.
As shown in figure 5, the device of zero reference resolution corpus of construction at least may include:
Word segmentation processing module 51 is configured to obtain the corresponding word sequence of sentence to be processed, and marks the word sequence and include Each word part of speech;
Word frequency statistics module 53 is configured to determine that part of speech is each word of noun in each word that the word sequence includes Frequency of occurrence in the word sequence respectively;
First detection module 55 is configured to detect in each word that the word sequence includes candidate with the presence or absence of at least one Word, wherein the part of speech of the candidate word is noun, and corresponding frequency of occurrence is not less than 2;
Sentence processing module 57 is configured to there are candidate word described at least one, selects a time It selects word as target word, and from the target word in multiple appearance positions in the sentence to be processed, selects at least one A target position, and the target word of each target position is deleted, obtain calibration sentence;
Corpus constructing module 59 is configured to the calibration sentence, the target word and each target position group It closes, obtains zero reference resolution corpus, the zero reference resolution corpus carries out zero reference resolution for treating parsing sentence.
In a kind of possible embodiment, the zero reference resolution corpus is the positive sample for train language model; Wherein, the language model, the position of the zero reference item for predicting to include in the corresponding sentence inputted, and predict zero finger The object referred to for item.
In a kind of possible embodiment, described device further include:
Second detection module is configured in the case where at least one described candidate word is not present, what detection had obtained In multiple calibration sentences, if there is calibration sentence identical with the sentence to be processed;
Negative sample determining module is configured to there is no calibration sentence identical with the sentence to be processed, The sentence to be processed is determined as to be used to train the negative sample of the language model.
In a kind of possible embodiment, described device further include:
Data acquisition module is configured to acquire text data from webpage;
Preprocessing module is configured to carry out data cleansing and pretreatment to the text data, obtains text to be processed;
Subordinate sentence processing module is configured to carry out subordinate sentence processing to the text to be processed, it is described wait locate to obtain at least one Manage sentence.
In a kind of possible embodiment, multiple appearance positions of the target word in the sentence to be processed are led to Crossing the target word corresponding multiple serial numbers in the word sequence indicates.
In a kind of possible embodiment, the sentence processing module 57, concrete configuration for from the target word in institute It states in multiple appearance positions in sentence to be processed, randomly chooses at least one target position.
In a kind of possible embodiment, the sentence processing module 57, comprising:
Conditional probability determination unit is configured to determine that the target word exists according to the data set comprising multiple sample sentences Multiple appearance positions in the sentence to be processed refer to the conditional probability that item refers to by zero respectively;
Sentence processing unit is configured to according to the corresponding conditional probability of each appearance position, from multiple appearance positions Middle at least one target position of selection, wherein the corresponding conditional probability in each target position, not less than non-selected The corresponding conditional probability of each appearance position.
In a kind of possible embodiment, the conditional probability determination unit, concrete configuration are as follows: from including multiple samples At least one target sentences is determined in the data set of sentence, wherein include at least one institute in each target sentences State target word;For each target sentences, the zero reference item for referring to the target word is obtained in the target sentences First position, and obtain the second position of the target word in the target sentences;Determine at least one described target sentence In son, first position be located at its corresponding second position before first frequency, first position be located at its corresponding second position Second frequency later;According to first frequency and second frequency, the target word is calculated in the sentence to be processed In multiple appearance positions respectively by zero refer to item refer to conditional probability.
This specification embodiment additionally provides a kind of calculating equipment, including memory and processor, deposits in the memory Executable code is contained, when the processor executes the executable code, realizes any one embodiment description in explanation Method.
Those skilled in the art are it will be appreciated that in said one or multiple examples, described in this specification Function can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these function Computer program corresponding to energy stores in computer-readable medium or as one or more on computer-readable medium A instructions/code is transmitted, and when being computer-executed so as to computer program corresponding to these functions, passes through computer reality Existing any one method as described in the examples of the present invention.
All the embodiments in this specification are described in a progressive manner, identical, similar between each embodiment Part may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device For embodiment, since it is substantially similar to the method embodiment, so being described relatively simple, related place is implemented referring to method The part explanation of example.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can With or may be advantageous.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention Protection scope, all any modification, equivalent substitution, improvement and etc. on the basis of technical solution of the present invention, done should all Including within protection scope of the present invention.

Claims (18)

1. a kind of method for constructing zero reference resolution corpus, which comprises
The corresponding word sequence of sentence to be processed is obtained, and marks the part of speech for each word that the word sequence includes;
It determines in each word that the word sequence includes, part of speech is each word appearance frequency in the word sequence respectively of noun It is secondary;
It detects in each word that the word sequence includes with the presence or absence of at least one candidate word, wherein the part of speech of the candidate word For noun, and corresponding frequency of occurrence is not less than 2;
There are candidate word described at least one, select the candidate word as target word, and from described Target word selects at least one target position in multiple appearance positions in the sentence to be processed, and by each mesh The target word of cursor position is deleted, and calibration sentence is obtained;
By the calibration sentence, the target word and each target position combination, zero reference resolution corpus is obtained, it is described Zero reference resolution corpus carries out zero reference resolution for treating parsing sentence.
2. according to the method described in claim 1, wherein,
The zero reference resolution corpus is the positive sample for train language model;Wherein, the language model, for prediction pair Include in the sentence that should be inputted zero refers to the position of item, and predicts the object that the zero reference item refers to.
3. according to the method described in claim 2, wherein,
The method also includes:
In the case where at least one described candidate word is not present, detect in obtained multiple calibration sentences, if In the presence of calibration sentence identical with the sentence to be processed;
There is no calibration sentence identical with the sentence to be processed, the sentence to be processed is determined as being used for The negative sample of the training language model.
4. according to the method described in claim 1, wherein,
Before the corresponding word sequence of the acquisition sentence to be processed, further includes:
Text data is acquired from webpage;
Data cleansing and pretreatment are carried out to the text data, obtain text to be processed;
Subordinate sentence processing is carried out to the text to be processed, obtains at least one described sentence to be processed.
5. according to the method described in claim 1, wherein,
Multiple appearance positions of the target word in the sentence to be processed are right in the word sequence by the target word Multiple serial numbers for answering indicate.
6. according to claim 1 to any method in 5, wherein
From the target word in multiple appearance positions in the sentence to be processed, at least one target position is selected, comprising: From the target word in multiple appearance positions in the sentence to be processed, at least one target position is randomly choosed.
7. according to claim 1 to any method in 5, wherein
It is described from the target word in multiple appearance positions in the sentence to be processed, select at least one target position, Include:
According to the data set comprising multiple sample sentences, multiple appearance positions of the target word in the sentence to be processed are determined It sets and refers to the conditional probability that item refers to by zero respectively;
According to the corresponding conditional probability of each appearance position, at least one target position is selected from multiple appearance positions, Wherein, the corresponding conditional probability in each target position, it is corresponding not less than non-selected each appearance position Conditional probability.
8. according to the method described in claim 7, wherein,
Data set of the basis comprising multiple sample sentences, determine the target word in the sentence to be processed it is multiple go out Existing position refers to the conditional probability that item refers to by zero respectively, comprising:
At least one target sentences is determined from the data set comprising multiple sample sentences, wherein each target sentences In include at least one described target word;
For each target sentences, first of the zero reference item for referring to the target word in the target sentences is obtained It sets, and obtains the second position of the target word in the target sentences;
It determines at least one described target sentences, first position is located at first frequency before its corresponding second position, the One position is located at second frequency after its corresponding second position;
According to first frequency and second frequency, multiple appearance of the target word in the sentence to be processed are calculated Position refers to the conditional probability that item refers to by zero respectively.
9. a kind of device for constructing zero reference resolution corpus, described device include:
Word segmentation processing module is configured to obtain the corresponding word sequence of sentence to be processed, and mark that the word sequence includes it is each The part of speech of word;
Word frequency statistics module is configured to determine that part of speech is that each word of noun exists respectively in each word that the word sequence includes Frequency of occurrence in the word sequence;
First detection module is configured to detect in each word that the word sequence includes with the presence or absence of at least one candidate word, In, the part of speech of the candidate word is noun, and corresponding frequency of occurrence is not less than 2;
Sentence processing module is configured to there are candidate word described at least one, and the candidate word is selected to make For target word, and from the target word at least one target is selected in multiple appearance positions in the sentence to be processed Position, and the target word of each target position is deleted, obtain calibration sentence;
Corpus constructing module is configured to obtain the calibration sentence, the target word and each target position combination Zero reference resolution corpus, the zero reference resolution corpus carry out zero reference resolution for treating parsing sentence.
10. device according to claim 9, wherein
The zero reference resolution corpus is the positive sample for train language model;Wherein, the language model, for prediction pair Include in the sentence that should be inputted zero refers to the position of item, and predicts the object that the zero reference item refers to.
11. device according to claim 10, wherein
Described device further include:
Second detection module is configured to there is no in the case where at least one described candidate word, and detection has obtained multiple In the calibration sentence, if there is calibration sentence identical with the sentence to be processed;
Negative sample determining module is configured to there is no calibration sentence identical with the sentence to be processed, by institute It states sentence to be processed and is determined as negative sample for training the language model.
12. device according to claim 9, wherein
Described device further include:
Data acquisition module is configured to acquire text data from webpage;
Preprocessing module is configured to carry out data cleansing and pretreatment to the text data, obtains text to be processed;
Subordinate sentence processing module is configured to carry out subordinate sentence processing to the text to be processed, obtains at least one described sentence to be processed Son.
13. device according to claim 9, wherein
Multiple appearance positions of the target word in the sentence to be processed are right in the word sequence by the target word Multiple serial numbers for answering indicate.
14. according to the device any in claim 9 to 13, wherein
The sentence processing module, concrete configuration are multiple appearance positions from the target word in the sentence to be processed In, randomly choose at least one target position.
15. according to the device any in claim 9 to 13, wherein
The sentence processing module, comprising:
Conditional probability determination unit is configured to determine the target word described according to the data set comprising multiple sample sentences Multiple appearance positions in sentence to be processed refer to the conditional probability that item refers to by zero respectively;
Sentence processing unit is configured to be selected from multiple appearance positions according to the corresponding conditional probability of each appearance position Select at least one target position, wherein the corresponding conditional probability in each target position, not less than non-selected each The corresponding conditional probability of appearance position.
16. device according to claim 15, wherein
The conditional probability determination unit, concrete configuration are as follows: determine at least one from the data set comprising multiple sample sentences A target sentences, wherein include at least one described target word in each target sentences;For each target sentence Son obtains first position of the zero reference item for referring to the target word in the target sentences, and obtains the target word The second position in the target sentences;Determine at least one described target sentences, first position be located at its corresponding the First frequency, first position before two positions are located at second frequency after its corresponding second position;According to described first The frequency and second frequency calculate multiple appearance positions of the target word in the sentence to be processed and are referred to respectively by zero The conditional probability that item refers to.
17. a kind of computer readable storage medium, is stored thereon with computer program, when the computer program in a computer When execution, computer perform claim is enabled to require method described in any one of 1-8.
18. a kind of calculating equipment, including memory and processor, executable code, the processing are stored in the memory When device executes the executable code, method of any of claims 1-8 is realized.
CN201910635597.XA 2019-07-15 2019-07-15 Method and device for constructing zero-index digestion corpus Active CN110413996B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910635597.XA CN110413996B (en) 2019-07-15 2019-07-15 Method and device for constructing zero-index digestion corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910635597.XA CN110413996B (en) 2019-07-15 2019-07-15 Method and device for constructing zero-index digestion corpus

Publications (2)

Publication Number Publication Date
CN110413996A true CN110413996A (en) 2019-11-05
CN110413996B CN110413996B (en) 2023-01-31

Family

ID=68361511

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910635597.XA Active CN110413996B (en) 2019-07-15 2019-07-15 Method and device for constructing zero-index digestion corpus

Country Status (1)

Country Link
CN (1) CN110413996B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428490A (en) * 2020-01-17 2020-07-17 北京理工大学 Reference resolution weak supervised learning method using language model
CN113011162A (en) * 2021-03-18 2021-06-22 北京奇艺世纪科技有限公司 Reference resolution method, device, electronic equipment and medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2005208782A (en) * 2004-01-21 2005-08-04 Fuji Xerox Co Ltd Natural language processing system, natural language processing method, and computer program
US20050216443A1 (en) * 2000-07-06 2005-09-29 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
CN105373527A (en) * 2014-08-27 2016-03-02 中兴通讯股份有限公司 Omission recovery method and question-answering system
CN106815215A (en) * 2015-11-30 2017-06-09 华为技术有限公司 The method and apparatus for generating annotation repository
CN109165386A (en) * 2017-08-30 2019-01-08 哈尔滨工业大学 A kind of Chinese empty anaphora resolution method and system
CN109471919A (en) * 2018-11-15 2019-03-15 北京搜狗科技发展有限公司 Empty anaphora resolution method and device

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20050216443A1 (en) * 2000-07-06 2005-09-29 Streamsage, Inc. Method and system for indexing and searching timed media information based upon relevance intervals
JP2005208782A (en) * 2004-01-21 2005-08-04 Fuji Xerox Co Ltd Natural language processing system, natural language processing method, and computer program
CN105373527A (en) * 2014-08-27 2016-03-02 中兴通讯股份有限公司 Omission recovery method and question-answering system
CN106815215A (en) * 2015-11-30 2017-06-09 华为技术有限公司 The method and apparatus for generating annotation repository
CN109165386A (en) * 2017-08-30 2019-01-08 哈尔滨工业大学 A kind of Chinese empty anaphora resolution method and system
CN109471919A (en) * 2018-11-15 2019-03-15 北京搜狗科技发展有限公司 Empty anaphora resolution method and device

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SIMONE PEREIRA ETC: "ZAC.PB: An Annotated Corpus for Zero Anaphora Resolution in Portuguese", 《STUDENT RESEARCH WORKSHOP》 *
孔芳 等: "中文篇章零元素语料库构建", 《北京大学学报(自然科学版)》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111428490A (en) * 2020-01-17 2020-07-17 北京理工大学 Reference resolution weak supervised learning method using language model
CN111428490B (en) * 2020-01-17 2021-05-18 北京理工大学 Reference resolution weak supervised learning method using language model
CN113011162A (en) * 2021-03-18 2021-06-22 北京奇艺世纪科技有限公司 Reference resolution method, device, electronic equipment and medium
CN113011162B (en) * 2021-03-18 2023-07-28 北京奇艺世纪科技有限公司 Reference digestion method, device, electronic equipment and medium

Also Published As

Publication number Publication date
CN110413996B (en) 2023-01-31

Similar Documents

Publication Publication Date Title
Kranjc et al. Active learning for sentiment analysis on data streams: Methodology and workflow implementation in the ClowdFlows platform
Yimam et al. Exploring Amharic sentiment analysis from social media texts: Building annotation tools and classification models
CN110337645B (en) Adaptable processing assembly
RU2704531C1 (en) Method and apparatus for analyzing semantic information
Dąbrowski et al. Mining user opinions to support requirement engineering: an empirical study
US20200081980A1 (en) Features for classification of stories
CN110427627A (en) Task processing method and device based on semantic expressiveness model
US10210251B2 (en) System and method for creating labels for clusters
CN110019389A (en) Financial non-structured text analysis system and its method
de Does et al. Creating research environments with blacklab
CN110413996A (en) Construct the method and device of zero reference resolution corpus
JP6885506B2 (en) Response processing program, response processing method, response processing device and response processing system
US11416556B2 (en) Natural language dialogue system perturbation testing
US20210390258A1 (en) Systems and methods for identification of repetitive language in document using linguistic analysis and correction thereof
AU2018273369A1 (en) Automated classification of network-accessible content
Putri et al. Software feature extraction using infrequent feature extraction
CN110516157A (en) A kind of document retrieval method, equipment and storage medium
US20230024040A1 (en) Machine learning techniques for semantic processing of structured natural language documents to detect action items
Bontcheva et al. Extracting information from social media with gate
Gomide Corpus linguistics software: Understanding their usages and delivering two new tools
Hosseini et al. Compositional generalization for natural language interfaces to web apis
Leopold et al. On labeling quality in business process models
Gulnara et al. The development of a web application for the automatic analysis of the tonality of texts based on machine learning methods
CN116028620B (en) Method and system for generating patent abstract based on multi-task feature cooperation
Lou et al. Extracting facts from case rulings through paragraph segmentation of judicial decisions

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20201010

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Innovative advanced technology Co.,Ltd.

Address before: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant before: Advanced innovation technology Co.,Ltd.

Effective date of registration: 20201010

Address after: Cayman Enterprise Centre, 27 Hospital Road, George Town, Grand Cayman Islands

Applicant after: Advanced innovation technology Co.,Ltd.

Address before: A four-storey 847 mailbox in Grand Cayman Capital Building, British Cayman Islands

Applicant before: Alibaba Group Holding Ltd.

TA01 Transfer of patent application right
GR01 Patent grant
GR01 Patent grant