Summary of the invention
This specification one or more embodiment provides a kind of method and device for constructing zero reference resolution corpus, favorably
In being quickly obtained large-scale zero reference resolution corpus.
In a first aspect, providing a kind of method for constructing zero reference resolution corpus, which comprises
The corresponding word sequence of sentence to be processed is obtained, and marks the part of speech for each word that the word sequence includes;
It determines in each word that the word sequence includes, part of speech is each word going out in the word sequence respectively of noun
The existing frequency;
It detects in each word that the word sequence includes with the presence or absence of at least one candidate word, wherein the candidate word
Part of speech is noun, and corresponding frequency of occurrence is not less than 2;
There are candidate word described at least one, select the candidate word as target word, Yi Jicong
The target word selects at least one target position in multiple appearance positions in the sentence to be processed, and by each institute
The target word for stating target position is deleted, and calibration sentence is obtained;
By the calibration sentence, the target word and each target position combination, zero reference resolution corpus is obtained,
The zero reference resolution corpus carries out zero reference resolution for treating parsing sentence.
In a kind of possible embodiment,
The zero reference resolution corpus is the positive sample for train language model;Wherein, the language model, for pre-
It surveys include in the sentence of corresponding input zero and refers to the position of item, and predict the object that the zero reference item refers to.
In a kind of possible embodiment,
The method also includes:
In the case where at least one described candidate word is not present, detect in obtained multiple calibration sentences,
With the presence or absence of calibration sentence identical with the sentence to be processed;
There is no calibration sentence identical with the sentence to be processed, the sentence to be processed is determined as
For training the negative sample of the language model.
In a kind of possible embodiment,
Before the corresponding word sequence of the acquisition sentence to be processed, further includes:
Text data is acquired from webpage;
Data cleansing and pretreatment are carried out to the text data, obtain text to be processed;
Subordinate sentence processing is carried out to the text to be processed, obtains at least one described sentence to be processed.
In a kind of possible embodiment,
Multiple appearance positions of the target word in the sentence to be processed, by the target word in the word sequence
In corresponding multiple serial numbers indicate.
In a kind of possible embodiment,
From the target word in multiple appearance positions in the sentence to be processed, at least one target position is selected,
Include: from the target word in multiple appearance positions in the sentence to be processed, randomly choose at least one target position.
In a kind of possible embodiment,
It is described from the target word in multiple appearance positions in the sentence to be processed, select at least one target position
It sets, comprising:
According to the data set comprising multiple sample sentences, determine the target word in the sentence to be processed it is multiple go out
Existing position refers to the conditional probability that item refers to by zero respectively;
According to the corresponding conditional probability of each appearance position, at least one target position is selected from multiple appearance positions
It sets, wherein the corresponding conditional probability in each target position is respectively corresponded not less than non-selected each appearance position
Conditional probability.
In a kind of possible embodiment,
According to the data set comprising multiple sample sentences, determine the target word in the sentence to be processed it is multiple go out
Existing position refers to the conditional probability that item refers to by zero respectively, comprising:
At least one target sentences is determined from the data set comprising multiple sample sentences, wherein each target
It include at least one described target word in sentence;
For each target sentences, obtains and refer to the zero of the target word and refer to the of item in the target sentences
One position, and obtain the second position of the target word in the target sentences;
It determines at least one described target sentences, first position is located at the first frequency before its corresponding second position
Secondary, first position is located at second frequency after its corresponding second position;
According to first frequency and second frequency, it is multiple in the sentence to be processed to calculate the target word
Appearance position refers to the conditional probability that item refers to by zero respectively.
Second aspect, provides a kind of device for constructing zero reference resolution corpus, and described device includes:
Word segmentation processing module is configured to obtain the corresponding word sequence of sentence to be processed, and marks the word sequence and include
The part of speech of each word;
Word frequency statistics module is configured to determine that part of speech is each word point of noun in each word that the word sequence includes
Frequency of occurrence not in the word sequence;
First detection module is configured to detect in each word that the word sequence includes candidate with the presence or absence of at least one
Word, wherein the part of speech of the candidate word is noun, and corresponding frequency of occurrence is not less than 2;
Sentence processing module is configured to there are candidate word described at least one, selects a candidate
Word selects at least one in multiple appearance positions in the sentence to be processed as target word, and from the target word
Target position, and the target word of each target position is deleted, obtain calibration sentence;
Corpus constructing module is configured to the calibration sentence, the target word and each target position combination,
Zero reference resolution corpus is obtained, the zero reference resolution corpus carries out zero reference resolution for treating parsing sentence.
In a kind of possible embodiment,
The zero reference resolution corpus is the positive sample for train language model;Wherein, the language model, for pre-
It surveys include in the sentence of corresponding input zero and refers to the position of item, and predict the object that the zero reference item refers to.
In a kind of possible embodiment,
Described device further include:
Second detection module is configured in the case where at least one described candidate word is not present, what detection had obtained
In multiple calibration sentences, if there is calibration sentence identical with the sentence to be processed;
Negative sample determining module is configured to there is no calibration sentence identical with the sentence to be processed,
The sentence to be processed is determined as to be used to train the negative sample of the language model.
In a kind of possible embodiment,
Described device further include:
Data acquisition module is configured to acquire text data from webpage;
Preprocessing module is configured to carry out data cleansing and pretreatment to the text data, obtains text to be processed;
Subordinate sentence processing module is configured to carry out subordinate sentence processing to the text to be processed, it is described wait locate to obtain at least one
Manage sentence.
In a kind of possible embodiment,
Multiple appearance positions of the target word in the sentence to be processed, by the target word in the word sequence
In corresponding multiple serial numbers indicate.
In a kind of possible embodiment,
The sentence processing module, concrete configuration are multiple appearance positions from the target word in the sentence to be processed
In setting, at least one target position is randomly choosed.
In a kind of possible embodiment,
The sentence processing module, comprising:
Conditional probability determination unit is configured to determine that the target word exists according to the data set comprising multiple sample sentences
Multiple appearance positions in the sentence to be processed refer to the conditional probability that item refers to by zero respectively;
Sentence processing unit is configured to according to the corresponding conditional probability of each appearance position, from multiple appearance positions
Middle at least one target position of selection, wherein the corresponding conditional probability in each target position, not less than non-selected
The corresponding conditional probability of each appearance position.
In a kind of possible embodiment,
The conditional probability determination unit, concrete configuration are as follows: determined from the data set comprising multiple sample sentences to
Few target sentences, wherein include at least one described target word in each target sentences;For each mesh
Sentence is marked, obtains first position of the zero reference item for referring to the target word in the target sentences, and obtain the mesh
Mark the second position of the word in the target sentences;It determines at least one described target sentences, first position is located at its correspondence
The second position before first frequency, first position be located at second frequency after its corresponding second position;According to described
First frequency and second frequency calculate multiple appearance positions of the target word in the sentence to be processed respectively by zero
Refer to the conditional probability that item refers to.
The third aspect provides a kind of computer readable storage medium, is stored thereon with computer program, when the calculating
When machine program executes in a computer, computer is enabled to execute method described in any one of first aspect.
Fourth aspect provides a kind of calculating equipment, including memory and processor, and being stored in the memory can hold
Line code when the processor executes the executable code, realizes method described in any one of first aspect.
The method and device provided by this specification embodiment can obtain the corresponding word sequence of sentence to be processed first,
And mark the part of speech for each word that the word sequence includes.Then it determines in each word that the word sequence includes, part of speech is noun
Each word frequency of occurrence in word sequence respectively.When there are part of speech being noun, frequency of occurrence not less than 2 in the word sequence
When one or more candidate words, that is, a candidate word may be selected as target word, and from the target word in sentence to be processed
Multiple appearance positions in, select at least one target position, and the target word of each target position is deleted, obtain comprising zero
Refer to the calibration sentence of item.Later, calibration sentence, target word and each target position can be combined, is obtained for treating
Parsing sentence carries out zero reference resolution corpus of zero reference resolution.As it can be seen that during zero reference resolution corpus of construction, without disappearing
The excessive time is consumed to sentence progress semantic analysis and mark, is conducive to be quickly obtained large-scale zero reference resolution corpus.
Specific embodiment
With reference to the accompanying drawing, each non-limiting embodiment provided by this specification is described in detail.
Fig. 1 shows a kind of schematic diagram of the applicable application scenarios of this specification one or more embodiment.
The sentence for being used to provide it for user for intelligent answer, machine translation etc. responds, and refers to realize
Determine the operation system of business, the semanteme for the sentence expectation expression that operation system needs accurate understanding to provide it could be preferable
Realize specified services.In actual application, may cause because of the speech habits or other reasons of user to operation system
Item often is referred to comprising zero in the sentence of offer;At this time, it may be necessary to according to the language model trained in advance or the language constructed in advance
Method rule base carries out zero reference resolution to the sentence for referring to item comprising zero, so that operation system accurate understanding sentence it is expected to express
Semanteme.In general, language model is obtained using the training of large-scale zero reference resolution corpus, syntax rule library is to big
Zero reference resolution corpus of scale is for statistical analysis, according to the building of the result of statistical analysis.As it can be seen that if you need to realize to sentence
Carry out zero reference resolution, it is necessary first to obtain large-scale zero reference resolution corpus.
Conventional, semantic analysis can be carried out to the sentence in text, to find text by manual read's a large amount of text
The object that the sentence for referring to item comprising zero in this and zero reference item refer to.It later, can people for the sentence for referring to item comprising zero
Work marks the appearance position of the zero reference item in sentence, and marks the object that the zero reference item refers to.In this way, can be obtained
Zero be made of the object that the sentence comprising zero reference item, zero appearance position of the reference item in the sentence, zero reference item refer to
Reference resolution corpus.
However, when obtaining zero reference resolution corpus through the above way, by including the zero sentence institute for referring to item in text
Accounting example is relatively small, comprising zero refer to item sentence proportion it is relatively large, need to waste the more time to not wrapping
The sentence for referring to item containing zero carries out semantic analysis;Moreover, being also required to by being manually labeled to the sentence comprising zero reference item
Occupy longer time.Therefore, in aforesaid way, one zero reference resolution corpus of every construction is required to consumption longer time,
Large-scale zero reference resolution corpus can not be quickly obtained.
In view of the above problems, this specification embodiment considers a kind of situation, i.e., for any one sentence, if there is
Part of speech is that a target word of noun repeatedly occurs in the sentence, then being directed to multiple appearance of the target word in the sentence
Position, deletes one or more after the target word that target position occurs, and obtained calibration sentence i.e. may be to include zero finger
For the sentence of item;Moreover, each target position be possible for calibration sentence in include zero refer to item appearance position, and this zero
Referring to the object that item refers to is the target word.In this case, it is only necessary to sentence, target word and each will be demarcated accordingly
Target position combination, can quickly obtain a zero reference resolution corpus, carry out language to the sentence without consuming the excessive time
Justice analysis is labeled the sentence without the consumption excessive time, is conducive to be quickly obtained large-scale zero reference resolution
Corpus.
In view of the foregoing, the basic conception of this specification embodiment has been to provide a kind of zero reference resolution corpus of construction
Method and device.The embodiment of the above basic conception is specifically described with reference to the accompanying drawing.
Fig. 2 shows a kind of flow diagrams of method for constructing zero reference resolution corpus.
It is appreciated that implementing the executing subject of the method for zero reference resolution corpus of construction as shown in Figure 2, can be such as Fig. 1
Calculating equipment in shown application scenarios, the calculating equipment include but is not limited to server or general computer.As shown in Fig. 2,
The method for constructing zero reference resolution corpus at least may include steps of 21~step 29: step 21, obtain sentence to be processed
Corresponding word sequence, and mark the part of speech for each word that the word sequence includes;Step 23, determine that the word sequence includes each
In a word, part of speech is each word frequency of occurrence in the word sequence respectively of noun;Step 25, the word sequence packet is detected
It whether there is at least one candidate word in each word contained, wherein the part of speech of the candidate word is noun, and corresponding appearance is frequently
It is secondary to be not less than 2;Step 27, there are candidate word described at least one, select the candidate word as target
Word, and from the target word in multiple appearance positions in the sentence to be processed, at least one target position is selected, and
The target word of each target position is deleted, calibration sentence is obtained;Step 29, by the calibration sentence, the mesh
Word and the combination of each target position are marked, obtains zero reference resolution corpus, the zero reference resolution corpus is for treating point
It analyses sentence and carries out zero reference resolution.
Firstly, obtaining the corresponding word sequence of sentence to be processed in step 21, and mark each word that the word sequence includes
Part of speech.
It specifically, can be by calling language technology platform (Language Technology Platform, LTP), nature
Language Processing and information retrieval shared platform (Natural Language Processing&Information Retrieval,
NLPIR) or other participle tools, realization carry out word segmentation processing to sentence to be processed, obtain the corresponding word sequence of sentence to be processed,
And mark the part of speech for each word that word sequence includes.
Fig. 3 is turned next to, Fig. 3, which is shown, to be handled for an exemplary sentence to construct zero reference resolution corpus
Process schematic.As shown in figure 3, exemplary sentence " Xiao Ming has eaten an apple, apple very sweet tea " is used as sentence to be processed, it is first
First word segmentation processing can be carried out to exemplary sentence, can be obtained corresponding word order be classified as [Xiao Ming, eat, one, apple, apple
Fruit, very, sweet tea];Then to include in the word sequence each word carry out part-of-speech tagging, annotation results can for [Xiao Ming/nr, eat/
V ,/ul, one/mq, apple/n, apple/n, very/dc, sweet tea/a].
Then, in step 23, determine in each word that the word sequence includes, part of speech for noun each word respectively in institute
Frequency of occurrence in predicate sequence.
Referring to FIG. 3, be not difficult to count according to annotation results, in the corresponding word sequence of exemplary sentence, noun " apple "
Frequency of occurrence is 2.It is appreciated that in practical business scene, in the corresponding word sequence of sentence to be processed may comprising it is multiple not
The word same, part of speech is noun.
Then, it in step 25, detects and whether there is at least one candidate word in each word that the word sequence includes,
In, the part of speech of the candidate word is noun, and corresponding frequency of occurrence is not less than 2.
Here, if part of speech be noun word in sentence to be processed/word sequence in repeatedly occur, which can make
For candidate word.
Further, in step 27, there are candidate word described at least one, a candidate word is selected
At least one mesh is selected in multiple appearance positions in the sentence to be processed as target word, and from the target word
Cursor position, and the target word of each target position is deleted, obtain calibration sentence.
Specifically, a candidate word can be randomly choosed as target word.
It should be noted that can be accomplished in several ways indicates target word to be processed in practical application scene
Appearance position in sentence.Specifically, in a kind of possible embodiment, it is corresponding more in word sequence by target word
A serial number, to indicate multiple appearance positions of the target word in sentence to be processed;I.e. for appearing in sentence to be processed every time
The target word of the secondary appearance can be corresponded to the serial number in word sequence, the target as the secondary appearance by the target word in son
Appearance position of the word in sentence to be processed.Referring again to FIGS. 3, in illustrative sentence to be processed, for target word " apple ",
The corresponding serial number in word sequence of " apple " first appeared in sentence to be processed is 5, can be indicated using serial number 5
The appearance position of " apple " that is first appeared in sentence to be processed;It is similar, second of " apple " occurred in sentence to be processed
The corresponding serial number in word sequence is 6, then, it can indicate to appear in sentence to be processed for the second time using serial number 6
In " apple " appearance position.
It, can be by this time for appearing in the target word in sentence to be processed every time in alternatively possible embodiment
The corresponding character ordinal number in sentence to be processed of the first character of the target word of appearance, the target word as the secondary appearance is wait locate
Manage the appearance position in sentence.Referring again to FIGS. 3, in illustrative sentence to be processed, for target word " apple ", sentence to be processed
The corresponding character ordinal number in sentence to be processed of the first character " apple " of " apple " that first appears in son is 7, using character
Serial number 7 indicates the appearance position of " apple " that first appears in sentence to be processed.
Obviously, it is also possible to realize the appearance position for indicating target word in sentence to be processed by other means.
It should be noted that from target word in multiple appearance positions in sentence to be processed when selection target position, institute
The total amount of selection target position should be less than the total amount of multiple appearance positions of the target word in sentence to be processed, that is, ensure subsequent
After deleting in the process the target word of each target position, it can at least retain the target word in obtained calibration sentence
Certain it is primary occur, calibration sentence is become and refers to item comprising zero and zero to refer to the object that item refers to be target word
Sentence.
In a more specific example, multiple appearance positions from the target word in the sentence to be processed
In setting, at least one target position is selected, comprising: from multiple appearance positions of the target word in the sentence to be processed
In, randomly choose at least one target position.It is random by being carried out to multiple appearance positions of the target word in sentence to be processed
The mode of selection is conducive to relatively quick construct zero reference resolution corpus.
But it if is randomly choosed for multiple appearance positions of the target word in sentence to be processed, to be processed
After the target word for deleting each target position in sentence, but there may be the language habits for not meeting user for obtained calibration sentence
Used situation.For example, illustrative sentence " Xiao Ming has eaten an apple, apple very sweet tea " to be processed, if it is to be processed to delete this
The target word " apple " first appeared in sentence, obtained calibration sentence are " Xiao Ming has eaten one, apple very sweet tea ";At this point, should
Calibration sentence obviously do not meet user speech habits namely user it is almost impossible offer with calibration sentence " Xiao Ming has eaten one
It is a, apple very sweet tea " the similar sentence of grammer, no matter according to the zero reference resolution corpus construction syntax rule comprising the calibration sentence
Library or train language model are all difficult to preferably complete to carry out zero reference resolution to other sentences.
Therefore, in order to obtain the calibration sentences of the speech habits for more meeting user, in another more specific example
In, it is described from the target word in multiple appearance positions in the sentence to be processed, select at least one target position, wrap
It includes: according to the data set comprising multiple sample sentences, determining multiple appearance positions of the target word in the sentence to be processed
It sets and refers to the conditional probability that item refers to by zero respectively;According to the corresponding conditional probability of each appearance position, from multiple appearance
At least one target position is selected in position, wherein the corresponding conditional probability in each target position, not less than not being chosen
The corresponding conditional probability of each appearance position selected.
In the example, for each appearance position of the target word in sentence to be processed, the corresponding condition of the appearance position
Probability is bigger, then illustrates that most users when expressing the semanteme of sentence expectation expression, there is biggish probability can omit this
The target word that existing position occurs, namely after deleting the target word that the biggish appearance position of conditional probability occurs, obtain
Calibration sentence there is bigger probability to meet the speech habits of user.
In the example, the multiple sample sentences for including in data set should all be the sentence for more meeting the speech habits of user
Son;Specifically, sample sentence can be manually to sentence carry out semantic analysis after, marked zero reference item and (marked out
Zero refers to position of the item in sentence) and its sentence of object that refers to.
In one more specifically example, data set of the basis comprising multiple sample sentences determines the target
Multiple appearance positions of the word in the sentence to be processed refer to the conditional probability that item refers to by zero respectively, comprising: from comprising more
At least one target sentences is determined in the data set of a sample sentence, wherein comprising at least in each target sentences
One target word;For each target sentences, the zero reference item for referring to the target word is obtained in the target sentence
First position in son, and obtain the second position of the target word in the target sentences;Determine it is described at least one
In target sentences, first position be located at its corresponding second position before first frequency, first position be located at its corresponding
Second frequency after two positions;According to first frequency and second frequency, the target word is calculated described wait locate
The multiple appearance positions managed in sentence refer to the conditional probability that item refers to by zero respectively.
Still it for sentence to be processed, is determined from data set multiple comprising target word " apple " shown in Fig. 3
Target sentences after, for each target sentences, can get refer to target word " apple " zero refer to item in the target sentence
First position in son, and obtain the second position of target word " apple " in target sentences.It is assumed that in multiple target sentences,
Refer to the first position in the zero reference Xiang Qi said target sentence of target word " apple ", the target word " apple referred to positioned at it
First frequency of the fruit " before the second position in its said target sentence is a, and the zero reference item for referring to target word " apple " exists
First position in its said target sentence, second of the target word " apple " referred to positioned at it in its said target sentence
Second frequency after setting is b.So, for the target word in sentence to be processed " Xiao Ming has eaten an apple, apple very sweet tea "
Apple, can calculate first appearance position of " apple " in sentence to be processed and refer to the conditional probability that item refers to by zero is a/
(a+b), calculating second appearance position of " apple " in sentence to be processed and referring to the conditional probability that item refers to by zero is b/
(a+b)。
According to a specific embodiment, it is assumed that conditional probability b/ (a+b) greater than condition probability a/ (a+b), that is, sentence to be processed
" apple " has the higher conditional probability that item reference is referred to by zero in second appearance position in son, then just deleting second
" apple " of appearance position obtains calibration sentence " Xiao Ming has eaten an apple, very sweet tea ".
Correspondingly, combining the calibration sentence, the target word and each target position in step 29, obtaining
To zero reference resolution corpus, the zero reference resolution corpus carries out zero reference resolution for treating parsing sentence.
It, can be by calibration sentence " Xiao Ming has eaten an apple, very sweet tea " and target position " 5 " and mesh referring again to Fig. 3
Word " apple " combination is marked, a zero reference resolution corpus is obtained.
It should be noted that building auxiliary is realized for sentence carries out the syntax rule library of zero reference resolution, it can
Use the zero reference resolution corpus combined on a large scale by calibration sentence, target word and target position.Training is used
For the language model for zero object for referring to item and its reference that prediction sentence includes, except needs are using largely by calibration sentence
Except the zero reference resolution corpus that son, target word and target position are combined is as positive sample, it is also necessary to use a part
The sentence of item is not referred to as negative sample comprising zero.
Correspondingly, in order to obtain zero reference resolution corpus for train language model, as negative sample, one kind can
In the embodiment of energy, the method also includes: in the case where at least one described candidate word is not present, detection has been obtained
Multiple calibration sentences in, if there is calibration sentence identical with the sentence to be processed;There is no with it is described to
In the case where handling the identical calibration sentence of sentence, the sentence to be processed is determined as to be used to train the negative of the language model
Sample.
In the embodiment, in sentence used in everyday, the sentence comprising zero reference item is relatively fewer, does not refer to comprising zero
The sentence of item is relatively more;Therefore, can obtain it is large-scale, as zero reference resolution corpus of positive sample after, for
The case where there is no candidate words in each word that processing sentence includes, further inquires in obtained each calibration sentence,
It is identical as the sentence to be processed with the presence or absence of calibration sentence.It is identical as the sentence to be processed if there is a calibration sentence, then
Illustrate that the sentence to be processed itself may be the sentence for referring to item comprising zero, should not be used as the negative sample for train language model
This;Conversely, then illustrating that the sentence to be processed may be the sentence for not referring to item comprising zero, which can be determined as
Zero reference resolution corpus for train language model, as negative sample.
In summary description is as it can be seen that obtain the process of zero reference resolution corpus by method provided by the embodiments of the present application
In, semantic analysis is carried out to sentence without consuming the excessive time, sentence is labeled without the consumption excessive time;Phase
It answers, if it is possible to obtain large-scale sentence to be processed, then can be quickly obtained large-scale zero reference resolution corpus.Cause
This, on the basis of embodiment as shown in Figure 2, as shown in figure 4, in a kind of possible embodiment, the step 21 it
Before, the method can also include the following steps 31~step 33: step 31, text data be acquired from webpage;Step 33, right
The text data carries out data cleansing and pretreatment, obtains text to be processed;Step 35, the text to be processed is divided
Sentence processing, obtains at least one described sentence to be processed.
In step 31, text data is acquired from webpage.The data carried in webpage are usually public data, these disclosures
Data are also easy to acquire, it is only necessary to these public datas are simply handled, can be obtained it is large-scale, can be used in structure
Make the sentence to be processed of zero reference resolution corpus.
Specifically, user can also select the data source of text data according to its actual demand, for example, can be by microblogging, Baidu
Encyclopaedia, publication and manage the knowledge base of paper, knowledge base etc. for managing patent application document is used as data source, under these data sources
Text data it is occupied in its corresponding webpage ratio it is relatively large, and in these text datas actual bearer sentence
Also more meet the speech habits of user.In general, can be quick from webpage corresponding to these data sources by web crawlers
Acquire text data.
Then, in step 33, data cleansing and pretreatment is carried out to the text data, obtain text to be processed.From net
It may further include other characters for being unfavorable for constructing zero reference resolution corpus, by text in the text data acquired in page
Data carry out data cleansing, can remove the character for being unfavorable for constructing zero reference resolution corpus in text data, such as removal text
HTML (Hyper Text Markup Language, the hypertext markup language) label for including in data.Moreover, by complete
It is pre-processed at the text data of data cleansing, the text to be processed for meeting user demand can be obtained;For example, adjustment is counted
According to the typesetting of each sentence, chapter in the text data after cleaning, taken out from the pretreated text to be processed of completion so as to subsequent
Take complete sentence.
Later, in step 35, subordinate sentence processing is carried out to the text to be processed, obtains at least one described sentence to be processed
Son.
Specifically, can by characterized in text to be processed some sentence complete expression punctuation mark ".", " ",
"!" etc., subordinate sentence processing is carried out to text to be processed, realization extracts multiple complete sentences from text to be processed.It is not difficult to manage
Solution, for multiple sentences to be processed that step 35 obtains, can be rapidly completed to each by each step as shown in Figure 2 wait locate
Reason sentence is handled, to be quickly obtained large-scale zero reference resolution corpus.
Based on design identical with embodiment of the method, this specification embodiment additionally provides a kind of zero reference resolution language of construction
The device of material, the device can have calculating, the software of processing capacity, hardware or combinations thereof to realize by any.In general, should
Device can be deployed in the calculating equipment of application scenarios as shown in Figure 1.
Fig. 5 shows a kind of structural schematic diagram of device for constructing zero reference resolution corpus.
As shown in figure 5, the device of zero reference resolution corpus of construction at least may include:
Word segmentation processing module 51 is configured to obtain the corresponding word sequence of sentence to be processed, and marks the word sequence and include
Each word part of speech;
Word frequency statistics module 53 is configured to determine that part of speech is each word of noun in each word that the word sequence includes
Frequency of occurrence in the word sequence respectively;
First detection module 55 is configured to detect in each word that the word sequence includes candidate with the presence or absence of at least one
Word, wherein the part of speech of the candidate word is noun, and corresponding frequency of occurrence is not less than 2;
Sentence processing module 57 is configured to there are candidate word described at least one, selects a time
It selects word as target word, and from the target word in multiple appearance positions in the sentence to be processed, selects at least one
A target position, and the target word of each target position is deleted, obtain calibration sentence;
Corpus constructing module 59 is configured to the calibration sentence, the target word and each target position group
It closes, obtains zero reference resolution corpus, the zero reference resolution corpus carries out zero reference resolution for treating parsing sentence.
In a kind of possible embodiment, the zero reference resolution corpus is the positive sample for train language model;
Wherein, the language model, the position of the zero reference item for predicting to include in the corresponding sentence inputted, and predict zero finger
The object referred to for item.
In a kind of possible embodiment, described device further include:
Second detection module is configured in the case where at least one described candidate word is not present, what detection had obtained
In multiple calibration sentences, if there is calibration sentence identical with the sentence to be processed;
Negative sample determining module is configured to there is no calibration sentence identical with the sentence to be processed,
The sentence to be processed is determined as to be used to train the negative sample of the language model.
In a kind of possible embodiment, described device further include:
Data acquisition module is configured to acquire text data from webpage;
Preprocessing module is configured to carry out data cleansing and pretreatment to the text data, obtains text to be processed;
Subordinate sentence processing module is configured to carry out subordinate sentence processing to the text to be processed, it is described wait locate to obtain at least one
Manage sentence.
In a kind of possible embodiment, multiple appearance positions of the target word in the sentence to be processed are led to
Crossing the target word corresponding multiple serial numbers in the word sequence indicates.
In a kind of possible embodiment, the sentence processing module 57, concrete configuration for from the target word in institute
It states in multiple appearance positions in sentence to be processed, randomly chooses at least one target position.
In a kind of possible embodiment, the sentence processing module 57, comprising:
Conditional probability determination unit is configured to determine that the target word exists according to the data set comprising multiple sample sentences
Multiple appearance positions in the sentence to be processed refer to the conditional probability that item refers to by zero respectively;
Sentence processing unit is configured to according to the corresponding conditional probability of each appearance position, from multiple appearance positions
Middle at least one target position of selection, wherein the corresponding conditional probability in each target position, not less than non-selected
The corresponding conditional probability of each appearance position.
In a kind of possible embodiment, the conditional probability determination unit, concrete configuration are as follows: from including multiple samples
At least one target sentences is determined in the data set of sentence, wherein include at least one institute in each target sentences
State target word;For each target sentences, the zero reference item for referring to the target word is obtained in the target sentences
First position, and obtain the second position of the target word in the target sentences;Determine at least one described target sentence
In son, first position be located at its corresponding second position before first frequency, first position be located at its corresponding second position
Second frequency later;According to first frequency and second frequency, the target word is calculated in the sentence to be processed
In multiple appearance positions respectively by zero refer to item refer to conditional probability.
This specification embodiment additionally provides a kind of calculating equipment, including memory and processor, deposits in the memory
Executable code is contained, when the processor executes the executable code, realizes any one embodiment description in explanation
Method.
Those skilled in the art are it will be appreciated that in said one or multiple examples, described in this specification
Function can be realized with hardware, software, firmware or their any combination.It when implemented in software, can be by these function
Computer program corresponding to energy stores in computer-readable medium or as one or more on computer-readable medium
A instructions/code is transmitted, and when being computer-executed so as to computer program corresponding to these functions, passes through computer reality
Existing any one method as described in the examples of the present invention.
All the embodiments in this specification are described in a progressive manner, identical, similar between each embodiment
Part may refer to each other, and each embodiment focuses on the differences from other embodiments.Especially for device
For embodiment, since it is substantially similar to the method embodiment, so being described relatively simple, related place is implemented referring to method
The part explanation of example.
It is above-mentioned that this specification specific embodiment is described.Other embodiments are in the scope of the appended claims
It is interior.In some cases, the movement recorded in detail in the claims or step can be come according to the sequence being different from embodiment
It executes and desired result still may be implemented.In addition, process depicted in the drawing not necessarily require show it is specific suitable
Sequence or consecutive order are just able to achieve desired result.In some embodiments, multitasking and parallel processing be also can
With or may be advantageous.
Above-described specific embodiment has carried out further the purpose of the present invention, technical scheme and beneficial effects
It is described in detail, it should be understood that being not intended to limit the present invention the foregoing is merely a specific embodiment of the invention
Protection scope, all any modification, equivalent substitution, improvement and etc. on the basis of technical solution of the present invention, done should all
Including within protection scope of the present invention.