CN110389999A - A kind of method, apparatus of information extraction, storage medium and electronic equipment - Google Patents

A kind of method, apparatus of information extraction, storage medium and electronic equipment Download PDF

Info

Publication number
CN110389999A
CN110389999A CN201910684300.9A CN201910684300A CN110389999A CN 110389999 A CN110389999 A CN 110389999A CN 201910684300 A CN201910684300 A CN 201910684300A CN 110389999 A CN110389999 A CN 110389999A
Authority
CN
China
Prior art keywords
subproblem
target
answer
text
open problems
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910684300.9A
Other languages
Chinese (zh)
Inventor
李夏禹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Shannon Huiyu Technology Co Ltd
Original Assignee
Beijing Shannon Huiyu Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Shannon Huiyu Technology Co Ltd filed Critical Beijing Shannon Huiyu Technology Co Ltd
Priority to CN201910684300.9A priority Critical patent/CN110389999A/en
Publication of CN110389999A publication Critical patent/CN110389999A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • G06F16/3329Natural language query formulation or dialogue systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Human Computer Interaction (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

The present invention provides a kind of method, apparatus of information extraction, storage medium and electronic equipments, wherein this method comprises: obtaining open problems, and open problems is decomposed into multiple subproblems according to multi-layer tree construction;Leaf subproblem is chosen as target subproblem, extracts the answer of target subproblem, and is the answer of target subproblem by the expansible information update to match in the subproblem of a upper level;Later using other leaf subproblems as target subproblem, repeat the above steps;The subproblem for the upper level that all expansible information is updated is as target subproblem, until using root problem as target subproblem;The answer of target subproblem is extracted, and using answer as the answer of open problems.Method, apparatus, storage medium and the electronic equipment of the information extraction provided through the embodiment of the present invention can accurately extract the answer of open problems, improve and extract precision, increase substantially the extraction accuracy of challenge.

Description

A kind of method, apparatus of information extraction, storage medium and electronic equipment
Technical field
The present invention relates to information extraction technique fields, method, apparatus, storage in particular to a kind of information extraction Medium and electronic equipment.
Background technique
It is asked currently, question and answer (QA, question answer) model based on general deep learning provides a kind of basis Topic obtains the pervasive solution of answer from text paragraph.However existing QA model is only applicable to simple problem, not It can the complicated problem of answer.
For example, existing original text is: the incumbent principal in the middle school A is Zhang San.Zhang San once held a post in the kindergarten B, and post is the form master; C primary school, post are prefect of studies.
Whom the incumbent principal in the middle school problem 1:A? QA model is answered: Zhang San.
Did the incumbent principal in the middle school problem 2:A once take office in where? QA model can not answer.
Did what post the incumbent principal in the middle school problem 3:A once hold a post? QA model can not answer.
Existing Question-Answering Model (QA model) cannot answer challenge, and certain problems can not be mentioned quickly and accurately For answer.
Summary of the invention
To solve the above problems, a kind of method, apparatus for being designed to provide information extraction of the embodiment of the present invention, storage Medium and electronic equipment.
In a first aspect, the embodiment of the invention provides a kind of methods of information extraction, comprising:
Open problems are obtained, and the open problems are decomposed into multiple subproblems according to multi-layer tree construction;The son Problem includes at least leaf subproblem and root problem, and the answer son corresponding with a upper level of the subproblem of current level Expansible information in problem matches;
The leaf subproblem is chosen as target subproblem, the target subproblem is extracted from preset content of text Answer, and by the expansible information update to match in the subproblem of a upper level be the target subproblem answer;It It afterwards using other leaf subproblems as target subproblem, repeats the above steps, until all in the subproblem of a upper level can Extension information is updated;
The subproblem for the upper level that all expansible information is updated repeats above-mentioned step as target subproblem Suddenly, until using the root problem as target subproblem;
The answer of the target subproblem is extracted from preset content of text, and target extracted at this time is asked Answer of the answer of topic as the open problems.
Second aspect, the embodiment of the invention also provides a kind of devices of information extraction, comprising:
PROBLEM DECOMPOSITION module is decomposed into for obtaining open problems, and by the open problems according to multi-layer tree construction Multiple subproblems;The subproblem includes at least leaf subproblem and root problem, and the answer of the subproblem of current level with Expansible information in the corresponding subproblem of a upper level matches;
Subproblem answer extracting module, for choosing the leaf subproblem as target subproblem, from preset text The answer of the target subproblem is extracted in content, and is by the expansible information update to match in the subproblem of a upper level The answer of the target subproblem;It later using other leaf subproblems as target subproblem, repeats the above steps, until upper one All expansible information is updated in the subproblem of level;The upper level that all expansible information is updated Subproblem repeats the above steps as target subproblem, until using the root problem as target subproblem;
Open problems answer extracting module, for extracting the answer of the target subproblem from preset content of text, And using the answer of the target subproblem extracted at this time as the answer of the open problems.
The third aspect, the embodiment of the invention also provides a kind of computer storage medium, the computer storage medium is deposited Contain computer executable instructions, side of the computer executable instructions for information extraction described in above-mentioned any one Method.
Fourth aspect, the embodiment of the invention also provides a kind of electronic equipment, comprising:
At least one processor;And
The memory being connect at least one described processor communication;Wherein,
The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one A processor executes, so that the method that at least one described processor is able to carry out information extraction described in above-mentioned any one.
In the scheme that the above-mentioned first aspect of the embodiment of the present invention provides, multiple sons of PROBLEM DECOMPOSITION multi-layer tree construction are asked Topic, the answer of each subproblem is successively determined according to sequence from the bottom up, and finally determines answering for top layer grade subproblem Case.Can be determined by way of PROBLEM DECOMPOSITION it is multiple it is apparent, be easier to extract the subproblem of accurate answer, and be based on upper one The answer of the subproblem of level updates the subproblem of next level, finally accurately extracts the answer of open problems, Ke Yiti Height extracts precision, increases substantially the extraction accuracy of challenge.
To enable the above objects, features and advantages of the present invention to be clearer and more comprehensible, preferred embodiment is cited below particularly, and cooperate Appended attached drawing, is described in detail below.
Detailed description of the invention
In order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, to embodiment or will show below There is attached drawing needed in technical description to be briefly described, it should be apparent that, the accompanying drawings in the following description is only this Some embodiments of invention for those of ordinary skill in the art without creative efforts, can be with It obtains other drawings based on these drawings.
Fig. 1 shows a kind of flow chart of the method for information extraction provided by the embodiment of the present invention;
Fig. 2 shows a kind of schematic diagrames of the multi-layer tree construction of the composition of subproblem provided by the embodiment of the present invention;
Fig. 3 is shown in the method for information extraction provided by the embodiment of the present invention, extracts the answer of target subproblem Method flow diagram;
Fig. 4 shows a kind of structural schematic diagram of the device of information extraction provided by the embodiment of the present invention;
The structure that Fig. 5 shows the electronic equipment of the method extracted provided by the embodiment of the present invention for execution information is shown It is intended to.
Specific embodiment
In the description of the present invention, it is to be understood that, term " center ", " longitudinal direction ", " transverse direction ", " length ", " width ", " thickness ", "upper", "lower", "front", "rear", "left", "right", "vertical", "horizontal", "top", "bottom" "inner", "outside", " up time The orientation or positional relationship of the instructions such as needle ", " counterclockwise " is to be based on the orientation or positional relationship shown in the drawings, and is merely for convenience of The description present invention and simplified description, rather than the device or element of indication or suggestion meaning must have a particular orientation, with spy Fixed orientation construction and operation, therefore be not considered as limiting the invention.
In addition, term " first ", " second " are used for descriptive purposes only and cannot be understood as indicating or suggesting relative importance Or implicitly indicate the quantity of indicated technical characteristic.Define " first " as a result, the feature of " second " can be expressed or Implicitly include one or more of the features.In the description of the present invention, the meaning of " plurality " is two or more, Unless otherwise specifically defined.
In the present invention unless specifically defined or limited otherwise, term " installation ", " connected ", " connection ", " fixation " etc. Term shall be understood in a broad sense, for example, it may be being fixedly connected, may be a detachable connection, or be integrally connected;It can be machine Tool connection, is also possible to be electrically connected;It can be directly connected, two members can also be can be indirectly connected through an intermediary Connection inside part.For the ordinary skill in the art, above-mentioned term can be understood in this hair as the case may be Concrete meaning in bright.
A kind of method of information extraction provided in an embodiment of the present invention, it is shown in Figure 1, comprising:
Step 101: obtaining open problems, and open problems are decomposed into multiple subproblems according to multi-layer tree construction;Son Problem includes at least leaf subproblem and root problem, and the answer son corresponding with a upper level of the subproblem of current level Expansible information in problem matches.
In the embodiment of the present invention, open problems need the problem of being answered according to preset content of text;It is obtaining To after open problems, open problems are divided into multiple subproblems.Meanwhile subproblem is indicated in the form of the multi-layer tree construction, I.e. each subproblem corresponds to a node of multi-layer tree construction, such as corresponding subproblem (the i.e. leaf of each leaf node Subproblem), and the corresponding subproblem of root node of top layer, i.e. root problem;The root problem was completely corresponding to ask to solution Topic.Wherein, leaf subproblem is the subproblem of bottom grade, and root problem is the subproblem of top layer's grade, subproblem composition A kind of schematic diagram of multi-layer tree construction can be found in shown in Fig. 2.
Meanwhile the hierarchical structure in the present embodiment between subproblem passes through the answer of subproblem and the extension letter of subproblem Breath is to determine.Specifically, the expansible information in the answer subproblem corresponding with a upper level of the subproblem of current level Match." expansible information " in the present embodiment refers to be extended in subproblem (alternatively, can obtain by extended mode To) information, which specifically can be a word, a phrase or a subordinate sentence in subproblem etc..
For example, open problems are " the incumbent principal in middle school once took office in where? ", which can resolve into two Subproblem " whom the incumbent principal in middle school is? ", " he once took office in where? ";Wherein, " he " in second subproblem refers to It is " the incumbent principal in middle school ", then " he " is exactly an expansible information, and is matched with the answer of first character problem, therefore second A subproblem is the subproblem of a upper level for first subproblem.And it is asked since the open problems have only resolved into two sons Topic, then first subproblem " whom the incumbent principal in middle school is? " for leaf subproblem, " his second subproblem once took office in what Place? " it is root problem.
When being decomposed to open problems, can be decomposed by decomposition model.Specifically, above-mentioned steps 101 " will Open problems are decomposed into multiple subproblems according to multi-layer tree construction " it can specifically include:
Step A1: establishing PROBLEM DECOMPOSITION model, and obtains sample problem to be solved and corresponding with sample problem to be solved in advance Multi-layer tree construction multiple subsample problems.
Step A2: using sample problem to be solved as input, by multi-layer tree construction corresponding with sample problem to be solved Multiple subsample problems are trained PROBLEM DECOMPOSITION model as output, the problem decomposition model after determining training.
Step A3: after getting open problems, open problems are decomposed into mould as input, based on the problem after training Type determines multiple subproblems of multi-layer tree construction corresponding with open problems.
In the embodiment of the present invention, sample problem to be solved and corresponding subsample problem are obtained in advance, as training Sample is trained PROBLEM DECOMPOSITION model, may thereby determine that the parameter of PROBLEM DECOMPOSITION model to get the problem to after training Decomposition model.When needing to decompose open problems again later, the problem decomposition model after the training is utilized.Wherein, The PROBLEM DECOMPOSITION model is specifically as follows converter model (Transformer model), by the model to input wait solve Problem is coded and decoded, and obtains the son that the open problems include using from attention mechanism during coding and decoding Problem.
Open problems are decomposed from top to bottom alternatively, can also be handled by natural language understanding.Specifically, will be wait solve Problem determines the expansible information in root problem as root problem, and the determining expansible information with the root problem Matched next level subproblem;It is matched more next in the expansible information institute for determining the subproblem of next level later The subproblem of level, until identified subproblem does not include expansible information, i.e. the leaf subproblem of bottom grade does not include Expansible information.
For example, open problems are " the incumbent principal in middle school once took office in where? ", " the incumbent principal in middle school " therein is can Extend information, produce at this time corresponding next level subproblem " whom the incumbent principal in middle school is? ";Meanwhile the subproblem In expansible information is not present, then the subproblem " whom the incumbent principal in middle school is? " leaf subproblem after as decomposing, root are asked Entitled " the incumbent principal in middle school once took office in where? ".
Step 102: choosing leaf subproblem as target subproblem, target subproblem is extracted from preset content of text Answer, and by the expansible information update to match in the subproblem of a upper level be target subproblem answer;Later will Other leaf subproblems repeat the above steps as target subproblem, until all expansible in the subproblem of a upper level Information is updated.
Step 103: the subproblem for the upper level that all expansible information is updated is as target subproblem, weight Multiple above-mentioned steps, until using root problem as target subproblem.
In the embodiment of the present invention, after determining all subproblems, need successively to obtain from preset content of text each The answer of subproblem;That is, the traditional approach of open problems will be directly acquired in the present embodiment from content of text, replace with from text The mode of the subproblem of open problems is successively obtained in this content, and finally extracts the corresponding answer of open problems.This implementation In example, the answer of subproblem is determined according to the sequence of multi-layer tree construction from the bottom up, i.e., determines answering for leaf subproblem first Case determines the answer of the subproblem of a upper level successively later, until the final answer for determining root problem.
Specifically, determining the answer of leaf subproblem first in a step 102, it may thereby determine that the son of a level is asked Content corresponding to expansible information in topic;After all expansible information is updated in the subproblem of a upper level, i.e., Step 103 can be carried out, i.e., from the answer for extracting the subproblem of a level on this in content of text, until determining institute in root problem There is content corresponding to expansible information, it at this time can be using root problem as target subproblem.
For example, open problems are " what post the incumbent principal in the middle school A holds a post at the kindergarten that Li Si creates? ", should Open problems are divided into three subproblems: subproblem 1 " whom the incumbent principal in the middle school A is ", and " kindergarten of Li Si's creation is subproblem 2 What? ", subproblem 3 " what post someone holds a post at somewhere ";And subproblem 1 and subproblem 2 are leaf subproblems, Subproblem 3 is root problem.Preset content of text are as follows: " Zhang San and Li Si are good friends, and Li Si once created the kindergarten B, Please Zhang San hold a post in the kindergarten B, post is the form master.... later, Zhang San goes to C primary school, and post is educational administration.... post-tensioning Three go to the middle school A again.... currently, the incumbent principal in the middle school A is Zhang San." in the present embodiment, it is first determined leaf subproblem is answered Case, according to content of text it is found that the answer of subproblem 1 is " Zhang San ", the answer of subproblem 2 is " kindergarten B ";In root problem " someone " and " somewhere " be expansible information, two expansible information respectively correspond subproblem 1 and subproblem 2, can After the answer for extending the subproblem that information update is a upper level, which is that " Zhang San holds a post at the kindergarten B What post? ", the answer of root problem can be accurately extracted from content of text at this time.
Step 104: extracting the answer of target subproblem from preset content of text, and target extracted at this time is asked Answer of the answer of topic as open problems.
In the embodiment of the present invention, root problem and standby problem be it is completely corresponding, sub using root problem as target Identified answer can be used as the answer of open problems when problem.Such as above-mentioned example, " Zhang San is in the kindergarten B for root problem When, what post held a post in? " answer be open problems " the incumbent principal in the middle school A at the kindergarten that Li Si creates, tenure In what post? " answer, answer be " form master ".
Traditional problem model often attempts disposably to extract the answer gone wrong, and information provided in an embodiment of the present invention The method of extraction successively determines multiple subproblems of PROBLEM DECOMPOSITION multi-layer tree construction every according to sequence from the bottom up The answer of a subproblem, and finally determine the answer of top layer grade subproblem.It can be determined by way of PROBLEM DECOMPOSITION multiple It is apparent, be easier to extract the subproblem of accurate answer, and the answer of the subproblem based on a upper level updates next level Subproblem finally accurately extracts the answer of open problems, and extraction precision can be improved, and increases substantially the pumping of challenge Take accuracy.
On the basis of the above embodiments, shown in Figure 3, in step 102 and step 104, " from preset text The answer of target subproblem is extracted in content " it specifically includes:
Step 301: content of text being divided into multiple text units, text unit is word, phrase, sentence, one in paragraph Item is multinomial.
Step 302: determining the similarity between each text unit and target subproblem, similarity is greater than preset threshold Text unit as effective text unit, and from all effective text units extract target subproblem answer.
Since content of text generally comprises bulk information, if extracting answering for each subproblem from complete content of text Case will affect treatment effeciency.In the embodiment of the present invention, as unit of word, phrase, sentence or paragraph, content of text is divided into more A text unit, and text unit relevant to current target subproblem is chosen from all text units to extract target The answer of subproblem reduces to reduce treating capacity when determining subproblem answer and calculates the time, improves extraction efficiency.Wherein, Determine which text unit is and the target subproblem phase by calculating the similarity between target subproblem and text unit The unit of pass.The preset threshold can be pre-set fixed value;It can also be according to determining that text unit asks with target Preset threshold is determined after the similarity of topic, again so as to select several text units with highest similarity.
Specifically, above-mentioned steps 302 " determining the similarity between each text unit and target subproblem " include:
Step B1: word segmentation processing is carried out to target subproblem, determines all participle p of target subproblemi, i ∈ [1, m], m For the participle quantity of target subproblem.
In the embodiment of the present invention, word segmentation processing is carried out to target subproblem first, determines the m participle of target subproblem, That is p1,p2,…,pm.Specifically word segmentation processing can be carried out according to participle model, the present embodiment does not limit this.
Step B2: each participle p is determinediWith text unit DjDegree of correlation rijAnd the weights omega of each participlei;And:
Wherein, fijIndicate participle piIn text unit DjIn word frequency, ljIndicate text unit DjLength, avgl indicate The average length of all text units, N indicates the total quantity of text unit, and j ∈ [1, N], λ are preset non-zero adjustment system Number, n (pi) indicate comprising participle piText unit quantity;g1()、g2() is positive correlation function.
In the present embodiment, content of text is divided into N number of text unit Dj, target subproblem includes m participle pi;It at this time can be with Determine each participle piWeight.Specifically, according to comprising segmenting piText unit quantity n (pi) in text unit sum The accounting in N is measured to determine corresponding weight, the accounting is smaller, and corresponding weight is bigger.For example, content of text has been divided into 100 A text unit, and segment " Zhang San " and only occur in two text units wherein, then the weight for segmenting " Zhang San " is opposite It is larger;If participle " ", "Yes" in 99 text units occur, illustrate that such participle is common word, weight It is relatively low.In the present embodiment,λ is preset non-zero adjustment factor, avoids n (pi) it is zero.g2 () is positive correlation function, i.e.,It is bigger, corresponding weights omegaiIt is bigger.g2() is specifically as follows linear function, index Function, logarithmic function etc., the present embodiment does not limit this.
Meanwhile for each text unit, its degree of correlation r between each participle of target subproblem can be determinedij。 Wherein, text unit DjLength it is longer, then segment piMore easily occur in text cells DjIn, the degree of correlation between the two rijIt is lower.Specifically,G therein1() and g2() is similar, and be also positive correlation function.
Step B3: according to the degree of correlation of all participles of target subproblem and weight calculation target subproblem and text unit Dj Between similarity Rj, and
In the present embodiment, each participle and D are being determinedjBetween the degree of correlation after, that is, can determine target subproblem and text Cells DjBetween similarity Rj.The present embodiment can quickly and accurately determine similar between target subproblem and text unit Degree facilitates and subsequent selects text unit relevant to target subproblem to extract answer.
Optionally, on the basis of the above embodiments, in step 102 and step 104, " from preset content of text The middle answer for extracting target subproblem " specifically includes:
Step C1: when extracting multiple answers of target subproblem from content of text, using all answers as time Answer is selected, and determines the confidence level of each candidate answers.
Step C2: using the highest candidate answers of confidence level as the answer of the target subproblem finally extracted.
In the present embodiment, when extracting the answer of target subproblem from content of text, it may can determine whether multiple answers, i.e., Multiple candidate answers determine unique answer according to the confidence level of each candidate answers at this time.Wherein, question and answer can specifically be passed through Model determines the confidence levels of each candidate answers.
The method of information extraction provided in an embodiment of the present invention, by multiple subproblems of PROBLEM DECOMPOSITION multi-layer tree construction, The answer of each subproblem is successively determined according to sequence from the bottom up, and finally determines the answer of top layer grade subproblem. Can be determined by way of PROBLEM DECOMPOSITION it is multiple it is apparent, be easier to extract the subproblem of accurate answer, and be based on upper one layer The answer of the subproblem of grade updates the subproblem of next level, finally accurately extracts the answer of open problems, can be improved Precision is extracted, the extraction accuracy of challenge is increased substantially.Content of text is divided into multiple text units, and from all Text unit relevant to current target subproblem is chosen in text unit to extract the answer of target subproblem, to reduce Treating capacity when determining subproblem answer, it is possible to reduce calculate the time, improve extraction efficiency
The method flow of information extraction is described in detail above, this method can also be realized by corresponding device, below The structure and function of the device is discussed in detail.
A kind of device of information extraction provided in an embodiment of the present invention, it is shown in Figure 4, comprising:
PROBLEM DECOMPOSITION module 41 is decomposed for obtaining open problems, and by the open problems according to multi-layer tree construction For multiple subproblems;The subproblem includes at least leaf subproblem and root problem, and the answer of the subproblem of current level Expansible information in subproblem corresponding with a upper level matches;
Subproblem answer extracting module 42, for choosing the leaf subproblem as target subproblem, from preset text The answer of the target subproblem, and the expansible information update that will be matched in the subproblem of a upper level are extracted in this content For the answer of the target subproblem;It later using other leaf subproblems as target subproblem, repeats the above steps, until upper All expansible information is updated in the subproblem of one level;The upper level that all expansible information is updated Subproblem as target subproblem, repeat the above steps, until using the root problem as target subproblem;
Open problems answer extracting module 43, for extracting answering for the target subproblem from preset content of text Case, and using the answer of the target subproblem extracted at this time as the answer of the open problems.
On the basis of the above embodiments, PROBLEM DECOMPOSITION module 41 includes:
Model foundation unit for establishing PROBLEM DECOMPOSITION model, and obtains sample problem to be solved and with described wait solve in advance Multiple subsample problems of the corresponding multi-layer tree construction of sample problem;
Training unit, for using the sample problem to be solved as input, will be corresponding with the sample problem to be solved Multiple subsample problems of multi-layer tree construction are trained described problem decomposition model as output, asking after determining training Inscribe decomposition model;
PROBLEM DECOMPOSITION unit, for after getting open problems, using the open problems as input, based on described Problem decomposition model after training determines multiple subproblems of multi-layer tree construction corresponding with the open problems.
On the basis of the above embodiments, the subproblem answer extracting module 42 extracts institute from preset content of text The answer for stating target subproblem includes:
The content of text is divided into multiple text units, the text unit is word, phrase, sentence, one in paragraph Item is multinomial;
It determines the similarity between each text unit and the target subproblem, similarity is greater than to the text of preset threshold This unit extracts from all effective text units the answer of the target subproblem as effective text unit.
On the basis of the above embodiments, the subproblem answer extracting module 42 determines each text unit and the mesh Mark subproblem between similarity include:
Word segmentation processing is carried out to the target subproblem, determines all participle p of the target subproblemi, i ∈ [1, m], M is the participle quantity of the target subproblem;
Determine each participle piWith text unit DjDegree of correlation rijAnd the weights omega of each participlei;And:
Wherein, fijIndicate participle piIn text unit DjIn word frequency, ljIndicate text unit DjLength, avgl indicate The average length of all text units, N indicates the total quantity of the text unit, and j ∈ [1, N], λ are preset non-zero adjustment Coefficient, n (pi) indicate comprising participle piText unit quantity;g1()、g2() is positive correlation function;
According to target subproblem and text unit described in the degree of correlation and weight calculation of all participles of target subproblem DjBetween similarity Rj, and
On the basis of the above embodiments, the subproblem answer extracting module 42 extracts institute from preset content of text The answer for stating target subproblem includes:
When extracting multiple answers of the target subproblem from the content of text, using all answers as time Answer is selected, and determines the confidence level of each candidate answers;
Using the highest candidate answers of confidence level as the answer of the target subproblem finally extracted.
The device of information extraction provided in an embodiment of the present invention, by multiple subproblems of PROBLEM DECOMPOSITION multi-layer tree construction, The answer of each subproblem is successively determined according to sequence from the bottom up, and finally determines the answer of top layer grade subproblem. Can be determined by way of PROBLEM DECOMPOSITION it is multiple it is apparent, be easier to extract the subproblem of accurate answer, and be based on upper one layer The answer of the subproblem of grade updates the subproblem of next level, finally accurately extracts the answer of open problems, can be improved Precision is extracted, the extraction accuracy of challenge is increased substantially.Content of text is divided into multiple text units, and from all Text unit relevant to current target subproblem is chosen in text unit to extract the answer of target subproblem, to reduce Treating capacity when determining subproblem answer, it is possible to reduce calculate the time, improve extraction efficiency
The embodiment of the invention also provides a kind of computer storage medium, the computer storage medium is stored with computer Executable instruction, it includes the program of the method for executing above-mentioned information extraction, the computer executable instructions are executable Method in above-mentioned any means embodiment.
Wherein, the computer storage medium can be any usable medium that computer can access or data storage is set It is standby, including but not limited to magnetic storage (such as floppy disk, hard disk, tape, magneto-optic disk (MO) etc.), optical memory (such as CD, DVD, BD, HVD etc.) and semiconductor memory (such as ROM, EPROM, EEPROM, nonvolatile memory (NAND FLASH), solid state hard disk (SSD)) etc..
Fig. 5 shows the structural block diagram of a kind of electronic equipment of another embodiment of the invention.The electronic equipment 1100 can be the host server for having computing capability, personal computer PC or portable portable computer or end End etc..The specific embodiment of the invention does not limit the specific implementation of electronic equipment.
The electronic equipment 1100 includes at least one processor (processor) 1110, communication interface (Communications Interface) 1120, memory (memory array) 1130 and bus 1140.Wherein, processor 1110, communication interface 1120 and memory 1130 complete mutual communication by bus 1140.
Communication interface 1120 with network element for communicating, and wherein network element includes such as Virtual Machine Manager center, shared storage.
Processor 1110 is for executing program.Processor 1110 may be a central processor CPU or dedicated collection At circuit ASIC (Application Specific Integrated Circuit), or it is arranged to implement the present invention One or more integrated circuits of embodiment.
Memory 1130 is for executable instruction.Memory 1130 may include high speed RAM memory, it is also possible to also wrap Include nonvolatile memory (non-volatile memory), for example, at least a magnetic disk storage.Memory 1130 can also be with It is memory array.Memory 1130 is also possible to by piecemeal, and described piece can be combined into virtual volume by certain rule.Storage The instruction that device 1130 stores can be executed by processor 1110, so that processor 1110 is able to carry out in above-mentioned any means embodiment Information extraction method.
The above description is merely a specific embodiment, but scope of protection of the present invention is not limited thereto, any Those familiar with the art in the technical scope disclosed by the present invention, can easily think of the change or the replacement, and should all contain Lid is within protection scope of the present invention.Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (10)

1. a kind of method of information extraction characterized by comprising
Open problems are obtained, and the open problems are decomposed into multiple subproblems according to multi-layer tree construction;The subproblem Including at least leaf subproblem and root problem, and the answer subproblem corresponding with a upper level of the subproblem of current level In expansible information match;
The leaf subproblem is chosen as target subproblem, answering for the target subproblem is extracted from preset content of text Case, and be the answer of the target subproblem by the expansible information update to match in the subproblem of a upper level;Later will Other leaf subproblems repeat the above steps as target subproblem, until all expansible in the subproblem of a upper level Information is updated;
The subproblem for the upper level that all expansible information is updated repeats the above steps as target subproblem, Until using the root problem as target subproblem;
Extract the answer of the target subproblem from preset content of text, and by the target subproblem extracted at this time Answer of the answer as the open problems.
2. the method according to claim 1, wherein described divide the open problems according to multi-layer tree construction Solution is that multiple subproblems include:
PROBLEM DECOMPOSITION model is established, and obtains sample problem to be solved and multi-layer corresponding with the sample problem to be solved in advance Multiple subsample problems of tree construction;
Using the sample problem to be solved as input, by the multiple of multi-layer tree construction corresponding with the sample problem to be solved Subsample problem is trained described problem decomposition model as output, the problem decomposition model after determining training;
After getting open problems, using the open problems as input, based on the problem decomposition model after the training Determine multiple subproblems of multi-layer tree construction corresponding with the open problems.
3. the method according to claim 1, wherein described extract target from preset content of text The answer of problem includes:
The content of text is divided into multiple text units, the text unit be word, phrase, sentence, one in paragraph or It is multinomial;
It determines the similarity between each text unit and the target subproblem, similarity is greater than to the text list of preset threshold Member is used as effective text unit, and the answer of the target subproblem is extracted from all effective text units.
4. according to the method described in claim 3, it is characterized in that, each text unit of the determination and the target subproblem Between similarity include:
Word segmentation processing is carried out to the target subproblem, determines all participle p of the target subproblemi, i ∈ [1, m], m are institute State the participle quantity of target subproblem;
Determine each participle piWith text unit DjDegree of correlation rijAnd the weights omega of each participlei;And:
Wherein, fijIndicate participle piIn text unit DjIn word frequency, ljIndicate text unit DjLength, avgl indicates all The average length of text unit, N indicates the total quantity of the text unit, and j ∈ [1, N], λ are preset non-zero adjustment system Number, n (pi) indicate comprising participle piText unit quantity;g1()、g2() is positive correlation function;
According to target subproblem described in the degree of correlation and weight calculation of all participles of target subproblem and text unit DjBetween Similarity Rj, and
5. method according to claim 1 to 4, which is characterized in that it is described from preset content of text extract described in The answer of target subproblem includes:
When extracting multiple answers of the target subproblem from the content of text, all answers are answered as candidate Case, and determine the confidence level of each candidate answers;
Using the highest candidate answers of confidence level as the answer of the target subproblem finally extracted.
6. a kind of device of information extraction characterized by comprising
PROBLEM DECOMPOSITION module for obtaining open problems, and the open problems is decomposed into according to multi-layer tree construction multiple Subproblem;The subproblem includes at least leaf subproblem and root problem, and the answer and upper one of the subproblem of current level Expansible information in the corresponding subproblem of level matches;
Subproblem answer extracting module, for choosing the leaf subproblem as target subproblem, from preset content of text The middle answer for extracting the target subproblem, and be described by the expansible information update to match in the subproblem of a upper level The answer of target subproblem;It later using other leaf subproblems as target subproblem, repeats the above steps, until a upper level Subproblem in all expansible information be updated;The son for the upper level that all expansible information is updated is asked Topic is used as target subproblem, repeats the above steps, until using the root problem as target subproblem;
Open problems answer extracting module, for extracting the answer of the target subproblem from preset content of text, and will Answer of the answer of the target subproblem extracted at this time as the open problems.
7. device according to claim 6, which is characterized in that PROBLEM DECOMPOSITION module includes:
Model foundation unit, for establishing PROBLEM DECOMPOSITION model, and obtain in advance sample problem to be solved and with the sample to be solved Multiple subsample problems of the corresponding multi-layer tree construction of problem;
Training unit, for using the sample problem to be solved as input, will multilayer corresponding with the sample problem to be solved Multiple subsample problems of grade tree construction are trained described problem decomposition model as output, the problem after determining training point Solve model;
PROBLEM DECOMPOSITION unit, for using the open problems as input, being based on the training after getting open problems Problem decomposition model afterwards determines multiple subproblems of multi-layer tree construction corresponding with the open problems.
8. device according to claim 6, which is characterized in that the subproblem answer extracting module is out of preset text The answer that the target subproblem is extracted in appearance includes:
The content of text is divided into multiple text units, the text unit be word, phrase, sentence, one in paragraph or It is multinomial;
It determines the similarity between each text unit and the target subproblem, similarity is greater than to the text list of preset threshold Member is used as effective text unit, and the answer of the target subproblem is extracted from all effective text units.
9. a kind of computer storage medium, which is characterized in that the computer storage medium is stored with computer executable instructions, The method that the computer executable instructions require information extraction described in 1-5 any one for perform claim.
10. a kind of electronic equipment characterized by comprising
At least one processor;And
The memory being connect at least one described processor communication;Wherein,
The memory is stored with the instruction that can be executed by least one described processor, and described instruction is by described at least one It manages device to execute, so that the method that at least one described processor is able to carry out information extraction described in claim 1-5 any one.
CN201910684300.9A 2019-07-26 2019-07-26 A kind of method, apparatus of information extraction, storage medium and electronic equipment Pending CN110389999A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910684300.9A CN110389999A (en) 2019-07-26 2019-07-26 A kind of method, apparatus of information extraction, storage medium and electronic equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910684300.9A CN110389999A (en) 2019-07-26 2019-07-26 A kind of method, apparatus of information extraction, storage medium and electronic equipment

Publications (1)

Publication Number Publication Date
CN110389999A true CN110389999A (en) 2019-10-29

Family

ID=68287543

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910684300.9A Pending CN110389999A (en) 2019-07-26 2019-07-26 A kind of method, apparatus of information extraction, storage medium and electronic equipment

Country Status (1)

Country Link
CN (1) CN110389999A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813961A (en) * 2020-08-25 2020-10-23 腾讯科技(深圳)有限公司 Data processing method and device based on artificial intelligence and electronic equipment
CN112905777A (en) * 2021-03-19 2021-06-04 北京百度网讯科技有限公司 Extended question recommendation method and device, electronic equipment and storage medium
CN113420111A (en) * 2021-06-17 2021-09-21 中国科学院声学研究所 Intelligent question-answering method and device for multi-hop inference problem

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549658A (en) * 2018-03-12 2018-09-18 浙江大学 A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN109710928A (en) * 2018-12-17 2019-05-03 新华三大数据技术有限公司 The entity relation extraction method and device of non-structured text

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108549658A (en) * 2018-03-12 2018-09-18 浙江大学 A kind of deep learning video answering method and system based on the upper attention mechanism of syntactic analysis tree
CN109710928A (en) * 2018-12-17 2019-05-03 新华三大数据技术有限公司 The entity relation extraction method and device of non-structured text

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
ALON TALMOR等: "The Web as Knowledge-base for Answering Complex Questions", 《PROCEEDINGS OF NAACL-HLT 2018》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111813961A (en) * 2020-08-25 2020-10-23 腾讯科技(深圳)有限公司 Data processing method and device based on artificial intelligence and electronic equipment
CN112905777A (en) * 2021-03-19 2021-06-04 北京百度网讯科技有限公司 Extended question recommendation method and device, electronic equipment and storage medium
CN112905777B (en) * 2021-03-19 2023-10-17 北京百度网讯科技有限公司 Extended query recommendation method and device, electronic equipment and storage medium
CN113420111A (en) * 2021-06-17 2021-09-21 中国科学院声学研究所 Intelligent question-answering method and device for multi-hop inference problem
CN113420111B (en) * 2021-06-17 2023-08-11 中国科学院声学研究所 Intelligent question answering method and device for multi-hop reasoning problem

Similar Documents

Publication Publication Date Title
CN110795543B (en) Unstructured data extraction method, device and storage medium based on deep learning
US20170351663A1 (en) Iterative alternating neural attention for machine reading
CN111078836A (en) Machine reading understanding method, system and device based on external knowledge enhancement
US20090182554A1 (en) Text analysis method
US11232263B2 (en) Generating summary content using supervised sentential extractive summarization
CN110489424B (en) Tabular information extraction method and device, storage medium and electronic equipment
JP7335300B2 (en) Knowledge pre-trained model training method, apparatus and electronic equipment
CN110389999A (en) A kind of method, apparatus of information extraction, storage medium and electronic equipment
US10685012B2 (en) Generating feature embeddings from a co-occurrence matrix
CN110222328B (en) Method, device and equipment for labeling participles and parts of speech based on neural network and storage medium
CN110276023A (en) POI changes event discovery method, apparatus, calculates equipment and medium
CN110457672A (en) Keyword determines method, apparatus, electronic equipment and storage medium
CN113435208B (en) Training method and device for student model and electronic equipment
CN105144149A (en) Translation word order information output device, translation word order information output method, and recording medium
CN110781413A (en) Interest point determining method and device, storage medium and electronic equipment
CN103577547B (en) Webpage type identification method and device
CN110929532B (en) Data processing method, device, equipment and storage medium
CN115146621A (en) Training method, application method, device and equipment of text error correction model
CN111062209A (en) Natural language processing model training method and natural language processing model
CN113723077A (en) Sentence vector generation method and device based on bidirectional characterization model and computer equipment
CN112541069A (en) Text matching method, system, terminal and storage medium combined with keywords
CN113569018A (en) Question and answer pair mining method and device
CN107958061A (en) The computational methods and computer-readable recording medium of a kind of text similarity
CN112836019A (en) Public health and public health named entity identification and entity linking method and device, electronic equipment and storage medium
CN117236340A (en) Question answering method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20191029

RJ01 Rejection of invention patent application after publication