CN107622126A - The method and apparatus sorted out to the solid data in data acquisition system - Google Patents
The method and apparatus sorted out to the solid data in data acquisition system Download PDFInfo
- Publication number
- CN107622126A CN107622126A CN201710903481.0A CN201710903481A CN107622126A CN 107622126 A CN107622126 A CN 107622126A CN 201710903481 A CN201710903481 A CN 201710903481A CN 107622126 A CN107622126 A CN 107622126A
- Authority
- CN
- China
- Prior art keywords
- data
- solid
- entity
- class
- solid data
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Abstract
Present disclose provides a kind of method that solid data in data acquisition system is sorted out.Methods described includes extracting at least one solid data in any one data acquisition system, determine the entity class belonging to each solid data at least one solid data, and, when the second class solid data for being classified as unknown classification be present, the characteristic information training entity classification model of the first kind solid data of entity class according to belonging to having determined that, and the entity class belonging to each solid data in the second class solid data is determined by the entity classification model prediction.The disclosure additionally provides the device that a kind of solid data in data acquisition system is sorted out.
Description
Technical field
A kind of this disclosure relates to method and apparatus that solid data in data acquisition system is sorted out.
Background technology
Structuring, standardization data can by machine recognition, can be widely applied to information retrieval, intelligent input, from
Dynamic check and correction, automatic error-correcting and paginal translation etc..
However, flourishing with internet, the information on internet is also drastically expanding.The information of these magnanimity
In, typically various information is mixed in together, very lack of standardization, such as many information do not carry out structuring processing (such as count
According to without clearly classification) or even if some information have certain structuring to handle, but its structuring processing is not often complete
Face, or even chaotic etc. (such as data classification disunity) be present, so as to cause can not to be efficiently identified and be utilized by machine.
The content of the invention
An aspect of this disclosure provides a kind of method that solid data in data acquisition system is sorted out.The side
Method includes:At least one solid data in any one data acquisition system is extracted, the solid data includes the data acquisition system
In there is independent meaning, can be used to indicate that the word of any one object;Determine every at least one solid data
Entity class belonging to one solid data, the entity class are the class categories in default entity class storehouse, wherein, institute
Stating entity class storehouse includes unknown classification, for will be unable to determine that the solid data of its affiliated entity class is sorted out;When
During in the presence of the second class solid data for being classified as unknown classification, the first kind solid data of entity class according to belonging to having determined that
Characteristic information training entity classification model;Determined by the entity classification model prediction in the second class solid data
Entity class belonging to each solid data.
Alternatively, when the solid data of unknown classification also be present in the second instance data after prediction, methods described
Also include:By in the first kind solid data and the second class solid data it is predicted that determine entity class solid data
It is mixed to get new training data;According to the characteristic information of the new training data, entity classification model is trained again;Pass through
The entity classification model of training again predicts the reality for determining the solid data of unknown classification in the second instance data again
Body type;When the solid data of unknown classification also be present in the second class solid data, described mix, again is repeated
Training, prediction determines operation again, classifies when the solid data of unknown classification is not present in the second class solid data
Complete, or the prediction result of the solid data of the unknown classification in the second class solid data no longer changes
When abandon continuing to predict.
Alternatively, at least one solid data in any one data acquisition system, including the extraction data acquisition system are extracted
In at least one triple data, and extract the solid data in each of at least one triple data.
Wherein, the triple data include the relation predicate between two solid datas and described two solid datas, the pass
It is that predicate is used to describe the relation between described two solid datas.
Optionally it is determined that the entity class belonging to each solid data at least one solid data, including
For each solid data, the solid data pair is determined by least one triple data including the solid data
At least one relation predicate answered, and according to the content of the relation predicate and/or the frequency of occurrence of same relation predicate, really
Make the entity class belonging to the solid data.
Alternatively, the characteristic information includes triplet information, relation predicate information, and/or relation predicate frequency information.
Another aspect of the present disclosure provides the device that a kind of solid data in data acquisition system is sorted out.The dress
Put including solid data extraction module, entity class determining module, disaggregated model training module and unknown class prediction module.It is real
Volume data extraction module is used to extract at least one solid data in any one data acquisition system, and the solid data includes institute
Stating in data acquisition system has independent meaning, can be used to indicate that the word of any one object.Entity class determining module is used for
The entity class belonging to each solid data at least one solid data is determined, the entity class is default
Class categories in entity class storehouse, wherein, the entity class storehouse includes unknown classification, for will be unable to determine belonging to it
The solid data of entity class is sorted out.Disaggregated model training module is used for when in the presence of the second class for being classified as unknown classification
During solid data, the characteristic information training entity classification model of the first kind solid data of entity class according to belonging to having determined that.
Unknown class prediction module is used to determine each in the second class solid data by the entity classification model prediction
Entity class belonging to solid data.
Alternatively, described device also includes combined training data generation module.Combined training data generation module is used to work as
When the solid data of unknown classification also be present in the second instance data after prediction, by the first kind solid data and
It is predicted that determining that the solid data of entity class is mixed to get new training data in second class solid data.Disaggregated model training
Module is additionally operable to the characteristic information according to the new training data, trains entity classification model again.Unknown class prediction mould
Block is additionally operable to train entity classification model to predict unknown classification in the determination second instance data again again by described
The entity type of solid data.
Alternatively, the solid data extraction module includes triple data extracting sub-module and solid data extraction submodule
Block.Triple data extracting sub-module is used to extract at least one triple data in the data acquisition system, the triple
Data include the relation predicate between two solid datas and described two solid datas, and the relation predicate is used to describe
Relation between described two solid datas.Solid data extracting sub-module is used to extract at least one triple data
Each in solid data.
Alternatively, entity class determining module includes relation predicate determination sub-module and entity class determination sub-module.Close
It is that predicate determination sub-module is used for for each solid data, passes through at least one triple number including the solid data
According at least one relation predicate corresponding to the determination solid data.Entity class determination sub-module is used to be called according to the relation
The content of word and/or the frequency of occurrence of same relation predicate, identify the entity class belonging to the solid data.
Alternatively, the characteristic information includes triplet information, relation predicate information, and/or relation predicate frequency information.
Another aspect of the present disclosure provides a kind of non-volatile memory medium, is stored with computer executable instructions, institute
Instruction is stated to be used to realize when executed as described above to the solid data progress classifying method in data acquisition system.
Another aspect of the present disclosure provides a kind of computer program, and the computer program includes the executable finger of computer
Order, the instruction is used to realize when executed carries out classifying method to the solid data in data acquisition system as described above.
Brief description of the drawings
In order to be more fully understood from the disclosure and its advantage, referring now to the following description with reference to accompanying drawing, wherein:
Fig. 1 diagrammatically illustrates the method sorted out to the solid data in data acquisition system according to the embodiment of the present disclosure
Flow chart;
Fig. 2 diagrammatically illustrates being sorted out to the solid data in data acquisition system according to the disclosure another embodiment
The flow chart of method;
Fig. 3 diagrammatically illustrates the method flow diagram that solid data is extracted in the method according to the embodiment of the present disclosure;
Fig. 4 diagrammatically illustrates the entity class determined in the method according to the embodiment of the present disclosure belonging to each solid data
Other flow chart;
Fig. 5 diagrammatically illustrates the device sorted out to the solid data in data acquisition system according to the embodiment of the present disclosure
Block diagram;And
Fig. 6 diagrammatically illustrates being sorted out to the solid data in data acquisition system according to the disclosure another embodiment
The block diagram of device.
Embodiment
Hereinafter, it will be described with reference to the accompanying drawings embodiment of the disclosure.However, it should be understood that these descriptions are simply exemplary
, and it is not intended to limit the scope of the present disclosure.In addition, in the following description, the description to known features and technology is eliminated, with
Avoid unnecessarily obscuring the concept of the disclosure.
Term as used herein is not intended to limit the disclosure just for the sake of description specific embodiment.Use herein
Term " comprising ", "comprising" etc. indicate the presence of the feature, step, operation and/or part, but it is not excluded that in the presence of
Or addition one or more other features, step, operation or parts.
All terms (including technology and scientific terminology) as used herein have what those skilled in the art were generally understood
Implication, unless otherwise defined.It should be noted that term used herein should be interpreted that with consistent with the context of this specification
Implication, without should by idealization or it is excessively mechanical in a manner of explain.
, in general should be according to this using in the case of being similar to that " in A, B and C etc. at least one " is such and stating
Art personnel are generally understood that the implication of the statement to make an explanation (for example, " having system at least one in A, B and C "
Should include but is not limited to individually with A, individually with B, individually with C, with A and B, with A and C, with B and C, and/or
System with A, B, C etc.).Using in the case of being similar to that " in A, B or C etc. at least one " is such and stating, it is general come
Say be generally understood that the implication of the statement to make an explanation (for example, " having in A, B or C at least according to those skilled in the art
The system of one " should include but is not limited to individually with A, individually with B, individually with C, with A and B, with A and C, with
B and C, and/or system etc. with A, B, C).It should also be understood by those skilled in the art that substantially arbitrarily represent two or more
The adversative conjunction and/or phrase of optional project, either in specification, claims or accompanying drawing, shall be construed as
Give including one of these projects, the possibility of these projects either one or two projects.For example, " A or B " should for phrase
It is understood to include " A " or " B " or " A and B " possibility.
Shown in the drawings of some block diagrams and/or flow chart.It should be understood that some sides in block diagram and/or flow chart
Frame or its combination can be realized by computer program instructions.These computer program instructions can be supplied to all-purpose computer,
The processor of special-purpose computer or other programmable data processing units, so as to which these instructions can be with when by the computing device
Create the device for realizing function/operation illustrated in these block diagrams and/or flow chart.
Therefore, the technology of the disclosure can be realized in the form of hardware and/or software (including firmware, microcode etc.).Separately
Outside, the technology of the disclosure can take the form of the computer program product on the computer-readable medium for being stored with instruction, should
Computer program product is available for instruction execution system use or combined command execution system to use.In the context of the disclosure
In, computer-readable medium can be the arbitrary medium that can include, store, transmit, propagate or transmit instruction.For example, calculate
Machine computer-readable recording medium can include but is not limited to electricity, magnetic, optical, electromagnetic, infrared or semiconductor system, device, device or propagation medium.
The specific example of computer-readable medium includes:Magnetic memory apparatus, such as tape or hard disk (HDD);Light storage device, such as CD
(CD-ROM);Memory, such as random access memory (RAM) or flash memory;And/or wire/wireless communication link.
Embodiment of the disclosure provides the method and apparatus that a kind of solid data in data acquisition system is sorted out.Should
Method includes extracting at least one solid data in any one data acquisition system, and determines at least one solid data
Entity class belonging to each solid data, when the second class solid data for being classified as unknown classification be present, according to
It is determined that the characteristic information training entity classification model of the first kind solid data of affiliated entity class, and pass through the entity classification
Model prediction determines the entity class belonging to each solid data in the second class solid data.Wherein, the solid data
Including there is independent meaning in the data acquisition system, can be used to indicate that the word of any one object.The entity class is default
Entity class storehouse in class categories, the entity class storehouse includes unknown classification, for will be unable to determine its affiliated entity
The solid data of classification is sorted out.
What the embodiment of the present disclosure provided carries out classifying method and device to the solid data in data acquisition system, can be to possessing
Solid data in any one data acquisition system of mass data is extracted and effectively classified, so as to build the number
According to the data structure of the standardization of set so that the data acquisition system can be efficiently identified and utilized by machine.
Wherein, the data acquisition system can have very huge data volume, such as the website data summation shape of 1 year
Into data acquisition system.Although the initial data in the data acquisition system can be in itself do not classify or have classification but
It is that lack of standardization or partial data compares specification, has partial data lack of standardization in a jumble for classification disunity, comparison.
The method and apparatus that embodiment of the disclosure provides, carried by carrying out solid data to any one data acquisition system
Take, determine the part entity data for wherein there are clear and definite classification results, the entity number for then having clear and definite classification results according to these
According to training entity classification model, the entity classification model is set effectively to learn the various features letter to first kind solid data
Breath and the incidence relation between corresponding entity class, and there is no the entity of clear and definite classification results using the machine prediction after training
The generic of data.
In this manner it is possible to the solid data in any one data acquisition system effectively, to a deeper level divide
Class.The class categories that the entity class of classification belongs in default entity class storehouse are additionally, since, therefore, when the default reality
Class categories in body class library have uniformity, terseness, it becomes possible to obtain succinct, unified, to compare specification reality of classifying
Volume data is classified.
In this way, the method and apparatus of the offer of the embodiment of the present disclosure, help to build the specification of any one data acquisition system
Change the normalized structure of the data of any website whole year in structure, such as structure internet, so that the sea in internet
Measure data can by machine carry out effectively utilize and analyze, such as applied to information retrieval, intelligent input, automatic Proofreading or
Automatic error-correcting etc., with improving data utilizability and handlability.
Fig. 1 diagrammatically illustrates the method sorted out to the solid data in data acquisition system according to the embodiment of the present disclosure
Flow chart.
As shown in figure 1, according to the method sorted out to the solid data in data acquisition system of the embodiment of the present disclosure, including
Operate S101~operation S104.
In operation S101, at least one solid data in any one data acquisition system is extracted, the solid data includes should
There is independent meaning in data acquisition system, can be used to indicate that the word of any one object.
Specifically, the solid data can serve to indicate that any one object.When any one object can be for example
Between, place, artistic works, mechanism, building, tangible products, virtual product, event, academic title, awards, personage, numeral, quantity
Information, color etc..
Different from the solid data has relation predicate data.Relation predicate data are to be used to describe two solid datas
Between relation word.
Such as " Fan Bingbing occupation is performer." in the words " Fan Bingbing " be the solid data of assignor's thing object, " drill
Member " is the solid data for indicating professional object, and " occupation " one word is to be used to describe " Fan Bingbing " and " performer " the two entities
Between relation predicate.
The solid data is the word for having independent meaning.For example, " Fan Bingbing " one word is combined by three Chinese characters,
For indicating who object, and if each word is individually and clearly right for instruction one after " Fan Bingbing " three words are split
As.
On the other hand, a word, one section of word or a chart can be obtained by Chinese word cutting method for example in Chinese
In the word with independent meaning, in another example itself can carry out reality by space between the Chinese and English word of English and word
The description of existing word and word.
In operation S102, the entity class belonging to each solid data at least one solid data is determined, should
Entity class is the class categories in default entity class storehouse, wherein, the entity class storehouse includes unknown classification, for inciting somebody to action
It can not determine that the solid data of its affiliated entity class is sorted out.
Specifically, for the more mixed and disorderly data acquisition system of no classification or classification, default entity class storehouse can be
Clear and definite class categories are had according to the classified finishing such as part of speech standard corpus storehouse or People's Daily's participle or xinhua dictionary
Entity class storehouse.
Compare the data acquisition system of specification for partial data classification, default entity class storehouse can be to the part classifying
Compare specification data combed after, the part that includes of formation is categorized in interior more perfect entity class storehouse.
Default entity class storehouse can for example include but is not limited to following classification:Time, place, numeral, quantity, people
Thing, mechanism, artistic works, tangible products, event, building, awards, academic title, color, education degree, regulation, race, religion,
It is language, gods, chemicals, biological agent, medical treatment, medicine, symptom, disease, body part, biology, animal, food, website, wide
Net, broadcast program, television channel, currency, stock exchange, algorithm, program language, transportation system, supply line are broadcast, and it is unknown
Classification.
Classification in default class library can also have certain hierarchical structure, i.e., can have multiple one in class library
Level is classified, and can also include multiple subclassifications under each first-level class.
It is appreciated that under different application scenarios or under different criteria for classifications, obtained default entity class
The particular content and relation of classification in storehouse also can be different.
Determine the entity class belonging to each solid data at least one solid data.During this, have
A part of solid data can be according to its characteristic information, such as the side such as part-of-speech tagging corpus, People's Daily's participle, data label
Method, determine its corresponding entity class, and if have the second class solid data that can not specify the classification, it can temporarily be returned
Enter in unknown classification.
Then, S103 is being operated, when the second class solid data for being classified as unknown classification be present, according to having determined that
Belong to the characteristic information training entity classification model of the first kind solid data of entity class.
Wherein, the characteristic information of the solid data includes triplet information, relation predicate information, and/or relation predicate frequency
Secondary information.
So-called triplet information includes two solid datas and describes the relation predicate of two solid datas.
Characteristic information is illustrated by taking " Fan Bingbing " this solid data as an example.
For example, (Fan Bingbing, occupation, performer) is exactly a triplet information of solid data " Fan Bingbing ".Wherein " duty
Industry " is a relation predicate of solid data " Fan Bingbing ".
Solid data " Fan Bingbing " can also be corresponding with other relation predicates, such as (Fan Bingbing, representative works, a sleep terror fright at night
Happiness), wherein " representative works " are exactly another relation predicate.
Same relation predicate, in multiple triplet informations that the solid data can be appeared in.In another example (Fan Bingbing,
Representative works, I is not Lady Pan Jinlian), it is another triplet information related to solid data " Fan Bingbing ".Wherein, " represent
This relation predicate of works " describes the relation of " Fan Bingbing " this solid data and another solid data.It can be seen that the same relation
Predicate may also occur repeatedly, and this just constitutes the relation predicate frequency information of the solid data.
Assuming that have determined that " Fan Bingbing " this solid data belongs to figure kind.The relation predicate of so figure kind may wrap
Include " occupation ", " works of writing on one's behalf " etc..
The characteristic information training entity classification model of the first kind solid data of entity class, makes this according to belonging to having determined that
Incidence relation between the characteristic information and corresponding entity class of entity classification model learning first kind solid data.
When the data volume of the data acquisition system is sufficiently large, when training data is enough, the entity classification model just can be abundant
Learn the incidence relation (such as correlation degree etc.) between the characteristic information of the first kind solid data and corresponding entity class.
In operation S104, each entity number in the second class solid data is determined by the entity classification model prediction
According to affiliated entity class.
Specifically, entity is constantly trained using the characteristic information for the first kind solid data for having determined that affiliated entity class
Disaggregated model so that the entity classification model can learn the various features information and corresponding entity to first kind solid data
Incidence relation between classification, the entity classification model prediction then completed using training is to determine the second class solid data
In each solid data belonging to entity class.In this way, it is possible to predict at least one in the second class solid data
Entity class belonging to part entity data, so that the classification of more solid datas is made clear in the data acquisition system, make
The data structure for obtaining the data acquisition system is more standardized.
In accordance with an embodiment of the present disclosure, by carrying out solid data extraction to the data acquisition system, determine wherein have clearly
The part entity data (i.e. first kind solid data) of affiliated entity class, then trained according to these first kind solid datas real
Body disaggregated model, the entity classification model is set effectively to learn the characteristic information to first kind solid data and corresponding reality
Incidence relation between body classification, and it is classified as using the machine prediction after training the solid data (i.e. second of the unknown classification in position
Class solid data) generic.
In this manner it is possible to the solid data in any one data acquisition system effectively, to a deeper level classify.
The entity class that entity class belongs in default entity class storehouse is additionally, since, therefore, when in the default entity class storehouse
Class categories there is uniformity, terseness, it becomes possible to obtain that classification is succinct, solid data classification unified, that compare specification.
In this way, after the solid data in the data acquisition system is by effective and uniformly classification, help to build the number
According to the normalized structure of set so that the data acquisition system can be carried out effectively utilizing and analyzing by machine, such as applied to letter
Cease retrieval, intelligent input, automatic Proofreading or automatic error-correcting etc..
Fig. 2 diagrammatically illustrates being sorted out to the solid data in data acquisition system according to the disclosure another embodiment
The flow chart of method.
As shown in Fig. 2 according to the method sorted out to the solid data in data acquisition system of the embodiment of the present disclosure, grasping
After making S104, when the solid data of unknown classification also be present in the second instance data after prediction, in addition to operation
S105~operation S109.
Operation S105, by the first kind solid data and the second class solid data it is predicted that determine entity class
Other solid data is mixed to get new training data.
In operation S106, according to the characteristic information of the new training data, entity classification model is trained again.
In operation S107, entity classification model is trained to predict again in the determination second instance data again by described
The entity type of the solid data of unknown classification.
Then in operation S108, the current solid data that unknown classification whether also be present is judged.If so, then show from the number
There is the solid data of unknown classification according to the solid data extracted in set, now enter operation S109;If it is not, then show from this
The solid data extracted in data acquisition system is all sorted out to be completed, that is, is sorted out and terminated.
In operation S109, judge whether this result sorted out changes than before.
If this result sorted out is changed than before, although illustrating that this classification process is not real by the second class
All solid datas in volume data, which are sorted out, to be completed, but some quilt in the second class solid data in this classification process
Prediction determines clear and definite entity class.Now this portion for determining entity class can will be predicted in the second class solid data
Solid data is divided to also serve as a part for training data, to train entity classification model, i.e. circulation performs operation S105~operation
S108, the solid data for still belonging to unknown classification in the second class solid data is sorted out repeatedly.
If this result sorted out does not change than before, illustrate, although also being deposited in current second class solid data
In the solid data of the unknown classification in part, but currently no longer possesses new training data.So as to when training data is no longer sent out
Changing, and using training of the same training data to the disaggregated model it is very abundant when, be further continued for utilization this point
Class model is trained and predicted that its result will be ineffectual, so when abandon to remaining in the second class solid data
The prediction of the solid data of unknown classification.
Of course it is to be understood that it can be from continuous more after a certain time that this result sorted out does not change than before
Secondary middle categorization results all do not change, so that it is determined that abandoning continuing to predict.
In other words, this result sorted out does not change and not represented necessarily once there is adjacent returning twice than before
Class result, which does not change, just to be terminated to train.It is also likely to be the production of some enchancement factors that adjacent categorization results twice, which do not change,
It is raw, therefore, can by see whether in continuous several times behind categorization results all do not change exclude it is random because
The influence of element.
In accordance with an embodiment of the present disclosure, the thinking based on Bootstrap, by the second class solid data it is predicted that determining real
The solid data of body classification also serves as a part for training data, by means of inspiring iterative manner to train the entity classification model,
The solid data of the unknown classification to being remained in the second class solid data is sorted out repeatedly, until in the second class solid data
In the absence of unknown classification solid data when classification complete, or the reality of the unknown classification in the second class solid data
Abandon continuing to predict when the prediction result of volume data no longer changes.
Completed in this manner it is possible to which the almost all of the solid data extracted from the data acquisition system is classified.So as to have
Effect ground build the data acquisition system more completely, the data structure of specification, drastically increase the available of the data acquisition system
Property.
Fig. 3 diagrammatically illustrates the method flow diagram that solid data is extracted in the method according to the embodiment of the present disclosure.
As shown in figure 3, at least one solid data in any one data acquisition system is extracted in operating S101, can be wrapped
Include operation S301 and operation S302.
In operation S301, at least one triple data in the data acquisition system are extracted, the triple data include two
Relation predicate between solid data and two solid datas, the relation predicate be used for describe two solid datas it
Between relation.
Specifically, the data set can be extracted by triple extraction tool or Chinese word segmentation and part of speech conventional tool etc.
At least one triple data in conjunction.
When being extracted by triple extraction tool, rule template can be used.For example, can according to rule template " [,.]
The nationality of (.*) be (.*) [,.] " extract the triple data with nationality's relation.
For example, to such segment description:" ..., Fan Bingbing nationality is China.......“.
Triple data as (Fan Bingbing, nationality, China) can be drawn into from this section of words, wherein, " Fan Bingbing "
It is a solid data, principal data can be considered as here." China " is another solid data, accordingly can be by
It is considered as objective solid data." nationality " is the relation predicate between the two solid datas.
Or extract at least one triple number in the data acquisition system using Chinese word segmentation and part of speech conventional tool etc.
According to.
Chinese word segmentation and part-of-speech tagging instrument are a lot.It is using the substantially given a word of thought, Chinese word segmentation and part of speech
Annotation tool to the word can segment while give a part of speech to each word, such as to following the words:
Chongqing general headquarters of Chang Ping construction groups Co., Ltd place is seated Chongqing City Banan District Yu Nan main roads 162)
The words can be split as following structure by the Chinese word segmentation and part-of-speech tagging instrument:
" Chongqing Chang Ping construction groups Co., Ltd/ns general headquarters place/ns is located/and v is in/p Chongqing City Banan District Yu Nan main roads
No. 162/ns ",
And then it can therefrom extract triple data (Chongqing Chang Ping construction groups Co., Ltd, general headquarters place, Chongqing
City Banan District Yu Nan main roads 162).Wherein, " Chongqing Chang Ping construction groups Co., Ltd " and " Chongqing City Banan District Yu Nan main roads
No. 162 " it is respectively two solid datas, " general headquarters place " is the relation predicate between whole two solid datas.
In operation S302, the solid data in each of at least one triple data is extracted.
After triple data are obtained, it is possible to obtain solid data therein from the triple data.
In as exemplified above, triple data (Fan Bingbing, nationality, China) are < principals, relation predicate, objective entity>
Data structure, so as to which first data in the triple data and the 3rd data just belong to solid data.
It will be appreciated, of course, that the structure for the triple data that different methods is obtained can be different.If for example, ternary
The structure of group data is < principals, objective entity, relation predicate>When, the first two data in the triple data just belong to real
Volume data.It is corresponding to extract triple data therein with specific reference to the data structure of different triples.
Fig. 4 diagrammatically illustrates the entity class determined in the method according to the embodiment of the present disclosure belonging to each solid data
Other flow chart.
As shown in figure 4, determined in operation S102 belonging to each solid data at least one solid data
Entity class, including operation S401 and operation S402.
In operation S401, for each solid data, pass through at least one triple data including the solid data
Determine at least one relation predicate corresponding to the solid data.
In operation S402, according to the content of the relation predicate and/or the frequency of occurrence of same relation predicate, the reality is determined
Entity class belonging to volume data.
For example, for the data in table 1:
Chinese name | Zheng is anxious to be given | Date of birth | 1933 |
Alias | Zheng Wentao | Occupation | Poet, Chinese Marine University writer in school |
Nationality | China | Graduation universities and colleges | Taiwan Chung Hsing University |
Birthplace | Jinan, Shandong Province | Representative works | 《Mistake》《Clasp knife》 |
Table 1
It can extract to obtain following triple data:
(Zheng is anxious to be given, Chinese name, and Zheng is anxious to be given)
(Zheng is anxious to be given, alias, Zheng Wentao)
(Zheng is anxious to be given, nationality, China)
(Zheng is anxious to be given, birthplace, Jinan, Shandong Province)
(Zheng is anxious to be given, the date of birth, 1933)
(Zheng is anxious to be given, occupation, poet)
(Zheng is anxious to be given, occupation, Chinese Marine University writer in school)
(Zheng is anxious to be given, universities and colleges of graduating, Taiwan Chung Hsing University)
(Zheng is anxious to be given, representative works, mistake)
(Zheng is anxious to be given, representative works, clasp knife)
In operation S401, it can determine that " Zheng is anxious to be given " this solid data as principal based on above triple data
When, corresponding relation predicate includes Chinese name, alias, birthplace, date of birth, occupation, graduation universities and colleges, representative works.
Furthermore it is also possible to determine that " China " is the objective entity as " nationality " this relation predicate, it is corresponding " to help in Shandong
South " is the objective entity as " birthplace " this relation predicate.
In operation S402, relation meaning contamination that can be corresponding to " Zheng anxious gives ", or one of those or with part
Combination, it is determined that " Zheng anxious gives " this solid data belongs to figure kind.
In this example, if " Zheng can might not may completely be determined by being based only upon " Chinese name " or " date of birth "
It is anxious to give " this solid data belongs to figure kind, for example, crossing when animal class solid data also be present in the data acquisition system, it is possible to
The relation predicate of animal class solid data, which can be also corresponding with also, can have Chinese name or also have date of birth of record etc..This
When, " Zheng is anxious to be given " this entity can be assured that further combined with " representative works ", " graduation universities and colleges " or " occupation " etc.
Data belong to figure kind.
Likewise, for the objective solid data in above-mentioned triple relation, there are some to be based on only relation predicate
Determine that its classification, such as " Jinan, Shandong Province " correspond to the objective entity of " birthplace " this relation predicate, it may be determined that " help in Shandong
South " belongs to location category.
The data volume of table 1 is less, thus for some solid datas, may only have several associated relations
Predicate, its corresponding classification might not be can determine.Such as " mistake ", " clasp knife " based on current " representative works " this
Relation predicate, it can't determine which kind of (being, for example, artistic works class, Building class or Graphing of Engineering class etc.) belonged to completely.
Therefore, it is also desirable to it can just be further determined that with reference to the description of more data relationships.
It is appreciated that the data that table 1 provides are merely to explanation determines the classification of solid data according to relation predicate
A kind of exemplary description.
The data volume for the data acquisition system applied according to the method for the embodiment of the present disclosure is significantly larger than the exemplary number in table 1
According to, such as the mass data of internet, (such as a certain annual data in encyclopaedia website or a certain Blog Website are annual or for many years
Data etc.), wherein substantial amounts of solid data can be included, and triple relation etc...Enough training can be so provided
Data, to train up grouped data model.
Fig. 5 diagrammatically illustrates the device sorted out to the solid data in data acquisition system according to the embodiment of the present disclosure
Block diagram.
As shown in figure 5, according to the device 500 sorted out to the solid data in data acquisition system of the embodiment of the present disclosure,
Including solid data extraction module 510, entity class determining module 520, disaggregated model training module 530 and unknown class prediction
Module 540.
The device 500 can be used for realizing and the solid data in data acquisition system sorted out with reference to what 1~Fig. 4 of figure was described
Method.
Solid data extraction module 510 is used to extract at least one solid data in any one data acquisition system, the reality
Volume data includes being used for the word for indicating any one object in the data acquisition system;
Entity class determining module 520 is used to determine belonging to each solid data at least one solid data
Entity class, the entity class are the class categories in default entity class storehouse, wherein, the entity class storehouse includes unknown
Classification, for will be unable to determine that the solid data of its affiliated entity class is sorted out;
Disaggregated model training module 530 is used for when the second class solid data for being classified as unknown classification be present, according to
The characteristic information training entity classification model of the first kind solid data of entity class belonging to having determined that;
Unknown class prediction module 540 is used to determine in the second class solid data by the entity classification model prediction
Entity class belonging to each solid data.
In accordance with an embodiment of the present disclosure, device 500 is determined wherein by carrying out solid data extraction to the data acquisition system
There are the part entity data of clear and definite classification results, then train entity classification according to these solid datas for there are clear and definite classification results
Model, the entity classification model is set effectively to learn the various features information to first kind solid data and corresponding entity
Incidence relation between classification, and there is no the generic of the entity of clear and definite classification results using the machine prediction after training.
In this manner it is possible to the solid data in any one data acquisition system effectively, to a deeper level classify.
The entity class that entity class belongs in default entity class storehouse is additionally, since, therefore, when in the default entity class storehouse
Class categories there is uniformity, terseness, it becomes possible to obtain that classification is succinct, solid data classification unified, that compare specification.
In this way, after the solid data in the data acquisition system is by effective and uniformly classification, help to build the number
According to the normalized structure of set so that the data acquisition system can be carried out effectively utilizing and analyzing by machine, such as applied to letter
Cease retrieval, intelligent input, automatic Proofreading or automatic error-correcting etc..
In accordance with an embodiment of the present disclosure, the device 500 also includes combined training data generation module 550.
Combined training data generation module 550 is used for the reality that unknown classification in the second instance data after prediction also be present
During volume data, by the first kind solid data and the second class solid data it is predicted that determine entity class solid data
It is mixed to get new training data.
Disaggregated model training module 530 is additionally operable to the characteristic information according to the new training data, trains entity point again
Class model.
Wherein, this feature information includes triplet information, relation predicate information, and/or relation predicate frequency information.
Unknown class prediction module 540 be additionally operable to train again by this entity classification model predict again determination this second
The entity type of the solid data of unknown classification in solid data.
In accordance with an embodiment of the present disclosure, the device 500 by the second class solid data it is predicted that determine entity class reality
Volume data also serves as a part for training data, by means of inspiring iterative manner to train the entity classification model, repeatedly to second
The solid data of unknown classification is sorted out in class solid data, until unknown classification is not present in the second class solid data
Classify during solid data and complete, or the prediction result of the solid data of the unknown classification in the second class solid data
Abandon continuing to predict when no longer changing.
Completed in this manner it is possible to which the almost all of the solid data extracted from the data acquisition system is classified.So as to have
Effect ground build the data acquisition system more completely, the data structure of specification, drastically increase the available of the data acquisition system
Property.
In accordance with an embodiment of the present disclosure, solid data extraction mould 510 includes triple data extracting sub-module 511 and entity
Data extracting sub-module 512.
Triple data extracting sub-module 511 is used to extract at least one triple data in the data acquisition system, and this three
Tuple data includes the relation predicate between two solid datas and two solid datas, and the relation predicate is used to describe
Relation between two solid datas;
Solid data extracting sub-module 512 be used for extract at least one triple data each in entity number
According to.
In accordance with an embodiment of the present disclosure, entity class determining module 520 includes relation predicate determination sub-module 521 and entity
Classification determination sub-module 522.
Relation predicate determination sub-module 521 is used for for each solid data, by including the solid data at least
One triple data determines at least one relation predicate corresponding to the solid data.
Entity class determination sub-module 522 is used for according to the content of the relation predicate and/or the appearance of same relation predicate
The frequency, identify the entity class belonging to the solid data.
It is understood that solid data extraction module 510, entity class determining module 520, disaggregated model training module
530th, unknown class prediction module 540 and combined training data generation module 550, which may be incorporated in a module, realizes, or
Person's any one module therein can be split into multiple modules.Or one or more of these modules module is extremely
Small part function can be combined with least part function phase of other modules, and be realized in a module.According to the present invention's
Embodiment, solid data extraction module 510, entity class determining module 520, disaggregated model training module 530, unknown classification are pre-
That surveys in module 540 and combined training data generation module 550 at least one can at least be implemented partly as hardware electricity
Road, such as field programmable gate array (FPGA), programmable logic array (PLA), on-chip system, the system on substrate, encapsulation
On system, application specific integrated circuit (ASIC), or can be to carry out any other rational method that is integrated or encapsulating to circuit
Realize Deng hardware or firmware, or realized with software, the appropriately combined of hardware and firmware three kinds of implementations.It is or real
Volume data extraction module 510, entity class determining module 520, disaggregated model training module 530, unknown class prediction module 540
And at least one in combined training data generation module 550 can at least be implemented partly as computer program module,
When the program is run by computer, the function of corresponding module can be performed.
Fig. 6 diagrammatically illustrates being sorted out to the solid data in data acquisition system according to the disclosure another embodiment
The block diagram of device.
As shown in fig. 6, the device 600 sorted out to the solid data in data acquisition system includes processor 610 and calculated
Machine readable storage medium storing program for executing 620.The robot 600 can perform the method described above with reference to Fig. 2~Fig. 4, to realize to data
Solid data in set is sorted out.
Specifically, processor 610 can for example include general purpose microprocessor, instruction set processor and/or related chip group
And/or special microprocessor (for example, application specific integrated circuit (ASIC)), etc..Processor 610 can also include being used to cache using
The onboard storage device on way.Processor 610 can be performed for the side according to the embodiment of the present disclosure described with reference to 2~Fig. 4 of figure
Single treatment unit either multiple processing units of the different actions of method flow.
Computer-readable recording medium 620, such as can include, store, transmit, propagate or transmit appointing for instruction
Meaning medium.For example, readable storage medium storing program for executing can include but is not limited to electricity, magnetic, optical, electromagnetic, infrared or semiconductor system, device,
Device or propagation medium.The specific example of readable storage medium storing program for executing includes:Magnetic memory apparatus, such as tape or hard disk (HDD);Optical storage
Device, such as CD (CD-ROM);Memory, such as random access memory (RAM) or flash memory;And/or wire/wireless communication chain
Road.
Computer-readable recording medium 620 can include computer program 621, and the computer program 621 can include generation
Code/computer executable instructions, it by processor 610 when being performed so that processor 610 is performed for example above in conjunction with Fig. 2~figure
Method flow and its any deformation described by 4.
Computer program 621 can be configured with such as computer program code including computer program module.Example
Such as, in the exemplary embodiment, the code in computer program 621 can include one or more program modules, such as including
621A, module 621B ....It should be noted that the dividing mode and number of module are not fixed, those skilled in the art
It can be combined according to actual conditions using suitable program module or program module, when these program modules are combined by processor
During 610 execution so that processor 610 can be performed for example above in conjunction with the method flow described by Fig. 2~Fig. 4 and its any change
Shape.
According to an embodiment of the invention, solid data extraction module 510, entity class determining module 520, disaggregated model instruction
That practices in module 530, unknown class prediction module 540 and combined training data generation module 550 at least one can realize
For the computer program module described with reference to figure 6, it by processor 610 when being performed, it is possible to achieve corresponding behaviour described above
Make.
It will be understood by those skilled in the art that the feature described in each embodiment and/or claim of the disclosure can
To carry out multiple combinations or/or combination, even if such combination or combination are not expressly recited in the disclosure.Especially, exist
In the case of not departing from disclosure spirit or teaching, the feature described in each embodiment and/or claim of the disclosure can
To carry out multiple combinations and/or combination.All these combinations and/or combination each fall within the scope of the present disclosure.
Although the disclosure, art technology has shown and described in the certain exemplary embodiments with reference to the disclosure
Personnel it should be understood that without departing substantially from appended claims and its equivalent restriction spirit and scope of the present disclosure in the case of,
A variety of changes in form and details can be carried out to the disclosure.Therefore, the scope of the present disclosure should not necessarily be limited by above-described embodiment,
But not only should be determined by appended claims, also it is defined by the equivalent of appended claims.
Claims (10)
1. a kind of method that solid data in data acquisition system is sorted out, including:
At least one solid data in any one data acquisition system is extracted, the solid data includes having in the data acquisition system
There is independent meaning, can be used to indicate that the word of any one object;
The entity class belonging to each solid data at least one solid data is determined, the entity class is pre-
If entity class storehouse in class categories, wherein, the entity class storehouse includes unknown classification, for will be unable to determine it
The solid data of affiliated entity class is sorted out;
When the second class solid data for being classified as unknown classification be present, the first kind of entity class is real according to belonging to having determined that
The characteristic information training entity classification model of volume data;
Reality belonging to each solid data in the second class solid data is determined by the entity classification model prediction
Body classification.
2. the entity of unknown classification in the second instance data after prediction according to the method for claim 1, also be present
During data, methods described also includes:
By in the first kind solid data and the second class solid data it is predicted that determine entity class solid data mix
Obtain new training data;
According to the characteristic information of the new training data, entity classification model is trained again;
Pass through the entity for training entity classification model to predict unknown classification in the determination second instance data again again
The entity type of data;
When the solid data of unknown classification being also present in the second class solid data, repeating the mixing, instructing again
Practice, prediction determination operation again, classified when the solid data of unknown classification is not present in the second class solid data
Into, or when the prediction result of the solid data of the unknown classification in the second class solid data no longer changes
Abandon continuing to predict.
3. according to the method for claim 1, wherein, at least one solid data in any one data acquisition system is extracted,
Including:
Extract at least one triple data in the data acquisition system, the triple data include two solid datas, with
And the relation predicate between described two solid datas, the relation predicate are used to describe the pass between described two solid datas
System;
Extract the solid data in each of at least one triple data.
4. according to the method for claim 3, wherein it is determined that each solid data at least one solid data
Affiliated entity class, including:
For each solid data, the entity number is determined by least one triple data including the solid data
According to corresponding at least one relation predicate;
According to the content of the relation predicate and/or the frequency of occurrence of same relation predicate, determine belonging to the solid data
Entity class.
5. device according to claim 1, wherein:
The characteristic information includes triplet information, relation predicate information, and/or relation predicate frequency information.
6. the device that a kind of solid data in data acquisition system is sorted out, including:
Solid data extraction module, for extracting at least one solid data in any one data acquisition system, the entity number
According to including there is independent meaning in the data acquisition system, can be used to indicate that the word of any one object;
Entity class determining module, for determining the entity belonging to each solid data at least one solid data
Classification, the entity class are the class categories in default entity class storehouse, wherein, the entity class storehouse includes unknown
Classification, for will be unable to determine that the solid data of its affiliated entity class is sorted out;
Disaggregated model training module, for when the second class solid data for being classified as unknown classification be present, according to having determined that
The characteristic information training entity classification model of the first kind solid data of affiliated entity class;
Unknown class prediction module, it is every in the second class solid data for being determined by the entity classification model prediction
Entity class belonging to one solid data.
7. device according to claim 6, described device also include:
Combined training data generation module, for the entity number of unknown classification in second instance data after prediction also be present
According to when, by the first kind solid data and the second class solid data it is predicted that determine entity class solid data mix
Conjunction obtains new training data;
Disaggregated model training module is additionally operable to the characteristic information according to the new training data, trains entity classification mould again
Type;
Unknown class prediction module is additionally operable to train entity classification model to predict that determination described second is real again again by described
The entity type of the solid data of unknown classification in volume data.
8. device according to claim 6, wherein, solid data extraction module includes:
Triple data extracting sub-module, for extracting at least one triple data in the data acquisition system, the ternary
Group data include the relation predicate between two solid datas and described two solid datas, and the relation predicate is used to retouch
State the relation between described two solid datas;
Solid data extracting sub-module, for extract at least one triple data each in solid data.
9. device according to claim 8, wherein, entity class determining module includes:
Relation predicate determination sub-module, for for each solid data, by including at least one of the solid data
Triple data determine at least one relation predicate corresponding to the solid data;
Entity class determination sub-module, for the content and/or the frequency of occurrence of same relation predicate according to the relation predicate,
Identify the entity class belonging to the solid data.
10. device according to claim 6, wherein:
The characteristic information includes triplet information, relation predicate information, and/or relation predicate frequency information.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710903481.0A CN107622126A (en) | 2017-09-28 | 2017-09-28 | The method and apparatus sorted out to the solid data in data acquisition system |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710903481.0A CN107622126A (en) | 2017-09-28 | 2017-09-28 | The method and apparatus sorted out to the solid data in data acquisition system |
Publications (1)
Publication Number | Publication Date |
---|---|
CN107622126A true CN107622126A (en) | 2018-01-23 |
Family
ID=61091385
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710903481.0A Pending CN107622126A (en) | 2017-09-28 | 2017-09-28 | The method and apparatus sorted out to the solid data in data acquisition system |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107622126A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304381A (en) * | 2018-01-25 | 2018-07-20 | 北京百度网讯科技有限公司 | Entity based on artificial intelligence builds side method, apparatus, equipment and storage medium |
CN110555208A (en) * | 2018-06-04 | 2019-12-10 | 北京三快在线科技有限公司 | ambiguity elimination method and device in information query and electronic equipment |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662923A (en) * | 2012-04-23 | 2012-09-12 | 天津大学 | Entity instance leading method based on machine learning |
CN103617239A (en) * | 2013-11-26 | 2014-03-05 | 百度在线网络技术(北京)有限公司 | Method and device for identifying named entity and method and device for establishing classification model |
CN103678316A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Entity relationship classifying device and entity relationship classifying method |
US20160092406A1 (en) * | 2014-09-30 | 2016-03-31 | Microsoft Technology Licensing, Llc | Inferring Layout Intent |
-
2017
- 2017-09-28 CN CN201710903481.0A patent/CN107622126A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102662923A (en) * | 2012-04-23 | 2012-09-12 | 天津大学 | Entity instance leading method based on machine learning |
CN103678316A (en) * | 2012-08-31 | 2014-03-26 | 富士通株式会社 | Entity relationship classifying device and entity relationship classifying method |
CN103617239A (en) * | 2013-11-26 | 2014-03-05 | 百度在线网络技术(北京)有限公司 | Method and device for identifying named entity and method and device for establishing classification model |
US20160092406A1 (en) * | 2014-09-30 | 2016-03-31 | Microsoft Technology Licensing, Llc | Inferring Layout Intent |
Non-Patent Citations (1)
Title |
---|
贾真: "面向中文网络百科的本体学习与知识获取研究", 《中国博士学位论文全文数据库信息科技辑》 * |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN108304381A (en) * | 2018-01-25 | 2018-07-20 | 北京百度网讯科技有限公司 | Entity based on artificial intelligence builds side method, apparatus, equipment and storage medium |
CN108304381B (en) * | 2018-01-25 | 2021-09-21 | 北京百度网讯科技有限公司 | Entity edge establishing method, device and equipment based on artificial intelligence and storage medium |
CN110555208A (en) * | 2018-06-04 | 2019-12-10 | 北京三快在线科技有限公司 | ambiguity elimination method and device in information query and electronic equipment |
CN110555208B (en) * | 2018-06-04 | 2021-11-19 | 北京三快在线科技有限公司 | Ambiguity elimination method and device in information query and electronic equipment |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Kim et al. | Transparency and accountability in AI decision support: Explaining and visualizing convolutional neural networks for text information | |
Manjunatha et al. | Explicit bias discovery in visual question answering models | |
Vylomova et al. | Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning | |
US20170364773A1 (en) | Training a classifier algorithm used for automatically generating tags to be applied to images | |
US20190347571A1 (en) | Classifier training | |
US20170270096A1 (en) | Method and system for generating large coded data set of text from textual documents using high resolution labeling | |
CN112257421A (en) | Nested entity data identification method and device and electronic equipment | |
Na et al. | Discovery of natural language concepts in individual units of cnns | |
Noguti et al. | Legal document classification: An application to law area prediction of petitions to public prosecution service | |
CN116628229B (en) | Method and device for generating text corpus by using knowledge graph | |
CN112132238A (en) | Method, device, equipment and readable medium for identifying private data | |
Gadek et al. | An interpretable model to measure fakeness and emotion in news | |
CN108090099A (en) | A kind of text handling method and device | |
CN113947086A (en) | Sample data generation method, training method, corpus generation method and apparatus | |
CN114638914A (en) | Image generation method and device, computer equipment and storage medium | |
CN107622126A (en) | The method and apparatus sorted out to the solid data in data acquisition system | |
CN110069558A (en) | Data analysing method and terminal device based on deep learning | |
CN112015915A (en) | Question-answering system and device based on knowledge base generated by questions | |
CN114842982B (en) | Knowledge expression method, device and system for medical information system | |
CN115081452B (en) | Method for extracting entity relationship | |
Aurnhammer et al. | Manual Annotation of Unsupervised Models: Close and Distant Reading of Politics on Reddit. | |
CN110888940A (en) | Text information extraction method and device, computer equipment and storage medium | |
CN113610080B (en) | Cross-modal perception-based sensitive image identification method, device, equipment and medium | |
CN112732910B (en) | Cross-task text emotion state evaluation method, system, device and medium | |
CN115048536A (en) | Knowledge graph generation method and device, computer equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20180123 |