CN107622126A - The method and apparatus sorted out to the solid data in data acquisition system - Google Patents

The method and apparatus sorted out to the solid data in data acquisition system Download PDF

Info

Publication number
CN107622126A
CN107622126A CN201710903481.0A CN201710903481A CN107622126A CN 107622126 A CN107622126 A CN 107622126A CN 201710903481 A CN201710903481 A CN 201710903481A CN 107622126 A CN107622126 A CN 107622126A
Authority
CN
China
Prior art keywords
data
solid
entity
class
solid data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710903481.0A
Other languages
Chinese (zh)
Inventor
胡长建
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Lenovo Beijing Ltd
Original Assignee
Lenovo Beijing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Lenovo Beijing Ltd filed Critical Lenovo Beijing Ltd
Priority to CN201710903481.0A priority Critical patent/CN107622126A/en
Publication of CN107622126A publication Critical patent/CN107622126A/en
Pending legal-status Critical Current

Links

Abstract

Present disclose provides a kind of method that solid data in data acquisition system is sorted out.Methods described includes extracting at least one solid data in any one data acquisition system, determine the entity class belonging to each solid data at least one solid data, and, when the second class solid data for being classified as unknown classification be present, the characteristic information training entity classification model of the first kind solid data of entity class according to belonging to having determined that, and the entity class belonging to each solid data in the second class solid data is determined by the entity classification model prediction.The disclosure additionally provides the device that a kind of solid data in data acquisition system is sorted out.

Description

The method and apparatus sorted out to the solid data in data acquisition system
Technical field
A kind of this disclosure relates to method and apparatus that solid data in data acquisition system is sorted out.
Background technology
Structuring, standardization data can by machine recognition, can be widely applied to information retrieval, intelligent input, from Dynamic check and correction, automatic error-correcting and paginal translation etc..
However, flourishing with internet, the information on internet is also drastically expanding.The information of these magnanimity In, typically various information is mixed in together, very lack of standardization, such as many information do not carry out structuring processing (such as count According to without clearly classification) or even if some information have certain structuring to handle, but its structuring processing is not often complete Face, or even chaotic etc. (such as data classification disunity) be present, so as to cause can not to be efficiently identified and be utilized by machine.
The content of the invention
An aspect of this disclosure provides a kind of method that solid data in data acquisition system is sorted out.The side Method includes:At least one solid data in any one data acquisition system is extracted, the solid data includes the data acquisition system In there is independent meaning, can be used to indicate that the word of any one object;Determine every at least one solid data Entity class belonging to one solid data, the entity class are the class categories in default entity class storehouse, wherein, institute Stating entity class storehouse includes unknown classification, for will be unable to determine that the solid data of its affiliated entity class is sorted out;When During in the presence of the second class solid data for being classified as unknown classification, the first kind solid data of entity class according to belonging to having determined that Characteristic information training entity classification model;Determined by the entity classification model prediction in the second class solid data Entity class belonging to each solid data.
Alternatively, when the solid data of unknown classification also be present in the second instance data after prediction, methods described Also include:By in the first kind solid data and the second class solid data it is predicted that determine entity class solid data It is mixed to get new training data;According to the characteristic information of the new training data, entity classification model is trained again;Pass through The entity classification model of training again predicts the reality for determining the solid data of unknown classification in the second instance data again Body type;When the solid data of unknown classification also be present in the second class solid data, described mix, again is repeated Training, prediction determines operation again, classifies when the solid data of unknown classification is not present in the second class solid data Complete, or the prediction result of the solid data of the unknown classification in the second class solid data no longer changes When abandon continuing to predict.
Alternatively, at least one solid data in any one data acquisition system, including the extraction data acquisition system are extracted In at least one triple data, and extract the solid data in each of at least one triple data. Wherein, the triple data include the relation predicate between two solid datas and described two solid datas, the pass It is that predicate is used to describe the relation between described two solid datas.
Optionally it is determined that the entity class belonging to each solid data at least one solid data, including For each solid data, the solid data pair is determined by least one triple data including the solid data At least one relation predicate answered, and according to the content of the relation predicate and/or the frequency of occurrence of same relation predicate, really Make the entity class belonging to the solid data.
Alternatively, the characteristic information includes triplet information, relation predicate information, and/or relation predicate frequency information.
Another aspect of the present disclosure provides the device that a kind of solid data in data acquisition system is sorted out.The dress Put including solid data extraction module, entity class determining module, disaggregated model training module and unknown class prediction module.It is real Volume data extraction module is used to extract at least one solid data in any one data acquisition system, and the solid data includes institute Stating in data acquisition system has independent meaning, can be used to indicate that the word of any one object.Entity class determining module is used for The entity class belonging to each solid data at least one solid data is determined, the entity class is default Class categories in entity class storehouse, wherein, the entity class storehouse includes unknown classification, for will be unable to determine belonging to it The solid data of entity class is sorted out.Disaggregated model training module is used for when in the presence of the second class for being classified as unknown classification During solid data, the characteristic information training entity classification model of the first kind solid data of entity class according to belonging to having determined that. Unknown class prediction module is used to determine each in the second class solid data by the entity classification model prediction Entity class belonging to solid data.
Alternatively, described device also includes combined training data generation module.Combined training data generation module is used to work as When the solid data of unknown classification also be present in the second instance data after prediction, by the first kind solid data and It is predicted that determining that the solid data of entity class is mixed to get new training data in second class solid data.Disaggregated model training Module is additionally operable to the characteristic information according to the new training data, trains entity classification model again.Unknown class prediction mould Block is additionally operable to train entity classification model to predict unknown classification in the determination second instance data again again by described The entity type of solid data.
Alternatively, the solid data extraction module includes triple data extracting sub-module and solid data extraction submodule Block.Triple data extracting sub-module is used to extract at least one triple data in the data acquisition system, the triple Data include the relation predicate between two solid datas and described two solid datas, and the relation predicate is used to describe Relation between described two solid datas.Solid data extracting sub-module is used to extract at least one triple data Each in solid data.
Alternatively, entity class determining module includes relation predicate determination sub-module and entity class determination sub-module.Close It is that predicate determination sub-module is used for for each solid data, passes through at least one triple number including the solid data According at least one relation predicate corresponding to the determination solid data.Entity class determination sub-module is used to be called according to the relation The content of word and/or the frequency of occurrence of same relation predicate, identify the entity class belonging to the solid data.
Alternatively, the characteristic information includes triplet information, relation predicate information, and/or relation predicate frequency information.
Another aspect of the present disclosure provides a kind of non-volatile memory medium, is stored with computer executable instructions, institute Instruction is stated to be used to realize when executed as described above to the solid data progress classifying method in data acquisition system.
Another aspect of the present disclosure provides a kind of computer program, and the computer program includes the executable finger of computer Order, the instruction is used to realize when executed carries out classifying method to the solid data in data acquisition system as described above.
Brief description of the drawings
In order to be more fully understood from the disclosure and its advantage, referring now to the following description with reference to accompanying drawing, wherein:
Fig. 1 diagrammatically illustrates the method sorted out to the solid data in data acquisition system according to the embodiment of the present disclosure Flow chart;
Fig. 2 diagrammatically illustrates being sorted out to the solid data in data acquisition system according to the disclosure another embodiment The flow chart of method;
Fig. 3 diagrammatically illustrates the method flow diagram that solid data is extracted in the method according to the embodiment of the present disclosure;
Fig. 4 diagrammatically illustrates the entity class determined in the method according to the embodiment of the present disclosure belonging to each solid data Other flow chart;
Fig. 5 diagrammatically illustrates the device sorted out to the solid data in data acquisition system according to the embodiment of the present disclosure Block diagram;And
Fig. 6 diagrammatically illustrates being sorted out to the solid data in data acquisition system according to the disclosure another embodiment The block diagram of device.
Embodiment
Hereinafter, it will be described with reference to the accompanying drawings embodiment of the disclosure.However, it should be understood that these descriptions are simply exemplary , and it is not intended to limit the scope of the present disclosure.In addition, in the following description, the description to known features and technology is eliminated, with Avoid unnecessarily obscuring the concept of the disclosure.
Term as used herein is not intended to limit the disclosure just for the sake of description specific embodiment.Use herein Term " comprising ", "comprising" etc. indicate the presence of the feature, step, operation and/or part, but it is not excluded that in the presence of Or addition one or more other features, step, operation or parts.
All terms (including technology and scientific terminology) as used herein have what those skilled in the art were generally understood Implication, unless otherwise defined.It should be noted that term used herein should be interpreted that with consistent with the context of this specification Implication, without should by idealization or it is excessively mechanical in a manner of explain.
, in general should be according to this using in the case of being similar to that " in A, B and C etc. at least one " is such and stating Art personnel are generally understood that the implication of the statement to make an explanation (for example, " having system at least one in A, B and C " Should include but is not limited to individually with A, individually with B, individually with C, with A and B, with A and C, with B and C, and/or System with A, B, C etc.).Using in the case of being similar to that " in A, B or C etc. at least one " is such and stating, it is general come Say be generally understood that the implication of the statement to make an explanation (for example, " having in A, B or C at least according to those skilled in the art The system of one " should include but is not limited to individually with A, individually with B, individually with C, with A and B, with A and C, with B and C, and/or system etc. with A, B, C).It should also be understood by those skilled in the art that substantially arbitrarily represent two or more The adversative conjunction and/or phrase of optional project, either in specification, claims or accompanying drawing, shall be construed as Give including one of these projects, the possibility of these projects either one or two projects.For example, " A or B " should for phrase It is understood to include " A " or " B " or " A and B " possibility.
Shown in the drawings of some block diagrams and/or flow chart.It should be understood that some sides in block diagram and/or flow chart Frame or its combination can be realized by computer program instructions.These computer program instructions can be supplied to all-purpose computer, The processor of special-purpose computer or other programmable data processing units, so as to which these instructions can be with when by the computing device Create the device for realizing function/operation illustrated in these block diagrams and/or flow chart.
Therefore, the technology of the disclosure can be realized in the form of hardware and/or software (including firmware, microcode etc.).Separately Outside, the technology of the disclosure can take the form of the computer program product on the computer-readable medium for being stored with instruction, should Computer program product is available for instruction execution system use or combined command execution system to use.In the context of the disclosure In, computer-readable medium can be the arbitrary medium that can include, store, transmit, propagate or transmit instruction.For example, calculate Machine computer-readable recording medium can include but is not limited to electricity, magnetic, optical, electromagnetic, infrared or semiconductor system, device, device or propagation medium. The specific example of computer-readable medium includes:Magnetic memory apparatus, such as tape or hard disk (HDD);Light storage device, such as CD (CD-ROM);Memory, such as random access memory (RAM) or flash memory;And/or wire/wireless communication link.
Embodiment of the disclosure provides the method and apparatus that a kind of solid data in data acquisition system is sorted out.Should Method includes extracting at least one solid data in any one data acquisition system, and determines at least one solid data Entity class belonging to each solid data, when the second class solid data for being classified as unknown classification be present, according to It is determined that the characteristic information training entity classification model of the first kind solid data of affiliated entity class, and pass through the entity classification Model prediction determines the entity class belonging to each solid data in the second class solid data.Wherein, the solid data Including there is independent meaning in the data acquisition system, can be used to indicate that the word of any one object.The entity class is default Entity class storehouse in class categories, the entity class storehouse includes unknown classification, for will be unable to determine its affiliated entity The solid data of classification is sorted out.
What the embodiment of the present disclosure provided carries out classifying method and device to the solid data in data acquisition system, can be to possessing Solid data in any one data acquisition system of mass data is extracted and effectively classified, so as to build the number According to the data structure of the standardization of set so that the data acquisition system can be efficiently identified and utilized by machine.
Wherein, the data acquisition system can have very huge data volume, such as the website data summation shape of 1 year Into data acquisition system.Although the initial data in the data acquisition system can be in itself do not classify or have classification but It is that lack of standardization or partial data compares specification, has partial data lack of standardization in a jumble for classification disunity, comparison.
The method and apparatus that embodiment of the disclosure provides, carried by carrying out solid data to any one data acquisition system Take, determine the part entity data for wherein there are clear and definite classification results, the entity number for then having clear and definite classification results according to these According to training entity classification model, the entity classification model is set effectively to learn the various features letter to first kind solid data Breath and the incidence relation between corresponding entity class, and there is no the entity of clear and definite classification results using the machine prediction after training The generic of data.
In this manner it is possible to the solid data in any one data acquisition system effectively, to a deeper level divide Class.The class categories that the entity class of classification belongs in default entity class storehouse are additionally, since, therefore, when the default reality Class categories in body class library have uniformity, terseness, it becomes possible to obtain succinct, unified, to compare specification reality of classifying Volume data is classified.
In this way, the method and apparatus of the offer of the embodiment of the present disclosure, help to build the specification of any one data acquisition system Change the normalized structure of the data of any website whole year in structure, such as structure internet, so that the sea in internet Measure data can by machine carry out effectively utilize and analyze, such as applied to information retrieval, intelligent input, automatic Proofreading or Automatic error-correcting etc., with improving data utilizability and handlability.
Fig. 1 diagrammatically illustrates the method sorted out to the solid data in data acquisition system according to the embodiment of the present disclosure Flow chart.
As shown in figure 1, according to the method sorted out to the solid data in data acquisition system of the embodiment of the present disclosure, including Operate S101~operation S104.
In operation S101, at least one solid data in any one data acquisition system is extracted, the solid data includes should There is independent meaning in data acquisition system, can be used to indicate that the word of any one object.
Specifically, the solid data can serve to indicate that any one object.When any one object can be for example Between, place, artistic works, mechanism, building, tangible products, virtual product, event, academic title, awards, personage, numeral, quantity Information, color etc..
Different from the solid data has relation predicate data.Relation predicate data are to be used to describe two solid datas Between relation word.
Such as " Fan Bingbing occupation is performer." in the words " Fan Bingbing " be the solid data of assignor's thing object, " drill Member " is the solid data for indicating professional object, and " occupation " one word is to be used to describe " Fan Bingbing " and " performer " the two entities Between relation predicate.
The solid data is the word for having independent meaning.For example, " Fan Bingbing " one word is combined by three Chinese characters, For indicating who object, and if each word is individually and clearly right for instruction one after " Fan Bingbing " three words are split As.
On the other hand, a word, one section of word or a chart can be obtained by Chinese word cutting method for example in Chinese In the word with independent meaning, in another example itself can carry out reality by space between the Chinese and English word of English and word The description of existing word and word.
In operation S102, the entity class belonging to each solid data at least one solid data is determined, should Entity class is the class categories in default entity class storehouse, wherein, the entity class storehouse includes unknown classification, for inciting somebody to action It can not determine that the solid data of its affiliated entity class is sorted out.
Specifically, for the more mixed and disorderly data acquisition system of no classification or classification, default entity class storehouse can be Clear and definite class categories are had according to the classified finishing such as part of speech standard corpus storehouse or People's Daily's participle or xinhua dictionary Entity class storehouse.
Compare the data acquisition system of specification for partial data classification, default entity class storehouse can be to the part classifying Compare specification data combed after, the part that includes of formation is categorized in interior more perfect entity class storehouse.
Default entity class storehouse can for example include but is not limited to following classification:Time, place, numeral, quantity, people Thing, mechanism, artistic works, tangible products, event, building, awards, academic title, color, education degree, regulation, race, religion, It is language, gods, chemicals, biological agent, medical treatment, medicine, symptom, disease, body part, biology, animal, food, website, wide Net, broadcast program, television channel, currency, stock exchange, algorithm, program language, transportation system, supply line are broadcast, and it is unknown Classification.
Classification in default class library can also have certain hierarchical structure, i.e., can have multiple one in class library Level is classified, and can also include multiple subclassifications under each first-level class.
It is appreciated that under different application scenarios or under different criteria for classifications, obtained default entity class The particular content and relation of classification in storehouse also can be different.
Determine the entity class belonging to each solid data at least one solid data.During this, have A part of solid data can be according to its characteristic information, such as the side such as part-of-speech tagging corpus, People's Daily's participle, data label Method, determine its corresponding entity class, and if have the second class solid data that can not specify the classification, it can temporarily be returned Enter in unknown classification.
Then, S103 is being operated, when the second class solid data for being classified as unknown classification be present, according to having determined that Belong to the characteristic information training entity classification model of the first kind solid data of entity class.
Wherein, the characteristic information of the solid data includes triplet information, relation predicate information, and/or relation predicate frequency Secondary information.
So-called triplet information includes two solid datas and describes the relation predicate of two solid datas.
Characteristic information is illustrated by taking " Fan Bingbing " this solid data as an example.
For example, (Fan Bingbing, occupation, performer) is exactly a triplet information of solid data " Fan Bingbing ".Wherein " duty Industry " is a relation predicate of solid data " Fan Bingbing ".
Solid data " Fan Bingbing " can also be corresponding with other relation predicates, such as (Fan Bingbing, representative works, a sleep terror fright at night Happiness), wherein " representative works " are exactly another relation predicate.
Same relation predicate, in multiple triplet informations that the solid data can be appeared in.In another example (Fan Bingbing, Representative works, I is not Lady Pan Jinlian), it is another triplet information related to solid data " Fan Bingbing ".Wherein, " represent This relation predicate of works " describes the relation of " Fan Bingbing " this solid data and another solid data.It can be seen that the same relation Predicate may also occur repeatedly, and this just constitutes the relation predicate frequency information of the solid data.
Assuming that have determined that " Fan Bingbing " this solid data belongs to figure kind.The relation predicate of so figure kind may wrap Include " occupation ", " works of writing on one's behalf " etc..
The characteristic information training entity classification model of the first kind solid data of entity class, makes this according to belonging to having determined that Incidence relation between the characteristic information and corresponding entity class of entity classification model learning first kind solid data.
When the data volume of the data acquisition system is sufficiently large, when training data is enough, the entity classification model just can be abundant Learn the incidence relation (such as correlation degree etc.) between the characteristic information of the first kind solid data and corresponding entity class.
In operation S104, each entity number in the second class solid data is determined by the entity classification model prediction According to affiliated entity class.
Specifically, entity is constantly trained using the characteristic information for the first kind solid data for having determined that affiliated entity class Disaggregated model so that the entity classification model can learn the various features information and corresponding entity to first kind solid data Incidence relation between classification, the entity classification model prediction then completed using training is to determine the second class solid data In each solid data belonging to entity class.In this way, it is possible to predict at least one in the second class solid data Entity class belonging to part entity data, so that the classification of more solid datas is made clear in the data acquisition system, make The data structure for obtaining the data acquisition system is more standardized.
In accordance with an embodiment of the present disclosure, by carrying out solid data extraction to the data acquisition system, determine wherein have clearly The part entity data (i.e. first kind solid data) of affiliated entity class, then trained according to these first kind solid datas real Body disaggregated model, the entity classification model is set effectively to learn the characteristic information to first kind solid data and corresponding reality Incidence relation between body classification, and it is classified as using the machine prediction after training the solid data (i.e. second of the unknown classification in position Class solid data) generic.
In this manner it is possible to the solid data in any one data acquisition system effectively, to a deeper level classify. The entity class that entity class belongs in default entity class storehouse is additionally, since, therefore, when in the default entity class storehouse Class categories there is uniformity, terseness, it becomes possible to obtain that classification is succinct, solid data classification unified, that compare specification.
In this way, after the solid data in the data acquisition system is by effective and uniformly classification, help to build the number According to the normalized structure of set so that the data acquisition system can be carried out effectively utilizing and analyzing by machine, such as applied to letter Cease retrieval, intelligent input, automatic Proofreading or automatic error-correcting etc..
Fig. 2 diagrammatically illustrates being sorted out to the solid data in data acquisition system according to the disclosure another embodiment The flow chart of method.
As shown in Fig. 2 according to the method sorted out to the solid data in data acquisition system of the embodiment of the present disclosure, grasping After making S104, when the solid data of unknown classification also be present in the second instance data after prediction, in addition to operation S105~operation S109.
Operation S105, by the first kind solid data and the second class solid data it is predicted that determine entity class Other solid data is mixed to get new training data.
In operation S106, according to the characteristic information of the new training data, entity classification model is trained again.
In operation S107, entity classification model is trained to predict again in the determination second instance data again by described The entity type of the solid data of unknown classification.
Then in operation S108, the current solid data that unknown classification whether also be present is judged.If so, then show from the number There is the solid data of unknown classification according to the solid data extracted in set, now enter operation S109;If it is not, then show from this The solid data extracted in data acquisition system is all sorted out to be completed, that is, is sorted out and terminated.
In operation S109, judge whether this result sorted out changes than before.
If this result sorted out is changed than before, although illustrating that this classification process is not real by the second class All solid datas in volume data, which are sorted out, to be completed, but some quilt in the second class solid data in this classification process Prediction determines clear and definite entity class.Now this portion for determining entity class can will be predicted in the second class solid data Solid data is divided to also serve as a part for training data, to train entity classification model, i.e. circulation performs operation S105~operation S108, the solid data for still belonging to unknown classification in the second class solid data is sorted out repeatedly.
If this result sorted out does not change than before, illustrate, although also being deposited in current second class solid data In the solid data of the unknown classification in part, but currently no longer possesses new training data.So as to when training data is no longer sent out Changing, and using training of the same training data to the disaggregated model it is very abundant when, be further continued for utilization this point Class model is trained and predicted that its result will be ineffectual, so when abandon to remaining in the second class solid data The prediction of the solid data of unknown classification.
Of course it is to be understood that it can be from continuous more after a certain time that this result sorted out does not change than before Secondary middle categorization results all do not change, so that it is determined that abandoning continuing to predict.
In other words, this result sorted out does not change and not represented necessarily once there is adjacent returning twice than before Class result, which does not change, just to be terminated to train.It is also likely to be the production of some enchancement factors that adjacent categorization results twice, which do not change, It is raw, therefore, can by see whether in continuous several times behind categorization results all do not change exclude it is random because The influence of element.
In accordance with an embodiment of the present disclosure, the thinking based on Bootstrap, by the second class solid data it is predicted that determining real The solid data of body classification also serves as a part for training data, by means of inspiring iterative manner to train the entity classification model, The solid data of the unknown classification to being remained in the second class solid data is sorted out repeatedly, until in the second class solid data In the absence of unknown classification solid data when classification complete, or the reality of the unknown classification in the second class solid data Abandon continuing to predict when the prediction result of volume data no longer changes.
Completed in this manner it is possible to which the almost all of the solid data extracted from the data acquisition system is classified.So as to have Effect ground build the data acquisition system more completely, the data structure of specification, drastically increase the available of the data acquisition system Property.
Fig. 3 diagrammatically illustrates the method flow diagram that solid data is extracted in the method according to the embodiment of the present disclosure.
As shown in figure 3, at least one solid data in any one data acquisition system is extracted in operating S101, can be wrapped Include operation S301 and operation S302.
In operation S301, at least one triple data in the data acquisition system are extracted, the triple data include two Relation predicate between solid data and two solid datas, the relation predicate be used for describe two solid datas it Between relation.
Specifically, the data set can be extracted by triple extraction tool or Chinese word segmentation and part of speech conventional tool etc. At least one triple data in conjunction.
When being extracted by triple extraction tool, rule template can be used.For example, can according to rule template " [,.] The nationality of (.*) be (.*) [,.] " extract the triple data with nationality's relation.
For example, to such segment description:" ..., Fan Bingbing nationality is China.......“.
Triple data as (Fan Bingbing, nationality, China) can be drawn into from this section of words, wherein, " Fan Bingbing " It is a solid data, principal data can be considered as here." China " is another solid data, accordingly can be by It is considered as objective solid data." nationality " is the relation predicate between the two solid datas.
Or extract at least one triple number in the data acquisition system using Chinese word segmentation and part of speech conventional tool etc. According to.
Chinese word segmentation and part-of-speech tagging instrument are a lot.It is using the substantially given a word of thought, Chinese word segmentation and part of speech Annotation tool to the word can segment while give a part of speech to each word, such as to following the words:
Chongqing general headquarters of Chang Ping construction groups Co., Ltd place is seated Chongqing City Banan District Yu Nan main roads 162)
The words can be split as following structure by the Chinese word segmentation and part-of-speech tagging instrument:
" Chongqing Chang Ping construction groups Co., Ltd/ns general headquarters place/ns is located/and v is in/p Chongqing City Banan District Yu Nan main roads No. 162/ns ",
And then it can therefrom extract triple data (Chongqing Chang Ping construction groups Co., Ltd, general headquarters place, Chongqing City Banan District Yu Nan main roads 162).Wherein, " Chongqing Chang Ping construction groups Co., Ltd " and " Chongqing City Banan District Yu Nan main roads No. 162 " it is respectively two solid datas, " general headquarters place " is the relation predicate between whole two solid datas.
In operation S302, the solid data in each of at least one triple data is extracted.
After triple data are obtained, it is possible to obtain solid data therein from the triple data.
In as exemplified above, triple data (Fan Bingbing, nationality, China) are < principals, relation predicate, objective entity> Data structure, so as to which first data in the triple data and the 3rd data just belong to solid data.
It will be appreciated, of course, that the structure for the triple data that different methods is obtained can be different.If for example, ternary The structure of group data is < principals, objective entity, relation predicate>When, the first two data in the triple data just belong to real Volume data.It is corresponding to extract triple data therein with specific reference to the data structure of different triples.
Fig. 4 diagrammatically illustrates the entity class determined in the method according to the embodiment of the present disclosure belonging to each solid data Other flow chart.
As shown in figure 4, determined in operation S102 belonging to each solid data at least one solid data Entity class, including operation S401 and operation S402.
In operation S401, for each solid data, pass through at least one triple data including the solid data Determine at least one relation predicate corresponding to the solid data.
In operation S402, according to the content of the relation predicate and/or the frequency of occurrence of same relation predicate, the reality is determined Entity class belonging to volume data.
For example, for the data in table 1:
Chinese name Zheng is anxious to be given Date of birth 1933
Alias Zheng Wentao Occupation Poet, Chinese Marine University writer in school
Nationality China Graduation universities and colleges Taiwan Chung Hsing University
Birthplace Jinan, Shandong Province Representative works 《Mistake》《Clasp knife》
Table 1
It can extract to obtain following triple data:
(Zheng is anxious to be given, Chinese name, and Zheng is anxious to be given)
(Zheng is anxious to be given, alias, Zheng Wentao)
(Zheng is anxious to be given, nationality, China)
(Zheng is anxious to be given, birthplace, Jinan, Shandong Province)
(Zheng is anxious to be given, the date of birth, 1933)
(Zheng is anxious to be given, occupation, poet)
(Zheng is anxious to be given, occupation, Chinese Marine University writer in school)
(Zheng is anxious to be given, universities and colleges of graduating, Taiwan Chung Hsing University)
(Zheng is anxious to be given, representative works, mistake)
(Zheng is anxious to be given, representative works, clasp knife)
In operation S401, it can determine that " Zheng is anxious to be given " this solid data as principal based on above triple data When, corresponding relation predicate includes Chinese name, alias, birthplace, date of birth, occupation, graduation universities and colleges, representative works.
Furthermore it is also possible to determine that " China " is the objective entity as " nationality " this relation predicate, it is corresponding " to help in Shandong South " is the objective entity as " birthplace " this relation predicate.
In operation S402, relation meaning contamination that can be corresponding to " Zheng anxious gives ", or one of those or with part Combination, it is determined that " Zheng anxious gives " this solid data belongs to figure kind.
In this example, if " Zheng can might not may completely be determined by being based only upon " Chinese name " or " date of birth " It is anxious to give " this solid data belongs to figure kind, for example, crossing when animal class solid data also be present in the data acquisition system, it is possible to The relation predicate of animal class solid data, which can be also corresponding with also, can have Chinese name or also have date of birth of record etc..This When, " Zheng is anxious to be given " this entity can be assured that further combined with " representative works ", " graduation universities and colleges " or " occupation " etc. Data belong to figure kind.
Likewise, for the objective solid data in above-mentioned triple relation, there are some to be based on only relation predicate Determine that its classification, such as " Jinan, Shandong Province " correspond to the objective entity of " birthplace " this relation predicate, it may be determined that " help in Shandong South " belongs to location category.
The data volume of table 1 is less, thus for some solid datas, may only have several associated relations Predicate, its corresponding classification might not be can determine.Such as " mistake ", " clasp knife " based on current " representative works " this Relation predicate, it can't determine which kind of (being, for example, artistic works class, Building class or Graphing of Engineering class etc.) belonged to completely. Therefore, it is also desirable to it can just be further determined that with reference to the description of more data relationships.
It is appreciated that the data that table 1 provides are merely to explanation determines the classification of solid data according to relation predicate A kind of exemplary description.
The data volume for the data acquisition system applied according to the method for the embodiment of the present disclosure is significantly larger than the exemplary number in table 1 According to, such as the mass data of internet, (such as a certain annual data in encyclopaedia website or a certain Blog Website are annual or for many years Data etc.), wherein substantial amounts of solid data can be included, and triple relation etc...Enough training can be so provided Data, to train up grouped data model.
Fig. 5 diagrammatically illustrates the device sorted out to the solid data in data acquisition system according to the embodiment of the present disclosure Block diagram.
As shown in figure 5, according to the device 500 sorted out to the solid data in data acquisition system of the embodiment of the present disclosure, Including solid data extraction module 510, entity class determining module 520, disaggregated model training module 530 and unknown class prediction Module 540.
The device 500 can be used for realizing and the solid data in data acquisition system sorted out with reference to what 1~Fig. 4 of figure was described Method.
Solid data extraction module 510 is used to extract at least one solid data in any one data acquisition system, the reality Volume data includes being used for the word for indicating any one object in the data acquisition system;
Entity class determining module 520 is used to determine belonging to each solid data at least one solid data Entity class, the entity class are the class categories in default entity class storehouse, wherein, the entity class storehouse includes unknown Classification, for will be unable to determine that the solid data of its affiliated entity class is sorted out;
Disaggregated model training module 530 is used for when the second class solid data for being classified as unknown classification be present, according to The characteristic information training entity classification model of the first kind solid data of entity class belonging to having determined that;
Unknown class prediction module 540 is used to determine in the second class solid data by the entity classification model prediction Entity class belonging to each solid data.
In accordance with an embodiment of the present disclosure, device 500 is determined wherein by carrying out solid data extraction to the data acquisition system There are the part entity data of clear and definite classification results, then train entity classification according to these solid datas for there are clear and definite classification results Model, the entity classification model is set effectively to learn the various features information to first kind solid data and corresponding entity Incidence relation between classification, and there is no the generic of the entity of clear and definite classification results using the machine prediction after training.
In this manner it is possible to the solid data in any one data acquisition system effectively, to a deeper level classify. The entity class that entity class belongs in default entity class storehouse is additionally, since, therefore, when in the default entity class storehouse Class categories there is uniformity, terseness, it becomes possible to obtain that classification is succinct, solid data classification unified, that compare specification.
In this way, after the solid data in the data acquisition system is by effective and uniformly classification, help to build the number According to the normalized structure of set so that the data acquisition system can be carried out effectively utilizing and analyzing by machine, such as applied to letter Cease retrieval, intelligent input, automatic Proofreading or automatic error-correcting etc..
In accordance with an embodiment of the present disclosure, the device 500 also includes combined training data generation module 550.
Combined training data generation module 550 is used for the reality that unknown classification in the second instance data after prediction also be present During volume data, by the first kind solid data and the second class solid data it is predicted that determine entity class solid data It is mixed to get new training data.
Disaggregated model training module 530 is additionally operable to the characteristic information according to the new training data, trains entity point again Class model.
Wherein, this feature information includes triplet information, relation predicate information, and/or relation predicate frequency information.
Unknown class prediction module 540 be additionally operable to train again by this entity classification model predict again determination this second The entity type of the solid data of unknown classification in solid data.
In accordance with an embodiment of the present disclosure, the device 500 by the second class solid data it is predicted that determine entity class reality Volume data also serves as a part for training data, by means of inspiring iterative manner to train the entity classification model, repeatedly to second The solid data of unknown classification is sorted out in class solid data, until unknown classification is not present in the second class solid data Classify during solid data and complete, or the prediction result of the solid data of the unknown classification in the second class solid data Abandon continuing to predict when no longer changing.
Completed in this manner it is possible to which the almost all of the solid data extracted from the data acquisition system is classified.So as to have Effect ground build the data acquisition system more completely, the data structure of specification, drastically increase the available of the data acquisition system Property.
In accordance with an embodiment of the present disclosure, solid data extraction mould 510 includes triple data extracting sub-module 511 and entity Data extracting sub-module 512.
Triple data extracting sub-module 511 is used to extract at least one triple data in the data acquisition system, and this three Tuple data includes the relation predicate between two solid datas and two solid datas, and the relation predicate is used to describe Relation between two solid datas;
Solid data extracting sub-module 512 be used for extract at least one triple data each in entity number According to.
In accordance with an embodiment of the present disclosure, entity class determining module 520 includes relation predicate determination sub-module 521 and entity Classification determination sub-module 522.
Relation predicate determination sub-module 521 is used for for each solid data, by including the solid data at least One triple data determines at least one relation predicate corresponding to the solid data.
Entity class determination sub-module 522 is used for according to the content of the relation predicate and/or the appearance of same relation predicate The frequency, identify the entity class belonging to the solid data.
It is understood that solid data extraction module 510, entity class determining module 520, disaggregated model training module 530th, unknown class prediction module 540 and combined training data generation module 550, which may be incorporated in a module, realizes, or Person's any one module therein can be split into multiple modules.Or one or more of these modules module is extremely Small part function can be combined with least part function phase of other modules, and be realized in a module.According to the present invention's Embodiment, solid data extraction module 510, entity class determining module 520, disaggregated model training module 530, unknown classification are pre- That surveys in module 540 and combined training data generation module 550 at least one can at least be implemented partly as hardware electricity Road, such as field programmable gate array (FPGA), programmable logic array (PLA), on-chip system, the system on substrate, encapsulation On system, application specific integrated circuit (ASIC), or can be to carry out any other rational method that is integrated or encapsulating to circuit Realize Deng hardware or firmware, or realized with software, the appropriately combined of hardware and firmware three kinds of implementations.It is or real Volume data extraction module 510, entity class determining module 520, disaggregated model training module 530, unknown class prediction module 540 And at least one in combined training data generation module 550 can at least be implemented partly as computer program module, When the program is run by computer, the function of corresponding module can be performed.
Fig. 6 diagrammatically illustrates being sorted out to the solid data in data acquisition system according to the disclosure another embodiment The block diagram of device.
As shown in fig. 6, the device 600 sorted out to the solid data in data acquisition system includes processor 610 and calculated Machine readable storage medium storing program for executing 620.The robot 600 can perform the method described above with reference to Fig. 2~Fig. 4, to realize to data Solid data in set is sorted out.
Specifically, processor 610 can for example include general purpose microprocessor, instruction set processor and/or related chip group And/or special microprocessor (for example, application specific integrated circuit (ASIC)), etc..Processor 610 can also include being used to cache using The onboard storage device on way.Processor 610 can be performed for the side according to the embodiment of the present disclosure described with reference to 2~Fig. 4 of figure Single treatment unit either multiple processing units of the different actions of method flow.
Computer-readable recording medium 620, such as can include, store, transmit, propagate or transmit appointing for instruction Meaning medium.For example, readable storage medium storing program for executing can include but is not limited to electricity, magnetic, optical, electromagnetic, infrared or semiconductor system, device, Device or propagation medium.The specific example of readable storage medium storing program for executing includes:Magnetic memory apparatus, such as tape or hard disk (HDD);Optical storage Device, such as CD (CD-ROM);Memory, such as random access memory (RAM) or flash memory;And/or wire/wireless communication chain Road.
Computer-readable recording medium 620 can include computer program 621, and the computer program 621 can include generation Code/computer executable instructions, it by processor 610 when being performed so that processor 610 is performed for example above in conjunction with Fig. 2~figure Method flow and its any deformation described by 4.
Computer program 621 can be configured with such as computer program code including computer program module.Example Such as, in the exemplary embodiment, the code in computer program 621 can include one or more program modules, such as including 621A, module 621B ....It should be noted that the dividing mode and number of module are not fixed, those skilled in the art It can be combined according to actual conditions using suitable program module or program module, when these program modules are combined by processor During 610 execution so that processor 610 can be performed for example above in conjunction with the method flow described by Fig. 2~Fig. 4 and its any change Shape.
According to an embodiment of the invention, solid data extraction module 510, entity class determining module 520, disaggregated model instruction That practices in module 530, unknown class prediction module 540 and combined training data generation module 550 at least one can realize For the computer program module described with reference to figure 6, it by processor 610 when being performed, it is possible to achieve corresponding behaviour described above Make.
It will be understood by those skilled in the art that the feature described in each embodiment and/or claim of the disclosure can To carry out multiple combinations or/or combination, even if such combination or combination are not expressly recited in the disclosure.Especially, exist In the case of not departing from disclosure spirit or teaching, the feature described in each embodiment and/or claim of the disclosure can To carry out multiple combinations and/or combination.All these combinations and/or combination each fall within the scope of the present disclosure.
Although the disclosure, art technology has shown and described in the certain exemplary embodiments with reference to the disclosure Personnel it should be understood that without departing substantially from appended claims and its equivalent restriction spirit and scope of the present disclosure in the case of, A variety of changes in form and details can be carried out to the disclosure.Therefore, the scope of the present disclosure should not necessarily be limited by above-described embodiment, But not only should be determined by appended claims, also it is defined by the equivalent of appended claims.

Claims (10)

1. a kind of method that solid data in data acquisition system is sorted out, including:
At least one solid data in any one data acquisition system is extracted, the solid data includes having in the data acquisition system There is independent meaning, can be used to indicate that the word of any one object;
The entity class belonging to each solid data at least one solid data is determined, the entity class is pre- If entity class storehouse in class categories, wherein, the entity class storehouse includes unknown classification, for will be unable to determine it The solid data of affiliated entity class is sorted out;
When the second class solid data for being classified as unknown classification be present, the first kind of entity class is real according to belonging to having determined that The characteristic information training entity classification model of volume data;
Reality belonging to each solid data in the second class solid data is determined by the entity classification model prediction Body classification.
2. the entity of unknown classification in the second instance data after prediction according to the method for claim 1, also be present During data, methods described also includes:
By in the first kind solid data and the second class solid data it is predicted that determine entity class solid data mix Obtain new training data;
According to the characteristic information of the new training data, entity classification model is trained again;
Pass through the entity for training entity classification model to predict unknown classification in the determination second instance data again again The entity type of data;
When the solid data of unknown classification being also present in the second class solid data, repeating the mixing, instructing again Practice, prediction determination operation again, classified when the solid data of unknown classification is not present in the second class solid data Into, or when the prediction result of the solid data of the unknown classification in the second class solid data no longer changes Abandon continuing to predict.
3. according to the method for claim 1, wherein, at least one solid data in any one data acquisition system is extracted, Including:
Extract at least one triple data in the data acquisition system, the triple data include two solid datas, with And the relation predicate between described two solid datas, the relation predicate are used to describe the pass between described two solid datas System;
Extract the solid data in each of at least one triple data.
4. according to the method for claim 3, wherein it is determined that each solid data at least one solid data Affiliated entity class, including:
For each solid data, the entity number is determined by least one triple data including the solid data According to corresponding at least one relation predicate;
According to the content of the relation predicate and/or the frequency of occurrence of same relation predicate, determine belonging to the solid data Entity class.
5. device according to claim 1, wherein:
The characteristic information includes triplet information, relation predicate information, and/or relation predicate frequency information.
6. the device that a kind of solid data in data acquisition system is sorted out, including:
Solid data extraction module, for extracting at least one solid data in any one data acquisition system, the entity number According to including there is independent meaning in the data acquisition system, can be used to indicate that the word of any one object;
Entity class determining module, for determining the entity belonging to each solid data at least one solid data Classification, the entity class are the class categories in default entity class storehouse, wherein, the entity class storehouse includes unknown Classification, for will be unable to determine that the solid data of its affiliated entity class is sorted out;
Disaggregated model training module, for when the second class solid data for being classified as unknown classification be present, according to having determined that The characteristic information training entity classification model of the first kind solid data of affiliated entity class;
Unknown class prediction module, it is every in the second class solid data for being determined by the entity classification model prediction Entity class belonging to one solid data.
7. device according to claim 6, described device also include:
Combined training data generation module, for the entity number of unknown classification in second instance data after prediction also be present According to when, by the first kind solid data and the second class solid data it is predicted that determine entity class solid data mix Conjunction obtains new training data;
Disaggregated model training module is additionally operable to the characteristic information according to the new training data, trains entity classification mould again Type;
Unknown class prediction module is additionally operable to train entity classification model to predict that determination described second is real again again by described The entity type of the solid data of unknown classification in volume data.
8. device according to claim 6, wherein, solid data extraction module includes:
Triple data extracting sub-module, for extracting at least one triple data in the data acquisition system, the ternary Group data include the relation predicate between two solid datas and described two solid datas, and the relation predicate is used to retouch State the relation between described two solid datas;
Solid data extracting sub-module, for extract at least one triple data each in solid data.
9. device according to claim 8, wherein, entity class determining module includes:
Relation predicate determination sub-module, for for each solid data, by including at least one of the solid data Triple data determine at least one relation predicate corresponding to the solid data;
Entity class determination sub-module, for the content and/or the frequency of occurrence of same relation predicate according to the relation predicate, Identify the entity class belonging to the solid data.
10. device according to claim 6, wherein:
The characteristic information includes triplet information, relation predicate information, and/or relation predicate frequency information.
CN201710903481.0A 2017-09-28 2017-09-28 The method and apparatus sorted out to the solid data in data acquisition system Pending CN107622126A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710903481.0A CN107622126A (en) 2017-09-28 2017-09-28 The method and apparatus sorted out to the solid data in data acquisition system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710903481.0A CN107622126A (en) 2017-09-28 2017-09-28 The method and apparatus sorted out to the solid data in data acquisition system

Publications (1)

Publication Number Publication Date
CN107622126A true CN107622126A (en) 2018-01-23

Family

ID=61091385

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710903481.0A Pending CN107622126A (en) 2017-09-28 2017-09-28 The method and apparatus sorted out to the solid data in data acquisition system

Country Status (1)

Country Link
CN (1) CN107622126A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304381A (en) * 2018-01-25 2018-07-20 北京百度网讯科技有限公司 Entity based on artificial intelligence builds side method, apparatus, equipment and storage medium
CN110555208A (en) * 2018-06-04 2019-12-10 北京三快在线科技有限公司 ambiguity elimination method and device in information query and electronic equipment

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662923A (en) * 2012-04-23 2012-09-12 天津大学 Entity instance leading method based on machine learning
CN103617239A (en) * 2013-11-26 2014-03-05 百度在线网络技术(北京)有限公司 Method and device for identifying named entity and method and device for establishing classification model
CN103678316A (en) * 2012-08-31 2014-03-26 富士通株式会社 Entity relationship classifying device and entity relationship classifying method
US20160092406A1 (en) * 2014-09-30 2016-03-31 Microsoft Technology Licensing, Llc Inferring Layout Intent

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102662923A (en) * 2012-04-23 2012-09-12 天津大学 Entity instance leading method based on machine learning
CN103678316A (en) * 2012-08-31 2014-03-26 富士通株式会社 Entity relationship classifying device and entity relationship classifying method
CN103617239A (en) * 2013-11-26 2014-03-05 百度在线网络技术(北京)有限公司 Method and device for identifying named entity and method and device for establishing classification model
US20160092406A1 (en) * 2014-09-30 2016-03-31 Microsoft Technology Licensing, Llc Inferring Layout Intent

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
贾真: "面向中文网络百科的本体学习与知识获取研究", 《中国博士学位论文全文数据库信息科技辑》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108304381A (en) * 2018-01-25 2018-07-20 北京百度网讯科技有限公司 Entity based on artificial intelligence builds side method, apparatus, equipment and storage medium
CN108304381B (en) * 2018-01-25 2021-09-21 北京百度网讯科技有限公司 Entity edge establishing method, device and equipment based on artificial intelligence and storage medium
CN110555208A (en) * 2018-06-04 2019-12-10 北京三快在线科技有限公司 ambiguity elimination method and device in information query and electronic equipment
CN110555208B (en) * 2018-06-04 2021-11-19 北京三快在线科技有限公司 Ambiguity elimination method and device in information query and electronic equipment

Similar Documents

Publication Publication Date Title
Kim et al. Transparency and accountability in AI decision support: Explaining and visualizing convolutional neural networks for text information
Manjunatha et al. Explicit bias discovery in visual question answering models
Vylomova et al. Take and took, gaggle and goose, book and read: Evaluating the utility of vector differences for lexical relation learning
US20170364773A1 (en) Training a classifier algorithm used for automatically generating tags to be applied to images
US20190347571A1 (en) Classifier training
US20170270096A1 (en) Method and system for generating large coded data set of text from textual documents using high resolution labeling
CN112257421A (en) Nested entity data identification method and device and electronic equipment
Na et al. Discovery of natural language concepts in individual units of cnns
Noguti et al. Legal document classification: An application to law area prediction of petitions to public prosecution service
CN116628229B (en) Method and device for generating text corpus by using knowledge graph
CN112132238A (en) Method, device, equipment and readable medium for identifying private data
Gadek et al. An interpretable model to measure fakeness and emotion in news
CN108090099A (en) A kind of text handling method and device
CN113947086A (en) Sample data generation method, training method, corpus generation method and apparatus
CN114638914A (en) Image generation method and device, computer equipment and storage medium
CN107622126A (en) The method and apparatus sorted out to the solid data in data acquisition system
CN110069558A (en) Data analysing method and terminal device based on deep learning
CN112015915A (en) Question-answering system and device based on knowledge base generated by questions
CN114842982B (en) Knowledge expression method, device and system for medical information system
CN115081452B (en) Method for extracting entity relationship
Aurnhammer et al. Manual Annotation of Unsupervised Models: Close and Distant Reading of Politics on Reddit.
CN110888940A (en) Text information extraction method and device, computer equipment and storage medium
CN113610080B (en) Cross-modal perception-based sensitive image identification method, device, equipment and medium
CN112732910B (en) Cross-task text emotion state evaluation method, system, device and medium
CN115048536A (en) Knowledge graph generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20180123