CN110287302A - A kind of science and techniques of defence field open source information confidence level determines method and system - Google Patents

A kind of science and techniques of defence field open source information confidence level determines method and system Download PDF

Info

Publication number
CN110287302A
CN110287302A CN201910572637.0A CN201910572637A CN110287302A CN 110287302 A CN110287302 A CN 110287302A CN 201910572637 A CN201910572637 A CN 201910572637A CN 110287302 A CN110287302 A CN 110287302A
Authority
CN
China
Prior art keywords
attribute
entity
corrigendum
feature vector
name
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910572637.0A
Other languages
Chinese (zh)
Other versions
CN110287302B (en
Inventor
姚晗
晏裕生
程洁丹
孙孟阳
董文轩
江洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
INTRODUCTION OF TECHNOLOGY RESEARCH & ECONOMY DEVELOPMENT INSTITUTE
Original Assignee
INTRODUCTION OF TECHNOLOGY RESEARCH & ECONOMY DEVELOPMENT INSTITUTE
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by INTRODUCTION OF TECHNOLOGY RESEARCH & ECONOMY DEVELOPMENT INSTITUTE filed Critical INTRODUCTION OF TECHNOLOGY RESEARCH & ECONOMY DEVELOPMENT INSTITUTE
Priority to CN201910572637.0A priority Critical patent/CN110287302B/en
Publication of CN110287302A publication Critical patent/CN110287302A/en
Application granted granted Critical
Publication of CN110287302B publication Critical patent/CN110287302B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The invention discloses a kind of science and techniques of defence field open source information confidence levels to determine method and system.The method is named Entity recognition and attribute extraction by having open source information to science and techniques of defence field, extracts corresponding name entity and corresponding attribute;Further differentiation corrigendum is done to name entity and corresponding attribute with entity disambiguation technology by the way that entity is unified again, improves the accuracy of entity and attribute extraction.In actual use, by the mutual confirmation to the same attribute of same entity in different information sources, the confidence level of the open source information and the confidence level of information source are calculated, provides more accurate information service for science and techniques of defence field user.

Description

A kind of science and techniques of defence field open source information confidence level determines method and system
Technical field
The present invention relates to information confidence level estimation analysis technical fields, increase income and believe more particularly to a kind of science and techniques of defence field Breath confidence level determines method and system.
Background technique
Open source information is the information for referring to obtain from open or semi-over channel, in the mistake that split source information is handled Cheng Zhong, some possible entity attributes have the different forms of expression, such as a certain article (letter in different information sources Breath) in record certain type equipment (entity) length (attribute) be 26 meters, and recorded in another article the type equipment length be 20 meters, the data that user has no way of judging which article in this two articles provides in such cases are more accurate and reliable.And state Anti- sciemtifec and technical sphere be pay special attention to data accuracy, if data go wrong, related work can be caused it is serious after Fruit.
Summary of the invention
The object of the present invention is to provide a kind of science and techniques of defence field open source information confidence levels to determine method and system, to solve User can not judge the problem of open source information reliability when obtaining and increasing income information.
To achieve the above object, the present invention provides following schemes:
A kind of science and techniques of defence field open source information confidence level determines method, which comprises
Obtain the open source information in science and techniques of defence field;
Identify that all names in the open source information are real using the name entity recognition method based on condition random field Body and the corresponding attribute information of the name entity;The attribute information includes attribute and attribute value;
Entity unification is carried out to the name entity and the corresponding attribute information of the name entity and entity disambiguates operation, Form after corrigendum after entity and the corrigendum attribute information after the corresponding corrigendum of entity;
The open source letter is determined according to attribute information after the corresponding corrigendum of entity after entity after the corrigendum and the corrigendum The confidence level of breath.
Optionally, described to be identified in the open source information using the name entity recognition method based on condition random field All name entities and the corresponding attribute information of the name entity, specifically include:
Identify that all names in the open source information are real using the name entity recognition method based on condition random field Body;
Attribute extraction is carried out according to the context of the name entity, obtains the corresponding attribute information of the name entity.
Optionally, described that the unified behaviour of entity is carried out to the name entity and the corresponding attribute information of the name entity Make, form after corrigendum after entity and the corrigendum attribute information after the corresponding corrigendum of entity, specifically include:
The substance feature vector constituted using the word that vector space model calculates the different name entity periphery of title;
The substance feature vector for comparing the different name entity of title using cosine similarity, by the substance feature The name entity that vector is similar but title is different names entity after being classified as the same corrigendum;
The word structure on the attribute periphery for naming the corresponding title of entity different after the corrigendum is calculated using vector space model At attribute feature vector;
The attribute feature vector for comparing the different attribute of title using cosine similarity, by the attribute feature vector Similar but different title attribute is classified as attribute after the same corrigendum;Attribute is corresponding after attribute and the corrigendum after the corrigendum Attribute value constitute attribute information after the corrigendum.
Optionally, described that entity disambiguation behaviour is carried out to the name entity and the corresponding attribute information of the name entity Make, form after corrigendum after entity and the corrigendum attribute information after the corresponding corrigendum of entity, further includes:
The substance feature constituted using the word that vector space model calculates the identical multiple name entity peripheries of title Vector;
Compare the substance feature vector of the identical multiple name entities of title using cosine similarity, title is identical But the name entity of the substance feature vector dissmilarity names entity after being classified as different corrigendums;
The identical multiple attribute peripheries of the corresponding title of entity are named after calculating the corrigendum using vector space model The attribute feature vector that word is constituted;
The attribute feature vector for comparing the identical multiple attributes of title using cosine similarity, by title is identical but institute The attribute for stating attribute feature vector dissmilarity is classified as attribute after different corrigendums;Belong to after attribute and the corrigendum after the corrigendum Property corresponding attribute value constitute attribute information after the corrigendum.
A kind of science and techniques of defence field open source information confidence level determines system, the system comprises:
Open source data obtaining module, for obtaining the open source information in science and techniques of defence field;
Entity recognition and attribute extraction module are named, for knowing using the name entity recognition method based on condition random field All name entities and the corresponding attribute information of the name entity in the information that described Chu not increase income;The attribute information includes Attribute and attribute value;
Entity unification and entity disambiguation module, for the name entity and the corresponding attribute information of the name entity It carries out that entity is unified and entity disambiguates operation, forms after corrigendum that attribute is believed after the corresponding corrigendum of entity after entity and the corrigendum Breath;
Confidence calculations module, for according to attribute after the corresponding corrigendum of entity after entity after the corrigendum and the corrigendum Information determines the confidence level of the open source information.
Optionally, the name Entity recognition and attribute extraction module, specifically include:
Entity recognition unit is named, for identifying described open using the name entity recognition method based on condition random field All name entities in source information;
It is real to obtain the name for carrying out attribute extraction according to the context of the name entity for attribute extraction unit The corresponding attribute information of body.
Optionally, the entity unification and entity disambiguation module, specifically include:
First instance feature vector computing unit, it is real for calculating the different name of title using vector space model The substance feature vector that the word on body periphery is constituted;
First instance feature vector comparing unit, for comparing the different name entity of title using cosine similarity Substance feature vector, ordered after the name entity that the substance feature vector is similar but title is different is classified as the same corrigendum Name entity;
First attribute feature vector computing unit, for naming entity pair after calculating the corrigendum using vector space model The attribute feature vector that the word on the different attribute periphery of the title answered is constituted;
First attribute feature vector comparing unit, for comparing the category of the different attribute of title using cosine similarity The attribute that the attribute feature vector is similar but title is different is classified as attribute after the same corrigendum by property feature vector;It is described The corresponding attribute value of attribute constitutes attribute information after the corrigendum after attribute and the corrigendum after corrigendum.
Optionally, the entity unification and entity disambiguation module, further includes:
Second instance feature vector computing unit, for calculating the identical multiple lives of title using vector space model The substance feature vector that the word on name entity periphery is constituted;
Second instance feature vector comparing unit, for comparing the identical multiple names of title using cosine similarity The substance feature vector of entity, by title is identical but the name entity of the substance feature vector dissmilarity be classified as it is different more Entity is just named afterwards;
Second attribute feature vector computing unit, for naming entity pair after calculating the corrigendum using vector space model The attribute feature vector that the word on the identical multiple attribute peripheries of the title answered is constituted;
Second attribute feature vector comparing unit, for comparing the identical multiple attributes of title using cosine similarity Attribute feature vector, by title is identical but the attribute of the attribute feature vector dissmilarity be classified as different corrigendums after belong to Property;The corresponding attribute value of attribute constitutes attribute information after the corrigendum after attribute and the corrigendum after the corrigendum.
The specific embodiment provided according to the present invention, the invention discloses following technical effects:
The present invention provides a kind of science and techniques of defence field open source information confidence level and determines method and system, the method by pair Science and techniques of defence field has open source information and is named Entity recognition and attribute extraction, extracts corresponding name entity and correspondence Attribute;Further differentiation corrigendum is done to name entity and corresponding attribute with entity disambiguation technology by the way that entity is unified again, is mentioned The accuracy of high entity and attribute extraction.In actual use, by the same attribute of same entity in different information sources Mutually confirmation, calculates the confidence level of the open source information and the confidence level of information source, provides more for science and techniques of defence field user Add accurate information service.
Detailed description of the invention
It in order to more clearly explain the embodiment of the invention or the technical proposal in the existing technology, below will be to institute in embodiment Attached drawing to be used is needed to be briefly described, it should be apparent that, the accompanying drawings in the following description is only some implementations of the invention Example, for those of ordinary skill in the art, without any creative labor, can also mention according to the present invention The attached drawing of confession obtains other attached drawings.
Fig. 1 is the method flow diagram that science and techniques of defence field provided by the invention open source information confidence level determines method;
Fig. 2 is the basic schematic diagram that science and techniques of defence field provided by the invention open source information confidence level determines method;
Fig. 3 is the system construction drawing that science and techniques of defence field provided by the invention open source information confidence level determines system.
Specific embodiment
Following will be combined with the drawings in the embodiments of the present invention, and technical solution in the embodiment of the present invention carries out clear, complete Site preparation description, it is clear that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.It is based on Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts every other Embodiment shall fall within the protection scope of the present invention.
The object of the present invention is to provide a kind of science and techniques of defence field open source information confidence levels to determine method and system, by same One open source information calculates the confidence level of the information and the confidence indicator of information source in the mutual confirmation of different aforementioned sources, To solve the problems, such as that user can not judge information reliability of increasing income when obtaining and increasing income information.
In order to make the foregoing objectives, features and advantages of the present invention clearer and more comprehensible, with reference to the accompanying drawing and specific real Applying mode, the present invention is described in further detail.
Fig. 1 is the method flow diagram that science and techniques of defence field provided by the invention open source information confidence level determines method.Fig. 2 is Open source information confidence level in science and techniques of defence field provided by the invention determines the basic schematic diagram of method.Referring to Fig. 1 and Fig. 2, described Science and techniques of defence field open source information confidence level determines that method includes:
Step 101: obtaining the open source information in science and techniques of defence field.
Open source information (abbreviation information of the present invention) is the information for referring to obtain from open or semi-over channel, in the present invention The open source information in science and techniques of defence field refers mainly to the data resource in science and techniques of defence field, and data resource is based on text data, and one As for Domestic News, documents and materials, research report etc..
Arrange the data resource in science and techniques of defence field, the primary data source as confidence calculations.
Step 102: the institute in the open source information is identified using the name entity recognition method based on condition random field There are name entity and the corresponding attribute information of the name entity.
Entity recognition operation is named to the data resource that step 101 is formed.Name Entity recognition refers to from textual data Name entity is automatically identified according to concentration, mainly identifies the proper nouns such as name, place name, equipment name, the mechanism name in text With the entity informations such as significant time.The present invention, which uses, is based on CRF (Conditional Random Field, condition random ) name entity recognition method identify all name entities in data resource.
For the name entity (abbreviation entity) extracted, attribute extraction is carried out by entity context.Attribute extraction Target be obtain special entity attribute information, the attribute information includes attribute and attribute value.If certain type is equipped for entity, Then length, width, the price etc. of type equipment are the corresponding attribute of the entity, and the type equips specific length value, width value It is the corresponding attribute value of attribute with price.
Name entity and its corresponding attribute are directed to depending on specific text, and such as " length of X-type steamer is 45m ", can extract name entity is " X-type steamer ", and attribute-name is " length ", and attribute value is " 45m ".In specific implementation process In without presetting name entity and attribute-name, but adjusted according to specific text dynamic.
Name entity recognition method based on condition random field identifies that the process of name entity and attribute information includes:
1. constructing training set, random selection a part is used as training set from alternate data collection (open source information), transfers to specially Industry personage is labeled using BIEM notation methods, B, that is, Begin, the beginning of presentation-entity, I, that is, Intermediate, indicates real The centre of body, E, that is, End, the end of presentation-entity, O, that is, Other indicate the word of non-physical.
2. being trained by CRF (Conditional Random Field, condition random field) algorithm to training set, shape At Named Entity Extraction Model.
3. identifying all name entities in the open source information using Named Entity Extraction Model;
4. carrying out attribute extraction according to the context of the name entity, the corresponding attribute letter of the name entity is obtained Breath.
The attribute extraction method based on template can also be used when carrying out attribute extraction, write according to training sample corresponding Attribute extraction template is named entity attributes extraction.
Step 103: entity unification and entity are carried out to the name entity and the corresponding attribute information of the name entity Operation is disambiguated, attribute information after the corresponding corrigendum of entity is formed after corrigendum after entity and the corrigendum.
The name entity and corresponding attribute formed for step 102, carries out entity unification and entity disambiguates operation.It is wherein real Body disambiguation is the technology to produce ambiguity for solving the problems, such as entity of the same name, and entity uniform technical is referred to for solving multiple titles The problem of same entity.Method of the present invention by clustering, using vector space model, what the word on computational entity periphery was constituted Feature vector recycles cosine similarity to be compared, will describe similar entity and be polymerized to one kind, describes dissimilar entity and returns To be different classes of, to solve the problems, such as that the different names of same entity or same name refer to different entities, to name entity It is corrected.Entity attributes are corrected using same method.
Specifically, solving the problems, such as that multiple titles refer to the same entity using entity uniform technical, process includes:
The substance feature vector constituted using the word that vector space model calculates the different name entity periphery of title;
The substance feature vector for comparing the different name entity of title using cosine similarity, by the substance feature The name entity that vector is similar but title is different names entity after being classified as the same corrigendum;
The word structure on the attribute periphery for naming the corresponding title of entity different after the corrigendum is calculated using vector space model At attribute feature vector;
The attribute feature vector for comparing the different attribute of title using cosine similarity, by the attribute feature vector Similar but different title attribute is classified as attribute after the same corrigendum;Attribute is corresponding after attribute and the corrigendum after the corrigendum Attribute value constitute attribute information after the corrigendum.
Such as " length of X-type ship is 45m " and " the about long 45m of X-type steamer ", the reality of the two is compared according to cosine similarity Body characteristics vector is similar, then it is believed that " X-type ship " and " X-type steamer " is same attribute;Similarly, compared according to cosine similarity The attribute feature vector of the two is similar, then it is believed that " length " and " length " is same attribute.
Disambiguating the process that technology solves the problems, such as that entity of the same name produces ambiguity using entity includes:
The substance feature constituted using the word that vector space model calculates the identical multiple name entity peripheries of title Vector;
Compare the substance feature vector of the identical multiple name entities of title using cosine similarity, title is identical But the name entity of the substance feature vector dissmilarity names entity after being classified as different corrigendums;
The identical multiple attribute peripheries of the corresponding title of entity are named after calculating the corrigendum using vector space model The attribute feature vector that word is constituted;
The attribute feature vector for comparing the identical multiple attributes of title using cosine similarity, by title is identical but institute The attribute for stating attribute feature vector dissmilarity is classified as attribute after different corrigendums;Belong to after attribute and the corrigendum after the corrigendum Property corresponding attribute value constitute attribute information after the corrigendum.
Step 104: institute is determined according to attribute information after the corresponding corrigendum of entity after entity after the corrigendum and the corrigendum State the confidence level of open source information.
Attribute after entity and corresponding corrigendum after the corrigendum that and entity unified for step 103 entity is formed after disambiguating, will Multiple attribute values of attribute are compared after the same corrigendum of entity after identical corrigendum, judge whether multiple attribute values are consistent, such as The war skill index of equipment, using different unit standards, therefore attribute value is different in different information sources, is turned by unit The error for judging each attribute value is changed whether within tolerance interval.In defence equipment field, it is considered that error 0.1% with Interior attribute value is same attribute value.
If the information of all data sources is consistent, the confidence level of the information and information source is improved, if there is not Unanimous circumstances then reduce the confidence level of the information and information source.Usual information refers to an article, and information source refers to Issue the mechanism of this article.By multiple entities in an article, the confidence level of this article is calculated, passes through a mechanism The plurality of articles of publication, to calculate the confidence level of the mechanism.Confidence level is higher, then it represents that the structure, this article or the entity pair Answer the accuracy of the attribute value of attribute higher.
For confidence calculations of the present invention using 5 points of systems, open source information confidence calculations process is as follows:
Computation attribute value confidence level: shown in the confidence level formula such as following formula (1) of attribute value i, wherein VCiIndicate attribute value i Confidence level, VFiIndicate that the number that attribute value i occurs, N indicate that attribute belonging to attribute value i shares N class.
Computation attribute confidence level: pass through the phase homogeneous of multiple attribute values of attribute after the corresponding corrigendum of entity after the corrigendum Number accounts for the percentage * 5 of total degree to calculate the attribute confidence of attribute after the corresponding corrigendum of entity after corrigendum.
For example, if certain corrigendum after entity corrigendum after attribute A occur altogether 10 times, wherein after 8 corrigendums attribute A attribute Value is A1, then it is believed that the confidence level that the attribute value of attribute A is A1 after the corrigendum is 4;If attribute A after remaining 2 corrigendum Attribute value is identical, is all A2, then the confidence level that the attribute value of attribute A is A2 after the corrigendum is 1;If belonging to after remaining 2 corrigendum Property A attribute value it is different, such as one be A3 another be A4, then the confidence level of A3 and A4 is 0.5.
Specifically, shown in the confidence level formula such as following formula (2) of attribute j, wherein ACjIndicate the confidence level of attribute j, Indicate the number that the i-th generic attribute value of attribute j occurs,Indicate the confidence level of the i-th generic attribute value of attribute j, N is indicated The attribute value of attribute j shares N class.
It is as shown in table 1 the confidence calculations case of attribute value and attribute, the confidence level of attribute is by all properties value Confidence calculations obtain.
The confidence calculations case of 1 attribute value of table and attribute
Type Title Frequency of occurrence Confidence level
Attribute-name Length 10 3.4
Attribute value 500 8 4.0
Attribute value 480 1 0.5
Attribute value 530 1 0.5
Computational entity confidence level: using the frequency of occurrence of attribute after the corresponding each corrigendum of entity after the corrigendum as weight, Calculate the weighted average of attribute after the corresponding all corrigendums of entity after the corrigendum, the confidence level as entity after the corrigendum.Tool Body, shown in the confidence level formula such as following formula (3) of entity j, wherein ECjThe confidence level of presentation-entity j,Presentation-entity j's The number that i-th generic attribute occurs in total,The confidence level of the i-th generic attribute of presentation-entity j, N presentation-entity j share N generic Property.
It is as shown in table 2 the confidence calculations case of entity, the confidence level of entity is the confidence level meter by all properties It obtains.
The confidence calculations case of 2 entity of table
Type Title Frequency of occurrence Confidence level
Entity XX warship 20 4.13
Attribute Length 10 3.4
Attribute Width 4 4.8
Attribute Range 6 4.9
It calculates the confidence level of information: using the frequency of occurrence of entity after each corrigendum as weight, calculating all corrigendums of the information The weighted average of entity afterwards, the confidence level as the open source information.Specifically, confidence level formula such as following formula (4) institute of information j Show, wherein ICjIndicate the confidence level of information j,Indicate the number that the i-th class entity of information j occurs in total,Indicate letter The confidence level of the i-th class entity of j is ceased, N indicates that information j shares N class entity.
It is as shown in table 3 the confidence calculations case of information, the confidence level of information is the confidence level by all entity values It is calculated.
The confidence calculations case of 3 information of table
Type Title Frequency of occurrence Confidence level
Information XX warship development trend 10 4.585
Entity XX warship -1 15 4.5
Entity XX warship -2 4 4.8
Entity XX warship -3 1 5.0
It calculates the confidence level of information source: using the frequency of occurrence of each information as weight, calculating adding for all information of information source Weight average number, the confidence level as information source.Confidence level is higher, indicates that the data reliability of information source publication is higher.Specifically , shown in the confidence level formula such as following formula (5) of information source j, wherein SCjIndicate the confidence level of information source j,Indicate information source The number that the i-th category information of j occurs in total,Indicate the confidence level of the i-th category information of information source j, N indicates that information source j is total There is N category information.
It is as shown in table 4 the confidence calculations case of information source, the confidence level of information source is the confidence by all information What degree was calculated.
The confidence calculations case of 4 information source of table
Type Title Frequency of occurrence Confidence level
Information source XX media 10 4.53
Information XX development trend 8 4.5
Information XX present Research 6 4.4
Information XX technical research 6 5.7
When service is externally provided, the numerical value of its confidence level can be marked in the corresponding position of information and information source, for user With reference to provide more accurate information service for science and techniques of defence field user.The same attribute of same entity is carried out simultaneously Hyperlink mark, user can quickly be checked other information report of the attribute by the hyperlink, grasp information content comprehensively.
When there is new data resource (open source information) to update, the pumping of entity and corresponding attribute is carried out by the method for the invention It takes, is compared with existing entity and corresponding attribute, adjusts the confidence level of the open source information and related information source.
Method is determined based on confidence level provided by the invention, and the present invention also provides a kind of science and techniques of defence field open source information to set Reliability determines system, as shown in figure 3, the system comprises:
Open source data obtaining module 301, for obtaining the open source information in science and techniques of defence field;
Entity recognition and attribute extraction module 302 are named, for using the name Entity recognition side based on condition random field Method identifies all name entities and the corresponding attribute information of the name entity in the open source information;The attribute information Including attribute and attribute value;
Entity unification and entity disambiguation module 303, for the name entity and the corresponding attribute of the name entity Information carries out that entity is unified and entity disambiguates operation, forms after corrigendum after entity and the corrigendum attribute after the corresponding corrigendum of entity Information;
Confidence calculations module 304, after according to the corresponding corrigendum of entity after entity after the corrigendum and the corrigendum Attribute information determines the confidence level of the open source information.
Wherein, the name Entity recognition and attribute extraction module 302, specifically include:
Entity recognition unit is named, for identifying described open using the name entity recognition method based on condition random field All name entities in source information;
It is real to obtain the name for carrying out attribute extraction according to the context of the name entity for attribute extraction unit The corresponding attribute information of body.
Wherein, the entity unification and entity disambiguation module 303, specifically include:
First instance feature vector computing unit, it is real for calculating the different name of title using vector space model The substance feature vector that the word on body periphery is constituted;
First instance feature vector comparing unit, for comparing the different name entity of title using cosine similarity Substance feature vector, ordered after the name entity that the substance feature vector is similar but title is different is classified as the same corrigendum Name entity;
First attribute feature vector computing unit, for naming entity pair after calculating the corrigendum using vector space model The attribute feature vector that the word on the different attribute periphery of the title answered is constituted;
First attribute feature vector comparing unit, for comparing the category of the different attribute of title using cosine similarity The attribute that the attribute feature vector is similar but title is different is classified as attribute after the same corrigendum by property feature vector;It is described The corresponding attribute value of attribute constitutes attribute information after the corrigendum after attribute and the corrigendum after corrigendum;
Second instance feature vector computing unit, for calculating the identical multiple lives of title using vector space model The substance feature vector that the word on name entity periphery is constituted;
Second instance feature vector comparing unit, for comparing the identical multiple names of title using cosine similarity The substance feature vector of entity, by title is identical but the name entity of the substance feature vector dissmilarity be classified as it is different more Entity is just named afterwards;
Second attribute feature vector computing unit, for naming entity pair after calculating the corrigendum using vector space model The attribute feature vector that the word on the identical multiple attribute peripheries of the title answered is constituted;
Second attribute feature vector comparing unit, for comparing the identical multiple attributes of title using cosine similarity Attribute feature vector, by title is identical but the attribute of the attribute feature vector dissmilarity be classified as different corrigendums after belong to Property;The corresponding attribute value of attribute constitutes attribute information after the corrigendum after attribute and the corrigendum after the corrigendum.
The method of the present invention and system are by naming Entity recognition and attribute extraction technology to extract the entity in data resource With corresponding attribute;Further differentiation corrigendum, raising entity are done to entity and attribute with entity disambiguation technology by the way that entity is unified With the accuracy of attribute extraction;According to the different information of the same attribute of same entity report, confirm the information confidence level and The confidence level in corresponding informance source can provide more accurate information clothes for science and techniques of defence field user when information is increased income in inquiry Business.
Each embodiment in this specification is described in a progressive manner, the highlights of each of the examples are with other The difference of embodiment, the same or similar parts in each embodiment may refer to each other.For system disclosed in embodiment For, since it is corresponded to the methods disclosed in the examples, so being described relatively simple, related place is said referring to method part It is bright.
Used herein a specific example illustrates the principle and implementation of the invention, and above embodiments are said It is bright to be merely used to help understand method and its core concept of the invention;At the same time, for those skilled in the art, foundation Thought of the invention, there will be changes in the specific implementation manner and application range.In conclusion the content of the present specification is not It is interpreted as limitation of the present invention.

Claims (8)

1. a kind of science and techniques of defence field open source information confidence level determines method, which is characterized in that the described method includes:
Obtain the open source information in science and techniques of defence field;
Using the name entity recognition method based on condition random field identify it is described open source information in all name entities and The corresponding attribute information of the name entity;The attribute information includes attribute and attribute value;
Entity unification is carried out to the name entity and the corresponding attribute information of the name entity and entity disambiguates operation, is formed Attribute information after the corresponding corrigendum of entity after entity and the corrigendum after corrigendum;
The open source information is determined according to attribute information after the corresponding corrigendum of entity after entity after the corrigendum and the corrigendum Confidence level.
2. open source information confidence level in science and techniques of defence field according to claim 1 determines method, which is characterized in that described to adopt All name entities and the life in the open source information are identified with the name entity recognition method based on condition random field The corresponding attribute information of name entity, specifically includes:
All name entities in the open source information are identified using the name entity recognition method based on condition random field;
Attribute extraction is carried out according to the context of the name entity, obtains the corresponding attribute information of the name entity.
3. open source information confidence level in science and techniques of defence field according to claim 2 determines method, which is characterized in that described right The name entity and the corresponding attribute information of the name entity carry out entity unified operation, form after corrigendum entity and described Attribute information after the corresponding corrigendum of entity after corrigendum, specifically includes:
The substance feature vector constituted using the word that vector space model calculates the different name entity periphery of title;
The substance feature vector for comparing the different name entity of title using cosine similarity, by the substance feature vector Similar but different title name entity names entity after being classified as the same corrigendum;
It is constituted using the word that vector space model calculates the attribute periphery for naming the corresponding title of entity different after the corrigendum Attribute feature vector;
Compare the attribute feature vector of the different attribute of title using cosine similarity, the attribute feature vector is similar But the different attribute of title is classified as attribute after the same corrigendum;The corresponding category of attribute after attribute and the corrigendum after the corrigendum Property value constitute attribute information after the corrigendum.
4. open source information confidence level in science and techniques of defence field according to claim 3 determines method, which is characterized in that described right The name entity and the corresponding attribute information of the name entity carry out entity and disambiguate operation, form after corrigendum entity and described Attribute information after the corresponding corrigendum of entity after corrigendum, further includes:
The substance feature vector constituted using the word that vector space model calculates the identical multiple name entity peripheries of title;
The substance feature vectors for comparing the identical multiple name entities of title using cosine similarity, by title is identical but institute State substance feature vector dissmilarity name entity be classified as different corrigendums after name entity;
The word structure on the identical multiple attribute peripheries of the corresponding title of name entity after the corrigendum is calculated using vector space model At attribute feature vector;
The attribute feature vector for comparing the identical multiple attributes of title using cosine similarity, by title is identical but the category The attribute of property feature vector dissmilarity is classified as attribute after different corrigendums;Attribute pair after attribute and the corrigendum after the corrigendum The attribute value answered constitutes attribute information after the corrigendum.
5. a kind of science and techniques of defence field open source information confidence level determines system, which is characterized in that the system comprises:
Open source data obtaining module, for obtaining the open source information in science and techniques of defence field;
Entity recognition and attribute extraction module are named, for identifying using the name entity recognition method based on condition random field All name entities and the corresponding attribute information of the name entity in the open source information;The attribute information includes attribute And attribute value;
Entity unification and entity disambiguation module, for being carried out to the name entity and the corresponding attribute information of the name entity Entity is unified and entity disambiguates operation, forms after corrigendum after entity and the corrigendum attribute information after the corresponding corrigendum of entity;
Confidence calculations module, for according to attribute information after the corresponding corrigendum of entity after entity after the corrigendum and the corrigendum Determine the confidence level of the open source information.
6. open source information confidence level in science and techniques of defence field according to claim 5 determines system, which is characterized in that the life Name Entity recognition and attribute extraction module, specifically include:
Entity recognition unit is named, for identifying the open source letter using the name entity recognition method based on condition random field All name entities in breath;
Attribute extraction unit obtains the name entity pair for carrying out attribute extraction according to the context of the name entity The attribute information answered.
7. open source information confidence level in science and techniques of defence field according to claim 6 determines system, which is characterized in that the reality The decorum one and entity disambiguation module, specifically include:
First instance feature vector computing unit, for calculating title different name entity week using vector space model The substance feature vector that the word on side is constituted;
First instance feature vector comparing unit, for comparing the reality of the different name entity of title using cosine similarity Body characteristics vector is named in fact after the name entity that the substance feature vector is similar but title is different is classified as the same corrigendum Body;
First attribute feature vector computing unit, it is corresponding for name entity after calculating the corrigendum using vector space model The attribute feature vector that the word on the different attribute periphery of title is constituted;
First attribute feature vector comparing unit, the attribute for comparing the different attribute of title using cosine similarity are special Vector is levied, the attribute that the attribute feature vector is similar but title is different is classified as attribute after the same corrigendum;The corrigendum The corresponding attribute value of attribute constitutes attribute information after the corrigendum after attribute and the corrigendum afterwards.
8. open source information confidence level in science and techniques of defence field according to claim 7 determines system, which is characterized in that the reality The decorum one and entity disambiguation module, further includes:
Second instance feature vector computing unit, it is real for calculating the identical multiple names of title using vector space model The substance feature vector that the word on body periphery is constituted;
Second instance feature vector comparing unit, for comparing the identical multiple name entities of title using cosine similarity Substance feature vector, after title is identical but the name entity of the substance feature vector dissmilarity is classified as different corrigendums Name entity;
Second attribute feature vector computing unit, it is corresponding for name entity after calculating the corrigendum using vector space model The attribute feature vector that the word on the identical multiple attribute peripheries of title is constituted;
Second attribute feature vector comparing unit, for comparing the category of the identical multiple attributes of title using cosine similarity Property feature vector, by title is identical but the attribute of the attribute feature vector dissmilarity is classified as attribute after different corrigendums;Institute State after corrigendum that the corresponding attribute value of attribute constitutes attribute information after the corrigendum after attribute and the corrigendum.
CN201910572637.0A 2019-06-28 2019-06-28 Method and system for determining confidence of open source information in national defense science and technology field Active CN110287302B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910572637.0A CN110287302B (en) 2019-06-28 2019-06-28 Method and system for determining confidence of open source information in national defense science and technology field

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910572637.0A CN110287302B (en) 2019-06-28 2019-06-28 Method and system for determining confidence of open source information in national defense science and technology field

Publications (2)

Publication Number Publication Date
CN110287302A true CN110287302A (en) 2019-09-27
CN110287302B CN110287302B (en) 2021-03-30

Family

ID=68020006

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910572637.0A Active CN110287302B (en) 2019-06-28 2019-06-28 Method and system for determining confidence of open source information in national defense science and technology field

Country Status (1)

Country Link
CN (1) CN110287302B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674304A (en) * 2019-10-09 2020-01-10 北京明略软件系统有限公司 Entity disambiguation method and device, readable storage medium and electronic equipment
CN111125438A (en) * 2019-12-25 2020-05-08 北京百度网讯科技有限公司 Entity information extraction method and device, electronic equipment and storage medium

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298588A (en) * 2010-06-25 2011-12-28 株式会社理光 Method and device for extracting object from non-structured document
CN102495892A (en) * 2011-12-09 2012-06-13 北京大学 Webpage information extraction method
CN105989080A (en) * 2015-02-11 2016-10-05 富士通株式会社 Apparatus and method for determining entity attribute values
CN106998264A (en) * 2017-02-21 2017-08-01 中国科学院信息工程研究所 A kind of IP location database credibility evaluation methods based on dynamic trust model
CN109783651A (en) * 2019-01-29 2019-05-21 北京百度网讯科技有限公司 Extract method, apparatus, electronic equipment and the storage medium of entity relevant information
CN110580337A (en) * 2019-06-11 2019-12-17 福建奇点时空数字科技有限公司 professional entity disambiguation implementation method based on entity similarity calculation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102298588A (en) * 2010-06-25 2011-12-28 株式会社理光 Method and device for extracting object from non-structured document
CN102495892A (en) * 2011-12-09 2012-06-13 北京大学 Webpage information extraction method
CN105989080A (en) * 2015-02-11 2016-10-05 富士通株式会社 Apparatus and method for determining entity attribute values
CN106998264A (en) * 2017-02-21 2017-08-01 中国科学院信息工程研究所 A kind of IP location database credibility evaluation methods based on dynamic trust model
CN109783651A (en) * 2019-01-29 2019-05-21 北京百度网讯科技有限公司 Extract method, apparatus, electronic equipment and the storage medium of entity relevant information
CN110580337A (en) * 2019-06-11 2019-12-17 福建奇点时空数字科技有限公司 professional entity disambiguation implementation method based on entity similarity calculation

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
李继光等: "大数据背景下数据挖掘及处理分析", 《大数据背景下数据挖掘及处理分析 *
杨燕: "面向电商领域的智能问答系统若干关键技术研究", 《中国博士学位论文全文数据库信息科技辑》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674304A (en) * 2019-10-09 2020-01-10 北京明略软件系统有限公司 Entity disambiguation method and device, readable storage medium and electronic equipment
CN111125438A (en) * 2019-12-25 2020-05-08 北京百度网讯科技有限公司 Entity information extraction method and device, electronic equipment and storage medium
CN111125438B (en) * 2019-12-25 2023-06-27 北京百度网讯科技有限公司 Entity information extraction method and device, electronic equipment and storage medium

Also Published As

Publication number Publication date
CN110287302B (en) 2021-03-30

Similar Documents

Publication Publication Date Title
Artstein Inter-annotator agreement
Studer WeightedCluster library manual
West et al. Model fit and model selection in structural equation modeling
Ferreira et al. Towards automatic content analysis of social presence in transcripts of online discussions
Ji et al. A source code linearization technique for detecting plagiarized programs
CN109408641A (en) It is a kind of based on have supervision topic model file classification method and system
US10032167B2 (en) Abnormal pattern analysis method, abnormal pattern analysis apparatus performing the same and storage medium storing the same
CN106121622B (en) A kind of Multiple faults diagnosis approach of the Dlagnosis of Sucker Rod Pumping Well based on indicator card
Heist et al. Uncovering the semantics of Wikipedia categories
Chen et al. An iterative method for leakage zone identification in water distribution networks based on machine learning
CN110287302A (en) A kind of science and techniques of defence field open source information confidence level determines method and system
CN110032650A (en) A kind of generation method, device and the electronic equipment of training sample data
Baba et al. Plagiarism detection using document similarity based on distributed representation
GB2622348A (en) Improved model for textual and numerical information retrieval in documents
CN106227743B (en) Advertisement target group touching reaches ratio estimation method and device
CN107169321A (en) The program plagiarism detection method and system being combined based on attribute count and structure measurement technology
Basnet et al. Supervised machine learning approaches for leak localization in water distribution systems: Impact of complexities of leak characteristics
CN109977131A (en) A kind of house type matching system
CN109614074A (en) Approximate adder reliability degree calculation method based on probability transfer matrix model
CN109299884A (en) A kind of influence power appraisal procedure and assessment device
US11151198B1 (en) Machine-learned disambiguation of user action data
CN101882259A (en) Method and equipment for filtering entity relationship instance
Eksi et al. Explaining errors in machine translation with absolute gradient ensembles
Wang et al. A new partition model for the optimization of subsea cluster manifolds based on the new definition of layout cost
CN110019829A (en) Data attribute determines method, apparatus

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant