CN103678436A - Information processing system and information processing method - Google Patents

Information processing system and information processing method Download PDF

Info

Publication number
CN103678436A
CN103678436A CN201310322481.3A CN201310322481A CN103678436A CN 103678436 A CN103678436 A CN 103678436A CN 201310322481 A CN201310322481 A CN 201310322481A CN 103678436 A CN103678436 A CN 103678436A
Authority
CN
China
Prior art keywords
data
chart
label
teacher
eigenvector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310322481.3A
Other languages
Chinese (zh)
Other versions
CN103678436B (en
Inventor
柳濑利彦
今一修
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hitachi Ltd
Original Assignee
Hitachi Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hitachi Ltd filed Critical Hitachi Ltd
Publication of CN103678436A publication Critical patent/CN103678436A/en
Application granted granted Critical
Publication of CN103678436B publication Critical patent/CN103678436B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/904Browsing; Visualisation therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)
  • Image Analysis (AREA)

Abstract

The invention provides an information processing system and an information processing method for reducing the labor cost and equipment cost during machine learning of documents. According to the information processing system in the situation that characteristic types are inputted, based on the inputted characteristic types and pieces of teacher data, characteristic vectors of the pieces of teacher data are generated and the characteristic vectors are numerical vectors indicating characteristics related with the pieces of teacher data; based on the characteristic vectors of the pieces of teacher data, a diagram of the teacher data is generated; based on the diagram of the teacher data, a characteristic type used for generating a first diagram which is most suitable for transmission of labels of the teacher data is selected and then the first diagram is further outputted; based on the first diagram and pieces of no-label data, the pieces of no-label data to which the labels given in the teacher data should be transmitted are selected and then a second diagram is generated by virtue of enabling the first diagram to contain the selected no-label data; and based on the second diagram, the labels given in the teacher data are transmitted to the selected no-label data.

Description

The information processing system information processing method of unifying
Technical field
The present invention relates to information handling system.
Background technology
In recent years, a lot of enterprises apply in a flexible way and are called as a large amount of electronic data of large data.This is because due to the appearance of the open source software of Apache Hadoop etc., use general PC server to disperse the technology of also column count to be popularized.Universal by this technology, the cost of processing at short notice the needed computer resource of mass data etc. significantly reduces.
As the data processing content for large data, the processing etc. that the accumulative total with a large amount of numeric datas is processed and computing machine automatically extracts the useful pattern of user from data for electronic documents.As making computing machine carry out the method that wisdom that this script undertakies by the mankind is processed, use machine learning.In machine learning, particularly have in teacher learning, the data that the mankind are generated are as teacher's data, the pattern of computer learning teacher data, thus can for the wisdom of carrying out the mankind, be processed by computer generation.
Teacher's data need to be made by the mankind, so, in the situation that computing machine has teacher learning, produce human cost.Especially, the in the situation that of information extraction from professional document, need to make teacher's data by the expert (domain expert) in this field, so human cost is large especially.
For example, in order to carry out the such wisdom of information extraction from decree document, process, before computing machine carries out machine learning, lawyer or judicial graffer's etc. law expert need to generate the example of the information that extract.And, in order to carry out the such wisdom of from the document relevant with intellecture property information extraction, processing, the intellecture property responsible official of procurator or enterprise need to prepare the example of the information that extract.
Usually, teacher's data are more, more can improve learning outcome.But generating teacher's data needs human cost, so, be difficult to prepare a large amount of teacher's data.In today of a large amount of several data that comprise in processing large data, the problem when becoming application and have teacher learning for generating the human cost of teacher's data.
As with for generating a resolution policy of the relevant problem of the human cost of teacher's data, attempting applies in a flexible way in study does not have the data of teacher's information (label) (without label data).Except teacher's data, also in study, use machine learning without label data to be called as and partly have teacher learning (for example, with reference to patent documentation 1 and 2).
Following method has been proposed in patent documentation 1 and 2: in order to extract the document that comprises harmful word from document group, use and partly have teacher learning.
Partly having in teacher learning, from counting yield aspect, the teacher learning that partly has based on chart described in non-patent literature 1 receives publicity especially.Based on chart partly have teacher learning such as being applied to pass judgment on to analyze, semantic ambiguity is eliminated or part of speech estimation etc.
And, following method has been proposed: according to the minority word extracting based on certain viewpoint, extract other words (for example, with reference to patent documentation 3) based on same viewpoint.
And, following method has been proposed: the degree of association at the document to for retrieval inquiry is given in the problem of label, from having given the document of label, to the document of not giving label, propagate the degree of association (for example, with reference to patent documentation 4).
Here, the chart in machine learning means following mathematic charts: for example, using data (word) as a node, be that internodal similarity is carried out quantitatively as the weight at internodal edge using between data.In this chart, similar data connect at the edge of larger weight.Therefore, by the weight with edge, propagate label information, can be to without label data distributing labels.
For example, to extract the example that is treated to of name information from electronic document, the propagation of label information is below shown.In this is processed, utilize expression word document decomposition is also judged to respectively mark whether it is name for mark, as the identification problem of two-value, process.
In the example of processing that extracts name information, computing machine using identifying object be mark as node, calculate similarity between each mark as the weight at edge.And, according to the information of long etc. the mark self of part of speech or character string and and adjacent marker between the information of shared information etc., calculate the similarity of mark.Particularly, by the information of described mark is carried out to value of vectors, use value of vectors to calculate distance, thereby obtain the similarity of mark.And, obtain thus the chart that comprises each mark.
In the situation that use the chart of obtaining like this to propagate label, the similar edge that is marked at larger weight using in similar context connects, so, easily distribute identical label.
Based on partly the having in teacher learning of chart, the construction method of chart makes a big impact to study precision.Before this, to improve to construct the precision of chart and realize calculating, turn to object at a high speed, carried out the pruning (not needing the deletion at edge) at edge.
For example, proposed to be similar to by near chart k-or b-coupling chart the method (for example, with reference to non-patent literature 2) of original chart.Here, chart, b-coupling chart are respectively the charts at edge that only comprises the upper k part of the similarity generating by near method k-or b-matching method near k-.
And then, proposed in the situation that carry out the edge generation method (for example, with reference to non-patent literature 3) that the pruning at edge does not generate the concentrated node in edge.
In these documents, in order to generate chart, need to pre-determine for the information of node being carried out to the feature (attribute) of value of vectors.And this feature need to and be familiar with by domain expert the people that machine learning processes and determine.
And, in the situation that the performance of machine learning is evaluated, may carry out the checking again of experimental result, so, published common teacher's data used and without label data more.But, in the situation that user's reality is processed document to be processed, without label data, usually become huge amount, in order to learn within the real time, need to be from useful without label data without selecting label data.
Prior art document
Patent documentation
Patent documentation 1: TOHKEMY 2011-039576 communique
Patent documentation 2: TOHKEMY 2011-039575 communique
Patent documentation 3: TOHKEMY 2010-257406 communique
Patent documentation 4: Japanese Unexamined Patent Application Publication 2009-528628 communique
Non-patent literature
Non-patent literature 1:Learning from Labeled and Unlabeled Data with Label Propagation, Technical Report CMU-CALD-02-107,2002
Non-patent literature 2: partly teach hiding property of the Division あ り Language Righteousness Warm solution め グ ラ Off ス パ ー スization that disappears, information processing association research report, 2010
Non-patent literature 3: Ha Block The is made ら な い グ ラ フ Agencies Building method The and partly taught hiding property of Division あ り Language Righteousness Warm solution to disappear with い, information processing association research report, 2010
Non-patent literature 4:Efficient Graph-Based Semi-Supervised Learning of Structured Tagging Models, Proceedings of the2010Conference on Empirical Methods in Natural Language Processing, pp.167-176,2010
Summary of the invention
The problem that invention will solve
Based on partly the having in teacher learning of chart, in order to obtain optimum chart structure, need to there is the people of object domain (as the technical field of processing under the content of document of object) and machine learning both sides' professional knowledge, human cost is larger.
Optimization with chart structure is the object that is optimized for of feature, considers the method for the final Output rusults being undertaken after machine learning processing being evaluated by the professional knowledge of object domain.But, in the situation that using the method, also need domain expert's evaluation, need more human costs.Particularly, this be because, in order to process according to machine learning, chart structure is evaluated, domain expert need to generate teacher's data of evaluating use by manual working, human cost is larger.
And then, in the optimization of chart structure, need execution and the machine learning of the number of times of the proportional increase of pattern count of chart structure to process.In the situation that repeatedly carrying out repeatedly machine learning, need a large amount of computing times, need huge equipment cost.
Like this, produce the problem of the equipment cost increase etc. of human cost and computing machine.
The object of the invention is to, provide and reduce human cost and the equipment cost of computing machine the system of document being carried out to suitable machine learning.
For solving the means of problem
A representational example of the present invention is as follows.; information handling system is carried out machine learning to a plurality of data in literature; wherein; described information handling system has: initialization section, and the data in literature that to obtain a plurality of data in literature of having given label be a plurality of teacher's data, do not give described label is without label data and a plurality of characteristic types that represent to extract the method for the feature relevant with described each data in literature; Eigenvector generating unit, in the situation that characteristic type described at least one in the characteristic type of obtaining described in input, according to the characteristic type of described input and described in each teacher's data of obtaining, generate the eigenvector that represents described each teacher's data of the feature relevant with described each teacher's data by value of vectors; Chart is constructed portion, according to the eigenvector of each teacher's data that generated by described eigenvector generating unit, generates the chart of described teacher's data; Feature selecting portion, according to the chart of the teacher's data that generated by the described chart portion of constructing, from the characteristic type of being obtained by described initialization section, select for generating the characteristic type of the 1st chart of the label that is suitable for propagating described teacher's data most, and then, described the 1st chart that output is generated by the described chart portion of constructing; Data selection portion, according to described the 1st chart and described without label data, selection should be propagated the described without label data of the label given in described teacher's data, and then, by select described in comprising in described the 1st chart without label data, generate the 2nd chart; And machine learning portion, by described the 2nd chart, will give label in described teacher's data to described propagating without label data of selecting.
Invention effect
According to an embodiment of the invention, can reduce human cost and equipment cost in machine learning.
Accompanying drawing explanation
Fig. 1 is the block diagram of physical arrangement that the information extracting system of the present embodiment 1 is shown.
Fig. 2 is the block diagram of logical organization that the information extracting system of the present embodiment 1 is shown.
Fig. 3 A is the key diagram that the bibliographic data base of the present embodiment 1 is shown.
Fig. 3 B is the key diagram that the tag database of the present embodiment 1 is shown.
Fig. 3 C is the key diagram that the characteristic type database of the present embodiment 1 is shown.
Fig. 4 illustrates the optimization of not carrying out characteristic type of the present embodiment 1 and without the functional block diagram of the machine learning in the situation of the selection of label data.
Fig. 5 is the functional block diagram that information extracting system that the present embodiment 1 is shown carries out the summary of the data stream before the machine learning of document.
Fig. 6 A is the key diagram that teacher's data list L of the present embodiment 1 is shown.
Fig. 6 B is the key diagram without label data list U that the present embodiment 1 is shown.
Fig. 7 is the process flow diagram that the treatment scheme that the feature selecting portion of the present embodiment 1 carries out is shown.
Fig. 8 A is the key diagram of eigenvector that teacher's data of the present embodiment 1 are shown.
Fig. 8 B is the key diagram that the eigenvector without label data of the present embodiment 1 is shown.
Fig. 9 A be the present embodiment 1 is shown only by different labels, connect the key diagram of the evaluation of estimate of the chart that scores calculate.
Fig. 9 B be the present embodiment 1 is shown by same label, connect score is connected the evaluation of estimate of the chart that score calculates key diagram with different labels.
Figure 10 is the process flow diagram of processing that the data selection portion of the present embodiment 1 is shown.
Figure 11 A illustrates the chart g2 of the present embodiment 1 and without the key diagram of label data.
Figure 11 B is the key diagram without label data extracting in the situation that extracting the peaked data of distance that the present embodiment 1 is shown.
Figure 11 C is the key diagram without label data not disperseing that the present embodiment 1 is shown.
Figure 12 is the functional block diagram that information extracting system that the present embodiment 5 is shown carries out the summary of the data stream before the machine learning of document.
Figure 13 is the process flow diagram that the processing of the feature selecting portion in the lower situation of the evaluation of machine learning of the present embodiment 5 is shown.
Label declaration
110: processor; 120: storer; 130: local file system; 140: input media; 150: output unit; 160: network devices; 170: bus; 200: information extraction computing machine; 210: Local Area Network; 220: bibliographic data base; 225: tag database; 230: property data base; 290: label generation computing machine.
Embodiment
In following embodiment, in the situation that mention the quantity etc. of key element, in specially appointed situation and principle, obviously definite situation, be not limited to this definite quantity, can be more than definite quantity, can be also below definite quantity.
And then known in following embodiment, in specially appointed situation and principle, the situation of significant need, its textural element not necessarily.And, same, in following embodiment, when mentioning the shape of textural element and position relationship, except situation about expressing especially with think in principle it is not obviously such situation, comprise in fact the shape approximate or similar to this shape etc. etc.Its in above-mentioned numerical value and scope too.
[embodiment 1]
Fig. 1 is the block diagram that the physical arrangement of the computing machine 100 that the information extracting system of the present embodiment 1 has is shown.
The computing machine 100 that the information extracting system of the present embodiment has is the multi-purpose computers shown in Fig. 1.The computing machine 100 that information extracting system has can be also for example PC server.
Computing machine 100 has processor 110, storer 120, local file system 130, input media 140, output unit 150, network devices 160, bus 170.Processor 110, storer 120, local file system 130, input media 140, output unit 150, network devices 160 connect by bus 170.
Processor 110 is for example central operation device (Central Processing Unit; CPU), can there are a plurality of core processors.Storer 120 is the memory storages for storage program and data.
Input media 140 is devices of keyboard or mouse etc., is for being subject to the device of the data of reason user input.Output unit 150 is devices of display or printer etc., is for the device to user's output information.In addition, via network from telepilot operation computing machine 100 in the situation that, computing machine 100 can not have input media 140 and output unit 150.
Local file system 130 is the memory storages that can be rewritten by computing machine 100.Local file system 130 can be the memory storage being built in computing machine 100, the memory storage that also can be arranged on the outside of computing machine 100 and be connected with computing machine 100.Local file system 130 is such as the memory storage that is hard drive, solid-state circuit driving or ram disc etc.
Network devices 160 is for be connected to the device of network for computing machine 100.
Fig. 2 is the block diagram that the logical organization of each computing machine that the information extracting system of the present embodiment 1 has is shown.
The information extracting system of the present embodiment has information extraction and generates with computing machine 290 with computing machine 200 and label.Information extraction generates the physical arrangement respectively with computing machine 290 with the computing machine 100 shown in Fig. 1 with computing machine 200 and label.
And the information extracting system of the present embodiment has bibliographic data base 220, tag database 225, characteristic type database 230, Local Area Network 210.Each computing machine is connected by LAN210 with each database.
As handling part, information extraction has initialization section 235, eigenvector generating unit 237, feature selecting portion 240, data selection portion 255, chart with computing machine 200 and constructs portion 270, many objects Optimization Dept. 275, machine learning portion 280.
Initialization section 235 is that the data of document etc. are converted to for carrying out the handling part of the data of machine learning.Eigenvector generating unit 237 is handling parts of generating feature vector.
Feature selecting portion 240 is the handling parts that carry out characteristic optimization.Feature selecting portion 240 has characteristic evaluating portion 245 and feature selecting convergence detection unit 250.
Data selection portion 255 is handling parts without label data of selecting from teacher's data dissemination label.Data selection portion 255 has data evaluation portion 260 and data selection convergence detection unit 265.Chart is constructed portion 270 and by obtaining node and edge, is generated the handling part of chart.Many objects Optimization Dept. 275 is selected for obtaining the solution candidate's of optimum evaluation of estimate handling part in the situation that changing evaluation of estimate according to a plurality of objects.Machine learning portion 280 is the handling parts that carry out machine learning.
Information extraction can realize by program with each handling part of computing machine 200, also can be by realizing for realizing the physical unit of each function.Below, suppose that information extraction realizes by program with each handling part of computing machine 200, by processor 110, in storer 120, read the program that is equivalent to each handling part, realize the function of each handling part.
And information extraction can realize by a handling part function of a plurality of handling parts with each handling part of computing machine 200.And a plurality of processing that comprise in a handling part shown in Fig. 2 can realize by a plurality of handling parts.
Label generates has label generating unit 295 with computing machine 290.Label generating unit 295 generates the data that will be stored in tag database 225 according to user's indication.Then, label generating unit 295 is stored in generated data in tag database 225.And label generating unit 295 is deleted data according to user's indication from tag database 225.
Therefore,, in the situation that use the data of predetermined tag database 225, the information extracting system of the present embodiment also can omit label and generate with computing machine 290.
Bibliographic data base 220 is for storing the database as the data of the document of the object of the machine learning of the present embodiment.Tag database 225 is for storing the database of teacher's data.Characteristic type database 230 is to represent for generating the database of data of type of the feature of chart for storing.
In addition, information extraction with computing machine 200 can built-in bibliographic data base 220, each database and the label generating unit 295 of tag database 225 and characteristic type database 230.In information extraction, by the built-in all database of computing machine 200 and label generating unit 295 in the situation that, information extracting system can omit LAN210.
The database that information extracting system shown in Fig. 2 has can be realized with data repository mechanism arbitrarily.And the most simply, the database that information extracting system has can be used as 1 row of only describing text and realizes as the database of 1 record.And the database that information extracting system has also can be realized by the data base management system (DBMS) of Relational database, key assignments thesaurus etc.
And then, in order to obtain high speed and short response time, link information is extracted with the network (being LAN210 in Fig. 2) that computing machine 200, label generate with computing machine 290, bibliographic data base 220, tag database 225, characteristic type database 230 and can be arranged in Yi Ge data center.
And each computing machine of information extracting system and each textural element of each database also can be arranged in each different data center.
Set up procedure to the information extracting system of the present embodiment describes.User connects the power supply of computing machine 200 for information extraction, and actuate message extracts the OS(operating system having with computing machine 200).And then user connects bibliographic data base 220, tag database 225, characteristic type database 230 and label and generates the power supply with computing machine 290.And then user connects the power supply of LAN210, make computing machine 200 for information extraction, bibliographic data base 220, tag database 225, characteristic type database 230, label generation become the state that mutually can communicate with computing machine 290 and LAN210.After this, each computing machine of information extracting system and each database for example communicate according to IP address and host name.
Fig. 3 A is the key diagram that the bibliographic data base 220 of the present embodiment 1 is shown.
Bibliographic data base 220 is databases of information of the document of the storage object that carries out machine learning as the information extracting system of the present embodiment.
Bibliographic data base 220 keeps document ID2201 and text 2202.The identifier that document ID2201 comprises unique expression document, for distinguishing the object of each document.Text 2202 represents the character string comprising in the document shown in document ID2201.
Fig. 3 B is the key diagram that the tag database 225 of the present embodiment 1 is shown.
Tag database 225 means the database of label definite in each document.Tag database 225 comprises label ID2251, document ID2252 and label 2253.
The identifier that label ID2251 comprises only table indicating label.Document ID2252 has represented to give the document of the label shown in label ID2251, is equivalent to the identifier of the document ID2201 of bibliographic data base 220.
The data that label 2253 represents to have given label appear at which position of document.For example, record 2254 is illustrated in the document of document ID2252 " 1 ", and the beginning text point of having given the node of " 1 " such label is " 10 ", and finishing text point is " 14 ".And record 2254 represents the label in the document of document ID2252 " 1 " " 1 " distributing labels ID2251 " 1 ".
In addition, for example, in the situation that to have given the data of label be each mark, tag database 225 also can keep representing starting position and end position etc., data based on giving the object of label by marker number.
Fig. 3 C is the key diagram that the characteristic type database 230 of the present embodiment 1 is shown.
Characteristic type database 230 means the database of the pattern of the feature obtaining for node.Characteristic type database 230 comprises characteristic ID 2301 and feature name 2302.Characteristic ID 2301 is identifiers of unique expression feature mode.
Feature name 2302 means the character string of feature mode.Feature name 2302 represents data in literature numerical value to turn to the method that eigenvector is used.
For example, the character string of character string itself that feature name 2302 " token_surface_0 " expression of the characteristic ID shown in Fig. 3 C 2301 " 1 " obtains node is as feature.And feature name 2302 " token_surface_1 " expression of the characteristic ID 2301 " 2 " shown in Fig. 3 C obtains the rear character string of character string of object as feature.
In characteristic type database 230, the characteristic type of storage is the predetermined characteristic type of user.
Fig. 4 illustrates the optimization of not carrying out characteristic type of the present embodiment 1 and without the functional block diagram of the machine learning in the situation of the selection of label data.
Fig. 4 illustrates the data stream of the functional block input and output in the processing of embodiment 1 that are equivalent to the handling part shown in Fig. 2.
First, label generate label generating unit 295 with computing machine 290 by the tag storage of user's appointment in tag database 225.In addition, in characteristic type database 230, store the preassigned characteristic type of user.
Initialization section 235 obtains characteristic type f arbitrarily from characteristic type database 230, according to tag database 225 and bibliographic data base 220, generates teacher's data list.And initialization section 235 generates without label data list according to bibliographic data base 220.Initialization section 235 is constructed portion 270 by comprising characteristic type f, teacher's data list and outputing to chart without the data 30 of label data list.
Chart is constructed portion 270 according to characteristic type f, teacher's data list and is generated chart without label data list.In addition, when generating chart, chart is constructed portion 270 makes eigenvector generating unit 237 generate the eigenvector of teacher's data and without the eigenvector of label data according to teacher's data list with without label data list.
In addition, eigenvector is following value of vectors: according to characteristic type f, show the information relevant with the data of each data and each data front and back, thereby the data that comprise in each document are shown quantitatively by value of vectors.
The generation processing of eigenvector and the example of the generation processing that chart is constructed the chart in portion 270 that eigenvector generating unit 237 is carried out are below shown.In following example, eigenvector generating unit 237 is mark by expression word by the Data Segmentation comprising in document, using each mark as node and generating feature vector.
As the information of the mark comprising in document being carried out to the concrete example of value of vectors, there is the method for eigenvector generating unit 237 use information and the corresponding table of the dimension of value of vectors.For example, the corresponding table as part of speech name with the dimension of value of vectors, eigenvector generating unit 237 keeps " noun: 1, verb: 2, auxiliary word: 3 ... " in advance, according to this correspondence table, the part of speech of mark is carried out to value of vectors.
Particularly, in described example, in the situation that the part of speech of mark is noun, eigenvector generating unit 237 generation value of vectors (1,0,0 ...).And, in the situation that the part of speech of mark is auxiliary word, eigenvector generating unit 227 generation value of vectors (0,0,1 ...).The Key factor distribution " 1 " of the correspondence table of 237 pairs of indicia matched of eigenvector generating unit, to the unmatched Key factor distribution of mark " 0 ".
By same step, eigenvector generating unit 237 can for souvenir and the prototype of mark, apply flexibly form and flexible use type and with dictionary project between mate etc. and to generate value of vectors.
And then by same step, eigenvector generating unit 237 can be used the information of the mark adjacent with the mark of object that generates value of vectors.Particularly, in the situation that characteristic type f represents to use the previous mark of mark of object as feature, 237 pairs of grammatical category informations as the previous mark of the mark of object of eigenvector generating unit carry out value of vectors.Then, in the value of vectors of eigenvector generating unit 237 by the mark as object, append the value of vectors as the previous mark of the mark of object, generate the value of vectors as the mark of object.
And, except in the situation that making with the corresponding table coupling of the dimension of information and value of vectors method that the value of the key element of value of vectors is " 1 ", as the value that represents the shared information of two adjacent marks, eigenvector generating unit 237 also can be used the number of times that value, the document of auto-correlation quantity of information mate with dictionary in all etc.
About the mark of the object of generating feature vector, in the situation that generating whole value of vectors, eigenvector generating unit 237 is carried out combination according to predetermined order to generated value of vectors, generates an eigenvector of expressive notation.Here, the combination of value of vectors refers to, generates the whole key elements with each vector as the vector of self key element, for example, vector v(v1, v2, v3) and vector w(w1, w2) combination x be (v1, v2, v3, w1, w2).
Then, chart is constructed portion 270 and is for example calculated the similarity that the distance of the value of vectors of two marks serves as a mark.Here, the distance of value of vectors has Euclidean distance or cosine distance etc., and the distance that is applicable to each task or data is different.
Chart is constructed portion 270 and for the edge between each mark, is determined the weight of the distance based on calculating.For example, chart is constructed portion 270 and can be determined less weight for the edge between the less mark of the distance calculating.Chart is constructed portion 270 by determining the weight at the edge between mark, generates chart g.In the present embodiment, in the situation that by having determined between the edge connected node of the weight more than setting of user's appointment, be recited as between node and be connected.
Chart is constructed portion 270 data that comprise chart g 31 is input to machine learning portion 280.Machine learning portion 280 in the situation that inputted the data 31 that comprise chart g, is used chart g, to what connect at edge, without label data, propagates the label of teacher's data.Then, the result that machine learning portion 280 output labels are propagated is as final output 32.
Here, finally export 32 form according to the algorithm of machine learning portion 280 and difference.For example, in the situation that being CRF, known algorithm is the model parameter of CRF.And the in the situation that of label propagation algorithm, giving at the label without in label data is final output 32.
Below, the machine learning algorithm of the machine learning portion 280 of the present embodiment is briefly described.
As the typical example that uses the machine learning of chart, enumerate the label Law of Communication proposing in non-patent literature 1.In the algorithm of the label Law of Communication of using non-patent literature 1 to record, first, machine learning portion 280 makes N teacher's data and M be arranged in one dimension arrangement D without label data.
And, each teacher's data and K any one party in label without label data correspondence.Machine learning portion 280 makes with teacher's data and is arranged in one dimension without label corresponding to label data to arrange E.
Then, the 280 calculating probability migration ranks T of machine learning portion.(i, j) key element of ranks T is the similarity of arranging i the data of D and arranging j the data of E.Then, ranks Y calculates in machine learning portion 280.(i, j) key element of ranks Y is the probability that i the data of arrangement D are got j label arranging E.
After calculating ranks T and ranks Y, machine learning portion 280 carries out steps A 1~steps A 3 these three steps below repeatedly, until ranks Y convergence.
(steps A 1) calculates ranks T and ranks Y is long-pending, is defined as new Y
(steps A 2) column criterion of advancing to new ranks Y
(steps A 3) utilizes the key element corresponding with teacher's data in the key element of ranks Y of label information coverage criteria
In the algorithm of described label Law of Communication, as the result of machine learning, the probable value that maybe may give the possibility of giving in the label without in label data and expression at the label without in label data is given in output.
About the step of label Law of Communication, the step of enumerating, also there is a lot of variation in non-patent literature 1.
And, in order to carry out label propagation, exist and use chart as the algorithm that has the supplementary of teacher learning.For example, as non-patent literature 4, enumerate the example of following algorithm: at condition random field (Conditional Random Field; CRF) in study, use without label data, so adopt chart structure.
In this situation, 280 pairs, machine learning portion gives pseudo-label without label data, again learns CRF.Then, machine learning portion 280, according to propagating the score that label determines on the score of the CRF learning before and chart, determines pseudo-label.
The in the situation that of this algorithm, as learning outcome, machine learning portion 280 obtains the model parameter of the CRF identical with common CRF.Therefore, while providing any document afterwards, machine learning portion 280 can be same with common CRF, utilizes viterbi algorithm etc. to identify at a high speed.Like this, although be the algorithm with the feature different from the label Law of Communication of non-patent literature 1,, it is identical when giving pseudo-label, propagating this point of label information, can similarly apply the present invention with the label Law of Communication of non-patent literature 1.
In addition, if input chart g, the machine learning portion 280 of the present embodiment shown below also can carry out label propagation by some variation of label Law of Communication.
User (domain expert) evaluates final output 32, in the situation that evaluation result is poor, uses label generating unit 295 to append label.And in the situation that evaluation result is poor, domain expert newly determines characteristic type f ', and characteristic type f ' is input to initialization section 235 as characteristic type f.
Here, according to the processing shown in Fig. 4, in order to select optimum characteristic type f, the information extracting system of the present embodiment need to make machine learning portion 280 repeatedly carry out label dissemination process.
And then, in chart g, include all data that comprise in bibliographic data base 220.Therefore,, in the situation that the data volume comprising in bibliographic data base 220 is more, due to the processing of computational data distance each other, information extraction may be urgent by the resource of computing machine 200.
Therefore,, in the processing of embodiment shown below 1, the information extracting system of the present embodiment was carried out the optimization of the characteristic type based on feature selecting portion 240 before the processing based on machine learning portion 280.And the information extracting system of the present embodiment suitably selects to be input to by data selection portion 255 data (without label data) that comprise in the chart of machine learning portion 280.
Fig. 5 is the functional block diagram that information extracting system that the present embodiment 1 is shown carries out the summary of the data stream before the machine learning of document.
Fig. 5 illustrates the data stream of the functional block input and output in the processing of embodiment 1 that are equivalent to the handling part shown in Fig. 2.
First, identical with the label generating unit 295 shown in Fig. 4, label generate label generating unit 295 with computing machine 290 by the tag storage of user's appointment in tag database 225.
Then, information extraction is used the data of storage in bibliographic data base 220, tag database 225 and characteristic type database 230 to carry out initialization process by the initialization section 235 of computing machine 200.Particularly, as initialization process, initialization section 235 generates teacher data list L601 according to bibliographic data base 220 and tag database 225 and without label data list U602.And as initialization process, initialization section 235 is extracted all characteristic types from characteristic type database 230, generate the characteristic type F that comprises the characteristic type extracting.
In addition, characteristic type F, without label data list U602 and teacher's data list L601, also can be specified by user.
Fig. 6 A is the key diagram that teacher's data list L601 of the present embodiment 1 is shown.
Teacher's data list L601 is the list of the document that comprises teacher's data.Initialization section 235 is extracted label ID2251 and document ID2252 from tag database 225, and the data that extract are included in teacher's data list L601.
Teacher's data list L601 has label ID6011 and document ID6012.Label ID6011 is equivalent to label ID2251, and document ID6012 is equivalent to document ID2252.
Fig. 6 B is the key diagram without label data list U602 that the present embodiment 1 is shown.
It without label data list U602, is the list of the document that do not comprise teacher's data.Initialization section 235 is extracted the identifier except the document ID2252 of tag database 225 from the identifier of the document ID2201 of bibliographic data base 220.Then, initialization section 235 is included in the identifier extracting without in label data list U602.
Without label data list U602, comprise ID6021 and document ID6022.In ID6021, store the document that comprises without label data at the serial number without in label data list U602.Document ID6022 includes the identifier comprising without the document of label data.
The result of initialization process, initialization section 235 is input to feature selecting portion 240 using characteristic type F and teacher's data list L601 as data 300.
Feature selecting portion 240 is in the situation that being transfused to data 300, identical with the feature selecting portion 240 shown in Fig. 4, and use characteristic vector generating unit 237 is constructed portion 270 with chart and generated the chart g1 relevant with teacher's data.Here, in order to generate chart g1, optimum characteristic type is selected from characteristic type F by feature selecting portion 240.Then, the characteristic type that 240 outputs of feature selecting portion are selected is as characteristic type f1.
Feature selecting portion 240 is input to data selection portion 255 using the eigenvector of generated chart g1, teacher's data and characteristic type f1 as data 310.And initialization section 235 will be input to data selection portion 255 as data 320 without label data list U602.
Data selection portion 255 in the situation that being transfused to data 310 and data 320, according to the eigenvector of chart g1, teacher's data, without the eigenvector of label data, select to be suitable for to propagate label without label data.Then, the data that data selection portion 255 output is selected are as without label data u2.And data selection portion 255 is created on and adds the chart g2 obtaining without label data u2 in chart g1.
Chart g2 is appended to the chart in chart g1 using the data without label data u2 as node.The initial value of chart g2 is chart g1.
Data selection portion 255 is input to machine learning portion 280 using the eigenvector of chart g2, teacher's data with without the eigenvector of label data u2 as data 330.
Machine learning portion 280, in the situation that being transfused to data 330, carries out machine learning according to data 330, generates the final output 340 as the result of machine learning.Machine learning portion 280 carries out machine learning by the identical method of the machine learning portion 280 with shown in Fig. 4 to chart g2, carries out thus label propagation.
Fig. 7 is the process flow diagram that the treatment scheme that the feature selecting portion 240 of the present embodiment 1 carries out is shown.
Processing shown in Fig. 7 is illustrated in the processing of being carried out by feature selecting portion 240 from the situation of initialization section 235 input data 300 in Fig. 2.
Feature selecting portion 240 selects chart to construct at least one characteristic type (400) of middle use from characteristic type F.The characteristic type of selecting in step 400 is recited as to characteristic type f1.The arbitrary value that the quantity of the characteristic type of selecting in step 400 is user.
After step 400, feature selecting portion 240 is input to eigenvector generating unit 237 by characteristic type f1 and teacher's data list L601.
Eigenvector generating unit 237 is according to inputted characteristic type f1, teacher's data list L601, bibliographic data base 220 and tag database 225 generating feature vector 710(410).In step 410, eigenvector generating unit 237 is by the identical method generating feature vector of the method with generating feature vector in the processing shown in Fig. 4.
Fig. 8 A is the key diagram of eigenvector 710 that teacher's data of the present embodiment 1 are shown.
Eigenvector 710 is eigenvectors of teacher's data.Each line display of eigenvector 710 and an eigenvector that teacher's data are relevant.
Beginning at each row of eigenvector 710 includes the value of giving the label in teacher's data.In each row, include the key element that represents each feature relevant with the data of object, by the division word of separator etc., each key element is divided.
For example, the key element showing like this about " 1:0.5 ", the dimension " 1 " of the numeric representation feature in ": " left side, the value " 0.5 " of ": " right side representation feature.
The dimension of feature is the numerical value that word distributed in the grammer of the content recorded according to document, for example, is by numerical value, to show the value of auxiliary word or adjective etc.The value of feature is the value of the feature in document itself.For example, at the dimension of feature, represent in adjectival situation, the value of feature is " at a high speed " etc.
And then, in Fig. 8 A, the such eigenvector of the line display that comprises key element " 1:0.5 ", key element " 2:0.8 ", key element " 5:-0.1 " (0.5,0.8,0,0 ,-0.1).
Fig. 8 B is the key diagram that the eigenvector 700 without label data of the present embodiment 1 is shown.
In the aftermentioned of data selection portion 255 is processed, without label data list, U602 is also converted into eigenvector 700.
Eigenvector 700 is the eigenvectors without label data.Each line display of eigenvector 700 with one without the relevant eigenvector of label data.
Eigenvector 700 comprises the value of vectors identical with eigenvector 710.But eigenvector 700 is with the difference of eigenvector 710, does not give label in each row of eigenvector 700.
In step 410, eigenvector generating unit 237 so that 1 row of the teacher's data list L601 mode corresponding with 1 row of eigenvector 710 group of the value of the dimension of feature and feature is stored in eigenvector 710.Then, eigenvector generating unit 237 is determined the row of the tag database 225 with the label ID2251 corresponding with the label ID6011 of teacher's data list L601, extracts the value of label from the label 2253 of definite row.Then, eigenvector generating unit 237 is stored in the value of the label extracting the beginning of each row of eigenvector 710.
As mentioned above, eigenvector generating unit 237 is according to characteristic type f1 and teacher's data list L601 generating feature vector 710.
After step 410, chart is constructed portion 270 eigenvector 710 generating in step 410 is converted to chart g1(420).Particularly, because each row of eigenvector 710 is corresponding with node, so chart is constructed the distance that portion's 270 use characteristic vectors calculate each row, internodal edge is determined to the weight of the distance based on calculating.Thus, chart is constructed portion 270 eigenvector of teacher's data 710 is converted to chart g1.
After step 420, characteristic evaluating portion 245 is according to the evaluation of estimate (Score of characteristic evaluating function calculation chart g1 merge) (430).Here, characteristic evaluating function can return to plural evaluation of estimate for a chart.
Characteristic evaluating portion 245 is for example used an evaluation of estimate in formula 1 calculated characteristics evaluation function to intersect tag error (Err diff).The tag error of intersecting means the evaluation of estimate which kind of degree to comprise different labels with in chart.
[mathematical expression 1]
Err diff ( G ) = Σ ( i , j ) ∈ E W ij 1 [ l ( i ) ≠ l ( j ) ] Σ ( i , j ) W i , j (formula 1)
Mark G in formula 1 means the mark of chart.Mark E represents all edges that comprise in chart.Mark W is for the definite weight in internodal edge.Mark 1 is the value of label.Node i and j represent node.Function 1[l(i) be ≠ l(j)] in the situation that different 1 the functions that return of the value of node i and the label of node j.Therefore, the intersection tag error shown in formula 1 is the value that the summation (molecule) of the internodal weight that the value of label is different obtains divided by the summation (denominator) of internodal weight.
And then characteristic evaluating portion 245 is for example used formula 2 to calculate different labels connection score (Score diff).By intersection tag error being multiplied by negative 1, calculate different labels connection scores.
[mathematical expression 2]
Score diff(G)=-Errdiff (G) (formula 2)
In non-patent literature 2, also using intersection tag error, is the value of evaluating in the ratio of edge connection for chart being had to the node of different labels.The node with different labels cannot accurately be propagated label at the chart of the edge of larger weight connection each other.Therefore, by intersecting tag error, as evaluation index, carry out the table for assessing to figure, characteristic evaluating portion 245 can carry out point penalty to the edge connecting between different labels.
And then characteristic evaluating portion 245 is for example used formula 3 to calculate same label connection score (Score same).Same label connects score and means the evaluation of estimate which kind of degree to comprise same label with in chart.That is, be the evaluation of estimate of evaluating for chart being connected to the ratio each other of node with same label.
[mathematical expression 3]
Score same ( G ) = Σ ( i , j ) ∈ E W ij 1 [ l ( i ) = l ( j ) ] Σ ( i , j ) W i , j (formula 3)
Function 1[l(i) be=l(j)] in the situation that identical 1 the function that returns of the value of node i and the label of node j.Therefore, to connect score be the value that the summation (molecule) of the internodal weight that the value of label is identical obtains divided by the summation (denominator) of internodal weight to the same label shown in formula 3.
The characteristic evaluating portion 245 of embodiment 1 is used same label connection score to be connected the evaluation of estimate of score calculation chart g1 with different labels.Then, the evaluation of estimate (different labels connect score and are connected score with same label) of 245 each chart of the storage g1 of characteristic evaluating portion.
Fig. 9 A be the present embodiment 1 is shown only by different labels, connect the key diagram of the evaluation of estimate of the chart that scores calculate.
The evaluation of estimate of the chart shown in Fig. 9 A is only by different labels, to connect the evaluation of estimate that score calculates.Bullet shown in Fig. 9 A represents the evaluation of estimate of chart.Fig. 9 A illustrates evaluation of estimate 90 and evaluation of estimate 91.
Evaluation of estimate 90 is evaluations of estimate that the such node of chart 900 as shown in Figure 9 A calculates for chart 900 connect at edge in the situation that.And evaluation of estimate 91 is the evaluations of estimate that calculate for chart 910 or chart 911 in the situation that node connecting at edge as chart 910 or chart 911.And the transverse axis of Fig. 9 A is that different labels connect score.Each chart shown in Fig. 9 A is the chart generating according to different characteristic type f1.
The node that label has been given in quadrilateral shown in chart 900, chart 910 and chart 911 and circular expression is teacher's data.Node shown in identical figure is the node of having given same label.
Chart 900 is the charts connect at edge in the situation that of the node only with different labels.Chart 910 is the charts connect at edge in the situation that of the node only with same label.Chart 911 is arbitrary node charts all do not connect at edge in the situation that.
Here, about chart 910 and chart 911, the different labels of any one party connect to such an extent that be divided into " 0 " (different labels connect the maximal value of score), calculate identical different labels and connect score.But about chart 911, node does not all connect at edge arbitrarily, so, can not say the chart that is suitable for propagating label.
Particularly, this be because, the in the situation that of having appended without label data, become the chart of excessively becoming estranged in chart 911, the possibility that hinders label to propagate is large, so the information extracting system of the present embodiment may be suitably to propagating label without label data.
Therefore, only by different labels, connect to such an extent that to assign to select the method for chart 911 be unsuitable, and, only by different labels, connect to such an extent that the method for the judgement schematics of assigning to is unsuitable.
Fig. 9 B be the present embodiment 1 is shown by same label, connect score is connected the evaluation of estimate of the chart that score calculates key diagram with different labels.
Fig. 9 B illustrates the evaluation of estimate that connects the chart in score the is connected score calculation chart situation of evaluation of estimate with different labels by same label.The transverse axis of Fig. 9 B represents that different labels connect score, and the longitudinal axis of Fig. 9 B represents that same label connects score.Fig. 9 B illustrates evaluation of estimate 92, evaluation of estimate 93, evaluation of estimate 94 and evaluation of estimate 95.
Evaluation of estimate 92 is the evaluations of estimate that calculate for chart 920, and evaluation of estimate 93 is the evaluations of estimate that calculate for chart 930, and evaluation of estimate 94 is the evaluations of estimate that calculate by chart 940, and evaluation of estimate 95 is the evaluations of estimate that calculate by chart 950.Each chart shown in Fig. 9 B is the chart generating according to different characteristic type f1.
Right side the closer to Fig. 9 B illustrates evaluation of estimate, and it is larger that different labels connect scores, the closer to the upside of Fig. 9 B, evaluation of estimate is shown, and it is larger that same label connects score.Evaluation of estimate shown in the region, lower left of certain evaluation of estimate means, different labels connect scores, same label connects score or its both sides ratio is positioned at top-right evaluation value difference.
For example,, because evaluation of estimate 94 is positioned at the lower left of evaluation of estimate 93, so different labels connect scores, and be connected score with same label all poor than evaluation of estimate 93.On the other hand, the closer to upper right side, evaluation of estimate is higher, can be described as the chart that contributes to propagate label.
Like this, there is plural object in the situation that of (being connected score for same label connects score with different labels in Fig. 9 B), according to compare the order from less to more of chart that evaluation of estimate is shown in upper right side with self evaluation of estimate, each chart is sorted, thus, can be to each figure table for assessing.
After step 430, feature selecting convergence detection unit 250, by the evaluation of estimate calculating in the step 430 of the evaluation of estimate being calculated by characteristic evaluating portion 245 and execution is in the past compared, judges whether the evaluation of estimate being calculated by characteristic evaluating portion 245 restrains (440).
Here, feature selecting convergence detection unit 250 can, in the situation that to be judged to be the evaluation of estimate being calculated by characteristic evaluating portion 245 be same degree lower than the evaluation of estimate calculating in the past or they, be judged to be evaluation of estimate convergence.And, feature selecting convergence detection unit 250 also can be judged to be the evaluation of estimate that calculates lower than the evaluation of estimate calculating in the past or they are for after same degree, the result of repeatedly carrying out step 450, step 410, step 420 and the step 430 of the preassigned stipulated number of user is judged to be in the situation that the evaluation of estimate that calculates significantly do not change, and is judged to be evaluation of estimate convergence.
In the situation that feature selecting convergence detection unit 250 is judged to be evaluation of estimate convergence, the chart g1 that the evaluation of estimate that calculates in feature selecting convergence detection unit 250 output steps 430 is the highest, owing to generating the characteristic type f1 of chart g1, the eigenvector 710 of teacher's data.Then, feature selecting portion 240 finishes the processing shown in Fig. 7.
In the situation that being judged to be evaluation of estimate by feature selecting convergence detection unit 250 and not restraining, new characteristic type f1(450, according to the evaluation of estimate calculating in step 430 and characteristic type f1, selects in many objects Optimization Dept. 275).
Below, the concrete example of the system of selection of the new characteristic type f1 based on many objects Optimization Dept. 275 is shown.
Method based on chart being sorted according to the evaluation of estimate shown in Fig. 9 B, as the example that evolutional calculating gimmick is applied to the optimization of plural object (being connected score for different labels connect scores with same label in described example), be known to evolutional many objects optimization of NSGA-II etc.In step 450, many objects Optimization Dept. 275 can be used this evolutional many objects optimization.
In NSGA-II, the method for using described sort method to arrange solution candidate (chart generating by characteristic type f1) is called as Non-Dominated Sort.Evolutional many objects optimization as the present embodiment 1, describes NSGA-II below.
The in the situation that of performing step 450 first after execution step 400, the 275Dui Xie candidate P of group of many objects Optimization Dept. and the subsolution candidate Q of group carry out initialization.Particularly, many objects Optimization Dept. 275 carries out initialization by the characteristic type f1Dui Xie candidate P of group.And then many objects Optimization Dept. 275 utilizes empty list to carry out initialization to the Q of subsolution Candidate Set group.
Then, many objects Optimization Dept. 275, whenever execution step is carried out following steps B1~step B5 at 450 o'clock repeatedly, thus, obtains the subsolution candidate Q of group as the chart of optimum evaluation of estimate with for generating the characteristic type of this chart.That in addition, establishes the solution candidate that will seek adds up to 3.
(step B1) many objects Optimization Dept. 275 generates the list R obtaining in conjunction with understanding the Candidate Set P of group and the subsolution candidate Q of group, by Non-Dominated Sort, arranges list R.Then, many objects Optimization Dept. 275 carries out grouping according to the order based on Non-Dominated Sort.In addition, according to the evaluation of estimate calculating in step 430, determine the order based on Non-Dominated Sort.
Solution candidate's proximity to one another (Crowding Distance) that (step B2) many objects Optimization Dept. 275 is calculated in each group.
(step B3) many objects Optimization Dept. 275 generates the new P of solution candidate group, and the new P of solution candidate group is initialized as to empty list.Then, many objects Optimization Dept. 275 the new P of solution candidate group want prime number to be less than S time, repeatedly make to separate candidate and from list R, move to the new P of solution candidate group with group unit.
(step B4) many objects Optimization Dept. 275 makes the group that the order of list R is higher arrive the new P of solution candidate group according to Crowding Distance ordinal shift from big to small, until the prime number of wanting of the new P of solution candidate group equals S.
(step B5) many objects Optimization Dept. 275 according to the new P of solution candidate group select, the genetic manipulation of intersection or halmatogenesis etc., generate the subsolution candidate Q of group.Then, many objects Optimization Dept. 275 returns to step 1.
Repeatedly carry out step B1~step B5 until meet termination condition.The Q of subsolution candidate group generating in the P of solution candidate group generating in the 275 maintenance step B4 of many objects Optimization Dept. and step B5, in the situation that step B1 is returned in processing, is used the P of solution candidate group and the subsolution candidate Q of group that keep.
Then, in the situation that meet the termination condition of step B1~step B5, many objects Optimization Dept. 275 is input to eigenvector generating unit 237 as next characteristic type f1, end step 450 using the Q of subsolution candidate group generating by step B5.
The termination condition of step B1 in step 450~step B5 is repeatedly to have carried out the situation of step 450 of stipulated number of user's appointment or the situation that solution can not be improved.The situation that solution can not be improved for example has following situation: even if repeatedly carry out step 450, the solution candidate's who comprises in the group of upper in the order based on Non-Dominated Sort quantity does not change yet.
And, the situation that solution can not be improved for example has following situation: even if repeatedly carry out step 450, the volume (hypervolume) in the region that comprises the solution candidate's who comprises in the group of upper in the order based on Non-Dominated Sort evaluation of estimate and comprise the face of respectively evaluating axle (transverse axis shown in Fig. 9 B and the longitudinal axis) does not increase yet.
Here, the optimum solution having in the situation of plural object is not to obtain on one point, can obtain the set (Pareto optimal solution) of a plurality of points that there is no other points in the upper right side of certain point yet.Particularly, in step B5, can obtain the characteristic type f1 of a plurality of optimums.In this situation, as the result of step 450, many objects Optimization Dept. 275 can export a plurality of characteristic type f1.Then, the eigenvector generating unit 237 in step 410 can generate a plurality of eigenvectors according to a plurality of characteristic type f1.
The advantage of Pareto optimal solution is, when optimizing end, from paying attention to the situation of different labels connection scores, to the situation of paying attention to same label connection score, obtains various solution candidates.Therefore, if in the situation that one separate candidate in the performance of machine learning can not improve, by successively attempting Pareto optimal solution, also can access the learning outcome after reselecting.
In addition, because being connected score with same label, different labels connection scores there is tradeoff, so, even if being connected from different labels connection scores the index that score is different with same label, utilization replaces some scores, also can realize same function.For example, replace same label to connect score, can be by the score that acts on the evaluation of estimate of calculation chart for total edge number.In this situation, use the computing method of following evaluation of estimate: total edge is counted score and had the effect that increases number of edges, on the other hand, different labels connect score the different edge of label are added to point penalty, so, its result, increase the edge between same label, suppressed the different edge of label.
Like this, using different labels to connect scores are connected these two objects of score method with same label is an example, and other that can use any amount have the index of same effect.
By the processing shown in Fig. 7, feature selecting portion 240 can the different a plurality of charts of generating feature type, for each generated schematic calculation evaluation of estimate.And, according to evaluation of estimate, can select to be suitable for most to the characteristic type f1 of the chart without label data propagation label and the chart generating by characteristic type f1 for generating.Its result, by the processing of feature selecting portion 240, chart g1 is optimised.
Figure 10 is the process flow diagram of processing that the data selection portion 255 of the present embodiment 1 is shown.
In the situation that to the eigenvector 710 of data selection portion 255 input chart g1, characteristic type f1, teacher's data with without label data list U602,237 inputs of 255 pairs of eigenvector generating units of data selection portion are without label data list U602 and characteristic type f1.Then, eigenvector generating unit 237 will be converted to the eigenvector 700(1090 without label data shown in Fig. 8 B according to characteristic type f1 without label data list U602).
Here, eigenvector generating unit 237 is for all data that comprise in the document shown in the document ID6022 without label data list U602, according to characteristic type f1 generating feature vector 700.Therefore, each row of eigenvector 700 is corresponding to all nodes that comprise in each document.
After step 1090, data evaluation portion 260 is according to the eigenvector 710 of the eigenvector 700 without label data and teacher's data, calculates without the distance between the node comprising in the node of label data and chart g1.Then, the minimum value of the distance between the node comprising in each node without label data and chart g2 is accumulated in to (1100) in storer 120.
Particularly, for example, calculating is without the distance between the node (Node B~node D) comprising in the node A of label data and chart g1, in the situation that the distance between node A and node D is shorter than the distance between node A and other arbitrary nodes, as the distance between node A and chart g1, data evaluation portion 260 is only accumulated in the distance between node A and node D in storer 120.Then, the distance between all nodes without label data and chart g1, by the computing of this distance, is calculated by data evaluation portion 260.
After step 1100, data evaluation portion 260 is selected from accumulated a plurality of distances and chart g1(is chart g2 in execution step after 1130) between the longest data d ' of distance.Then, data evaluation portion 260 is appended to chart g1(and after execution step 1130, is chart g2 using the data d ' selecting as node) in.The chart g1 having appended in chart g1 after data d ' is recited as to chart g2.
And then the row corresponding with data d ' deleted from the eigenvector 700 without label data by data evaluation portion 260.And data evaluation portion 260 is appended to without (1110) in label data u2 being appended to data d ' in chart g2 and the eigenvector of data d '.
After step 1110, data selection convergence detection unit 265 is according to being appended to the quantity of the data d ' in chart g2 or the distance of data d ' etc. in step 1110, and decision data d ' appends processing and whether restrains (1120).
Particularly, data selection convergence detection unit 265 can be specified data d ' that the quantity of the data d ' that will append maybe will append and the minimum value of the distance between chart g2 in advance by user.Then, in step 1120, data selection convergence detection unit 265 can in the situation that the data d ' of specified quantity is appended in chart g2, be judged to be appending of data d ' and process convergence.And data selection convergence detection unit 265 also can, in the situation that the distance of the data d ' selecting in step 1110 is shorter than the minimum value of the distance of the data d ' of appointment, is judged to be appending of data d ' and processes convergence.
In the situation that be judged to be appending of data d ', process convergence, data selection portion 255 finishes the processing shown in Figure 10, the eigenvector 710 of output chart g2, teacher's data and without the eigenvector 700 of label data u2.
In the situation that be judged to be the processing of appending of data d ', do not restrain, data evaluation portion 260 is according to the eigenvector 710 of the eigenvector 700 without label data and teacher's data, calculate in the eigenvector 700 without label data, comprise without the distance between the data d ' being appended in label data and step 1110 in chart g2.Then, data evaluation portion 260 is according to the distance calculating, to upgrading (1130) without label data and the minimum value that belongs to the distance between the data of chart g2.After step 1130, data evaluation portion 260 returns to step 1110, selects data d '.
Below, the data d ' extracting by described step 1110~step 1130 is described.
Figure 11 A illustrates the chart g2 of the present embodiment 1 and without the key diagram of label data.
Data 10~data 14 shown in Figure 11 A indicate without label data.And, after data 20~data 22 represent teacher's data and are appended in chart g2 without label data.
Data 10~data 12 are positioned at position close to each other in eigenvector, and the distance between chart g2 about equally.Data 10, data 13, data 14 are positioned at separated position in eigenvector.
Figure 11 B be illustrate the present embodiment 1 selection and chart g2 between the longest key diagram without the chart in the situation of label data of distance.
Here, as the convergence in step 1120, judge the quantity of the data d ' that will append using, suppose to specify in advance " three " in data selection convergence detection unit 265.
In the situation that the processing shown in Figure 10 starts, in step 1100, the distance between data 14 and data 22 is for example accumulated by data evaluation portion 260, as without label data being minimum value between data 14 and chart g2.And the distance between data 11 and data 20 is for example accumulated by data evaluation portion 260, as the minimum value between data 11 and chart g2.
And then in step 1110, data evaluation portion 260 is selected and chart g1(chart g2 from accumulated a plurality of distances) between the longest data d ' of distance.Therefore, data evaluation portion 260, by repeatedly performing step 1110, selects data 10, data 13 and data 14 as being appended to the data d ' in chart g2.
Here, in order to generate for to propagate the new chart of label without label data, as Figure 11 B, preferred selectedly disperse in eigenvector without label data.But, in the situation that only according to the maximum selection rule of distance without label data, data evaluation portion 260 from intensive without selecting data d ' label data, sometimes cannot from disperse without selecting data d ' label data.
Figure 11 C be illustrate the present embodiment 1 selection and chart 1 between the longest key diagram without the chart in the situation of label data of distance.
The distance apart between discrete data 12 and chart g2 between tentation data 11 and chart g2 be greater than between data 13 and chart g2 apart from the distance between discrete data 14 and chart g2, in the situation that data evaluation portion 260 has been selected and chart g1 between distance maximum without label data, as shown in the black triangle of Figure 11 C, the data d ' selecting in step 1110 is data 10~data 12.
But in step 1130,260 pairs of distances of accumulating of the data evaluation portion of embodiment 1 are upgraded.For example, the in the situation that of extracting data 10 in step 1110, in step 1130, according to the distance between data 11 and data 10, the distance between data 11 and chart g2 is upgraded.Therefore,, in upper step 1110 once, can not select data 11 as data d '.
That is, the data evaluation portion 260 of embodiment 1 is by carry out step 1130, in upper step 1110 once, can select to disperse without label data.Then, data selection of the present invention portion 255 is by selecting without label data from the less part of the density of node, the less chart of deviation that can generated data.
Then, by the processing of the data selection portion 255 shown in Figure 10, can be input to machine learning portion 280 by comprising the optimum chart g2 without label data.
In addition, in embodiment 1, as the viewpoint of selecting without label data, the density of usage data, still, new index also can be appended in this system of selection by data selection portion 255, same with the feature selecting in feature selecting portion 240, as many objects optimization problem, select data.
Here, from the viewpoint of data bulk, estimate the needed Time Calculation amount of processing feature selecting portion 240.If teacher's data bulk is N, is M without label data quantity.The different labels of 1 time connect scores evaluation, be that the needed Time Calculation amount of formula 1 and formula 2 is zero (N*N).And, same label connect score evaluation, be that the needed Time Calculation amount of formula 3 is zero (N*N).
Do not use the feature selecting portion 240 of the present embodiment 1, in the situation that use simple label Law of Communication in machine learning, be in the situation of the processing shown in execution graph 4, machine learning portion 280 carries out machine learning repeatedly Time Calculation amount in order to select optimum feature is zero ((N+M) * (N+M) * t).T represents the multiplicity of label Law of Communication.
In the present invention, supposed be difficult to obtain teacher's data and enrich such prerequisite without label data, so data bulk N is far smaller than data bulk M.On the other hand, the Time Calculation amount of the processing in feature selecting portion 240 as described in Time Calculation amount zero (N*N), be the Time Calculation amount that does not rely on data bulk M.Therefore, compare with the processing depending in the machine learning portion 280 of data bulk M, the feature selecting portion 240 of the present embodiment can significantly shorten for selecting the time of feature.
And then, the needed Time Calculation amount of processing in data estimator selection portion 255.If being appended to the number (quantity of the data d ' extracting) of the key element without label data u2 in chart g2 is M_u.It is zero (N*M) that distance in step 1100 is calculated needed Time Calculation amount.
And the Time Calculation amount of the step 1130 of the 1st time is zero (M-1), the Time Calculation amount of the step 1130 of the 2nd time is zero (M-2).And, repeatedly carry out the step 1130 of M_u-1 time, so, for carrying out in steps 1130 Time Calculation amount be zero ((M-1)+(M-2)+... + (M-(M_u-1)))=zero (M(M_u-1)-M_u*M_u+M_u).
Do not use the data selection portion 255 of the present embodiment 1, in the situation that use simple label Law of Communication in machine learning,, in the situation of the processing shown in execution graph 4, the label Law of Communication of not carrying out data selection is zero ((N+M) * (N+M) * t).On the other hand, be added the computing time of the data selection based on data selection portion 255 and the label Law of Communication after data selection becomes zero (M(M_u-1)-M_u*M_u+M_u(N+M_u) * (N+M_u) * t).
Because data bulk M is greater than data bulk N and data bulk M_u, so when focused data quantity M, the in the situation that of not carrying out data selection in data selection portion 255, Time Calculation amount is zero (tM^2+tNM), with M^2(M square) proportional.On the other hand, in the situation that the data selection portion 255 of the present embodiment 1 carries out data selection, Time Calculation amount is and zero ((M_u-1) M) and proportional time of M.It represents, when the data selection portion 255 by the present embodiment 1 processes, more without the quantity M of label data, more can significantly shorten computing time.
According to embodiment 1, can bring into play following effect.
First effect is, the information extraction of embodiment 1 is carried out the optimization of characteristic type with computing machine 200 and without the optimization of label data, thus, chart structure is optimised, so, can cut down the quantity that needs teacher's data that domain expert selects, can suppress human cost.
Second effect be, the feature selecting portion 240 of embodiment 1 is used clarification of objective evaluation function in order to carry out the optimization of characteristic type, so, in the evaluation of chart, do not need the expert's of domain expert or machine learning evaluation.Thus, can suppress human cost.And then, by the robotization of machine learning, can improve the speed of machine learning, can reduce equipment cost.
The 3rd effect be, the characteristic evaluating function of embodiment 1 is that the node of same label easily connects, the node of different labels is difficult to the evaluation function connecting, so, can improve the precision of study.
The 4th effect be, calculates the characteristic evaluating function of embodiment 1 before machine learning portion 280 carries out machine learning, so, in chart optimization, do not need the result of machine learning, can obtain with less computing time being suitable for propagating the chart structure of label.
The 5th effect be, the data selection portion 255 of embodiment 1 does not carry out machine learning and selects in a large number without in label data, machine learning being caused to the data of good impact, so, can improve the speed of machine learning, can reduce equipment cost.
[embodiment 2]
The information extracting system of embodiment 2 adopts the structure identical with the information extracting system of the embodiment 1 shown in Fig. 2.But the difference of the information extracting system of the information extracting system of embodiment 2 and embodiment 1 is, does not need to have data evaluation portion 260 and data selection convergence detection unit 265 in data selection portion 255.
In embodiment 1, with together with the optimization of feature, to propagating being optimized without label data of object of label.This be because, very many without label data in the situation that, necessary computer resource and necessary learning time increase, so, need restriction without the quantity of label data.But, suppose in the situation that without label data negligible amounts in the situation that or computer resource abundant, even if use, all without label data, carry out machine learning, can not produce the problem that computer resource urgent and learning time excessively increases etc. yet.
In this situation, (Figure 10) processed in the selection without label data in the information extracting system omitted data selection portion 255 of embodiment 2.
For example, in the situation that user wishes to propagate to had or not label data the label of teacher's data, user comprises all without label data via 140 pairs of information extractions of input media with computing machine 200 indicator diagram g2.Then, in this situation, data selection portion 255 replaces the processing shown in Figure 10, by by all being appended in chart g1 without label data, generates chart g2.
The eigenvector of the chart g2 that then, 255 outputs of data selection portion generate, all eigenvectors without label data and teacher's data is as data 330.Thus, the processing time of the data selection portion 255 in Fig. 5 shortens, and the processing shown in Fig. 5 all realizes high speed.
And, for example, in the situation that user only wishes to propagate without label data to a part label of teacher's data, user via 140 pairs of information extractions of input media with in computing machine 200 indicator diagram g2, should comprise without label data.Then, in this situation, data selection portion 255 replaces the processing shown in Figure 10, by being appended in chart g1 without label data of only user being indicated, generates chart g2.
[embodiment 3]
The information extracting system of embodiment 3 adopts the structure identical with the information extracting system of embodiment 1.But the difference of the information extracting system of the information extracting system of embodiment 3 and embodiment 1 is, does not need to have characteristic evaluating portion 245 and feature selecting convergence detection unit 250.
In embodiment 1, with as label, propagate destination without together with the optimization of label data, carried out the optimization of feature (being characteristic type).This be because, be generally difficult to be chosen in label and should use which feature in propagating, need to carry out operation by domain expert.
But according to the document of the kind of data and learning object, characteristic type is determined by unique sometimes.In this situation, the processing shown in Fig. 7 that omission feature selecting portion 240 carries out, can realize all high speeds of the processing shown in Fig. 5.
For example, the in the situation that of indicating well-determined characteristic type via 140 pairs of information extractions of input media with computing machine 200 user, feature selecting portion 240 omits the processing shown in Fig. 7.
In the situation that omit the processing shown in Fig. 7, feature selecting portion 240 replaces the processing shown in Fig. 7, to eigenvector generating unit 237 input teacher data list L601 and well-determined characteristic types, make eigenvector generating unit 237 generate the eigenvector 710 of teacher's data.And then feature selecting portion 240 makes chart construct portion 270 and generates chart g1 according to generated eigenvector 710.The eigenvector 710 of the chart g1 that then, 240 outputs of feature selecting portion generate, well-determined characteristic type, teacher's data is as data 310.
About the well-determined situation of characteristic type, such as the situation etc. of considering applied for machines study in the part of speech classification for electronic document.In this situation, the degree of freedom that the range of choice of feature just changes etc. the quantity of adjacent mark.The quantity of adjacent mark is definite by the compromise of computing time and precision, so, the unique definite feature of outside essential factor according to the performance of the computing machine that will use with the precision that will seek.
Part of speech classification is the general task relevant with electronic document, so, can, as huge without the data bulk of label data, need to dwindle data so that can learn within the real time.In embodiment 3, suppose this situation, can high efficiency selected data.
[embodiment 4]
The structure of the information extracting system of embodiment 4 is identical with the information extracting system of embodiment 1.But the difference of the information extracting system of the information extracting system of embodiment 4 and embodiment 1 is, many objects Optimization Dept. 275 is replaced by aftermentioned single goal Optimization Dept..
In embodiment 1, by many objects Optimization Dept. 275, select characteristic type, still, the feature selecting portion 240 of embodiment 4 is optimized characteristic type by single goal Optimization Dept..As the characteristic evaluating function in feature selecting portion 240, use formula 4.
In step 450, the single goal Optimization Dept. of embodiment 4 is used the different labels connection scores that through type 1~formula 3 calculates to be connected score and formula 4 with same label, the evaluation of estimate (Score of calculation chart merge).
[mathematical expression 4]
Score merge(G)=λ Score diff(G) Score+(1-λ) same(g) (formula 4)
Formula 4 be different labels connect scores are connected score linearity with same label with.Weight λ means that different labels connect scores and are connected the weight of score score separately with same label, is by user's real number of definite 0~1 arbitrarily.The evaluation of estimate of the chart that through type 4 calculates be the different node of chart interior label is more and label is identical node less, lower value, be the different node of chart interior label less and the identical node of label is more, higher value.
In embodiment 4, many objects Optimization Dept. 275 of embodiment 1 is replaced into single goal Optimization Dept..In step 450, the single goal Optimization Dept. of embodiment 4 is according to the characteristic type f1 selecting in the past and the evaluation of estimate (Score that calculates in the past merge) generate new characteristic type f1.The single goal Optimization Dept. of embodiment 4 is used the known method of genetic algorithm or annealing method etc.For example, in the situation that single goal Optimization Dept. is used simple genetic algorithm, select two higher characteristic types of evaluation of estimate of chart, by the key element of these two feature list of mutual replacement, select new characteristic type f1.
The single goal Optimization Dept. of embodiment 4 is applicable to the situation that Pareto optimal solution is defined as a bit.And then single goal Optimization Dept. does not need to keep a plurality of solution candidates, so, the memory resource of computing machine can be reduced.
[embodiment 5]
The information extracting system of embodiment 5 is identical with the information extracting system of embodiment 1.
In embodiment 1, characteristic evaluating function (formula 1~formula 3) is not determined by the result of machine learning.But, according to the kind of data (document), the evaluation of estimate of obtaining according to the result of machine learning and characteristic evaluating function may produce well-behaved from.Therefore, as shown in figure 12, the information extracting system of embodiment 5 feeds back the result of machine learning, improves characteristic evaluating function.
Figure 12 is the functional block diagram that information extracting system that the present embodiment 5 is shown carries out the summary of the data stream before the machine learning of document.
Figure 12 illustrates the data stream of the functional block input and output in the processing of embodiment 5 that are equivalent to the handling part shown in Fig. 2.
Processing in label generating unit 295, bibliographic data base 220, tag database 225 and characteristic type database 230 are identical with embodiment 1.
The initialization section 235 of embodiment 5 makes any part separation comprising in teacher's data, as test data 1310.Particularly, initialization section 235 copies any part of comprising in teacher's data as test data 1310, deletes the data identical with copied test data 1310 from teacher's data.In addition, user specifies the quantity of test data 1310 that will be separated from teacher's data etc. in advance.
In initialization section 235 isolated test data 1310 be not used as chart construct with machine learning in teacher's data, only for the evaluation of the machine learning based on machine learning portion 280.The initialization section 235 of embodiment 5 is input to machine learning portion 280 by test data 1310.
Characteristic type f1 appends in being input to the data 330 of machine learning portion 280 in the data selection portion 255 of embodiment 5.
Machine learning in the machine learning portion 280 of embodiment 5 is specifically described.
The in the situation that of input data 330 and test data 1310, machine learning portion 280 pairs of eigenvector generating units, 237 input test data 1310 and characteristic type f1.Machine learning portion 280 is converted to inputted test data 1310 by characteristic type f1 the eigenvector of test data.The eigenvector of test data is identical with the eigenvector without label data shown in Fig. 8 B, not additional label.
Then, in the situation that inputted data 330 are carried out to simple label, propagate, the machine learning portion 280 of embodiment 5 imposes on the eigenvector of test data in the eigenvector 700 without label data comprising in data 330.Then, machine learning portion 280 usage datas 330 are carried out label propagation.
And then 280 pairs, the machine learning portion of embodiment 5 propagates the label of the test data estimating by label and the real label of test data compares, thus, at least one that calculate in recall factor and applicable rate etc. is worth as evaluation of estimate.
On the other hand, after the processing of initialization section 235, when the initial execution of feature selecting portion 240 is processed, identical with embodiment 1, characteristic type is selected according to the characteristic evaluating function of formula 1~formula 3 by feature selecting portion 240.Then, machine learning portion 280 obtains after for the processing that comprises data selection portion 255 the chart of Pareto optimal solution and the data of test data 330 are carried out machine learning.
If the evaluation result of the 1st time of the machine learning in machine learning portion 280 does not reach the evaluation of the degree that user expects, sufficient machine learning with thumb down portion 280 permissible accuracies, feature selecting portion 240 carries out the 2nd feature selecting.
Figure 13 is the process flow diagram that the processing of the feature selecting portion 240 in the lower situation of the evaluation of machine learning of the present embodiment 5 is shown.
Feature selecting portion 240, according to the value of characteristic evaluating function and the evaluation of estimate of the machine learning based on machine learning portion 280 of the chart before last time, carries out approximate (1400) of evaluation function.Particularly, the value that feature selecting portion 240 establishes characteristic evaluating function is x1, x2, x3 ... if the evaluation of estimate of the corresponding machine learning based on machine learning portion 280 is y1, y2, y3 ... carry out regretional analysis, while thus, obtaining the value x when input feature vector evaluation function, return to the approximate function r of estimated value y of the evaluation of estimate of the machine learning based on machine learning portion 280.
Here, in regretional analysis, except linear regression, can also use Support Vector Regression(SVR) etc.
Step 400, step 410 and the step 420 after step 1400, carried out are identical with step 400, step 410 and the step 420 of embodiment 1.
After step 420, characteristic evaluating portion 245 is input to the evaluation of estimate based on characteristic evaluating function of chart g1 in approximate function r.Then, characteristic evaluating portion 245 determines that the result calculating by approximate function r is as evaluation of estimate (1410).Step 440 and the step 450 after step 1410, carried out are identical with step 440 and the step 450 of embodiment 1.
Like this, the figure table for assessing of characteristic type by new being optimized by machine learning, repeatedly carry out the evaluation of processing in the feature selecting portion 240 shown in Figure 13, the processing based on data selection portion 255, the machine learning based on machine learning portion 280, until meet the precision that user expects.In addition, the situation of the evaluation of estimate of the precision that obtains representing that user sets except feature selecting portion 240, in the situation that repeatedly carry out the improvement ratio of precision that the number of times of the processing of feature selecting portion 240, data selection portion 255 and machine learning portion 280 surpasses preassigned higher limit, machine learning lower than carrying out last time while processing, that the precision of machine learning is worse than situation about carrying out last time while processing is inferior, also can stop the processing of feature selecting portion 240, data selection portion 255 and machine learning portion 280.
Embodiment 5 is different from embodiment 1, need to carry out machine learning repeatedly.But, by the execution object of machine learning being only defined as to the part of having higher rating of approximate function r, the execution number of times of the larger machine learning that can suppress to assess the cost.
In addition, in embodiment 5, also can carry out the processing in the information extracting system of embodiment 2.That is, the data selection portion 255 of embodiment 5 can not have data evaluation portion 260 and data selection convergence detection unit 265.
And, in embodiment 5, also can carry out the processing in the information extracting system of embodiment 4.That is, many objects Optimization Dept. 275 of embodiment 5 can be replaced into single goal Optimization Dept..
The above invention inventor being completed according to embodiment illustrates, and still, the invention is not restricted to described embodiment, can in the scope that does not depart from its purport, carry out various changes.
[utilizability in industry]
Decentralized computing system of the present invention is the useful especially technology that is applicable to information extraction from data for electronic documents, is not limited to this, can be widely used in comprising the total data processing that the machine learning based on chart is processed.

Claims (12)

1. an information handling system, carries out machine learning to a plurality of data in literature, it is characterized in that, described information handling system has:
Initialization section, the data in literature that to obtain a plurality of data in literature of having given label be a plurality of teacher's data, do not give described label is without label data and a plurality of characteristic types that represent to extract the method for the feature relevant with described each data in literature;
Eigenvector generating unit, in the situation that characteristic type described at least one in the characteristic type of obtaining described in input, according to the characteristic type of described input and described in each teacher's data of obtaining, generate the eigenvector that represents described each teacher's data of the feature relevant with described each teacher's data by value of vectors;
Chart is constructed portion, according to the eigenvector of each teacher's data that generated by described eigenvector generating unit, generates the chart of described teacher's data;
Feature selecting portion, according to the chart of the teacher's data that generated by the described chart portion of constructing, from the characteristic type of being obtained by described initialization section, select for generating the characteristic type of the 1st chart of the label that is suitable for propagating described teacher's data most, and then, described the 1st chart that output is generated by the described chart portion of constructing;
Data selection portion, according to described the 1st chart and described without label data, selection should be propagated the described without label data of the label given in described teacher's data, and then, by select described in comprising in described the 1st chart without label data, generate the 2nd chart; And
Machine learning portion, by described the 2nd chart, will give label in described teacher's data to described propagating without label data of selecting.
2. information handling system as claimed in claim 1, is characterized in that,
Described chart is constructed portion and is calculated the distance between described each teacher's data according to the eigenvector of each teacher's data of described generation,
Described chart is constructed portion by determining the weight of the distance based between described each teacher's data that calculate between described each teacher's data, generates the chart of described teacher's data,
Described feature selecting portion has:
Characteristic evaluating portion, the figure table for assessing to teacher's data of described generation;
Feature selecting convergence detection unit, in the situation that described characteristic evaluating portion meets the 1st rated condition to the evaluation result of the chart of described teacher's data, exports the chart of described teacher's data as described the 1st chart; And
Characteristic optimization portion, in the situation that described characteristic evaluating portion does not meet described the 1st rated condition to the evaluation result of the chart of described teacher's data, according to the evaluation result of the chart of described teacher's data, from the characteristic type of being obtained by described initialization section, select new described characteristic type, to the characteristic type of selecting described in described eigenvector generating unit input
Described characteristic evaluating portion is used to have given weight definite between described teacher's data of different described labels less higher to the chart evaluation of described teacher's data and given weight definite between described teacher's data of identical described label the larger characteristic evaluating function higher to the chart evaluation of described teacher's data, the figure table for assessing to described teacher's data.
3. information handling system as claimed in claim 1 or 2, is characterized in that,
Described eigenvector generating unit according to generate the described characteristic type of described the 1st chart and by described initialization section, obtained a plurality of without label data, generation represents and described each described eigenvector without label data without the relevant feature of label data by value of vectors
Described data selection portion has data evaluation portion, this data evaluation portion is according to the eigenvector of described each teacher's data and described each eigenvector without label data, calculate each teacher's data and described each minimum value without the distance between label data of in described the 1st chart, comprising, as described the 1st chart and described respectively without the distance between label data
The 1st chart calculating described in described data evaluation portion keeps and each be without the distance between label data,
Described without label data without the ultimate range in the distance between label data of the 1st chart of described maintenance and each selected by described data evaluation portion,
Described data evaluation portion by described select without label data, change to the data in literature comprising in described the 1st chart,
Each data in literature and described each minimum value without the distance between label data comprising in described the 1st chart calculated by described data evaluation portion,
Described data evaluation portion according to described in each data in literature of calculating and each without the distance between label data, the 1st chart of described maintenance and each are upgraded without the distance between label data.
4. information handling system as claimed in claim 2, is characterized in that,
Described initialization section, by cutting apart having given a plurality of data in literature of described label, obtains described teacher's data and test data,
Described characteristic evaluating portion is used described characteristic evaluating function calculated characteristics evaluation of estimate,
Described eigenvector generating unit according to generate described the 2nd chart described characteristic type and described in the test data that obtains, generate the eigenvector that represents the described test data of the feature relevant with described test data by value of vectors,
The eigenvector that described machine learning portion comprises described test data in the described eigenvector without label data of selecting,
Described machine learning portion according to described in the eigenvector without label data selected and the eigenvector of described teacher's data, by described the 2nd chart, will give label in described teacher's data to described propagating without label data of selecting,
The label that described machine learning portion propagates by the described test data without comprising in label data of selecting described in subtend and the label of giving in described test data compare, the evaluation of estimate of computing machine study,
Described characteristic evaluating portion is in the situation that the evaluation of estimate of described machine learning meets the 2nd rated condition, according to the evaluation of estimate of described machine learning and described in the evaluation attribute that calculates obtain regression function,
Regression function and the figure table for assessing of described characteristic evaluating function to described teacher's data described in described characteristic evaluating portion is used, obtained.
5. information handling system as claimed in claim 2, is characterized in that,
Described information handling system also has the input media of accepting indication from user,
In the situation that having indicated the 1st chart from described user via described input media, comprised without label data, described data selection portion select to be indicated by described user without label data as should in described the 1st chart, append without label data.
6. information handling system as claimed in claim 1, is characterized in that,
Described information handling system also has the input media of accepting indication from user,
In the situation that indicated for generating the characteristic type of the chart of the label that is suitable for propagating described teacher's data most via described input media from described user, described feature selecting portion selects the characteristic type of being indicated by described user as for generating the characteristic type of described the 1st chart.
7. an information processing method, for a plurality of data in literature being carried out to the information handling system of machine learning, is characterized in that,
Described information handling system has processor and storer,
Said method comprising the steps of:
Initialization step, the data in literature that it is a plurality of teacher's data that described processor is obtained a plurality of data in literature of having given label, do not give described label is without label data and a plurality of characteristic types that represent to extract the method for the feature relevant with described each data in literature;
Eigenvector generates step, in the situation that characteristic type described at least one in the characteristic type of obtaining described in input, described processor according to the characteristic type of described input and described in each teacher's data of obtaining, generate the eigenvector that represents described each teacher's data of the feature relevant with described each teacher's data by value of vectors;
Chart is constructed step, and described processor, according to generated the eigenvector of each teacher's data of step generation by described eigenvector, generates the chart of described teacher's data;
Feature selecting step, described processor is according to the chart of being constructed teacher's data of step generation by described chart, from the characteristic type of being obtained by described initialization step, select for generating the characteristic type of the 1st chart of the label that is suitable for propagating described teacher's data most, and then output constructs by described chart described the 1st chart that step generates;
Data selection step, described processor is according to described the 1st chart and described without label data, and selection should be propagated the described without label data of the label given in described teacher's data, and then, by select described in comprising in described the 1st chart without label data, generate the 2nd chart; And
Machine learning step, described processor is by described the 2nd chart, will give label in described teacher's data to described propagating without label data of selecting.
8. information processing method as claimed in claim 7, is characterized in that,
Described chart is constructed step and is comprised the following steps:
Described processor calculates the step of the distance between described each teacher's data according to the eigenvector of each teacher's data of described generation,
Described processor generates the step of the chart of described teacher's data by the weight in definite distance based between described each teacher's data that calculate between described each teacher's data,
Described feature selecting step comprises the following steps:
Characteristic evaluating step, the figure table for assessing of described processor to teacher's data of described generation;
Feature selecting convergence determination step, in the situation that described characteristic evaluating step meets the 1st rated condition to the evaluation result of the chart of described teacher's data, described processor is exported the chart of described teacher's data as described the 1st chart; And
Characteristic optimization step, in the situation that described characteristic evaluating step does not meet described the 1st rated condition to the evaluation result of the chart of described teacher's data, described processor is according to the evaluation result of the chart of described teacher's data, from the characteristic type of being obtained by described initialization step, select new described characteristic type, at described eigenvector, generate the characteristic type of selecting described in input in step
Described characteristic evaluating step also comprises the steps: that described processor uses to have given weight definite between described teacher's data of different described labels less higher to the chart evaluation of described teacher's data and given weight definite between described teacher's data of identical described label the larger characteristic evaluating function higher to the chart evaluation of described teacher's data, the figure table for assessing to described teacher's data.
9. information processing method as claimed in claim 7 or 8, is characterized in that,
Described eigenvector generates step and comprises the following steps: described processor according to generate the described characteristic type of described the 1st chart and by described initialization step, obtained without label data, generation represents and described each described eigenvector without label data without the relevant feature of label data by value of vectors
Described data selection step has following data evaluation step: described processor is according to the eigenvector of described each teacher's data and described each eigenvector without label data, calculate each teacher's data and described each minimum value without the distance between label data of in described the 1st chart, comprising, as described the 1st chart and described respectively without the distance between label data
Described data evaluation step comprises the following steps:
The 1st chart and each step without the distance between label data that described processor calculates described in storing in described storer;
The 1st chart of storing in storer described in described processor selection and each described step without label data without the ultimate range in the distance between label data;
Described processor by described select without label data, change to the step of the data in literature comprising in described the 1st chart;
Described processor calculates each data in literature and described each step without the minimum value of the distance between label data comprising in described the 1st chart; And
Described processor according to described in each data in literature of calculating and each is without the distance between label data, to the 1st chart of storing in described storer and each step of upgrading without the distance between label data.
10. information processing method as claimed in claim 8, is characterized in that,
Described initialization step comprises the following steps: described processor, by cutting apart having given a plurality of data in literature of described label, is obtained described teacher's data and test data,
Described characteristic evaluating step comprises the following steps: described processor uses described characteristic evaluating function calculated characteristics evaluation of estimate,
Described eigenvector generates step and comprises the following steps: described processor according to generate described the 2nd chart described characteristic type and described in the test data that obtains, generation represents the eigenvector of the described test data of the feature relevant with described test data by value of vectors
Described machine learning step comprises the following steps:
The step of the eigenvector that described processor comprises described test data in the described eigenvector without label data of selecting;
Described processor according to described in the eigenvector without label data selected and the eigenvector of described teacher's data, by described the 2nd chart, will give label in described teacher's data to the described step of propagating without label data of selecting;
The label that described processor is propagated by the described test data without comprising in label data of selecting described in subtend and the label of giving in described test data compare, the step of the evaluation of estimate of computing machine study;
In described characteristic evaluating step, when the evaluation of estimate of described machine learning does not meet in the situation of the 2nd rated condition, described processor according to the evaluation of estimate of described machine learning and described in the evaluation attribute that calculates obtain the step of regression function; And
Regression function and the step of described characteristic evaluating function to the figure table for assessing of described teacher's data described in described processor uses, obtained.
11. information processing methods as claimed in claim 8, is characterized in that,
Described information handling system also has the input media of accepting indication from user,
Described data selection step comprise the steps: in the situation that having indicated the 1st chart from described user via described input media, comprised without label data, described processor selection by described user, indicated without label data as should in described the 1st chart, append without label data.
12. information processing methods as claimed in claim 7, is characterized in that,
Described information handling system also has the input media of accepting indication from user,
Described feature selecting step comprises the steps: in the situation that indicated for generating the characteristic type of the chart of the label that is suitable for propagating described teacher's data most via described input media from described user, and the characteristic type that described processor selection is indicated by described user is as for generating the characteristic type of described the 1st chart.
CN201310322481.3A 2012-09-18 2013-07-29 Information processing system and information processing method Expired - Fee Related CN103678436B (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
JP2012204680A JP5881048B2 (en) 2012-09-18 2012-09-18 Information processing system and information processing method
JP2012-204680 2012-09-18

Publications (2)

Publication Number Publication Date
CN103678436A true CN103678436A (en) 2014-03-26
CN103678436B CN103678436B (en) 2017-04-12

Family

ID=50316016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310322481.3A Expired - Fee Related CN103678436B (en) 2012-09-18 2013-07-29 Information processing system and information processing method

Country Status (2)

Country Link
JP (1) JP5881048B2 (en)
CN (1) CN103678436B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107347050A (en) * 2016-05-05 2017-11-14 腾讯科技(深圳)有限公司 Based on the malice recognition methods reversely gone fishing and device

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CA3198484A1 (en) * 2014-06-30 2016-01-07 Amazon Technologies, Inc. Feature processing tradeoff management
JP6935059B2 (en) 2017-03-30 2021-09-15 日油株式会社 A method for purifying polyethylene glycol having one carboxyl group
KR101864380B1 (en) * 2017-12-28 2018-06-04 (주)휴톰 Surgical image data learning system
JP7006296B2 (en) 2018-01-19 2022-01-24 富士通株式会社 Learning programs, learning methods and learning devices
JP7006297B2 (en) 2018-01-19 2022-01-24 富士通株式会社 Learning programs, learning methods and learning devices
KR102543698B1 (en) * 2018-05-28 2023-06-14 삼성에스디에스 주식회사 Computing system and method for data labeling thereon
CN109522961B (en) * 2018-11-23 2022-09-13 中山大学 Semi-supervised image classification method based on dictionary deep learning
JP2020140452A (en) 2019-02-28 2020-09-03 富士通株式会社 Node information estimation method, node information estimation program and information processing device
JP7399998B2 (en) 2022-03-29 2023-12-18 本田技研工業株式会社 Teacher data collection device

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8001121B2 (en) * 2006-02-27 2011-08-16 Microsoft Corporation Training a ranking function using propagated document relevance
JP4963245B2 (en) * 2007-03-16 2012-06-27 日本電信電話株式会社 Syntax / semantic analysis result ranking model creation method and apparatus, program, and recording medium
JP4433323B2 (en) * 2007-10-22 2010-03-17 ソニー株式会社 Information processing apparatus, information processing method, and program
US8655817B2 (en) * 2008-02-20 2014-02-18 Digital Medical Experts Inc. Expert system for determining patient treatment response
WO2010075408A1 (en) * 2008-12-22 2010-07-01 The Trustees Of Columbia University In The City Of New York System and method for annotating and searching media
CN101840516A (en) * 2010-04-27 2010-09-22 上海交通大学 Feature selection method based on sparse fraction

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107347050A (en) * 2016-05-05 2017-11-14 腾讯科技(深圳)有限公司 Based on the malice recognition methods reversely gone fishing and device
CN107347050B (en) * 2016-05-05 2019-12-20 腾讯科技(深圳)有限公司 Malicious identification method and device based on reverse phishing

Also Published As

Publication number Publication date
JP2014059754A (en) 2014-04-03
CN103678436B (en) 2017-04-12
JP5881048B2 (en) 2016-03-09

Similar Documents

Publication Publication Date Title
CN103678436A (en) Information processing system and information processing method
Galbrun et al. From black and white to full color: extending redescription mining outside the Boolean world
CN109446341A (en) The construction method and device of knowledge mapping
CN106997341B (en) A kind of innovation scheme matching process, device, server and system
US11775594B2 (en) Method for disambiguating between authors with same name on basis of network representation and semantic representation
CN106445988A (en) Intelligent big data processing method and system
Thesen Computer methods in operations research
Gong et al. Novel heuristic density-based method for community detection in networks
Saini et al. Extractive single document summarization using binary differential evolution: Optimization of different sentence quality measures
CN109033277A (en) Class brain system, method, equipment and storage medium based on machine learning
CN113010688A (en) Knowledge graph construction method, device and equipment and computer readable storage medium
CN107507028A (en) User preference determines method, apparatus, equipment and storage medium
CN104573130A (en) Entity resolution method based on group calculation and entity resolution device based on group calculation
CN108733644A (en) A kind of text emotion analysis method, computer readable storage medium and terminal device
CN107644051A (en) System and method for the packet of similar entity
CN102508971B (en) Method for establishing product function model in concept design stage
CN103412878A (en) Document theme partitioning method based on domain knowledge map community structure
CN116402166B (en) Training method and device of prediction model, electronic equipment and storage medium
CN107862166A (en) A kind of intelligent Simulation experiment design system and design method
CN116244277A (en) NLP (non-linear point) identification and knowledge base construction method and system
CN105373561B (en) The method and apparatus for identifying the logging mode in non-relational database
CN106156259A (en) A kind of user behavior information displaying method and system
CN111339258B (en) University computer basic exercise recommendation method based on knowledge graph
CN113505117A (en) Data quality evaluation method, device, equipment and medium based on data indexes
CN113010642A (en) Semantic relation recognition method and device, electronic equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170412

Termination date: 20210729

CF01 Termination of patent right due to non-payment of annual fee