CN102682065B - Semantic entity control using input and output sample - Google Patents
Semantic entity control using input and output sample Download PDFInfo
- Publication number
- CN102682065B CN102682065B CN201210023688.6A CN201210023688A CN102682065B CN 102682065 B CN102682065 B CN 102682065B CN 201210023688 A CN201210023688 A CN 201210023688A CN 102682065 B CN102682065 B CN 102682065B
- Authority
- CN
- China
- Prior art keywords
- input
- output
- item
- conversion
- entity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Landscapes
- Stored Programmes (AREA)
- Machine Translation (AREA)
Abstract
The invention refers to semantic entity control using input and output samples. The invention presents a semantic entity control technology implementation example and the generation of the embodiment can control a probability program of character strings representing a semantic entity. The probability program later can be used generating necessary output consistent with the embodiment from a type of input included in the input and output samples. The probability program is generated based on output of analytical, transformation and formatting modules. The analytical module uses a probabilistic method to analyze the input and output samples. The transformation module can generate a weighted transformation set of output of a probability prescribed by distributed weight from input of an input and output sample. The formatting module generates formatting instructions for converting a selected output part into a form prescribed by the input in the input and output sample.
Description
Technical field
Design semantic manipulations of physical of the present invention, especially uses the semantic entity of input-output example to handle.
Background technology
Worldwide millions of people uses electrical form etc. store and handle data.It is another kind of form that these data manipulation scenes to generally include a large amount of input information from a kind of format conversion, or needs to calculate input information and executing to produce required output.Usually, manually or use little, normally disposable should being used for realize these tasks, these application are that terminal use creates by terminal use or by programmer.
Summary of the invention
Semantic entity manipulation technology embodiment described herein generates the probability program can handling the character string representing semantic entity based on input-output example.The input that this program can be used for the type comprised from input-output example subsequently generate consistent with this example needed for export.This allows the input-output example provided based on terminal use that input information is become another kind of form from a kind of format conversion, and in input information, performs calculating to produce required output.
Generally speaking, in one implementation by first receiving aforesaid input-output example to realize foregoing.Each input-output example provides one or more input item and corresponding required output item.Resolve to produce the analytic sets through weighting to received input and output item.These each expressions in the parsing of weighting, to the different potential parsing of each input and output item, have wherein carried out weighting according to the tolerance being the possibility of effectively resolving based on this parsing compared with the parsing storehouse of regulation to this parsing.Next, for each input-output example, from the transformation library of a type, identify one or more conversion, this one or more conversion can generate required output item from the input item of this example.In addition, sign format instruction, this formatting commands can format output item so that and the format of output item needed for input-output example match.When treated whole input-output example, generating probability program, when the given one or more input item identical with the input item type of input-output example, this probability program adopts the conversion identified to produce the corresponding output item of input item one or more with this with formatting commands.Receive the one or more input items identical with the input item type of input-output example subsequently, and use the probability program generated to produce output item.
It should be noted that to provide content of the present invention to be to introduce some concepts that will further describe in the following specific embodiments in simplified form.Content of the present invention not intended to be identify key feature or the essential feature of theme required for protection, are not also intended to the scope for helping to determine theme required for protection.
Accompanying drawing explanation
By reference to the following description, appended claims and accompanying drawing, specific features of the present invention, aspect and advantage will be understood better, in accompanying drawing:
Fig. 1 shows the program generating system (PGS) for generating the program performing data manipulation task based on input-output example, and this program is applied to the program execution module of new input item.
Fig. 2 shows and comprises the program generating system (PGS) of Fig. 1 and the data manipulation system of program execution module.
Fig. 3 is the flow chart of the general view of a kind of mode of operation of the program generating system (PGS) that Fig. 2 is shown.
Fig. 4 illustrates how (Fig. 2) program generating system (PGS) uses three part operations to carry out the flow chart of generator.
Fig. 5 and 6 illustrates an example that a kind of mode of operation of the conversion module used in the program generating system (PGS) of Fig. 2 is shown jointly.
Fig. 7 is the flow chart such as being carried out the example of supplementary Fig. 5 and 6 by the general view of a kind of mode of operation that conversion module is shown.
Fig. 8 illustrates the conversion process that rounds off for rounding off to semantic entity example.
Fig. 9 illustrates monetary data table, and wherein terminal use wants to use the currency exchange rate on the date of row shown in-3 by the currency type shown in the currency conversion in column-2 in row-1, thus obtains the result shown in output row.The first row shows the input-output example that terminal use provides.
Figure 10 shows range data table, wherein depend on the attribute of travelling frequently (such as arriving the driving time in rush hour of user's office, to the walking distance in nearest gymnasium and to the driving distance of nearest university) of some ad-hoc location, terminal use wants to make wise selection from (briefly presenting its address the first row of table) 8 apartments.
Figure 11 shows the form string list for double-precision value being formatted into various output format.
Figure 12 shows the form string list for Date-Time 24/9/1986 18:23:05 being formatted into various output format.
Figure 13 show the date with three kinds of different-formats electrical form (that is, U.S.'s form: the moon/day/year, European format: day. the moon. year, and Chinese form: year-month-day).Some noticing in these dates lacks the time, and it is assumed to be and is defaulted as this year.
Figure 14 is the flow chart of roughly setting forth the example process realizing semantic entity manipulation technology embodiment described herein.
Figure 15 shows a realization of the semantic entity control system of the process for realizing Figure 14.
Figure 16 is the diagram depicting the universal computing device being configured for the example system realizing semantic entity manipulation technology embodiment described herein.
Detailed description of the invention
In the following description to semantic entity manipulation technology embodiment, with reference to the drawings, figures constitute a part for embodiment and show as explanation the specific embodiment can putting into practice this technology wherein wherein.Be appreciated that and can use other embodiments and can structural change be made and not deviate from the scope of the technology of the present invention.
In addition, some accompanying drawings describe concept in the context of one or more construction package (being variously referred to as function, module, feature, element etc.).Each assembly shown in accompanying drawing can realize in any way.In one case, shown in accompanying drawing is that different unit can reflect the different assembly using correspondence in practical implementations by various components apart.Can alternatively, or additionally, any single component shown in accompanying drawing can be realized by multiple actual component.Alternatively, or additionally, can reflect the different function performed by single actual component to the description of two or more the independent assemblies any in accompanying drawing.
Other accompanying drawings describe concept in flow diagram form.In this format, some operation is described to form the different frame performed with a certain order.Such realization is illustrative and nonrestrictive.Some frame described herein can be grouped in together and perform in single operation, and some frame can be divided into multiple composition frame, and some frame can by from shown here go out different order perform (comprise and perform these frames in a parallel fashion).Each frame shown in flow chart can realize in any way.
About term, phrase " is configured to " contain any mode that the function that can construct any kind performs identified operation.Term " logic " or " logic module " contain any function for executing the task.Such as, each operation shown in flow chart corresponds to the logic module for performing this operation.When being realized by computing system (such as, " computing function "), logic module represent as computing system physical piece, no matter be the physical assemblies how to realize.
One or more signature identification can be " optional " by following explanation.This statement should not be interpreted as the detailed explanation that can be regarded as optional feature; That is, other features also can be regarded as optional, although do not have explicitly to identify in the text.Similarly, explanation can point out that one or more feature can realize with plural number (that is, by providing more than one feature).This statement should not be interpreted as the detailed explanation of recursive feature.Finally, term " exemplary " or " illustrative " refer to a realization in the multiple realization of possibility.
1.0 general views and the relation with father's patent application
In father's application of this part continuity application, describe the program generating system (PGS) carrying out generator based on multiple input-output example.Input-output example comprises input item and corresponding output item.In one implementation, this program generating system (PGS) comprises three assembly modules.Parsing module processes input item and output item respectively to provide multiple importation and output.Conversion module is that each output is determined to use the one or more conversion modules selected from candidate conversion device module collection whether can produce this output from the importation of correspondence.Formatting module generates the formatting commands selected output being transformed into the form specified by original output item.These three modules produce the program generated realizing the logic of learning from example from input-output; This program generated can be used for new input item to be transformed into new corresponding output item subsequently.
In order to understand semantic entity manipulation technology embodiment described herein better, the review of system to describing in father's application first will be presented, and the amendment of expression semantic entity manipulation technology embodiment will presented this system subsequently.
1.1 father's patent application system
Fig. 1 shows the illustrative program generating system (PGS) 102 carrying out creation procedure based on input-output example.Each input-output example comprises an input item and corresponding output item.Input item can comprise the one or more parts being called as importation herein.Output item also can comprise the one or more parts being called as output.
Each output item represents the conversion to certain type that the input item of correspondence performs.Such as, an output item can comprise the one or more outputs directly copied represented the one or more corresponding importation obtained from input item.In addition, or alternatively, an output item can comprise the one or more outputs represented the conversion (non-immediate copies) of one or more corresponding importation.In addition, or alternatively, an output item can comprise the format being applied to its content, and this format is different from the format being applied to corresponding input item.In addition, or alternatively, output item can comprise one or more outputs of the opposite segments do not had in corresponding input item.In addition, or alternatively, output item does not need the opposite segments of each importation comprised in corresponding input item.
Such as, Fig. 1 presents the illustrative input-output example set 104 of data file 106.Input-output example set 104 comprises multiple input item 108 and corresponding multiple output items 110.In this example, input item comprises single-row alphanumeric input information; Equally, output item 110 comprises single-row alphanumeric output information.But in other situations, data file 106 can comprise the single-row input information being mapped to two or more columns output information.In another situation, data file 106 can comprise the two or more columns input information being mapped to single-row output information.In another situation, data file 106 can comprise the two or more columns input information being mapped to two or more columns output information, by that analogy.In addition, data file 106 can organize the set of input information and output information (that is, as the replacement or supplementary of row tissue) according to any mode.More generally, the example shown in Fig. 1 can change according to much different modes.
In the special scenes of Fig. 1, input item represents the invoice of unprocessed form.Output item represents the invoice through shifted version of output format.Such as, the first input item comprises the date (" 2-2-15 ") of specific format, represents the date on February 2nd, 2015.This date is transformed into another kind of form by output item, namely by printing the abbreviation (that is, " Feb. (February) ") of month name.In addition, the first letter that month abridges is transformed into small letter from capitalization by output item, thus produces " feb. ".First input item also comprises the title in city, i.e. " Denver (Denver) ".This urban information is transformed into the corresponding state information through abbreviation by output item, i.e. " CO (Colorado) ".First input item also comprises the value at cost in dollar, i.e. " 152.02 ".Output item repeats this value at cost, but this value is rounded to immediate U.S. dollar amount, thus creates " 152 ".First input item also comprises string " Paid (paid) ".Output item repeats this string in mode word for word.
In addition, notice that (first input-output example) output item comprises non-existent additional information in corresponding input item.Such as, output item comprises three commas, and input item only comprises single comma.In addition, output item adds dollar mark () " $ " before cost digital " 152 ".In addition, compared with corresponding input item, output item carrys out arrangement information in a different manner.Such as, before positional information (" Denver ") is placed on cost information (" 152.02 ") by input item, and before cost information (" 152 ") is placed on positional information (" CO ") by output item.Finally, output item black matrix presents last string (" Paid "), and presents it without black matrix in input item.As will be appreciated, only this specific example is presented for illustrative purposes.Other input-output examples can be different from this scene according to any mode.
Data file 106 also comprises the input item 112 of another group without conversion of the output item still without correspondence.For small data set, user can study input-output example set 104 to find the logic being used to input item is transformed into corresponding output item.User can subsequently for one group that meets this logic new input item 112 manually generates new output item.But this manual operation becomes unrealistic along with the increase of the size of data file 106.
In order to address this problem, program generating system (PGS) 102 automatically generates the program 114 helping user one group of input item 112 to be transformed into required output form.From high-level view, program generating system (PGS) 102 carrys out generator 114 based on input-output example set 104.Program 114 is applied to one group of new input item 112 by program execution module 116 subsequently.This generates one group of new output item.Such as, new input item " 4-19-15 Yakima 292.88, Paid " is automatically transformed into " apr 2015, $ 293, WA, Paid " by program 114.
Fig. 2 shows a declarative data control system 200 of the program generating system (PGS) 102 and program execution module 116 that can utilize Fig. 1.Usually, Fig. 2 divides different modules clearly identifying the function performed by these corresponding modules.In a kind of situation, these modules can represent different physical assemblies.In other situations, one or more module can represent the assembly in other modules one or more.
From high-level view, program generating system (PGS) 102 operates in conjunction with the data manipulation function 202 of any type.Data manipulation function 202 represents any instrument being used for data item being performed to process.In a kind of situation, data manipulation function 202 can provide and allow user to check and the user interface of Update Table item.Such as, in a kind of situation, data manipulation function 202 can represent the spreadsheet systems allowing user to handle data item in a tabular form.Can be positioned at the Microsoft that State of Washington Randt covers city by the spreadsheet systems used
the Microsoft Office that company provides
in another situation, data manipulation function 202 can represent the table operating function in documents editing application, etc.
Data manipulation function 202 can be mutual with other functions 204.Such as, data manipulation function 202 from other function 204 receiving data items, or can send data item to other functions 204.Other functions 204 can represent the application module (such as documents editing application, spreadsheet application etc.) of any type.Alternatively, or in addition, other functions 204 can represent the entity of the network-accessible of any type.Such as, can represent can via the collection of data items safeguarded in the remote data storage of access to the Internet for other functions 204.
In operation, user can provide input-output example collection to data manipulation function 202.Such as, in a situation, user can manually create input-output example set.In another situation, user can read by guide data operating function 202 in the data file comprising input-output example.Data file can be obtained from any source, such as can represent other functions 204 of (relative to data manipulation function 202) local source and/or remote source.After guidance, data manipulation function 202 can provide program 114 by service routine generation system 102.This program 114 have expressed the logic embodied in input-output example.Program execution module 116 service routine 114 can be located in the new input item of reason to generate new output item subsequently automatically.
Data manipulation function 202 and program generating system (PGS) 102 illustrate as two different corresponding modules by Fig. 2.In another situation, data manipulation function 202 can consolidation procedure generation system 102 as one of its assembly, vice versa.Equally, program execution module 116 illustrates as the assembly of in data manipulation function 202 by Fig. 2.In another situation, data manipulation function 202 and program execution module 116 can represent two different modules.
Data manipulation function 202 can in different patterns caller generation system.In a kind of pattern, in user can such as be demonstrated by the user interface provided in data manipulation function 202, activation command button, menu item etc. carry out the function of explicitly caller generation system 102.User can identify input-output example set for use in generator 114 by explicitly subsequently.
In another pattern, data manipulation function 202 can comprise measuring ability, and this measuring ability detects user and input item set repeatedly performed to the conversion of identical type to provide corresponding output item.The input-output example that data manipulation function 202 can provide based on user subsequently carrys out automatically caller generation system 102.
These using forestland are representational, and non-exclusive.Data manipulation function 202 can be mutual with program generating system (PGS) 102 in other operator schemes.
User can directly or indirectly caller generation system 102 to realize different data manipulation targets.In first scene, when there is the special demand of some environment of the information expressed in the first format being converted to the information expressed in the second format, user can caller generation system 102.Such as, in a situation, user can receive the information of the first form from another person (or multiple people).Based on the consideration that any environment is special, user may wish this information conversion to be become more receptible second form of this user.In another situation, user itself may create this information with the first form.User may wish this information conversion to become the second form now.In another situation, user can apply from source, data storage etc. receives the information expressed in the first format.User may wish to convert this information to be more suitable for intended application, data storage etc. the second form.Such as, user may wish this information to apply from documents editing the form that the format conversion that uses becomes spreadsheet application to use, and vice versa.In another situation, user wishes that the information will expressed with markup language (such as, XML, HTML etc.) converts non-markup languages form to, etc.These examples exemplarily unrestrictedly to present.
In the second scene, for the main purpose extracting one or more data item from (obtaining from any source) input item, user can directly or indirectly caller generation system 102.In this scenario, the second form represents the subset of the information expressed in the first format.
In the 3rd scene, user can based on the combination of the reason be associated with the first scene and the second scene, directly or indirectly caller generation system 102.Such as, except except input item information extraction, user may wish the conversion to any type of extracted information and executing.User can also add the information in input item without opposite segments to output item.
Data manipulation scene as described above is representational, and non-exclusive.User can caller generation system 102 to realize other data manipulation target.
About physics realization, the modules shown in Fig. 2 and system can be realized by one or more computing equipment.These computing equipments can be positioned at single position maybe can be distributed in multiple position.Such as, local data operating function 202 can be mutual so that the function summed up above performing with local program generation system 102.In another situation, the program generating system (PGS) 102 that local data operating function 202 can realize with telecommunication network is mutual to realize function described herein.In addition, the modules shown in Fig. 2 and system can be managed by single entity or multiple entity.
The computing equipment of any type can be used for the function realized described in Fig. 2, mention in the Illustrative Operating Environment part of such as this description those.
Program generating system (PGS) 102 and data manipulation function 202 can also store 206 mutual with one or more data.Such as, data store 206 and can store input-output example etc.
Introduction above having had, explains the illustrative synthesis advancing to now program generating system (PGS) 102.Program generating system (PGS) 102 comprises (maybe can be conceptualized as and comprise) module collection.This part provides the general view to these modules.Various piece below provides the additional detail about each module.As general view, input-output example can be converted to program 114 with three partial routine by program generating system (PGS) 102: parsing module 208 performs Part I; Conversion module 210 performs Part II; And formatting module 212 performs Part III.
More specifically, parsing module 208 identifies the various piece of input item.Parsing module 208 can also identify the various piece of output item.Can conversion module 210 determine each output that use one or more conversion module to identify to calculate parsing module 208.Conversion module performs this task by storing search in 220 in data.Data store 220 and provide candidate conversion module collection.Usually, input information conversion, based at least one pre-defined rule, is become output information by each candidate conversion module.Formatting module 212 provides the formatting commands selected output being transformed into the form specified by original output item.Such as, formatting module 212 can according to the order matched with the form specified by output item to arrange output.In addition, formatting module 212 can print constant information to mate with the fixed information presented in output item.
Program generating system (PGS) 102 can export generated program 114, and this program 114 reflects parsing module 208, conversion module 210 and the process performed by formatting module 212.The program 114 generated can be used for, based on the logic embodied in input-output example collection, new input item is transformed into new output item.In a situation, generated program 114 is expressed as the set of the program module will called with certain order by program generating system (PGS) 102.One or more program module can represent the instantiation of the conversion module that conversion module 210 identifies.Other program modules one or more can print extracted content to operate by the content extracted in new input item in the new output item of correspondence.Other program modules one or more can perform the format manipulation of the outward appearance (but needing not to be content) affecting output that formatting module 212 identifies, etc.
Fig. 3 shows the process 300 of the senior description of the operation of the data manipulation system 200 presenting Fig. 1.At frame 302, data manipulation system 200 receives input-output example set.Each input-output example comprises data item (comprising one or more input string item) and output item.At frame 304, data manipulation system 200 carrys out creation procedure 114 based on input-output example.At frame 306, additional new input item (being not yet transformed) is transformed into new output item by data manipulation system 200 service routine 114.
Fig. 4 shows the process 400 of the more detailed description of the mode presented the program 114 that program generating system (PGS) 102 generates.At frame 402, program generating system (PGS) 102 receives input-output example set.At frame 404, each input item is resolved to each composition importation by program generating system (PGS) 102, and each output item is resolved to each composition output.At frame 406, importation can be converted to the conversion (if any) of corresponding output by program generating system (PGS) 102 mark.These conversion are performed by each conversion module selected from candidate conversion device module collection.At frame 408, program generating system (PGS) 102 generates the formatting commands selected output being transformed into the appropriate format specified by output item.At frame 410, program generating system (PGS) 102 exports generated program 114 on the basis of the analysis performed by frame 404-408.
2.0 semantic entities are handled
Semantic entity manipulation technology embodiment described herein revises the mode of program that aforesaid parsing, the mode of conversion and formatting module operation and program generating system (PGS) generate usually.More specifically, semantic entity manipulation technology embodiment described herein introduce about resolving, conversion, format and the probabilistic method of Program Generating.
Usually, introduce the module frame of semantic entity, the concept that this module frame extends class is easily accessed using the terminal use at string interface to make its function, and this module frame allows designer to define novel entities and the operation to these entities like a cork.In addition, introduce the design of probability programming scheme so that handle the string representing aforementioned semantic entity.Program in the program comprises the parsing-calculating-printing expression formula collection through weighting.The mode of the consistent probability of use programming scheme of the input-output example set that provides with user is also provided.
Usually, semantic entity is easily accessed using the terminal use at string interface to make its function just as the wrapper around (use without negative effect method) storehouse class.That it facilitate from destructuring and may noisy string to the conversion of the regular representation of bottom class, and adopt letter quality printer by the regular representation of bottom class conversion bunchiness.
There is provided the input and output interface based on string to have and allow terminal use to the benefit of the light access of class function (comprising function available online or dictionary).But, notice the information (acquiescence of such as implicit expression) that the input and output example that terminal use provides can be shown potential ambiguity (i.e. multiple explanation), noise (i.e. data error) and be lost.In order to solve this situation, adopt the common probability resolver framework requiring that entity design person specifies the hard of possibility field value and soft-constraint and the measuring similarity between any string and effective word segment value.
In order to the string list handling semantic entity shows, adopt aforesaid probability programming scheme, its program is made up of the parsing-calculating-printing expression formula collection (also referred to as pcp expression formula) through weighting.Analytical expression is positioned at the end of blade of pcp expression formula, and allows to show from string list and be converted to certain regular representation (i.e. certain normalized intermediate representation).Print expression formula and be positioned at the top layer of expression formula, and allow to be converted to required string list from regular representation and show.Even if because under given analytical expression, input string also can have multiple parsing, therefore the semantic interpretation of each pcp expression formula to input string returns one group of output string through weighting.There is (through weighting) this pcp expression formula collection and serve two objects.The first, it allows to encode to condition control flow check.In addition, due to these expression formulas from faulty information learning to, therefore its allows expression to the probability distribution of pcp expression formula.Each program also comprises the mathematics expression symbol collection through weighting.This set captures the statistics that can build from the corpus of example and resolves knowledge.This set is used for refinement may resolve this given string that take a step forward performing given string in pcp expression formula.
Semantic entity manipulation technology embodiment described herein also enables terminal use use data available on web to calculate.This is important, because research shows that most of user solves the most of computational problem in their daily life by using available data or service on web.Semantic entity manipulation technology embodiment described herein provides various ways to enable the seamless access of data to available on web and service.Such as, in one implementation, do not do to suppose to the interface of computing module, and in fact, these computing modules can be the API Calls by internet, such as so that the exchange rate between calculating to fix the date two kinds of currency, or to calculate the driving distance between two addresses.In addition, in one implementation, the virtual value set of an entity or the given field of an entity can be constrained to by certain data, services provider and belong to set available on web.This is useful when this set is huge and cannot be locally stored or cannot upgrades constantly in time.
Now by the preceding feature of descriptive semantics manipulations of physical technical em-bodiments in more detail in subsequent section.
2.1 semantic entity
Semantic entity (or simply, entity) is wrapper around the class of storehouse to make its function be easily access for using the terminal use at string interface.In one implementation, an entity is associated with bottom class, set of fields, constraint and ToCanonical and FromCanonical method.These semantic entities can be obtained from the database of this entity.In addition, developer can generate new entity, and directly provides them or be stored in semantic entity database.
Bottom semantic entity class has the unexposed Standard Interface with one group of method without negative effect.Semantic entity field has (typed) and represents, wherein each type can be another entity or base type.Notice that the term " field " used in this description can also refer to some expression of actual field.Semantic entity constraint comprises hard and soft-constraint.Soft-constraint is the constraint set through weighting of specifying order possible between effective field distribution and field or delimiter.Hard constraint is defined as being have the soft-constraint that weight equals 1.The boolean that can be used as entity instance (comprising the distribution of some subset to field) and parsing format descriptor (field name and delimiter sequence) can also be comprised and check any constraint set realized.In one implementation, use following predicate to express constraint:
1) InSet (f, S): field f from finite aggregate S value.Range (f, i, j) is the special case of InSet (f, S), wherein S={i ..., j}.
2) FieldOrder (f, f '): field f ' follows after f in entity resolution format descriptor.
3) DelAfter (f, d): delimiter d follows after field f.
4) DelBefore (f, d): before delimiter d appears at field f.
Such as, suppose that the semantic entity of the class of " DateTime (date-time) " by name has following sets of fields: " Month (moon) ", " Day (day) ", " Year (year) ", " Hours (time) ", " Minutes (dividing) ", " Seconds (second) " and " AM-PM (morning-afternoon) ", and " Month " field has the following expression of the hard constraint with correspondence:
Month
num: this expression from set 1,2 ..., 12}, i.e. Range (Month
num, 1,12) be worth in middle round numbers month; And
Month
words: this expression from set January, February ..., December}, i.e. InSet (Month
words, January ..., December}) in get string month value.
Also suppose that " DateTime " is associated with following soft-constraint.That is, " Minutes " field is more likely followed after " Hours " field, i.e. (FieldOrder (Hours, Minutes)).In addition, more likely " Seconds ", " Minutes " or " Hours " field before " AM-PM " field, i.e. FieldOrder (Seconds, AM-PM), FieldOrder (Minutes, and FieldOrder (Hours, AM-PM) AM-PM).
Some other example of soft-constraint comprises the online dictionary of the address of the surname, name and Address (address) entity that use for mating Name (name) entity.
Aforesaid ToCanonical and FromCanonical method is used for entity instance being converted to regular representation respectively and being converted to entity instance from regular representation.More specifically, ToCanonical method: e → e
cfor entity instance being converted to the regular representation of entity.FromCanonical method: e
cxF → e performs the inverse operation entity instance of regular representation form being converted to given sets of fields F.
Such as, suppose that the semantic entity of the class of " Length (length) " by name has following sets of fields: " kms (km) ", " m (rice) ", " cm (centimetre) ", " mm (millimeter) ", " ft (foot) ", " inches (inch) " etc.Also suppose that " cm " field represents the regular representation of this entity.The ToCanonical method of " Length " use the conversion factor for " cm " suitably normalization is carried out to whole field value after be added.FromCanonical method gets canonical long measure example e
cwith sets of fields (f
1..., f
n), and split entity value e at each interfield
c, make this value be packaged in next higher field.If do not have can higher field pack, then the amount that the highest field store is extra.Such as, suppose that schoolman wants with the form of ft-inches (foot-inch) to register the height of class middle school student, but the height data obtained from student is very irregular---namely with various gauging system (such as, m, cm, in, m-cm etc.).Irregular input can be converted to consistent form by ToCanonical with FromCanonical method, such as ft-inches or form needed for any other.First this by converting data to canonical form and be converted to required form subsequently.
In one implementation, entities field by base type but not entity type represents time, this base type is gathered and is associated with going here and there one as inputting and producing another letter quality printer pp gone here and there.The example of this base type is integer, string, double.In one implementation, the format that adopts following letter quality printer scheme to be convenient to enrich base type (and then make it possible to format that the entity that its field has this base type is enriched).
1) Identity:Identity printer prints string faithfully;
2) UpperCase:UpperCase printer prints string with uppercase format;
3) LowerCase:LowerCase printer prints string with lower case format;
4) ProperCase:ProperCase printer is with suitable capital and small letter format print string;
5) Prefix (k): Prefix (k) printer prints the length of string is the prefix of k;
6) IntOrd:IntOrd printer prints the integer represented by going here and there with ordinal form, and such as 1 is printed as 1
st(the first);
7) IntegerPrecision (d): IntegerPrecision (d) printer is with this integer of format print of integer d bit representation;
8) DoublePrecision (d): DoublePrecision (d) printer prints the floating number with definite d decimal positional accuracy;
9) DoubleAtMost (d): DoubleAtMost (d) printer prints the floating number with maximum d decimal positional accuracy; And
10) DoubleAtLeast (d): DoubleAtLeast (d) printer prints the floating number with at least d decimal positional accuracy.
Resolve the letter quality printer value from one of these printers in format descriptor.Resolve additional argument's parameter that format descriptor needs to learn Prefix (k), IntegerPrecision (d) and DoublePrecision (d) letter quality printer.Developer can provide one group of this argument value that may be useful to given entity.Such as, for " Month " field in " DateTime ", the prefix of size k=3 may be useful usually.
2.2 probability programming schemes
As previously mentioned, in one implementation, probability programming scheme is adopted so that handle the string representing aforesaid semantic entity.Now by this programming scheme of description.The language of the program will be adopted subsequently in the description of the parsing to semantic entity, manipulation and printing.
Usually, the program that probability of use programming scheme generates comprises parsing-calculating-printing (pcp) the expression formula collection through weighting.The specific characteristic of the program is for given input, and the program generated produces one group of output through weighting, instead of single output.
2.2.1
syntax
In one implementation, probability programming scheme adopts following syntax:
1) program P:={PD, PCP}, wherein PD={ (p
1, w
1) ..., (p
k, w
k), and PCP={ (O
1, w
1) ..., (O
n, w
n);
2) mathematics expression symbol p:=[fd
1..., fd
m];
3) field-delimiter is to fd:=(f
ij, constStr);
4) pcp expression formula O:=printEntity
e(C, q);
5) calculation expression C:=parseEntity
e(s
i, p) | T (C
1..., C
n); And
6) layout q:=Format ((fd
1, pp
1) ..., (fd
n, pp
n)).
Program P in this language is defined as the mathematics expression symbol collection PD and parsing-calculating-printing expression formula collection PCP through weighting.Symbol w represents these weights.
(E of entity) parsing (form) descriptor p is the sequence to fd, wherein each certain field (and representing identifier) and delimiter to comprising E.For notation is convenient, the expression of mathematics expression symbol usually by concatenated sequences simply and internal whole units be usually shown laid flat in string (such as a, [(f
1; str
1), (f
2; str
2), (f
3; str
3)] be represented as f
1str
1f
2str
2f
3str
3).
Parsing-calculating-printing expression formula the O being also referred to as pcp expression formula is printEntity
e(C, q) form, wherein C is certain calculation expression and q is the layout descriptor of entity E.Calculation expression C has the syntax of recurrence.It or will go here and there s
iwith the parsing format descriptor p of entity E (basic scenario) parseEntity as input
e(s
i, p) expression formula, or its independent variable is also the conversion expression formula T of calculation expression (recurrence situation).
Printing (form) descriptor q just as mathematics expression accords with, except each field is associated with letter quality printer.Its type is the letter quality printer of the field of entity itself is layout descriptor, and the letter quality printer that its type is the field of base type is one of its letter quality printer be associated (described by 2.1 joints).
Such as, the electrical form with two input row is considered.First row comprises the date, and secondary series comprises the integer representing multiple business day.Terminal use wants the date of the quantity on the business day in secondary series being added in first row, and formats the date of gained in a specific way in the 3rd row.Such as, if an electronic watch table rows comprises string " 24/09/1986 " at first row comprise string " 3 " in a second column, and user is desirably in display " Monday, 24 in the 3rd row
thseptember 1986 " (on September 24th, 1986, Monday), then performing one of required pcp expression formula O handled in program can be represented as:
O=printEntity
dateTime(addBusinessDays (C
1, C
2), q
3), wherein
C1≡parseEntity
DateTime(s
1,p
1)
C2≡parseEntity
Duration(s
2,p
2)
p
1≡Day
1/Month
2/Year
1
p
2≡Days
1
q
3≡Format((DayOfWeek
1,″,″),stringP),(Day
1,″″),intSupP),(Month
2,″″),stringP),(Year
1,ε,intP))
Wherein the subscript of entities field title represents that corresponding field represents.String s
1and s
2represent the string laid respectively in the first and second row.Empty string is represented by ε.
2.2.2 semantic
In one implementation, probability programming scheme adopts following semantic:
In input state σ to the semanteme of program P evaluation (to string variable s
iapportioning cost) be exactly to pcp expression formula collection PCP evaluation in input state σ.To PCP={ (O in input state σ
1, w
1) .., (O
n, w
n) semanteme of evaluation be first in state σ to each pcp expression formula O
jevaluation is to obtain one group of output string (o through weighting
i, w '
i), and passing through weight w ' subsequently
ibe multiplied by w
jall this union of sets collection are got after weight is normalized.
To pcp expression formula O in input state σ
jfirst evaluation comprises and uses function getAllParses (getting all parsings) to calculate through the collection of the parse state π of weighting from σ.Here it is is used as the symbol of the mathematics expression through weighting of a program P part to collect PD part.By in 2.3 joints, the getAllParses function of resolving string is described.Different from the state σ distributing string value to variable, distribute through the parse state π of weighting (regular representation) tuple (e, p, w) comprising entity instance e, mathematics expression symbol p and weight w to each variable s.Subsequently each in the parse state π of weighting to pcp expression formula O
jevaluation is used as exporting with the string obtained through weighting.So, to O in input state σ
jthe result of evaluation is exactly the set of all this strings through weighting.Note, this is the execution of the pcp expression formula to input string how probability interpretation being assigned to the possible fuzzy parsing of band.
To pcp expression formula O in the parse state π through weighting
j=printEntity
e(C, q) evaluation comprises the calculation expression C evaluation in π, and this produces the entity instance (e, w) through weighting.According to layout descriptor q, thus obtained entity instance e is converted to from its regular representation and represents e '.According to layout descriptor q, further entity instance e ' is converted to string list and shows.
To calculation expression T (C
1..., C
n) recurrence situation evaluation comprise its independent variable recursively evaluation.Convert T generate canonical entity instance by calling in the canonical entity instance corresponding with to the result of independent variable evaluation.Corresponding weight is calculated as the minimum of a value of the weight be associated with the result of evaluation of each independent variable.This is expressed as follows the fact: when each input that and if only if required for calculating has high confidence level, high confidence level is just associated with the result calculated.
To calculation expression parseEntiry in the parse state π through weighting
e(s
i, basic scenario evaluation p) comprises the degree of approach tolerance w ' between computation scheme descriptor p ' and p, and wherein p ' is tuple π (s
i) in mathematics expression symbol.Weight w ' is subsequently for adjusting and the string s in state π
ithe weight w of corresponding entity instance e.And if only if string s
iwhen needing the effective parsing descriptor similar with p, this weighting normalization contributes to providing higher weight to top layer pcp expression formula.
Note, given P={PD, { (O
1, w
1) .., (O
n, w
n).σ is made to be input program state.With each (through weighting) parsing π of state σ (in variable tuple)
iwith each pcp expression formula (O
j, w
j) corresponding, the output string generated through weighting is right
when following amount that and if only if is high, w '
ijvalue be relatively high-with (O in state π
jmiddle use) the corresponding weight of the parsing of string, those resolve with at pcp expression formula O
jend of blade occur correspondence mathematics expression symbol between measuring similarity and w
j.
Following three joints emphasize three key function aspects of aforesaid probability programming scheme, resolve to entity, calculate and be that string list shows by physical print to entity by string.
2.3 illustrative parsing modules
This section describes a realization of aforesaid parsing module, and this parsing module provides the proposition of aforementioned probability programming scheme string to be resolved to the abundant probability support of the analytic sets through weighting---comprise fuzzy string (string with multiple parsing) and noisy string (string of effectively resolving).Perform during learning phase and resolve to resolve both the input and output strings in example, thus learn potential calculation expression collection.Operationally period also performs to resolve to input and performs for program.
2.3.1 fields match
Parsing starts from identification field coupling.These will be used for building analysis diagram, as will be described later.Given string s and entity e, the most eldest son matched with the field of e that fields match is defined as in s goes here and there.Whole fields match collection M of field f in string s
fs () is defined as:
M
fs ()={ (i, j) | s [i..j] } is that the most eldest son matched with field f of s goes here and there.
2.3.2 analysis diagram
Analysis diagram G is for representing whole parsings of the given string s of the example as entity e.Given string s and entity e, can build analysis diagram as follows.
In one implementation, in the following manner from fields match collection M
fs () builds analysis diagram G=(V, E).A node n corresponding with each index i is there is in string s
i∈ V (G) (0 < i < Len (s)).This brings in figure | V|=Len (s) (length of s) individual node.Limit collection in figure is divided into two kinds, i.e. field limit E
fand delimiter limit E (G)
d(G) E (G)=E, is made
f(G) UE
d(G).
Directed edge (u, v, l) ∈ E (G) is corresponding to the limit marked with mark l between node u to node v.If (i, j) ∈ is M
f, then there is field limit (n in (s)
i, n
j, f) ∈ E
f(G).(if j>=i), then also there is delimiter limit (n
i, n
j, d) ∈ E
d(G).Note, given substring can match with multiple field, and therefore can there is multiple field limit between two given nodes.Delimiter limit also causes the self-loopa corresponding with delimiting identifier value ∈.
Resolve P to be defined as in G from node n
1to n
len (s)directed walk, make limit on this path at field limit E
fand delimiter limit E (G)
d(G) between alternately.In form, if meet the following conditions, then resolve P and be defined as directed walk (e
1, e
2..., e
k):
1) e
1u=n
1the wherein start node e in path
1u is the initial of string;
2) e
kv=n
len (s)the wherein end node e in path
kv is the ending of string;
3) e
i∈ E
f(G) and if only if e
i-1∈ E
d(G),
if wherein e
ifield limit, then the last bar limit e on path
i-1it is delimiter limit; And
4) e
i∈ E
d(G) and if only if e
i-1∈ E
f(G),
if wherein e
idelimiter limit, then the last bar limit e on path
i-1it is field limit.
Analytic sets SP comprises all this parsing P.Function getParseGraph (getting analysis diagram) will go here and there s and entity e as independent variable, and return corresponding analysis diagram G.
2.3.3 parsing is mated
Function MatchParse (coupling is resolved) is by two mathematics expression symbol p
1and p
2as independent variable, (wherein one of mathematics expression symbol is from mathematics expression symbol storehouse, and another is associated with semantic entity, this semantic entity with input or output item and be associated), and calculate the weight w of the similarity represented between these two format descriptor.The perfect matching of field and delimiter is endowed the highest weight.In one implementation, if two format descriptor are not ideally mated, then calculate field between two format descriptor mate with delimiter between Hamming distance.The weight representing that the degree of approach is measured is calculated inversely with Hamming distance.
2.3.4GetAllParse method
Function getAllParses will go here and there s and resolve descriptor set PD as independent variable, and return through weighting entity instance collection and represent that the correspondence to the difference of string s is resolved resolves descriptor.First this function uses getParseGraph function to build analysis diagram G according to s and e.Distribute and the part of its soft-constraint met and the proportional weight w of the similarity that accords with mathematics expression in PD to the every bar resolution path p in program.This function returns one group of entity through weighting subsequently and resolves descriptor tuple (e, p, w).
2.4 illustrative conversion modules
This section describes a realization of the conversion module 210 of Fig. 2.As previously mentioned, conversion module 210 mark can be used for one or more conversion modules importation being transformed into corresponding output.As described in father's patent application, and with reference to figure 5 and 6, will illustrate how different outputs corresponds to each different importations.For each this situation, conversion module 210 is investigated (data store 220 in) candidate transformation device module collection is to determine whether there is one or more conversion modules that can perform and identified importation is transformed into corresponding output.In a situation, conversion module 210 mark can perform the single conversion module of required conversion.In other situations, conversion module 210 is identified at two or more conversion modules performing required conversion when specifying order application.In other situations, conversion module 210 possibly cannot find any conversion module performing required conversion.
Start with the scene being labeled as " A ", conversion module 210 mark can based on two conversion modules (502,504) generating " feb " the digital month " 2 " in input item in output item.That is, conversion module 502 receives and numeral month generates three alphabetical months mark (such as, Feb) as exporting as input.The sentence that conversion module 504 receives initial caps as input and the sentence generating small letter as output.
In scenario B, conversion module 210 mark can generate the single conversion module 506 of " 2015 " in output item based on the numeral " 15 " in input item.That is, conversion module 506 receive two digits year number as input, and generate 4-digit number year number as output.
In scene C, conversion module 210 mark can generate the single conversion module 508 of " CO " in output item based on " Denver " in input item.That is, conversion module 508 receives city name and is referred to as input, and generates corresponding state name as output, and its Central Region name corresponds to the state that this city is positioned at.Conversion module 508 can use predetermined look-up table to perform such conversion.
In scene D, conversion module 210 identifies the single conversion module 602 that can generate numeral " 152 " based on the numeral " 152.02 " in input item in output item.That is, conversion module 902 receives floating-point U.S. dollar amount as input, and generates the U.S. dollar amount after rounding off as output.Although do not use in this example, in other scenes, conversion module 210 can rely on conversion module 604 and monetary information is converted to another kind from a kind of monetary base, such as, is converted to dollar etc. from sterling.
In scene E, the word " Paid " that conversion module 210 is determined in output item based on the resolving information that parsing module 208 provides mates definitely with the same word in input item.In this case, conversion module 210 can be abandoned attempting searching conversion module to generate output " Paid ".On the contrary, program generating system (PGS) 102 is by generator module 606, and this program module 606 is extracted last word in input item simply and in output item, be it can be used as last word to carry out repetition.
In one implementation, the candidate conversion device module that conversion module 210 successive applications is different produces the conversion module of results needed to identify.After each calculating using particular converter module, conversion module 210 can determine whether the output information generated by this conversion module matches with the output in output item in a uniform matter.Such as, for scene C, conversion module 210 can provide city name set to close to different candidate conversion device modules in succession.Conversion module 210 can infer city-generate the string as one man mated with the state name identified in output item to-state conversion module 508.Therefore, conversion module 210 can infer that conversion module 508 is the suitable selections for generating state name in output item.As mentioned above, in some cases, conversion needs to use two or more conversion modules importation to be converted to corresponding output.Therefore, conversion module 210 also can investigate the various combination of conversion module without any confusion.
Fig. 7 shows the process 700 summing up above concept in flow diagram form.At frame 702, can conversion module 210 be determined assign to draw the output consideration by directly copying corresponding input part from the input item of correspondence.If so, then conversion module 210 can abandon the investigation of execution in frame 704.At frame 704, can conversion module 210 can be determined use one or more conversion module to draw the output considering from the importation of correspondence.
Circulation in Fig. 7 indicates conversion module 210 can repeat analysis above until it has identified the full set of the conversion module that can be used for the output generating output item.In some cases, conversion module 210 can infer that use conversion module cannot draw one or more output.In fact, in some cases, conversion module 210 can infer that use conversion module cannot draw any output.
In the above example, suppose (being associated with input-output item) all input item have employed consistent form.In this case, if conversion module (or combination of conversion module) provides the successful transformation of the whole examples to input-output example, then this conversion module is feasible selection.In other realize, program generating system (PGS) 102 can process the input item adopting two or more forms.Such as, suppose that the input item of Fig. 1 comprises the two kinds of modes describing invoice information.In this case, conversion module 210 can identify to effective first conversion module of the first subset of input-output example (or combination of conversion module) and to effective second conversion module of the second subset of input-output example (or combination of conversion module).The program 114 generated can be used in the conditioned clues occurred in input item and determine to call the first conversion module or the second conversion module.
In one implementation, program generating system (PGS) 102 uses extensible framework to realize candidate conversion device module collection.If new conversion module meets extensible framework (in the situation of the semantic entity manipulation technology embodiment described herein, this can be aforesaid probability programming scheme) form of setting forth, then these new conversion modules can add in set by developer.
In the context of the semantic entity manipulation technology embodiment described herein, conversion module 210 is designated example input-output entry and is transformed into the semantic entity that input is associated and the conversion module exporting the corresponding semantic entity be associated.But, when finding the search generating results needed, the search of suitable conversion module is not stopped.On the contrary, the continuous analysis of the result produced by each conversion module be associated with the type of the semantic entity be just considered is continued, until consider whole this conversion module.When each generation in multiple conversion module (or various combination of multiple conversion module) is to when converting needed for semantic entity, consider all these conversion modules when producing and exporting, as will be described later.
In addition, in the context of the semantic entity manipulation technology embodiment described herein, the abundant support of any computing module of realization that the probability programming scheme described before conversion module adopts provides to designer, and do not need terminal use worry Finding possibility or use syntax.Recalling calculation expression C is parseEntity
e(s
i, p) expression formula or conversion expression formula T (C
1..., C
n).At this, T is as input and with regular representation to export any conversion of certain semantic entity using the regular representation of semantic entity.What be associated with the bottom class of being packed by entity or data type is the good selection of useful conversion without negative effect method.
In following chapters and sections, describe some useful conversion and support the large generic task that in fact terminal use will expect.As will be appreciated, these examples exemplarily unrestrictedly to present.Conversion module set also can comprise the module of other type.
2.4.1 arithmetic transformation
In multiple context, performing arithmetical operation to produce required output to input entity can be useful conversion.Such as, in one implementation, the Date-Time string that can build in mark input also also identifies " DateTime " entity of duration string.These data given, can perform multiple arithmetic transformation and export to produce Date-Time.Such as, " addDuration (adding the duration) " and " subtractDuration (deducting the duration) " conversion module can be generated to add respectively or from identified Date-Time, deduct the identified duration to produce another Date-Time." DateTime " entity can also support between calculating two Date-Time strings business day quantity " getBusinessDays (obtaining business day) " conversion module or calculate " getDuration (obtaining the duration) " conversion module of the duration between two Date-Times.
As another example, in one implementation, " Unit " entity that mark represents the string in the input of length measuring can be built." addLength (adding length) " or " substractLength (deducting length) " conversion module can be calculated subsequently and export length measuring to add or to deduct identified length measuring respectively to calculate.
2.4.2 to round off conversion
Performing calculating of rounding off to produce required output to input entity also can be very useful conversion.Such as, conversion module can be used to round off to the input entity of the data such as working time of employee to representing from bank balance, stock price.This round off may need by amount of currency upwards (upper) be rounded to 1/4th dollars, by the time downwards (lower) be rounded to time slot half an hour, numeral be rounded to integer etc. closest to (nearest).Whole entities that are that be associated with digital metric and that have defined whole order between its field can support to reach the conversion of rounding off of required precision.Some common solid falling into this kind has: Date-Time, duration, unit, numeral and currency.
In one implementation, this conversion of rounding off is realized as follows.Using entity z as zero reference, the value of entity e is rounded to the multiple of the k value of field f by roundOff (e, z, k, f, Mode) conversion (use can be downwards, upwards or closest to the pattern " Mode " of input).Define this round off based on properties:
1) digital metric NM (e) be associated with entity instance e.For each substantial definition elementary field f
base∈ Fields (e).For all other field f ∈ Fields (e) define conversion factor field value being converted to primary word segment value
digital metric is defined as NM (e)=∑ subsequently
f ∈ Fields (e)(c
f* v
f), wherein v
frepresent the value of field f.
2) neutral element z.
3) by PF=k*c
fthe dilution of precision (PF) of definition.
All as shown in Figure 8 " roundOff (rounding off) " conversion modules can be adopted to calculate and perform (e, z, k, f, Mode) to entity instance e and to round off conversion.First it calculate digital value tolerance NM (e) and dilution of precision PF value.Its calculates business q by being obtained divided by PF by NM (e)-z and remainder r subsequently.According to quotient and the remainder value, calculate lower limit L being positioned in the entity value on required precision bound
ewith upper limit U
e.Method getEntity:Integer → e returns the entity corresponding with digital metric independent variable.According to required rounding mode " Mode ", return suitable entity e '.
2.4.3 network transformation
Network transformation be by use computer network (such as internet or private intranet) upper can semantic dictionary or those entities of supporting of service convert.Semantic entity manipulation technology embodiment described herein provides the seamless access to this service for terminal use, and without the need to study with write web script.Some exemplary network be associated with two exemplary physical (that is, currency entity and address entity) are described below convert.
2.4.3.1 currency entity
In one implementation, for two kinds of currency to fixing the date, can design and adopt conversion module to obtain to the general currency exchange rate of fixing the date.Such as, in the table 900 shown in Fig. 9, terminal use wants to use the currency exchange rate on date 906 shown in row-3 that the currency 902 in row-1 is converted to the currency type shown in row-2904, thus acquisition exports the result shown in row 908.For this purpose, end user manual ground adopts reception source currency type, object currency type and trade date to provide the currency conversion widget of currency exchange rate.Specifically, terminal use provides string " USD (dollar) ", " EUR (Euro) " and " 24/05/2010 " to currency conversion widget, and obtains such as result " 0.800192046091 ".Terminal use uses this result to fill the entry exporting the first row in row.One of semantic entity manipulation technology embodiment described herein realizes using currency converter module, automatically will be filled in by the value shown in black matrix in the remaining element lattice of output row.
2.4.3.2 address entity
The conversion that some exemplary address entity is relevant comprises following.For address object, obtain its current local zone time, weather, latitude, longitude and nearest facility (such as, airport, bus stop, cafe etc.).Right for two address objects, obtain driving time/distance, the driving time of traffic peak period, between them walking time/distance.For two cities to (be not address itself, but position) with to fixing the date, obtain the information of sailing through between them and the most cheap quotation of this direct route.Such information can be obtained like a cork via internet.
Therefore, in one implementation, a given address (or position of certain type), needed for conversion module can be designed to obtain, information is as calculated as output.Such as, in the table 1000 shown in Figure 10, depend on the attribute of travelling frequently of some ad-hoc location, such as (table 1000 each export column heading 1004,1006,1008 present) driving time peak period to user's office, the walking distance to nearest gymnasium and driving distance to nearest university, terminal use wants to make wise selection from (its address is arranged in the first row 1002 of table 1000) 8 apartments.Terminal use obtains this information for the first row.One of semantic entity manipulation technology embodiment described herein realizes using the conversion module be applicable to, automatically fills the remaining element lattice exporting row with information needed (not shown).
2.4.3.3 other entities
Other examples interested comprising the conversion obtaining the information relevant with other semantic entities comprise and obtain personal data that each website stores (such as, financial mix, credit card spend) or be stored in those conversion of organized data (such as, calendar details, managing hierarchically structure) in Active Directory.
2.5 illustrative formatting modules
This section describes a realization of the formatting module 212 of Fig. 2.As previously mentioned, the instruction of formatting module 212 production form, the form that output is specified with original output item is presented to terminal use by this instruction, and this original output item is a part for the example that user provides.
Aforesaid probability programming scheme is printed as string by entity to provide abundant support.This builds printEntity by the top layer of pcp expression formula
eenable, this pcp expression formula is by layout descriptor q={ (f
1, d
1, pp
1) ..., (f
n, d
n, pp
n) as input, this printing descriptor be comprise that field represents, the tuple sequence of delimiter and letter quality printer.The semanteme of printEntity constructed fuction comprises the entity instance using function printEntityFD to print the correspondence using q, and entity instance to represent as field according to q and to print with the sequence of delimiter string by this entity instance.The corresponding letter quality printer be associated in tuple is used to occur to the printing that field represents.The printing of basic data-type field is processed by the letter quality printer scheme described in 2.1 joints.The printing of entity-type field has been come by recursively calling printEntityFD in entity instance and the printing descriptor corresponding with this field.
Such as, Figure 11 illustrates Microsoft
spreadsheet application or the format string required for C# programming language are to print double-precision value in different formats.Figure 12 illustrates for Microsoft
the format string of the Date-Time object in spreadsheet application or C# programming language.Layout descriptor expression formula q be enough effable in case represent these format intention in each.In addition, formatting module can infer from input-output example these format intention in each.Therefore, user is without the need to remembeing format descriptor.
2.6 learning probability programs
Given input-output example, can acquistion semantic entity steering program P, and when running in other non-exemplary inputs, this semantic entity steering program P calculates required output.Without loss of generality, supposing to input is the tuple of going here and there, and output is single string.Probability program P comprises two components: through the mathematics expression symbol collection PD of weighting and the pcp expression formula collection through weighting.
First, how PD is collected from given input and output example learning to the mathematics expression symbol through weighting by describing.For each string s, in whole input and output example, using method getAllParses calculates the set { (e of whole parsings of s
1, p
1, w
1) ..., (e
n, p
n, w
n).Make PD ' for each corresponding set { (p of s that goes here and there
1, w
1) ..., (p
n, w
n) many set union.If p
1=p
2, then (p
1, w
1+ w
2) can replace in PD ' any two element (p iteratively
1, w
1) and (p
2, w
2).PD is defined as result set PD ' subsequently.
Such as, consider the date with three kinds of different-formats electrical form (that is, U.S.'s form: the moon/day/year, European format: day. the moon. year, and Chinese form: year-month-day), as shown in figure 13.Some noticing in these dates lacks the time, and it is assumed to be and is defaulted as this year.This calculating needed for user is preferably described to condition and calculates, and wherein copies time output from input (if present), if or input do not comprise the time, giving tacit consent to is this year.(through weighting) pcp expression formula collection processes by being expressed as by semantic entity steering program by previously described for this.The probability interpretation of this program (to input) gives more weight to those pcp expression formulas that the analytical expression of its correspondence and input are closer mated.
Terminal use also wants to format the date with the consolidation form indicated by (shown in italic) the first two output example.Output format has some nuances.Such as, the prefix that use three is alphabetical and initial caps (that is, first letter is capitalization) print month string.Applicable suffix " st ", " nd ", " rd " or " th " is used to print number of days.Notice, semantic entity manipulation technology embodiment described herein by allow developer by letter quality printer routine only with such as to go here and there and the base type such as integer is associated and supports this abundant format.
Input string data have multiple explanation and noise.This is processed by the analytic sets (weight is higher, resolves more likely) explaining to generate through weighting by probability of use as described earlier.Specifically, for input 6/3/2008, compare day/moon/year, prefer to parsing the moon/day/year because calculate Jun 3rd, the calculating that 2008 (on June 3rd, 2008) produced is the simplest one (that is, identical mapping).For input 2.5.2008, compare the moon. day. year, prefer to resolve day. the moon. year because mathematics expression symbol day. the moon. have in electrical form in year more significantly occur (specifically, 25.3.2007 has parsing day. the moon. year).Input ' 09-Fabruary-1 is likely resolved as year-month-day, because string Fabruary is most closely matched with month name February (February), and " ' " before character is positioned at integer 09, this may be the delimiter occurred before two digits time string.
Acquistion can will generate the required semantic entity steering program exported from the non-exemplary input of all the other shown in the first two input-output example and Figure 13.More specifically, mathematics expression symbol p
1=Day
1month
1year
1the weight of (day month year) is mathematics expression symbol p
2=Month
1day
1year
14 times of the weight of (month day year).This is because there is p
1be 4 inputs (that is, 2.5.2008,25.3.2007,26.3.2007,27.3.2007) effectively resolving descriptor, and only there is p
2it is 1 input (i.e. 2.5.2008) effectively resolving descriptor.Therefore, to compare and p
2corresponding parsing, inputting date 2.5.2008 has and corresponds to p
1the higher weight that is associated of parsing.This illustrates the use that the mathematics expression through weighting is accorded with and allow probability resolution so that the most possible parsing of the correctly input string of ambiguous estimation.
Present hypothesis electrical form has noisy input 5.16.2010.Accord with mathematics expression the weight be associated and will be conducive to p by 4 to 2
1, and therefore the highest parsing through weighting of 2.5.2008 will be still and p
1corresponding that.This illustrates that to maintain probability resolution when the use accorded with the mathematics expression through weighting exists noise in the data sane.
Next, the pcp expression formula collection PCP how learnt through weighting will be described.Consider the single row comprising one of input-output example.For the input string s shown in routine i-th row
i, use getAllParse method to obtain its mathematics expression through weighting and accord with (e, p, w) collection
each parsing returns entity instance e with its canonical form.In a similar fashion, the mathematics expression symbol collection of full line is calculated
obtain the mathematics expression symbol collection exporting string r subsequently
for making
each input entity instance vector
and make
each output entity instance e ', study will
be mapped to one group of calculating through weighting of e '
perform based on the exhaustive search of type to calculate for available mapping ensemble
(the conversion limited amount system to using in calculating in one implementation).Any calculating
weight (C) (weight) be defined as
(simplicity), wherein in one implementation, composite function
be defined as min (minimum of a value) function, and the inverse of the size of C is defined as to the simplicity tolerance of C.In order to print the entity e ' calculated with required output format
j, also need study to print mathematics expression symbol q '
j.By searching for exhaustively for printing p
jin field represent the letter quality printer of f, from comprising (f, d) right mathematics expression symbol p '
jstudy prints mathematics expression symbol q '
j.
Make PCP ' for set
Many set union.If O
1=O
2, then (O
1, w
1+ w
2) can replace in PCP ' any two element (O iteratively
1, w
1) and (O
2, w
2).PCP is defined as result set PCP ' subsequently.
Such as, suppose that terminal use wants the stock price of band 6 the decimal point positions in the row 1 of electrical form to carry out being formatted into 2 decimal point positions.Terminal use provides one and comprises " 12.124532 " as input and " 12.12 " the example row as corresponding output.The calculating that can realize two of this conversion possible is: print maximum 2 decimal places and print 2 decimal places definitely.Subsequently, terminal use provides one by " 12.1 " as input and " 12.10 " another example row as output.Again, two possible calculating are: print at least 2 decimal places and print 2 decimal places definitely.The pcp expression formula collection that will comprise through weighting of the probability program P that learns, compare the weight converting corresponding pcp expression formula with " maximum 2 decimal places " and " at least 2 decimal places ", this pcp expression formula set pair is in converting with " 2 definite decimal places " weight that corresponding pcp expression formula has twice, because " maximum 2 decimal places " and " at least 2 decimal places " conversion only obtains result in one of each example.Therefore, when running P in new input, the highest output of sorting will corresponding to conversion " 2 definite decimal places ".This example illustrate for generate through weighting pcp expression formula collection the semantic entity steering program that learns there is the desirable attribute giving more weights to the pcp expression formula represented across the common calculating of multiple input-output example.
Consider another example.This example illustrate learnt semantic entity steering program and how modeling is carried out to condition, such as the time feature lacked described in conjunction with Figure 13.From two input-output example learnings to following two the pcp expression formulas with higher weight:
wherein
p
1=Month
1/ Day
1/ Year
1, q
1={ (Month
2, " ", string-PrefixP (3)), (Day
1, ", ", IntOrd), (Year
1, ∈; Identity) },
P
2=Month
1/ Day
1, and q
2with q
1identical.
Present given input " 4.24 " newly, its mathematics expression symbol p=Month
1.Day
1(the day moon) compares p
1with p
2more mate, and therefore will perform the calculating than identical calculating with much higher weight on that input
the program learnt just implicitly distinguishes the input in its mathematics expression symbol without year field.
2.7 illustrative semantic entity manipulation process
In view of aforementioned, use description to now the example process realizing semantic entity manipulation technology embodiment described herein.With reference to Figure 14, example process starts (frame 1400) to receive input-output example.As previously mentioned, each input-output example provides one or more input item and corresponding required output item.Resolve to produce the analytic sets (frame 1402) through weighting to received input and output item.These each expressions in the parsing of weighting are to the different potential parsing of each input and output item, and weight has carried out weighting according to the tolerance being the possibility of effectively resolving based on this parsing compared with the parsing storehouse of regulation to this parsing.Next, previous non-selected input-output example (frame 1404) is selected.From the transformation library of a type, identify one or more conversion, the conversion of this type can generate required output item (frame 1406) from the input item of selected example.In addition, sign format instruction, this instruction can be formatd output item to match (frame 1408) with the format of output item needed for selected input-output example.Determine whether there is still non-selected any remaining input-output example (frame 1410) subsequently.If so, then the process action of repetitive operation frame 1404 to 1410.When selecting and processed whole input-output example, generating probability program (frame 1412), when one or more input item of given type identical with the input item of input-output example, this probability program adopts the conversion identified to generate the corresponding output item of input item one or more with this with formatting commands.Receive one or more input items (frame 1414) of type identical with the input item of input-output example subsequently, and use the probability program generated to produce the output item (frame 1416) corresponding with each received input item.
Also likely receive one or more input items of the identical type with the input item of input-output example together with reception input-output example.In this case, resolve input item and can comprise the input item of resolving the type identical with the input item of input-output example received and the input and output item be associated with input-output example with output item with the aforementioned activities produced through the analytic sets of weighting.
About resolving input item and output item to produce the action through the analytic sets of weighting, notice in one implementation, each input item and output item are character strings, and resolve storehouse and comprise multiple semantic entity, and each semantic entity has an entity class and set of fields.Given this point, for representing each character string of input and output item and resolving each semantic entity found in storehouse, identification field set of matches, wherein each fields match represents that the most eldest son matched with a field of the semantic entity coming self-analytic data storehouse inputed or outputed in a character string goes here and there.As previously mentioned, build analysis diagram from this fields match collection, make each character in character string there is a node.In addition, for each fields match, the node of the termination character representing this fields match is pointed to from the node of the bebinning character representing this fields match in field limit.In addition, delimiter edge is set up from the node of character late in each node and expression character string of the character represented character string.Next, identification (RNC-ID) analytic path is come by analysis diagram, wherein every bar resolution path comprises a sequence node, this sequence node is to represent in character string the node of first character and start and to represent that in character string, the node of last character terminates, and the node in this sequence is alternately connected by the limit replaced between field limit and delimiter limit.Thus, the resolution path that every bar identifies comprises semantic entity field and the right sequence of delimiter.Next, the resolution path identified for every bar calculates constraint weight factor.More specifically, each semantic entity is associated with the constraint set through weighting, this constraint set is that semantic entity specific field distributes and order between field or delimiter, and wherein for more likely by the constraint that semantic entity is shown, the constraint weight of this constraint is larger.Given this point, make constraint weight factor be associated with considered semantic entity, resolution path the quantity of constraint that meets and the constraint weight of distributing to each met constraint proportional.Such as, in one implementation, this has come with the ratio of the weight sum all retrained by constraint weight factor being defined as those met weight sums retrained.Next, the tolerance to the similarity between resolution path and each effective parsing descriptor set be associated with semantic entity is calculated.In one implementation, this is by calculating and resolution path and resolve the degree of approach that the Hamming distance between descriptor is inversely proportional to and measure to realize, during the pattern of semantic entity field and delimiter in resolution path and mathematics expression are accorded with, the pattern of semantic entity field and delimiter is more close, and degree of approach tolerance is larger.Degree of approach tolerance is designated as effective parsing descriptor measuring similarity of resolution path subsequently.Effectively to resolve in descriptors each comprises semantic entity field and the right sequence of delimiter for these, and this sequence represents the effective parsing to semantic entity.The weighted value of resolution path is calculated subsequently based on constraint weight factor with for effective parsing descriptor measuring similarity of resolution path calculating.In one implementation, weighted value is directly proportional with the measuring similarity of constraint weight factor and Geng Gao.Such as, it can be selected to the product of two amounts.Resolution path is designated as the mathematics expression symbol of semantic entity subsequently, and specified mathematics expression symbol and the weighted value that calculates be associated with semantic entity so as to be formed through weighting semantic entity with resolve descriptor tuple.Notice in one implementation, the semantic entity of each semantic entity through weighting and parsing descriptor tuple is just converting the canonical form of the regulation of this semantic entity to, as previously mentioned.The canonical form of the regulation of semantic entity is identified in the entity class of entity.
About the action that can from the input item of input-output example produce the one or more conversion of required output item of mark from the transformation library of a type, notice in one implementation, the entity class of each semantic entity identifies the one or more conversion being applicable to this entity class.Given this point, as previously mentioned, identifies the mathematics expression symbol group be associated with the input item of input-output example and the mathematics expression symbol group be associated with the output item of input-output example.Subsequently, each in one or more conversion that the entity class in the mathematics expression symbol group of the input and output item with input-output example is associated, mark is as down conversion: when being applied to the semantic entity of the institute's identification (RNC-ID) analytic descriptor tuple be associated with input item, the entity of institute's identification (RNC-ID) analytic descriptor tuple that this conversion generation is associated with output item.In addition, be applied transformation calculations weighted factor, this weighted factor represents that having many to applied conversion may be the tolerance of required conversion.Generally speaking, the tolerance of the weighted value that the mathematics expression symbol group being based upon input and output item in input-output example calculates and conversion complexity is to calculate this weighted factor.More specifically, in one implementation, weighted factor is calculated: the weighted value of the mathematics expression symbol group 1) be associated with the input item applying the input-output example converted to it, and 2 by first identifying following MINIMUM WEIGHT weight values in both) weighted value of mathematics expression symbol group that is associated with the output item applying the input-output example converted to it.Next, the inverse of the size of the conversion of applying is calculated.Calculate the minimal weight that identifies with apply the product of the inverse calculated of transform size, and be assigned therein as weighted factor.
Notice that required conversion may need application one with up conversion in some cases.Given this, in mark, of the action of the conversion of required output item of can generating from the input item of input-output example from the transformation library of a type replaces in realization, adopts the following change to aforementioned process.More specifically, for the combination in one or more conversion that the entity class in the mathematics expression symbol group of the input and output item with input-output example is associated, the specified quantity of conversion is reached in this combination, identify as down conversion combination: when being applied to the semantic entity of the institute's identification (RNC-ID) analytic descriptor tuple be associated with input item, this conversion combination produces the entity of the institute's identification (RNC-ID) analytic descriptor tuple be associated with output item.In addition, the weighted factor that applied conversion is combined is defined as to have many to it may be the tolerance of required conversion combination.
Can format output item about mark so that the action of the formatting commands matched with the format of output item needed for input-output example, generally speaking, this each semantic entity field identified in may needing the formatting commands from formatting commands storehouse to accord with the mathematics expression in mathematics expression symbol group is associated, this mathematics expression symbol group is associated with the output item of input-output example, the form that this formatting commands is shown with the output item part be associated with this semantic entity field is to generate the corresponding field of the semantic entity of this descriptor tuple.More specifically, in one implementation, this needs first to determine that whether semantic entity field is one of one group of regulation base type field of each semantic entity field identified during the mathematics expression of the mathematics expression symbol group be associated with the output item of input-output example accords with.As long as semantic field is confirmed as base type field, just formatting commands is associated with base type field this semantic field predefined for the type.But, the semantic field during as long as the mathematics expression of the mathematics expression symbol group be associated with the output item of input-output example accords be confirmed as be not base type field (namely, it is an entity type field), then carry out sign format instruction by said process being recursively applied to this entities field.
About the action of generating probability program, wherein one or more input items of given type identical with the input item of input-output example, this probability program adopts the conversion identified to produce the corresponding output item of input item one or more with this with formatting commands, and this needs following operation in one implementation.For each semantic entity in the mathematics expression symbol group of the output item of input-output example, conversion, its weight factor as calculated and the formatting commands identified of the semantic entity of institute's identification (RNC-ID) analytic descriptor tuple that is associated with output item are carried out combination to be formed resolving-calculating-print tuple.For each conversion in one or more conversion that the entity class in the mathematics expression symbol group of the input item with input-output example is associated is to complete this point.Mark has the parsing-calculating-printing tuple of identical conversion and formatting commands subsequently.-printing tuple is resolved-calculated to the often group with identical conversion and formatting commands, with having the conversion identical with this group and formatting commands and replacing this group as the single parsing-calculating-printing tuple of the weight factor of weight factor sum whole in this group.In addition, the input item of input-output example and the mathematics expression symbol group with identical mathematics expression symbol of output item is identified.For the often group mathematics expression symbol group with identical mathematics expression symbol, to accord with and single mathematics expression symbol group as the weighted value of weighted value sum whole in this group replaces this group with having the mathematics expression identical with this group.Finally remaining mathematics expression symbol group and parsing-calculating-printing tuple are combined with formation probability program.
About the action using the probability program generated to produce the output item corresponding with received input item, notice and be converted in the realization of the canonical form of regulation at the semantic entity of each semantic entity through weighting and parsing descriptor tuple, application conversion will produce semantic entity with canonical form equally.Given this, mark can format output item also to comprise (if necessary) with the aforementioned activities of the formatting commands that the format of output item needed for input-output example matches converting this output item to the form consistent with output item needed for input-output example from the canonical form specified.
2.8 illustrative semantic entity control systems
In view of aforementioned, use description to now the example system realizing semantic entity manipulation technology embodiment described herein.With reference to Figure 15, exemplary semantics manipulations of physical system 1500 comprises parsing module 1502, conversion module 1504 and formatting module 1506.Parsing module 1502 resolves to produce the analytic sets through weighting to the input item and output item 1508 that form input-output example.These each expressions in the parsing of weighting to the different potential parsing of each input and output item, wherein according to based on being that the tolerance of the possibility of effectively resolving has carried out weighting to this parsing with resolving this parsing compared with storehouse.Conversion module 1504 can be the one or more conversion of each input-output example from the required output item of input item generation from transformation library 1512 mark of a type.The instruction of formatting module 1506 sign format, this instruction can be formatd output item so that the format carried out with output item needed for input-output example matches.
Also there is the program generating module 1514 of generating probability program 1516, one or more input items of given type identical with the input item of input-output example, probability program 1516 adopts the conversion identified from conversion module 1504 and the formatting commands from formatting module 1506 to produce one group of output item through sequence.Sort according to these output items of secondary ordered pair for its weighted factor calculated, weighted factor illustrates instruction conversion generates the degree of accuracy of output item tolerance from the input item be associated with input-output example.Program execution module 1518 is for using generated probability program 1516 to produce output item, and this output item corresponds to the input item received of type identical with the input item of input-output example.One of this program execution module 1518 realizes producing in the output item with a group through sorting output item corresponding to the highest output item that sort.When there is more than one output item for one group in the output item of sequence, program execution module 1518 produces a designator, and this designator indicates the lower output item of other sequences to be produced and can for consulting.
2.9 illustrative user interface operations
This section describes a realization of used graphic user interface in conjunction with spreadsheet program.First user selects a rectangular area of the electrical form comprising input and output row.This is comprised one or more input-output example by the region selected and the non-example that needs it to export inputs.The row be filled at most are treated as input row, and using the less row be filled as output row.
But, in one implementation, user can also select multiple row scope and identify clearly which row be input and which row be export.Most of cell in input row to have in the situation of sky entry that this is especially favourable, because it may be considered to export row.Using the row of the entry comprised for exporting row as being used for treat for the input-output example of the program of this output row study.
In one implementation, user selects " application " button (or similar), and electrical form is filled as follows.Be each output column-generation semantic entity steering program as previously mentioned.For each output unit lattice α
r, c(row r and output arrange c), system is expert at and input state specified in r is run the program that the institute for exporting row c learn, to generate the possible output collection O (or just may have those outputs of sufficiently high weight) through sorting.System following filler cells lattice α
r, c:
1) if O comprises a string, then this string of system carrys out filler cells lattice;
2) if O comprises multiple string, then system is with representing that the string of the solution that sequence is the highest carrys out filler cells lattice, but highlight this cell to there is multiple calculating with the some examples pointed out user to user and provide and explain, and user may want the degree of accuracy of the output investigating highlighted cell; And
3) if O is empty, then system with " " carry out filler cells lattice to arouse attention: user should be this cell and provides output.
User modifies to it by right click in the content of any cell subsequently, wherein opens and allows user select from other strings of corresponding sequence O or provide the new dialog box exported together.After any this amendment, by the learning process above automatically repeating through the input-output example set of expansion, and the content automatically upgrading electrical form is to reflect the result that new study is arrived.
3.0 Illustrative Operating Environment
Semantic entity manipulation technology embodiment described herein can operation in many general or special-purpose computing system environment or configuration.Figure 16 illustrates the simplification example that it can realize each embodiment of semantic entity manipulation technology embodiment described herein and the general-purpose computing system of element.It should be noted that, any frame table in Figure 16 represented by the line disconnected or dotted line shows the replacement embodiment simplifying computing equipment, and these any or all of replacing in embodiment described below can use in conjunction with running through other replacement embodiments described herein.
Such as, Figure 16 shows generalized system figure, and it illustrates and simplifies computing equipment 10.Such computing equipment can find usually in the equipment with at least some minimum of computation ability, and these equipment include but not limited to personal computer, server computer, Handheld computing device, on knee or mobile computer, the such as communication equipment such as cell phone and PDA, multicomputer system, system, Set Top Box, programmable consumer electronics, network PC, minicom, mainframe computer, video media player etc. based on microprocessor.
For permission equipment realizes semantic entity manipulation technology embodiment described herein, this equipment should have enough computing capabilitys and system storage operates to enable basic calculating.Specifically, as shown in figure 16, computing capability is generally illustrated by one or more processing unit 12, and can comprise one or more GPU14, and any or all in both communicates with system storage 16.Note, the processing unit 12 of universal computing device can be special microprocessor, as DSP, VLIW or other microcontrollers, maybe can be the conventional CPU with one or more process core, comprise in multi-core CPU based on GPU specific core.
In addition, the simplification computing equipment of Figure 16 also can comprise other assemblies, such as such as communication interface 18.The simplification computing equipment of Figure 16 also can comprise one or more conventional computer input equipment 20 (such as, pointing device, keyboard, audio input device, video input apparatus, tactile input device, for receiving the equipment etc. of wired or wireless data transmission).The simplification computing equipment of Figure 16 also can comprise other optical modules, such as such as one or more conventional display device 24 and other computer output equipments 22 (such as, audio output apparatus, picture output device, for transmitting the equipment etc. of wired or wireless data transmission).Note, the typical communication interface 18 of all-purpose computer, input equipment 20, output equipment 22 and memory device 26 are known to those skilled in the art, and can not describe in detail at this.
The simplification computing equipment of Figure 16 also can comprise various computer-readable medium.Computer-readable medium can be any usable medium can accessed via memory device 26 by computer 10, and comprising is volatibility and the non-volatile media of removable 28 and/or irremovable 30, and this medium is for storing the information such as such as computer-readable or computer executable instructions, data structure, program module or other data.Exemplarily unrestricted, computer-readable medium can comprise computer-readable storage medium and communication media.Computer-readable storage medium includes but not limited to: computer or machine readable media or memory device, such as DVD, CD, floppy disk, tape drive, hard disk drive, CD drive, solid-state memory device, RAM, ROM, EEPROM, flash memory or other memory technologies, cassette, tape, disk storage or other magnetic storage apparatus or can be used for storing information needed and any other equipment can accessed by one or more computing equipment.
The reservation of the information such as such as computer-readable or computer executable instructions, data structure, program module also realizes to encode one or more modulated message signal or carrier wave or other transmission mechanisms or communication protocol by any one using in various above-mentioned communication media, and comprises any wired or wireless information transmission mechanism.Note, term " modulated message signal " or " carrier wave " refer generally to arrange in the mode of encoding to the information in signal or change the signal of one or more feature.Such as, communication media comprises such as cable network or direct line connection etc. and carries the wire medium of one or more modulated message signal, and the such as wireless medium for transmitting and/or receive one or more modulated message signal or carrier wave such as acoustics, RF, infrared ray, laser and other wireless mediums.Any combination of above-mentioned communication media also should be included within the scope of communication media.
In addition, can store by the form of computer executable instructions or other data structures, receive and send or read part or all of software, program and/or the computer program or its each several part of specializing semantic entity manipulation technology described herein from any required combination of computer or machine readable media or memory device and communication media.
Finally, semantic entity manipulation technology embodiment described herein also can describe in the general context of the computer executable instructions such as the such as program module performed by computing equipment.Generally speaking, program module comprises the routine, program, object, assembly, data structure etc. that perform particular task or realize particular abstract data type.Each embodiment described herein can also realize in the DCE that task is performed by the one or more remote processing devices by one or more communication network links or performed in the cloud of this one or more equipment wherein.In a distributed computing environment, program module can be arranged in the local and remote computer-readable storage medium comprising media storage device.In addition, above-mentioned instruction can partly or integrally realize as the hardware logic electric circuit that can comprise or not comprise processor.
4.0 other embodiments
It should be noted that can by required any combination to use the above-described embodiment in this specification any or all to form other mix embodiment.In addition, although describe this theme with architectural feature and/or the special language of method action, be appreciated that subject matter defined in the appended claims is not necessarily limited to above-mentioned specific features or action.On the contrary, above-mentioned specific features and action are as disclosed in the exemplary forms realizing claim.
Claims (10)
1., for adopting probability program to generate a computer implemented method for required output item from one or more input item, described probability program uses input-output example to generate, and described method comprises:
Computer is used to perform following methods action:
Receive input-output example (1400), each input-output example provides one or more input item and corresponding required output item;
Resolve described input item and described output item to produce the analytic sets (1402) through weighting, each analytic representation through weighting, to the different potential parsing of each input and output item, has wherein carried out weighting according to the tolerance being the possibility of effectively resolving based on described parsing compared with parsing storehouse to described parsing;
For each input-output example,
From the transformation library of a type, identify one or more conversion, described one or more conversion can produce required output item (1406) from described input item, and
Sign format instruction, described formatting commands can format output item so that the format carried out with output item needed for described input-output example matches (1408);
Generating probability program (1412), wherein one or more input items of the identical type of described input item of given and described input-output example, described probability program adopts the conversion identified to produce the output item corresponding with described one or more input item with described formatting commands; And
Receive the one or more input items (1414) with the identical type of described input item of described input-output example, and use the probability program generated to produce the output item (1416) corresponding with received input item.
2. the method for claim 1, it is characterized in that, each input item and output item comprise character string, and wherein said parsing storehouse comprises multiple semantic entity, each semantic entity comprises entity class and set of fields, and wherein resolves described input item and described output item and comprise following action with the method action produced through the analytic sets of weighting:
For each character string representing described input and output item and each semantic entity found in described parsing storehouse,
Identification field set of matches, wherein each fields match comprise described in the most eldest son matched with the semantic entity field from described parsing storehouse inputed or outputed in a character string go here and there,
Analysis diagram is built from described fields match collection, each character in described character string is made to there is a node, and for each fields match, the node of the termination character representing described fields match is pointed to from the node of the bebinning character representing described fields match in field limit, and set up delimiter limit from each node of the character represented described character string with the node of any character late in the described character string of expression
Identify the resolution path by described analysis diagram, wherein every bar resolution path comprises sequence node, described sequence node is to represent the node of first character in described character string and start and to represent that the node of last character in described character string terminates, and the node in described sequence is alternately connected by the limit replaced between field limit and delimiter limit, the resolution path that every bar is identified comprises semantic entity field and the right sequence of delimiter, and
For identified every bar resolution path,
For each identified resolution path calculates constraint weight factor, described constraint weight factor and described resolution path meet, the quantity of the constraint that is associated with considered semantic entity and to distribute to the constraint weight of each met constraint proportional, wherein each semantic entity is associated with the set of the constraint through weighting of the order of specifying between the field distribution of described semantic entity and field or delimiter, wherein for the constraint more likely shown by described semantic entity, the constraint weight of this constraint is larger
The tolerance of the similarity in the effective parsing descriptor set calculating described resolution path and be associated with described semantic entity between each effective parsing descriptor, wherein each effective parsing descriptor comprises the semantic entity field of effective parsing that represents described semantic entity and the right sequence of delimiter
Be based upon the described constraint weight factor of described resolution path calculating and effectively resolve descriptor measuring similarity to calculate the weighted value of described resolution path, and
Described resolution path is appointed as the appointment mathematics expression symbol of described semantic entity, and described appointment mathematics expression symbol and the weighted value that calculates are associated with described semantic entity in case formed through weighting semantic entity with resolve descriptor tuple.
3. method as claimed in claim 2, it is characterized in that, the entity class mark of each semantic entity is applicable to one or more conversion of described entity class, and wherein for each input-output example, the method action of one or more conversion of output item needed for can producing from described input item from the transformation library mark of a type comprises following action:
Identify the mathematics expression symbol group be associated with the described input item of described input-output example;
Identify the mathematics expression symbol group be associated with the described output item of described input-output example; And
Each in one or more conversion that the entity class in the mathematics expression symbol group of the input and output item with described input-output example is associated,
Mark is with down conversion: when being applied to the semantic entity of the mathematics expression symbol group identified be associated with described input item, described conversion produces the entity of the mathematics expression symbol group identified be associated with described output item, and
For each this transformation calculations weighted factor, wherein said weighted factor is the tolerance of the possibility being required conversion to described conversion.
4. method as claimed in claim 3, it is characterized in that, mark can format output item so that the method action of the formatting commands matched with the format that output item needed for described input-output example is carried out comprises following action: each semantic entity field identified in according with for the mathematics expression of the mathematics expression symbol group be associated with the described output item of described input-output example, by the action that the formatting commands from formatting commands storehouse is associated, the form that described formatting commands is shown with the output item part be associated with described semantic entity field is to produce the corresponding field of the semantic entity of described descriptor tuple.
5. method as claimed in claim 4, it is characterized in that, wherein one or more input items of the identical type of described input item of given and described input-output example, described probability program adopts the conversion identified to produce the output item corresponding with described one or more input item with described formatting commands, and the method action generating described probability program comprises following action:
For each semantic entity in the mathematics expression symbol group of the described output item of described input-output example,
For each conversion in one or more conversion that the entity class of the mathematics expression symbol group of the described input item with described input-output example is associated, the weight factor calculated of described conversion, described conversion and the formatting commands identified of semantic entity are associated to form parsing-calculating-printing tuple, described semantic entity is the semantic entity of the mathematics expression symbol group identified be associated with described output item;
Mark has the parsing-calculating-printing tuple of identical conversion and formatting commands;
-printing tuple is resolved-calculated to the often group with identical conversion and formatting commands, with having the conversion identical with this group and formatting commands and replacing this group as the single parsing-calculating-printing tuple of the weight factor of weight factor sum whole in this group;
Identify the described input item of described input-output example and the mathematics expression symbol group with identical mathematics expression symbol of output item;
For the often group mathematics expression symbol group with identical mathematics expression symbol, to accord with and single mathematics expression symbol group as the weighted value of weighted value sum whole in this group replaces this group with having the mathematics expression identical with this group; And
Divide into groups to form described probability program to remaining mathematics expression symbol group and parsing-calculating-printing tuple.
6. method as claimed in claim 3, is characterized in that, mark can format output item so that the method action of formatting commands that the format carried out with output item needed for described input-output example matches comprises following methods action:
The each semantic entity field identified in according with for the mathematics expression of the mathematics expression symbol group be associated with the described output item of described input-output example, determines whether described semantic entity field is one of base type field of one group of regulation;
As long as the semantic field during the mathematics expression of the mathematics expression symbol group be associated with the described output item of described input-output example accords with is confirmed as base type field, just formatting commands is associated with base type field this semantic field predefined for the type;
As long as the semantic field during the mathematics expression of the mathematics expression symbol group be associated with the described output item of described input-output example accords with is confirmed as not being base type field, just be associated by formatting commands, the form that described formatting commands is shown with the output item part be associated with described semantic entity field is to produce the corresponding field of the semantic entity of described descriptor tuple.
7. method as claimed in claim 2, it is characterized in that, the entity class mark of each semantic entity is applicable to one or more conversion of described entity class, and wherein for each input-output example, the method action of one or more conversion of output item needed for can producing from described input item from the transformation library mark of a type comprises following action:
Identify the mathematics expression symbol group be associated with the described input item of described input-output example;
Identify the mathematics expression symbol group be associated with the described output item of described input-output example; And
For the combination of one or more conversion that the entity class in the mathematics expression symbol group of the input and output item with described input-output example is associated, in described combination, reach the specified quantity of conversion,
Mark combines with down conversion: when being applied to the semantic entity of the mathematics expression symbol group identified be associated with described input item, and described conversion combination produces the entity of the mathematics expression symbol group identified be associated with described output item, and
For each identified conversion combination calculates weighted factor, described weighted factor comprises to described conversion combination the tolerance of the possibility being required conversion combination.
8. the method for claim 1, it is characterized in that, receive the one or more input items with the identical type of input item of described input-output example in conjunction with the described input-output example of reception, and wherein resolve described input item and described output item and comprise the input and output item be associated with described input-output example and the action of resolving with the input item received of the identical type of input item of described input-output example with the method action produced through the analytic sets of weighting.
9., for adopting probability program to produce a system for required output item from one or more input item, described probability program uses input-output example to generate, and described system comprises:
Parsing module (1502), the input item of the described input-output example of parsing formation and output item (1508) are to produce the analytic sets through weighting, each analytic representation through weighting is to the different potential parsing of each input and output item, wherein according to the tolerance being the possibility of effectively resolving based on described parsing compared with parsing storehouse (1510), weighting is carried out to described parsing
Conversion module (1504), for each input-output example, described conversion module identifies one or more conversion from the transformation library (1512) of a type, and described one or more conversion can produce required output item from described input item, and
Formatting module (1506), the instruction of described formatting module sign format, described formatting commands can format output item so that the format carried out with output item needed for input-output example matches;
Program generating module (1514), described program generating module generating probability program (1516), one or more input items of the identical type of input item of given and described input-output example, described probability program adopts the conversion that identifies and described formatting commands to produce one group of output item through sequence, wherein sort according to for output item described in the secondary ordered pair of its weighted factor calculated, described weighted factor comprises the described conversion of instruction produces the degree of accuracy of described output item tolerance from the input item be associated with described input-output example, and
Program execution module (1518), described program execution module uses the probability program generated to produce output item, and this output item corresponds to the input item received with the identical type of input item of described input-output example.
10. system as claimed in claim 9, it is characterized in that, described program execution module produces in the output item with a group through sorting the corresponding output item of the highest output item that sorts, as long as and there is more than one output item in this group in the output item of sequence, described program execution module just produces designator further, and described designator indicates the lower output item of other sequences to be produced and can for consulting.
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/020,153 US8799234B2 (en) | 2010-07-12 | 2011-02-03 | Semantic entity manipulation using input-output examples |
US13/020,153 | 2011-02-03 |
Publications (2)
Publication Number | Publication Date |
---|---|
CN102682065A CN102682065A (en) | 2012-09-19 |
CN102682065B true CN102682065B (en) | 2015-03-25 |
Family
ID=46813998
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201210023688.6A Active CN102682065B (en) | 2011-02-03 | 2012-02-02 | Semantic entity control using input and output sample |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN102682065B (en) |
Families Citing this family (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9251467B2 (en) * | 2013-03-03 | 2016-02-02 | Microsoft Technology Licensing, Llc | Probabilistic parsing |
KR102146261B1 (en) * | 2014-02-14 | 2020-08-20 | 삼성전자 주식회사 | Electronic Device And Method For Extracting And Using Semantic Entity In Conversation Message Of The Same |
CN113966518B (en) * | 2019-02-14 | 2024-02-27 | 魔力生物工程公司 | Controlled agricultural system and method of managing agricultural system |
CN109948164A (en) * | 2019-04-02 | 2019-06-28 | 北京三快在线科技有限公司 | Processing method, device, computer equipment and the storage medium of statistical demand information |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1400547A (en) * | 2001-08-03 | 2003-03-05 | 富士通株式会社 | Format file information extracting device and method |
WO2010088523A1 (en) * | 2009-01-30 | 2010-08-05 | Ab Initio Technology Llc | Processing data using vector fields |
Family Cites Families (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8296461B2 (en) * | 2007-08-07 | 2012-10-23 | Object Innovation Inc. | Data transformation and exchange |
US8103951B2 (en) * | 2008-09-30 | 2012-01-24 | Apple Inc. | Dynamic schema creation |
-
2012
- 2012-02-02 CN CN201210023688.6A patent/CN102682065B/en active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1400547A (en) * | 2001-08-03 | 2003-03-05 | 富士通株式会社 | Format file information extracting device and method |
WO2010088523A1 (en) * | 2009-01-30 | 2010-08-05 | Ab Initio Technology Llc | Processing data using vector fields |
Also Published As
Publication number | Publication date |
---|---|
CN102682065A (en) | 2012-09-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Zhou et al. | Deep learning for aspect-level sentiment classification: survey, vision, and challenges | |
US20120011084A1 (en) | Semantic entity manipulation using input-output examples | |
Elizarov et al. | Digital ecosystem OntoMath: Mathematical knowledge analytics and management | |
Xia et al. | Distantly supervised lifelong learning for large-scale social media sentiment analysis | |
EP2057557B1 (en) | Joint optimization of wrapper generation and template detection | |
CN1906609B (en) | System for data format conversion for use in data centers | |
CN103299292B (en) | For the treatment of method and the equipment thereof of natural language and mathematical formulae | |
Smith et al. | Performance Model Interchange Format (PMIF 2): A comprehensive approach to queueing network model interoperability | |
Diallo et al. | Multi-view document clustering based on geometrical similarity measurement | |
US20100211533A1 (en) | Extracting structured data from web forums | |
US20120239677A1 (en) | Collaborative knowledge management | |
CN102682065B (en) | Semantic entity control using input and output sample | |
US11537448B1 (en) | Adapting application programming interfaces with schema mappings | |
Alvarez-Rodríguez et al. | Empowering the access to public procurement opportunities by means of linking controlled vocabularies. A case study of Product Scheme Classifications in the European e-Procurement sector | |
Kiu et al. | Ontology mapping and merging through OntoDNA for learning object reusability | |
CN117389544B (en) | Artificial intelligence data modeling method, device, medium and equipment | |
Kalwar et al. | Smart: Towards automated mapping between data specifications | |
Vo | Se4exsum: An integrated semantic-aware neural approach with graph convolutional network for extractive text summarization | |
Flouri et al. | Tree template matching in ranked ordered trees by pushdown automata | |
Javed et al. | An unsupervised incremental learning algorithm for domain-specific language development | |
An et al. | Automatically labeling aging scenarios with a machine learning approach | |
Elizarov et al. | OntoMath Digital Ecosystem: Ontologies, Mathematical Knowledge Analytics and Management | |
Zhang et al. | Big data-assisted urban governance: A comprehensive system for business documents classification of the government hotline | |
Shao et al. | Web and Big Data: Third International Joint Conference, APWeb-WAIM 2019, Chengdu, China, August 1–3, 2019, Proceedings, Part I | |
Faqir et al. | An Approach to Map Geography Mark-up Language Data to Resource Description Framework Schema |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C14 | Grant of patent or utility model | ||
GR01 | Patent grant | ||
ASS | Succession or assignment of patent right |
Owner name: MICROSOFT TECHNOLOGY LICENSING LLC Free format text: FORMER OWNER: MICROSOFT CORP. Effective date: 20150728 |
|
C41 | Transfer of patent application or patent right or utility model | ||
TR01 | Transfer of patent right |
Effective date of registration: 20150728 Address after: Washington State Patentee after: Micro soft technique license Co., Ltd Address before: Washington State Patentee before: Microsoft Corp. |