ES2886459T3

ES2886459T3 - Extraction of financial event and relationship

Info

Publication number: ES2886459T3
Application number: ES09706670T
Authority: ES
Inventors: Frank Schilder; Christopher Dozier; Ravi Kumar Kondadadi
Original assignee: Thomson Reuters Enterprise Centre GmbH
Current assignee: Thomson Reuters Enterprise Centre GmbH
Priority date: 2008-01-30
Filing date: 2009-01-30
Publication date: 2021-12-20
Anticipated expiration: 2029-01-30

Abstract

Un sistema informático para la extracción de datos e información relacionada de tablas en documentos electrónicos que tiene al menos un procesador y al menos una memoria, comprendiendo el sistema: medios para identificar y etiquetar automáticamente un segmento de texto en un documento (110) electrónico; medios para etiquetar automáticamente nombres de entidades, expresiones monetarias, y expresiones temporales dentro del segmento (120) de texto; medios para identificar un evento financiero descrito dentro del segmento de texto etiquetado automáticamente; un clasificador (310) de máquina de vectores de soporte adaptado para filtrar el documento e identificar una tabla que comprende información de interés distinguiendo tablas de las que no son tablas y en donde las tablas utilizadas por razones de formato se identifican como no tablas, la información de interés comprende una pluralidad de atributos deseados y valores deseados, las tablas genuinas identificadas son procesadas por: a. clasificación de tablas utilizando clasificadores específicos de relación con base en el aprendizaje automático supervisado, b. clasificación de filas y columnas de etiquetas distinguiendo entre columnas de etiquetas y filas de etiquetas de los valores dentro de las tablas, c. reconocimiento de la estructura de la tabla asociando cada valor con sus etiquetas en la misma columna y la misma fila para generar una lista de pares atributo-valor, d. comprensión de la tabla comparando cada uno de los pares atributo-valor; medios para definir en la memoria un registro de datos asociado con el evento financiero, que incluye el registro de datos, datos derivados del segmento (319) de texto etiquetado; y medios para extraer datos (320) de relación del segmento de texto y para determinar un papel de al menos una entidad, estando etiquetada la entidad dentro del segmento de texto y relacionada con el registro de datos.A computer system for extracting data and related information from tables in electronic documents having at least one processor and at least one memory, the system comprising: means for automatically identifying and labeling a text segment in an electronic document (110); means for automatically tagging entity names, currency expressions, and time expressions within the text segment (120); means for identifying a financial event described within the automatically tagged text segment; a support vector machine classifier (310) adapted to filter the document and identify a table comprising information of interest by distinguishing tables from those that are not tables and wherein tables used for formatting reasons are identified as non-tables, the information of interest comprises a plurality of desired attributes and desired values, the identified genuine tables are processed by: a. table classification using relation-specific classifiers based on supervised machine learning, b. classifying label rows and columns by distinguishing between label columns and label rows of values within tables, c. recognizing the structure of the table by associating each value with its labels in the same column and the same row to generate a list of attribute-value pairs, d. comprehension of the table comparing each of the attribute-value pairs; means for defining in memory a data record associated with the financial event, the data record including data derived from the tagged text segment (319); and means for extracting relationship data (320) from the text segment and for determining a role of at least one entity, the entity being tagged within the text segment and related to the data record.

Description

DESCRIPCIÓNDESCRIPTION

E x tra cc ió n de e ve n to fin a n c ie ro y re lac iónExtraction of financial event and relationship

Aviso y permiso de derechos de autorCopyright notice and permission

U na porc ió n de e ste d o cu m e n to de p a ten te co n tie n e m a te ria l su je to a la p ro te cc ió n de los d e re ch o s de autor. El p ro p ie ta rio de los d e re ch o s de a u to r no tie n e n in g u n a o b je c ió n a la re p ro d u cc ió n p o r cop ia e xa c ta de l o rig in a l del d o cu m e n to de p a ten te o la d ivu lg a c ió n de la pa ten te , ta l com o a p a re ce en los a rch ivo s o re g is tro s de p a te n te s de la O fic in a de P a te n te s y M a rcas R eg is trad as , pero se re se rva to d o s los d e re ch o s de autor. El s ig u ie n te a v iso se a p lica a e ste d ocum e n to : D e re ch o s de a u to r © 2007 -2008 , R e cu rso s G lo b a le s de R eu te rs T h om so n .A portion of this patent document contains material that is subject to copyright protection. The copyright owner has no objection to the reproduction by exact copy of the rig in al of the document of the pa ten te or d ivu lg atio n of the pa ten te , as it ap pears in the P a te n t R e gi st s or Files of the P a n t Office C o nten ts and R eg istrated B a rmarks , but all copyrights are re se rva ted. The following notice applies to this d ocum ent : Copyright © 2007 -2008 , R eu te rs G lo ba le R esources T h om so n .

Aplicaciones relacionadasRelated apps

E sta so lic itu d re iv in d ica la p rio rida d de la so lic itu d de p a te n te de E s ta d o s U n id os 12 /341 ,926 , la cua l se p re se n tó el 22 de D ic ie m b re de 2008 , y de la so lic itu d p ro v is io n a l de E s ta d o s U n id os 61 /063 ,047 , la cua l se p re se n tó el 30 de E ne ro de 2008.This application claims the priority of United States Patent Application 12/341,926, which is filed filed on December 22, 2008, and of the p ro v isio nal request of the United States 61/063,047, which was filed on December 30, 2008. of January 2008.

Campo técnicotechnical field

D ive rsa s re a liza c io n e s de la p rese n te inve n c ió n se re fie ren a la e x tra cc ió n de d a tos e in fo rm a c ió n re la c io n a d a de d o cu m e n to s , ta le s co m o id e n tif ic a r y e tiq u e ta r n o m b re s y e ve n to s en te x to e in fe rir a u to m á tica m e n te re la c io n e s entre e n tid a d e s e tiq ue tad as , e ve n to s , y etc.Various embodiments of the present inve n tio n refer to the ex tra c tio n of data and rela ted in fo rm at io n from do cu men to s, such as id en tif ic arye ta ck n a m e s and ven t s in te x t and infe r au to m a t ically re la tio ns betw een en tity and tiq ue ty as, events, and etc.

AntecedentesBackground

Los p re se n te s in ve n to re s re co n o c ie ro n la n eces id ad de p ro p o rc io n a r a los c o n su m id o re s de in fo rm a c ió n re la c io n a l y de e ve n to s so b re e n tida de s , ta le s com o em presas , pe rso na s , c iud ad es , q ue se m e nc ion an en d o cu m e n to s e le c tró n ico s , p a rticu la rm e n te d o cu m e n to s fin a n c ie ro s . P o r e je m p lo , d ocu m e n to s , ta le s com o no tic ias , p re se n ta c io n e s de la S E C (C o m is ió n de B o lsa y V a lo re s ) p ue de n in d ica r que la E m p re sa A se fu s io n ó o se ru m o re a q ue se fu s io n a rá con la E m p re sa B, o q ue la E m p re sa C a n u n c ió g a n a n c ia s rea les o p ro ye c ta d a s de X d ó la re s p o r acc ión .THE PRESENT INVENTORS RECOGNIZED THE NEEDS TO PROVIDE CUSTOMERS WITH RELATED INFORMATION naly of e ve n ts on en tities , such as companies , people , cities , that are mentioned in se le c tron do cu men t ico s , p articu la rm en te fin anc ie ro s do cu men ts . For example, documents, such as news, SEC (Securities and Exchange Commission) p re se n tatio ns p ue to n d indicate that C ompany A merged or is rumored to merge with C ompany B, or that C announced io pro je c ted real earnings of X d o the re spor a c tion .

S in e m ba rg o , d eb ido a las va ria c io n e s del id io m a y la n a tu ra le za no e s tru c tu ra d a de d ive rso s de los d o cum e n to s , d is c e rn ir a u to m á tica m e n te la in fo rm a c ió n re la c io n a l y de e ve n to s sob re e s ta s e n tid a d e s es d ifíc il y re q u ie re m u ch o tie m p o inc lu so con e q u ip o s in fo rm á tico s de ú ltim a g en e rac ión .Ho w e rg o , D ue to va ria c io ns in l anguage may th e unstructured na tu ra le za d o f d ive rso s of d o cum en ts , d is ce Going auto m a tically th e relat io na l an d e v e n t in fo rm atio n about these feelings is di cult and requires a lot of time inc lu I use the latest computer equipment.

El d o cu m e n to US 2005 /131935 A1 (O 'L E A R Y P A U L J [U S ] E T A L ) d e sc rib e un s is te m a y p ro ce so de m in e ría de co n te n id o q ue u tiliza una co m b in a c ió n de re co n o c im ie n to de té rm in o s y c la s ifica c ió n de e ve n to s de a c tiv id a d con base en reg las, re a liza d a u sa nd o una base de d a tos m o d u la r que d e fine uno o m ás m e rca d o s o s e c to re s de in fo rm a c ión m ás ve rtica les , para id e n tif ica r e v id e n c ia re le va n te de l sec tor. Los e le m e n to s p rim a rio s de la e v id e n c ia id e n tifica d a se p u n tú an de una m a ne ra q ue ca lifica la re le va n c ia de un e le m e n to de co n te n id o con re sp e c to a un co n ju n to de e n tid a d e s n o m in a tiva s ide n tifica d a s , un co n ju n to de ca te g o ría s de e ve n to s con base en a c tiv id a d e s , a so c ia d a s a d e m á s com o co n ju n to s de pare s e n tid a d -e ve n to . U na b ase de d a to s co n s tru id a a p a rtir de la in fo rm a c ió n p un tu a d a p ro p o rc io n a un re p o s ito rio in d e xa d o de re le va n c ia de los e le m e n to s de co n te n id o no e s tru c tu ra d o s o rig in a le s .The document US 2005 /131935 A1 (O 'LEARYPAULJ [US ] ETAL ) de sc rib e a c o n te n id m ine m ining p ro ce s s ys te that uses a co mb in a tio n of r e c o n i e n t of ter m in os a n d c las ifica tio n of e v e n ts of a c tiv ity based on rules, performs dausing a basis of da This module defines one or more mar kets or m ore v e rtical in fo rm atio n sectors, to id en tif ica re le va n ce of the sector. Sp rim a ry ele m a n ts of the id en tified ev id enc e are scored in a m a n e that qualifies the re le va nc e of a c o n te n ele m ent id with respect to a set of iden tified nom in a tive en tities, a set of activ ity-based e v e n t ca te go ries ity , asso c ia dasa s a m o s s o n co n t o s pare sen ti d -e ven to . A D ata b a s e c o n built from the in fo rm atio n p un tu adap ro p ro tio n an in de x re re le va nc e repo sito ry of the ele men c o n te n d o s not s tru c tu ra doso rig in a l s .

ResumenSummary

La inve n c ió n se e xp o n e en el co n ju n to de re iv in d ica c io n e s a d ju n tas .The invention is set forth in the attached set of claims.

Breve descripción de los dibujosBrief description of the drawings

La F igura 1 es un d ia g ra m a de b lo q u e s y de flu jo de un s is te m a de e je m p lo para e tiq ue tad o , re so lu c ió n y e x tra cc ió n de e ve n to s de e n tid a d e s d e n o m in a d a s , el cua l co rre sp o n d e a una o m ás re a liza c io n e s de la p re se n te invenc ión . La F igura 2 es un d ia g ra m a q ue ilus tra la d e co d ifica c ió n de se cu e n c ia g u ia da para el e tiq u e ta d o de e n tida d d e n o m in a d a el cua l co rre sp o n d e a una o m ás re a liza c io n e s de la p re se n te invenc ión .F igure 1 is a b lo ck and flow dia g ra m of an e x a mp le s ys te m fo r tagging, solving, and e ven e xtra c tio n to s of denominated entities, which corresponds to one or more embodiments of the present invention. F igure 2 is a diagram illustrating the guided se quence encoding for the labeling of en tity named which corresponds one or more embodiments of the present invention.

La F igura 3 es un d ia g ra m a de b lo q ue s de un s is te m a de e x tra cc ió n de e ve n to s, re so lu c ió n , y e tiq u e ta d o de e n tida d d e n o m in a d a de e je m p lo co rre sp o n d ie n te a una o m ás re a liza c io n e s de la p re se n te inve nc ión .F igure 3 is a b loc d ia g ra m of an e v e n t e xtra c tio n, resolution, and en tagging s ys te m tion ddenom in ated as an ex ample co rre sp ond ing to one o m or m ore embodiments of the p re se n te inve nc ion.

La F ig u ra 4 es un d ia g ra m a de flu jo de un m é to do de e je m p lo de e tiq u e ta d o y re so lu c ió n de e n tid a d e s d e n o m in a d a s y e xtra cc ió n de e ve n to s c o rre sp o n d ie n te s a una o m ás re a liza c io n e s de la p re se n te invenc ión .F igure 4 is a flow diagram of an example method of labeling and solving named entities and extracting e ve n to s co rre sp ond i n te sa o m o m e r e a liza c io ns of the p re se n te invenc ion.

La F ig u ra 5 es un d ia g ra m a de b lo q u e s y de flu jo de o tro s is te m a de e je m p lo para e tiq u e ta d o y re so lu c ió n de e n tid a d e s d e n o m in a d a s , y e x tra cc ió n de e ve n to s, el cua l co rre sp o n d e a una o m ás re a liza c io n e s de la p re se n te invenc ión . F igure 5 is a b lo ck and flow dia g ra m of oth er ex a m p le s ys te m fo r LABELING AND RESOLUTION OF DENOMINATED ENTITIES , and e v e n to e n t e n tra cc io n , w hich corresponds to one or more embodiments of the present invention.

Descripción detallada de la(s) realización(es) de ejemploDetailed description of the example embodiment(s)

E sta d e sc rip c ió n , la cua l inco rp o ra las F ig u ras y las re iv in d ica c io n e s , d e sc rib e una o m ás re a liza c io n e s e sp e c ífica s de una inve nc ión . E s ta s re a liza c io n e s , que se o frecen no para lim ita r s ino so lo p ara e je m p lific a r y e n s e ñ a r la inve nc ión , se m u es tra n y d e sc rib e n con s u fic ie n te d e ta lle para p e rm itir q ue los e xp e rto s en la té c n ica im p le m e n te n o p rac tiqu e n la inve nc ión . P o r lo tan to , cu a n d o sea a p ro p ia d o para im p e d ir o sc u re c e r la inve nc ión , la d e sc rip c ió n p ue de o m itir c ie rta in fo rm a c ió n co n o c id a p o r los e xp e rto s en la técn ica .T his de sc rip tio n , WHICH INCORPORATES THE F igu res AND THE CLAIMS , D escribes one o m o r e a lizat io n s ec ica s of an invention. These embodiments , which are offered not to limit , but only to exemplify and teach the inve n tio n , are suffi ciently shown and described . detail to enable tech nical experts to implement the inve nc tion no p rac tiqu . THEREFORE , WHERE APPROPRIATE TO PREVENT OBSERVING THE INVENTION , THE DESCRIPTION MAY OMIT S CERTAIN INFORMATION KNOWLEDGE Go for the e xp e rts in the technique.

S is te m a de e je m p lo de re so lu c ió n y e tiq u e ta d o de e n tid a d e s d e n o m in a d a sE x a m p le s ys te m of resolutio n a n d la be l ing o f n o m in a d en tities

La F ig ura 1 m u e s tra un s is te m a 100 de e tiq u e ta d o y re so lu c ió n de e n tida d d e n o m in a d a de e je m p lo . A d e m á s de los p ro ce sa d o re s 101 y una m e m o ria 102, el s is te m a 100 in c lu ye un e tiq u e ta d o r 110 de e n tida d , un s o lu c io n a d o r 120 de e n tida d , y a rch ivo s 130 de au to rida d . El e tiq u e ta d o r 110, so lu c io n a d o r 120, y los a rch ivo s 130 de a u to rid a d se im p le m e n ta n u sa nd o d a tos le g ib le s p o r m á q u in a y /o in s tru cc io n e s e je cu ta b le s p o r m á q u in a a lm a ce n a d a s en la m e m o ria 102, las cu a le s p ue de n to m a r una d ive rs id a d de fo rm a s c o n so lid a d a s y /o d is trib u id as .F ig ure 1 shows an ex a m p le d e n om in a d en t y resolution s ys te m 100 . In addition to the processors 101 and a memory 102, the system 100 includes an entity labeler 110, a solution 120 of en tity , and au to rity files 130 . The tag er 110, solver 120, and the au to rity files 130 are im p le me n ted without s an d readable data sporm a qu in ay/or in s tru cc io nese je cu ta b le sporm á qu in a stor a das in the memory 102, which can take a d ive rs ity in a solid way and / od is tribes.

El e tiq u e ta d o r 110 de e n tida d , el cua l rec ibe e n tra d a de te x tu a l en la fo rm a de d o cu m e n to s u o tros se g m e n to s de tex to , ta le s com o una o rac ió n 109, inc lu ye un s e ñ a liz a d o r 111, una zo n a 112, y un e tiq u e ta d o r 113 e sta d ís tico .The entity labeler 110 , which receives text input in the form of a do cu ment to its other text segments , such as a 109 sentence, includes a 111 flag, a 112 zone, and a 113 statistic labeler.

El s e ñ a liz a d o r 111 p roce sa y c la s ifica se cc io n e s de una ca d e n a de c a ra c te re s de e n tra da , ta le s co m o la o rac ió n 109. El p roce so de se ñ a liza c ió n se usa para d iv id ir la o rac ió n u o tro se g m e n to de te x to en se ñ a le s de pa labras . Las se ñ a le s re su lta n te s se e nv ían a la zo n a 112.The 111 flag proc esses and classifies se c tio ns of a string of input charac te rs , such as sentence 109. T h e p rocess signage is used to divide the sentence into another segment of text into word signs. The resulting signals are sent to zone 112.

La z o n a 112 u b ica p artes de l te x to q ue n eces itan s e r p ro ce sa d a s para e tique ta r, u sa nd o p a tro n e s o reg las. P o r e je m p lo , la z o n a p ue de a is la r p o rc io n e s de l d o cu m e n to o te x to que te n g a n n o m b re s p rop ios. D e sp u é s de esa d e te rm in a c ió n , las p artes de l te x to q ue n ece s ita n s e r p ro ce sa d a s a d ic io n a lm e n te se pasan al e tiq u e ta d o r 113 de se cu e n c ia esta d ís tica .Zone 112 locates parts of the text that need to be processed for labeling, using patterns or rules. For example, the zone can isolate portions of the document or text that have their own names. A fter that de te rm in a tio n , the p art s of the t e x t that n e d ed to be p ro ce sa d a dd it io na l m en t are passed to the 113 e ta ler of se cu enc ia sta d is tic .

El e tiq u e ta d o r 113 (o d e co d ifica d o r) de se cu e n c ia e s ta d ís tica usa una o m ás lis ta s 114 de n o m b re s no a m b ig u a s (ta b la s de b ú sq u e d a ) y re g la s 115 p ara e tiq u e ta r el te x to d e n tro de la o rac ió n 109 com o e m presa , pe rso na , o lu g a r o com o un no nom bre . Las re g la s y lis ta s se co n s id e ra n en el p re se n te d o cu m e n to c la s if ica d o re s de a lta p rec is ión . Sta d is tic se quence tag er 113 (or decoder) uses one or more lists 114 of unambiguous names (search ta b les ) and rules 115 for labeling the text within the sentence 109 as a company , person , or place as a no name . The rules and lists are considered in this document to be high-precision classifiers.

Se p ueden im p le m e n ta r re g la s de patrón de e je m p lo u sa n d o re ge x+ Java , re g la s Ja p e d e n tro de G A T E , A N T L R , y etc. U na re g la de e je m p lo p ara la ilus tra c ión d ic ta q ue “ si una se cu e n c ia de p a la b ra s se e sc rib e en m a yú scu la y te rm in a c o n ” Inc .” , e n to n ce s se e tiq u e ta com o una e m p re sa u o rg a n iza c ió n . Las re g la s las d e sa rro lla un h u m a n o (p o r e je m p lo , un in ve s tig a d o r) y se co d ifica n en un fo rm a lism o de reg las o d ire c ta m e n te en un len g u a je de p ro g ra m a c ió n p ro ce d im e n ta l. E s ta s re g la s e tiq u e ta n una e n tid a d en el te x to cu a n d o se cu m p le n las co n d ic io n e s p re v ia s de la regla. You can implement example pattern rules using re g e x+ Java , Ja p e rules inside G A T E , A N T L R , and so on. An example rule for the illustration d ic ta ts that “if a se quence of w ords is written in upper case and ends with "Inc." , THEN IT IS LABELED AS A COMPANY OR ORG A N IZATION . The rules are developed by a human (for example, a researcher) and co d ified in a rules fo rm a lism or directly in a language. p ro g ra mac io n gua je p ro ce d im en ta l. These rules are labeled as an entity in the text when the preconditions of the rule are met.

Las lis tas de n o m b re s de e je m p lo ide n tifican e m p re sa s , ta le s com o M ic ro so ft, G o o g le , A T & T , M e d tro n ics , X erox ; luga res , ta le s com o M in ne ap o lis , F o rt D odge , D es M o ines, H ong Kong; y m e d ica m e n to s , ta le s com o V ioxx , V iag ra , A sp ir in a , P en ic ilina . En la re a liza c ió n de e je m p lo , las lis ta s se p rod u ce n fu e ra de líne a y se ponen a d isp o s ic ió n d u ra n te el t ie m p o de e je cuc ión . P ara p ro d u c ir la lista , se pasa un g ran cu e rp o de d ocu m e n to s , p o r e je m p lo , un co n ju n to de no tic ias , a tra vé s de un m o de lo e s ta d ís tico y /o d ive rsa s reg las (p o r e je m p lo , un m o de lo de cam p o a le a to rio co n d ic io n a l (C R F )) para d e te rm in a r si el n o m b re es co n s id e ra d o ine q u ívoco . Las reg las de e je m p lo para c re a r las lis tas inc luyen: 1) e s ta r inc lu ido en un d ic c io n a rio de su s ta n tivo com ú n; y 2) s e r u tilizad o com o n om b re de la e m p re sa m ás de l noven ta p o r c ie n to de las v e ce s que el n o m b re se m e n c io n a en un cue rp o . El e tiq u e ta d o r de b úsqu ed a ta m b ié n e ncue n tra v a r ia n te s s is te m á tica s de los n o m b re s para a g re g a r a la lis ta ine q u ívoca . A d e m á s, el e tiq u e ta d o r de b ú sq u e d a g u ía y fu e rza so lu c io n e s p arc ia les . El uso de e sta lis ta a yu da al m o de lo e s ta d ís tico (e l e tiq u e ta d o r de s e cu e n c ia ) al f ija r in m e d ia ta m e n te ese n o m b re e xa c to sin te n e r q ue re a liz a r d e te rm in a c io n e s e sta d ís tica s .Ex a m p lo n ame l l s id e n t ic a n s , such as M ic ro so ft , G o o g le , AT & T , M e dtro n ics , X erox ; places, such as M in ne a p o lis , F o rt D odge , D es M o ines, H ong Kong; and m e d ica m e n ts , such as V ioxx , V iag ra , A sp irin a , P en ic illin . In the ex a m p le reali zation , the lists are pro duced off-line and are made available dur ing the run time . To produce the list , a large body of documents , for example , a set of news items , is passed through a model of the s ta d is tic an d / o d ive rsa s rules (for example, a cond ic io nal (CRF) field model) for de te rm in ar if the name is co ns id e ra do ine qu ivocal. Example rules for creating lists include: 1) be included in a dictionary of common nouns; and 2) is used as a company name more than ninety percent of the time the name is mentioned in a body. T h e S earch ta m b e r also f nds s ys te m a tic v a ria n ts of the n a m e s to A d g e t h e unequivocal l ist . In addition, the search engine tag guide and force p art ia l solutions. The use of this list helps the statistical model (sequence labeler) by immediately fixing that exact name without you nerq ue re a liz ar de te rm in a tio nes sta d ís tica s .

Los e je m p lo s de c la s if ica d o re s de se cu e n c ia e s ta d ís tica inc luyen c la s if ica d o re s de ca m p o a le a to rio c o n d ic io n a l (C R F ) de ca d e n a linea l, los cua les p ro p o rc io n a n ta n to p re c is ió n com o ve lo c id a d . La in te g ra c ió n de c la s if ica d o re s de a lta p rec is ión con el e n fo q u e de e tiq u e ta d o de se cu e n c ia e s ta d ís tica im p lica , en p rim e r lugar, m o d ifica r el co n ju n to de c a ra c te rís tica s de l m o de lo e s ta d ís tico o rig in a l m e d ia n te la inc lu s ió n de c a ra c te rís tica s c o rre sp o n d ie n te s a las e tiq u e ta s a s ig n a d a s p o r los c la s if ica d o re s de a lta p rec is ión , “a c tiv a n d o ” de h echo las c a ra c te rís tica s de e tiq u e ta a p ro p ia d a s de a cu e rd o con la e tiq u e ta a s ig n a d a p o r el c la s if ica d o r e xte rno . En se g u n d o lugar, en el t ie m p o de e je cuc ión , un d e c o d ifica d o r de V ite rb i (o un d e c o d ifica d o r de fu n c ió n s im ila r) e stá o b lig a d o a re sp e ta r las s e cu e n c ia s p a rc ia lm e n te e tiq u e ta d a s o e tiq u e ta d a s a s ig n a d a s p o r los c la s if ica d o re s de a lta p rec is ión .Examples of sta t istic se quen ce classif icators include conditional random field (CRF) classif icators of linear chain, which provide as much precision as speed. The in te g ra tio n of high-precision c lassif icators with the sta d is tic se quence labeling approach implies, in F rst, C o n d ify the o rig in a l m o d e s t a d s tic s s e t t h e n th e inc lu s io n of ch a ra c te ris tica s o n d ie n t to th e ta s ign ated b y H igh-P rec is ion clas if ica o rs , in fact “activating” ch a ra c te ris tic s of appro p ria d la be ls a ccording to the la bel a s igned b y th e exte rno l c lassif ica dore . Second, at runtime, a V ite rb i decoder (or a similar func tio n decoder) is constrained to respect the se quence spa rc ia lm en ta daso ta dasoe tiq ue ta dasas ignated by h igh p re c is io n classif ica do rs .

E sta fo rm a de d e co d ifica c ió n g u ia da p ro p o rc io n a d ive rso s b ene fic ios . En p rim e r lugar, se m e jo ra la ve lo c id a d de la d e co d ifica c ió n , d e b id o a que el e sp a c io de b ú sq u e d a e stá lim itad o p or el e tiq u e ta d o p rev io . En se g u n d o lugar, los re su lta d o s son m ás co n s is te n te s , p o rq u e se t ie n e n en cu e n ta tre s fu e n te s de co n o c im ie n to : las lis tas, las reg las, y el m o d e lo e s ta d ís tico del d e c o d ifica d o r e n tre na do . El te rc e r b en e fic io es la fa c ilid a d de p e rso n a liza c ió n q ue se d eriva de la e lim in a c ió n de la n eces id ad de v o lv e r a e n tre n a r al d e c o d ifica d o r si se a g re g a n n ue va s reg las y e le m e n to s de la lista.T his form of guided deco d ing pro vides d ive rso b ene fi ces . First, the decoding speed is improved, because the search space is limited by the label. dop rev io . Secondly, the results are more co n s i te n t , because three kno wledge s o u rces are taken into account: the lists, the rules, and the statistical model of the trained decoder. T h e r th e r b e n e f i c e i s th e e s i c i l y o f c u s o m a liza tio n w h i c h stems from elim in a tio n o f the n eed to retrain the decoder if new rules are added to the list.

La F igura 2 es un d ia g ra m a co n ce p tu a l que m u es tra cóm o un se g m e n to de te x to “ M ic ro so ft a n u n c ió el lunes u n ” está e tiq u e ta d o p rev io y cóm o e ste e tiq u e ta d o p rev io (o fija c ió n ) re s tr ing e las p os ib le s e tiq u e ta s u o p c io n e s de e tiq u e ta d o q ue un d e co d ifica d o r, ta l com o el d e c o d ifica d o r de V ite rb i, tie n e q ue p roce sa r. En la F igura , el té rm in o M ic ro so ft está e tiq u e ta d o o a n c la d o com o e m p re sa con base en su inc lu s ió n en una lis ta de n o m b re s de e m p re sa s ; el té rm in o lunes se m arca com o “fu e ra ” con base en la inc lu s ió n de una lis ta de té rm in o s q ue s ie m p re d eb en m a rca rse com o “ fu e ra ” ; y el té rm in o “ d e n tro ” se m arca co m o fu e ra con base en una re g la q ue d eb e m a rca rse com o “ fu e ra ” , si va se g u id o de un té rm in o m a rca d o com o “ fu e ra ” en este caso el té rm in o “ L u n e s ” .F igure 2 is a c o n ce p tu al d ia g ra m that shows how a text segment “Mic ro so ft announced Monday a ” is tagged dop rev io and h o w this prev io tagging (or fixation) re s tr n e s the p os ib le se t ib le t ag ing o ptio ns of a deco d er, Just like the Vite rb i decoder, it has to process. In Figure , the term M ic ro so ft is dooanc laged as a company based on its inclusion in a list of company names ; the term Monday is marked as “out” based on the inclusion of a list of terms that should always be marked as “out” ra ” ; and the term “inside” is marked as outside based on a rule that must be marked as “outside”, if followed by a term marked as “out” in this case the term “Monday”.

En la re a liza c ió n de e je m p lo , el e tiq u e ta d o r de se cu e n c ia e s ta d ís tica ca lcu la la p ro b a b ilid a d de una se cu e n c ia de e tiq u e ta s d ada el te x to de e n trada . Los p a rá m e tro s de l m o de lo se e s tim a n a p a rtir de un cu e rp o de d a tos de e n tre n a m ie n to , es dec ir, te x to d on de un h u m a n o ha a n o ta d o to d a s las m e n c io n e s u o cu rre n c ia s de en tida de s . (E l te x to sin a n o ta r ta m b ié n se p ue de u tiliza r para m e jo ra r la e s tim a c ió n de los p a rá m e tros ). El m o de lo e s ta d ís tico luego reúne los d a tos de e n tre n a m ie n to , d e sa rro lla un co n ju n to de c a ra c te rís tica s y u tiliza reg las p ara anc lar. A n c la r es una fo rm a e sp e c ífica de u tiliz a r un m o de lo e s ta d ís tico para e tiq u e ta r una se cu e n c ia de ca ra c te re s e in te g ra r d ive rso s tip o s d ife re n te s de in fo rm a c ió n y m é to d o s en el p ro ce so de e tiq ue tad o .In the ex a mp le embodiment, the sta d is tic se quence e tiq er ca lcu lates the p ro ba lity of a e tique t se quence sd ada the input text. The parameters of the lmo of it are estimated from a body of training data, i.e., text of a human has annotated all the mentions of the entities. (The text without annotating can also be used to improve the estimation of the parameters). The statistical model then gathers the training data, develops a set of features, and uses rules to anchor. A n c l a r i s a spe cific w a y of u sing a sta t istic m o d to label a se quence o f ch a ra c te re in te g ra rd ive rso d ife re n t typ os o f in fo rm atio n and m eth ods in th e labeling process .

El m o d e lo e s ta d ís tico loca liza las p o s ic io n e s de d e sp la za m ie n to de c a ra c te re s (es dec ir, c o m ie n zo y fin a l) en el d o cu m e n to para cada e n tida d d e n o m in a d a . El d o cu m e n to es una se cu e n c ia de ca rac te res ; p o r lo tan to , se d e te rm ina n las p o s ic io n e s de d e sp la za m ie n to de los ca rac te res . P o r e je m p lo , d en tro de la o rac ió n “ H a n k 's H ardw are , Inc. tie n e una ve n ta en e s te m o m e n to ” , el fra g m e n to de te x to “ H an k 's H a rdw are , Inc .” tie n e una p os ic ió n de d e sp la za m ie n to de (0 ,20 ). La se cu e n c ia de c a ra c te re s tie n e un pun to in ic ia l y un pun to fina l; s in e m ba rg o , el ca m in o e n tre e sos pun to s va ría .Statistical model locates character shift positions (i.e., start and end) in do cu me n t for each entity named. The document is a sequence of characters; therefore, the p o s t i o n s of t h e d e s p la za m e n t are deter m ina n d . For example, within the sentence “H ank's H ardw are, Inc. has a sale right now”, the text fragment “H an k 's H a rdw are , Inc .” it has a displacement position of (0 .20 ). The sequence of characters has a start point and an end point; however, the path between these points varies.

D e sp u é s de u b ica r las p o s ic io n e s de d e sp la za m ie n to de ca rac te res , la in fo rm a c ió n so b re la e n tida d se ide n tifica m e d ia n te el uso de ca ra c te rís tica s . E s ta in fo rm a c ión a b a rca a p a rtir de in fo rm a c ió n g en e ra l (es dec ir, d e te rm in a r que el te x to es el a p e llid o ) h as ta in fo rm a c ió n e sp e c ífica (p o r e je m p lo , id e n tif ic a d o r ún ico). La re a liza c ió n de e je m p lo usa las ca ra c te rís tica s q ue se d e sc rib e n a co n tin u a c ió n , pero o tras re a liza c io n e s usan o tro s tip o s y c a n tid a d e s de ca ra c te rís tica s :A fter locating the char a ra c te r shift po s itio ns, the in fo rm atio n about the m ea d id e n tifica tion n te the use of fe ra c te ris tic s . This in fo rm atio n ranges from general in fo rm a tio n (i.e., determining that the text is the last name) up to in fo rm sp ec ific ac tio n (for example, unique id en tif ic ator). The example implementation uses the features described below, but other implementations use other types and quantities. ade s of char a c te ris tics :

• E xp re s io n e s re gu la res : con tie n e una le tra m a yú scu la , el ú ltim o c a rá c te r es un punto , fo rm a to de acrón im o , co n tie n e un d íg ito , p un tu ac ió n• R e g u la r e xp re s io n s : it contains a capital letter , the last char a c te r is a dot , acronym fo rm a t , it contains a d íg ito, p un tu ac ió n

• L is tas de una so la pa labra : a pe llid o s , p u e s to s de tra b a jo , p a la b ra s de loca liza c ión , etc.• L is ts of a single word : surname s , job positions , loca lization w ords , etc. .

• L is tas de d ive rsa s p a labras : n o m b re s de países, ca p ita le s de países, u n ive rs id a d e s , n o m b re s de em presas , n om b re s de e sta do s, etc.• L is ts of d ive rsa spa labras : names of countries, capitals of countries, un ive rs ities , names of companies, names of states, etc. .

• C a ra c te rís tica s de co m b in a c ió n : títu lo@ -1 Y (n o m b re O a p e llid o )• C a ra c te ris tic s of co m b in a tio n : title@-1 AND (firstname or lastname )

• C a ra c te rís tica s de cop ia: ca ra c te rís tica s de cop ia de una se ñ a l a se ñ a le s ve c in a s , p o r e je m p lo , la se ñ a l dos a la izq u ie rd a de m í e stá en m a yú scu la (C a p @ -2 )• C o py F a ra c te ris tics: C o py features of one signal to neighboring signals, for example, the two-to-the-side signal. left of me is in uppercase (C ap @ -2 )

• C a ra c te rís tica s de la p a la b ra en sí: “fu e ” t ie n e la c a ra c te rís tica fu e @ 0• Characteristics of the word itself: “was ” has the characteristic was @ 0

• C a ra c te rís tica s de la p rim e ra o rac ió n : co p ia r c a ra c te rís tica s de las p a la b ra s de la p rim e ra o ra c ió n a o tras• Characteristics of the first sentence: copying characteristics of the words of the first sentence to other

• C a ra c te rís tica de a b re v ia tu ra : c o p ia r las c a ra c te rís tica s de l n o m b re a las m e n c io n e s de la a bre v ia tu ra .• C a ra c te ris tic of a b re v ia tu ra : copy the ch a ra c te ris tics of the name to the m e n t i o n s of the ab brevia ture .

El cá lcu lo de c a ra c te rís tica s no ca lcu la ca ra c te rís tica s para se ñ a le s a n c la d a s a is la da s. Los cá lcu lo s com b ina n p ica d illo s , com b in a n in ten tos, y co m b in a n e xp re s io n e s re gu la res . Las ca ra c te rís tica s so lo se ca lcu la n cu a n d o es n e ce sa rio (p o r e je m p lo , los se ñ a le s de p un tu ac ió n no e s tán en n in g ún p icad illo , a s í q ue no los b usqu e). U na ve z que se ha e n tre n a d o el m ode lo , se u tiliza el a lg o ritm o de V ite rb i (o un a lg o ritm o de fu n c ió n s im ila r) para e n c o n tra r de m a ne ra e fic ie n te la se cu e n c ia m ás p ro b a b le de e tiq u e ta s d ad a la e n tra d a y el m o de lo en tre na do . D e sp u é s de q ue el a lg o ritm o d e te rm in a la se cu e n c ia m ás p ro b a b le de e tiq ue tas , el tex to , ta l com o la o rac ión 119 e tiq ue tad a , d on de se e n cu e n tra n las en tida de s , se pasa a un so lu c io n a d o r, ta l com o el so lu c io n a d o r 120 de en tida de s .The ca lculus of char a c te ris tics does not ca lculate char a c te ris tics for anchored and isolated sig nals. The calculations combine mincemeats, combine trials, and combine regular s. Characteristics are only ca lcu lated when necessary (for example, punctuation signals are not in any p icad illo , so don't look for them). Once the model has been trained, the V ite rb i algorithm (or a similar func tio n rhythm) is used to find In an efficient way, the m ost p ro ba b le se quence ta s d a d to the in tra day the mo of the traine d . A fter th e alg o rythm ode rm in a t the m ost p ro ba b le se quence o f ta ls , the text , such as sentence 119 e tiq ue ted , where the entities are located, it is passed to a solver, such as the entity solver 120 .

El so lu c io n a d o r 120 de e n tid a d e s p ro p o rc io n a in fo rm a c ió n a d ic io n a l sob re una e n tida d al h a ce r c o in c id ir un id e n tif ica d o r p ara un o b je to e x te rn o d e n tro de los a rch ivo s 130 de a u to rid a d a los cu a le s se re fie re la e n tida d . El s o lu c io n a d o r en la re a liza c ió n de e je m p lo usa reg las en lu g a r de un m o de lo e s ta d ís tico p ara re s o lv e r e n tid a d e s d e n o m in a d a s . En la re a liza c ió n de e je m p lo , el o b je to e x te rn o es un a rch ivo de a u to rid a d de la e m p re sa q ue co n tie n e id e n tif ica d o re s ún icos. La re a liza c ió n de e je m p lo ta m b ié n re su e lve n o m b re s de personas.The entity solver 120 provides additional information about an entity by in c iding an id en tif ica dor for an ob ject ex te rn oin th e 130 au to rid ed files to which the entity re fe re s. The solver in the exemplary realization uses rules instead of a sta d is tic m o to solve d e n o m in a d a s . In the ex a m p le embodi ment , the ex te rn o l o b je l is a company au to rity file w h ich con tains id en tif icators unique. The realization of examples also results in the names of people.

El so lu c io n a d o r de e je m p lo usa tre s tip o s de reg las para v in c u la r n o m b re s en te x to a e n tra d a s de a rch ivo s de a u to rida d : re g la s para m a sa je a r las e n tra d a s de a rch ivo s de au to rida d , re g la s p ara n o rm a liza r el te x to de e n tra da , y re g la s para u sa r e n la ce s a n te rio re s para in flu ir en e n la ce s fu tu ro s. O tras re a liza c io n e s inc lu ye n la in te g rac ión de l m o de lo e s ta d ís tico y el so lu c io na d o r.The e x a m p sol e n er uses three types of rules to link names in text to au to rity file entries: rules for ma saging au to rity file entries , re g la s for norm alizing input text , and re g la s for use in ce san te rio re s to influencing future ce s. Other achievements include the inte g ratio n of the statis tical mo d and the solv e r.

E s ta lis ta ju n to con el te x to o rig in a l es la e n tra d a a un m ó du lo de re so lu c ió n de en tida de s . El m ó du lo de re so lu c ió n de e n tid a d e s to m a e stas e n tid a d e s e tiq u e ta d a s y d e c id e a cua l e le m e n to en un a rch ivo de a u to rid a d se re fie re la e n tida d e tiq ue tad a . En la re a liza c ió n de e je m p lo , el a rch ivo 130 de a u to rid a d es una base de d a tos de in fo rm a c ió n sob re en tida de s . P o r e je m p lo , una e n tra d a de a rch ivo de a u to rida d para S w a tch p ue de te n e r una d ire cc ió n p ara la e m presa , un n o m b re e s tá n d a r com o S w a tch Ltda., el n om b re de l d ire c to r e je cu tivo actua l, y un s ím b o lo de co tiza c ió n de la bo lsa de va lo re s . C ad a e n tra d a del a rch ivo de a u to rid a d tie n e una ide n tid ad ún ica . En el e je m p lo a n te rio r, una id e n tifica c ió n ún ica p od ría ser, ID :345428 , “S w a tch L td a .” , N ich o la s G. H a ye k Jr., U H R N .S . El o b je tivo del so lu c io n a d o r es d e te rm in a r cua l e n tra d a en el a rch ivo de a u to rid a d co in c id e con una m e nc ión de n om b re en el te x to co rre sp o n d ie n te . P o r e je m p lo , d e b e ría a ve rig u a r q ue el S w a tch G rou p se re fie re al ID :345428 de e n tida d . P o r sup ue s to , re so lve r n o m b re s com o S w a tch es re la tiva m e n te fá c il en c o m p a ra c ió n con un n o m b re com o A cm e. S in e m b a rg o , inc lu so para n o m b re s co m o S w atch , d ive rsa s e m p re sa s re la c io n a d a s pero d ife re n te s p ueden se r p o s ib le s re fe ren te s . Lo q ue s igue es un a lg o ritm o de re so lu c ió n h eu rís tico u tiliza d o en la re a liza c ió n de e jem p lo :This list together with the or ig in al text is the input to an entity resolution module. The en tity resolution m o du le takes these en tities and decides which le ment in an au to rity file the en t y refers to of tiq ue tad a . In the exemplary embodiment, the au to rity file 130 is a database of in fo rm atio n about en tities. For example, an au to rity file entry for S wa tch may have an address for the company, a standard name such as S wa tch Ltda., the name of the current execu tive director, and a stock tick ing symbol re s. Each entry in the au to rity file has a unique identity. In the example above, a unique ID could be, ID :345428 , “S wa tch L td a .” , N ich o la s G. H a ye k Jr., UHRN .S . The goal of the solver is to deter rm in ar which in the au to rity file matches a name mention in the text correspondent . For example, you should find out that the S wa tch G ro u p refers to entity ID :345428 . Of course, solving names like Swa tch is rela tively easy compared to a name like A cm e. However , even for names such as S w atch , various re lated but diff e re nt sp ue rsa s may be r po sible r re fe re n te s . What follows is a heuristic resolution algorithm used in the ex ample run :

A lg o ritm o de re so lu c ió n h eu rís tica p ara e m p re sa sA lg o rhythm o f h euristic resolution f o r business

Iterar a tra vé s de las e n tid a d e s e tiq u e ta d a s p o r el C RF: Iterate through the entities labeled by the CRF:

Si la e n tida d e tiq u e ta d a es O R G : If the given ticketing entity is ORG :

Si un “ no re s o lv e r” O R G (p o r e je m p lo , a b re v ia c io n e s de la bo lsa): If an ORG “doesn't solve” (for example, ab re v ia tio ns of the bag):

definir a tr ib u to de ID a “ N O R E S U E L T O ” set ID attribute to “ UNSOLVED ”

Si no: If not :

Si e n tida d en el a rch ivo de a u to rida d de em presa , If entity in company authority file,

definir a tr ib u to de ID al ID de e m p re sa set ID attribute to company ID

Si no: If not :

Iterar a tra vé s de las e n tid a d e s N O R E S U E L T A S : Iterate through the UNSOLVED entities:

Si E es una su b ca d e n a a n c la d a a la izq u ie rd a de una e m p re sa resue lta : If E is a leading subchain it gives the left of a resolved firm:

definir a tr ib u to de ID a un ID de c o in c id e n c ia de s u b ca d e n a de e m p re sa resue lta , define ID attribute to a resolved enterprise substring match ID,

cambiar el tip o de e tiq u e ta a O R G , si es n ece sa rio change the label type to ORG , if necessary

Si E es un a c ró n im o de una e m p re sa ya resue lta : If E is an acronym for a settled company:

definir a tr ib u to de ID a ID de e m p re sa sin a c ró n im o ya resue lta , define ID attribute to company ID without acronym already resolved,

N ó tese q ue el e tiq u e ta d o r de e n tidad de e je m p lo y las v a r ia c io n e s de l m ism o no so lo es útil para el e tiq u e ta d o de e n tida d d e n o m in a d a . D ive rsa s ta re a s im p o rta n te s de m in e ría de da tos se p ueden e n m a rc a r com o e tiq u e ta d o de secu en c ia . A d e m á s, e x is ten d ive rso s p ro b le m a s para los cu a le s se e n cu e n tra n d isp o n ib le s c la s if ica d o re s e x te rn o s de a lta p rec is ión (pero ba ja re cu p e ra c ió n ) q ue p ueden h a b e r s ido e n tre n a d o s en un co n ju n to de e n tre n a m ie n to sep arad o . S is te m a de e x tra cc ió n de e ve n to s y re la c io n e s de e je m p loNote that the example entity labeler and variations of it are not only useful for named entity labeling. Various important data mining tasks can be marked as sequence tagging. In addition, there are d ive rso sp ro b lems for which high-precision sex te rn os are available (but low recovery) that may have been trained in a separate training set. E v e n t e xtra c tio n s ys te m an d e x a m p le relatio n s

La F ig u ra 3 m u es tra un s is te m a 300 de e je m p lo el cua l se basa en los c o m p o n e n te s de l s is te m a 100 con un c la s if ica d o r 310 y un e x tra c to r 320 de p la n tilla s , los cu a le s se m u es tra n com o parte de la m e m o ria 102, y se e n tie n d e que se im p le m e n ta n u tiliza n d o in s tru cc io n e s leg ib le s p o r m á qu ina y e je cu ta b le s p o r m áqu ina .F igure 3 shows an example s ys te m 300 which is based on the compo nents of s ys te m 100 with a classifier 310 and an ex tra c to r 320 of templates, which are shown as part of memory 102, and it is understood that it is implemented by using in s tru cc io ns leg ib le sporm á qu ina ye je cu ta b le sporm á qu ina.

El c la s if ica d o r 310, el cua l a ce p ta te x to e tiq u e ta d o y re sue lto ta l com o la o rac ión 129 de l so lu c io n a d o r 120, ide n tifica las o ra c io n e s que co n tie n e n in fo rm a c ión de re lac ión e x tra íb le p e rte n e c ie n te a una c lase de re lac ió n e sp ec ífica . P o r e je m p lo , si uno e stá in te re sa d o en la re lac ión de co n tra ta c ió n d on de la re lac ió n es de co n tra ta c ió n (e m p re sa , pe rsona ), el filtro (o c la s if ica d o r) 312 id e n tifica la o rac ió n (1.1 ) com o p e rte n e c ie n te a la c lase de o ra c io n e s q ue co n tie n e n un e ve n to de co n tra ta c ió n o c a m b io de tra b a jo y o rac ió n (1.2 ) com o no p e rte n e c ie n te a la c lase.The c lassif icator 310, which accepts text and t a lg ue ta d i s o l s as o ratio n 129 of sol u c io nator 120, iden tifies the o ra tio ns that con tain extractable relatio n in fo rmatio n in fo rm atio n of a sp ecific relatio n class . For example, if one is interested in the contracting relationship of the contracting relationship (company, pe rsona ), the filter (oc classif ica tor) 312 id en tifies the sentence (1.1) as pertaining to the class of sentences that contain an event Hire n t or job change ration (1.2) as not belonging to the class.

(1.1 ) Joh n W illia m s se ha in co rp o ra d o a la firm a S ka d d e n & A rp s co m o aso c ia d o .(1.1) Joh n W illia m s has joined the firm S ka d d e n & A rp s as an associate.

(1.2 ) Joh n W illia m s d irige el d e p a rta m e n to de fa c tu ra c ió n de S ka d d e n & A rps.(1.2 ) Joh n W illia m s manages the billing department at S ka d d e n & A rps.

La re a liza c ió n de e je m p lo im p le m e n ta el c la s if ica d o r 310 com o c la s if ica d o r b ina rio . En la re a liza c ió n de e je m p lo , la co n s tru cc ió n de e ste c la s if ica d o r b ina rio para la e x tra cc ió n de re la c io n e s im p lica :The ex a m p lo realization im p le m e n t the 310 c las if ica d o r as a r b ina ry c las if ica d o r . In the ex a m p le realization, the con s tru cc io n of this b ina ry c las if ica dor for the e x tra cc io n of rela t io ns implies:

1) E x tra e r a rtícu lo s de una base de d a to s ob je tivo ;1) E x tra e r ticles from a target da ta base ;

2 ) D iv id ir o ra c io n e s en to d o s los a rtícu lo s y ca rg a rla s en un so lo a rch ivo ; 2 ) D iv id ir o ra tio ns in all the articles and load them in a single file ;

3) E tiq u e ta r y re s o lv e r t ip o s de e n tid a d e s re le va n te s para un tip o de re lac ió n q ue o cu rren d en tro de cad a o rac ión ; 3) Label and solve the types of entities that are relevant to a type of relationship that occurs within each sentence;

4) S e le c c io n a r de l co n ju n to de o ra c io n e s to d a s las o ra c io n e s q ue tie n e n el n ú m e ro m ín im o de e n tid a d e s e tiq u e ta d a s n e ce sa ria s para fo rm a r una re lac ió n de in te rés. E sto s ign ifica , p o r e je m p lo , q ue al m e no s el n o m b re de una p e rso na y el n o m b re de un b u fe te de a b o g a d o s d eb en e sp e c ifica rse en una o rac ió n para q ue co n te n g a un e ve n to de ca m b io de tra ba jo . Las o ra c io n e s q ue con tie n e n el n úm e ro n e ce sa rio de tip o s de e n tid a d e s e tiq u e ta d a s se d e n o m in a n o ra c io n e s can d id a ta s ;4) S e le c tio n from the set of sentences all the sentences that have the minimum number of entities that are required s to fo rm a re latio n of in te res. This means, for example, that at least the name of a person and the name of a law firm must be specified in one sentence to THAT CONTAINS A JOB CHANGE EVENT. The sentences that have the necessary number of types of entities that are labeled are known as candidate names;

5) Id e n tif ica r 500 in s ta n c ia s p os itiva s de l co n ju n to de c a n d id a to s y 500 ins ta nc ias neg a tivas . U na o rac ió n en el co n ju n to de ca n d id a to s q ue re a lm e n te co n tie n e una re lac ió n de in te rés se d e n o m in a ins ta nc ia pos itiva . U na o rac ión del co n ju n to de ca n d id a to s q ue no co n tie n e una re lac ió n de in te rés se d e n o m in a ins ta n c ia n ega tiva . T o d a s las o ra c io n e s den tro de l co n ju n to de c a n d id a to s son in s ta n c ia s p o s itiva s o neg a tivas . E s ta s in s ta n c ia s m u e s tre a d a s d eb en se r re p re se n ta tiva s de sus re sp e c tivo s co n ju n to s y d eben e n co n tra rse de la m a ne ra m ás e fic ie n te pos ib le ;5) Identify 500 positive instances of the set of candidates and 500 negative instances. A sentence in the set of candidates that really has a relationship of interest is called a positive instance. A sentence of the set of candidates that does not contain a relationship of interest is called a negative instance. All the sentences within the set of candidates are positive or negative instances. These sampled instances must be re p re se n ta tive of their respective co n t s and must be found in the most e fic ie n pos ib le ;

6 ) C re a r c la s if ica d o r q ue co m b in e ca ra c te rís tica s se le cc io n a d a s con m é to d o s de e n tre n a m ie n to se le cc io n a d o s . Los m é to d o s de e n tre n a m ie n to de e je m p lo inc luyen M á q u in a de V e c to r de B aye s y S o p o rte in g e n u o (S V M ). Las ca ra c te rís tica s de e je m p lo inc lu ye n té rm in o s c o e x is te n te s y á rb o le s de s in ta x is q ue co n e c ta n e n tid a d e s de re lac ión ; y 6 ) C re a r a s if ica r that co mbin e s e le c tio n a d char a c te ris tics w ith se le c tio n a d m e tho d o ds of trainin g . Example training methods include B aye s Vector Machine and Naive Support (SVM). Example features include co n e x is te n t erm in o s and s y n ta x t rees that con n e c t a n e n t e n t e s of relatio n ; Y

7) P ro b a r la c la s if ica c ió n de o ra c io n e s se le cc io n a d a s al a za r de l g rup o de can d id a to s . D e sp u é s de p robar, la re a liza c ió n de e je m p lo e va lúa las p rim e ras c ien o ra c io n e s c la s if ica d a s com o p o s itiva s (p o r e je m p lo , que co n tie n e un e ve n to de ca m b io de tra b a jo ) y las p rim e ra s c ien c la s ifica d a s co m o neg a tivas , ca lcu la n d o la p re c is ió n y re co rd a n d o y g u a rd a n d o las o ra c io n e s e va lu a d a s com o d a tos de oro para fu tu ra s p ruebas.7) Test the ranking of sentences randomly selected from the group of candidates. A fter proving, th e e x a m p le e val u a s the f rst hundred ora tio ns c las si ed as opos itive s (pore je mp lo , which contains a job change event ) and the first ones classified as negative , calculating the p re c is io ny re co rd ing and saving the prayers valu ed as golden data for fu tu ra sp trials.

Se d e sa rro lla una g am a de filtro s que son filtros d e p e n d ie n te s del d o cu m e n to o filtros de d e te cc ió n de re lac io ne s co m p le ja s con base en a lg o ritm o s de a p re n d iza je a u to m á tico y h e rra m ie n ta s q ue re o rien tan fá c ilm e n te n u e vo s tip o s de d ocu m e n to s . La e s tru c tu ra de un tip o de d o cu m e n to p ro p o rc io n a p is ta s m uy co n fia b le s sob re d ó n d e se puede e n c o n tra r la in fo rm a c ió n b uscad a . Ide a lm e n te , el filtro es flex ib le , y d e te c ta a u to m á tica m e n te á rea s p ro m e te d o ra s en un d ocum e n to . P o r e je m p lo , un filtro que inc lu ye una h e rra m ie n ta de a p re n d iza je a u to m á tico (p o r e je m p lo , W e k a ) que d e te c ta á rea s p ro m e te d o ra s y p ro d u ce ca n a le s de c o m u n ica c ió n q ue se p ueden c a m b ia r de a cu e rd o con las c a ra c te rís tica s re le va n te s n e ce sa ria s para la ta rea .A range of filters is devel oped which are either document dependent filters or complex relatio n de te c tio n filters based on something au to m a tic learning rhytms an d tools that easily target new types of d ocu men ts . The s tru c tu r e of a do cu ment type pro p o rd io ns m u rly RELIABLE ABOUT WHERE YOU CAN FIND THE IN fo rm atio n . Ideally , the filter is flexible , and automatically detects promising areas in a document . For example, a filter that includes a machine learning tool (for example, W eka ) that detects a rea sp ro me te do ra sy p ro du ce comm unica tio n chan nels that c a n be c h a nge d a ccording to the re le va n te s ne ce sa ry char a c te ris tics for the homework .

D e p e n d ie n d o de los re qu is ito s , se p ueden im p le m e n ta r d ife re n te s n ive les de re so lu c ió n de co -re fe re n c ia . En a lg u no s dom in ios , no se u tiliza n in g u n a re so lu c ió n de co -re fe re n c ia . O tras s itu a c io n e s u tilizan un co n ju n to re la tiva m e n te s im p le de re g la s para la re so lu c ió n de co -re fe re n c ia s , con b ase en m e n c io n e s re c ie n te s en el te x to y a tr ib u to s id e n tifica b le s (es dec ir, g én ero , p lu ra lida d , e tc .) de las e n tid a d e s d e n o m in a d a s in te re sa d as . P o r e je m p lo , en el e ve n to de ca m b io de tra b a jo , cas i to d o s los p ro b le m a s de c o -re fe re n c ia se re sue lve n s im p le m e n te con re fe re n c ia a la m e nc ión m ás re c ien te de l tip o de e n tida d c o in c id e n te (es dec ir, b u fe te de a b o g a d o s o n o m b re del a bo ga do ).DEPENDING ON REQUIREMENTS, DIFFERENT LEVELS OF CO-REFERENCE RESOLUTION CAN BE IMPLEMENTED. In some domains, no co-referencing resolution is used. Other situations use a relatively simple set of rules for co-reference resolution, based on recent mentions in the text and id en tifiable t r i b u ts (i.e., g en ro , p lu ra lity , e tc .) o f the en tities from name in adas in te re sa d as . For example, in the event of job change, almost all co-referenc e problems are im p le mented with re fe re nce to the most recent mention of the type of entity or incident (i.e., law firm on behalf of the lawyer).

El e x tra c to r 320 de p la n tilla s e xtra e p la n tilla s de e ve n to s de o ra c io n e s c la s if ica d a s p os itiva m e n te , ta l com o la o rac ió n 319, del c la s if ica d o r 310. En la re a liza c ió n de e je m p lo , la e x tra cc ió n de p la n tilla s de o ra c io n e s im p lica id e n tif ic a r las e n tid a d e s de n o m b re q ue p a rtic ipa n en la re lac ión y v in c u la rla s para q ue sus re sp e c tivo s p ap e le s en la re lac ió n se ide n tifiq ue n . S e u tiliza un a n a liza d o r para id e n tif ic a r fra g m e n to s de o ra c io n e s n o m in a le s y p ro p o rc io n a r un a ná lis is s in tá c tico co m p le to de la o rac ión .The template extractor 320 extracts sentence event s if ica dasp os itatively, such as the sentence 319, from c lassif icador 310. In the e x a m p le realization, the ex tra c tio n of sentence templates im p lic id en tif ic a r the name en tities that p a rtic ipa n in the relatio n and link them so that their respective sp ap e s in the relatio n are iden tified . A parser is used to id en tif ic ar fragmen ts of n o m in a l o ra tio n s and p ro p ro p o r a com plete s in ta c tic a lys is of the o rac ion.

En la re a liza c ió n de e je m p lo , la im p le m e n ta c ió n de l e x tra c to r 320 im p lica :In the example implementation, the implementation of the 320 extractor implies:

1) C re a r d a tos de oro to m a n d o o ra c io n e s de e je m p lo p o s itiva s de la fa se de c la s ifica c ió n y g e n e ra n d o m a n u a lm e n te los re g is tro s de p la n tilla a p ro p ia d os . A l u su a rio se le p rese n ta n a u to m á tica m e n te to d a s las p la n tilla s p o s ib le s las cua les p o d rían g e n e ra rse a p a rtir de la o rac ió n y se le p re g u n ta al u su a rio que se le cc io n e la correcta ;1) C re arda ts of gold taking pos itive ex ample ratio n s of the c las s ificatio n pha se and man u a nally generating the stencil reg is ters a p ro p ria d . All the pos sible templates which could be generated from the sentence are presented n au to m a tically to the lu su ry and he is asked ta al u su a rio that the correct one is le ct io ned ;

2 ) T o m a r 400 o ra c io n e s de un co n ju n to de d a tos de oro para e n tre n a r d a tos y d e s a rro lla r p ro g ra m a s de e x tra cc ió n con b ase en una o m ás de las s ig u ie n te s te cn o lo g ía s : reg las de a so c ia c ión , n úc leo de fra g m e n to s con b ase en fra g m e n to s , C R F, y n úc leo de á rbo l con base en e s tru c tu ra s in tá ctica ;2 ) Take 400 sentences from a gold data set to train and develop mining programs based on one or more of the n ew t h e chn o l o g e s : a sso c ia tion rules , fra gment-based fra gment core , CRF, and tree core l based on in tactic s tru c tu ra s;

3) P ro b a r so lu c io n e s en 100 m u e s tra s de p rue b a re ten idas;3) P ro b a r s o lu tio n s in 100 s a m p e s of re tained t e s s;

4 ) C o m b in a n d o c la s if ica d o r con e x tra c to r para p ro b a r la p re c is ió n u sa nd o d a tos inv is ib le s . P o r e je m p lo , una o rac ió n q ue co n tie n e un e ve n to de ca m b io de tra b a jo es una q ue d e sc rib e a un a b o g a d o que se une a un b u fe te de a b o g a d o s u o tra o rg a n iza c ió n a títu lo p ro fe s io n a l. Los cu e rp o s o b je tivo de los cu a le s se e x tra en los e ve n to s de ca m b io de tra b a jo son las b ases de d a tos de p e rió d ico s lega les . El n úm e ro m ín im o de e n tid a d e s e tiq u e ta d a s las cu a le s ca lifican una o rac ió n para su inc lu s ió n en el co n ju n to de c a n d id a to s es un n om b re de a b o g a d o y un n o m b re de o rg a n iza c ió n legal. U na fo rm a de re co p ila r de m a n e ra e fic ie n te las in s ta n c ia s de e n tre n a m ie n to p os itiva s y n e g a tiva s es e s tra tific a r las m u es tra s . E s to se p ue de h a ce r o rd e n a n d o las o ra c io n e s de a cu e rd o con la p a la b ra p rinc ip a l de la o rac ió n ve rb a l que co n e c ta a una p e rso na con un b u fe te de a b o g a d o s en la o rac ió n . Luego , re ún e to d o s los ve rb o s p rin c ip a le s q ue se p rod u ce n al m e no s c inco v e ce s en un so lo paq ue te . D e sp u é s de la re cop ilac ión , s e le cc io n a r c inco o ra c io n e s de e je m p lo de cad a g rup o al a za r y m á rq u e la s com o e je m p lo s p o s itivo s o neg a tivos . P ara cad a p a q ue te q ue a rro ja so lo e je m p lo s p os itivo s , a g re g a r to d a s las in s ta n c ia s re s ta n te s al g rup o de e je m p lo s p os itivo s . Y para cada p aq ue te que so lo a rro ja e je m p lo s neg a tivos , a g re g a r to d o s los e je m p lo s al g rup o de e je m p lo s n eg a tivos . Si h ay m e no s de 500 e je m p lo s p o s itivo s o m e no s de 500 e je m p lo s n eg a tivos , ca lif ic a r m a n u a lm e n te o ra c io n e s se le cc io n a d a s al a za r has ta que se id e n tifiq u e n 500 e je m p lo s de cad a vez. El e x tra c to r de e ve n to s de ca m b io de tra b a jo m u eve las e n tid a d e s id e n tifica d a s de una o rac ió n de e ve n to de ca m b io de tra b a jo c la s if ica d a p o s itiva m e n te a un re g is tro de p la n tilla e s tru c tu ra d o . El re g is tro de p la n tilla id e n tifica los p a p e le s q ue d e se m p e ñ a n las e n tid a d e s d e n o m in a d a s y las o ra c io n e s e tiq u e ta d a s en el evento .4 ) C omb in andoc l sif ica dor w ith ex tra c to r to test the c re c is io u using inv is ib le da ta . For example, a sentence that contains a job change event is one that describes a lawyer who joins a law firm or a o rg an iza tio n a p ro fe s io na l title. The target bodies from which the job change events are extracted are the da ta base of legal newspapers. The m inim u m n u m e r o f t a lig e d en t id e s w h i c h qual ify a sentence for inc lu s io n in the pool of ca d id a ts is a name of a lawyer and a name of a legal organization. A way to effi ciently collect the pos itive synega tive trainin g in s ta n ce s is to stratify the samples. This can be done by ordering the sentences according to the main word of the verb al sentence that connects to a pe rso na with a law firm in the or ation . Then, gather all the main ve rbs that occur at least five times into one package. A fter col ilation , selec tio n five e x a mp le ex a m p le from each gro u p r a n d a z y m a r ke th e s as pos itive ex ample I am negative. For each package that returns only ex ample s , add all re s tan ng in s ta nce s to the g roup of ex ample sp os itative s . And for each package that returns only negative examples, add all examples to the group of negative examples. If there is me I don't know about 500 spositive ex amples some 500 n e g a tive ex amples , qualif ic amanual lm en th e o ra tio ns randomly selected until 500 are identified e je mp lo s from each time. The job change e vent extractor moves the id en tified entities of a job change e ve nt sentence c the s if ica positively a nd a s tru c tu ra do st o rk ro g is t . The roster of roster id en tifies the roles played by the named entities and the prayers that are tagged at the event.

La s ig u ie n te p la n tilla (la cua l ta m b ié n re p re se n ta una e s tru c tu ra de d a to s ) hace re fe re n c ia a la o rac ió n 1.1 an te rio r.The following template (which also represents a data structure) refers to sentence 1.1 an I laughed at you

Se in co rp o ra n d ive rsa s s u p o s ic io n e s en la re a liza c ió n de e je m p lo . Un su p u e s to p rin c ip a l es q ue la ide n tid ad de las e n tid a d e s sue le se r in d e p e n d ie n te de la fo rm a de h a b la r de un e ve n to o re lac ión . O tro s u p u e s to es q ue la e x tra cc ió n de o ra c io n e s co n s id e ra d a s p a rá fra s is con base en la igu a lda d de las e n tid a d e s co n s titu ye n te s y la v e n ta n a de tie m p o e stá re la tiva m e n te lib re de e rro re s . La p rec is ión de e sta ú ltim a e ta pa de filtra d o se m e jo ra al te n e r o tra s ve rifica c io n e s , ta l com o la s im ilitu d del co se n o e n tre los d o cu m e n to s en los cu a le s se e n cu e n tra n las dos o ra c io n e s, la s im ilitu d de los títu lo s de los d ocu m e n to s , etc. E s te e n fo q u e im p lica :D ive rsa s u p o s i tio n s are inco rp o n d i v e rsa s in the e x a m p le w o rk ing . A main assumption is that the identity of entities is usually independent of the way of talking about an event or relationship . Another supposition is that the extraction of sentences with co ns id e ra das para phra s is based on the equality of the co nsti tu ye n t entities and the time window is rela tively e rro r free . The accuracy of this last stage of filtering is improved after further verification, such as the similarity of the co se no between the documents. s in which the two sentences are found, the similarity of the titles of the documents, etc. T h i s ap proach i m p l e s :

1) P ro p o rc io n a r un g ran cu e rp o de d o cu m e n to s que p re fe rib le m e n te te n g a n la p rop ied a d de que d ive rso s d o cu m e n to s q ue h ab lan de l m ism o h echo o re lac ió n de d ife re n te s a u to re s son fá c ile s de e ncon tra r. Un e je m p lo es un cu e rp o de n o tic ia s con se llo de tie m p o de d ife re n te s fu e n te s de no tic ias , d on de es p rob a b le que el m ism o e ve n to sea cub ie rto p o r d ife re n te s fue n te s;1) P ro p o r tio n a large body of do cu men ts that p re fe rib le m a n t h ave the p ro p rop i e t of th a d ive rso d o cu men ts they talk about of the same made or relatio n of difer en te sau to re s are easy to find. An example is a time-stamped news body from different news sources, where it is likely that the same o e ven to be covered b y different sources;

2 ) U sa r un re c o n o c e d o r de e n tid a d e s d e s ig n a d o p ara e tiq u e ta r las e n tid a d e s en el cu e rp o con una p re c is ió n razonab le . C la ra m e n te , el con ju n to de e n tid a d e s que d ebe c u b rir el N E R (s o lu c io n a d o r de e n tida d d e n o m in a d a ) d e p e n d e del p ro b le m a de e x tra cc ió n ;2) Use a designated entity recognizer to label the entities in the body with reasonable accuracy. C la ra m e n t , the set of en tities that must be covered by the N E R (nominee d e n tity solv e r ) de pends on the p ro b le m of e x tra c tio n ;

3) P ro p o rc io n a r un in d e xa d o r para una b ú sq u e d a y re cu p e ra c ió n e fic ie n te s de l cuerpo;3) Provide an index for efficient body search and retrieval;

4 ) P ro p o rc io n a r una lis ta g e n e ra d a p o r h u m a n o s de o ra c io n e s de a lta p rec is ión con las e n tid a d e s re e m p la za d a s por co m o d in e s . P o r e je m p lo , p a ra MA, un h u m a n o p od ría p ro p o rc io n a r una re g la “ O RG 1 a d q u irió O R G 2 ” s ig n ifica que e sta es una o rac ió n M A con O RG 1 com o co m p ra d o r y O r G 2 com o ob je tivo .4 ) P ro p o r tio n a h u m a n g e n e r a t L ist o f h i c h p re c is io n s t h e n t i d e s r e c h a d e d r e c o d e d s . For example, for MA, a human could provide a rule “ORG 1 acquired ORG 2” means that this is a MA sentence with O RG 1 as a buyer and O r G 2 as a target.

M é to d o s de e je m p lo para o p e ra r un s is te m a de e x tra cc ió n de e ve n to s y re lac io ne s, re so lu c ió n y e tiq u e ta d o de e n tid a d e s d e n o m in a d a sS a m p le m e tho d s f o r o p e r a s ys te m of e v e n t e xtra c tio n an d relatio n s, s o lu tio n a n d t ag ing o f en tities from name in adas

La F ig ura 4 m u e s tra un d ia g ra m a 400 de flu jo de un m é to do de e je m p lo de o p e ra r un s is te m a de e x tra cc ió n de e ve n to s, re so lu c ió n y e tiq u e ta d o de e n tida d d e n o m in a d a , ta l com o el s is te m a 300 en la F ig u ra 3. El d ia g ra m a 300 de flu jo inc lu ye los b lo q u e s 410 -460 , los cua les e stán d isp u e s to s y d e sc rito s en serie . S in e m ba rg o , o tras re a liza c io n e s ta m b ié n p ro p o rc io n a n d ife re n te s p a rtic io n e s o b lo q ue s fu n c io n a le s para lo g ra r re su lta d o s a ná logos .F ig ure 4 shows a flow diag ra m 400 of an exemplary method of opera ting an e v e n t e xtra c tio n s ys te m, re so lu tio n an d t ag ing d e n t ed en t y , such as s ys te m 300 in F igure 3. F ow d ia g ra m 300 incl ues the b what s 410 -460 , which are available in series . However, oth e r realizations also p ro p ro p ro tio n and ife re n te spa rtic io n o n s o n th e f u n c io na l s to a g ra r su lta dosa na logos.

El b lo q ue 410 im p lica d iv id ir el te x to e x tra íd o en seña les . La e je cu c ió n p ro ce d e en el b lo q ue 220.The block 410 implies dividing the extracted text into signals. The execution proceeds in block 220.

El b lo q ue 420 im p lica lo ca liza r p a rtes de l te x to e x tra íd o que n ece s ita n s e r p roce sa d as . En la re a liza c ió n de e jem p lo , e sto im p lica el uso de la zo n a 112 para lo ca liza r o ra c io n e s c a n d id a ta s para su p ro ce sa m ie n to . La e je cu c ió n luego a va n za al b lo q ue 230.Block 420 involves locating parts of the extracted text that need to be processed. In the exemplary embodiment, this implies the use of zone 112 to locate candidate candidates for processing. Execution then proceeds to block 230.

El b loque 430 im p lica e n c o n tra r las e n tid a d e s d e n o m in a d a s d e n tro de las p a rte s p ro ce sa d a s de l te x to e x tra ído . Luego, se e tiq u e ta n las e n tid a d e s de in te ré s en las o ra c io n e s can d id a ta s . Las o ra c io n e s c a n d id a ta s son o ra c io n e s del cu e rp o o b je tivo q ue p ueden c o n te n e r una re lac ió n de in te rés. P o r e je m p lo , una re a liza c ió n id e n tifica s e g m e n to s de te x to que ind ican e ve n to s de ca m b io de tra ba jo ; o tro ide n tifica s e g m e n to s que ind ican a c tiv id a d de fu s io n e s y a d q u is ic io n e s ; o tro aún id e n tifica se g m e n to s que p ueden in d ica r a n u n c io s de in g re so s co rp o ra tivo s . La e je cu c ió n co n tin ú a en el b loque 440.Block 430 involves finding the named entities within the processed parts of the extracted text. Then, the entities of interest are labeled in the candidate sentences. C a n d id a t sentences are sentences of the objective body that can have a relationship of interest. For example, one embodiment iden tifies text segments that indicate job change events; oth e r id e n tifies seg m e n ts d i n d i c a t m e r a c tiv ity ; still another identifies segments that may indicate corporate income announcements. Execution continues at block 440.

El b lo q ue 440 im p lica re so lve r las e n tid a d e s d e n o m in a d a s . C ada e n tida d se a d ju n ta a una id e n tifica c ió n ú n ica que a s ig n a la e n tida d a un o b je to de l m u nd o real ún ico, ta l co m o una e n tra d a en un a rch ivo de au to rida d . La e je cu c ió n luego a va n za al b lo q ue 250.The block 440 implies solving the named entities. Each entity is attached to a unique id en tificatio n that assigns the entity a unique real world object, such as an entry in a file of authority. Execution then advances to block 250.

El b loque 250 c la s ifica las o ra c io n e s ca n d id a ta s . Las o ra c io n e s c a n d id a ta s se c la s ifica n en dos con jun tos : las que co n tie n e n la re lac ió n de in te ré s y las q ue no. P o r e je m p lo , una re a liza c ió n ide n tifica se g m e n to s de te x to que ind ican e ve n to s de ca m b io de tra b a jo ; o tro id e n tifica se g m e n to s que ind ican a c tiv id a d de fu s io n e s y a d q u is ic io n e s ; o tro aún ide n tifica se g m e n to s q ue p ue de n in d ica r a n u n c io s de in g re so s co rp o ra tivo s . C u a n d o el te x to e stá c las ifica do , la e je cu c ió n a va n za al b lo q ue 260.Block 250 classifies the candidate sentences. The sentences can be classified into two groups: those that contain the relationship of interest and those that do not. For example, one embodiment iden tifies te x t segmen ts that indi cat e n ts of w ork change ; o th er id en tifies segmen ts d i n dicating m u s io n a c tiv ity and ac qu is io ns ; or another iden tifies segmen ts that may n d dicate corpo ra tive income announcements. W hen the text is collated, execu tion advances to block 260.

El b lo q ue 260 im p lica e x tra e r la re lac ió n de in te rés u tiliza n d o una p lan tilla . M ás e sp e c ífica m e n te , esto im p lica e x tra e r e n tid a d e s del te x to que co n tie n e la re lac ió n y co lo ca r las e n tid a d e s en una p la n tilla de re lac ió n q ue d e fine a d e cu a d a m e n te la re lac ió n e n tre las en tida de s . C u a n d o se co m p le ta la p lan tilla , los d a tos e x tra íd o s pueden a lm a ce n a rse en una b ase de datos , pero ta m b ié n p ueden im p lica r o p e ra c io n e s m ás com p le ja s , ta l com o re p re se n ta r los d a tos de a cu e rd o con una línea de tie m p o o m a p e a rlo s en un índ ice.The block 260 involves extracting the interest relationship using a template. More spe cifically , this involves ex tra eren tities from the te x t con taining the rela tio n and placing the entities in a rela tio n tem plat n that ade cu ately fine-tuned the relatio n betw een the en tites of s . W hen the template is completed, the extracted data may be stored in a database, but may also involve more compe n tive rope ratio ns. far away, such as representing the data according to a time line or mapping it into an index.

A lg u n a s re a liza c io n e s de la p re se n te inve n c ió n se im p le m e n ta n u tiliza n d o una se rie de ca n a le s de co m u n ica c ió n que a g re g a n a n o ta c io n e s a los d o cu m e n to s de tex to , re c ib ie n d o cad a co m p o n e n te la sa lid a de uno o m ás co m p o n e n te s a n te rio res . E s ta s im p le m e n ta c io n e s u tilizan el m a rco de la A rq u ite c tu ra de G e s tió n de In fo rm a c ió n no E s tru c tu ra d a (U IM A ) e ing ie ren te x to sin fo rm a to y lo d e sco m p o n e n en co m p o n e n te s . C ad a co m p o n e n te im p le m e n ta in te rfa ce s d e fin id a s p o r el m arco y p ro p o rc io n a m e ta d a to s a u to d e sc rip tivo s a tra vé s de a rch ivo s d e sc rip to re s X M L. El m arco g e s tio n a e s to s c o m p o n e n te s y el flu jo de d a tos e n tre e llos. Los co m p o n e n te s e s tán e sc rito s en Ja va o C +; los d a tos que fluye n e n tre los c o m p o n e n te s e s tán d ise ñ a d o s para una a s ig n a c ió n e fic ie n te e n tre e stos len g ua jes . A d e m á s, U IM A p ro p o rc io n a un su b s is te m a q ue g e s tio n a el in te rca m b io e n tre d ife re n te s m ó d u lo s en los ca n a le s de co m u n ica c ió n de p ro ce sa m ie n to . El S is te m a de A n á lis is C om ú n (C A S ) co n tie n e la re p re se n ta c ió n de la in fo rm a c ió n e s tru c tu ra d a que los M o to re s de A n á lis is de T e xto (T A E s) a g re g a n a los d a tos no e s tru c tu ra d o s . Los T A E s re c ibe n re su lta d o s de o tros c o m p o n e n te s de U IM A y p rod u ce n n ue vo s re su lta d o s q ue se a g re g an al CAS. A l fina l del ca n a l de co m u n ica c ió n de p roce sa m ie n to , to d o s los re su lta d o s a lm a ce n a d o s en el C A S p ueden se r e x tra íd o s de a llí p o r la a p lica c ió n q ue invoca (p o r e je m p lo , la p ob lac ión de la base de d a to s ) a tra vé s de un c o n s u m id o r de C AS. Los T A E s p rim itivo s (p o r e jem p lo , se ñ a liza d o r, d iv is o r de o ra c io n e s ) se p ue de n a g ru p a r en un T A E a g re g ad o . O tras re a liza c io n e s u tilizan a lte rn a tiva s al m a rco U IM A.Some embodiments of the present invention are implemented by using a series of communication channels that add gains It outputs the text do cu men ts, each compo n e nt receiving the output of one or m o r san te rio res compo n te rs. T his im p le men ta tio n u ses the framework of the Unstructured In fo rm atio n M a nagem ent A rch u ite c tu re (UIM A ) and ing ie ren t x to w ithout fo rm a t and break it down into co mponents. Each component implements in te rfa ce de fin ed by the framework and p ro por cr io name ta da to sau to de sc rip tivo sa via sc rip to re fi les s XM L. The framework manages these components and the data flow between them. The components are written in Java or C+; the data flowing between the components is designed for efficient mapping between these languages. In addition, U IM A provides a subsystem that manages the exchange between different modules in the co na l s P ro cess m unica tio n . T h e C o m m u n A n a lis is S ys te m (CAS ) contains the re p re se n ta tio n of the structured in fo rm a tio n that the M o to T e xt A n a ly s is re s (TAE s) ag gregates unstructured data. T A E s receive results from other UIM A components and produce new results that are added to the CAS. At the end of the processing communication channel, all results stored in the CAS can be retracted from there by the ap lica tio n that invokes (for example, the pop latio n of the da da ta s base ) through a C AS consumer. P rim itive T A E s (eg, flag, s e n t e r d iv is o r ) can be grouped into an a g g e g d T A E . Other projects use alternatives to the UIM A framework.

S is te m a y m é to do de e x tra cc ió n y re so lu c ió n de e ve n to s fin a n c ie ro s de e je m p loS is te m a n d m e tho d o f e xtra c tio n a n d e s o lu tio n of e x a m p le finan cial e v e nt s

La F ig u ra 5 m u es tra una e x te n s ió n o m e jo ra del s is te m a 300 en la fo rm a de un s is te m a 500 q ue e xtra e y re sue lve a u to m á tica m e n te e ve n to s fin a n c ie ro s de d o cu m e n to s de tex to . A u n q u e no se m u es tra e xp líc ita m e n te en e ste d ibu jo , el s is te m a 100 se im p le m e n ta u tiliza n d o uno o m ás p ro ce sa d o re s y d isp o s itivo s de m em oria , los cu a le s a lm a ce n a n da tos y c o n ju n to s de in s tru cc io n e s e je cu ta b le s y leg ib le s p o r m á qu ina . Los p ro ce sa d o re s y d isp o s itivo s de m e m oria p ueden o rg a n iza rse o d isp o n e rse en c u a lq u ie r a rq u ite c tu ra in fo rm á tica ce n tra liza d a o d is trib u id a d ese a b le . A lg u n a s re a liza c io n e s im p le m e n ta n el s is te m a 500 com o un can a l de co m u n ica c ió n de Ja va el cua l p uede in te g ra rse fá c ilm e n te en un flu jo de tra b a jo e d ito ria l. El s is te m a se p uede c o n fig u ra r para que fu n c io n e en m o do p o r lo tes o com o un se rv ic io w eb . A d e m á s, el s is te m a se p ue de c o n fig u ra r para o p e ra r en m odo p o r lo tes o com o un se rv ic io w eb .F igure 5 shows an ex te ns io nimprovement of the 300 s ys te m in the form of a 500 s ys te m that extracts and autom a tically resolves Text do cu me nt financial events. Although not e xp licitly shown in this drawing , the s ys te m 100 is im p le mented u sing one or m o re p ro ce sa do r re s a n d d evices . m em oria , which outputs data and sets of in s tru cc io nes je cu ta b le s and leg ib le sporm á quina . P ro ce sa do r re s a n d mem ory d ispo s c a n be o rg an ized o d is p o n e d in A ny rq u ite c tu ra ce n fo rm a tic daod is trib u id ad that ab le . S OME IMPLEMENTATIONS IMPLEMENT THE 500 SYSTEM AS A JAVA COMMUNICATION CHANNEL WHICH CAN BE EASILY INTEGRATED in an editorial workflow. T h e s ys te m c o n fi g u re to oper a t in b atch mode or as a w eb se rv ic e. In addition, the system can be configured to operate in batch mode or as a web service.

En p a rticu la r, el s is te m a 500 inc lu ye un co n ju n to de d o cu m e n to s 510 e le c trón icos , un filtro 520 de re leva nc ia , re co n o ce d o re s 530, c la s if ica d o re s 540 de se g m e n to de tex to , re lle n o s 550 de p la n tilla o ranura , y m ó du lo 560 de sa lida.In particular, the system 500 includes a set of electronic documents 510 , a relevance filter 520 , recognizers 530, text segment classifiers 540, template or slot fills 550, and output module 560.

Los d o cu m e n to s 510 inc lu ye n un co n ju n to de d o c u m e n to s te x tu a le s e s tru c tu ra d o s y /o no e s tru c tu ra d o s . P o r e jem p lo , en la re a liza c ió n de e je m p lo , los d o cu m e n to s 510 inc lu ye n c o m u n ica d o s de p rensa, a rtícu lo s de no tic ias , d o cu m e n to s de la S E C (C o m is ió n de B o lsa y V a lo re s ). Los d o c u m e n to s 510 se in g re san p o r lo te s o en se rie al filtro 520 de re leva nc ia .The 510 documents include a set of structured and/or unstructured text documents. For example, in the making of the example, the do cu men ts 510 include press releases, news articles, do cu men ts s of the SEC (B o lsa y V a lo re s C om isio n). The documents 510 are entered by batch or serially to the re lence n filter 520 .

El filtro 520 de re le va n c ia inc lu ye uno o m ás c la s ifica d o re s de e ve n to s fin a n c ie ro s . En la re a liza c ió n de e je m p lo , el filtro 520 de te rm ina , u sa nd o uno o m ás c la s if ica d o re s con base en a p re n d iza je a u tom á tico , si es p ro b a b le q ue los d o c u m e n to s inc lu ya n te x to q ue sea re p re se n ta tivo de un e ve n to fin a n c ie ro que p ue da s e r e x tra íd o p o r el s is te m a. Los e ve n to s fin a n c ie ro s de e je m p lo inc lu ye n fu s io n e s y a d q u is ic io n e s , a n u n c io s de g an an c ia s , o in fo rm e s de g u ía de gan an c ia s . Las d e te rm in a c io n e s p ueden basarse , p o r e je m p lo , en si dos e m p re sa s se m e n c io n a n en una so la o rac ió n o d en tro de a lgún o tro se g m e n to de te x to d e fin id o , ta l com o un p árra fo o d en tro de una c ie rta d is ta n c ia e n tre sí, o si se m e n c io n a una ca n tid a d m o n e ta ria ce rca de un n o m b re de la e m p re sa o té rm in o s p ró x im o s re la c io n a d o s con la o cu rre n c ia de un e ve n to fin a n c ie ro . Las d e te rm in a c io n e s ta m b ié n p ue de n b asa rse en la inc lu s ió n de té rm in o s com o fus ió n , a d q u is ic ió n , g a n a n c ia s , y ra íces, ra íz de la pa labra , s inó n im o s , etc. re la c io n a d o s . Los d o cu m e n to s q ue se d e te rm in a q ue es p oco p ro b a b le q ue inc lu ya n un e ve n to fin a n c ie ro se e xc lu ye n de l p ro ce sa m ie n to p os te rio r, a la ve z que a q u e llo s q ue se co n s id e ra p ro b a b le q ue inc lu ya n ta le s e ve n to s se ing re san a los re co n o ce d o re s 530.Relevance filter 520 includes one or more financial event classifiers. In the exemplary embodiment, filter 520 deter minates, using one or more classif icators based on machine learning, If it is likely that the documents will include text that is representative of a financial event that can be extracted by the system theme. Ex a mp le financial events in clude mergers and acquisitions , earnings announcements , or earnings guide reports . an c ia s . De te rm in a tio ns may be based , for example , on whether two companies are mentioned in a single sentence or within some other segment of text . ended, such as a paragraph within a certain distance from each other, or if a monetary amount is mentioned near a company name. Company or proximate terms related to the occurrence of a financial event. The determinations may also be based on the inclusion of terms such as merger, acquisition, earnings, and roots, ra íz of the word, synonyms, etc. rela t io n a d o s . Documents that are determined to be unlikely to include a financial event are excluded from the process. to p o te rio r, at the same time that those that are con s id e ra p ro bab le that inc lu y n ta le se ven to s are ad re s to the re co no ce do re s 530.

Los re co n o ce d o re s 530 e x tra en y re sue lve n e m p re sa s , p o rce n ta je s y c a n tid a d e s de d in e ro de la m ism a m a ne ra g e n e ra l q ue se d e sc rib e para el s is te m a 100. En p a rticu la r, los re co n o ce d o re s 530 inc lu ye n un e x tra c to r de e n tida d d e s ig n a d o y so lu c io n a d o r 532, un e x tra c to r 534 m o ne ta rio , y un e x tra c to r 536 te m p o ra l. El e x tra c to r de e n tida d d e s ig n a d o y el s o lu c io n a d o r 532 en la re a liza c ió n de e je m p lo es idé n tico al s is te m a 100 q ue se m u es tra en la F igura 1. El e x tra c to r 534 m o n e ta rio id e n tifica y e tiq u e ta e xp re s io n e s de p orce n ta je , e xp re s io n e s m o ne ta ria s , que inc lu ye n rangos m o ne ta rio s , el c o lo r de l d in e ro (g a n a n c ia s re a le s o g a n a n c ia s p ro ye cta da s , e tc .) y p o s ib le m e n te una te n d e n c ia (p o r e je m p lo , hac ia a rriba o h ac ia aba jo ). En la re a liza c ió n de e je m p lo , e sto im p lica n o rm a liza r el p o rce n ta je y la can tid a d de d in e ro p o r e je m p lo , a m o n e d a s e s ta d o u n id e n se s . El e x tra c to r 536 te m p o ra l id e n tifica y e tiq u e ta té rm in o s y /o ve n ta n a s te m p o ra le s . En la re a liza c ió n de e je m p lo , el e x tra c to r te m p o ra l (p o r e je m p lo , le xe r A N T L R , el cua l ta m b ié n se usa para a n a liza r e xp re s io n e s m o n e ta ria s ) ta m b ié n fu n d a m e n ta e xp re s io n e s de tie m p o (p o r e je m p lo , Q 2 s ign ifica se g u n d o tr im e s tre del a ño a c tu a l) y co n v ie rte a un v a lo r de tie m p o ISO. La re a liza c ió n de e je m p lo im p le m e n ta este e x tra c to r de m a ne ra p ro g ra m á tica u sa nd o lo s igu ien te :The recognizers 530 extract and solve business, percentages and amounts of money in the same general way that is described for s ys te m 100. In particular, recognizers 530 include a des ig nated entity ex tra c to r and solver 532, an ex tra c to r 534 mo ne ta ry , and an ex tra c to r 536 tempo ra l. The des ig nated en tity ex tra c to r and 532 solv e r in the ex a mp le em ployment is identical to the 100 s ys te m shown in the F i g u re 1. The 534 mone ta ry extractor id en tifies and labels p orc e n ta ge e xp re s io ns , m onetary e xp res io n s , WHICH INCLUDE monetary ranges, the color of money (real profits, sp ro projected profits, etc.), and pos ib le m in d a trend (for example, , up or down). In the ex a m p le realization , this im p plies to normalize the p e rce n ta ge and amount of m in e r for ex a m p le , coins ta doun id en I know. The extractor 536 te mpo ra l id en tifies and labels te rm in os and/or te mpo ra le windows. In the example implementation, the tempo ra l extract (for example, le xe r ANTLR , which is also used to parse xp re mone ta ry s io n s ) a lso fundamentally e xp re s io ns of time (for example, Q 2 signifies second quarter of the current year) and converts to an ISO time value. The ex a mp le imple mentation im p le men ts this ex tra c to r p ro g ra m a tically usin g th e following :

- T IE M P O : { in ic ia liz a r.tie m p o ();}- T IE M P O : { initialize r.time ();}

{T ie m p o .c a lc u la rV a lo r() ;}{T ie m p o .c a lc u la rV a lo r() ;}

- C lase In fo rm a c ió n T ie m p o T ie rra re g is tra el s ig n ifica d o te m p o ra l de la e xp re s ió n y ca lcu la el t ie m p o en tie rra .- C lasse In fo rm a tio n T ie m p o r G o d r e g is t e r th e m e m p o ra l m eaning of the e xp re s io n and ca lcula tion of the time on the ground .

- Ind ica tivo s : Hoy, m añana , m ié rco le s- Indicative s: Today, tomorrow, Wednesday

- E spe cífico : 2008 -05 -06 T 02 :30 :30- Specific : 2008 -05 -06 T 02 :30 :30

- P eriod os : 3 m e se s- Periods: 3 months

- Ind e fin ido : lunes p o r la noche- Unfinished: Monday night

- E xp re s io n e s a na fó rica s : E ste p e río do- E xp re s io n e s anaphoric : E ste p e rio do

P ara lo g ra r e sta fu n c io n a lid a d de p ue sta a tie rra , el s is te m a de e je m p lo u tiliza una base de d a tos que co n tie n e in fo rm a c ión sob re el a ño fisca l de d ive rsa s em presas . A lg u n a re a liza c ió n re s tr in g e el e tiq u e ta d o de e xp re s io n e s de tie m p o a a q u e lla s m a yo re s de un m es y a q u e lla s q ue e s tán v ig e n te s en re lac ió n con la fech a de p ub lica c ió n del d ocum e n to . A d e m á s, si h ay d ive rsa s e xp re s io n e s de tie m p o vá lid a s , se e tiq u e ta la m ás ce rca n a a c u a lq u ie r e xp re s ió n m o n e ta ria y la o tra se om ite a m e no s q ue haya una e xp re s ió n m o n e ta ria c o rre sp o n d ie n te . Si hay una e xp re s ió n de tie m p o vá lid a , se e xtra e la sa lid a de los re co n o ce d o re s 530, la cua l to m a la fo rm a de o ra c io n e s e tiq u e ta d a s u o tros se g m e n to s de tex to , se a lim e n ta a los c la s if ica d o re s 540 de o rac io ne s.To achieve this grounding func tio na lity, th e ex ample s ys te m u ses a da ta base that c o ntains in fo rm atio n about the fisca l year of d ive rsa s compa nies . S o m e a l a liza tio n r e s t r in ge s th e labeling of time ex p re s io n s t h e s m o r e r s than a m o n t is since th e s t h e s r e n t e n d in relation to the date of publication of the document. In addition, if there are several valid time se xp re s io ns, the one that is closest to the one that re s monetary xp re s io n is labeled and the other it is omitted unless there is a co rre sp ond ing mone ta ry e xp re s io n . If there is a valid time expression, the output of the 530 recognizers is extracted, which takes the form of the sentence ta ck d o th e r text segmen ts , feed th e classif ica do re s 540 o ra tio n s.

Los c la s if ica d o re s 540 de o ra c io n e s (m ás en g en e ra l c la s if ica d o re s de se g m e n to s de te x to ) inc lu ye n un con ju n to de c la s if ica d o re s para d ir ig ir el p ro ce sa m ie n to de las o ra c io n e s o s e g m e n to s de te x to a uno o m ás m ó d u lo s de llen a do de re g is tro s o p la n tilla s d en tro de los re lle n o s 550 de ranu ras . E sp e c ífica m e n te , los c la s if ica d o re s 540 de o ra c io n e s inc lu ye n un c la s if ica d o r 542 de e ve n to s de M & A (fu s io n e s y a d q u is ic io n e s ), un c la s if ica d o r 544 de e ve n to s de guía, y un c la s if ica d o r 546 de e ve n to s de gan an c ia s .The 540 sentence classifiers (more generally, the text segment classifiers) include a set c las s if ica do rs to steer the p ro ce sa m e n t of t e x t segmen t sentences to one o m o r e m d o u l m o d u l s of reg is tro sop la n tilla sd in tro of 550 fillers of slots. Specifically , the 540 o ratio n classif icators include a 542 M & A e v e n t classif icator (fu s io nesyadqu is ic io n ), a classif icator 544 of guiding events, and a classif icator 546 of earnings events.

El c la s if ica d o r 542 de M & A d e te rm in a si las o ra c io n e s e tiq u e ta d a s y re su e lta s (o m ás en g e n e ra l se g m e n to s de te x to ) de los re c o n o ce d o re s 530 inc lu ye n un e ve n to de M &A. D en tro de la re a liza c ió n de e je m p lo , un e ve n to de M & A se d e fin e com o una re lac ió n e n tre dos e m p re sa s y una ca n tid a d de d in e ro (o un p o rce n ta je de p a rtic ipa c ión ). Las dos e m p re sa s en un e ve n to de M & A son la a d q u ire n te y el o b je tivo . Un e ve n to de M & A ta m b ié n tie n e un e s ta d o (es decir, rum or, p rev is to , a nu n c ia d o , p en d ie n te , co m p le ta d o , re tirado ). A co n tin u a c ió n se m u es tra un te x to de e je m p lo que co n tie n e un e ve n to de M & A ju n to con el re g is tro de e ve n to e s tru c tu ra d o co rre sp o n d ie n te (e s tru c tu ra de d a tos ) p ro d u c id o p o r el re llen o 552 de ra nu ra s de M & A (e x tra c to r de re la c io n e s ) y el c la s if ica d o r 558 de estado .M & A's 542 c las sif ica dor de te rm in a s wheth er the sentences s e t ed ed and reso lted (or m ore gen eral t e x t segmen ts ) of 530 recognizers include an M&A event. Within the example embodiment, an M & A event is defined as a relationship between two companies and a number of d in e ro (or a percentage of partic ipa tion). The two companies in an M&A event are the acquirer and the target. An M&A event also has a status (i.e., rumored, planned, announced, pending, completed, retired ). An ex ample text is shown below what a M & A e ve nt con tains along with the e v e nt record is tru Corresponding c tu ra do (da ta s tru c tu ra ) p ro duc id o f the filling 552 of M & A slots (extra c to r of rela c io ns ) and the state classifier 558 .

E je m p lo de te x to de fus ió n y a dq u is ic ió nE x a m p le of mer ger an d a c d u is tio n text

B ajo el a cu e rd o a n u n c ia d o el ju e v e s , G lu M o b ile (G L U U ) p ag a rá a lre d e d o r de $ 14.7 m illo n e s en V A L O R A G R E G A D O p ara a d q u ir ir B e ijin g Z h a n g zh o n g M IG In fo rm a tio n T e ch n o lo g y Co. Ltda.Under the agreement announced Thursday, G lu M ob ile (GLUU) will pay about $14.7 million in ADDED VALUE to acquire B e ijin g Z hang z ong MIG In fo rm a tio n T e ch no lo gy Co. Ltda.

P lan tilla E x tra íd a de F u s io n e s y A d q u is ic io n e s (R e g is tro )E x tra id roster of M e r g s an d A c d u is tio n s (R e g is tro )

En la re a liza c ió n de e je m p lo , la c re a c ió n de una p la n tilla e s tru c tu ra d a d ado un d o cu m e n to de e n tra d a im p lica id e n tif ica r si el d o cu m e n to co n tie n e un e ve n to de M & A y c o m p le ta r la (s ) p la n tilla (s ) con la in fo rm a c ió n co rre c ta de la e n tida d , ta l com o el n om b re de la e m presa , los id e n tif ica d o re s de la em presa , o la ca n tid a d de d in e ro no rm a lizad a . In the ex a mp le realization, the cre a tio n of a s tru c tu ra ted template with an input do cu me n t implies id en tif ica r if the document contains an M & A y event, complete the template(s) with the correct in fo rm atio n of the en t y d , such as the name of the company , the id en tif ica do rs of the company , or the no rm a lized amount of money .

El c la s if ica d o r 542 de M & A se im p le m e n ta u tiliza n d o un e n fo q u e de a p re n d iza je a u to m á tico s e m i-su p e rv isa d o para d e te rm in a r q ué o ra c io n e s tie n e n pare s de e m p re sa s a d q u ire n te -o b je tivo . El e n fo q u e con base en reg las se u tiliza luego para a so c ia r una o m ás c ifra s o va lo re s de v a lo ra c ió n de fu s io n e s con el p a r a d q u irie n te -o b je tivo . El c la s if ica d o r 558 de e s ta d o de M & A d e te rm in a un e s ta d o para el e ve n to de M &A. La re a liza c ió n de e je m p lo im p le m e n ta el c la s if ica d o r 558 u sa n d o un e n fo q u e de a p re n d iza je a u to m á tico se m i-su p e rv isa d o .M & A's 542 c lassif ica tor is im p le me n ta l u sing a semi-supervised machine learning ap proach to de te rm in arch ue ora tio ns h ave pairs of sadqu ire n t-goal com pany. The rule-based approach is then used to associate one or more merger valuation figures with the target purchaser. tive. The 558 M&A s tate c lassif icator determines a s tate for the M&A event. The e x a mp lo w orking im p le men t the class if ica tor 558 u sing a sem i-superv e m a th a ch learn ing ap proach isa do

El é x ito de c u a lq u ie r e n fo q u e de a p re n d iza je a u to m á tico su p e rv isa d o se basa en te n e r d a to s de e n tre n a m ie n to de a lta ca lidad . P ero los d a to s de e n tre n a m ie n to re q u ie re n el e tiq u e ta d o m a nu a l de c ien tos de e je m p lo s y, p o r lo tan to , su g e n e ra c ió n p uede re s u lta r co s to sa y re qu ie re m u ch o tie m p o . P ara a liv ia r e ste cue llo de bote lla , la re a liza c ió n de e je m p lo e m p le a un m arco para g e n e ra r g ra n d e s c a n tid a d e s de d a tos de e n tre n a m ie n to de fo rm a se m ia u to m á tica a p a rtir de un cu e rp o de n o tic ia s sin e tiq ue ta r, y con se llo de tiem p o . D ich os m é to d o s se d e n o m in a n “s e m i-su p e rv is a d o s ” p o rq u e re qu ie ren m e no s in te rve n c ió n h u m a n a en el p roce so de e n tre n a m ie n to . A vece s, se p ue de n u sa r m ú ltip les a lg o ritm o s para e n tre n a rse e n tre s í (co -e n tre n a m ie n to ) o se p ueden u sa r ca ra c te rís tica s de a lta m e m o ria para e n tre n a r o tras ca ra c te rís tica s (a p re n d iza je sus titu to ). C on b ase en un p e q u e ñ o co n ju n to de 15 p a tro n e s se m illa (p o r e je m p lo , “ a d q u is ic ió n de O R G ” ), d e riva m o s los d a tos de e n tre n a m ie n to de un g ran cu e rp o de n o tic ia s sin e tiq ue ta r. Luego , los d a tos de e n tre n a m ie n to se u tiliza ron para a p re n d e r m o d e lo s q ue ide n tifican las d ife re n te s p ie zas de in fo rm a c ión n e ce sa ria s p ara e x tra e r un re g is tro e s tru c tu ra d o para ca d a e ve n to de M & A del d o c u m e n to de entrada .The success of any supervised machine learning approach is based on having high-quality training data. B ut the training da ta re quires the doman nu al labeling of hundreds of ex ample s , and therefore , its genera tio n can re it is expensive and requires a lot of time. To alleviate this bottleneck, the ex ample realization employs a framework to generate large amounts of training data. Semi-au to m a tic fo rm from a body of unlabeled news, and time stamped. These methods are called “semi-supervised” because they require less human intervention in the training process. Sometimes, you can use multiple rhythms to train each other (co-training) or you can use characte ris High-memory tics for training or trai ning features (substitute lear ning). Based on a small set of 15 seed patterns (for example, “ORG acquisition”), we derive the training data n to a large body of unlabeled news. Then , the training data was used to learn models that identify the different pieces of information needed to extract a s tru c tu ra ted register for each M & A event of the entry document .

El n ú m e ro m ín im o de e n tid a d e s e tiq u e ta d a s las cua les ca lifica n una o rac ió n para su inc lu s ió n en el co n ju n to de c a n d id a to s es d os n o m b re s de e m presas . P ara a yu d a r a re co p ila r d a to s de e n tre n a m ie n to , la re a liza c ió n de e je m p lo u tiliza re g is tro s e s tru c tu ra d o s de la base de d a tos de fu s io n e s y a d q u is ic io n e s en el s is te m a de re cu p e ra c ió n de in fo rm a c ió n W e s tla w ® (u o tro s is te m a de re cu p e ra c ió n de in fo rm a c ió n a d e cu a d o ) p ara id e n tif ica r e ve n to s de fu s io n e s y a d q u is ic io n e s q ue han te n id o lu g a r en el p a sa d o rec ien te .The m inim u m inum n u m e r of t ag ed en tities w h i c h qualifies a sentence for inc lu s io n in the pool of ca d id a ts is d the names of companies. To help you collect training data, the example run uses structured records from the database of mergers and acquisitions in the information retrieval system W es tla w ® (or other retrieval s ys te m of in fo rm atio n ade cu ate ) to id en tif ica re s e n ts of mergers and ac qu is io ns that have taken place in the recent past .

P ara id e n tif ic a r de m a ne ra e fic ie n te in s ta n c ia s de e n tre n a m ie n to p o s itiva s del co n ju n to de can d id a to s , la re a liza c ió n de e je m p lo e n cu e n tra o ra c io n e s q ue co n tie n e n los n o m b re s de e n tid a d e s q ue co in c id e n con e stos re g is tro s y se p ub lica ro n d u ra n te el p e río do de tie m p o d u ra n te el cua l tu vo lu g a r el e ve n to de fus ió n . P ara id e n tif ic a r ins ta n c ia s n eg a tivas , la re a liza c ió n de e je m p lo se le cc io n a o ra c io n e s q ue co n tie n e n e m p re sa s q ue se sab e q ue no han e s ta do in vo lu c ra d a s en una fus ió n o a dq u is ic ió n . U na ve z q ue el s is te m a d e te rm in a q ue un se g m e n to de te x to inc lu ye un e ve n to de M & A , el se g m e n to se pasa al e x tra c to r 552 de e ve n to s de M & A el cua l cop ia o co lo ca e n tid a d e s id e n tifica d a s y e xp re s io n e s e tiq u e ta d a s de una o rac ió n de e ve n to de ca m b io de M & A c la s ifica d a p o s itiva m e n te (se g m e n to de te x to ) en un re g is tro de p la n tilla e s tru c tu ra d o q ue ide n tifica los p a p e le s de las e n tid a d e s d e n o m in a d a s y las e xp re s io n e s e tiq u e ta d a s en el evento .In order to efficiently identify positive training instances of the set of candidates, it is performed e xa m p lo n f o r a t i o ns w h o ha n t h e n a m e s o f en tities w hich co in c id e n th e s e r g is t s a n d th e p ub lica rondu the period of time in which the merger event took place. In order to iden tify sn eg a tive ins ta nces , th e e x a m p le s e le c tio n ao ra tio ns w h o ha ncompanies w h o are kno w n they have not been involved in a merger or acquisition. Once the s ys te m a te rm in a t that a text segment includes an M & A event, the segment is passed to the 552 extractor of M & A e ve nt s which copies or places id en tities and xp re s io nes tiq ue ted from an e ve nt o ration of M & A c las s ifies da pos itatively (t e x t segmen t ) in a s tru c tu ra d tem p re g is ter that iden tifies the roles of en tities from nom in adas and the e xp re s io nes are tagged in the event.

El c la s if ica d o r 544 de e ve n to s de g u ía d e te rm in a si las o ra c io n e s e tiq u e ta d a s y re su e lta s (o m ás en g e n e ra l se g m e n to s de te x to ) de los re co n o ce d o re s 530 inc lu ye n un e ve n to de guía. D en tro de la re a liza c ió n de e je m p lo , un e ve n to de guía se d e fine co m o una re lac ió n e n tre una e m presa , una ca n tid a d de d in e ro co m p le ja y un p e río do de tie m p o fu tu ro . La can tid a d de d in e ro co m p le ja se llam a M O N E X para n u e s tro s p ro p ó s ito s y p uede c o n te n e r una ca n tid a d de d in e ro (o rango ), el c o lo r de l d in e ro (p o r e je m p lo , g a n a n c ia s ) y p o s ib le m e n te una te n d e n c ia (p o r e je m p lo , h ac ia a rrib a o hac ia aba jo ). A co n tin u a c ió n se m u es tra un e je m p lo de una d e c la ra c ió n de g u ía y la p la n tilla de e ve n to co rre sp o n d ie n te p ro d u c id a por el e x tra c to r 554 de e ve n to s de guía.The classif icator 544 of e v e n ts of te rm in a s if the sentences are t ick ed and resolved (or m ore gen eral segmen ts of te x to ) of the 530 recognizers include a guide event. Within the example embodiment, a guiding event is de fined as a relatio n between a company, an amount of money, mp le ja and a fu tu ro period of time. The complex amount of money is called MONEX for our purposes and may contain a number of money (or range), the color of money (for example, profits) and possibly a trend (for example, up or down). An ex a m p le of a guid ing dec la ratio n an d the corre sponding e v e n t template p ro duc ed is shown below by the extractor 554 of guide events.

T e x to de g u ía de la m u es traS a m p le g u i d e t e x t

C A a u m e n tó su p ro n ó s tico para to d o el a ño 2008 , a h o ra e sp e ra ganancias de 87 centavos a 91 centavos por acción e in g re sos en el ra ng o de $ 4.15 b illo ne s a $ 4.2 b illones. (los té rm in o s u o ra c io n e s e tiq u e ta d o s e s tán re sa lta d o s en neg rita ).CA raised its full - year 2008 forecast , now expecting earnings of 87 cents to 91 cents per share and revenue in the $ 4.15 billion range . $4.2 billion. (the term in osuo ra c io ns t e t ed are highlighted in bold).

P la n tilla de g u ía e x tra ídaP la n t e l e d u d e d e x tra d

D eb ido a q ue el le n g u a je u tilizad o en los e ve n to s de g u ía es a lg o fo rm u la d o , el c la s if ica d o r de e ve n to s de g u ía de e je m p lo usa un e n fo q u e con base en reg las para d e te rm in a r si un se g m e n to de te x to inc lu ye un e ve n to de guía. Un a sp e c to de e sta d e te rm in a c ió n es d e te rm in a r si un p e río do de tie m p o e tiq u e ta d o en el se g m e n to de te x to es un p erío do de tie m p o fu tu ro en re lac ió n con un p e río d o de tie m p o a c tu a l o una fech a de p u b lica c ió n a so c ia d a con el d o cu m e n to q ue co n tie n e el se g m e n to de tex to . A d e m á s, se d e te rm in a el c o lo r del M O N E X . Las g a n a n c ia s de $ 10 a $ 12 p o r acc ión d e sc rib e n un M O N E X q ue co n tie n e las s ig u ie n te s ra nu ra s : [V a lo rM ín :10 , V a lo rM á x :12 , M o ne da : D ó la res, M ed ida: EP S ]. Luego , ide n tifica la e m p re sa re sp e c tiva y el p e río d o de tiem p o .D ue to the fact that the lan guage used in the guid e e v e n ts is s lg o fo rm u lated , the e je mp is used by a rule-based approach to dete rm in ar if a text segmen t includes a guide e ve nt. One aspect of th is de rm inatio n is to deter rm inar whether a time pe riod tagged in the text segment is a time pe riod future in relation to a current time period or a pub lica tion date associated with the do cu men t that con tains the text seg ment . In addition, the color of the M O N E X is determined. Earnings of $10 to $12 per sc rib share on a MONEX that contains the following slots: rM a x :12 , M o ne da : D o lares, M easure : EP S ]. Then , identify the respective company and the period of time .

El c la s if ica d o r 546 de e ve n to s de g a n a n c ia s d e te rm in a si las o ra c io n e s e tiq u e ta d a s y re su e lta s (o m ás en g en era l se g m e n to s de te x to ) de los re co n o ce d o re s 530 inc lu ye n un e ve n to de g an an c ia s . La re a liza c ió n de e je m p lo d e fine un e ve n to de g a n a n c ia s com o una re lac ió n e n tre una e m presa , una can tid a d de d in e ro co m p le ja y un p e río d o de tie m p o pasado. La can tid a d de d in e ro co m p le ja se llam a M O N E X p ara n u e s tro s p ro p ó s ito s y p ue de c o n te n e r una ca n tid a d de d in e ro (o rango), el c o lo r del d in e ro (p o r e je m p lo , g a n a n c ia s ) y p o s ib le m e n te una te n d e n c ia (p o r e je m p lo , hac ia arriba). A co n tin u a c ió n se m u es tra un e je m p lo de un e ve n to de g a n a n c ia s y su re g is tro e s tru c tu ra d o co rre sp o n d ie n te p rod u c id o p o r el e x tra c to r 556 de e ve n to s de g an an c ia s .The classif icator 546 of e ve n t e v e n ts de te rm in a s if the sentences are tiq ue ted and resolved (or m o re g enera l segmen ts of t e x t ) o f 530 re c o n e rs inc lu y an e v e n t of e n c a n s . The Exa mp le Embodiment defines an earnings event as a relationship between a company , a complex amount of money , and a pe river do of past tense. The complex amount of money is called MONEX for our purposes and may contain an amount of money (or range), the co the r of money (for example, earnings) and possibly a trend (for example, up). An example of an earnings event and its corre sp ond ing s tru c tu ra d rec o rd in g pro duced b y th e ex tra c to r 556 of pro fit e v e n ts.

E je m p lo de te x to de g a n a n c ia sE x a m p le of e x a n c e t e x t

Genpact Ltda., (G ) th e G u rg ao n , India, g e re n te de p ro ce so s c o m e rc ia le s para e m p re sa s , in fo rm ó que las g a n a n c ia s del te rc e r tr im e s tre a u m e n ta ro n un 27 % sob re in g re so s un 32 % m ás a ltos. Las ganancias a lca n za ro n los $16.3 millones fre n te a los $ 12.8 m illo n e s de l m ism o p e río do del año ante rio r. Genpact Ltda. , (G ) the G u rg ao n , India, a business p ro ce so s m an a g e r f o rm e s , reported that the earnings of the third th e quarter increased 27% on revenues 32% higher. Profits reached $16.3 million compared to $12.8 million in the same period of the previous year.

P lan tilla de G a n a n c ia s E x tra íd a sE xtra d Profit roster

D e m a n e ra s im ila r al p ro ce sa m ie n to de e ve n to s de guía, la re a liza c ió n de e je m p lo usa un e n fo q u e con b ase en reg las para c la s if ica r e ve n to s de g a n a n c ia s p o rq u e el len g u a je su b ya ce n te es en g e n e ra l fo rm u la d o . En a lg u n a s re a liza c io n e s , el n úm e ro m ín im o de e n tid a d e s e tiq u e ta d a s el cua l ca lifica una o rac ió n para su inc lu s ió n en el co n ju n to de c a n d id a to s (es dec ir, q ue inc lu ye p o te n c ia lm e n te un e ve n to de g a n a n c ia s ) es el n om b re de una e m p re sa y la o rac ió n “ in g re so s n e to s ” o la p a la b ra “g a n a n c ia s ” . P ara e n c o n tra r ca so s p o s itivo s de m a ne ra e fic ien te , la re a liza c ió n de e je m p lo e x tra e in fo rm a c ión de in g re so s ne tos de los d o cu m e n to s de la S E C para e m p re sa s p a rticu la re s y e n cu e n tra ca n d id a to s p o s itivo s cu a n d o la e m p re sa d e n o m in a d a en la o rac ió n y el m onto en d ó la re s o el p o rce n ta je de a u m e n to en las g a n a n c ia s d u ra n te un p e río do de tie m p o se a lin e an con la in fo rm a c ió n de un d o cu m e n to de la SEC . Se e n cu e n tra n in s ta n c ia s n eg a tivas cu a n d o los da tos de una e m p re sa en p a rticu la r no co in c id e n con los re g is tro s de la SEC . El e x tra c to r 556 de e ve n to s de g a n a n c ia s (e x tra c to r de e ve n to s de a n u n c io s de in g re so s n e tos ) m u eve las e n tid a d e s id e n tif ica d a s de una o rac ió n de e ve n to de a n u n c io s de in g re so s n e tos (g a n a n c ia s ) c la s ifica d a p o s itiva m e n te a un re g is tro de p la n tilla e s tru c tu ra d o . El re g is tro de p la n tilla ide n tifica los p ap e le s q ue d e se m p e ñ a n las e n tid a d e s d e n o m in a d a s y las o ra c io n e s e tiq u e ta d a s en el evento .In a simi lar way to the guide e ve n t p ro cessing, the e x a mp le w o rk u s a rule-based ap proach to c the s if ica re ven to s of gains because the underlying language is generally fo rm u la do . In some embodiments, the minimum number of labeled entities that qualifies a sentence for inclusion in the set. n to of cand id a ts (i.e., wh ich po te n cially includes an e n s e n t ) is the n ame of a company and the o rac io n “ net income ” or the word “earnings ” . T o find th e pos itive c a s so s e f ciently , th e e x a m p lo n fo rm atio n of net income from the two SEC cu men ts for Spa rticu la re s yen cu en tra ca nd id a to sposi tives w hen the company denominated in the sentence and the amount in only the percentage increase in earnings over a period of time aligns with information in an SEC do cu ment. N ega tive in s ta n c e s are found w h e n a p a rticu lar c o m p a n e 's d a ta do not match SEC re g is t s. Profit E v e n t E v e n t Extractor 556 (NET REVENUE ANd E v e n t Extractor ) M o v e d en t id en t ities ic ated from a net e n com e an d s e n t e r a tio n e n t e n t e n t e n t e n t e n e s s c las s ia s posi tively to a s tru s t roster c tu ra do . The roster of staff iden tifies the roles played by the named entities and the labeled sentences at the event.

P ara q ue un se g m e n to de te x to inc lu ya un e ve n to de g u ía o g an an c ia s , a lg u n a s re a liza c io n e s im p on en la re g la de que d eb e in c lu ir al m enos un n o m b re de e m p re sa re su e lto q ue no sea una e m p re sa a n a lis ta (p o r e je m p lo , T h o m so n F irs t C a ll o M a rke tW a tch ) y una e xp re s ió n m o ne ta ria .In order for a text segment to include a guidance or earning event, some implementations im p on the rule that d It must include at least one company name that is not an analyst company (for example, T hom so n F irs t C a llo M a rke tW a tch ) and a monetary e xp re s io n .

A d e m á s de los c la s if ica d o re s 540 de s e g m e n to s de te x to y los e x tra c to re s de re la c io n e s (re lle n o s 550 de ranu ras , el s is te m a 500 inc lu ye m ó d u lo s 560 de sa lida .In addition to t e x t segment classifiers 540 and rela tio n extractors (slot fillers 550 , the sys te MA 500 INCLUDES 560 OUTPUT MODULES.

Los m ó du los 560 de sa lid a inc lu ye n un m ó du lo 562 de c re a c ió n de b ases de d a tos y un m ó du lo 564 de c re ac ión de in fo rm e s . El m ó d u lo 562 de c re a c ió n de b ases de d a tos c rea una base de d a tos a p a rtir de las p la n tilla s de e ve n to s o re g is tro s q ue se llenan con e x tra c to re s 550 de re lac io ne s, lo q ue perm ite , p o r e je m p lo , a cc e d e r fá c ilm e n te los da tos de l e ve n to m e d ia n te la b ú sq u e d a con ve n c io n a l.The output modules 560 include a database creation module 562 and a report creation module 564 . The 562 database creation module creates a database from the event templates. registers that are populated with 550 relatio n ex tra c to re s, which allows, for example, easy access to le ve n t da ta THROUGH CONVENTIONAL SEARCH.

G e n e ra d o r de in fo rm e sREPORT GENERATOR

E x tra cc ió n de in fo rm a c ió n de ta b la s e n co n tra d a s en te x to de e je m p loE x tra c tio n o f in fo rm a tio n o f ta b ls i n c o n t e d in e x a m p le t e x t

El s is te m a 500 hace uso de los d a tos de a rch ivo de la SEC , p o r e je m p lo , para d e te rm in a r el tiem p o , d is c e rn ir las te n d e n c ia s de g an an c ia s , etc. P ara fa c ilita r el uso de e s to s datos, la re a liza c ió n de e je m p lo e m p le a un s is te m a y una m e to d o lo g ía n o ve d o so s p ara e x tra e r in fo rm a c ió n de las ta b la s q ue se e n cu e n tra n en el te x to de e s to s d o cum e n to s . Un c o m p o n e n te del s is te m a de e x tra cc ió n de d a tos de ta b la s es un c la s if ica d o r S V M (u o tro c la s if ica d o r s im ila r en fu n c ió n ) que d is tin g u e las ta b la s de las q ue no son tab las . Las ta b la s que so lo se u tilizan p o r m o tivo s de fo rm a to se ide n tifican com o no tab las . A d e m á s, las ta b la s se c la s ifica n com o ta b la s de in te rés, ta le s com o a n te ce d e n te s , co m p e n sa c ió n , etc. El co n ju n to de c a ra c te rís tica s co m p re n d e te x to a n te s y d e sp u é s de las tab las , a sí com o n -g ram as del te x to en la tab la . A con tin u a c ió n , las ta b la s de in te ré s se p ro ce sa n de a cu e rd o con lo s igu ien te :S ys te m 500 makes use of SEC file data , for example , to de te rm ine t im e , di sce rn e r e a n t rend s ia s , etc. In order to facilitate the use of these data, the ex ample w orking employs a s ys te m a no ve do m e to do lo g y to extract in fo rm atio n of the ta b la s that can be found in the text of this document . A compo nent of the table data ex tra c tio n s ys te m is an SVM c las s if ica dor (or another sim ila r in fu nc tio n c las if ica dor ) that distinguishes tables from those that are not tables. Tables that are only used for formatting reasons are identified as not tables. In addition, the tables are classified as interest tables, such as antecedents, compensation, etc. The f a ra c te ris tic set c o m p re n d e s of te x to n ts and a er t he ta bles , as well as t e x t-gram m es in the ta ble . Next, the interest tables are processed according to the following:

1) d e te cc ió n de e tiq u e ta /va lo r. La ta b la d eb e d iv id irse en las e tiq u e ta s y los va lo re s . P ara la s ig u ie n te ta b la de e jem p lo , el s is te m a d e te rm in a que las c a n tid a d e s de d in e ro son va lo re s y el re s to son e tiq ue tas ;1) d e te c tio n of la bel/value. The table must be divided into the labels and the values. For the fol lowing ex a m p le ta b le , the s ys te m ade determines that the money amounts are values and the rest are labels tas ;

2 ) a g ru p a c ió n de e tiq ue tas . A lg u n a s e tiq u e ta s están a g ru p ad as . P o r e je m p lo , E ric S ch m id t y su p ue sto a c tu a l son una e tiq ue ta . P o r o tro lado, una ta b la q ue co n tie n e un a ño y una lis ta de n o m b re s de té rm in o s (es dec ir, inv ie rno , p rim ave ra , o to ñ o ) no se a g rupan ;2) grouping of labels. Some of the labels are grouped together. For example, E ric Schm id t and current assumption are a label. On the other hand, a table containing a year and a list of ter m in o n names (i.e., winter, spring, autumn) do not group;

3) d e riva c ió n de ta b la s a b s tra c ta s . Un s is te m a de co o rd e n a d a s ca rte s ia n o d e riva d o co n d u ce a la n o tac ió n q ue d e fine cad a v a lo r en co n se cu e n c ia . [N o m b re y ca rgo p rinc ipa l. E ric S ch m id t P re s id e n te del C o m ité E je cu tivo y D ire c to r E jecu tivo . A ñ o .2005 , C o m p e n sa c ió n anua l. S a la rio ($ )]=1;3) D e riva c io n of ta b las a b s tra c ts . A non-derivative C o r d e n a s s y s te m leads to the notation that defines each value accordingly. [Name and main position. E ric S chm id t P re s id e n t of the E xecu tive C o m itte d and E xecu tive Dire c to r . YEAR .2005 , ANNUAL COMPENSATION. S to the river ($ )]=1;

4 ) e x tra cc ió n de re lac io ne s. D ada la re p re se n ta c ió n de la ta b la abs trac ta , se d e riva n las re la c io n e s d esea da s . La re lac ió n de co m p e n sa c ió n , p o r e je m p lo , se llen a con: N O M B R E : E ric S chm id t; T IP O DE C O M P E N S A C IÓ N : sa la rio ; C A N T ID A D : 1; M O N E D A : $. F in a lm e n te , se c rea un in té rp re te para las ta b la s de in te rés. La e n tra d a al in té rp re te es una ta b la y la sa lid a es una lis ta de re la c io n e s re p re se n ta d a s p o r la tab la .4 ) e x tra c tio n of relatio n s. G iven the re p re se n ta tio n of the abstract ta b la, the desired rela tio ns are derived. The compensation relationship, for example, is populated with: NAME: E ric S chm id t; T YPE OF C O M P E N S A T I O N : salary ; QUANTITY 1; CURRENCY: $. F in a l m e n t , an in te rp re te is created for the interest tables. The input to the in te rp re te is a table and the output is a list of rela tio ns re p re se n t a d by the table.

P ara la re a liza c ió n de e je m p lo , d e sca rg a m o s c ien tos de d o c u m e n to s de la base de d a tos de E d g a r (E D G A R ) y a n o ta m o s 150 de e llos para e n tre n a m ie n to y e va lua c ió n . C o n ve rtim o s los d o cu m e n to s a X H T M L u sa nd o T id y (R a g g e tt) an tes de a no ta rlos .For the realization of the example, we downloaded documents from the E dgar database (EDGAR) and we wrote down 150 of them for training to y e va lua tio n . We converted the documents to X H T M L u sa n o T id y (R a g g e tt) before annotating them.

T a b la 3: U na ta b la de co m p e n sa c ió nTable 3: A Compensation Table

N u e s tro s is te m a de e x tra cc ió n de in fo rm a c ió n para ta b la s g e n u in a s in vo lu c ra los s ig u ie n te s p rocesos:O u r in fo rm a tio n ex tra c tio n s ys te m fo r g e n u in a t a b l e s in v o lu c ra t h e following p rocesses:

1. c la s ifica c ió n de la ta b la1. C las s ificatio n of the ta b le

2. e tiq u e ta r la c la s ifica c ió n de fila s y co lu m n a s2. LABEL THE SORTING OF ROWS AND COLUMNS

3. re co n o c im ie n to de la e s tru c tu ra de la ta b la3. recognition of the s tru c ture of the table

4. co m p re n s ió n de la tab la4. UNDERSTANDING THE TABLE

El p roce so 1, el cua l m e jo ra la e fic ien c ia , im p lica la id e n tifica c ió n de ta b la s que tie n e n una p ro b a b ilid a d ra zo n a b le de c o n te n e r la re lac ió n d e se a d a a n te s de q ue se a p liq u e n o tro s p ro ce so s co m p u ta c io n a lm e n te m ás cos toso s. Las ta b la s q ue co n tie n e n la in fo rm a c ió n d e se a d a se ide n tifican rá p id a m e n te u tiliza n d o c la s if ica d o re s e sp e c ífico s de re lac ió n con base en el a p re n d iz a je a u to m á tico sup e rv isa do .P rocess 1, which improves efficiencies, involves the id en tificatio n of ta b les that h ave a rea zo nab le p ro ba blity of containing the rela tio n o f it is ad ain t s that it is ap liq ued in oth er sp ro ce so s com m puta tio n l m en t more expensive s. The tables that contain the desired in fo rm atio n are quickly iden tified by using the specific relatio n sif ica dores based on supervised machine learning.

El p roce so 2 im p lica d is tin g u ir e n tre la co lu m n a de e tiq u e ta s y las fila s de e tiq u e ta s de los va lo re s d e n tro de esa s tab las . E s ta vez, se u tiliza el m ism o e n fo q u e de a p re n d iza je a u to m á tico su p e rv isa d o , p e ro los d a to s de e n tre n a m ie n to son d ife re n te s a los de la E tapa 1.Process 2 involves distinguishing between the column of labels and the rows of labels of the values within those tables. This time, the same su pe rvised machine learning approach is used, but the training data is different sa those of the E cover 1.

En el p roce so 3, d e sp u é s de q ue se ide n tifican e sa s fila s y c o lu m n a s de e tiq ue tas , se a p lica un p ro ce d im ie n to e la b o ra d o a e s ta s ta b la s co m p le ja s para g a ra n tiz a r q ue las e tiq u e ta s se m á n tica m e n te c o h e re n te s no se sep a ren en m ú ltip le s ce ld as , o q ue no se a p la s te n m ú ltip le s e tiq u e ta s d is tin ta s en una ce lda . El o b je tivo a q u í es a so c ia r cad a v a lo r con sus e tiq u e ta s en la m ism a co lu m n a y la m ism a fila . El re su lta d o de la E tapa 3 es una lis ta de pare s a tr ibu to -va lo r. En el p roce so 4, un m ó d u lo de in fe re n c ia con base en re g la s pasa p or cad a uno de los pare s de a tr ib u to -v a lo r e ide n tifica los d e se a b le s para c o m p le ta r la b ase de d a tos de fu n c io n a rio s y d irec to res .In process 3, after these rows and label columns are identified, a working procedure is applied. sta b le s com p le ge s to gua ra n tize th e sem a n tically cohere n t a l e s a n t be sem a r e n into m u lti p le cel l s , o that the n ú ltip s are not applied, d is tin ts are tagged in a ce lda. The goal here is to associate each value with its labels in the same column and row. The result of Step 3 is a list of attribute-value pairs. In p rocess 4, a rule-based infe re nce m odu le goes through each of the attribu t-value pairs re iden tifies those of a b le s to complete the da ta b a s e of officials and directors .

La re a liza c ió n de e je m p lo hace uso de una a no ta c ión al re a liza r el a p re n d iza je su p e rv isa d o e m p le a d o ta n to en el p ro ce so 1 com o en el p ro ce so 2. P ara h a c e r q ue el s is te m a de e je m p lo sea m ás ro b u s to fre n te a v a r ia c io n e s léx icas y va ria c io n e s de tab la , se u tiliza el a p re n d iza je a u to m á tico su p e rv isa d o en los p ro ce so s 1 y 2. En el a p re n d iza je su p e rv isa d o , una de las ta re a s m ás d e sa fia n te s y q ue re qu ie ren m ás tie m p o es o b te n e r los e je m p lo s e tiq ue tad os . P ara fa c ilita r la re u tiliza c ió n en d ife re n te s d om in ios , la re a liza c ió n de e je m p lo usa un e sq u e m a q ue re du ce o m in im iza el e s fu e rzo de a n o ta c ió n h u m a n a n ecesa rio .The e x a m p l e a l a tio n makes use of an anno ta tio n a l a r a l e a r e n d e l e r a n d s u pe rv isa doemployed both in the p ro ce so 1 com or in process 2. T o m ake the ex ample s ys te m m ore robust against lex ical avaria c io ns and table va ria c io ns , it is uses su pe rvised ma chine lear ning in pro cesses 1 and 2. In su pe rvised learning, one of the m ost tasks of sa fi a n t s and that require m ore tim e is to obtain the e xamples se tiq ue ted . To facilitate reuse in diffe re n t d o m in ios , the e x a mp le realizatio n uses a sche m a t that re du ce om in im izes necesary hu man anno ta tio n effort.

P ara las ta b la s que con tie n e n la in fo rm a c ió n d eseada , el e je m p lo de re a liza c ió n u tiliza las s ig u ie n te s a n o ta c ion es : 1. e sA u té n tica : una b a n d e ra ind ica q ue se tra ta de una ta b la a u té n tica o no a u tén tica .For the tables that contain the desired information, the example of implementation uses the following healthy ta tio n s: 1. e sA u te n tic : a flag indicates that this is an authentic or inauthentic table .

2. re lac io ne s: las re la c io n e s q ue co n tie n e una tab la , co m o “ n o m b re títu lo ” , “ n o m b re e d a d ” , “ n o m b re a ñ o s a la rio ” o “ n o m b re a ñ o b o n ific a c ió n ” , o una co m b in a c ió n de e llas.2. relatio n s: the relatio ns that contain a table, such as “title name”, “age name”, “year-old name” or “name reyear bonus ”, or a combination of them.

3. e sC o n tin u o : una b a n d e ra ind ica q ue si e s ta ta b la es una co n tin u a c ió n de la ta b la g e n u in a ante rio r.3. e s C o n n u in u o : a flag indicates that if this ta b le is a con tinu a tio n of the previous ta b la g e n u in a.

4. ú lt im a fila d e e tiq u e ta : el n ú m e ro de fila de la ú ltim a fila de e tique ta .4. last label row : the row number of the last label row.

5. ú lt im a co lu m n a d e e tiq u e ta : el n ú m e ro de co lu m n a de la ú ltim a co lu m n a de e tiq u e ta a so c ia d a con cad a re lac ión . 6. va lo rC o lu m n a : el n úm e ro de la co lu m n a que co n tie n e los va lo re s d e se a d o s p ara cada re lac ión .5. LAST LABEL COLUMN : The column number of the last LABEL COLUMN associated with each relat ionship. 6. va lo rC o lu m n a : the n u m b e r of the col u m n a that contains the desired va lues for each relat ion .

Las re la c io n e s e sp e c ifica d a s se usan com o in s ta n c ia s de e n tre n a m ie n to para c o n s tru ir m o d e lo s para el p ro ce so 1. La in fo rm a c ió n ú ltim a fila d e e tiq u e ta y ú lt im a c o lu m n a d e e tiq u e ta se usan para co n s tru ir m o d e lo s para c la s if ica r fila s y co lu m n a s com o fila s o co lu m n a s de e tiq u e ta s en el p roce so 2. En n ue stra g u ía p ara a no ta do re s , se les p ide e sp e c ífica m e n te q ue a n o te n el n úm e ro de co lu m n a de la ú ltim a co lu m n a de e tiq u e ta para cad a re lac ión . Specified relations are used as training instances to build models for process 1. The useful information last label row and last label column are used to build models to classify rows and columns as rows with label columns 2. In our guide for annotators, they are specifically asked to annotate the column number of the last label column for each relation .

La n eces id ad de d icha a n o ta c ió n de g ran o fino se ilus tra m e jo r con un e je m p lo . En la T a b la 3, para la re lac ión “ n o m b re títu lo ” , la ú ltim a co lu m n a de e tiq u e ta es 1, la co lu m n a “ n om b re y ca rgo p rin c ip a l” . P ero para la re lac ió n “ n o m b re a ñ o b o n ific a c ió n ” , la ú ltim a co lu m n a de e tiq u e ta es 3, “a ño f is c a l” . P ara e x tra e r d ive rsa s re la c io n e s en una tab la , e stas re la c io n e s p ueden c o m p a rtir la m ism a ú ltim a co lu m n a de e tiq ue ta , pero no s ie m p re es así. C om o re su ltado , es n e ce sa rio a n o ta r la co lu m n a de e tiq u e ta a so c ia d a para cada re lac ió n p or sep a ra d o . La b a n d e ra e sC o n tin u o ind ica si la ta b la a c tu a l es una co n tin u a c ió n de la ta b la a n te rio r. S i es así, la ta b la a c tu a l p ue de “to m a r p re s ta d o ” el e n c a b e za d o de la ta b la a n te rio r, ya q ue fa lta d icha in fo rm a c ión . La re a liza c ió n de e je m p lo e lim in a las ta b la s m a rca d a s con el in d ic a d o r “e sC o n tin u o ” d u ra n te el e n tre n a m ie n to , pero m a n tu vo e sa s ta b la s d u ra n te la e va lua c ió n . La a no ta c ió n v a lo rC o lu m n a se p uede u tiliz a r p a ra una e va lu a c ió n a u to m á tica en el fu tu ro .The need for such a fine-grained notation is best illustrated with an example. In T ab le 3, for the relation “ name title ” , the last co lu mne of labels is 1, the co lu mne “ name and title p rin c ip to the" . B ut for the relationship “name year bonus”, the last label column is 3, “fiscal year”. To extract relationships into a table, these relationships may share the same last label column, but not always it is so As a result, it is necessary to annotate the associated label column for each relatio n sep arately. The flag is C o n n u a t i n d ides if the current table is a continuation of the previous table. If so, the current table may “borrow” the header from the previous table, since such information is lacking. Carrying out the example elim inates the ta b la s marked with the “e s C on tin uo ” in d ic ator dur ing the train ing , but ma ntain You had this table during the evaluation. The rC o lu m n e anno ta tio n c a n be u sed f or au to m a t e eva lu a tio n in the fu tu re .

H ay unos p ocos ca so s ra ros en d on de la d isp o s ic ió n p re d e te rm in a d a del e n c a b e za d o y el a pé n d ice , com o se m u es tra en la T a b la 3, se in te rca m b ia n en el cue rp o . A c tu a lm e n te , en n ue stra ano ta c ión , s im p le m e n te no p ro p o rc io n a m o s “v a lo rC o lu m n a ” para las re lac io ne s, ya q ue no se a p lican . P ara las ta re a s de c la s ifica c ió n de ta b la s y co m p re n s ió n de tab las , e sto no es un g ran p rob lem a, pero el e sq u e m a de a n o ta c ió n a n te r io r d e b e ría m o d ifica rse m ás para ca p tu ra r d icha d ife re nc ia .There are a few rare cases where the predetermined arrangement of the header and the appendix, as shown in T ab le 3, they are interca mb ia n in the body. CURRENTLY , IN OUR ANNO TA T ION , WE SIMPLY DO NOT PRO P ROM " VA LO RC O LU MNA " FOR THE RELATIONSHIPS, SINCE THEY DO NOT Are applied . For the tasks of table classificatio n and table compre nsio n , this is not a big p rob lem, but the anno ta c schema tio n te rio r should be mo d ified further to cap tu ra r d ife re nce .

C la s ifica c ió n de tab las : la re a liza c ió n de e je m p lo c la s ifica o filtra las ta b la s con b ase en si es p ro b a b le q ue inc lu ya n la in fo rm a c ió n re la c io n a l d e se a d a an tes de in te n ta r los p ro ce so s de e xtra cc ió n d e ta lla do s . P ara id e n tif ic a r ta b la s que con tie n e n las re la c io n e s d esea da s , e m p le a m o s L IB S V M (C h an g & Lin 2001), una im p le m e n ta c ió n b ien co n o c id a de la m á q u in a de v e c to re s de soporte . C on base en las ta b la s a no ta da s , se e n tre na un m o de lo se p a ra d o para cada re lac ió n d esea da . En el d o m in io SEC , una ta b la p uede c o n te n e r d ive rsa s re lac io ne s.C las s ifica tio n de ta b les : The e x a mp lo w o rk s il fers or filters the ta b les b a s e on whether they are likely to include the relat io nal in fo rm atio n desired BEFORE ATTEMPTING THE DETA IL E xtra cc io n P ro ce s . To id en tif ic ar ta b les containing the desired rela tio n s , we use L IB SVM (C h an g & Lin 2001), an im p le men ta tio n Knowledge of the support vector machine. Based on the annotated tables, a separate model is trained for each desired relationship. In the SEC domain, a table can contain various relationships.

Las c a ra c te rís tica s de e je m p lo p ara su uso en la S V M inc luyen:Example features for use in the S V M include:

• las 1000 p a la b ra s p rin c ip a le s d en tro de las ta b la s en el cu e rp o y las 200 p a la b ra s p rin c ip a le s en el te x to q ue p rece de a las tab las . E s to s u m b ra le s se basan en e xp e rim e n to s q ue u tilizan L IB S V m 5 v e ce s la va lid a c ió n c ru za da . S e u tilizó una lis ta de p a la b ra s vac ías .• the 1000 main words inside the tables in the body and the 200 main words in the text that precedes from to tables . These thresholds are based on e xp e rim e n ts u sing L IB S V m 5 times c ru za d v a lid a tio n . A list of empty words was used.

• n úm e ro de p a la b ra s en ta b la s que son p a la b ra s de e tiq ue ta• n u m b e r o f w or d s in ta b les that are tag w o rds

• n úm e ro de ce ld as q ue co n tie n e n una so la p a la bra• n u m b e r o f c e l d s that h ave a single w o rd

• n úm e ro de ce ld as q ue co n tie n e n n ú m e ro s• n u m b e r o f c e l d s that h ave n u m b e r s

• ta m a ñ o m á x im o de ca d e n a de ce ld as• MAXIMUM SIZE OF CELL STRINGS

• n úm e ro de n o m b re s• n u m b e r o f n a m e s

• n úm e ro de p a la b ra s de e tiq u e ta en la p rim e ra fila• number of label words in the first row

La re a liza c ió n de e je m p lo u tiliza e n to n ce s un m o de lo para cada re lac ión d esea da . D eb ido a q ue “ n o m b re a ñ o s a la rio ” y “ n o m b re a ñ o b o n ific a c ió n ” co in c id e n el 100% del t ie m p o en el cue rp o a no ta do , el m ism o c la s if ica d o r fu e para am ba s re lac io ne s. En e ste d om in io , el n úm e ro de ins ta n c ia s n e g a tiva s es s ig n ifica tiva m e n te m a yo r q ue las ins ta n c ia s p os itiva s , q u izás p o rq u e te n e r ta b la s de firm a s y ta b la s q ue co n tie n e n in fo rm a c ió n de a n te ce d e n te s en fo rm a to de o ra c io n e s c rea una su p e rp o s ic ió n s ig n ifica tiva e n tre in s ta n c ia s p os itiva s y neg a tivas . P ara a b o rd a r esto, la re a liza c ió n de e je m p lo so lo usa un su b co n ju n to de in s ta n c ia s n eg a tiva s para el e n tre n a m ie n to (e l 75 % de n u e s tra ins ta n c ia de e n tre n a m ie n to son in s ta n c ia s neg a tivas ). T a m b ié n e n tre n a m o s un m ó du lo se p a ra d o para d is tin g u ir e n tre ta b la s g e n u in a s y no g e n u in a s con base en d a tos a no ta do s . E ste se g u n d o m o de lo es in d e p e n d ie n te de la re lac ión . El co n ju n to de c a ra c te rís tica s es s im ila r al co n ju n to de ca ra c te rís tic a s d e sc rito a n te rio rm e n te .The example realization then uses one model for each desired relationship. Owing to the fact that “name of the year” and “name of the bonus” coincide 100% of the time in the body noted, the same oc la sif ica dor was fo r both relatio n s. In this domain, the number of negative instances is significantly higher than the positive instances, perhaps because they have Signature Tables and Tables Containing Background Information in Sentence Form Creates an Overlay n ifica tive betw een sp os itive and neg a tive in s ta nce . To address this, the example realization only uses a subset of negative instances for training ( 75% of our training instances are negative instances). We also train a separate module to distinguish between genuine and non-genuine tables based on annotated data. This second of it is independent of the relationship. The set of features is similar to the set of features described above.

P ara id e n tif ic a r q ué p a la b ra s p ro b a b le m e n te sean nom b re s, d e sca rg a m o s la lis ta de n o m b re s de (U .S . C e n su s B u rea u). La lis ta de n o m b re s se filtra aún m ás e lim in a n d o las p a la b ra s com u ne s , com o “ b la n co ” , “c o c in e ro ” o “ p re s id e n te ” , con base en una lis ta de p a la b ra s en ing lé s (A tk inso n , A g o s to de 2004). A u n q u e es fa c tib le u sa r una lista de p a la b ra s de títu lo com u ne s , la re a liza c ió n de e je m p lo no usa d ich a in fo rm a c ión p ara q ue p ueda o p e ra r m ás fá c ilm e n te en o tros dom in ios . S in e m b a rg o , en las fo rm a s de re a liza c ió n q ue u tilizan d ich a lis ta e sp e c ífica de dom in io , e sta in fo rm a c ió n p ro b a b le m e n te m e jo ra ría s ig n ifica tiva m e n te la p rec is ión y la re cu p e ra c ió n para e x tra e r la re lac ión “ n o m b re títu lo ” .P a r id en tif ic arqu e for the b ra sp ro bab le m in d they be n a m e s, we d e sca rg a l st of n a m e s from (U .S . C en su s B urea u) . The list of names is further filtered by removing common words, such as “white”, “chef” or “p re s id en te ”, based on a list of words in English (A tk inso n , August 2004). Although it is feasible to use a list of common title words , the exa mpl le w ord does not use such in fo rm a tio n for That you can operate more easily in other domains. However, in the implementations that use such a specific list of domains, this information is probably better r ia r sig n ifica tively the p reci is ion and the re cu pe ra tio n to extract the relatio n “name title ” .

C la s ifica c ió n de fila s y co lu m n a s de e tiq ue tas : C on base en los d a tos a no ta do s , L IB S V M se usa de nue vo para c la s if ica r q ué fila s p e rte ne cen al e n ca b e za d o y q ué co lu m n a s p e rte n e ce n al apé nd ice . Los d a tos de e n tre n a m ie n to para los m o d e lo s son p a la b ra s en las ta b la s d e se a d a s q ue se id e n tifica ro n m a n u a lm e n te com o e n ca b e za d o s y a p é n d ice s m e d ia n te las fu n c io n e s ú lt im a fila d e e tiq u e ta y ú lt im a co lu m n a d e e tiq u e ta . O tras c a ra c te rís tica s u tilizad as inc lu ye n la fre cu e n c ia de las p a la b ra s de la e tiq ue ta , la fre cu e n c ia de las p a la b ra s de nom bre , y la fre cu e n c ia de los núm eros.C las s ifica tio n of row s a n d la bel co lu m ns : Based on the annotated data, L IB SVM is again used to c las si nica r wh a s pe rte ne cen al at the head doy que col lu mnaspe rte ne ce n al ap pendice . The training data for the models are w o rds in the se d a s ta b la s that are id en tified manually as h a d e and a p e nd ix smedia n the func tio ns last label row and last label column. O ther su tilized fe ra c te ris tics in clude the fre cu en cy of the la bel w o rds , the fre cu ency of the n ame w ords , and the frequency of the numbers.

P ara cad a re lac ión , la re a liza c ió n de e je m p lo usa un c la s if ica d o r de co lu m n a de e tiq u e ta d ife re n te , ya que ú ltim a c o lu m n a d e e tiq u e ta p uede d ife r ir e n tre d ife re n te s re la c io n e s , com o se e xp lica en la S ecc ión de A n o ta c ió n . For each relat ion , the ex ample realization uses a di feren t la bel co lu mna classif ica tor, since last lu mnadee T ick may differ between different ratios , as explained in the Anno ta tio n S e c tion .

R e co n o c im ie n to de la e s tru c tu ra de la tab la : d e b id o a q ue las ta b la s en las p re se n ta c io n e s de la S E C son a lgo co m p le ja s y e stán fo rm a te a d a s con fin e s v isu a le s , se n eces ita una can tida d s ig n ifica tiva de e s fu e rzo para n o rm a liza r la ta b la para fa c ilita r las o p e ra c io n e s p os te rio re s . U na ve z id e n tif ica d a s las fila s y co lu m n a s de e tiq ue tas , se llevan a cab o d ive rsa s o p e ra c io n e s de n o rm a lizac ió n :R e cognition of the ta b le s tru c tu re : b ecause the ta b les in SEC filings are somewhat com m p le ja sye They are fo rm a te d for v isu a l purpo ses , a n ig n ifica tive amount of e ort is nec essary to n o rm a lize th e ta b le to facilitate opera tions io nesp os te rio re s . Once the rows and columns of labels have been identified, various normalization operations are carried out:

1. c re a r ce ld as d u p lica d a s con base en el in te rva lo de filas y el in te rva lo de co lu m n a s1. C re a r d u p l i c e d c e l d s b a s e d in the row in te rval and column in te rval

2. fu s io n a r ce ld a s en ce ld as de e tiq u e ta s c o h e re n te s2. Fu s io n a r c e l d s into c ohere n t c e l d s

3. id e n tif ic a r s u b -e n ca b e za d o s3. id e n t ic a r sub-headers

4. d iv id ir una co lu m n a e sp e c ífica de a cu e rd o con el m a rc a d o r de com b in a c ió n , ta l com o “y ” o p a ré n te s is (a n te s de la ú ltim a co lu m n a de la e tiq u e ta )4. split a specific column according to the combination marker, such as “and” opa ré n te s is (before the last column of the label)

5. d iv id ir las ce ld as q ue co n te n g a n d ive rsa s e tiq ue tas , com o los a ños “2005, 2006 , 2007 ” .5. Split the cells that contain different labels, such as the years “2005, 2006, 2007”.

La e ta pa 1 a bo rd a e sp e c ífica m e n te el p ro b le m a con el uso de e sp a c io de co lu m n a s y e sp a c io de fila s en una ta b la H T M L, com o se ha h echo en (C hen , T sa i y T sa i 2000). En la T a b la 3, sin co p ia r las e tiq u e ta s o rig in a le s en ce ld a s de e xp an s ión , la e tiq u e ta “co m p e n sa c ió n a n u a l” no se a d ju n ta ría al v a lo r “ 1 ,300 ,000 ” u tiliza n d o so lo la e sp e c ifica c ió n H T M L. A l h a ce r esta e tapa , so lo n e ce s ita m o s a so c ia r to d a s las e tiq u e ta s en el e n ca b e za d o del cu a d ro en esa co lu m n a en p a rticu la r al v a lo r e ig n o ra r o tra s co lu m na s .Stage 1 specifically addresses the problem with the use of column space and row space in an HTML table, as has been done. echo in (C hen , T sa i and T sa i 2000). In T ab le 3, without copying the original labels in e xp an s ion cells, the “annual compensation” label is not added. o n ta ry to the value “ 1 ,300 ,000 ” using only the HTM L sp ec ificatio n. all the labels in the table header in that particu lar col u m na r al va lo re igno ra ro ro co lu m na s.

En la e ta pa 2, u sa m o s c ie rta in fo rm a c ión de d iseño , ta l com o sub ra ya d o , línea vac ía , o c o lo r de fondo , para d e te rm in a r cu á n d o una e tiq u e ta está re a lm e n te com p le ta . En las p re se n ta c io n e s de la SEC , hay d ive rso s ca so s en d o n d e una e tiq u e ta se d iv id e en d ive rsa s ce ld a s en el e n ca b e za d o o en el a pé n d ice . En e so s casos, q u e re m o s re c re a r las e tiq u e ta s se m á n tica m e n te s ig n ifica tiva s p ara fa c ilita r la e x tra cc ió n de re la c io n e s pos te rio re s , un p roce so q ue d e p e n d e en g ran m e d id a de la ca lid ad de las e tiq u e ta s a d ju n ta s a los va lo re s . P o r e je m p lo , en la T a b la 3, con base en la fila 5 sep a rad a , las ce ld as “Jo h n T. C h a m b e rs ” , “ P re s ide n te , D ire c to r E je c u tiv o ” y “ O fic ia l y D ire c to r” se fu s io n a n en una ce lda , con el m a rc a d o r de sa lto de línea (# ) in se rta d o en la p os ic ió n o rig in a l. La n ueva ce ld a es “Joh n T. C h a m b e rs# P re s id e n te , D ire c to r E je cu tivo # O fic ia l y D ire c to r” , y se a lm a ce n a en la ce ld a de la fila 2, y se cop ia en las ce ld as de las filas 3 y 4.In stage 2, we use some design information, such as underlining, empty line, background color, to determine when a the label is really com p le t . In SEC filings, there are several cases where a label is divided into several cells in the header or in the appendix. In these cases, we will re-create the semantically significant labels to facilitate the extraction of post-relationships. te rio re s , a p rocess that largely depends on the quality of the labels attached to the va lo re s . For example, in T able 3, based on row 5 sep a ra d , the cells “Jo hn T. C hambe rs ”, “ P re s ide n t , D irect r E x e cu tiv e ” and “O fficial ly D ire c to r” are merged into one cell, with the line feed marker (# ) in se rted at the post ic ió no rig in a l. The new cell is “Joh n T. C hambe rs# P re s id en te , E xecu tive Dire c to r # Of ic ia ly D ire c to r”, and it is stored na in the cell in row 2, and is copied to the cells in rows 3 and 4.

En la e ta pa 4, se a p lica ro n re g la s h e u rís tica s para id e n tif ic a r el su b -e n ca b e za d o . P o r e je m p lo , si no h ay n ingún v a lo r en to d a la fila e xce p to en la p rim e ra ce ld a de e tiq ue ta , e n to n ce s esa ce ld a de e tiq u e ta se c la s ifica com o su b e n ca b e za d o . La e tiq u e ta de su b -e n c a b e za d o se a s ig n a com o p arte de la e tiq u e ta a cada ce ld a d e b a jo de e lla has ta que se e n cu e n tra una n ue va ce ld a de e tiq u e ta de su b -e n ca b e za d o .In stage 4, heuristic rules were applied to identify the subheading. For example, if there is no value in the entire row except in the first label cell, then that label cell tiq ue ta is c las s ified as sub b h e a d . The label of its b-header is assigned as part of the label to each cell below it until a new cell of label your b-header.

La e ta pa 5 d iv id e d e te rm in a d a s co lu m n a s en d ive rsa s co lu m n a s p ara g a ra n tiz a r q ue una ce ld a de v a lo r no con te n g a m ú ltip les va lo re s . P o r e je m p lo , en la T a b la 3, la p rim e ra ce ld a de la p rim e ra co lu m n a es “ n o m b re y ca rgo p rin c ip a l” . El s is te m a d e te c ta la p a la b ra “y ” y d iv id e la co lu m n a en d os co lu m na s , “ n o m b re ” y “ p os ic ió n p rin c ip a l” , y re a liza o p e ra c io n e s s im ila re s en to d a s las ce ld a s de la co lu m n a o rig ina l. R e cu e rd e que en la e ta pa 3, la ce lda de la fila 2 es el re su lta d o de fu s io n a r 3 ce ld as , con m a rca d o re s de sa lto de línea e n tre la ca d e n a en las ce ld a s o rig in a le s . De fo rm a p re d e te rm in a d a , u sa m os el m a rc a d o r de sa lto de p rim e ra línea para d iv id ir la ce ld a c o m b in a d a en d os ce ldas. D e spu és de e sta tra n s fo rm a c ió n , te n e m o s “Joh n T. C h a m b e rs ” y “ P re s ide n te , Je fe ...” q ue co rre sp o n d e a “ n o m b re ” y “ ca rgo p rin c ip a l” . E s te tip o de o p e ra c ió n no so lo se lim ita a “y ” , s ino ta m b ié n a c ie rto s p a ré n te s is , “O fic ia l E je cu tivo No D ire c to r (E dad al 28 de F e b re ro de 2006 )” . E s ta s ce ld a s se d iv id en en dos, al igua l q ue las o tra s ce ld a s de la m ism a co lum na . The stage 5 divides certain columns into various columns to ensure that a value cell does not contain multiple values . For example, in T a b le 3, the first cell in the first col u m n a is “name and main position”. The s ys te m ade te c t the w ord “and ” and divides the col u mne into two col u m na s , “name” and “main position”, and performs im ila re s opera t io ness in all the cells of the co lu mnao rig ina l. R emember that in stage 3, the cell in row 2 is the result of merging 3 cells, with newline markers between the chain in the cells aso rig in a s. By default, we use the first-line jump marker to split the combined cell into two cells. A fter this trans fo rm atio n , we have “Joh n T. C hambe rs ” and “ P re s ide n te , Je fe ...” w hich corresponds to “ name ” and “p rin c ip a l charge” . This type of operation is not only limited to “y”, but is also born to spa ré n te s is, “Non D ire c E xecutive O ficial to r (Age as of F ebruary 28, 2006)” . These cells are divided into two, just like the other cells in the same column.

La e ta pa 6 se o cu pa de las s e cu e n c ia s re p e tid a s en la ú ltim a co lu m n a de e tiq ue tas . En la T a b la 3, te n e m o s la sue rte de q ue to d a s las ce ld as de “a ño f is c a l” con tie n e n so lo 1 va lo r. H ay ca so s en n ue stro cu e rp o en d o n d e d ich a in fo rm a c ión se re p re se n ta d en tro de la m ism a ce ld a con un sa lto de línea e n tre cada va lo r. En d ich o s casos, no h ay líne as entre e s to s va lo re s y la ta b la re su ltan te se ve m ás lim p ia y, p o r lo tan to , v isu a lm e n te m ás ag ra d ab le . C ie rta m e n te es in co rre c to a s ig n a r los 3 a ños “2005, 2004, 2003 ” a la ce lda q ue co n tie n e la in fo rm a c ión a d ic io n a l “ 1 ,300 ,000 ” . Para s o lu c io n a r e ste p rob lem a, n u e stro s is te m a re a liza una d e te cc ió n de se cu e n c ia re p e tid a en to d a s las ú ltim a s co lu m n a s de e tiq ue tas . Si se d e te c ta un pa tró n de secu en c ia , el cua l no s ie m p re tie n e q ue se r e xa c ta m e n te el m ism o, la se cu e n c ia re p e tid a se d iv id e en d ive rsa s ce ld a s para q ue cada ce ld a p ueda a s ig n a rse co rre c ta m e n te al v a lo r aso c ia d o . Stage 6 deals with the repeated sequences in the last label column. In Table 3, we are lucky that all the “fiscal year” cells contain only 1 value. There are cases in our body where such information is represented within the same cell with a line break between each line. lo r. In such cases, there are no lines between these values and the resulting table looks cleaner and therefore visually more aggravated. ra d ab le . It is certainly incorrect to assign the 3 years “2005, 2004, 2003 ” to the cell that contains the additional information “ 1,300 , 000 ”. To solve this problem, our system performs a repeated sequence detection on all the last label columns. tas. If a sequ en ce pa t ra n is de te c ted , whi ch ha s n o t always re ected the same , the repeated se quence is d iv ide in d ive rsa s ce ld as so that each c e ld can be cor re c tly a ssigned to the associa ted val u e .

La tra n s fo rm a c ió n de una ta b la n o rm a liza d a en la re p re se n ta c ió n de W a n g (W an g 1996) es un p ro ce so triv ia l. D ada una ce ld a de v a lo r en (r, c), to d a s las ce ld a s de e tiq u e ta en la co lu m n a (c ) y la fila (r) son sus e tiq u e ta s a so c ia d as . A d e m á s, las e tiq u e ta s en el có d ig o a u x ilia r ta m b ié n p ueden te n e r e tiq u e ta s a so c ia d a s a d ic io n a le s en el e n cab eza do , y e s ta s ta m b ié n d eb en e s ta r a so c ia d a s con la ce ld a de va lo r. P o r e je m p lo , el v a lo r “ 1 ,300 ,000 ” te n d rá las s ig u ie n te s 4 e tiq u e ta s a so c ia d as : [co m p e n sa c ió n a n u a l|b o n ifica c ió n ($ )(1 )], [año fisca l|2005 ], [ca rg o p rin c ip a l|p re s id e n te , d ire c to r e je cu tivo y d irec to r], [n o m b re |Jo h n T. C ha m be rs ]. Los ca ra c te re s “ |” d e n tro de esa s e tiq u e ta s a so c ia d a s ind ican la re lac ió n je rá rq u ic a e n tre las e tiq ue tas . P ara las ta b la s con su b -e n ca b e za d o s , las e tiq u e ta s de su b -e n c a b e za d o s ya se han in se rta d o en to d a s las e tiq u e ta s a so c ia d a s en los a p é n d ice s a n te rio rm e n te .The transformation of a normalized table into the W a n g representation (W an g 1996) is a trivial process. Given a value cell in (r, c), all the label cells in column (c) and row (r) are its so-so-rate labels. c ia d as. In addition, the labels in the auxiliary code may also have additional labels in the header, and this is mb ié nd eb in this ta ra as c ia d with the va lo r cell. For example, the value “1,300,000” will have the following 4 labels: [annual compensation|bonus c io n ($ )(1 )], [fiscal year l|2005 ], [ch a rg op rin c ip al|p re s id en te , d ire c to r execu tive y d irec to r], [ name |Jo hn T. C ha m be rs ]. The characters “|” Within these six associated labels, they indicate the hierarchical relationship between the labels. For tables with sub-headers, the sub-header tags have already been inser ted into all associated sa tags. in the appendices san te rio rm en te .

C o m p re n s ió n de la tab la : s im ila r a (G a tte rb a u e r e t al. 2007), co n s id e ra m o s q ue la IE de l m o de lo de W a n g re qu ie re un p ro ce sa m ie n to in te lig e n te ad ic ion a l. P ara p o b la r la b ase de d a tos de a cu e rd o con la re p re se n ta c ió n de W an g , se u tiliza un s is te m a con base en reg las. B u sca m o s e sp e c ífica m e n te c ie rto s p a trones , ta le s com o “ n o m b re ” , “t í tu lo ” o “ p o s ic ió n ” en las e tiq u e ta s a so c ia d a s p ara c o m p le ta r la re lac ió n “ n o m b re t ítu lo ” . P ara d ife re n te s re lac io ne s, se usa un co n ju n to d ife re n te de p a trones . Es im p o rta n te re a liza r un a n á lis is de e rro re s en e sta e ta pa para d e te c ta r p a tro n e s ine fica ces . P o r e je m p lo , d ive rsa s ta b la s con in fo rm a c ión de “ n o m b re -títu lo ” u saban la o rac ió n “fu n c io n a rio e je cu tivo no d ire c to r” en lu g a r de la e tiq u e ta para “ n o m b re ” . C la ra m e n te , p od e m o s a p lic a r el a p re n d iza je a u to m á tico su p e rv isa d o para h a ce r que el p ro ce so sea m ás robusto . En n ue stra a no ta c ión , h em o s p ed ido a los a n o ta d o re s q ue id e n tifiq u e n las co lu m n a s q ue co n tie n e n la in fo rm a c ió n q ue q u e re m o s en va lo rC o lu m n a . D ich a in fo rm a c ió n pod ría u sa rse para e n tre n a r n ue stro m ó d u lo de co m p re n s ió n de ta b la s en el fu tu ro .Com p re ns io n of the table : si im ila ra (G a tterb aueret al. 2007), we co ns id e ra mos that the EI of the lmo of the W ang re quires a p ro ce add ition a l in te lig en t sa m e n t. To populate the database according to W an g's repre se n ta tio n, a rule-based system is used. We search specifically for certain spa trons , such as “name”, “title” or “position” in the labels asso c ia das to complete the relationship “ name t itle ” . For di e re n t re latio n s, a di erent set of pa tters is used. It is important to perform an e rro r a ly s is at this stage to de te c t ineffective pa tterns. For example, various tables with “name-title” information used the phrase “non-d irect execu tive officer r” instead of the tag for “name”. C la ra mente, we c a n ap p lic a su pe rvised ma chine lear ng to m ake the p ro ce ss more robust. In our anno ta tion , we have asked the annotators to identify the columns that contain the in fo rm atio n that we value rC or lu mna . Such in fo rm atio n could be used to train our table comprehension modu le in the future.

Los s ig u ie n te s p ro ce d im ie n to s se p ue de n u tiliz a r para a d a p ta r n u e stro e n fo q u e a una n ue va a p lica c ió n o dom in io : The following procedures can be used to adapt our approach to a new application or domain:

• R e c o p ila r un cu e rp o y a n o ta r las ta b la s con la in fo rm a c ió n d e se a d a com o se d e sc rib e en la se cc ió n A no tac ió n . • R e c o p il a t a body and annotate the ta b les with the in fo rm a tio n as de scribed in the Anno tatio n se c tio n.

• M o d ific a r fu n c io n e s para a p ro v e c h a r el c o n o c im ie n to en el n uevo d om in io .• Modify functions to take advantage of knowledge in the new domain.

• E n tre n a r a to d o s los c la s ifica d o re s . D e p e n d ie n d o de l ta m a ñ o de l cu e rp o , se p ue de n e sp e c ifica r d ife re n te s u m bra le s para m in im iza r el ta m a ñ o del vo ca b u la rio , el cua l se u tiliza co m o ca ra c te rís tica s . E s te p roce so de e n tre n a m ie n to se p ue de a u tom a tiza r.• Train all classifiers. D epend ing on the size of the body, you can spec ify r d i fer e n t u s m a n s to m inim ize the size of the vo ca bu la rio, which is used as characte ris tics. This training process can be automated.

• M o d ifica r la n o rm a liza c ió n de ta b la s p ara a p ro v e c h a r el co n o c im ie n to de l d om in io . P o r e je m p lo , en el d o m in io SEC , se a p lica la se p a ra c ió n de la ce lda de la e tiq u e ta “ n o m b re y t í tu lo ” con el fin de s im p le m e n te luego re la c io n a r las o p e ra c io n e s de e xtra cc ió n .• M o d ify the ta b le no rm a lizatio n to a p p r a s e p r a s e d o n d o m in io n kno wledge . For example, in the SEC domain, the cell separation of the “name and title” tag is applied for the purpose of sim p le men then relate the extraction operations.

• M o d ifica r las reg las de e x tra cc ió n de re lac io ne s. Las d ife re n te s re la c io n e s se señ a la n con d ife re n te s p a la b ra s en las e tiq ue tas . A c tu a lm e n te , e sp e c ifica m o s m a n u a lm e n te e s ta s reg las.• M o d ify the rules for extracting relatio ns. D if e r e n t r e a tio n s s e n d e n g d i f e r n t w o r d s on the la bels . CURRENTLY, WE SPECIFY THESE RULES MANUALLY.

E s te p roce so e stá d ise ñ a d o para m a x im iz a r la p rec is ión y la re co rd a c ió n a la v e z q ue se m in im iza el e s fu e rzo de a no ta c ión . C ad a co m p o n e n te se p ue de m o d ific a r para a p ro v e c h a r la in fo rm a c ió n e sp e c ífica del d o m in io para m e jo ra r su re nd im ie n to .This process is designed to maximize accuracy and recall while minimizing annotation effort. E ach c o m p o n e n t c a n d be m o d i fi ed to a v a l e a s h e s d o m e s s p e c h i n f o rm a t i o n o f i m e r p o rmance n .

G e n e ra c ió n de p a rá fra s is de o ra c io n e s de e je m p loG e n e r a t i o n o f E x a m p le P a phra s is

U na re a liza c ió n a d ic io n a l de la p re se n te inve n c ió n inc lu ye una h e rra m ie n ta que g e n e ra p a rá fra s is de o ra c io n e s a p a rtir de las p la n tilla s in ic ia les p ro p o rc io n a d a s p or un usuario . La h e rra m ie n ta to m a o ra c io n e s que ind ican un e ve n to con a lta p re c is ió n con las e n tid a d e s rea les re e m p la za d a s p or sus tip o s g en é rico s , p o r e je m p lo :An additional embodiment of the present invention includes a tool that generates sentence sentences from the p the initial n tilla s p ro p ro p by a user . The tool takes mao rations that indicate an event with high precision with the real entities replaced by their generic types , for example :

<ORG> compró <ORG><ORG> bought <ORG>

Fusión de <ORG> con <ORG>Merging of <ORG> with <ORG>

La o rac ió n se b usca en un cu e rp o y las id e n tid a d e s de e n tida d re a le s se o b tie n e n a p a rtir de o ra c io n e s q ue se a ju s tan al pa trón sem illa . L ue go se u b ican o tras o ra c io n e s q ue m e nc ion an las m ism a s e n tid a d e s en el cu e rp o y e s ta s s irven com o las q ue s irven de p a rá fra s is para la o rac ió n in ic ia l. (En a lg u n a s re a liza c io ne s , las o tras o ra c io n e s están re s tr in g id a s a a q u e lla s q ue o cu rren d en tro de una v e n ta n a de tie m p o e s tre cha ). C ad a una de e stas o tras o ra c io n e s se p ue de tra ta r com o una p la n tilla o pa trón se m illa e lim in a n d o las e n tid a d e s d e n o m in a d a s y luego re p itie n d o la b úsqu ed a de o tra s o ra c io n e s q ue se a ju s ten a este n ue vo patrón sem illa . Las o ra c io n e s se p ue de n o rd e n a r de a cu e rd o con las fre cu e n c ia s de las o ra c io n e s co m p o n e n te s y se p ue de n v e r if ic a r m a n u a lm e n te para g e n e ra r d a tos de oro para los c la s ifica d o re s .The sentence is searched for in a body and the real entity identities are obtained from sentences that fit the seed pattern. Then other ora tio ns are located that mention the same sen tities in the body you hear ta ss irven like the ones that serve the paraphrase for the orac initial io n. (In some embodiments, the other sentences are restricted to those that occur within a narrow time window.) Each of these other sentences can be treated as a template or pattern by deleting the unnamed entities and then repeating the SEARCH FOR ANOTHER RA TION THAT ADJUSTS TO THIS NEW SEED PATTERN. The sentences can be ordered according to the frequencies of the component sentences and can be manually checked for gener gold rda ts for qualifiers.

C o n c lu s ió nConclusion

Las re a liza c io n e s d esc rita s a n te r io rm e n te e stán d e s tin a d a s ú n ica m e n te a ilu s tra r y e n s e ñ a r una o m ás fo rm a s de p ra c tica r o im p le m e n ta r la p re se n te inve nc ión , no para re s tr in g ir su a m p litu d o a lcance . El a lca n ce rea l de la inve nc ión , el cua l a b a rca to d a s las fo rm a s de p ra c tic a r o im p le m e n ta r las e n se ñ a n za s de la inve nc ión , está d e fin id o ú n ica m e n te p o r las re iv in d ica c io n e s e m itid as . The san te rio rm ent writ ed embodi ments are intended solely to illustrate and teach one o m o r fo rms o f p ra c tica or im p le me n t r the p re se n te inve nc ion, not to re s tr in g ir its broad scope. The rea l scope of the inve n tio n , which covers all forms of p ra c tic a r or im p le me n ta r the teachings of the inve n tio n , is in accordance with fin ed s o nly for c iv in d ic a tio nssubmitted .

Claims

1. A computer system for extracting data and related information from tables in se le c tr o n ic docu ments that It has at least one process and at least one memory, understanding the system:

Means for autom a tically identifying and labeling a text seg m e nt in an e le c tron ic (110) docu m e n t; Means for autom a tically labeling en t y n ames , n e ta ry e xp re s io n s , an d tempo ra l e xp re s io ns within the se gment (120) of text ;

Means to id e n t ifi c a r a d e s c rite fin a n ci e r e v e n t within the auto m a tically tagged t e x g m e n t; a c las s if ica dor (310 ) m a chine of suppor t vec to rs adapted to f ilter the do cu ment an d id en tif ica r a ta b le that com p re Of in fo rm atio n of in te re s d is tin g u ing ta b la s from those that are not ta b la s and en d on from su tilized ta b las for forma t reasons are identified as no tab les, the in fo rm atio n of interest c o m p re n d a p lu ra lity o f desired attribu ts and v a lo re s , the ta b the sgenu in as id en tifica ted are p ro ce sa d by:

to. c las s ifica tio n de ta b le usin g s e c ific c las si ica dors of relatio n based on m e rv is a t machine learning,

b. c la s ificatio n o f la bel rows an d co lu m ns d is tin g ing betw een label col umns a n d l a bel rows of val u e s in tro The tables ,

c. RECOGNIZING THE TABLE S TRUCTURE ASSOCIATING EACH VALUE WITH ITS LABELS IN THE SAME COLUMN AND THE SAME ROW FOR GENERATION to r a list of peers to tribute-value,

d. com p re n s io n th e ta b le com p aring each one of the attr ibute-value pairs;

means to define in memory a data record associated with the financial event, which includes the data record, derivatives of the segment (319) of labeled text;

and means for extracting relationship data (320) from the text segment and for

to finish a role of at least one en tity , is ta ndo ing the en t y within the text seg ment and rela tio n to the re g is tro data.

2. The system of claim 1, where the text segment is a grammatical sentence.

3. The system of claim 1, where the data record includes:

a business field that includes text that iden tifies a named entity tagged in the text segment; A company identification field that includes an alphanumeric code that identifies the named entity; and a time period field that includes an alphanumeric code that identifies a financial information period.

4. The system of claim 1, where the data record includes a field that indicates whether a monetary expression tagged in the text segment has an up or down trend.

5. The system of claim 1, where the data record includes a field that indicates that a monetary expression Labeled in the text segment is a measure of earnings per share.

6. The s ys te m of re i v in d ica tio n 1, in d on of the means to label ta rau to m a tly n a m e s of en tida de s , e xp ne ta ria s mo re s io n s , y e xp re s io ns tempo ra s d ithin a text segmen t in clude:

FIRST MEANS OF LABELING AND RESOLVING ENTITY NAMES;

SECOND MEANS FOR LABELING THE MONETARY EXPRESSION; Y

T h i r d s m e d i o s f o r T e m p o ra l e xp res io n s .

7. The s ys te m of claim 1, in d on of the means to deter m in ate if the t e x t segmen t is au to m a tagged tically de sc rib e a fin anci e r e ve nt i ncludes a M&A c lassif icator to de te rm in ar if the segmen t T e x t s d e c ri b e n an M &A e v e n t.

8. The s ys te m of claim 1, in d on of the mean s to deter m in ate if the t e x t seg ment t au to m a tically de sc rib e a fin anci e r e ve nt i ncludes a M&A c lassif icator to de te rm in ar if the segmen t sc rib eno text s not an M &A event.

9. The system of claim 8, where the classif icator M & A is a classif icator based on machine learning tico.

10. The s ys te m of re iv in d ica tio n 1, in d on of the means to dete rm in ate if the t e x t seg ment t au to m a tically de sc rib e a fin anci e r e ve n t incl u des a c lassif icator of e x e n t e ve n t s in order to determine if the te x segmen t sc rib eo is not an e ve n t e v e n t .

11. The s ys te m of claim 1, in d on of the means to deter m in ate whether the t e x t segment is autom a tagged tically de sc rib e a fin anci e r e ve nt incl u des a guid e e ve nt c lassif ica tor to dete rm inar if the te x t segmen t de sc rib eo not a finan cial guid e e ve nt.

12. A computer-implemented method for extracting data and related information from ta b las in se le c tronic s that includes:

A u to m a t ically id e n t ific a n d t a lg a t e x t se g m e n t in an e l e c tr o n ic d ocu m e n ;

T au to m a tically label en t y n a m e s , n e ta ry e xp res io n s , a n d tempo ra l e xp res io n s within the tex segmen t to;

id e n t ific a t a d e scri t fin a n c i a l e v e n t w i n th e autom a tically tagged t e x g m e n t ;

filtering by using a suppor t vec to r ma chine c lassif icator the do cu ment and id en tif ica r a ta b le that co mp re nds in fo rm atio n of in te re s by distinguishing ta b les from ta b les that are not ta b la s and where ta b las su tilized for fo rm a to reasons are they ide n tify as not ta bles , including the in fo rm atio n of interest a p lu ra lity of desired attributes and desired values , the ta b the sgenu in as id en tifica ted are p ro ce sa d by:

to. c las s ifica tio n de ta b le usin g s s pe c ific classif ica tors of relatio n based on supervised machine learning,

A data record asso c ia ted with the fin anci e l e ve nt , w hich includes the data record derivative data of the tagged text segment; Y

ex tra er ing data on the relationship of the text segment and de ter m ining a role from at least one en t ity , sta ndo ng the en t ity within it t e x t segment and re lated to data logging.

13. The method of reiv in d ica tio n 12, which also comprises v is ualizing on a v isu a liz a tio n device at least a portion n o f d a ta reg is ter in a sso c ia tio n w ith a user-selectable com mand feature your a rio to trigger the re cu pe ra tio n of a do cu men t that includes the text segment.

14. The method of claim 12, in which the data record includes:

a business field that includes text that identifies a named entity tagged in the text segment; A company identification field that includes an alphanumeric code that identifies the named entity; and a time period field that includes an alphanumeric code that identifies a financial information period.

15. The method of claim 12, where the data record includes a field that indicates whether a monetary expression tagged in the text segment has an ascending or descending trend, the method also comprises:

autom a tically label the n a m e s of en tities within a t e x t se g m e n t as if they were of a person , company , and location ; already r au to m a tically associated with one o m o r o m e s of the des id e t ed en t ity n a m e s w ith an entry in a om in a d en tity da ta set .