WO2000026839A9

WO2000026839A9 - Advanced model for automatic extraction of skill and knowledge information from an electronic document

Info

Publication number: WO2000026839A9
Application number: PCT/US1999/026083
Authority: WO
Inventors: Prabhat K Andleigh; Nagaraju Pappu; Vasudeva V Kalindindi
Original assignee: Infodream Corp; Prabhat K Andleigh; Nagaraju Pappu; Vasudeva V Kalindindi
Priority date: 1998-11-04
Filing date: 1999-11-03
Publication date: 2001-08-02
Also published as: WO2000026839A8; WO2000026839A1

Abstract

An apparatus, method, and computer readable medium for analyzing and extracting skill and knowledge information from an electronic document (104) and for storing the extracted skill and knowledge information into predefined fields or tables in a target database (110) comprises a content analysis and semantic network engine (216) for analyzing and extracting skill and knowledge information from the electronic document (104). A skill and knowledge information extractor (702) is coupled to the content analysis and semantic network engine (216), for determining a skill level for the skill information extracted from the electronic document (104). In a preferred embodiment, the skill and knowledge section processor (702) uses a non-monotonic reasoning principle to determine a skill level for skill information extracted from the electronic document (104). The content analysis and semantic network engine (216) further comprises a thesaurus (221) for linking together terms (402) and skill information (404), and for defining relationships between and among the terms (402) and skill information (404), and a semantic network (220) coupled to the thesaurus (221), for organizing the terms (402) and skill information (404) in the thesaurus (221), along with knowledge information (502) and categories (504), in a hierarchical structure.

Description

ADVANCED MODEL FOR AUTOMATIC EXTRACTION OF SKILL AND KNOWLEDGE INFORMATION FROM AN ELECTRONIC DOCUMENT

RELATED APPLICAΉON

The subject matter of this application is a continuing application of and claims priority

from U.S. patent application Serial No. 09/380,219, filed August 27, 1999 descending in

priority from PCT application PCT/US98/27664, filed on December 28, 1998, and entitled

5 "Xtraction Server" by Prabhat K. Andleigh, Nagaraju Pappu, and Nasudeva Kalidindi. Said

two earlier applications are commonly assigned with the instant application.

The subject matter of this application is also related to and claims priority from U.S.

Provisional Application Serial No. 60/107,063, filed Novmeber 4, 1998, and entitled

"Advanced Model for Automatic Extraction of Content, Skills, and Knowledge from

10 Resumes" by Prabhat K. Andleigh, Nagaraju Pappu, and Nasudeva Kalidindi, which

application is commonly assigned with the instant application, and is incorporated herein by

reference in its entirety.

TECHNICAL FIELD

This invention relates to the field of computer analysis of electronic documents. More

15_. specifically, it relates to the field of information retrieval to convert and store information in

documents written in a natural language into a predefined structure which can be retrieved

and manipulated by computer program applications.

BACKGROUND OF THE INVENTION

Information to be sorted and stored in a computer database may reside in numerous

20 electronic documents. For example, information about people and their specific talents and

skills may reside in electronic documents, such as resumes, performance appraisals, design

documents, publications, books, patent documents, and email messages. When an individual is trying to organize and sort out specific information from such electromc documents, the

individual usually has to open each document separately and manually analyze, retrieve, and

store the relevant data in the particular database. For example, a project manager who would

like to find the best employee for a specific job may have a specific job description. When

searching for an employee whose skills, knowledge and talent are best suited for the specific

ob description, the project manager must sift through several documents which contain the

necessary information. Such a process is time consuming and inefficient, because the project

manager may have to read the documents several times and may have to review and type the

information into a computer database in order to organize the various pieces of information

into a coherent summary.

A computerized system which can analyze and extract pertinent information from

different electronic documents would provide a more efficient solution to this problem.

However, such text documents are often written in unstructured natural language text for

other people to understand. Thus, computer programs such as database applications cannot

efficiently process documents written in natural language texts. Rather, computer programs

can process only information which has been stored in a highly structured fashion in order to

retrieve and manipulate that information. Additionally, these documents may be prepared in

a variety of different file formats, such as Microsoft Word 97, Rich Text Format, PDF,

WordPerfect, ASCII files, and HTML, and may be stored in different areas within a computer.

There are a variety of information retrieval programs such as Internet search engines

that can retrieve documents that match a set of keywords. Their scope is very limited in the

context of the above mentioned problem, because they cannot understand the text, and certainly they cannot make any connection between the document and the person who is

related to that document. Another problem is that the 'information of interest' will vary

significantly from one organization to another. For example, a health care organization will

be interested in the skills and talents related to the medical field, but the skills related to

computers may not be of significant interest, whereas a software development organization

will be interested in the computer and software related skills, but may not be interested in

medical or first-aid related skills. The keyword based search engines cannot address this

problem of retrieving only the 'information of interest'. As a result, there is a vast amount of

information about people which cannot be easily processed by computer programs.

For example, in today's large corporations and government organizations, it is not

uncommon to receive hundreds of thousands of resumes of potential candidates in a very

short time. Recruiting the right candidates from such a vast pool of applicants is a very

complicated problem. It is crucial for organizations to find the people with the right

knowledge and skill set. In essence, managers have to deal with a vast number of resumes,

try to understand the content within the resumes, and short-list candidates who have the right

skills and knowledge. For example, if an organization wants to recruit a middle level

manager with 5 to 8 years of experience to lead a development project, the organization will

need to sort through thousands of resumes and determine from each one whether that

particular candidate has the requisite knowledge and skill level. It is not possible to find the

best resumes using a standard full text search engine because such search programs search for

a particular input string and retrieve only resumes which contain that particular input string.

Such an approach is not that useful, because a particular skill may be written using many

different terms (e.g. Microsoft Word, MS Word, Word 97, etc....) even though the terms all refer to the same or similar ideas. Moreover, in addition to not being able to correctly

identify a candidate's skills, a typical search program cannot identify the type of experience

with that skill, the duration of that experience, or the overall knowledge gained by the

candidate in a specific skill group. Additionally, it is also very desirable to have a system for

determining not only the knowledge and skills of a candidate but also the proficiency level of

a candidate in a particular skill.

Therefore, what is needed is a system for analyzing and extracting information from

an electronic document and for storing the extracted information in a database. Additionally,

what is needed is a system for analyzing and extracting skill and knowledge information from

an electronic document and for determining a skill level for skill information and for mapping

such skill level information to a qualitative scale.

DISCLOSURE OF INVENTION

The present invention is an apparatus, method, and computer-readable medium for

analyzing and extracting skill and knowledge information from an electronic document (104)

and for storing the extracted skill and knowledge information into predefined fields or tables

in a target database (110). The system for analyzing and extracting skill and knowledge

information from an electronic document (104) comprises a content analysis and semantic

network engine (216) for analyzing and extracting skill and knowledge information from the

electronic document (104), and a skill and knowledge information extractor (702) coupled to

the content analysis and semantic network engine (216), for determining a skill level for the

skill information extracted from the electronic document (104). In a preferred embodiment,

the skill and knowledge section processor (702) uses a non-monotonic reasoning principle to

determine a skill level for skill information extracted from the electronic document (104). The content analysis and semantic network engine (216) further comprises a thesaurus (221)

for linking together terms (402) and skill information (404) and for defining relationships

between and among the terms (402) and skill information (404), and a semantic network

(220) coupled to the thesaurus (221), for organizing the terms (402) and skill information

(404) in the thesaurus (221), knowledge information (502), and categories (504) in a

hierarchical structure.

A method for extracting skill and knowledge information from an electronic

document (104) comprises the steps of: identifying skill and knowledge information in the

electronic document (802); determining a skill level for skill information from the electronic

document (804); and mapping the skill level to a qualitative scale (806). The method further

comprises the step of storing the skill information and qualitative skill level scale mapping in

the target database (808).

BRIEF DESCRIPTION OF THE DRAWINGS

Figure 1 is a block diagram of a preferred embodiment of a system 100 in accordance

with the present invention.

Figure 2 is a block diagram of a preferred embodiment of an extraction server 108 in

accordance with the present invention.

Figure 3 is a flow chart of a preferred embodiment of the steps performed by the

document pre-processor 210.

Figure 4 is a block diagram of a preferred embodiment of a thesaurus. 221

Figure 5 is a block diagram of a preferred embodiment of a semantic network 220.

Figure 6 is a flow chart of a preferred embodiment of the steps performed by the

extraction server 108. Figure 7 is a block diagram of a preferred embodiment of a system 700 in accordance

with the present invention.

Figure 8 is a flow chart of a preferred embodiment of the steps performed by the skill

and knowledge information extractor 702.

Figure 9 is a screen shot of a user interface of a preferred embodiment of a target

database 110 display for skill information.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS

Referring now to Figure 1, a system 100 upon which a preferred embodiment of the

present invention operates is shown. A host computer 102, using the method and system

described herein, operates upon an electronic document 104, derived from a text document

which contains unstructured text. As used herein "unstructured text" refers to any document

which has been written in a natural language such as English. Examples of documents

containing unstructured text include, but are not limited to, a resume, performance appraisals,

design documents, publications, books, patent documents, and email messages. In a preferred

embodiment, the host computer 102 is a conventional computer having a keyboard and mouse

for input (not shown), and a conventional memory 106 associated with host computer 102 for

storing the electromc document 104. The electronic document 104 may be prepared in any

electronic file format, such as Microsoft Word 97, Rich Text Format, PDF, WordPerfect,

ASCII files, and HTML.

The electronic document 104 is processed by host computer 102 using the present

invention. Specifically, host computer 102 uses extraction server 108 to analyze, retrieve and

store words and word groups from the electronic document 104 into a predefined structure in target database 110. As used herein, the terms "words" and "word groups" are used to mean any text that may be derived from document 104 including, but not limited to, individual

words or numbers, phrases, whole sentences, and blocks of text. The extraction server 108

identifies the document type of the document 104 and determines which words and word

groups are to be extracted from the document 104. The structure and operation of the

extraction server 108 is described in more detail below with reference to Figures 2 through 6.

The target database 110 comprises predefined tables with predefined columns for

storing the word and word groups extracted from the electronic document 104. In a preferred

embodiment, a predefined table and predefined columns correspond to a particular document

type. For example, if document 104 is a resume, then a predefined table for a document type

called "resume" may have predefined columns such as "name and address", "education", and

"skills and experience". As another example, if document 104 is a patent document, then a

predefined table for a document type called "patent document" may have predefined columns

such as "inventors", "company", "patent number", and "field of search". The predefined

tables and columns in target database 110 are organized ahead of time, and one skilled in the

art will realize that the present invention is not limited to a particular document type or a

predefined table, but that many different compilations of predefined tables and columns may

be stored in target database 110 within the scope of this invention. The words and word

groups stored in the target database 110 can be stored in electronic form on any type of

computer data storage device or they may be printed out in a hard-copy printed format.

The process of extraction performed by the extraction server 108 preferably uses a

non-monotonic reasoning principle. As used herein, a "non-monotonic reasoning principle"

refers to a process whereby at every stage during extraction, the extraction server 108

assumes a reasonable default value. That default value is modified as further information becomes available. For example, a string '1987' is first assumed to be a number, and if

further information to qualify the string to be a date is available ( for example in this case,

that the string is preceded by another string 'Jan'), then the assumption is changed. If again

further information becomes available to negate the previous assumption, the assumption is

changed again.

Thus, the present invention advantageously allows a user to extract skill and

knowledge information from an electronic document directly into a database. More

specifically, the present invention analyzes an electronic copy of a text document and extracts

words and word groups relating to skill and knowledge information into a target database

comprising predefined tables and columns associated with a particular document type.

Moreover, the present invention operates upon electronic documents in any electronic file

format. The extracted skill and knowledge information stored in the target database can then

be retrieved and manipulated by other computer program applications.

Referring now to Figure 2, a block diagram of a preferred embodiment of the

extraction server 108 is shown. The electronic document 104 may be any electronic file

stored in memory 106 which is accessible by the extraction server 108. For example, the

electronic document 104 may be an electronic form of a hard copy of a document converted

using a conventional optical scanner and Optical Character Recognition (OCR) software 202,

a Microsoft Word file 204, an ASCII text file 206 or an email attachment 208. The database

applications which manipulate the extracted information in target database 110 are also

preferably stored in memory 106. In a preferred embodiment, the extraction server 108 comprises a document preprocessor 210 coupled to the memory 106 where the electronic

document 104 is stored, a heuristics engine 212 coupled to the document pre-processor 210, a morphological analysis engine 214 coupled to the heuristics engine 212, a content analysis

and semantic network engine 216 coupled to the document preprocessor 210, and a database

interface 222 coupled to the content analysis and semantic network engine 216 and to the

target database 110. The content analysis and semantic network engine 216 preferably

comprises section processors 218 and a semantic network 220.

The document pre-processor 210 retrieves the electronic document 104 from memory

106 and performs the initial analysis of the electronic document 104. Referring now to Figure

3, a flowchart of the steps of a preferred operation of the document pre-processor 210 is

shown. The document pre-processor 210 performs the initial analysis and extraction of the

electronic document 104 by first converting (302) the electronic document 104 from its native

file format into ASCII text. More specifically, the document pre-processor 210 identifies the

file format of the electronic document 104 and extracts the ASCII text out the document 104.

For example, if the electronic document 104 is a Microsoft Word file, then the document pre¬

processor 210 identifies the file by the Microsoft Word signature and uses the Microsoft

Object Linking and Embedding Software Development Kit (Microsoft OLE 2.0 SDK) to

extract text from the Microsoft Word File.

Next, the document pre-processor 210 filters out (304) any unnecessary and unwanted

information such as, but not limited to, email headers, OCR headers, blank pages, and

unwanted characters. Preferably, any information that is not part of the original document is

treated as unnecessary information. For example, email headers, non- ASCII characters at the

beginning or at the end of the file, extra blank lines and blank spaces are removed from the

text. Additionally, if the text contains vertical tables, these tables are preferably converted

into horizontal tables. If the text contains multiple columns, it is preferably converted into single column. The document pre-processor 210 then stores (306) formatting information for

, the document 104 such as, but not limited to, the fonts used, font sizes, section tittles, and

subsections.

The document pre-processor 210 then performs paragraph identification heuristics

(308) on the electronic document 104. During this step, the beginning and end of each

paragraph is identified, and the paragraph characteristics are gathered. As used herein, the

phrase "paragraph characteristics" refers to the statistical properties of the paragraph.

Paragraph characteristics include, but are not limited to, the number of words in the

paragraph, the number of lines in the paragraph, the average number of words per line,

whether any line has a bullet as the starting character, and whether there are any underlined

sentences in the paragraph.

Finally, the document pre-processor 210 performs paragraph grouping heuristics (310)

on the electronic document 104. Once the paragraphs have been identified, the document

pre-processor 210 groups the paragraphs into sections. During this step, the paragraphs are

grouped into sections based on the paragraph characteristics as well as using any section

tittles that precede the paragraphs. Starting at the beginning of the electronic document 104,

the first heading or section title is identified, and the following paragraphs until the next

section title are grouped into one section. If no section titles are found, then using the

paragraph characteristics, all the similar paragraphs are grouped into sections. Additionally,

paragraphs that have same or similar characteristics are grouped together into sections.

The heuristic engine 212 applies a set of heuristics, that is a set of rules, to the electronic document 104 for analyzing information in the electronic document 104. The set

of heuristics which are applied to the electronic document 104 are associated with a particular document type. For example, if the document type is a "resume", then the set of heuristics

associated with the document type "resume" is applied to the electronic document 104.

Heuristics are described below in more detail in commonly assigned U.S. patent application

Serial No. 09/380,219 entitled "Extraction Server" by Prabhat K. Andleigh, Nagaraju Pappu,

and Nasudeva Kalidindi, which is incorporated herein by reference in its entirety.

The morphological analysis engine 214 is used for target language analysis and is

preferably the LinguistiX 2.0 application programming interface (API) from InXight

Corporation in Palo Alto, CA. The LinguistiX 2.0 API is a language neutral programming

interface. In other words, the LinguistiX API can analyze documents in any language such as

English, French or German. Because the heuristics engine 212 and the LinguistiX API are

external to and separate from the document pre-processor 210 and the content analysis and

semantic network engine 216, the present invention can extract information from documents

in the English, French or German language, and any other languages which will be supported

by the LinguistiX API in future.

Preferably, the Heuristics Engine 212 uses the following features provided by the

LinguistiX API: tokenization, lexical analysis, tagging, and noun-phrase extraction. Before

text from the electronic document 104 can be analyzed in terms of its linguistic roots and

function, it must first be segmented into words, punctuation and idiomatic phrases.

LinguistiX tokenization includes the ability to recognize multi-word constructs such as

HTML tags. The lexical analysis feature identifies the grammatical features of a word in

addition to its root forms. The tagging feature identifies the grammatical category of words

by their context. The noun-phrase extraction identifies multi-word phrases in documents.

LinguistiX phrase extraction technology enables software to work with these larger concepts to provide improved information analysis and retrieval. For example, 'Windows

Programming' will be identified as one phrase, instead of two distinct words Windows and

Programming. This feature is used by the semantic network 220 to identify the multi-word

noun phrases.

These features of the LmguistiX API are used to implement the heuristics. For

example, by using the tagging feature, the extraction server 108 may discover that a particular

word is a proper noun. Whether that word is the name of the person or the name of a

company will depend on where the word occurred in a document. For example, if the word

occurs in a contact information section of a document, then it may be the name of the person,

or name of the street, city and so on. If the word occurs in an experience section of a

document, and if it is followed by the name of a city and state, it may be a company name.

The database interface 222 is a set of APIs that provide a mechanism for retrieving

and storing information to and from the target database 110. This is done in such a way that

the underlying implementation of the target database 110 is hidden from the application using

the database interface. Thus, the extraction server 108 can work with any industry standard

relational database software such as Oracle or Microsoft SQL Server without having to

change the software or its implementation. Additionally, the database interface 222 provides

the following mechanisms: a method to connect to the target database, a method to maintain

the connection to the database, a transaction model to maintain the consistency of the

database, and various methods to retrieve, query, update, insert and delete information from

the target database 110.

The content analyzer and semantic network engine 216 analyzes the content of the

electronic document 104, extracts words and word groups from the document 104, and stores the extracted information in the appropriate tables in the target database 110. In a preferred

embodiment, the content analyzer and semantic network engine 216 comprises section

processors 218 which extract information from a particular section of interest, and a semantic

network 220. The semantic network 220 uses a thesaurus 221 and a phrase extraction process

to identify the meta-concepts and categories in the electronic document 104 and extracts

related words and word groups into the target database 110. In a preferred embodiment, the

present invention may be implemented to run on a Windows NT Server and Oracle Database. Referring now to Figure 4, a block diagram of a preferred embodiment of a thesaurus

221 is shown. The thesaurus 221 is a vocabulary database for the extraction server 108 and is

organized by skills. The thesaurus 221 groups all related terms 402 in a language under a

language independent concept 404. As used herein, a "term" 402 refers to all the individual

words or word groups that belong to a particular language along with their alternatives. As

used herein, a "concept" or "skill" 404 comprises a set of terms 402 that are language specific

and alternatives to one another. However, the skill 404 itself is language independent. Skills

404 establish synonymous relationships among all terms 402 in the thesaurus 221 that have

the same meaning. In other words, skills 404 connect all the different names for the same

skill 404 that are known to the thesaurus 221 and specify certain characteristics for each name. Preferably, each skill 404 has a unique skill identifier (ConceptLD). The Concept ID

by itself has no intrinsic meaning. Each term 402 in each language in the thesaurus 221 has a

unique term identifier. The same term 402 in different languages, for example, in English

and Spanish, will have a different term identifier for each language.

To illustrate the relation between terms 402 and skills 404 consider an example in

which terml 402A may consist of 'MS NC++', term2 402B may consist of 'Microsoft Visual C++' and teraι3 402C may consist of 'MS Nisual C++'. All these terms 402 are linked to the

skill 404 'Visual C++'. In other words, if the electronic document 104 uses any of the words

or word groups 'MS VC++', 'Microsoft Visual C++' or 'MS Visual C++', the thesaurus 221

allows the extraction server 108 to recognize the words or word groups as being linked to the

skilll 404A Nisual C++'. In another example, term4, term5 and termό are respectively 'JDK

LI', 'Symantec Cafe', and 'JDBC, and all these terms 402 are linked to the skill2 404B called

'Java'. Thus, if the electronic document 104 uses any of the words or word groups 'JDK 1.1',

'Symantec Cafe', and 'JDBC, the thesaurus 221 allows the extraction server 108 to recognize

the word or word group as being linked to the skill2404B 'Java'.

The thesaurus 221 may also comprise other information such as the attributes of a

skill 404 or attributes of a term 402. Attributes provide additional information that helps to

define the meaning of a skill 404 and explain how it may be used in a document. In other

words, the different senses of a particular word or word groups are captured using the

attributes.

In addition to the relationship between a skill 404 and a set of terms 402, the thesaurus

221 also comprises relationships among skills 404. Preferably, these relationships are non-

subsumption relationships. As used herein, the term "non-subsumption" refers to

relationships that include related skills, co-occurring skills and/or associated expressions. In

other words, non-subsumption refers to relationships that are not based on subsumption. For example, C++ and Java are related, but neither subsumes the other. All these relationships

among skills 404 indicate that the skills 404 linked together are not exactly similar but are

associated with each other in different ways. One skilled in the art will realize that the terms

and skills of the thesaurus 221 are not limited to the examples given herein but may contain any number of terms and skills which have been predefined and stored in the thesaurus 221

prior to the processing of the electronic document 104. Thus, the thesaurus advantageously

allows the present invention to link together terms and skills used in specific industries,

disciplines, and technologies for which the thesaurus is being used, and preserves the

meanings and hierarchical connections between those terms and skills. Additionally, the

thesaurus facilitates the access to concept relationships and to term and skill attributes

irrespective of the term used as a point of entry.

Referring now to Figure 5, a block diagram of a preferred embodiment of a semantic

network 220 is shown. The semantic network 220 provides a way of arranging all the skills

404 at the lowest level and then builds a taxonomy or network of higher level knowledge-

concepts and categories. The semantic network 220 comprises skills 404 at the lowest level,

"knowledge" or knowledge-concepts 502 at a second level, and categories 504 at the highest

level. The semantic network 220 together with the thesaurus 221 provides a four level

hierarchy of terms 402, skills 404, knowledge-concepts 502 and categories 504.

A category 504 is the highest level in the semantic network 222. Broad categories

504 may be created according to a specific industry which fully subsume other knowledge-

concepts 502 and skills 404. The semantic network 220 categorizes all knowledge-concepts

502 into categories 504. Knowledge-concepts 502 comprises the next level in the semantic

network 220 hierarchy. Each knowledge-concept 502 is a collection of skills 404 that add to

the body of knowledge. The semantic network 220 categorizes all skills 404 into knowledge-

concepts 502. As described earlier with reference to Figure 4, skills 404 are generic and

language independent from all related terms 402. The semantic network 220 categorizes all

terms 402 into skills 404. As described earlier with reference to Figure 4, terms 402 comprise language dependent strings that are found in the electronic document 104. Terms 402

comprise the lowest level in the semantic network 220 hierarchy.

The entire semantic network 220, separate from the thesaurus 221, comprises

language independent knowledge that is arranged as a taxonomy. Preferably, the

relationships between skills 404 and knowledge-concepts 502 as well as the relationships

between knowledge-concepts 502 and categories 504 are many to many. In other words, a

single knowledge-concept 502 can comprise several skills 404 and a single skill 404 can be

linked to several knowledge-concepts 502. Similarly, several knowledge-concepts 502 may

comprises a category 504 and several categories may have links to a single knowledge-

concept 502.

To illustrate the terms 402, skills 404, knowledge-concepts 502, and categories 504 of

a semantic network 220, the two concepts discussed earlier with reference to Figure 4,

namely Nisual C++' and 'Java', will be used. Both these skills 404 may be grouped under a

knowledge-concept 502 Object Oriented programming languages'. Additionally, the skill

404 Nisual C++' may also belong to the knowledge-concept 502 Nisual Programming

Environment'. The knowledge-concept 502 "Visual Programming Environment" may also

be linked to other skills 404 such as Nisual Basic'.

The semantic network 220 uses subsumption as the basis for the hierarchical

organization of skills 404, knowledge-concepts 502, and categories 504. In other words, the

relationship between skills 404 and knowledge-concepts 502 and knowledge-concepts 502

and categories 504 in the semantic network 220 are based on conceptual subsumption, where

a more general object 'subsumes' a more specific object. The concept of subsumption is more general than the concept of synonymy. An object is subsumed by another object if the subsuming object is much more general than any other subsumed objects and effectively

summarizes the subsumed objects. Truly synonymous objects mutually subsume each other.

If only synonymous based relationships are allowed, then the granularity between the objects

cannot be captured effectively as there are not many truly synonymous objects. The

difference between the shades of meaning will not allow correct retrieval in a synonym-based

network. The subsumption-based network removes these drawbacks and aids in retrieving

related concepts more accurately, since a subsumption is more general compared to a

synonym. For example, the object 'JDBC is subsumed by a more general object called 'Java

Programming Language' (a knowledge-concept 502), which is further subsumed by an even

more generic object 'Software Engineering' (a category 504).

An object may also be subsumed by more than one higher level object. For example,

the skill 404 'JDBC may be subsumed by at least two knowledge-concepts 502 such as 'Java

Programming Language' and 'Database Connectivity Library'. Each of these knowledge-

concepts 502 may in turn be subsumed by several categories 504. Hence, the conceptual

subsumption also allows many-to-many relationships between skills 404 and knowledge-

concepts 502 and between knowledge-concepts 502 and categories 504.

Referring now to Figure 6, a flowchart of the steps of a preferred embodiment of a

method performed by the content analysis and semantic network engine 216 is shown. First,

identification heuristics are performed (602) on the electronic document 104 to identify the

beginning and end of the known sections of interest. The sections of interest are configured by the user when the extraction server 108 is first installed. The sections are then analyzed

(604) and information is extracted from the sections. The extracted information is stored

(606) in a predefined structure in the target database 110. Using the semantic network 220, words and word groups are analyzed (608) and the relationships between the different words

and word groups are determined and stored in the target database 110. Thus, the present

invention advantageously extracts meaningful information from electronic documents, and

stores them in a predefined structure in a target database. The extracted information stored in

the target database can then be retrieved and manipulated by computer program applications

accessing the database. Moreover, the present invention provides a powerful semantic

network and thesaurus for defining terms, concepts, meta-concepts, and categories and the

relationship between and among such terms, concepts, meta-concepts, and categories. Thus,

the semantic network can stored information relating to any field, industry or technology, and

allows the extraction server 108 to process various types of documents pertaining to such

fields, industries or technologies.

The section processors 218 extract information from sections of interest in an

electronic document 104. The particular sections of interest from which information is

extracted is determined by the document type. The content analysis and semantic network

engine 216 comprises a section processor 218 for extracting words or word groups from each

section of interest in an electronic document.

Section processors 218 are configured to operate on a specific document type and may

contain one or several section processors 218. For example, resumes typically contain several

sections such as a cover letter, contact information, an objective section, an experience

section, an education section, a patents section, a publications section, an awards and honors

received section, and a courses attended section. In a preferred embodiment, section

processors 218 for a resume document type may comprise a cover letter section processor for

extracting information from a cover letter, a contact information section processor for extracting contact information for a candidate, a skills and experience section processor for

extracting the skills and experience of a candidate, an education section processor for

extracting educational information from a candidate, an awards and honors section processor

for extracting any awards and honors received by a candidate, a patents section processor for

extracting information about patents obtained by a candidate, and a publications section

processor for extracting any articles or documents published by a candidate. Each section

processor 218 analyzes a particular section in the electronic document 104 and extracts

specific words and word groups from that section into a specific record in the target database

110. Additionally, as described in more detail in commonly assigned U. S. Patent

Application Serial No. 09/380,219 entitled "Xfraction Server" by Prabhat K. Andleigh,

Nagaraju Pappu, and Vasudeva Kalidindi, each section processor 218 applies a set of

heuristics to the particular section of interest in order to analyze and extract the desired

information.

Referring now to Figure 7, there is shown a preferred embodiment of the present

invention comprising a skills and knowledge information extractor 702. The skills and

knowledge information extractor 702 allows the system to automatically extract from a

document, such as a resume, the skills of a candidate, the candidate's knowledge in a

particular area, and to determine the proficiency level of the candidate in any given skill.

Thus, the skills and knowledge information extractor 702 allows a user to automatically

determine a "career profile" of a candidate from his or her resume. As used herein, a "career profile" refers to any qualitative and quantitative information about a candidate's work

history, experience, and proficiency. For example, such information includes, but is not

limited to, how long a candidate worked in a particular profession, when, where, and at what depth did the candidate gain experience in a particular skill, what is the candidate's overall

knowledge level in a particular area, how much management experience a candidate has, etc.

As used herein, "terms" refers to the actual word or words which are found in a

resume, "skill" or "skill information" refers to the skills 404 in the thesaurus 221 and

semantic network 220 which relate to those terms, and "knowledge" or "knowledge

information" refers to the knowledge-concepts 502 relating to the skills. For example, in a

resume, a candidate may have used the terms "Microsoft Visual C++" or "MS VC++". The

present invention would identify these terms as belonging to the skill "C++", which in turn is

related to the knowledge "object oriented programming" which in turn may be related to the

category "Software." Thus, although the only terms actually used in and extracted from the

resume were "Microsoft Visual C++" and "MS VC++", the present invention is able to

determine that the candidate has "skill" in C++ and has "knowledge" of object oriented

programming even though the words C++ and object oriented programming were never used

in the document.

The skill and knowledge information extractor 702 uses a non-monotonic reasoning

principle to determining a candidate's skill level. As described above, non-monotonic

reasoning refers to the use of default assumptions which are made about the state of unknown

factors. These default assumptions may be changed as new information or evidence becomes

available. Additionally, default assumptions may be changed due to the absence of certain

information or evidence. The operation of the non-monotonic reasoning approach used by

the skill and knowledge for information extractor 702 is best illustrated using an example.

During operation, the present invention finds a skill, X, in a candidate's resume, R. In

the absence of any other knowledge, the skill and knowledge information extractor 702 assumes that the skill level of the candidate for the skill X is average. As the skill and

knowledge information extractor 702 obtains additional information from the resume R about

skill X, the assumption of the skill level for skill X is refined. Additional knowledge that

may be used to refine the skill level includes, but is not limited to, the section in which the

skill X is found. For example, if the skill X is found in the Objective Section of a resume R,

a positive numerical value, or objective weightage factor W(O), will be added to the skill

level. Additionally, a positive weight for each project in which the skill X is used,

represented here by W(P;), may be added to the skill level. Preferably, this weightage value is

computed for all projects in resume R. The number of associated skills that are also used,

W(K), may also be added to the skill level. As used herein, associated skills are the skills

related to the main skill; knowing a main skill implies that a person also knows all associated

skills. For example, if one is an expert in the skill "database programming" or "database

administration," this person must be knowledgeable in the associated skill "SQL."

Associated skills can be determined using the semantic network 220 and the thesaurus 221.

For a given skill x, all its associated skills (X_! ... X are linked with x through the semantic

network 220 and thesaurus 221. For example, a thesaurus 221 entry for the "skill database

administration" would contain links to the "skills database server administration," "database

user management," and/or "SQL." Also, the number of years of experience for the skill X,

W(Y) may also be added to the skill level. Moreover, the number of years since the skill X

was used may represent a negative factor, W(LU), which is subtracted from the skill level.

Thus, in a preferred embodiment, a summation of the weights described above gives a

specific skill level for the skill X. A mathematical representation for determining the skill

level of a particular skill is as follows: SkillLevel(X') = SkillLevel(X) + W(O) + W(P_;) + W(K) + W(Y) - W(LU)

The weightage functions are computed using the total number of skill levels that are

defined, and the distance from the current skill level to the next skill level. One skilled in the

art will realize that the weightage factors used to adjust the skill level are not limited to those

listed in the above example but can comprise any number of factors to be determined by the

system creator.

The computation of the skill level of a particular skill for a candidate can also be

demonstrated using an example. Initially, the skill and knowledge information extractor 702

assumes that a person has an average skill level for a particular skill such as C++. If the

candidate's resume states that the candidate took a course in C++, that fact would add a

positive weightage factor to the skill level, thus adjusting the average skill level to a higher

value. If the candidate's resume also states that the candidate has two years of work

experience in C++, that fact would add another positive weightage factor to the skill level and

adjust the average skill level to another higher value. The values by which the average skill

level is adjusted for the C++ course and the two years of work experience are not necessarily

the same but may reflect the value attributed by the system creator. Each mention of C++ in

the resume would this be used to adjust the skill level either up or down. Additionally, the user of terms in the resume which are related in the semantic network 220 and thesaurus 221

to the concept or skill C++ could also be used to adjust the skill level of the candidate. After

all the relevant terms in the candidate's resume have been extracted and evaluated, the skill and knowledge information extractor 702 deteπnines a single value for the skill level for the candidate for the particular skill.

After a final skill value for a particular skill has been determined, the skill and

knowledge information extractor 702 then maps the skill value to a scale for qualitatively

illustrating the proficiency of the candidate in that particular skill. For example, if a final

skill value for a particular candidate has been determined to be the number 6.8, that number

may map to a rating of "good" on a scale of 1 to 10, with 1 being poor and 10 being excellent.

Thus, the present invention allows a user to determine the proficiency of a candidate's skill

level for a particular skill and to ascribe a qualitative value to that proficiency level. One

skilled in the art will realize that the qualitative scales used to describe a particular skill value

may be any type of scale with a range of numerical values and/or adjective descriptors. For

example, a qualitative scale may map the final skill value to a scale comprising numbers

such as 1 to 5 or 1 to 10. A scale may map the final skill value to a scale comprising numbers

and adjectives such as 1 (poor) to 10 (excellent). The qualitative scale may be determined by

the system creator.

The categories, knowledge, skills and terms are preferably set up in a relational database prior to the extraction process. As described above with reference to Figures 4 and 5,

in a preferred embodiment, the relationship between categories and knowledge is

many-many, the relationship between knowledge and skills is many-to-many, and the

relationship between skills and terms is one-to-many.

Referring now to Figure 8, there is shown a flow chart of a preferred embodiment of a

method for the present invention. In a preferred embodiment, a resume is evaluated (802) for

a particular skill. The skill level for that particular skill is then determined (804) using the above described techniques. After a final skill level value is determined, the skill level is

mapped (806) to a qualitative scale. Finally, the skill and the qualitative scale value of the skill level is stored (808) in the target database. More specifically, the categories, knowledge,

skills and terms (i.e. the semantic network) are loaded into main memory. The electronic

document text is then passed to the skill and knowledge information extractor 702. In a

preferred embodiment, knowledge, skills, skill levels and number of years are extracted from

the electronic document in the following manner: first, all the terms in the database are

checked against the document, then an initial scan of the document collects all the terms. The

frequency of appearance of the term is recorded. Afterwards, the weightage factors for the

skill level calculation are applied. A second scan of the electronic document analyses the

document and a running list is maintained for all terms to calculate the experience duration

where the term is maintained. On completion of the second scan, all the terms are rolled up

into skills according to the semantic network and thesaurus, all the skills are rolled up into

knowledge according to the semantic network, and all the knowledge items are rolled up into

categories. Additionally, categories specifically mentioned are added. Thus, based on this

information, the skill levels and years of experience are computed as described above.

Referring now to Figure 9, there is shown a screen shot of a user interface of a

preferred embodiment of a target database for a skill and knowledge information extractor

702. Window 902 displays the particular skills analyzed from a candidate's resume, the

qualitative level determined by the skill and knowledge information extractor 702, and the

years of experience the candidate has for the particular skill. For example, the highlighted

portion of window 902 indicates that the candidate has some skill as an analyst, that the

qualitative proficiency of the candidate's skill as an analyst is "excellent", and that the candidate has 4 years of experience as an analyst. Thus, the present invention advantageously

allows a user to extract, determine, and display from a candidate's resume the proficiency of a

particular skill of the candidate.

The present invention is designed as a set of Object Oriented Libraries and contains

the following major Object Libraries:

In a preferred embodiment, the present invention may be implemented to run on a

Windows NT Server and any relational database such as Oracle Database. Database tables

may be used to define how information is represented in a relational or object-oriented

database. In an object-oriented implementation, any relational table is preferably represented

as an object class. The following section describes a preferred embodiment of the content

and type of the fields that are extracted into a relational database, and also the definitions of

the categories, knowledge, skills and terms. The supporting tables are also explained. One

skilled in the art will realize that these tables are not limited to the specific information

illustrated therein but may be created as needed, depending on the document type being

processed.

Table 1

AutoEntryDocuments

Table 1 holds the documents that are to be extracted. It holds the following information:

Table 2

AutoEntrySchedule

Table 2 holds information about the scheduled extraction tasks.

Table 3 Candidate

Table 3 holds the personal information like name of the person, contact address, current employer, resume summary etc. The XtractionXpert automatically extracts the following information from the resume:

Table 4 ExperienceDetail

Table 9

Kno wledgeRecord

Table 13 Courses

Table 1

Miscellenouslnformation

Table 16 Category

Table 16 provides information regarding the relationships between categories and knowledge information.

Table 17 MetaConcept

Table 17 provides knowledge information for semantic network 220.

Table 18 Concept

Table 18 provides information relating to skills.

Table 19 ConceptRelation

Table 19 provides information on relationships between skills and knowledge.

Table 20 Term

Table 20 provides information on terms.

Table 21 Language

Table 21 stores information about different languages to which the terms belong.

Table 23 CaWordList

Table 24 Ca WordPosition

From the above description, it will be apparent that the invention disclosed herein

provides a novel and advantageous system and method for extracting and analyzing skill and

knowledge information from an electronic document. The foregoing discussion discloses and

describes merely exemplary methods and embodiments ofthe present invention. As will be

understood by those familiar with the art, the invention may be embodied in other specific

forms without departing from the spirit ofthe invention or essential characteristics thereof.

Accordingly, the disclosure ofthe present invention is intended to be illustrative, but not

limiting, ofthe scope ofthe invention, which is set forth in the following claims.

Claims

1. An apparatus for extracting skill and knowledge information from an

electromc document and for storing skill and knowledge information into a target database,

the apparatus comprising:

a content analysis and semantic network engine for analyzing and extracting skill and

knowledge information from the electronic document; and

a skill and knowledge information extractor, coupled to the content analysis and

semantic network engine, for determining a skill level for the skill information extracted from

the electronic document and for storing the skill level in the target database.

2. The apparatus of claim 1 wherein the skill and knowledge information

extractor also maps the skill level for the skill to a qualitative scale.

3. The apparatus of claim 1 wherein the content analysis and semantic network

engine further comprises:

a thesaurus for linking together terms and skills; and

a semantic network, coupled to the thesaurus, for organizing terms and skills ofthe

thesaurus, knowledge, and categories, and for defining relationships between and among the

terms, skills, knowledge, and categories.

4. The apparatus of claim 1 wherein the skill and knowledge information

extractor determines a skill level, at least in part, by using the mathematical equation:

SkillLevel(X) = SkillLevel(X) + W(O) + W(P_f) + W(K) + W(Y) - W(LU)

5. The apparatus of claim 1 wherein the skill and knowledge information extractor determines a skill level using a non-monotonic and default reasoning approach.

6. The apparatus of claim 2 wherein a skill extracted from the electronic

document and the skill mapping to a qualitative scale are displayed on a computer.

7. An apparatus for analyzing and extracting skill and knowledge information

from an electronic document into a target database having predefined fields, the apparatus

comprising:

a thesaurus for linking together terms and skills and for defining relationships between

and among the terms and skills; and

a semantic network coupled to the thesaurus for organizing terms and skills in the

thesaurus, knowledge, and categories in a hierarchical structure;

wherein the thesaurus and semantic network are used to analyze skill and knowledge

information in the electronic document.

8. The apparatus of claim 7 further comprising:

a document pre-processor coupled to the semantic network for classifying the

electronic document as a document type and for performing an initial analysis on the

electronic document.

9. The apparatus of claim 7 further comprising:

a heuristics engine coupled to the semantic network for applying a set of heuristics to

the electronic document.

10. The apparatus of claim 7 further comprising: a skill and knowledge information extractor for extracting skill and knowledge information from the electronic document and for determining a skill level for skill

information extracted from the electronic document.

11. The apparatus of claim 10 further comprising:

a target database coupled to the semantic network for storing skill and skill level

information in predefined fields in the target database.

12. A method for determining a skill level for skill information extracted from an

electronic document, the method comprising the steps of:

identifying skill and knowledge information in the electronic document;

extracting the skill and knowledge information from the electronic document; and

determining a skill level for skill information extracted from the electromc document.

13. The method of claim 12 wherein the step of determining a skill level is

performed by a skill and knowledge information extractor.

14. The method of claim 12 wherein the step of identifying skill and knowledge

information is performed using a semantic network.

15. A method for processing skill and knowledge information from an electronic

document, the method comprising the steps of:

identifying skill and knowledge information in the electronic document;

extracting the skill and knowledge information from the electronic document;

determimng a skill level for skill information extracted from the electronic document;

and mapping the skill level to a qualitative scale.

16. A computer implemented method for extracting and displaying skill and

knowledge information from an electronic document, the method comprising the steps of:

identifying skill and knowledge information in the electronic document;

extracting the skill and knowledge information from the electromc document;

determining a skill level for skill information extracted from the electronic document;

and

mapping the skill level to a qualitative scale.

17. A computer-readable medium for extracting and displaying skill and

knowledge information from an electronic document, the computer-readable medium

comprising code for performing the steps of:

identifying skill and knowledge information in the elecfronic document;

extracting the skill and knowledge information from the electronic document;

and

mapping the skill level to a qualitative scale.