WO2000026839A9 - Advanced model for automatic extraction of skill and knowledge information from an electronic document - Google Patents

Advanced model for automatic extraction of skill and knowledge information from an electronic document

Info

Publication number
WO2000026839A9
WO2000026839A9 PCT/US1999/026083 US9926083W WO0026839A9 WO 2000026839 A9 WO2000026839 A9 WO 2000026839A9 US 9926083 W US9926083 W US 9926083W WO 0026839 A9 WO0026839 A9 WO 0026839A9
Authority
WO
WIPO (PCT)
Prior art keywords
skill
electronic document
information
knowledge
knowledge information
Prior art date
Application number
PCT/US1999/026083
Other languages
French (fr)
Other versions
WO2000026839A8 (en
WO2000026839A1 (en
Inventor
Prabhat K Andleigh
Nagaraju Pappu
Vasudeva V Kalindindi
Original Assignee
Infodream Corp
Prabhat K Andleigh
Nagaraju Pappu
Vasudeva V Kalindindi
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from PCT/US1998/027664 external-priority patent/WO1999034307A1/en
Application filed by Infodream Corp, Prabhat K Andleigh, Nagaraju Pappu, Vasudeva V Kalindindi filed Critical Infodream Corp
Publication of WO2000026839A1 publication Critical patent/WO2000026839A1/en
Publication of WO2000026839A8 publication Critical patent/WO2000026839A8/en
Priority to GB0113250A priority Critical patent/GB2359168A/en
Publication of WO2000026839A9 publication Critical patent/WO2000026839A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q99/00Subject matter not provided for in other groups of this subclass

Definitions

  • This invention relates to the field of computer analysis of electronic documents. More
  • Information to be sorted and stored in a computer database may reside in numerous
  • the project manager must sift through several documents which contain the
  • manager may have to read the documents several times and may have to review and type the
  • a computerized system which can analyze and extract pertinent information from
  • WordPerfect WordPerfect
  • ASCII files ASCII files
  • HTML HyperText Markup Language
  • the keyword based search engines cannot address this
  • the present invention is an apparatus, method, and computer-readable medium for
  • information from an electronic document (104) comprises a content analysis and semantic
  • the skill and knowledge section processor (702) uses a non-monotonic reasoning principle to
  • the content analysis and semantic network engine (216) further comprises a thesaurus (221)
  • document (104) comprises the steps of: identifying skill and knowledge information in the
  • FIG. 1 is a block diagram of a preferred embodiment of a system 100 in accordance
  • FIG. 2 is a block diagram of a preferred embodiment of an extraction server 108 in
  • Figure 3 is a flow chart of a preferred embodiment of the steps performed by the
  • Figure 4 is a block diagram of a preferred embodiment of a thesaurus. 221
  • Figure 5 is a block diagram of a preferred embodiment of a semantic network 220.
  • Figure 6 is a flow chart of a preferred embodiment of the steps performed by the
  • FIG. 7 is a block diagram of a preferred embodiment of a system 700 in accordance with
  • Figure 8 is a flow chart of a preferred embodiment of the steps performed by the skill
  • Figure 9 is a screen shot of a user interface of a preferred embodiment of a target
  • database 110 display for skill information.
  • a host computer 102 using the method and system
  • unstructured text refers to any document
  • containing unstructured text include, but are not limited to, a resume, performance appraisals,
  • the host computer 102 is a conventional computer having a keyboard and mouse
  • the electronic document 104 may be prepared in any way.
  • the electronic document 104 is processed by host computer 102 using the present
  • host computer 102 uses extraction server 108 to analyze, retrieve and
  • words and word groups are used to mean any text that may be derived from document 104 including, but not limited to, individual
  • the extraction server 108 extracts words or numbers, phrases, whole sentences, and blocks of text.
  • extraction server 108 is described in more detail below with reference to Figures 2 through 6.
  • the target database 110 comprises predefined tables with predefined columns for
  • a predefined table and predefined columns correspond to a particular document
  • a predefined table for a document type For example, if document 104 is a resume, then a predefined table for a document type
  • predefined table for a document type called "patent document” may have predefined columns
  • predefined table but that many different compilations of predefined tables and columns may
  • groups stored in the target database 110 can be stored in electronic form on any type of
  • the process of extraction performed by the extraction server 108 preferably uses a
  • non-monotonic reasoning principle As used herein, a "non-monotonic reasoning principle"
  • the extraction server 108 refers to a process whereby at every stage during extraction, the extraction server 108
  • a string '1987' is first assumed to be a number, and if
  • the present invention advantageously allows a user to extract skill
  • the present invention analyzes an electronic copy of a text document and extracts
  • the present invention operates upon electronic documents in any electronic file
  • the electronic document 104 may be any electronic file
  • electronic document 104 may be an electronic form of a hard copy of a document converted
  • OCR Optical Character Recognition
  • the extraction server 108 comprises a document preprocessor 210 coupled to the memory 106 where the electronic
  • document 104 is stored, a heuristics engine 212 coupled to the document pre-processor 210, a morphological analysis engine 214 coupled to the heuristics engine 212, a content analysis
  • the content analysis and semantic network engine 216 preferably
  • section processors 218 and a semantic network 220.
  • the document pre-processor 210 retrieves the electronic document 104 from memory
  • the document pre-processor 210 performs the initial analysis and extraction of the
  • the document pre-processor 210 identifies the
  • the document pre ⁇ For example, if the electronic document 104 is a Microsoft Word file, then the document pre ⁇
  • processor 210 identifies the file by the Microsoft Word signature and uses the Microsoft
  • the document pre-processor 210 filters out (304) any unnecessary and unwanted
  • information such as, but not limited to, email headers, OCR headers, blank pages, and
  • any information that is not part of the original document is not part of the original document.
  • the text contains vertical tables, these tables are preferably converted
  • the document pre-processor 210 then stores (306) formatting information for
  • the document 104 such as, but not limited to, the fonts used, font sizes, section tittles, and
  • the document pre-processor 210 then performs paragraph identification heuristics
  • paragraph characteristics refers to the statistical properties of the paragraph.
  • Paragraph characteristics include, but are not limited to, the number of words in the
  • pre-processor 210 groups the paragraphs into sections. During this step, the paragraphs are
  • section title are grouped into one section. If no section titles are found, then using the
  • the heuristic engine 212 applies a set of heuristics, that is a set of rules, to the electronic document 104 for analyzing information in the electronic document 104.
  • the set of heuristics that is a set of rules
  • heuristics which are applied to the electronic document 104 are associated with a particular document type. For example, if the document type is a "resume", then the set of heuristics
  • the morphological analysis engine 214 is used for target language analysis and is
  • LinguistiX 2.0 API is a language neutral programming
  • the LinguistiX API can analyze documents in any language such as
  • the present invention can extract information from documents
  • the Heuristics Engine 212 uses the following features provided by the
  • LinguistiX API tokenization, lexical analysis, tagging, and noun-phrase extraction.
  • text from the electronic document 104 can be analyzed in terms of its linguistic roots and
  • LinguistiX tokenization includes the ability to recognize multi-word constructs such as
  • the lexical analysis feature identifies the grammatical features of a word in
  • the tagging feature identifies the grammatical category of words
  • the noun-phrase extraction identifies multi-word phrases in documents.
  • LinguistiX phrase extraction technology enables software to work with these larger concepts to provide improved information analysis and retrieval. For example, 'Windows
  • the extraction server 108 may discover that a particular
  • word is a proper noun. Whether that word is the name of the person or the name of a
  • the database interface 222 is a set of APIs that provide a mechanism for retrieving
  • the underlying implementation of the target database 110 is hidden from the application using
  • the extraction server 108 can work with any industry standard
  • relational database software such as Oracle or Microsoft SQL Server without having to
  • the database interface 222 provides
  • the content analyzer and semantic network engine 216 analyzes the content of the
  • the content analyzer and semantic network engine 216 comprises section
  • processors 218 which extract information from a particular section of interest, and a semantic
  • the semantic network 220 uses a thesaurus 221 and a phrase extraction process
  • the target database 110 receives related words and word groups from the target database 110.
  • the target database 110 receives related words and word groups from the target database 110.
  • FIG. 4 a block diagram of a preferred embodiment of a thesaurus
  • the thesaurus 221 is a vocabulary database for the extraction server 108 and is
  • the thesaurus 221 groups all related terms 402 in a language under a
  • a "concept” or “skill” 404 comprises a set of terms 402 that are language specific
  • each skill 404 that are known to the thesaurus 221 and specify certain characteristics for each name.
  • each skill 404 has a unique skill identifier (ConceptLD).
  • ConceptLD unique skill identifier
  • Each term 402 in each language in the thesaurus 221 has a
  • terml 402A may consist of 'MS NC++'
  • term2 402B may consist of 'Microsoft Visual C++'
  • tera ⁇ 3 402C may consist of 'MS Nisual C++'. All these terms 402 are linked to the
  • the thesaurus 221 allows the extraction server 108 to recognize
  • the thesaurus 221 may also comprise other information such as the attributes of a
  • non-subsumption refers to
  • non-subsumption refers to relationships that are not based on subsumption.
  • C++ and Java are related, but neither subsumes the other. All these relationships
  • the thesaurus advantageously
  • thesaurus facilitates the access to concept relationships and to term and skill attributes
  • the semantic network 220 provides a way of arranging all the skills
  • the semantic network 220 comprises skills 404 at the lowest level
  • the semantic network 220 together with the thesaurus 221 provides a four level
  • a category 504 is the highest level in the semantic network 222. Broad categories
  • 504 may be created according to a specific industry which fully subsume other knowledge-
  • the semantic network 220 categorizes all knowledge-concepts
  • Knowledge-concepts 502 comprises the next level in the semantic
  • Each knowledge-concept 502 is a collection of skills 404 that add to
  • the semantic network 220 categorizes all skills 404 into knowledge-
  • the semantic network 220 categorizes all
  • terms 402 into skills 404. As described earlier with reference to Figure 4, terms 402 comprise language dependent strings that are found in the electronic document 104. Terms 402
  • the entire semantic network 220 separate from the thesaurus 221, comprises
  • single knowledge-concept 502 can comprise several skills 404 and a single skill 404 can be
  • Both these skills 404 may be grouped under a
  • Nisual C++' may also belong to the knowledge-concept 502 Nisual Programming
  • the knowledge-concept 502 "Visual Programming Environment” may also be used as a "Visual Programming Environment"
  • the semantic network 220 uses subsumption as the basis for the hierarchical
  • the subsumption-based network removes these drawbacks and aids in retrieving
  • the object 'JDBC is subsumed by a more general object called 'Java
  • An object may also be subsumed by more than one higher level object. For example,
  • the skill 404 'JDBC may be subsumed by at least two knowledge-concepts 502 such as 'Java
  • identification heuristics are performed (602) on the electronic document 104 to identify the
  • the sections of interest are configured by the user when the extraction server 108 is first installed. The sections are then analyzed
  • the target database can then be retrieved and manipulated by computer program applications
  • the present invention provides a powerful semantic
  • the semantic network can stored information relating to any field, industry or technology, and
  • the section processors 218 extract information from sections of interest in an
  • engine 216 comprises a section processor 218 for extracting words or word groups from each
  • Section processors 218 are configured to operate on a specific document type and may
  • resumes typically contain several section processors 218.
  • resumes typically contain several
  • processors 218 for a resume document type may comprise a cover letter section processor for
  • an education section processor for extracting the skills and experience of a candidate, an education section processor for extracting the skills and experience of a candidate, an education section processor for extracting the skills and experience of a candidate, an education section processor for extracting the skills and experience of a candidate, an education section processor for extracting the skills and experience of a candidate, an education section processor for extracting the skills and experience of a candidate, an education section processor for extracting the skills and experience of a candidate, an education section processor for
  • processor 218 analyzes a particular section in the electronic document 104 and extracts
  • each section processor 218 applies a set of
  • the skills and knowledge information extractor 702. comprising a skills and knowledge information extractor 702.
  • knowledge information extractor 702 allows the system to automatically extract from a
  • the skills and knowledge information extractor 702 allows a user to automatically
  • a "career profile” refers to any qualitative and quantitative information about a candidate's work
  • such information includes, but is not
  • “skill” or “skill information” refers to the skills 404 in the thesaurus 221 and
  • semantic network 220 which relate to those terms, and "knowledge” or “knowledge
  • a candidate may have used the terms "Microsoft Visual C++” or "MS VC++”.
  • object oriented programming which in turn may be related to the
  • the skill and knowledge information extractor 702 uses a non-monotonic reasoning
  • the present invention finds a skill, X, in a candidate's resume, R.
  • the skill and knowledge information extractor 702 assumes that the skill level of the candidate for the skill X is average. As the skill and
  • knowledge information extractor 702 obtains additional information from the resume R about
  • this weightage value is
  • W(K) may also be added to the skill level.
  • associated skills are the skills
  • Associated skills can be determined using the semantic network 220 and the thesaurus 221.
  • W(Y) may also be added to the skill level. Moreover, the number of years since the skill X
  • W(LU) a negative factor
  • SkillLevel(X') SkillLevel(X) + W(O) + W(P ; ) + W(K) + W(Y) - W(LU)
  • the weightage functions are computed using the total number of skill levels that are
  • weightage factors used to adjust the skill level are not limited to those
  • the computation of the skill level of a particular skill for a candidate can also be
  • the resume would this be used to adjust the skill level either up or down. Additionally, the user of terms in the resume which are related in the semantic network 220 and thesaurus 221
  • the skill and knowledge information extractor 702 dete ⁇ nines a single value for the skill level for the candidate for the particular skill.
  • knowledge information extractor 702 then maps the skill value to a scale for qualitatively
  • the present invention allows a user to determine the proficiency of a candidate's skill
  • a qualitative scale may map the final skill value to a scale comprising numbers
  • a scale may map the final skill value to a scale comprising numbers
  • the qualitative scale may be determined by
  • the categories, knowledge, skills and terms are preferably set up in a relational database prior to the extraction process. As described above with reference to Figures 4 and 5,
  • the relationship between categories and knowledge is
  • a resume is evaluated (802) for
  • the skill level for that particular skill is then determined (804) using the above described techniques. After a final skill level value is determined, the skill level is
  • Window 902 displays the particular skills analyzed from a candidate's resume, the
  • portion of window 902 indicates that the candidate has some skill as an analyst, that the
  • the present invention is designed as a set of Object Oriented Libraries and contains
  • the present invention may be implemented to run on a
  • any relational table is preferably represented
  • Table 1 holds the documents that are to be extracted. It holds the following information:
  • Table 2 holds information about the scheduled extraction tasks.
  • Table 3 holds the personal information like name of the person, contact address, current employer, resume summary etc.
  • the XtractionXpert automatically extracts the following information from the resume:
  • Table 16 provides information regarding the relationships between categories and knowledge information.
  • Table 17 provides knowledge information for semantic network 220.
  • Table 18 provides information relating to skills.
  • Table 19 provides information on relationships between skills and knowledge.
  • Table 20 provides information on terms.
  • Table 21 stores information about different languages to which the terms belong.

Landscapes

  • Business, Economics & Management (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Machine Translation (AREA)

Abstract

An apparatus, method, and computer readable medium for analyzing and extracting skill and knowledge information from an electronic document (104) and for storing the extracted skill and knowledge information into predefined fields or tables in a target database (110) comprises a content analysis and semantic network engine (216) for analyzing and extracting skill and knowledge information from the electronic document (104). A skill and knowledge information extractor (702) is coupled to the content analysis and semantic network engine (216), for determining a skill level for the skill information extracted from the electronic document (104). In a preferred embodiment, the skill and knowledge section processor (702) uses a non-monotonic reasoning principle to determine a skill level for skill information extracted from the electronic document (104). The content analysis and semantic network engine (216) further comprises a thesaurus (221) for linking together terms (402) and skill information (404), and for defining relationships between and among the terms (402) and skill information (404), and a semantic network (220) coupled to the thesaurus (221), for organizing the terms (402) and skill information (404) in the thesaurus (221), along with knowledge information (502) and categories (504), in a hierarchical structure.

Description

ADVANCED MODEL FOR AUTOMATIC EXTRACTION OF SKILL AND KNOWLEDGE INFORMATION FROM AN ELECTRONIC DOCUMENT
RELATED APPLICAΉON
The subject matter of this application is a continuing application of and claims priority
from U.S. patent application Serial No. 09/380,219, filed August 27, 1999 descending in
priority from PCT application PCT/US98/27664, filed on December 28, 1998, and entitled
5 "Xtraction Server" by Prabhat K. Andleigh, Nagaraju Pappu, and Nasudeva Kalidindi. Said
two earlier applications are commonly assigned with the instant application.
The subject matter of this application is also related to and claims priority from U.S.
Provisional Application Serial No. 60/107,063, filed Novmeber 4, 1998, and entitled
"Advanced Model for Automatic Extraction of Content, Skills, and Knowledge from
10 Resumes" by Prabhat K. Andleigh, Nagaraju Pappu, and Nasudeva Kalidindi, which
application is commonly assigned with the instant application, and is incorporated herein by
reference in its entirety.
TECHNICAL FIELD
This invention relates to the field of computer analysis of electronic documents. More
15. specifically, it relates to the field of information retrieval to convert and store information in
documents written in a natural language into a predefined structure which can be retrieved
and manipulated by computer program applications.
BACKGROUND OF THE INVENTION
Information to be sorted and stored in a computer database may reside in numerous
20 electronic documents. For example, information about people and their specific talents and
skills may reside in electronic documents, such as resumes, performance appraisals, design
documents, publications, books, patent documents, and email messages. When an individual is trying to organize and sort out specific information from such electromc documents, the
individual usually has to open each document separately and manually analyze, retrieve, and
store the relevant data in the particular database. For example, a project manager who would
like to find the best employee for a specific job may have a specific job description. When
searching for an employee whose skills, knowledge and talent are best suited for the specific
ob description, the project manager must sift through several documents which contain the
necessary information. Such a process is time consuming and inefficient, because the project
manager may have to read the documents several times and may have to review and type the
information into a computer database in order to organize the various pieces of information
into a coherent summary.
A computerized system which can analyze and extract pertinent information from
different electronic documents would provide a more efficient solution to this problem.
However, such text documents are often written in unstructured natural language text for
other people to understand. Thus, computer programs such as database applications cannot
efficiently process documents written in natural language texts. Rather, computer programs
can process only information which has been stored in a highly structured fashion in order to
retrieve and manipulate that information. Additionally, these documents may be prepared in
a variety of different file formats, such as Microsoft Word 97, Rich Text Format, PDF,
WordPerfect, ASCII files, and HTML, and may be stored in different areas within a computer.
There are a variety of information retrieval programs such as Internet search engines
that can retrieve documents that match a set of keywords. Their scope is very limited in the
context of the above mentioned problem, because they cannot understand the text, and certainly they cannot make any connection between the document and the person who is
related to that document. Another problem is that the 'information of interest' will vary
significantly from one organization to another. For example, a health care organization will
be interested in the skills and talents related to the medical field, but the skills related to
computers may not be of significant interest, whereas a software development organization
will be interested in the computer and software related skills, but may not be interested in
medical or first-aid related skills. The keyword based search engines cannot address this
problem of retrieving only the 'information of interest'. As a result, there is a vast amount of
information about people which cannot be easily processed by computer programs.
For example, in today's large corporations and government organizations, it is not
uncommon to receive hundreds of thousands of resumes of potential candidates in a very
short time. Recruiting the right candidates from such a vast pool of applicants is a very
complicated problem. It is crucial for organizations to find the people with the right
knowledge and skill set. In essence, managers have to deal with a vast number of resumes,
try to understand the content within the resumes, and short-list candidates who have the right
skills and knowledge. For example, if an organization wants to recruit a middle level
manager with 5 to 8 years of experience to lead a development project, the organization will
need to sort through thousands of resumes and determine from each one whether that
particular candidate has the requisite knowledge and skill level. It is not possible to find the
best resumes using a standard full text search engine because such search programs search for
a particular input string and retrieve only resumes which contain that particular input string.
Such an approach is not that useful, because a particular skill may be written using many
different terms (e.g. Microsoft Word, MS Word, Word 97, etc....) even though the terms all refer to the same or similar ideas. Moreover, in addition to not being able to correctly
identify a candidate's skills, a typical search program cannot identify the type of experience
with that skill, the duration of that experience, or the overall knowledge gained by the
candidate in a specific skill group. Additionally, it is also very desirable to have a system for
determining not only the knowledge and skills of a candidate but also the proficiency level of
a candidate in a particular skill.
Therefore, what is needed is a system for analyzing and extracting information from
an electronic document and for storing the extracted information in a database. Additionally,
what is needed is a system for analyzing and extracting skill and knowledge information from
an electronic document and for determining a skill level for skill information and for mapping
such skill level information to a qualitative scale.
DISCLOSURE OF INVENTION
The present invention is an apparatus, method, and computer-readable medium for
analyzing and extracting skill and knowledge information from an electronic document (104)
and for storing the extracted skill and knowledge information into predefined fields or tables
in a target database (110). The system for analyzing and extracting skill and knowledge
information from an electronic document (104) comprises a content analysis and semantic
network engine (216) for analyzing and extracting skill and knowledge information from the
electronic document (104), and a skill and knowledge information extractor (702) coupled to
the content analysis and semantic network engine (216), for determining a skill level for the
skill information extracted from the electronic document (104). In a preferred embodiment,
the skill and knowledge section processor (702) uses a non-monotonic reasoning principle to
determine a skill level for skill information extracted from the electronic document (104). The content analysis and semantic network engine (216) further comprises a thesaurus (221)
for linking together terms (402) and skill information (404) and for defining relationships
between and among the terms (402) and skill information (404), and a semantic network
(220) coupled to the thesaurus (221), for organizing the terms (402) and skill information
(404) in the thesaurus (221), knowledge information (502), and categories (504) in a
hierarchical structure.
A method for extracting skill and knowledge information from an electronic
document (104) comprises the steps of: identifying skill and knowledge information in the
electronic document (802); determining a skill level for skill information from the electronic
document (804); and mapping the skill level to a qualitative scale (806). The method further
comprises the step of storing the skill information and qualitative skill level scale mapping in
the target database (808).
BRIEF DESCRIPTION OF THE DRAWINGS
Figure 1 is a block diagram of a preferred embodiment of a system 100 in accordance
with the present invention.
Figure 2 is a block diagram of a preferred embodiment of an extraction server 108 in
accordance with the present invention.
Figure 3 is a flow chart of a preferred embodiment of the steps performed by the
document pre-processor 210.
Figure 4 is a block diagram of a preferred embodiment of a thesaurus. 221
Figure 5 is a block diagram of a preferred embodiment of a semantic network 220.
Figure 6 is a flow chart of a preferred embodiment of the steps performed by the
extraction server 108. Figure 7 is a block diagram of a preferred embodiment of a system 700 in accordance
with the present invention.
Figure 8 is a flow chart of a preferred embodiment of the steps performed by the skill
and knowledge information extractor 702.
Figure 9 is a screen shot of a user interface of a preferred embodiment of a target
database 110 display for skill information.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS
Referring now to Figure 1, a system 100 upon which a preferred embodiment of the
present invention operates is shown. A host computer 102, using the method and system
described herein, operates upon an electronic document 104, derived from a text document
which contains unstructured text. As used herein "unstructured text" refers to any document
which has been written in a natural language such as English. Examples of documents
containing unstructured text include, but are not limited to, a resume, performance appraisals,
design documents, publications, books, patent documents, and email messages. In a preferred
embodiment, the host computer 102 is a conventional computer having a keyboard and mouse
for input (not shown), and a conventional memory 106 associated with host computer 102 for
storing the electromc document 104. The electronic document 104 may be prepared in any
electronic file format, such as Microsoft Word 97, Rich Text Format, PDF, WordPerfect,
ASCII files, and HTML.
The electronic document 104 is processed by host computer 102 using the present
invention. Specifically, host computer 102 uses extraction server 108 to analyze, retrieve and
store words and word groups from the electronic document 104 into a predefined structure in target database 110. As used herein, the terms "words" and "word groups" are used to mean any text that may be derived from document 104 including, but not limited to, individual
words or numbers, phrases, whole sentences, and blocks of text. The extraction server 108
identifies the document type of the document 104 and determines which words and word
groups are to be extracted from the document 104. The structure and operation of the
extraction server 108 is described in more detail below with reference to Figures 2 through 6.
The target database 110 comprises predefined tables with predefined columns for
storing the word and word groups extracted from the electronic document 104. In a preferred
embodiment, a predefined table and predefined columns correspond to a particular document
type. For example, if document 104 is a resume, then a predefined table for a document type
called "resume" may have predefined columns such as "name and address", "education", and
"skills and experience". As another example, if document 104 is a patent document, then a
predefined table for a document type called "patent document" may have predefined columns
such as "inventors", "company", "patent number", and "field of search". The predefined
tables and columns in target database 110 are organized ahead of time, and one skilled in the
art will realize that the present invention is not limited to a particular document type or a
predefined table, but that many different compilations of predefined tables and columns may
be stored in target database 110 within the scope of this invention. The words and word
groups stored in the target database 110 can be stored in electronic form on any type of
computer data storage device or they may be printed out in a hard-copy printed format.
The process of extraction performed by the extraction server 108 preferably uses a
non-monotonic reasoning principle. As used herein, a "non-monotonic reasoning principle"
refers to a process whereby at every stage during extraction, the extraction server 108
assumes a reasonable default value. That default value is modified as further information becomes available. For example, a string '1987' is first assumed to be a number, and if
further information to qualify the string to be a date is available ( for example in this case,
that the string is preceded by another string 'Jan'), then the assumption is changed. If again
further information becomes available to negate the previous assumption, the assumption is
changed again.
Thus, the present invention advantageously allows a user to extract skill and
knowledge information from an electronic document directly into a database. More
specifically, the present invention analyzes an electronic copy of a text document and extracts
words and word groups relating to skill and knowledge information into a target database
comprising predefined tables and columns associated with a particular document type.
Moreover, the present invention operates upon electronic documents in any electronic file
format. The extracted skill and knowledge information stored in the target database can then
be retrieved and manipulated by other computer program applications.
Referring now to Figure 2, a block diagram of a preferred embodiment of the
extraction server 108 is shown. The electronic document 104 may be any electronic file
stored in memory 106 which is accessible by the extraction server 108. For example, the
electronic document 104 may be an electronic form of a hard copy of a document converted
using a conventional optical scanner and Optical Character Recognition (OCR) software 202,
a Microsoft Word file 204, an ASCII text file 206 or an email attachment 208. The database
applications which manipulate the extracted information in target database 110 are also
preferably stored in memory 106. In a preferred embodiment, the extraction server 108 comprises a document preprocessor 210 coupled to the memory 106 where the electronic
document 104 is stored, a heuristics engine 212 coupled to the document pre-processor 210, a morphological analysis engine 214 coupled to the heuristics engine 212, a content analysis
and semantic network engine 216 coupled to the document preprocessor 210, and a database
interface 222 coupled to the content analysis and semantic network engine 216 and to the
target database 110. The content analysis and semantic network engine 216 preferably
comprises section processors 218 and a semantic network 220.
The document pre-processor 210 retrieves the electronic document 104 from memory
106 and performs the initial analysis of the electronic document 104. Referring now to Figure
3, a flowchart of the steps of a preferred operation of the document pre-processor 210 is
shown. The document pre-processor 210 performs the initial analysis and extraction of the
electronic document 104 by first converting (302) the electronic document 104 from its native
file format into ASCII text. More specifically, the document pre-processor 210 identifies the
file format of the electronic document 104 and extracts the ASCII text out the document 104.
For example, if the electronic document 104 is a Microsoft Word file, then the document pre¬
processor 210 identifies the file by the Microsoft Word signature and uses the Microsoft
Object Linking and Embedding Software Development Kit (Microsoft OLE 2.0 SDK) to
extract text from the Microsoft Word File.
Next, the document pre-processor 210 filters out (304) any unnecessary and unwanted
information such as, but not limited to, email headers, OCR headers, blank pages, and
unwanted characters. Preferably, any information that is not part of the original document is
treated as unnecessary information. For example, email headers, non- ASCII characters at the
beginning or at the end of the file, extra blank lines and blank spaces are removed from the
text. Additionally, if the text contains vertical tables, these tables are preferably converted
into horizontal tables. If the text contains multiple columns, it is preferably converted into single column. The document pre-processor 210 then stores (306) formatting information for
, the document 104 such as, but not limited to, the fonts used, font sizes, section tittles, and
subsections.
The document pre-processor 210 then performs paragraph identification heuristics
(308) on the electronic document 104. During this step, the beginning and end of each
paragraph is identified, and the paragraph characteristics are gathered. As used herein, the
phrase "paragraph characteristics" refers to the statistical properties of the paragraph.
Paragraph characteristics include, but are not limited to, the number of words in the
paragraph, the number of lines in the paragraph, the average number of words per line,
whether any line has a bullet as the starting character, and whether there are any underlined
sentences in the paragraph.
Finally, the document pre-processor 210 performs paragraph grouping heuristics (310)
on the electronic document 104. Once the paragraphs have been identified, the document
pre-processor 210 groups the paragraphs into sections. During this step, the paragraphs are
grouped into sections based on the paragraph characteristics as well as using any section
tittles that precede the paragraphs. Starting at the beginning of the electronic document 104,
the first heading or section title is identified, and the following paragraphs until the next
section title are grouped into one section. If no section titles are found, then using the
paragraph characteristics, all the similar paragraphs are grouped into sections. Additionally,
paragraphs that have same or similar characteristics are grouped together into sections.
The heuristic engine 212 applies a set of heuristics, that is a set of rules, to the electronic document 104 for analyzing information in the electronic document 104. The set
of heuristics which are applied to the electronic document 104 are associated with a particular document type. For example, if the document type is a "resume", then the set of heuristics
associated with the document type "resume" is applied to the electronic document 104.
Heuristics are described below in more detail in commonly assigned U.S. patent application
Serial No. 09/380,219 entitled "Extraction Server" by Prabhat K. Andleigh, Nagaraju Pappu,
and Nasudeva Kalidindi, which is incorporated herein by reference in its entirety.
The morphological analysis engine 214 is used for target language analysis and is
preferably the LinguistiX 2.0 application programming interface (API) from InXight
Corporation in Palo Alto, CA. The LinguistiX 2.0 API is a language neutral programming
interface. In other words, the LinguistiX API can analyze documents in any language such as
English, French or German. Because the heuristics engine 212 and the LinguistiX API are
external to and separate from the document pre-processor 210 and the content analysis and
semantic network engine 216, the present invention can extract information from documents
in the English, French or German language, and any other languages which will be supported
by the LinguistiX API in future.
Preferably, the Heuristics Engine 212 uses the following features provided by the
LinguistiX API: tokenization, lexical analysis, tagging, and noun-phrase extraction. Before
text from the electronic document 104 can be analyzed in terms of its linguistic roots and
function, it must first be segmented into words, punctuation and idiomatic phrases.
LinguistiX tokenization includes the ability to recognize multi-word constructs such as
HTML tags. The lexical analysis feature identifies the grammatical features of a word in
addition to its root forms. The tagging feature identifies the grammatical category of words
by their context. The noun-phrase extraction identifies multi-word phrases in documents.
LinguistiX phrase extraction technology enables software to work with these larger concepts to provide improved information analysis and retrieval. For example, 'Windows
Programming' will be identified as one phrase, instead of two distinct words Windows and
Programming. This feature is used by the semantic network 220 to identify the multi-word
noun phrases.
These features of the LmguistiX API are used to implement the heuristics. For
example, by using the tagging feature, the extraction server 108 may discover that a particular
word is a proper noun. Whether that word is the name of the person or the name of a
company will depend on where the word occurred in a document. For example, if the word
occurs in a contact information section of a document, then it may be the name of the person,
or name of the street, city and so on. If the word occurs in an experience section of a
document, and if it is followed by the name of a city and state, it may be a company name.
The database interface 222 is a set of APIs that provide a mechanism for retrieving
and storing information to and from the target database 110. This is done in such a way that
the underlying implementation of the target database 110 is hidden from the application using
the database interface. Thus, the extraction server 108 can work with any industry standard
relational database software such as Oracle or Microsoft SQL Server without having to
change the software or its implementation. Additionally, the database interface 222 provides
the following mechanisms: a method to connect to the target database, a method to maintain
the connection to the database, a transaction model to maintain the consistency of the
database, and various methods to retrieve, query, update, insert and delete information from
the target database 110.
The content analyzer and semantic network engine 216 analyzes the content of the
electronic document 104, extracts words and word groups from the document 104, and stores the extracted information in the appropriate tables in the target database 110. In a preferred
embodiment, the content analyzer and semantic network engine 216 comprises section
processors 218 which extract information from a particular section of interest, and a semantic
network 220. The semantic network 220 uses a thesaurus 221 and a phrase extraction process
to identify the meta-concepts and categories in the electronic document 104 and extracts
related words and word groups into the target database 110. In a preferred embodiment, the
present invention may be implemented to run on a Windows NT Server and Oracle Database. Referring now to Figure 4, a block diagram of a preferred embodiment of a thesaurus
221 is shown. The thesaurus 221 is a vocabulary database for the extraction server 108 and is
organized by skills. The thesaurus 221 groups all related terms 402 in a language under a
language independent concept 404. As used herein, a "term" 402 refers to all the individual
words or word groups that belong to a particular language along with their alternatives. As
used herein, a "concept" or "skill" 404 comprises a set of terms 402 that are language specific
and alternatives to one another. However, the skill 404 itself is language independent. Skills
404 establish synonymous relationships among all terms 402 in the thesaurus 221 that have
the same meaning. In other words, skills 404 connect all the different names for the same
skill 404 that are known to the thesaurus 221 and specify certain characteristics for each name. Preferably, each skill 404 has a unique skill identifier (ConceptLD). The Concept ID
by itself has no intrinsic meaning. Each term 402 in each language in the thesaurus 221 has a
unique term identifier. The same term 402 in different languages, for example, in English
and Spanish, will have a different term identifier for each language.
To illustrate the relation between terms 402 and skills 404 consider an example in
which terml 402A may consist of 'MS NC++', term2 402B may consist of 'Microsoft Visual C++' and teraι3 402C may consist of 'MS Nisual C++'. All these terms 402 are linked to the
skill 404 'Visual C++'. In other words, if the electronic document 104 uses any of the words
or word groups 'MS VC++', 'Microsoft Visual C++' or 'MS Visual C++', the thesaurus 221
allows the extraction server 108 to recognize the words or word groups as being linked to the
skilll 404A Nisual C++'. In another example, term4, term5 and termό are respectively 'JDK
LI', 'Symantec Cafe', and 'JDBC, and all these terms 402 are linked to the skill2 404B called
'Java'. Thus, if the electronic document 104 uses any of the words or word groups 'JDK 1.1',
'Symantec Cafe', and 'JDBC, the thesaurus 221 allows the extraction server 108 to recognize
the word or word group as being linked to the skill2404B 'Java'.
The thesaurus 221 may also comprise other information such as the attributes of a
skill 404 or attributes of a term 402. Attributes provide additional information that helps to
define the meaning of a skill 404 and explain how it may be used in a document. In other
words, the different senses of a particular word or word groups are captured using the
attributes.
In addition to the relationship between a skill 404 and a set of terms 402, the thesaurus
221 also comprises relationships among skills 404. Preferably, these relationships are non-
subsumption relationships. As used herein, the term "non-subsumption" refers to
relationships that include related skills, co-occurring skills and/or associated expressions. In
other words, non-subsumption refers to relationships that are not based on subsumption. For example, C++ and Java are related, but neither subsumes the other. All these relationships
among skills 404 indicate that the skills 404 linked together are not exactly similar but are
associated with each other in different ways. One skilled in the art will realize that the terms
and skills of the thesaurus 221 are not limited to the examples given herein but may contain any number of terms and skills which have been predefined and stored in the thesaurus 221
prior to the processing of the electronic document 104. Thus, the thesaurus advantageously
allows the present invention to link together terms and skills used in specific industries,
disciplines, and technologies for which the thesaurus is being used, and preserves the
meanings and hierarchical connections between those terms and skills. Additionally, the
thesaurus facilitates the access to concept relationships and to term and skill attributes
irrespective of the term used as a point of entry.
Referring now to Figure 5, a block diagram of a preferred embodiment of a semantic
network 220 is shown. The semantic network 220 provides a way of arranging all the skills
404 at the lowest level and then builds a taxonomy or network of higher level knowledge-
concepts and categories. The semantic network 220 comprises skills 404 at the lowest level,
"knowledge" or knowledge-concepts 502 at a second level, and categories 504 at the highest
level. The semantic network 220 together with the thesaurus 221 provides a four level
hierarchy of terms 402, skills 404, knowledge-concepts 502 and categories 504.
A category 504 is the highest level in the semantic network 222. Broad categories
504 may be created according to a specific industry which fully subsume other knowledge-
concepts 502 and skills 404. The semantic network 220 categorizes all knowledge-concepts
502 into categories 504. Knowledge-concepts 502 comprises the next level in the semantic
network 220 hierarchy. Each knowledge-concept 502 is a collection of skills 404 that add to
the body of knowledge. The semantic network 220 categorizes all skills 404 into knowledge-
concepts 502. As described earlier with reference to Figure 4, skills 404 are generic and
language independent from all related terms 402. The semantic network 220 categorizes all
terms 402 into skills 404. As described earlier with reference to Figure 4, terms 402 comprise language dependent strings that are found in the electronic document 104. Terms 402
comprise the lowest level in the semantic network 220 hierarchy.
The entire semantic network 220, separate from the thesaurus 221, comprises
language independent knowledge that is arranged as a taxonomy. Preferably, the
relationships between skills 404 and knowledge-concepts 502 as well as the relationships
between knowledge-concepts 502 and categories 504 are many to many. In other words, a
single knowledge-concept 502 can comprise several skills 404 and a single skill 404 can be
linked to several knowledge-concepts 502. Similarly, several knowledge-concepts 502 may
comprises a category 504 and several categories may have links to a single knowledge-
concept 502.
To illustrate the terms 402, skills 404, knowledge-concepts 502, and categories 504 of
a semantic network 220, the two concepts discussed earlier with reference to Figure 4,
namely Nisual C++' and 'Java', will be used. Both these skills 404 may be grouped under a
knowledge-concept 502 Object Oriented programming languages'. Additionally, the skill
404 Nisual C++' may also belong to the knowledge-concept 502 Nisual Programming
Environment'. The knowledge-concept 502 "Visual Programming Environment" may also
be linked to other skills 404 such as Nisual Basic'.
The semantic network 220 uses subsumption as the basis for the hierarchical
organization of skills 404, knowledge-concepts 502, and categories 504. In other words, the
relationship between skills 404 and knowledge-concepts 502 and knowledge-concepts 502
and categories 504 in the semantic network 220 are based on conceptual subsumption, where
a more general object 'subsumes' a more specific object. The concept of subsumption is more general than the concept of synonymy. An object is subsumed by another object if the subsuming object is much more general than any other subsumed objects and effectively
summarizes the subsumed objects. Truly synonymous objects mutually subsume each other.
If only synonymous based relationships are allowed, then the granularity between the objects
cannot be captured effectively as there are not many truly synonymous objects. The
difference between the shades of meaning will not allow correct retrieval in a synonym-based
network. The subsumption-based network removes these drawbacks and aids in retrieving
related concepts more accurately, since a subsumption is more general compared to a
synonym. For example, the object 'JDBC is subsumed by a more general object called 'Java
Programming Language' (a knowledge-concept 502), which is further subsumed by an even
more generic object 'Software Engineering' (a category 504).
An object may also be subsumed by more than one higher level object. For example,
the skill 404 'JDBC may be subsumed by at least two knowledge-concepts 502 such as 'Java
Programming Language' and 'Database Connectivity Library'. Each of these knowledge-
concepts 502 may in turn be subsumed by several categories 504. Hence, the conceptual
subsumption also allows many-to-many relationships between skills 404 and knowledge-
concepts 502 and between knowledge-concepts 502 and categories 504.
Referring now to Figure 6, a flowchart of the steps of a preferred embodiment of a
method performed by the content analysis and semantic network engine 216 is shown. First,
identification heuristics are performed (602) on the electronic document 104 to identify the
beginning and end of the known sections of interest. The sections of interest are configured by the user when the extraction server 108 is first installed. The sections are then analyzed
(604) and information is extracted from the sections. The extracted information is stored
(606) in a predefined structure in the target database 110. Using the semantic network 220, words and word groups are analyzed (608) and the relationships between the different words
and word groups are determined and stored in the target database 110. Thus, the present
invention advantageously extracts meaningful information from electronic documents, and
stores them in a predefined structure in a target database. The extracted information stored in
the target database can then be retrieved and manipulated by computer program applications
accessing the database. Moreover, the present invention provides a powerful semantic
network and thesaurus for defining terms, concepts, meta-concepts, and categories and the
relationship between and among such terms, concepts, meta-concepts, and categories. Thus,
the semantic network can stored information relating to any field, industry or technology, and
allows the extraction server 108 to process various types of documents pertaining to such
fields, industries or technologies.
The section processors 218 extract information from sections of interest in an
electronic document 104. The particular sections of interest from which information is
extracted is determined by the document type. The content analysis and semantic network
engine 216 comprises a section processor 218 for extracting words or word groups from each
section of interest in an electronic document.
Section processors 218 are configured to operate on a specific document type and may
contain one or several section processors 218. For example, resumes typically contain several
sections such as a cover letter, contact information, an objective section, an experience
section, an education section, a patents section, a publications section, an awards and honors
received section, and a courses attended section. In a preferred embodiment, section
processors 218 for a resume document type may comprise a cover letter section processor for
extracting information from a cover letter, a contact information section processor for extracting contact information for a candidate, a skills and experience section processor for
extracting the skills and experience of a candidate, an education section processor for
extracting educational information from a candidate, an awards and honors section processor
for extracting any awards and honors received by a candidate, a patents section processor for
extracting information about patents obtained by a candidate, and a publications section
processor for extracting any articles or documents published by a candidate. Each section
processor 218 analyzes a particular section in the electronic document 104 and extracts
specific words and word groups from that section into a specific record in the target database
110. Additionally, as described in more detail in commonly assigned U. S. Patent
Application Serial No. 09/380,219 entitled "Xfraction Server" by Prabhat K. Andleigh,
Nagaraju Pappu, and Vasudeva Kalidindi, each section processor 218 applies a set of
heuristics to the particular section of interest in order to analyze and extract the desired
information.
Referring now to Figure 7, there is shown a preferred embodiment of the present
invention comprising a skills and knowledge information extractor 702. The skills and
knowledge information extractor 702 allows the system to automatically extract from a
document, such as a resume, the skills of a candidate, the candidate's knowledge in a
particular area, and to determine the proficiency level of the candidate in any given skill.
Thus, the skills and knowledge information extractor 702 allows a user to automatically
determine a "career profile" of a candidate from his or her resume. As used herein, a "career profile" refers to any qualitative and quantitative information about a candidate's work
history, experience, and proficiency. For example, such information includes, but is not
limited to, how long a candidate worked in a particular profession, when, where, and at what depth did the candidate gain experience in a particular skill, what is the candidate's overall
knowledge level in a particular area, how much management experience a candidate has, etc.
As used herein, "terms" refers to the actual word or words which are found in a
resume, "skill" or "skill information" refers to the skills 404 in the thesaurus 221 and
semantic network 220 which relate to those terms, and "knowledge" or "knowledge
information" refers to the knowledge-concepts 502 relating to the skills. For example, in a
resume, a candidate may have used the terms "Microsoft Visual C++" or "MS VC++". The
present invention would identify these terms as belonging to the skill "C++", which in turn is
related to the knowledge "object oriented programming" which in turn may be related to the
category "Software." Thus, although the only terms actually used in and extracted from the
resume were "Microsoft Visual C++" and "MS VC++", the present invention is able to
determine that the candidate has "skill" in C++ and has "knowledge" of object oriented
programming even though the words C++ and object oriented programming were never used
in the document.
The skill and knowledge information extractor 702 uses a non-monotonic reasoning
principle to determining a candidate's skill level. As described above, non-monotonic
reasoning refers to the use of default assumptions which are made about the state of unknown
factors. These default assumptions may be changed as new information or evidence becomes
available. Additionally, default assumptions may be changed due to the absence of certain
information or evidence. The operation of the non-monotonic reasoning approach used by
the skill and knowledge for information extractor 702 is best illustrated using an example.
During operation, the present invention finds a skill, X, in a candidate's resume, R. In
the absence of any other knowledge, the skill and knowledge information extractor 702 assumes that the skill level of the candidate for the skill X is average. As the skill and
knowledge information extractor 702 obtains additional information from the resume R about
skill X, the assumption of the skill level for skill X is refined. Additional knowledge that
may be used to refine the skill level includes, but is not limited to, the section in which the
skill X is found. For example, if the skill X is found in the Objective Section of a resume R,
a positive numerical value, or objective weightage factor W(O), will be added to the skill
level. Additionally, a positive weight for each project in which the skill X is used,
represented here by W(P;), may be added to the skill level. Preferably, this weightage value is
computed for all projects in resume R. The number of associated skills that are also used,
W(K), may also be added to the skill level. As used herein, associated skills are the skills
related to the main skill; knowing a main skill implies that a person also knows all associated
skills. For example, if one is an expert in the skill "database programming" or "database
administration," this person must be knowledgeable in the associated skill "SQL."
Associated skills can be determined using the semantic network 220 and the thesaurus 221.
For a given skill x, all its associated skills (X! ... X are linked with x through the semantic
network 220 and thesaurus 221. For example, a thesaurus 221 entry for the "skill database
administration" would contain links to the "skills database server administration," "database
user management," and/or "SQL." Also, the number of years of experience for the skill X,
W(Y) may also be added to the skill level. Moreover, the number of years since the skill X
was used may represent a negative factor, W(LU), which is subtracted from the skill level.
Thus, in a preferred embodiment, a summation of the weights described above gives a
specific skill level for the skill X. A mathematical representation for determining the skill
level of a particular skill is as follows: SkillLevel(X') = SkillLevel(X) + W(O) + W(P;) + W(K) + W(Y) - W(LU)
The weightage functions are computed using the total number of skill levels that are
defined, and the distance from the current skill level to the next skill level. One skilled in the
art will realize that the weightage factors used to adjust the skill level are not limited to those
listed in the above example but can comprise any number of factors to be determined by the
system creator.
The computation of the skill level of a particular skill for a candidate can also be
demonstrated using an example. Initially, the skill and knowledge information extractor 702
assumes that a person has an average skill level for a particular skill such as C++. If the
candidate's resume states that the candidate took a course in C++, that fact would add a
positive weightage factor to the skill level, thus adjusting the average skill level to a higher
value. If the candidate's resume also states that the candidate has two years of work
experience in C++, that fact would add another positive weightage factor to the skill level and
adjust the average skill level to another higher value. The values by which the average skill
level is adjusted for the C++ course and the two years of work experience are not necessarily
the same but may reflect the value attributed by the system creator. Each mention of C++ in
the resume would this be used to adjust the skill level either up or down. Additionally, the user of terms in the resume which are related in the semantic network 220 and thesaurus 221
to the concept or skill C++ could also be used to adjust the skill level of the candidate. After
all the relevant terms in the candidate's resume have been extracted and evaluated, the skill and knowledge information extractor 702 deteπnines a single value for the skill level for the candidate for the particular skill.
After a final skill value for a particular skill has been determined, the skill and
knowledge information extractor 702 then maps the skill value to a scale for qualitatively
illustrating the proficiency of the candidate in that particular skill. For example, if a final
skill value for a particular candidate has been determined to be the number 6.8, that number
may map to a rating of "good" on a scale of 1 to 10, with 1 being poor and 10 being excellent.
Thus, the present invention allows a user to determine the proficiency of a candidate's skill
level for a particular skill and to ascribe a qualitative value to that proficiency level. One
skilled in the art will realize that the qualitative scales used to describe a particular skill value
may be any type of scale with a range of numerical values and/or adjective descriptors. For
example, a qualitative scale may map the final skill value to a scale comprising numbers
such as 1 to 5 or 1 to 10. A scale may map the final skill value to a scale comprising numbers
and adjectives such as 1 (poor) to 10 (excellent). The qualitative scale may be determined by
the system creator.
The categories, knowledge, skills and terms are preferably set up in a relational database prior to the extraction process. As described above with reference to Figures 4 and 5,
in a preferred embodiment, the relationship between categories and knowledge is
many-many, the relationship between knowledge and skills is many-to-many, and the
relationship between skills and terms is one-to-many.
Referring now to Figure 8, there is shown a flow chart of a preferred embodiment of a
method for the present invention. In a preferred embodiment, a resume is evaluated (802) for
a particular skill. The skill level for that particular skill is then determined (804) using the above described techniques. After a final skill level value is determined, the skill level is
mapped (806) to a qualitative scale. Finally, the skill and the qualitative scale value of the skill level is stored (808) in the target database. More specifically, the categories, knowledge,
skills and terms (i.e. the semantic network) are loaded into main memory. The electronic
document text is then passed to the skill and knowledge information extractor 702. In a
preferred embodiment, knowledge, skills, skill levels and number of years are extracted from
the electronic document in the following manner: first, all the terms in the database are
checked against the document, then an initial scan of the document collects all the terms. The
frequency of appearance of the term is recorded. Afterwards, the weightage factors for the
skill level calculation are applied. A second scan of the electronic document analyses the
document and a running list is maintained for all terms to calculate the experience duration
where the term is maintained. On completion of the second scan, all the terms are rolled up
into skills according to the semantic network and thesaurus, all the skills are rolled up into
knowledge according to the semantic network, and all the knowledge items are rolled up into
categories. Additionally, categories specifically mentioned are added. Thus, based on this
information, the skill levels and years of experience are computed as described above.
Referring now to Figure 9, there is shown a screen shot of a user interface of a
preferred embodiment of a target database for a skill and knowledge information extractor
702. Window 902 displays the particular skills analyzed from a candidate's resume, the
qualitative level determined by the skill and knowledge information extractor 702, and the
years of experience the candidate has for the particular skill. For example, the highlighted
portion of window 902 indicates that the candidate has some skill as an analyst, that the
qualitative proficiency of the candidate's skill as an analyst is "excellent", and that the candidate has 4 years of experience as an analyst. Thus, the present invention advantageously
allows a user to extract, determine, and display from a candidate's resume the proficiency of a
particular skill of the candidate.
The present invention is designed as a set of Object Oriented Libraries and contains
the following major Object Libraries:
Figure imgf000027_0001
In a preferred embodiment, the present invention may be implemented to run on a
Windows NT Server and any relational database such as Oracle Database. Database tables
may be used to define how information is represented in a relational or object-oriented
database. In an object-oriented implementation, any relational table is preferably represented
as an object class. The following section describes a preferred embodiment of the content
and type of the fields that are extracted into a relational database, and also the definitions of
the categories, knowledge, skills and terms. The supporting tables are also explained. One
skilled in the art will realize that these tables are not limited to the specific information
illustrated therein but may be created as needed, depending on the document type being
processed.
Table 1
AutoEntryDocuments
Table 1 holds the documents that are to be extracted. It holds the following information:
Figure imgf000028_0001
Table 2
AutoEntrySchedule
Table 2 holds information about the scheduled extraction tasks.
Figure imgf000028_0002
Table 3 Candidate
Table 3 holds the personal information like name of the person, contact address, current employer, resume summary etc. The XtractionXpert automatically extracts the following information from the resume:
Figure imgf000029_0001
Figure imgf000030_0001
Table 4 ExperienceDetail
Figure imgf000030_0002
Figure imgf000031_0001
Table 9
Kno wledgeRecord
Figure imgf000032_0001
Table 13 Courses
Figure imgf000033_0001
Table 1
Miscellenouslnformation
Figure imgf000033_0002
Table 16 Category
Table 16 provides information regarding the relationships between categories and knowledge information.
Figure imgf000033_0003
Table 17 MetaConcept
Table 17 provides knowledge information for semantic network 220.
Figure imgf000033_0004
Table 18 Concept
Table 18 provides information relating to skills.
Figure imgf000034_0001
Table 19 ConceptRelation
Table 19 provides information on relationships between skills and knowledge.
Figure imgf000034_0002
Table 20 Term
Table 20 provides information on terms.
Figure imgf000034_0003
Table 21 Language
Table 21 stores information about different languages to which the terms belong.
Figure imgf000034_0004
Table 23 CaWordList
Figure imgf000034_0005
Figure imgf000035_0001
Table 24 Ca WordPosition
Figure imgf000035_0002
From the above description, it will be apparent that the invention disclosed herein
provides a novel and advantageous system and method for extracting and analyzing skill and
knowledge information from an electronic document. The foregoing discussion discloses and
describes merely exemplary methods and embodiments ofthe present invention. As will be
understood by those familiar with the art, the invention may be embodied in other specific
forms without departing from the spirit ofthe invention or essential characteristics thereof.
Accordingly, the disclosure ofthe present invention is intended to be illustrative, but not
limiting, ofthe scope ofthe invention, which is set forth in the following claims.

Claims

1. An apparatus for extracting skill and knowledge information from an
electromc document and for storing skill and knowledge information into a target database,
the apparatus comprising:
a content analysis and semantic network engine for analyzing and extracting skill and
knowledge information from the electronic document; and
a skill and knowledge information extractor, coupled to the content analysis and
semantic network engine, for determining a skill level for the skill information extracted from
the electronic document and for storing the skill level in the target database.
2. The apparatus of claim 1 wherein the skill and knowledge information
extractor also maps the skill level for the skill to a qualitative scale.
3. The apparatus of claim 1 wherein the content analysis and semantic network
engine further comprises:
a thesaurus for linking together terms and skills; and
a semantic network, coupled to the thesaurus, for organizing terms and skills ofthe
thesaurus, knowledge, and categories, and for defining relationships between and among the
terms, skills, knowledge, and categories.
4. The apparatus of claim 1 wherein the skill and knowledge information
extractor determines a skill level, at least in part, by using the mathematical equation:
SkillLevel(X) = SkillLevel(X) + W(O) + W(Pf) + W(K) + W(Y) - W(LU)
5. The apparatus of claim 1 wherein the skill and knowledge information extractor determines a skill level using a non-monotonic and default reasoning approach.
6. The apparatus of claim 2 wherein a skill extracted from the electronic
document and the skill mapping to a qualitative scale are displayed on a computer.
7. An apparatus for analyzing and extracting skill and knowledge information
from an electronic document into a target database having predefined fields, the apparatus
comprising:
a thesaurus for linking together terms and skills and for defining relationships between
and among the terms and skills; and
a semantic network coupled to the thesaurus for organizing terms and skills in the
thesaurus, knowledge, and categories in a hierarchical structure;
wherein the thesaurus and semantic network are used to analyze skill and knowledge
information in the electronic document.
8. The apparatus of claim 7 further comprising:
a document pre-processor coupled to the semantic network for classifying the
electronic document as a document type and for performing an initial analysis on the
electronic document.
9. The apparatus of claim 7 further comprising:
a heuristics engine coupled to the semantic network for applying a set of heuristics to
the electronic document.
10. The apparatus of claim 7 further comprising: a skill and knowledge information extractor for extracting skill and knowledge information from the electronic document and for determining a skill level for skill
information extracted from the electronic document.
11. The apparatus of claim 10 further comprising:
a target database coupled to the semantic network for storing skill and skill level
information in predefined fields in the target database.
12. A method for determining a skill level for skill information extracted from an
electronic document, the method comprising the steps of:
identifying skill and knowledge information in the electronic document;
extracting the skill and knowledge information from the electronic document; and
determining a skill level for skill information extracted from the electromc document.
13. The method of claim 12 wherein the step of determining a skill level is
performed by a skill and knowledge information extractor.
14. The method of claim 12 wherein the step of identifying skill and knowledge
information is performed using a semantic network.
15. A method for processing skill and knowledge information from an electronic
document, the method comprising the steps of:
identifying skill and knowledge information in the electronic document;
extracting the skill and knowledge information from the electronic document;
determimng a skill level for skill information extracted from the electronic document;
and mapping the skill level to a qualitative scale.
16. A computer implemented method for extracting and displaying skill and
knowledge information from an electronic document, the method comprising the steps of:
identifying skill and knowledge information in the electronic document;
extracting the skill and knowledge information from the electromc document;
determining a skill level for skill information extracted from the electronic document;
and
mapping the skill level to a qualitative scale.
17. A computer-readable medium for extracting and displaying skill and
knowledge information from an electronic document, the computer-readable medium
comprising code for performing the steps of:
identifying skill and knowledge information in the elecfronic document;
extracting the skill and knowledge information from the electronic document;
determining a skill level for skill information extracted from the electronic document;
and
mapping the skill level to a qualitative scale.
PCT/US1999/026083 1998-11-04 1999-11-03 Advanced model for automatic extraction of skill and knowledge information from an electronic document WO2000026839A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0113250A GB2359168A (en) 1998-11-04 2001-05-31 Advanced model for automatic extraction of skill and knowledge information from an electronic document

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US10706398P 1998-11-04 1998-11-04
US60/107,063 1998-11-04
USPCT/US98/27664 1998-12-28
PCT/US1998/027664 WO1999034307A1 (en) 1997-12-29 1998-12-28 Extraction server for unstructured documents
US38021999A 1999-08-27 1999-08-27
US09/380,219 1999-08-27

Publications (3)

Publication Number Publication Date
WO2000026839A1 WO2000026839A1 (en) 2000-05-11
WO2000026839A8 WO2000026839A8 (en) 2000-10-12
WO2000026839A9 true WO2000026839A9 (en) 2001-08-02

Family

ID=26804347

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US1999/026083 WO2000026839A1 (en) 1998-11-04 1999-11-03 Advanced model for automatic extraction of skill and knowledge information from an electronic document

Country Status (1)

Country Link
WO (1) WO2000026839A1 (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7379929B2 (en) * 2003-09-03 2008-05-27 Yahoo! Inc. Automatically identifying required job criteria
US20050119904A1 (en) * 2003-12-02 2005-06-02 Tissington A. R. Cargo handling security handling system and method
US8527510B2 (en) 2005-05-23 2013-09-03 Monster Worldwide, Inc. Intelligent job matching system and method
US7587395B2 (en) * 2005-07-27 2009-09-08 John Harney System and method for providing profile matching with an unstructured document
US8195657B1 (en) 2006-01-09 2012-06-05 Monster Worldwide, Inc. Apparatuses, systems and methods for data entry correlation
US8600931B1 (en) 2006-03-31 2013-12-03 Monster Worldwide, Inc. Apparatuses, methods and systems for automated online data submission
US8021163B2 (en) * 2006-10-31 2011-09-20 Hewlett-Packard Development Company, L.P. Skill-set identification
US9830575B1 (en) 2008-04-21 2017-11-28 Monster Worldwide, Inc. Apparatuses, methods and systems for advancement path taxonomy
US20170330153A1 (en) 2014-05-13 2017-11-16 Monster Worldwide, Inc. Search Extraction Matching, Draw Attention-Fit Modality, Application Morphing, and Informed Apply Apparatuses, Methods and Systems
US10997560B2 (en) 2016-12-23 2021-05-04 Google Llc Systems and methods to improve job posting structure and presentation
US9996523B1 (en) 2016-12-28 2018-06-12 Google Llc System for real-time autosuggestion of related objects
US10607273B2 (en) 2016-12-28 2020-03-31 Google Llc System for determining and displaying relevant explanations for recommended content
CN113240400A (en) * 2021-06-02 2021-08-10 北京金山数字娱乐科技有限公司 Candidate determination method and device based on knowledge graph

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5197004A (en) * 1989-05-08 1993-03-23 Resumix, Inc. Method and apparatus for automatic categorization of applicants from resumes
JP2943447B2 (en) * 1991-01-30 1999-08-30 三菱電機株式会社 Text information extraction device, text similarity matching device, text search system, text information extraction method, text similarity matching method, and question analysis device
US5416694A (en) * 1994-02-28 1995-05-16 Hughes Training, Inc. Computer-based data integration and management process for workforce planning and occupational readjustment
WO1998039716A1 (en) * 1997-03-06 1998-09-11 Electronic Data Systems Corporation System and method for coordinating potential employers and candidates for employment

Also Published As

Publication number Publication date
WO2000026839A8 (en) 2000-10-12
WO2000026839A1 (en) 2000-05-11

Similar Documents

Publication Publication Date Title
Chu Information representation and retrieval in the digital age
Kowalski Information retrieval systems: theory and implementation
US5794236A (en) Computer-based system for classifying documents into a hierarchy and linking the classifications to the hierarchy
Witten Text Mining.
US7257530B2 (en) Method and system of knowledge based search engine using text mining
US5819259A (en) Searching media and text information and categorizing the same employing expert system apparatus and methods
US7890533B2 (en) Method and system for information extraction and modeling
US5893087A (en) Method and apparatus for improved information storage and retrieval system
Ahmed et al. Language identification from text using n-gram based cumulative frequency addition
US20020062302A1 (en) Methods for document indexing and analysis
CA2924140A1 (en) Systems, methods and software for hyperlinking names
JP2004110200A (en) Text sentence comparing device
WO1999034307A1 (en) Extraction server for unstructured documents
WO2000026839A9 (en) Advanced model for automatic extraction of skill and knowledge information from an electronic document
Ellis et al. In search of the unknown user: indexing, hypertext and the World Wide Web
Feldman et al. Text mining via information extraction
Nanba et al. Bilingual PRESRI-Integration of Multiple Research Paper Databases.
Tursunov Description of the management system programs of the national corpus of the uzbek language
Lama Clustering system based on text mining using the K-means algorithm: news headlines clustering
Aladağ The Potential of GPT in Ottoman Studies: Computational Analysis of Evliya Çelebi’s Travelogue with NLP and Text Mining and Digital Edition with TEI
Heryono et al. Word Frequencies in Linguistic Articles Published in SINTA Indexed Journals
Ayele Text Mining Technique for Driving Potentially Valuable Information from Text
Kuhns A survey of information retrieval vendors
Curry et al. Stratigraphic distribution of brachiopods–a new method of storing and querying loosely-structured biodiversity information
Chen et al. The design and implementation of Chinese semantic search engine based on FAQ corpus and ontology construction from information extraction

Legal Events

Date Code Title Description
ENP Entry into the national phase in:

Ref country code: US

Ref document number: 1999 380219

Date of ref document: 19991112

Kind code of ref document: A

Format of ref document f/p: F

AK Designated states

Kind code of ref document: A1

Designated state(s): CA GB IN US

DFPE Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed before 20040101)
AK Designated states

Kind code of ref document: C1

Designated state(s): CA GB IN US

CFP Corrected version of a pamphlet front page
CR1 Correction of entry in section i
ENP Entry into the national phase in:

Ref country code: GB

Ref document number: 200113250

Kind code of ref document: A

Format of ref document f/p: F

WWE Wipo information: entry into national phase

Ref document number: 09831064

Country of ref document: US

AK Designated states

Kind code of ref document: C2

Designated state(s): CA GB IN US

COP Corrected version of pamphlet

Free format text: PAGES 1-36, DESCRIPTION, REPLACED BY NEW PAGES 1-33; PAGES 37-41, CLAIMS, REPLACED BY NEW PAGES 34-37; PAGES 1/8-8/8, DRAWINGS, REPLACED BY NEW PAGES 1/9-9/9; DUE TO LATE TRANSMITTAL BY THE RECEIVING OFFICE