WO2016045153A1 - 基于文本履历信息的信息可视化方法及智能可视分析系统 - Google Patents

基于文本履历信息的信息可视化方法及智能可视分析系统 Download PDF

Info

Publication number
WO2016045153A1
WO2016045153A1 PCT/CN2014/088601 CN2014088601W WO2016045153A1 WO 2016045153 A1 WO2016045153 A1 WO 2016045153A1 CN 2014088601 W CN2014088601 W CN 2014088601W WO 2016045153 A1 WO2016045153 A1 WO 2016045153A1
Authority
WO
WIPO (PCT)
Prior art keywords
history
growth
information
resume
experience
Prior art date
Application number
PCT/CN2014/088601
Other languages
English (en)
French (fr)
Inventor
王浩
张晨
徐帆江
王微
Original Assignee
中国科学院软件研究所
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中国科学院软件研究所 filed Critical 中国科学院软件研究所
Priority to US14/898,897 priority Critical patent/US20170200125A1/en
Publication of WO2016045153A1 publication Critical patent/WO2016045153A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/10Office automation; Time management
    • G06Q10/105Human resources
    • G06Q10/1053Employment or hiring
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/137Hierarchical processing, e.g. outlines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the invention belongs to the technical field of computer application, and relates to an intelligent visual analysis system and an information visualization method based on text history information.
  • the resume information is a kind of information summarizing the personal experience. It exists in the resume data, and mainly includes the basic information of the individual and a brief description of the personal experience data.
  • Basic personal information includes name, gender, date of birth, ethnicity, education, political outlook, religious beliefs, major family members, major social relationships, marriage and personal health.
  • Personal experience as an important part of the resume usually includes the individual's past learning experience, job experience and so on.
  • Personal resume data is an important basis for personnel assessment. It reflects the individual's past behavior and current ability in many aspects.
  • the curriculum vitae analysis predicts future behavior based on past behaviors of the personnel reflected in the historical data, and is widely used in personnel selection and recruitment of various enterprises and institutions, cadre assessment and management of government departments, and research and evaluation of scientific and technological talent mobility.
  • the electronic resume is mainly divided into: 1) the public resume existing on the Internet; 2 the non-public resume existing in each enterprise and the talent recruitment system.
  • the electronic history can be divided into two types: a structured history and an unstructured history: 1 structured history.
  • a structured history usually in the form of a table, it comes from the personnel recruitment system or the internal management system. Its resume structure is more standardized and fixed, which is convenient for unified management.
  • structured resumes are difficult to perform deep semantic-based analysis.
  • 2 Unstructured resume It is usually in the form of text and has a wide range of sources, such as major news sites or social media on the Internet. Unstructured resumes are diverse in structure and are not convenient for unified analysis and management. However, unstructured resumes are based on text as a carrier, which often contains rich semantic information, so they can perform semantic-based intelligent analysis, such as semantic search and classification.
  • CVAS Curriculum Vitae Analysis System
  • CVAS mainly conducts automated history analysis and management of structured history data. With its powerful processing and analysis capabilities, it can quickly filter non-compliant resumes based on historical data, greatly improving the efficiency of resume analysis. Moreover, it can also quantitatively analyze and scientifically evaluate the history data according to specific application requirements, making the history analysis structure more reasonable and reliable. Therefore, in recent years, CVAS has been increasingly valued by the personnel management departments of enterprises and institutions, and is widely used for personnel selection and other personnel matters. Source management activities.
  • the existing CVAS still has the following shortcomings: (1) The current system is not suitable for the analysis of unstructured resume data. Unstructured resumes are usually stored in plain text (eg txt, word, pdf, etc.). The format is not uniform and varies greatly, and it is difficult to apply directly to the current CVAS. In other words, current CVAS lack the ability to translate unstructured resumes into structured resumes. (2) The analysis ability of the current system is mainly reflected in qualitative analysis and quantitative calculation (such as resume screening and scoring) under simple rules and statistical management (such as generating resume information report), while ignoring the potential patterns contained in the resume.
  • Intelligent mining and intuitive visual analysis can not help users to complete some complex tasks, such as semantic-based resume search and classification, personnel appointment and recommendation , career planning, etc.
  • the current system only performs an isolated analysis for a single resume, ignoring the correlation between resumes.
  • the potential associations between resumes can reflect potential social relationships between people, which are generated by the intersection of individual experiences, such as classmates, colleagues, fellow, comrades, collaborators, competitors, and so on. Based on this relationship, it is possible to restore and construct a potential social network between people, which can play a deep understanding of the scientific management of the resume, the potential social association between the users, and the organizational hierarchy between the people. enhancement.
  • the problem solved by the technology of the present invention overcomes the deficiencies of the existing methods and systems, and provides an intelligent visual analysis system and information visualization method based on text history information, which fully utilizes potential mode information in the history data, based on natural language processing and data.
  • Mining, machine learning, and information visualization technologies build a visual information visual analysis environment that helps users understand potential growth patterns in their resumes and potential associations between resumes, enabling semantic-based resume search and classification, appointment and appointment, and career planning. And support for tasks such as interpersonal relationships.
  • the inventive technology is a general framework for discovering the potential growth patterns contained in the history data and the potential social relationships between people, and expressing these pattern features and social relationships in an intuitive visual way. It can be widely used in the field of intelligent mining and information visualization of staff resumes, cadre resumes, corporate executive resumes, and researcher resumes.
  • an intelligent visual analysis system based on text history information comprising: a text resume preprocessing module; a personal growth experience quantification module; a personal growth pattern mining module; a group potential social relationship mining module; an organization generation module ; resume information visualization module; resume visual analysis module. among them:
  • Text resume preprocessing module preprocesses the unstructured text history data and extracts the history information. Effective elements (including personal basic information and experience information), and structured metadata elements (Extensible Markup Language).
  • Effective elements including personal basic information and experience information
  • structured metadata elements Extensible Markup Language.
  • the module uses natural language processing technology to convert multi-source history texts with non-uniform format into resume element data with unified structure, which provides a data foundation for the processing of subsequent modules.
  • the module performs quantitative calculation of the experience level for the experience information in the history element, thereby obtaining the growth trajectory sequence data.
  • the module uses natural language processing technology to quantify the experience information in the curriculum elements into grade information, which provides a basis for the mining and visualization of subsequent modules.
  • the module uses machine learning and data mining technology to analyze the time dimension and spatial dimension of the growth trajectory sequence data, and obtain the time and space growth mode of the resume.
  • Group potential social relationship mining module uses the association algorithm in data mining to correlate the growth trajectory sequence data of multiple resumes, and obtain potential social relationships between resumes (such as classmates, colleagues, fellow villagers, comrades, collaborators, competitors, etc.).
  • the module is based on the potential social relationships of the groups represented by the plurality of resumes, and can extract and restore the hierarchical information of the organization from the unit intersection information of the group.
  • History information visualization module Based on a text visualization method based on text history information, the module converts the aforementioned history of the history of the trajectory sequence and the mining results output by each mining module into an intuitive and easy-to-understand information visualization. The generated visualizations help users quickly grasp the characteristics of the resume data and the knowledge it contains.
  • Resume visual analysis module builds a visual information visual analysis environment based on the information visualization map, and uses human-computer interaction technology to help the user understand the potential information and pattern features in the history from the time and space dimensions, thereby obtaining deep knowledge.
  • the algorithm is based on the idea of growth metaphor, and transforms the abstract growth information in the resume into the visual representation of the space-time trajectory.
  • the time-space trajectory visualization generated by the algorithm can visually express the original abstract personal growth information in a time-space diagram by visualizing the growth trajectory sequence data.
  • Resume the potential social network visualization algorithm.
  • the algorithm builds a visual representation of the social network of the resume based on the potential social relationships between the resumes. Based on the potential relationship between the resumes, the algorithm constructs a visual representation of the social network of the resume.
  • the generated potential relationship map can visually express the potential relationship between the original abstract resumes in the form of a network diagram.
  • Resume organization level visualization algorithm The algorithm is based on the potential social relationships between resumes, and constructs the organizational level visual representation of the unit in which the person is located.
  • the algorithm extracts the unit intersection information between the resumes from the history information, converts the history with the unit intersection into the organizational hierarchy of the corresponding unit, and organizes the relationship with the structure based on the table structure. The way the organization chart is visualized.
  • the present invention uses the history data in the form of unstructured text as the data source, and based on the natural language processing technology, the unified processing requirement of the multi-source heterogeneous history data is satisfied by the history structured element extraction mechanism. Enhance the scope of application of the system and method.
  • the present invention focuses on intelligent mining of potential mode information contained in the history data, and performs deep visual analysis on the history mode information, and can obtain the growth trajectory pattern and growth in the history data.
  • the category mode enables support for deep analysis tasks such as semantic-based resume search and classification, personnel assessment, and appointment and dismissal recommendations.
  • the present invention innovatively introduces the potential association between the resumes into the analysis process, and through the mining and information visualization technology, the potential social relationship between the persons represented by the resume can be obtained. Based on this potential relationship, a potential social network between people can be constructed. Based on the social network, the organizational hierarchy relationship between the people can be restored, so that the model features reflected by the large number of resumes are provided to the user in a macroscopic perspective, thereby obtaining a deep understanding of the social relationship of the group.
  • 1 is a block diagram of the constituent modules of the present invention.
  • FIG. 1 system architecture diagram.
  • Fig. 3 is a diagram showing an example of the definition of the growth time trajectory category of the history time dimension, wherein: (a) the graph is a growth trajectory map, (b) the graph is a robust trajectory map, (c) the graph is a volatility trajectory map, and (d) the graph is Declining trajectory map.
  • the solid line in each figure is the personal growth trajectory, and the dotted line is the average value of the growth trajectory of the overall sample.
  • Fig. 4 is a diagram showing an example of definition of a history space dimension growth trajectory category, wherein: (a) the picture is a "place ⁇ central” type trajectory map, (b) a picture is a "local ⁇ central ⁇ local” type trajectory map, and (c) a diagram is The "central ⁇ local” type trajectory map, and (c) the map is the "central ⁇ local ⁇ local ⁇ central” type trajectory map.
  • Figure 5 is a schematic diagram showing the results of classification of personal growth trajectories.
  • FIG. 6 is a schematic diagram showing the results of group potential relationship mining, wherein: (a) the graph shows the growth trajectory similarity relationship diagram, and (b) the graph shows the experienced intersection relationship graph.
  • Figure 7 is a graph of personal growth, in which: (a) is a growth trajectory of the time dimension, and (b) is a growth trajectory of the spatial dimension.
  • Figure 8 is a potential relationship diagram.
  • Figure 9 is a diagram of the organization.
  • FIG. 10 is a schematic diagram of statistical analysis of information of a history track.
  • FIG. 11 is a schematic diagram of spatio-temporal association interaction analysis of a history track.
  • (a) is a time trajectory map
  • (b) is a spatial trajectory map
  • the experience segment shown by the broken line frame in (a) corresponds to the growth trajectory indicated by the dotted arrow in (b).
  • Fig. 12 is a schematic diagram showing the mode visual analysis of the history space-time trajectory.
  • the picture shows the “growth period”, “bottleneck period” and “breakthrough period” modes reflected in the personal growth process.
  • the “growth period” represents the rapid promotion of the early stage of life
  • the “bottleneck period” represents a bottleneck in the middle of the career, and the promotion is slower
  • the “breakthrough period” represents the breakthrough bottleneck at the end of the career and continues to advance.
  • FIG. 13 is a schematic diagram of interactive visual analysis of a resume social network.
  • (a) is a time trajectory map
  • (b) is a spatial trajectory map
  • (c) is a social network map.
  • the dashed box in (a) corresponds to the dashed box in (b) in the space-time dimension, and the specific information of the history of the intersection is shown in (c).
  • Personnel The main body represented by the resume, such as employees of enterprises and institutions, government departmental cadres, corporate executives and scientific research personnel.
  • System user usually a decision maker, such as a leader and other management personnel.
  • the invention is based on natural language processing, data mining, machine learning and information visualization technology, and constructs a visual information visual analysis environment, which can fully utilize the information in the text history data, and extract the potential knowledge in the history information that plays an important role in decision making. And the potential knowledge is displayed in an intuitive visualization based on the growth metaphor, which helps the user to understand the potential pattern features and potential related information between the resumes, thus the fuzzy search and intelligent classification, automatic personnel appointment and dismissal, career Support is provided for tasks such as planning and interpersonal relationships.
  • the present invention includes: a text history pre-processing module, a personal growth experience quantification module, a personal growth mode mining module, a group potential social relationship mining module, an organization generation module, a history information visualization module, and a resume visual analysis module.
  • a text history pre-processing module As shown in FIG. 1 , the present invention includes: a text history pre-processing module, a personal growth experience quantification module, a personal growth mode mining module, a group potential social relationship mining module, an organization generation module, a history information visualization module, and a resume visual analysis module.
  • FIG. 2 The system architecture diagram of the present invention is shown in FIG. 2. among them:
  • the module preprocesses unstructured resume text data through format filtering, Chinese word segmentation, and named entities.
  • the natural language processing technique such as recognition extracts the effective elements in the history information, and obtains structured history element XML data (Extensible Markup Language).
  • the XML data format is designed according to the characteristics of the history data.
  • the XML data is hierarchical and its structure is as follows.
  • the XML data contains two parts of the resume elements: the resume basic information and the experience information table.
  • the basic information of the resume includes basic information such as name, gender, ethnicity, and place of birth;
  • the experience information table is a table structure, and the header includes fields such as start time, termination time, place, unit, position, etc.
  • Each record in the table represents a person.
  • An experiential element is the experience (employment or learning) of the person within a certain period of time.
  • Unstructured resume text data mainly includes text history (html format) from the Internet, text history (txt, word, pdf, etc.) from the personnel system, and other personnel file history (stored in the database).
  • the Internet text resume is as follows. This data is usually obtained by the web crawler from the Internet. Because its format is complex and not uniform, the preprocessing for it is also the most complicated.
  • the module specifically includes the following steps:
  • the html analysis algorithm is used to remove noise such as advertisements and html formats from the original resume text, and a pure resume text including history information is obtained.
  • the pure resume text data is as follows. The data consists of two parts of text segments: a basic information segment and an experiential information segment. It should be noted that this step is only for Internet text history data.
  • the hierarchical structure organizes the history information according to the basic information segment (basic_info) and the experience information segment (office_record_array).
  • the basic information segment holds the basic information of the resume, and its structure is a fixed list form.
  • the experiential information segment is designed as a tree structure, and the tree nodes are different experience segments (office_record).
  • the tree structure has good scalability and can be easily and quickly expanded and queried. This structure can significantly improve the efficiency of feature matching calculation for large-scale history data.
  • the history element extraction algorithm mentioned in step 2 is the core algorithm of the module, and the regular expression matching method is mainly used to extract each element.
  • the algorithm specifically includes the following steps:
  • the "time” and “location” elements are extracted by regular matching.
  • the "time” element is extracted as the keyword of the “year” as the regular match
  • the "place” element is extracted as the keyword of the regular match with "province", "city”, “county”, “township”, etc.;
  • Each line element in the unit keyword dictionary includes two parts: "keyword” (and “auxiliary keyword”.
  • "auxiliary keyword” includes two types of R type and L type, and multiple "auxiliary keywords” are separated by commas.
  • the principle of unit key recognition using the unit keyword dictionary is: when a certain "keyword” in the dictionary is recognized, and there is no R-type "auxiliary keyword” on the right side, and there is no L-type "auxiliary keyword” on the left side. At the time, the recognition is successful; otherwise, the recognition fails.
  • the fourth row of the table 1 element represents the keyword "part", and its R type “auxiliary keyword” is "long” And “team”, its L-type “auxiliary keyword” is "dry”.
  • the module obtains the growth trajectory sequence data from the history element XML data.
  • the elements in the sequence data are a six-tuple, that is, ⁇ start time, end time, location, unit, position, quantization level>, wherein the last field "quantization level” represents the experience segment. Level size.
  • the core algorithm of this module is the experiential level quantization recognition algorithm.
  • the algorithm specifically includes the following steps:
  • the history information table for each text history information is sorted in ascending order according to the "start time” field to obtain an ordered experience information table.
  • step 2 Repeat step 2 until the ordered information table is scanned and processed.
  • the set of experiential segments containing different magnitudes is composed into an ordered sequence to obtain the growth trajectory sequence data (see Table 2).
  • the quantification library is a dictionary structure, and the elements in the dictionary are ⁇ unit, job title, quantization level> triplet.
  • the dictionary is used as the basis for the quantitative module of personal growth experience and is constructed by human-computer interaction:
  • the text history pre-processing module can be extracted from the resume corpus, and the user can also add and modify it.
  • the quantization initial value is first calculated by the computer according to a certain level of quantization rule, and secondly, the user can process some special cases (see the special case explanation below) according to his own knowledge and experience. The correctness of the adjusted quantized value is guaranteed.
  • step 2-2 The level quantization rule mentioned in step 2-2 depends on the specific application scenario:
  • the quantitative level of the cadre can be divided into: national level (quantified to 5), provincial level (quantified to 4), and departmental level (quantified as 3), county level (quantified to 2), township level (quantified to 1) and other levels, of which each level can be further subdivided according to its deputy.
  • the quantitative level of scientific research personnel can be divided into: academician (quantified to 5), positive researcher (quantified to 4), associate researcher (quantified to 3), Assistant researcher (quantified to 2), internship researcher (quantified to 1) and other levels.
  • the computer can calculate its level according to the position field of “XX Mayor” (the quantified is 3), which is correct in general; however, if the job field is “Beijing Mayor”, “ The mayor of Shanghai Municipality, such as the municipal governor, should be quantified as a provincial-level (quantified to 4) according to its administrative particularity.
  • the growth mode classification algorithm in this module innovatively applies supervised machine learning classification algorithms (such as Na ⁇ ve Bayes, SVM (Support Vector Machine) and other algorithms) to the history data.
  • the unknown history can be automatically classified based on the growth pattern of the known history, and the user can quickly grasp the growth type to which the history belongs, and predict the future development trend of the history based on the growth mode.
  • the algorithm specifically includes the following steps:
  • the definition of the four personal growth trajectory types is relative to the average value of the overall sample (see the dotted line in Fig. 3).
  • the personal growth rate (curve slope in Figure 3) can be obtained by measuring the time span experienced by each level in the personal growth trajectory.
  • the growth rate of the growth type is significantly larger than the sample average over the entire time dimension; the steady growth rate is approximately equal to the sample average; the growth rate of the wave type is greater than the sample average at some stages in the time dimension, and At other stages, it is smaller than the sample mean; the decaying growth rate is significantly smaller than the sample mean over the entire time dimension.
  • the “features” here belong to the category of machine learning and data mining, which are used to describe different types of growth trajectory sequence data.
  • the machine learning/data mining algorithm can only learn the type corresponding to the data/the mode of mining the data through the characteristics of the data. .
  • the characteristics of the 1 time dimension It can be seen from the time dimension type described in step 1 that the growth rate of the growth trajectory sequence data can be used as its time dimension feature. This growth rate can be quantified as two types of features:
  • Time tiers for each level representing the time span experienced by individuals at different levels.
  • the formal expression is: " ⁇ quantization level 1, time span 1>, ⁇ quantization level 2, time span 2>, ..., ⁇ quantization level n, time span n>".
  • n represents the sequence length of the growth trajectory sequence data (the number of elements in the sequence data)
  • the time span can be obtained by subtracting the "termination time” and the "starting time” of each element in the sequence data.
  • the time span characteristics of each level of the sequence data shown in Table 2 are: " ⁇ 0, 3>, ⁇ 1, 0>, ⁇ 2, 3>, ⁇ 3, 3>, ⁇ 4, 4>, ⁇ 5,8>, ⁇ 6,4>, ⁇ 7,0>, ⁇ 8,0>".
  • Timing growth slope which represents the slope value of the growth trajectory of individuals at different time periods.
  • the formal expression is: " ⁇ time phase 1, slope 1>, ⁇ time phase 2, slope 2>, ..., ⁇ time phase m, slope m>".
  • the sequence data shown in Table 2 can be divided into 10 time phases, such as "1989.1.1-1991.6.1”, “1991.6.1-1994.1.1”, ..., “2011.6.1-2014.1.1”.
  • the slope of the growth trajectory at each time phase is the difference between the quantization level at the end of the phase and the quantization level at the beginning of the phase, so the timing growth slope is characterized by: " ⁇ 1,0>, ⁇ 2,2>, ⁇ 3,1 >, ⁇ 4, 1>, ⁇ 5, 0>, ⁇ 6, 1>, ⁇ 7, 0>, ⁇ 8, 0>, ⁇ 9, 1>, ⁇ 10, 0>".
  • time dimension features may be used alone or in combination in the machine learning process.
  • the characteristics of the spatial dimension also known as the "spatial sequence”
  • the spatial dimension type described in step 1 that the geographic location of the unit in which the individual is located can be used as the spatial dimension feature of the growth trajectory sequence data.
  • the feature is formalized as: “ ⁇ place type 1, location type 2, ..., location type k>".
  • the "place type” is characterized by the characteristic attributes such as "central” and "place” described in step 1, and k represents the number of place types of the "place” field in the growth trajectory sequence data.
  • the spatial dimension of the sequence data shown in Table 2 is characterized by: “ ⁇ place, center>”.
  • the features of the spatial dimension here are referred to as “sequences” in the mining of sequence patterns, and the spatial dimension growth type described in step 1 is the “sequence pattern” found from several "sequences”.
  • sample data For the growth trajectory sequence data (referred to as "sample data") in the known history element XML data, according to the time dimension growth type definition described in step 1 and the time dimension type feature described in step 2, manual marking Its time dimension grows.
  • the machine learning classifier is used for classification training, and the classifier model parameters are learned.
  • sequence pattern corresponds to the growth type of the spatial dimension described in step 1, and the spatial dimension growth type can be manually marked.
  • sequence pattern is considered to be a spatial sequence pattern that has not appeared in the sample data, and can be used as a new spatial dimension growth type, and its type definition can be given manually for future resume classification tasks. .
  • Figure 5 is a schematic diagram of the classification results.
  • the person A is a growth type
  • the person B is a stable type
  • the person C is a wave type.
  • the social relationship mining algorithm in this module innovatively uses the growth trajectory distance measurement algorithm and the association rule algorithm to mine the potential social relationship R between resumes (such as classmates, colleagues, fellow villagers, comrades, collaborators, competitors, etc.) ).
  • the algorithm specifically includes the following steps:
  • the known history database M, M has a size of n, representing the number of all resumes.
  • Each element M 1 to M n in M represents history element XML data of each history.
  • the similarity sim(i, j) of the growth trajectory sequence data between any two of the history records M i and M j in M is measured by the cosine similarity algorithm, and the similarity matrix sim is obtained.
  • the matching degree mch(i, j) between the arbitrary histories M i and M j in the M is measured by the history element matching degree algorithm to obtain the matching degree matrix mch.
  • sim(i,j) If sim(i,j)>s0, then the growth trajectories of M i and M j are considered to be similar, and the larger sim(i,j) is, the more similar they are. In other words, the size of sim(i,j) can measure the strength of similarity. Where s 0 is a similarity threshold.
  • FIG. 6 is a schematic diagram showing the results of potential relationship mining.
  • the history matching algorithm elements mentioned in Step 3 the input of M i and M j, MCH output matching of M i and M j (i, j), M i with respect to the difference component elements of M j Err(i,j), and the history elements of M i and M j intersect with it(i,j).
  • the algorithm specifically includes the following steps:
  • C t represents the number of element comparisons between M i and M j : C r represents the same when M i is compared with M j elements The number of times the feature is.
  • C t represents the number of element comparisons between M i and M j :
  • C r represents the same when M i is compared with M j elements The number of times the feature is.
  • a difference element component list err(i,j) whose elements are different from the history elements of M i and M j .
  • a history element intersection list its(i,j) whose elements are the same history elements between M i and M j .
  • the organization generation algorithm in the module innovatively extracts and restores the hierarchical relationship of the organization from the potential social relationships among the plurality of resumes, and provides a basis for the visualization algorithm of the subsequent organization chart.
  • the algorithm specifically includes the following steps:
  • R is output by the group potential social relationship mining module, and its size is n ⁇ n, wherein each element R 11 ⁇ R nn represents a potential social relationship between the resumes, and the matrix element R ij represents between the history M i and the resume M j .
  • Potential social relationships are output by the group potential social relationship mining module, and its size is n ⁇ n, wherein each element R 11 ⁇ R nn represents a potential social relationship between the resumes, and the matrix element R ij represents between the history M i and the resume M j .
  • the library is a list structure: ⁇ V 1 , V 2 ,..., V m >.
  • the elements in the library are in a tree structure, the root node of the tree is the "organization name”, and the leaf node is "member information”.
  • the specific structure of the elements in the library is as follows: ⁇ organization name, ⁇ member 1, job 1, incumbent>, ⁇ member 2, position 2, incumbent>, ..., ⁇ member m, job m, whether incumbent>>.
  • step 4 Repeat step 4 until the R traversal is completed. At this point all elements in V are the required organizational information.
  • the module expresses the history information to the user in an intuitive way for the user to view and help the user to correctly understand the history information.
  • the module contains three visualization algorithms: the history and space trajectory visualization algorithm. Historical social network visualization algorithms, resume organization visualization algorithms. Based on the three algorithms, the following visualization maps can be generated: personal growth graph, potential relationship graph, and organization graph.
  • the personal growth graph is based on the visualization of the time and space trajectory visualization algorithm.
  • the algorithm utilizes the growth metaphor idea, and the generated time-space trajectory visualization map can visually express the original abstract personal growth information in the form of time-space map by visualizing the growth trajectory sequence data.
  • the specific steps of the algorithm are as follows:
  • the horizontal axis is the time axis, including the "age” and “age” display modes; the vertical axis is the grade axis, which represents the "quantization level” dimension of the growth trajectory sequence data (for example, the cadre, including “class”, “where Several levels, such as “level” and “office level”; for example, “internship researcher”, “assistant researcher”, “deputy researcher”, “positive researcher”, “academician”, etc.).
  • the horizontal axis is the time axis, including the "age” and “age” display modes; the vertical axis is the spatial axis, and the two-dimensional map is used as the spatial reference system, representing the spatial dimensions such as "place” and "unit” of the growth trajectory sequence data. .
  • a resume's growth trajectory sequence data consists of a series of experience segments, each of which represents the basic unit of the growth trajectory sequence data.
  • Trajectory visualization of 1 time dimension The horizontal rectangular block with a fixed width, variable length, and color fill is used as its visual metaphor expression.
  • the horizontal axis position of the rectangular block corresponds to the time axis, and its width represents the time interval of the experienced segment (the left side represents the "starting time” and the right side represents the “end time”).
  • the vertical axis position of the rectangular block corresponds to the rank axis and represents the "quantization level" of the experienced segment.
  • the rectangular blocks are connected by vertical lines according to the chronological order of the experienced segments, which constitutes a complete visual representation of the time dimension growth trajectory.
  • the time dimension growth trajectory visualization of different resumes is distinguished by the fill color of the rectangular blocks it contains.
  • the experiential segment is a variable radius, color-filled circle as its visual metaphor expression.
  • the position of the circle is mapped to the two-dimensional map of the spatial axis, representing geographic information such as "place” and "unit” of the experience segment.
  • the circles are connected by the variable-width, color-filled directed arrows in the chronological order of the experienced segments, which constitutes a complete visual representation of the spatial dimension growth trajectory, wherein the width of the directed arrow changes from the starting point to the ending point, representing The change in the "quantization level" between the segments is experienced (the width size represents the level of the hierarchy).
  • the spatial dimension growth trajectory visualization of different resumes is distinguished by the fill color of the rectangular blocks it contains.
  • the potential relationship graph is drawn based on a potential social network visualization algorithm.
  • the algorithm uses mining to get The potential relationship between resumes, constructing a visual representation of the social network of the resume, and generating the potential relationship map to visually express the potential relationship between the original abstract resumes in the form of a network map.
  • the specific steps of the algorithm are as follows:
  • the resume uses rounded rectangles as its visual metaphorical representation.
  • the rounded rectangle has a "name" in the basic information of the internal identification of the rectangle as a rectangle ID, and rectangles of different IDs represent different resumes.
  • the rounded rectangles are connected by line segments to represent a certain degree of similarity between the growth trajectories of the resumes.
  • the similarity of growth trajectories between resumes shows that the growth experiences between resumes are similar. For example, if the growth time of the personnel A and B represented by the resumes from the “departmental cadres” to the “office-level cadres” is similar, then A and B The growth trajectories are similar.
  • the length of the line segment characterizes the size of the similarity: the shorter the line segment (the smaller the distance between the two rectangles), the greater the similarity; and vice versa.
  • the similarity between A and B is characterized by the similarity matrix sim mentioned in the group potential social relationship mining module.
  • the intersection of elements reflects the intersection relationship between the personnel represented by the resume, such as classmate relationship, fellowship, and colleagues.
  • the organization chart is drawn based on the organization visualization algorithm.
  • the algorithm extracts the unit intersection information between the resumes from the history information, converts the history with the unit intersection into the organizational relationship of the corresponding unit, and visualizes the relationship in the form of a table organization chart.
  • the specific steps of the algorithm are as follows:
  • the horizontal axis of the head is the personnel axis, which represents the personnel of the unit;
  • the vertical axis of the table is the grade axis, which represents the rank of the unit, and the rank axis is arranged in descending order from top to bottom, that is, the higher the rank is, The higher the position.
  • the form element is the person avatar represented by the resume.
  • the horizontal row of the element represents the job title of the resume in the unit, and the column of the element represents the person represented by the resume.
  • the table element has two states: 1 active state (the person's avatar is color), indicating that the unit and job in which the element is located is the current state of the person (for example, the person's current position in the unit); 2 inactive state (the person's avatar is Gray), indicating that the unit and position of the element is the historical status of the person (for example, the person has served in the corresponding position of the unit, but is no longer in that position).
  • the module introduces human-computer interaction technology into the visual analysis environment for the history data, and on the basis of each mining module and the history information visualization module, helps the user to deeply understand the potential information in the history and the pattern features embodied in a large number of resumes, thereby Gain deep understanding.
  • the module specifically includes the following steps:
  • the function of the correlation analysis is provided from the perspective of human-computer interaction based on the trajectory growth time-space map, so that the user can jointly view the trajectory changes of the history from the two perspectives of time and space, thereby discovering the trajectory space-time mode.
  • predicting the future growth direction of the history track based on the existing trajectory space-time mode is also an important part of interactive visual analysis.
  • FIG. 12 based on the history track growth time-space map, the user can find the category patterns of different resume growth trajectories from the comparison display of multiple resumes, thereby quickly finding the trajectory categories of interest.
  • the user can perceive the three stages of personal growth from the visualization of the official promotion as shown in Figure 12: growth period (early career, faster promotion), bottleneck period (mid-career, promotion encounters bottleneck), breakthrough period (At the end of his career, the breakthrough bottleneck continues to advance).
  • the visual analysis environment is defined as follows: in the trajectory map at the same time, the growth trajectory of up to 3 resumes can be compared and analyzed, and different resumes are The spatio-temporal trajectory has a certain misalignment in each of the time axis and the spatial axis, thereby reducing the occlusion between different trajectories in the trajectory map without reducing the visual precision.
  • Resume social network interactive visual analysis As shown in FIG. 13, based on the group's potential relationship map, the user can selectively select a target resume and a resume having a potential relationship with the resume to form a specific social network according to his or her own interests. At the same time, based on the social network, the human-computer interaction editing and viewing function is provided to guide the user to purposely view important potential relationships.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Strategic Management (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Economics (AREA)
  • General Business, Economics & Management (AREA)
  • Tourism & Hospitality (AREA)
  • Quality & Reliability (AREA)
  • Operations Research (AREA)
  • Marketing (AREA)
  • Data Mining & Analysis (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Game Theory and Decision Science (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本发明公开了一种基于文本履历信息的信息可视化方法及智能可视分析系统。本方法为:1)对每一文本履历信息中的经历信息,进行经历等级量化计算,得到成长轨迹序列数据,并将该数据可视化;2)选取多份文本履历信息的成长轨迹序列数据进行关联计算,得到文本履历间的潜在社交关系,并将该潜在社交关系进行社交网络可视化;3)基于履历间的潜在社交关系,将具有单位交集的履历转化成相应人员所在单位的组织层级关系,并将该组织层级关系进行组织机构可视化。本发明通过数据挖掘与信息可视化技术能够获得履历所代表的个人的时空成长经历,发现人员之间的潜在社交关系,还原出人员间的组织层级关系,从而获得群体成长模式及社交关系的深层次认知。

Description

基于文本履历信息的信息可视化方法及智能可视分析系统 技术领域
本发明属于计算机应用技术领域,涉及一种基于文本履历信息的智能可视分析系统及信息可视化方法。
背景技术
履历信息是一种总结个人经历的信息,它存在于履历数据中,主要包括个人基本信息以及个人经历资料的简要说明。个人基本信息包括姓名、性别、出生年月、民族、教育程度、政治面貌、宗教信仰、主要家庭成员、主要社会关系、婚姻与个人健康状况等。个人经历作为履历的重要内容通常包括个人过去的学习经历、任职经历等。
个人履历数据作为人员测评的重要依据,它从多个方面反映了个人过去的行为以及当前的能力。履历分析基于履历数据所体现出来的人员过去的行为来预测将来的行为,从而广泛应用于各企事业单位的人事选拔与招聘,政府部门的干部考核与管理,以及科技人才流动研究与评价。
随着信息技术的不断发展,近年来电子履历数据呈爆炸方式增长与传播。电子履历从来源上划分主要包括:①存在于互联网上的公开履历;②存在于各企事业单位以及人才招聘系统中的非公开履历。此外,电子履历从形态上划分可以分为结构化履历和非结构化履历两种:①结构化履历。通常为表格形式,来源于人事招聘系统或单位内部的管理系统,其履历结构较规范且固定,便于统一管理。但是,结构化履历因其结构固定且扩展性较弱,很难对其进行基于语义的深层次分析。②非结构化履历。通常为文本形式,其来源较广,例如互联网各大新闻站点或社交媒体。非结构化履历结构多样,不便于统一分析与管理。但是,非结构化履历因其以文本作为载体,其中往往蕴含丰富的语义信息,故可以对其进行基于语义的智能分析,例如语义查找与分类等任务。
与此同时,随着履历数据量的不断增加,传统的基于人工的履历分析方法因其效率较低,在快速处理大量履历数据的任务中往往显得力不从心。因此,依靠计算机强大处理能力的履历分析系统(Curriculum Vitae Analysis System,CVAS)应运而生。CVAS主要针对结构化履历数据进行自动化的履历分析与管理。它借助于计算机其强大的处理与分析能力,能够基于履历数据快速过滤不符合要求的履历,大大提高履历分析的效率。而且,它也可以根据具体应用需求,对履历数据进行定量分析与科学评估,使得履历分析结构更加合理可靠。所以,近年来CVAS越来越受到企事业单位人事管理部门的重视,被广泛地用于人员选拔等人事资 源管理活动中。
综上所述,履历分析技术的发展经历了最初的手工分析技术,到互联网时代下的计算机自动分析技术。尤其是近年来出现的CVAS,运用计算机强大处理能力极大地提高了履历分析的效率,得到了各领域的广泛应用。
但是,现有CVAS依然存在着如下不足之处:(1)当前系统不适用于针对非结构化履历数据的分析。非结构化履历通常为纯文本形式存储(例如txt、word、pdf等形式),格式不统一且变化较大,很难直接应用于当前的CVAS。换句话说,当前CVAS缺乏将非结构化履历转化为结构化履历的能力。(2)当前系统的分析能力主要体现在简单规则下的定性分析与定量计算(例如履历筛选与打分)以及统计管理方面(例如生成履历信息报表),而忽略了对于履历中所蕴含的潜在模式的智能挖掘以及直观可视分析,尤其是忽略了从履历中挖掘出个人成长模式以及对于成长模式的直观可视化,从而无法帮助用户完成一些复杂任务,例如基于语义的履历查找与分类、人事任免推荐、职业生涯规划等。(3)当前系统仅针对单个履历进行孤立分析,而忽视了履历之间的关联性。履历间的潜在关联能够反映人员之间的潜在社交关系,该关系由个人的潜在经历交集产生,例如同学、同事、同乡、战友、合作者、竞争对手等关系。基于该关系能够还原并构建出人员之间的潜在社交网络,该网络对于履历的科学管理、用户掌握人员间的潜在社会关联、发现人员间的组织机构层级关系从而获得深层次认知能够起到促进作用。
发明内容
本发明技术解决的问题:克服现有方法与系统的不足,提供一种基于文本履历信息的智能可视分析系统及信息可视化方法,充分利用履历数据中的潜在模式信息,基于自然语言处理、数据挖掘、机器学习以及信息可视化技术构建履历信息可视分析环境,帮助用户理解履历中的潜在成长模式及履历间的潜在关联信息,从而为基于语义的履历查找与分类、人事任免推荐、职业生涯规划以及人际关系把握等任务提供支持。该发明技术为通用框架,旨在发现履历数据中所蕴含的潜在成长模式以及人员间的潜在社交关系,并将这些模式特征以及社交关系以直观的可视化方式加以表达。它可以广泛应用于职员履历、干部履历、企业高管履历以及科研人员履历的智能挖掘及信息可视化领域。
本发明技术解决方案:一种基于文本履历信息的智能可视分析系统,包括:文本履历预处理模块;个人成长经历量化模块;个人成长模式挖掘模块;群体潜在社交关系挖掘模块;组织机构生成模块;履历信息可视化模块;履历可视分析模块。其中:
文本履历预处理模块。该模块将非结构化的文本履历数据进行预处理,抽取履历信息中 的有效要素(包括个人基本信息以及经历信息),得到结构化的履历要素XML数据(Extensible Markup Language,可扩展标记语言)。该模块借助自然语言处理技术将格式不统一的多源履历文本转化为具有统一结构的履历要素数据,为后续模块的处理提供了数据基础。
个人成长经历量化模块。该模块针对履历要素中的经历信息,进行经历等级的量化计算,从而得到成长轨迹序列数据。该模块利用自然语言处理技术将履历要素中的经历信息量化为等级信息,为后续模块的挖掘及可视化提供了基础。
个人成长模式挖掘模块。该模块利用机器学习以及数据挖掘技术,对成长轨迹序列数据进行时间维度以及空间维度的类型分析,得到履历的时空成长模式。
群体潜在社交关系挖掘模块。该模块利用数据挖掘中的关联算法,对多份履历的成长轨迹序列数据进行关联计算,得到履历间的潜在社交关系(例如同学、同事、同乡、战友、合作者、竞争对手等关系)。
组织机构生成模块。该模块以多份履历所代表群体的潜在社交关系为基础,能够从群体的单位交集信息中提取并还原出组织机构的层级信息。
履历信息可视化模块。该模块以一种基于文本履历信息的信息可视化方法为基础,借助可视化隐喻手段,将前面提及的履历成长轨迹序列数据以及各挖掘模块所输出的挖掘结果转化成直观易于理解的信息可视化图。所生成的可视化图能够帮助用户快速掌握履历数据的特征以及其中蕴含的知识。
履历可视分析模块。该模块基于信息可视化图构建履历信息可视分析环境,利用人机交互技术帮助用户从时间和空间维度来理解履历中的潜在信息及模式特征,从而获得深层次的认知。
一种基于文本履历信息的信息可视化方法,其实现步骤为:
1.履历时空轨迹可视化算法。该算法基于成长隐喻思想,将履历中的抽象成长信息转化为形象的时空轨迹可视化表达。该算法生成的时空轨迹可视化图通过对成长轨迹序列数据的可视化,能够将原本抽象的个人成长信息以时空图的方式直观地表达出来。
2.履历潜在社交网络可视化算法。该算法基于履历间的潜在社交关系,构建履历社交网络可视化表达。该算法基于挖掘得到的履历间潜在关系,构建履历社交网络可视化表达,所生成的潜在关系图能够将原本抽象的履历间潜在关系以网络图的方式直观地表达出来。
3.履历组织层级可视化算法。该算法基于履历间的潜在社交关系,构建人员所在单位的组织层级可视化表达。该算法从履历信息中抽取出履历间的单位交集信息,将具有单位交集的履历转化成相应单位的组织层级关系,并将这种关系以基于表格结构的组织 机构图的方式可视化出来。
与现有技术相比,本发明的积极效果为:
1.本发明与传统方法相比,以非结构化文本形式的履历数据作为数据源,基于自然语言处理技术,通过履历结构化要素提取机制满足了多源异构履历数据的统一处理需求,大大增强了系统及方法的适用范围。
2.本发明与传统方法相比,侧重于对履历数据中所蕴含的潜在模式信息进行智能挖掘,同时针对履历模式信息进行深层次的可视分析,能够得到履历数据中的成长轨迹模式与成长类别模式,从而能够对一些基于语义的履历查找与分类、人事考核与任免推荐等履历信息深层次分析任务提供支持。
3.本发明与传统方法相比,创新性地将履历间的潜在关联引入分析过程,通过挖掘与信息可视化技术能够获得履历所代表的人员之间的潜在社交关系。基于该潜在关系能够构建一个人员间的潜在社交网络。基于该社交网络能够还原出人员间的组织层级关系,从而将大量履历所体现出的模式特征以一个宏观视角提供给用户,从而获得群体社交关系的深层次认知。
附图说明
图1是本发明组成模块框图。
图2系统架构图。
图3是履历时间维度成长轨迹类别定义范例图,其中:(a)图为成长型轨迹图,(b)图为稳健型轨迹图,(c)图为波动型轨迹图,(d)图为衰退型轨迹图。各图中实线为个人成长轨迹,虚线为总体样本的成长轨迹平均值。
图4是履历空间维度成长轨迹类别定义范例图,其中:(a)图为“地方→中央”型轨迹图,(b)图为“地方→中央→地方”型轨迹图,(c)图为“中央→地方”型轨迹图,(c)图为“中央→地方→地方→中央”型轨迹图。
图5是个人成长轨迹分类结果示意图。
图6是群体潜在关系挖掘结果展示示意图,其中:(a)图为成长轨迹相似性关系图,(b)图为经历交集关系图。
图7是个人成长图,其中:(a)图为时间维度的成长轨迹图,(b)图为空间维度的成长轨迹图。
图8是潜在关系图。
图9是组织机构图。
图10是履历轨迹的信息统计分析示意图。
图11是履历轨迹的时空关联交互分析示意图。其中(a)为时间轨迹图,(b)为空间轨迹图,且(a)中虚线框所示的经历段与(b)中的虚线箭头所示的成长轨迹相对应。
图12是履历时空轨迹的模式可视分析示意图。图中展示了个人成长过程中所体现出来的“成长期”、“瓶颈期”和“突破期”等模式。以官员升迁为例,“成长期”代表生涯初期的快速升迁;“瓶颈期”代表生涯中期遇到了瓶颈,升迁较慢;“突破期”代表生涯末期突破瓶颈,继续升迁。
图13是履历社交网络交互可视分析示意图。其中(a)为时间轨迹图,(b)为空间轨迹图,(c)社交网络图。(a)中的虚线框与(b)中的虚线框在时空维度上相对应,且其履历交集的具体信息在(c)中显示。
具体实施方式
为了使本发明的目的、技术方案和发明优势更加清楚明白,以下对本发明的实施方式做具体介绍。
名词定义
人员:履历所代表的主体,例如企事业单位员工,政府部门干部,企业高管以及科研人员。
用户:系统使用者,通常为决策者,例如领导以及其他企事业单位管理层人员。
履历:政府部门的干部履历、企事业单位的职员履历、企业高管履历、科研人员履历、明星履历等。
本发明所涉及思想、算法以及系统为通用框架,均可以推广到上述各类型的履历数据分析任务中。这里为便于说明本发明内容,以政府部门的“干部履历”为例进行阐述。
本发明基于自然语言处理、数据挖掘、机器学习和信息可视化技术,构建履历信息可视分析环境,可以充分利用文本履历数据中的信息,将履历信息中对决策起重要作用的潜在知识提取出来,并将这些潜在知识以基于成长隐喻的直观可视化方式进行展示,从而帮助用户理解履历所表达的潜在模式特征及履历间的潜在关联信息,从而为履历模糊查找与智能分类、自动人事任免、职业生涯规划以及人际关系把握等任务提供支持。
如图1所示,本发明包括:文本履历预处理模块、个人成长经历量化模块、个人成长模式挖掘模块、群体潜在社交关系挖掘模块、组织机构生成模块、履历信息可视化模块以及履历可视分析模块。本发明的系统架构图如图2所示。其中:
1.文本履历预处理模块
该模块将非结构化的履历文本数据进行预处理,通过格式过滤、中文分词以及命名实体 识别等自然语言处理技术抽取履历信息中的有效要素,得到结构化的履历要素XML数据(Extensible Markup Language,可扩展标记语言)。
XML数据格式按照履历数据的特征设计而成。XML数据为层级结构,其结构如下所示。
Figure PCTCN2014088601-appb-000001
如上所示,XML数据包含两部分履历要素:履历基本信息和经历信息表。履历基本信息包括姓名、性别、民族、出生地等人员基本信息;经历信息表为一个表结构,表头包含开始时间、终止时间、地点、单位、职务等字段,表中的每一条记录代表人员的一条经历要素,即该人员在一定时间内的经历(任职或学习)情况。
非结构化的履历文本数据主要包括来自于互联网的文本履历(html格式)、来自于人事系统的文本履历(txt、word、pdf等格式)以及其他人事档案履历(存储于数据库)。其中互联网文本履历如下所示,该数据通常由网络爬虫从互联网上爬取得到,因其格式复杂且不统一,故针对其的预处理也最为复杂。
Figure PCTCN2014088601-appb-000002
Figure PCTCN2014088601-appb-000003
该模块具体包括如下步骤:
1)利用html解析算法从原始履历文本中剔除广告、html格式等噪音,获得包含履历信息的纯履历文本。纯履历文本数据如下所示。该数据由两部分文本段组成:基本信息段和经历信息段。需要指出的是,该步骤仅针对互联网文本履历数据。
张三,男,汉族,1975年8月2日生,湖南长沙人。1990年1月参加工作,1991年12月加入中国共产党。现任湖南省省长。
1989-1992年 湖南省宁乡县卫生局党支部书记。
1992-1995年 湖南省宁乡县县委书记。
1995-1998年 湖南省长沙市副市长。
1998-2002年 湖南省长沙市市委书记。
2002-2010年 湖南省副省长。
2010-至今 湖南省省长。
2)针对纯履历文本,利用自然语言处理技术对文本进行分词与命名实体识别,并利用履历要素抽取算法,将非结构化的履历文本进行履历特征要素抽取,处理得到包含履历要素的结构化文本块。结构化文本块如下所示,主要包括基本信息以及经历信息。其中“/NAME”、“/TIME”、“/TITLE”等结构化标识符分别代表“姓名”、“时间”、“职务”等履历要素。
张三/NAME 男/GENDER 汉族/NATION 1975年8月2日/BIRTHDATE 湖南长沙/BIRTHPLACE 1990年1月1日/WORK TIME 1991年12月1日/PARTYTIME 湖南省省长/CURRENTTITLE
{1989-1992}/TIME 湖南省宁乡县/POS 卫生局党支部/ORG 书记/TITLE
{1992-1995}/TIME 湖南省宁乡县/POS 县委/ORG 书记/TITLE
{1995-1998}/TIME 湖南省长沙市/POS 市委/ORG 副市长/TITLE
{1998-2002}/TIME 湖南省长沙市/POS 市委/ORG 书记/TITLE
{2002-2010}/TIME 湖南省/POS 省委/ORG 副省长/TITLE
{2010-2014}/TIME 湖南省/POS 省委/ORG 省长/TITLE
3)将包含履历要素的结构化文本块进行格式转化,按照如下所示的层级结构,形成结构化的履历要素XML数据。该层级结构将履历信息按照基本信息段(basic_info)和经历信息段(office_record_array)两部分内容进行组织。其中基本信息段保存着履历的基本信息,其结构为固定的列表形式。经历信息段设计为树状结构,树节点为各个不同的经历段(office_record)。该树状结构具有良好的可扩展性,可以很容易且快速的对其进行扩充与查询。该结构能够显著提高大规模履历数据的要素匹配计算的效率。
这里给出一个XML数据的完整范例:
Figure PCTCN2014088601-appb-000004
Figure PCTCN2014088601-appb-000005
其中,步骤2中提及的履历要素抽取算法为该模块的核心算法,主要采取正则表达式匹配法对各要素进行抽取。该算法具体包括如下步骤:
2-1)基本信息的抽取:采用正则匹配的方法,对其中的人名、姓名、籍贯、出生日期、工作日期、入党日期等信息进行抽取。
2-2)经历信息的抽取:
①对于“时间”和“地点”要素,采用正则匹配的方法对其进行抽取。例如,以“年”作为正则匹配的关键词抽取“时间”要素,以“省”、“市”、“县”、“乡”等作为正则匹配的关键词进行“地点”要素的抽取;
②对于“单位”要素,采用关键词匹配法,利用设计好的单位关键词词典(如表1所示)对其进行抽取。单位关键词词典中每一行元素包括两部分:“关键字”(和“辅助关键字”。其中“辅助关键字”包括R型和L型两种,多个“辅助关键字”用逗号相隔。利用单位关键词词典进行单位要素识别的原则为:当识别到了词典中的某一“关键字”,且其右侧无R型“辅助关键字”,同时左侧无L型“辅助关键字”时,则识别成功;反之,识别失败。
表1单位关键词词典
Figure PCTCN2014088601-appb-000006
举例说明:表1第4行元素代表着关键字为“部”,其R型“辅助关键字”为“长” 和“队”,其L型“辅助关键字”为“干”。识别时,当“部”的右侧没有出现“长”和“队”,且左侧没有出现“干”时,则认为单位要素识别成功。换句话说,“部队”、“部长”和“干部”都不应作为单位要素出现。
③对于“职务”要素,在抽取得到的“单位”要素的文本段之后,利用正则匹配法对其进行抽取。
2.个人成长经历量化模块
该模块从履历要素XML数据中获得成长轨迹序列数据。如表2所示,该序列数据中的元素为六元组,即<起始时间,终止时间,地点,单位,职务,量化等级>,其中最后一个字段“量化等级”表征了该经历段的等级大小。
表2成长轨迹序列数据表
起始时间 终止时间 地点 单位 职务 量化等级
1989 1992 湖南省宁乡县 卫生局党支部 书记 0
1992 1995 湖南省宁乡县 县委 书记 2
1995 1998 湖南省长沙市 市委 副市长 3
1995 2002 湖南省长沙市 市委 书记 4
2002 2010 湖南省 省委 副省长 5
2010 2014 北京市 市委 市长 6
该模块的核心算法为经历等级量化识别算法。该算法具体包括如下步骤:
1)对每一文本履历信息的经历信息表按照“开始时间”字段进行升序排序,得到有序经历信息表。
2)逐条扫描有序经历信息表中的记录。从每一条记录中提取出“地点”、“单位”与“职务”字段,并将各个字段值分别与已有的经历等级量化库(如表3所示)进行比对识别,对匹配的实体赋予一定的数字量级。数字大小代表着等级的高低,例如:0代表基层干部,1代表科级干部,2代表处级干部,…,5代表国家级干部。
3)反复执行步骤2,直至有序经历信息表扫描并处理完毕。将包含不同量级大小的经历段集合组成有序序列,得到成长轨迹序列数据(见表2)。
表3等级量化库
Figure PCTCN2014088601-appb-000007
Figure PCTCN2014088601-appb-000008
其中,步骤2中提及的经历等级量化库如表3所示。该量化库为字典结构,字典中的元素为<单位,职务,量化等级>三元组。该字典作为个人成长经历量化模块的基础,通过人机交互的方式构建:
2-1)对于“单位”和“职务”字段,可以通过文本履历预处理模块从履历语料中抽取得到,同时用户也可自行添加并修改。
2-2)对于“量化等级”字段,首先通过计算机依据一定的等级量化规则计算出量化初始值,其次用户可以根据自身知识与经验对一些特殊情况(见下文中的特殊情况解释)进行处理,保证了调整后的量化值的正确性。
其中,步骤2-2中提及的等级量化规则要视具体的应用场景而定:
①以政府部门的“干部履历”为例,根据我国的行政级别可以将干部的量化等级划分为:国家级(量化为5)、省部级(量化为4)、司厅局级(量化为3)、县处级(量化为2)、乡镇科级(量化为1)等级别,其中各级别按照其正副职还可以作进一步细分。
②以科研院所的“科研人员履历”为例,根据职称级别可以将科研人员的量化等级划分为:院士(量化为5)、正研究员(量化为4)、副研究员(量化为3)、助理研究员(量化为2)、实习研究员(量化为1)等级别。
虽然根据等级量化规则在一般情况下能够得到正确的量化结果,但是还存在一些特殊情况需要人对量化结果作相应调整。例如:计算机根据“XX市长”的职务字段能够计算出其级别为司厅局级(量化为3),这在一般情况下是正确的;但是,如果职务字段为“北京市长”、“上海市长”等直辖市市长,则应该按照其行政特殊性将其量化为省部级(量化为4)。
3.个人成长模式挖掘模块
该模块中的成长模式分类算法创新性地将有监督的机器学习分类算法(例如朴素贝叶斯、SVM(Support Vector Machine,支持向量机)等算法)以及序列模式挖掘算法应用于履历数据,从而能够基于已知履历的成长模式自动地对未知履历进行分类,帮助用户快速掌握该履历所属的成长类型,并且基于成长模式对履历未来的发展趋势进行预测。该算法具体包括如下步骤:
1)定义一些个人成长轨迹类型。
①时间维度。可以定义履历随时间变迁的成长类型,例如定义如下四类成长类型(见图3):成长型、稳定型、波动型和衰退型。
该四种个人成长轨迹类型(见图3中的实线)的定义是相对于整体样本的平均值(见图3中的虚线)而言的。通过度量个人成长轨迹中各等级所经历的时间跨度可以得到个人成长速度(图3中的曲线斜率)。成长型的成长速度在整个时间维度上均要明显大于样本平均值;稳定型的成长速度与样本平均值大致相等;波动型的成长速度在时间维度上的某些阶段要大于样本平均值,而在其他阶段又小于样本平均值;衰退型的成长速度在整个时间维度上均要明显小于样本平均值。
②空间维度。可以定义履历随空间迁移的成长类型,例如定义如图4所示的四类成长类型(在数据挖掘范畴中也称作“序列模式”):“地方→中央”型、“地方→中央→地方”型、“中央→地方”型、“中央→地方→中央”型。其中“中央”可以代表北京,而“地方”可以代表其他省市。此外,“地方”可以依据需要细分为“东南沿海”、“西部地区”、“边远山区”等较小的空间尺度。注意:上述类型只涵盖了空间迁移特征明显的类型;不失一般性,像“地方→地方”、“中央→中央”等迁移特征不明显的类型可以用本方法同等对待,但这里暂不做考虑。
2)定义成长轨迹类型的特征。此处的“特征”属于机器学习与数据挖掘范畴,用于刻画不同类型的成长轨迹序列数据,机器学习/数据挖掘算法只有通过数据的特征才能学习得到数据所对应的类型/挖掘得到数据的模式。
①时间维度的特征。由步骤1所述的时间维度类型可知,成长轨迹序列数据的成长速度可以作为其时间维度特征。该成长速度可以量化为如下两类特征:
a.各等级时间跨度,代表个人在不同等级所经历的时间跨度。其形式化表达为:“<量化等级1,时间跨度1>,<量化等级2,时间跨度2>,…,<量化等级n,时间跨度n>”。其中n代表成长轨迹序列数据的序列长度(序列数据中元素的数目),时间跨度可以由序列数据中各元素的“终止时间”与“起始时间”相减得到。举例说明,表2所示的序列数据的各等级时间跨度特征为:“<0,3>,<1,0>,<2,3>,<3,3>,<4,4>,<5,8>,<6,4>,<7,0>,<8,0>”。
b.时序成长斜率,代表个人在不同时间阶段的成长轨迹斜率值。其形式化表达为:“<时间阶段1,斜率1>,<时间阶段2,斜率2>,…,<时间阶段m,斜率m>”。其中m代表时间阶段的数目,该数目一般由经验给定,例如m=10代表对成长轨迹序列数据在时间维度上取10等份。这里需要注意,对于不同的成长轨迹序列数据,其时间跨度一般不相等,不能直 接进行斜率的比较。故需要对序列数据的时间维度作归一化处理,将时间跨度归一化到[时间点1,时间点m]上。举例说明,表2所示的序列数据可以分为“1989.1.1~1991.6.1”、“1991.6.1~1994.1.1”、…、“2011.6.1~2014.1.1”等10个时间阶段,每个时间阶段的成长轨迹斜率为该阶段末端的量化等级与该阶段开端的量化等级之差,故其时序成长斜率特征为:“<1,0>,<2,2>,<3,1>,<4,1>,<5,0>,<6,1>,<7,0>,<8,0>,<9,1>,<10,0>”。
需要说明的是,上述两类时间维度特征在机器学习过程中可以单独使用,也可以结合使用。
②空间维度的特征(也称作“空间序列”)。由步骤1所述的空间维度类型可知,个人所在单位的地理位置可以作为成长轨迹序列数据的空间维度特征。该特征形式化为:“<地点类型1,地点类型2,…,地点类型k>”。其中“地点类型”由步骤1所述的“中央”、“地方”等特征属性表征,k代表成长轨迹序列数据中的“地点”字段的地点类型数目。举例说明,表2所示的序列数据的空间维度的特征为:“<地方,中央>”。需要指出,这里的空间维度的特征在序列模式挖掘中称作“序列”,步骤1所述的空间维度成长类型即为从若干“序列”中找到的“序列模式”。
3)针对已知的履历要素XML数据中的成长轨迹序列数据(称作“样本数据”),按照步骤1所述的各时间维度成长类型定义以及步骤2所述的时间维度类型特征,人工标记其时间维度成长类型。
4)基于标记好的成长轨迹序列数据及其时间维度类型特征,利用机器学习分类器进行分类训练,学习得到分类器模型参数。
5)基于已有的成长轨迹序列数据,针对其空间维度特征,利用序列模式挖掘算法,挖掘得到其序列模式。这里的“序列模式”与步骤1所述的空间维度的成长类型相对应,可由人工标记其空间维度成长类型。
6)针对时间维度成长类型未知的履历数据,在得到其成长轨迹序列数据之后,提取其时空维度的特征,利用步骤4训练得到的分类器对该序列数据进行分类,计算得出该履历的时间维度成长类型。
7)针对空间维度成长类型未知的履历数据,在得到其成长轨迹序列数据之后,提取其空间维度的特征(即空间序列),利用序列模式挖掘算法对该序列进行挖掘,计算得出该履历的空间维度成长类型。其中,具体计算方法为如下:在挖掘得到未知类型的空间序列的序列模式之后,将其与步骤5挖掘得到的已知类型的空间序列的序列模式进行对比:
①如果找到相同的已知序列模式,则将该已知序列模式的类型作为未知序列的类型;
②如果没有找到,则认为该序列模式是样本数据中没有出现过的空间序列模式,可以作为一种新的空间维度成长类型,并且可以由人工给出其类型定义,用于将来的履历分类任务。
图5为分类结果示意图。其中,人员A为成长型,人员B为稳健型,人员C为波动型。
8)基于计算得到的履历成长类型及其目前的成长等级,预测该履历所代表的个人未来的时空成长趋势。举例说明,计算得到某人的时间维度成长类型为“成长型”,那么他将来的成长速度很可能会大于样本平均值,此外,根据他当前的成长等级能够预测他未来(例如10年后)所能够达到的成长等级。
4.群体潜在社交关系挖掘模块
该模块中的社交关系挖掘算法,创新性地利用成长轨迹距离度量算法以及关联规则算法,能够挖掘出履历间的潜在社交关系R(例如同学、同事、同乡、战友、合作者、竞争对手等关系)。该算法具体包括如下步骤:
1)已知履历库M,M的大小为n,代表所有履历的数目。M中各元素M1~Mn代表各履历的履历要素XML数据)。
2)针对履历库M,利用余弦相似性算法度量M中任意两个履历Mi与Mj之间的成长轨迹序列数据的相似性sim(i,j),得到相似性矩阵sim。
3)针对履历库M,利用履历要素匹配度算法度量M中任意履历Mi与Mj之间的匹配度mch(i,j),得到匹配度矩阵mch。
4)扫描sim,如果sim(i,j)>s0,则认为Mi与Mj的成长轨迹具有相似性,且sim(i,j)越大,二者越相似。换句话说,sim(i,j)的大小能够度量相似性的强弱。其中,s0为相似性阈值。
5)扫描mch,如果mch(i,j)>0,则认为Mi与Mj的成长经历具有某种交集,且mch(i,j)越大,二者交集越突出。二者的经历交集细节可由履历要素交集its(i,j)表征,它体现了履历信息中所反映出的人员之间的同学、同事、同乡、战友等潜在关系。
6)反复执行步骤4和步骤5,直至M中所有履历全部扫描并处理完毕,得到所有履历间的潜在社交关系R。潜在社交关系分两种,一种是基于相似度矩阵sim得到的成长轨迹相似性关系,另一种是基于匹配度矩阵mch得到的经历交集关系。图6为潜在关系挖掘结果展示示意图。
其中,步骤3中提及的履历要素匹配度算法,其输入为Mi与Mj,输出为Mi与Mj的匹配度mch(i,j),Mi相对于Mj的差异要素成分err(i,j),以及Mi与Mj的履历要素交集its(i,j)。该算法具体包括如下步骤:
3-1)定义两个初始值为0的计数器Ct和Cr:Ct代表Mi与Mj之间进行要素比对的次数:Cr代表Mi与Mj要素比对时出现相同要素的次数。定义一个差异要素成分列表err(i,j),其元素为Mi与Mj之间不相同的履历要素。定义一个履历要素交集列表its(i,j),其元素为Mi与Mj之间相同的履历要素。
3-2)逐项扫描Mi和Mj的各基本信息要素(例如姓名、性别、民族、出生地等人员基本信息),每扫描一个要素,Ct加1。同时,针对任意要素f,如果f(Mi)=f(Mj),则Cr加1,并将该要素f添加至its(i,j);反之,则将该要素f添加至err(i,j)。比如人员i出生于北京,人员j出生于上海,则当扫描到要素“出生地”时,f(Mi)=北京,f(Mj)=上海。
3-3)逐行扫描Mi和Mj的经历信息表。针对每一行经历段,逐项扫描该经历段所包含的时间、地点、单位、职务等要素。每扫描一个要素,Ct加1。同时,针对任意要素e,如果e(Mi)=e(Mj),则Cr加1,并将该要素e添加至its(i,j);反之,则将该经历段中的要素e添加至err(i,j)。
3-4)反复执行步骤3和步骤4,直至Mi和Mj中履历要素全部扫描并处理完毕。按照如下公式计算得到Mi与Mj的匹配度mch(i,j):
mch(i,j)=Cr/Ct
5.组织机构生成模块
该模块中的组织机构生成算法,创新性地从多份履历间的群体潜在社交关系中提取并还原出组织机构的层级关系,为后续组织机构图的可视化算法提供了基础。该算法具体包括如下步骤:
1)已知履历潜在社交关系矩阵R。R由群体潜在社交关系挖掘模块输出得来,其大小为n×n,其中各元素R11~Rnn代表各履历间的潜在社交关系,矩阵元素Rij代表履历Mi和履历Mj之间的潜在社交关系。
2)定义组织机构库V,用于保存所有的组织机构及其成员信息。该库为列表结构:<V1,V2,…,Vm>。列表中每个元素Vi(i=1,2,…,m)代表一个组织机构,m为组织机构的数目。该库中元素为树状结构,树的根节点为“组织名称”,叶节点为“成员信息”。该库中元素具体结构如下:<组织名称,<成员1,职务1,是否现任>,<成员2,职务2,是否现任>,…,<成员m,职务m,是否现任>>。
3)定义计数器k(初始值为零)。
4)遍历R。如果Rij所代表的履历Mi和履历Mj存在单位交集,则将该单位以及履历Mi和履历Mj保存至Vk,同时k加1。并且将Vk保存至V;Vk是V中的一个元素。
5)反复执行步骤4,直至R遍历完毕。此时V中的所有元素即为所要求的组织机构信息。
6.履历信息可视化模块
该模块基于信息可视化技术,将履历信息以一种直观的方式表达给用户,供用户查看并帮助用户正确理解履历信息。该模块共包含三种可视化算法:履历时空轨迹可视化算法、履 历潜在社交网络可视化算法、履历组织机构可视化算法。基于该三种算法,可以生成如下可视化图:个人成长图、潜在关系图、组织机构图。
6.1个人成长图
如图7所示,个人成长图基于履历时空轨迹可视化算法绘制而成。该算法利用成长隐喻思想,生成的时空轨迹可视化图通过对成长轨迹序列数据的可视化,能够将原本抽象的个人成长信息以时空图的方式直观地表达出来。该算法具体步骤如下:
1)定义时间维度的轨迹可视化坐标轴。横轴为时间轴,包括“年代”与“年龄”两种展示方式;纵轴为等级轴,代表成长轨迹序列数据的“量化等级”维度(以干部为例,包括“科级”、“处级”、“厅局级”等若干等级;以研究人员为例,包括“实习研究员”、“助理研究员”、“副研究员”、“正研究员”、“院士”等若干等级)。
2)定义空间维度的轨迹可视化坐标轴。横轴为时间轴,包括“年代”与“年龄”两种展示方式;纵轴为空间轴,以二维地图作为空间参考系,代表成长轨迹序列数据的“地点”与“单位”等空间维度。
3)定义成长轨迹序列数据可视化思想。一份履历的成长轨迹序列数据由一系列经历段组成,每个经历段代表成长轨迹序列数据的基本单元。
①时间维度的轨迹可视化:经历段以固定宽度、可变长度、颜色填充的水平矩形块作为其可视化隐喻表达方式。矩形块的横轴位置与时间轴相对应,其宽度代表经历段的时间间隔(左侧代表“起始时间”,右侧代表“终止时间”)。矩形块的纵轴位置与等级轴相对应,代表该经历段的“量化等级”。矩形块之间按照所属经历段的时间先后顺序由垂直的直线相连而成,构成了完整的时间维度成长轨迹可视化表达。不同履历的时间维度成长轨迹可视化由其所包含的矩形块的填充颜色加以区别。
②空间维度的轨迹可视化:经历段为可变半径、颜色填充的圆圈作为其可视化隐喻表达方式。圆圈的位置映射到空间轴的二维地图中,代表该经历段的“地点”、“单位”等地理信息。圆圈之间按照所属经历段的时间先后顺序由宽度可变、颜色填充的有向箭头相连而成,构成了完整的空间维度成长轨迹可视化表达,其中有向箭头的宽度由起点向终点渐变,代表经历段之间的“量化等级”的变化(宽度大小表征等级高低)。不同履历的空间维度成长轨迹可视化由其所包含的矩形块的填充颜色加以区别。
4)针对所输入的履历成长轨迹序列数据,按照上述步骤1~3的定义,分配相应的填充颜色,对其进行可视化绘制,从而得到履历时空成长轨迹图。
6.2潜在关系图
如图8所示,潜在关系图基于潜在社交网络可视化算法绘制而成。该算法利用挖掘得到 的履历间潜在关系,构建履历社交网络可视化表达,所生成的潜在关系图能够将原本抽象的履历间潜在关系以网络图的方式直观地表达出来。该算法具体步骤如下:
1)定义履历可视化方式。履历以圆角矩形作为其可视化隐喻表达方式。圆角矩形以其内部标识的履历基本信息中的“姓名”作为矩形ID,不同ID的矩形代表不同的履历。
2)定义履历间潜在关系可视化方式。履历间的潜在关系按照挖掘算法的不同分为如下两类:
①相似的成长轨迹。圆角矩形之间用线段相连代表履历间的成长轨迹具有一定程度的相似性。履历间的成长轨迹相似性体现了履历间的成长经历是相似的,例如履历所代表的人员A和B从“处级干部”到“厅局级干部”的成长时间如果相近,那么A和B的成长轨迹具有相似性。线段长度表征了相似性的大小:线段越短(两矩形之间距离越小),则相似性越大;反之亦然。其中,A和B的相似性大小由群体潜在社交关系挖掘模块中提及的相似性矩阵sim表征。
②具有交集的履历要素。圆角矩形之间用线段相连代表履历间具有某种程度的要素交集。要素交集体现了履历所代表的人员之间的交集关系,例如同学关系,同乡关系,同事关系等。
3)针对所输入的履历XML数据,以及对该数据的挖掘结果,按照上述步骤1~2的定义,对其进行可视化绘制,从而得到潜在关系图(见图8)。
6.3组织机构图
如图9所示,组织机构图基于组织机构可视化算法绘制而成。该算法从履历信息中抽取出履历间的单位交集信息,将具有单位交集的履历转化成相应单位的组织机构关系,并将这种关系以表格形式的组织机构图可视化出来。该算法具体步骤如下:
1)定义组织机构图的表头。表头横轴为人员轴,代表该单位的人员组成;表格纵轴为等级轴,代表该单位所拥有的职务等级,且等级轴按照自上而下的降序排列,即职务等级越高,其位置越靠上。
2)定义组织机构图的表格元素。表格元素为履历所代表的人员头像。元素所在横行代表该履历在该单位的职务等级,元素所在纵列代表该履历所代表的人员。表格元素有两种状态:①激活状态(人员头像为彩色),说明元素所在的单位及职务是该人员的目前状态(例如该人员现任该单位的相应职务);②非激活状态(人员头像为灰色),说明元素所在的单位及职务是该人员的历史状态(例如该人员曾任该单位的相应职务,但是目前不再担任该职务)。
3)针对所输入的履历XML数据,按照上述步骤1~2的定义,对其进行可视化绘制,从而得到相应单位的组织机构图。
7.履历可视分析模块
该模块将人机交互技术引入针对履历数据的可视分析环境,在各挖掘模块以及履历信息可视化模块的基础之上,帮助用户深入理解履历中的潜在信息及大量履历所体现的模式特征,从而获得深层次的认知。该模块具体包括如下步骤:
1)履历轨迹的信息统计分析。如图10所示,基于履历成长轨迹序列数据中的“量化等级”信息,提供履历所代表人员在各等级经历的时间分布统计图(横坐标为“等级”,纵坐标为“时间”)。根据该统计分布图可以将个人成长的一般模式呈现给用户。
2)履历轨迹的时空关联交互分析。如图11所示,基于履历轨迹成长时空图,从人机交互角度提供关联分析的功能,供用户从时间和空间两个角度联合查看履历的轨迹变化,从而发现轨迹时空模式。此外,根据现有的轨迹时空模式预测将来的履历轨迹成长方向也是交互可视分析的重要内容。
3)履历时空轨迹的模式可视分析。如图12所示,基于履历轨迹成长时空图,用户可以从多份履历的对比展示中发现不同履历成长轨迹的类别模式,从而快速发现感兴趣的轨迹类别。例如用户可以从如图12所示的官员升迁可视化中感知得到个人成长所经历的三个阶段:成长期(生涯初期,升迁较快)、瓶颈期(生涯中期,升迁遇到瓶颈)、突破期(生涯末期,突破瓶颈继续升迁)。为避免交互过程中轨迹图过于复杂,降低用户理解可视化图的难度,可视分析环境作如下定义:同一时刻的轨迹图中,最多可供3份履历的成长轨迹进行比较分析,且不同履历的时空轨迹在各自的时间轴和空间轴均有一定的错位,以此在不降低可视化精度的同时减少轨迹图中不同轨迹之间的遮挡。
4)履历社交网络交互可视分析。如图13所示,基于群体的潜在关系图,用户可以根据自身兴趣,有选择性地选取目标履历及与该履历有潜在关系的履历组成特定的社交网络。同时,基于该社交网络提供人机交互编辑与查看功能,引导用户有目的地查看重要的潜在关系。
5)支持交互的履历信息挖掘。基于各挖掘模块,提供人机交互的机制,允许用户在挖掘结果的基础之上将专家知识与认知能力引入挖掘过程(例如修改挖掘参数、标记履历类别等),通过迭代地修正完善挖掘结果来帮助用户对履历所蕴含的潜在知识进行深入理解,从而获得深层次的认知。

Claims (12)

  1. 一种基于文本履历信息的信息可视化方法,其步骤为:
    1)对每一文本履历信息中的经历信息,进行经历等级量化计算,得到成长轨迹序列数据,并将该数据进行可视化;
    2)选取多份文本履历信息的成长轨迹序列数据进行关联计算,得到文本履历间的潜在社交关系,并将该潜在社交关系进行社交网络可视化;
    3)基于履历间的潜在社交关系,构建人员所在单位的组织层级可视化表达,将具有单位交集的履历转化成相应单位的组织层级关系,并将该组织层级关系进行组织机构可视化。
  2. 如权利要求1所述的方法,其特征在于如果履历为非结构化文本履历,则首先将其转换为结构化的文本履历信息,其方法为:
    1)对非结构化文本履历进行格式过滤,获得包含履历信息的纯履历文本;
    2)利用自然语言处理技术对纯履历文本进行分词与命名实体识别,然后进行履历特征要素抽取,处理得到包含履历要素的结构化文本块;
    3)将包含履历要素的结构化文本块进行格式转化,形成结构化的文本履历信息。
  3. 如权利要求2所述的方法,其特征在于所述结构化的文本履历信息包括:履历基本信息和经历信息表;所述履历基本信息包括姓名、性别、民族和出生地,所述经历信息表为一个表结构,表头包含开始时间、终止时间、地点、单位、职务字段。
  4. 如权利要求3所述的方法,其特征在于对于单位履历特征要素,采用关键字匹配算法进行履历特征要素的抽取:首先创建一单位关键词词典,所述单位关键词词典中每一行元素包括关键字和辅助关键字两部分信息,其中,辅助关键字包括R型和L型两种,多个辅助关键字用逗号相隔;然后利用单位关键词词典进行单位要素识别:当识别到了词典中的某一关键字,且其右侧无R型辅助关键字,同时左侧无L型辅助关键字时,则识别成功;反之,识别失败;对于其他履历特征要素,采取正则表达式匹配法进行履历特征要素的抽取。
  5. 如权利要求3所述的方法,其特征在于得到所述成长轨迹序列数据的方法为:
    1)对每一文本履历信息的经历信息表按照开始时间字段进行升序排序,得到有序经历信息表;
    2)逐条扫描有序经历信息表中的记录,从每一条记录中提取出地点、单位与职务字段,并将各个字段值分别与已有的经历等级量化库进行比对识别,对匹配的实体赋予设定的量化量级;
    3)将包含不同量级大小的经历段集合组成有序序列,得到所述成长轨迹序列数据。
  6. 如权利要求1或5所述的方法,其特征在于所述成长轨迹序列数据为一六元组,即<起始 时间,终止时间,地点,单位,职务,量化等级>。
  7. 如权利要求1~5任一所述的方法,其特征在于得到所述潜在社交关系的方法为:
    1)选取n份履历的成长轨迹序列数据,计算其中任意两个履历Mi与Mj之间的成长轨迹序列数据的相似性sim(i,j),得到一相似性矩阵sim;
    2)扫描矩阵sim,如果sim(i,j)>s0,则认为Mi与Mj的成长轨迹具有相似性,s0为相似性阈值;
    3)计算该n份履历的成长轨迹序列数据中任意两履历Mi与Mj之间的匹配度mch(i,j),并将二者的经历交集细节记录到一履历要素交集its(i,j);
    4)根据匹配度mch(i,j),判断Mi与Mj的成长经历之间是否具有交集,如果有,则根据对应的交集its(i,j)确定Mi与Mj之间的潜在关系,并且根据sim(i,j)确定Mi与Mj之间的密切程度。
  8. 如权利要求7所述的方法,其特征在于计算该n份履历的成长轨迹序列数据中任意两履历Mi与Mj之间的匹配度mch(i,j),并将二者的经历交集细节记录到一履历要素交集its(i,j)的方法为:
    1)设置两个初始值为0的计数器Ct和Cr:Ct代表Mi与Mj之间进行要素比对的次数:Cr代表Mi与Mj要素比对时出现相同要素的次数;定义一个差异要素成分列表err(i,j),其元素为Mi与Mj之间不相同的履历要素;定义一个履历要素交集列表its(i,j),其元素为Mi与Mj之间相同的履历要素;
    2)逐项扫描Mi和Mj的各基本信息要素,每扫描一个要素,Ct加1;同时,针对任意要素f,如果其值在Mi和Mj中相同,则Cr加1,并将该要素f添加至its(i,j);反之,则将该要素f添加至err(i,j);
    3)逐行扫描Mi和Mj的经历信息表,针对每一行经历段,逐项扫描该经历段所包含的时间、地点、单位、职务字段,且每扫描一个字段,Ct加1;同时,针对任意字段e,如果其值在Mi和Mj中相同,则Cr加1,并将该要素添加至its(i,j);反之,则将该要素添加至err(i,j);
    4)根据公式mch(i,j)=Cr/Ct计算Mi与Mj的匹配度mch(i,j)。
  9. 如权利要求1~5任一所述的方法,其特征在于所述基于履历间的潜在社交关系,构建人员所在单位的组织层级的组织机构生成方法,该方法为:
    1)将所述潜在社交关系记录为一矩阵R,矩阵元素Rij代表履历Mi和履历Mj之间的潜在社交关系;
    2)建立一组织机构库V,用于保存所有的组织机构及其成员信息;其中库中元素为树状结构,树的根节点为组织名称,叶节点为成员信息,其具体结构为:<组织名称,<成员1, 职务1,是否现任>,<成员2,职务2,是否现任>,…,<成员m,职务m,是否现任>>;
    3)遍历矩阵R,如果Rij所代表的履历Mi和履历Mj存在单位交集,则将该单位以及履历Mi和履历Mj保存至该组织机构库V;
    4)将V中的所有元素按照所述树状结构,采用组织机构可视化方法进行可视化表达。
  10. 如权利要求1或2所述的方法,其特征在于对每一成长轨迹序列数据进行时间维度以及空间维度的类型分析,得到对应文本履历的时空成长模式;其中,得到所述时空成长模式的方法为:首先定义履历随时间变迁的成长类型和履历随空间迁移的成长类型,并确定每一成长类型的特征;其中,随时间变迁的成长类型特征包括:等级时间跨度特征和或时序成长斜率特征,根据履历中的单位地理位置确定随空间迁移的成长类型特征;选取一部分成长轨迹序列数据作为样本数据,根据确定的成长类型特征标记其成长类型;利用机器学习分类器对样本数据进行分类训练,得到分类器模型参数,然后对未标记成长轨迹序列数据进行分类标记。
  11. 一种基于文本履历信息的智能可视分析系统,其特征在于包括个人成长经历量化模块、群体潜在社交关系挖掘模块、组织机构生成模块和履历信息可视化模块,其中:
    个人成长经历量化模块,用于对履历要素中的经历信息进行经历等级的量化计算,得到成长轨迹序列数据;
    群体潜在社交关系挖掘模块,用于对多份履历的成长轨迹序列数据进行关联计算,得到履历间的潜在社交关系;
    组织机构生成模块,用于以多份履历所代表群体的潜在社交关系为基础,从群体的单位交集信息中提取并还原出组织机构的层级信息;
    履历信息可视化模块,用于将履历的成长轨迹序列数据以及群体潜在社交关系挖掘模块、组织机构生成模块所输出的结果转化成信息可视化图。
  12. 如权利要求11所述的系统,其特征在于所述系统还包括文本履历预处理模块和个人成长模式挖掘模块;其中,文本履历预处理模块,用于将非结构化的文本履历数据进行预处理,抽取履历信息中的要素,得到结构化的履历要素XML数据;个人成长模式挖掘模块,用于对成长轨迹序列数据进行时间维度以及空间维度的类型分析,得到履历的时空成长模式。
PCT/CN2014/088601 2014-09-25 2014-10-15 基于文本履历信息的信息可视化方法及智能可视分析系统 WO2016045153A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US14/898,897 US20170200125A1 (en) 2014-09-25 2014-10-15 Information visualization method and intelligent visual analysis system based on text curriculum vitae information

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201410496047.1A CN104318340B (zh) 2014-09-25 2014-09-25 基于文本履历信息的信息可视化方法及智能可视分析系统
CN201410496047.1 2014-09-25

Publications (1)

Publication Number Publication Date
WO2016045153A1 true WO2016045153A1 (zh) 2016-03-31

Family

ID=52373568

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2014/088601 WO2016045153A1 (zh) 2014-09-25 2014-10-15 基于文本履历信息的信息可视化方法及智能可视分析系统

Country Status (3)

Country Link
US (1) US20170200125A1 (zh)
CN (1) CN104318340B (zh)
WO (1) WO2016045153A1 (zh)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344302A (zh) * 2018-08-14 2019-02-15 中国平安人寿保险股份有限公司 一种组织架构信息的展示方法、存储介质和服务器
CN109635301A (zh) * 2018-12-14 2019-04-16 湖南惟楚有才教育科技有限公司 一种教育资源管理方法及系统
CN109657039A (zh) * 2018-11-15 2019-04-19 中山大学 一种基于双层BiLSTM-CRF的工作履历信息抽取方法
CN109766438A (zh) * 2018-12-12 2019-05-17 平安科技(深圳)有限公司 简历信息提取方法、装置、计算机设备和存储介质
CN110781658A (zh) * 2019-10-14 2020-02-11 北京字节跳动网络技术有限公司 简历解析方法、装置、电子设备和存储介质
CN111984784A (zh) * 2020-07-17 2020-11-24 北京嘀嘀无限科技发展有限公司 人岗匹配方法、装置、电子设备和存储介质
CN112100237A (zh) * 2020-09-04 2020-12-18 北京百度网讯科技有限公司 一种用户数据处理方法、装置、设备以及存储介质
CN113095075A (zh) * 2021-04-02 2021-07-09 上海中通吉网络技术有限公司 一种简历文件解析方法
CN114708946A (zh) * 2022-03-22 2022-07-05 北京蓝田医疗设备有限公司 一种目标导向性专项能力训练方法及装置

Families Citing this family (31)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104951545B (zh) * 2015-06-23 2018-07-10 百度在线网络技术(北京)有限公司 输出对象的数据处理方法及装置
CN105260413A (zh) * 2015-09-24 2016-01-20 广东小天才科技有限公司 信息处理方法及装置
CN105786999A (zh) * 2016-02-17 2016-07-20 扬州大学 一种基于复杂网络关系的软件开发人员可视化推荐方法
US10692099B2 (en) * 2016-04-11 2020-06-23 International Business Machines Corporation Feature learning on customer journey using categorical sequence data
CN106844493A (zh) * 2016-12-26 2017-06-13 中国科学院自动化研究所 面向本体的时空信息挖掘及可视化展示方法
CN106874456B (zh) * 2017-02-14 2020-06-23 广州优视网络科技有限公司 人群优先级计算方法、装置及计算设备
US10833964B2 (en) * 2017-03-13 2020-11-10 Shenzhen Institutes Of Advanced Technology Chinese Academy Of Sciences Visual analytical method and system for network system structure and network communication mode
US11238363B2 (en) * 2017-04-27 2022-02-01 Accenture Global Solutions Limited Entity classification based on machine learning techniques
CN107392143B (zh) * 2017-07-20 2019-12-27 中国科学院软件研究所 一种基于svm文本分类的简历精确解析方法
US10884980B2 (en) * 2017-07-26 2021-01-05 International Business Machines Corporation Cognitive file and object management for distributed storage environments
US10817515B2 (en) 2017-07-26 2020-10-27 International Business Machines Corporation Cognitive data filtering for storage environments
CN107679194B (zh) * 2017-10-09 2020-04-10 东软集团股份有限公司 一种基于文本的实体关系构建方法、装置及设备
CN107656909B (zh) * 2017-10-30 2021-06-01 北京明朝万达科技股份有限公司 一种基于文档混合特征的文档相似度判定方法和装置
CN107944915B (zh) * 2017-11-21 2022-01-18 北京字节跳动网络技术有限公司 一种游戏用户行为分析方法及计算机可读存储介质
CN108319733B (zh) * 2018-03-29 2020-08-25 华中师范大学 一种基于地图的教育大数据分析方法及系统
US11113324B2 (en) * 2018-07-26 2021-09-07 JANZZ Ltd Classifier system and method
CN109446235B (zh) * 2018-10-18 2020-10-02 哈尔滨工业大学(深圳) 多维高效用序列模式处理方法、装置和计算机设备
CN109754224A (zh) * 2018-12-29 2019-05-14 贵州小爱机器人科技有限公司 人事关系图谱构建方法、装置以及计算机存储介质
CN109948447B (zh) * 2019-02-21 2023-08-25 山东科技大学 基于视频图像识别的人物网络关系发现及演化呈现方法
CN110147360B (zh) * 2019-04-03 2021-07-30 深圳价值在线信息科技股份有限公司 一种数据整合方法、装置、存储介质和服务器
CN110427406A (zh) * 2019-08-10 2019-11-08 吴诚诚 组织机构相关人员关系的挖掘方法及装置
CN110610001B (zh) * 2019-08-12 2024-01-23 大箴(杭州)科技有限公司 短文本完整性识别方法、装置、存储介质及计算机设备
CN111126951B (zh) * 2019-12-11 2022-12-20 云南电网有限责任公司 一种基于数字化的企业干部人才决策方法
CN111177583A (zh) * 2019-12-30 2020-05-19 山东合天智汇信息技术有限公司 一种基于社交平台的人脉分析方法及系统
US11829386B2 (en) 2020-01-30 2023-11-28 HG Insights, Inc. Identifying anonymized resume corpus data pertaining to the same individual
CN111782970B (zh) * 2020-07-23 2024-03-22 广州汇智通信技术有限公司 一种数据分析方法和装置
CN112364626B (zh) * 2020-11-25 2023-09-01 广东电网有限责任公司佛山供电局 一种安全措施智能管理方法及系统
CN113517074B (zh) * 2020-12-10 2023-09-12 中国人民解放军战略支援部队信息工程大学 一种流行病患者信息三维空间可视化方法
CN113449524B (zh) * 2021-04-01 2023-04-07 山东英信计算机技术有限公司 一种命名实体识别方法、系统、设备以及介质
CN113486003B (zh) * 2021-06-02 2024-03-19 广州数说故事信息科技有限公司 数据可视化时考虑异常值的企业数据集处理方法及系统
CN113673943B (zh) * 2021-07-19 2023-02-10 清华大学深圳国际研究生院 一种基于履历大数据的人员任免辅助决策方法及系统

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101546331A (zh) * 2009-05-07 2009-09-30 刘健 获取有助检索的特征、评价相关事物的价值的系统及方法
US7685151B2 (en) * 2006-04-12 2010-03-23 International Business Machines Corporation Coordinated employee records with version history and transition ownership
CN104036360A (zh) * 2014-06-19 2014-09-10 中国科学院软件研究所 一种基于磁卡考勤行为的用户数据处理系统及处理方法

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1167026C (zh) * 2001-01-22 2004-09-15 前程无忧网络信息技术(北京)有限公司上海分公司 汉语个人简历信息处理系统和方法
CN102999523A (zh) * 2011-09-16 2013-03-27 陆敏 一种才智数字化的方法
CN102999794A (zh) * 2011-09-16 2013-03-27 陆敏 人力资源人工智能的方法

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7685151B2 (en) * 2006-04-12 2010-03-23 International Business Machines Corporation Coordinated employee records with version history and transition ownership
CN101546331A (zh) * 2009-05-07 2009-09-30 刘健 获取有助检索的特征、评价相关事物的价值的系统及方法
CN104036360A (zh) * 2014-06-19 2014-09-10 中国科学院软件研究所 一种基于磁卡考勤行为的用户数据处理系统及处理方法

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109344302A (zh) * 2018-08-14 2019-02-15 中国平安人寿保险股份有限公司 一种组织架构信息的展示方法、存储介质和服务器
CN109344302B (zh) * 2018-08-14 2023-11-28 中国平安人寿保险股份有限公司 一种组织架构信息的展示方法、存储介质和服务器
CN109657039B (zh) * 2018-11-15 2023-04-07 中山大学 一种基于双层BiLSTM-CRF的工作履历信息抽取方法
CN109657039A (zh) * 2018-11-15 2019-04-19 中山大学 一种基于双层BiLSTM-CRF的工作履历信息抽取方法
CN109766438A (zh) * 2018-12-12 2019-05-17 平安科技(深圳)有限公司 简历信息提取方法、装置、计算机设备和存储介质
CN109635301A (zh) * 2018-12-14 2019-04-16 湖南惟楚有才教育科技有限公司 一种教育资源管理方法及系统
CN110781658A (zh) * 2019-10-14 2020-02-11 北京字节跳动网络技术有限公司 简历解析方法、装置、电子设备和存储介质
CN110781658B (zh) * 2019-10-14 2023-08-25 抖音视界有限公司 简历解析方法、装置、电子设备和存储介质
CN111984784A (zh) * 2020-07-17 2020-11-24 北京嘀嘀无限科技发展有限公司 人岗匹配方法、装置、电子设备和存储介质
CN111984784B (zh) * 2020-07-17 2024-03-12 北京嘀嘀无限科技发展有限公司 人岗匹配方法、装置、电子设备和存储介质
CN112100237A (zh) * 2020-09-04 2020-12-18 北京百度网讯科技有限公司 一种用户数据处理方法、装置、设备以及存储介质
CN112100237B (zh) * 2020-09-04 2023-08-15 北京百度网讯科技有限公司 一种用户数据处理方法、装置、设备以及存储介质
CN113095075A (zh) * 2021-04-02 2021-07-09 上海中通吉网络技术有限公司 一种简历文件解析方法
CN114708946B (zh) * 2022-03-22 2022-10-11 北京蓝田医疗设备有限公司 一种目标导向性专项能力训练方法及装置
CN114708946A (zh) * 2022-03-22 2022-07-05 北京蓝田医疗设备有限公司 一种目标导向性专项能力训练方法及装置

Also Published As

Publication number Publication date
CN104318340B (zh) 2017-07-07
US20170200125A1 (en) 2017-07-13
CN104318340A (zh) 2015-01-28

Similar Documents

Publication Publication Date Title
WO2016045153A1 (zh) 基于文本履历信息的信息可视化方法及智能可视分析系统
Meng et al. What makes an online review more helpful: an interpretation framework using XGBoost and SHAP values
US11899674B2 (en) Systems and methods to determine and utilize conceptual relatedness between natural language sources
Khan et al. A survey on scholarly data: From big data perspective
Shi et al. Prospecting information extraction by text mining based on convolutional neural networks–a case study of the Lala copper deposit, China
Tanwar et al. Unravelling unstructured data: A wealth of information in big data
CN104850601B (zh) 基于图数据库的警务实时分析应用平台及其构建方法
Baglatzi et al. Semantifying OpenStreetMap.
Zhang et al. Data mining applications in university information management system development
Theocharis et al. Knowledge management systems in the public sector: Critical issues
Ait-Mlouk et al. Winfra: A web-based platform for semantic data retrieval and data analytics
Chen et al. Data analysis and knowledge discovery in web recruitment—based on big data related jobs
Li et al. Construction of sentimental knowledge graph of Chinese government policy comments
CN112632223A (zh) 案事件知识图谱构建方法及相关设备
Miller et al. Digging into human rights violations: Data modelling and collective memory
Wang et al. Eliciting big data requirement from big data itself: A task-directed approach
Xu et al. Research on Tibetan hot words, sensitive words tracking and public opinion classification
Fuller et al. Structuring, recording, and analyzing historical networks in the china biographical database
Yu et al. Data service generation framework from heterogeneous printed forms using semantic link discovery
Chuprina et al. A way how to impart data science skills to computer science students exemplified by obda-systems development
Zhomartkyzy et al. The development of information models and methods of university scientific knowledge management
Gurcan et al. Big data research landscape: A meta-analysis and literature review from 2009 to 2018
Luo [Retracted] Analysis the Innovation Path on Psychological Ideological with Political Teaching in Universities by Big Data in New Era
Xiao Educational Information Recommendation System for College Design Based on Apriori Algorithm
Lytras et al. Innovations, developments, and applications of semantic web and information systems

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 14898897

Country of ref document: US

121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 14902523

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205N DATED 04.08.2017)

122 Ep: pct application non-entry in european phase

Ref document number: 14902523

Country of ref document: EP

Kind code of ref document: A1