CN115422427A

CN115422427A - Employment skill requirement analysis system

Info

Publication number: CN115422427A
Application number: CN202211007443.4A
Authority: CN
Inventors: 李海波
Original assignee: Jiangsu Open University of Jiangsu City Vocational College
Current assignee: Jiangsu Open University of Jiangsu City Vocational College
Priority date: 2022-08-22
Filing date: 2022-08-22
Publication date: 2022-12-02

Abstract

The invention discloses a employment skill requirement analysis system which comprises a data acquisition and storage module, a data preprocessing module, a data analysis module and a data visualization module. And the data visualization module is used for visually displaying the MySQL database. The invention collects and analyzes mass recruitment position information of the recruitment website by using big data technologies such as web crawlers, data mining, data visualization and the like, helps the employees to know the requirements of post skills of learned professionals in advance, evaluates the post and salary treatment of enterprises, learns corresponding professional skills in a targeted manner, and better prepares for employment.

Description

Employment skill requirement analysis system

Technical Field

The invention belongs to the field of data analysis, and particularly relates to an university student employment skill demand analysis system based on a big data technology.

Background

In recent years, with the increasing number of graduates year by year, the employment situation of the graduates of college students becomes more severe, and employment work of the graduates is due to become a hot issue of general attention of the society at present. Meanwhile, with the development of internet technology, more and more recruitment enterprises and job seekers release information by using the network. The network recruitment gradually becomes a main mode for graduates of universities due to the characteristics of wide coverage, strong timeliness, low cost and the like. However, internet job hunting information platforms are flooded, each platform carries massive job hunting information, job hunting enterprises are unclear in description of positions in the job hunting process, key description of position responsibilities and skill requirements is lacked, and job hunters are difficult to seek job application in a targeted manner. The recruitment platform cannot make reasonable recommendation according to self conditions of the university student job seekers, and the university students can only screen the massive information as if they drag for the needles at the seabed. The capability of acquiring the recruitment data by using the big data technology is remarkably enhanced, how to effectively analyze mass data helps college students to learn skills required for employment in a targeted manner, how to apply the information technology to better serve the employment of the students, and the problem to be solved urgently by employing the employment units and colleges and universities is solved.

Disclosure of Invention

The invention provides a employment skill requirement analysis system which comprises a data acquisition and storage module, a data preprocessing module, a data analysis module and a data visualization module, wherein the data acquisition and storage module is used for acquiring and storing data;

the data acquisition and storage module is used for acquiring a large amount of timeliness recruitment information data, acquiring the recruitment information of a recruitment website, analyzing a webpage structure, compiling a distributed crawler program and storing the acquired position data into a Hadoop distributed storage system;

the data preprocessing module determines a data analysis field, writes a data preprocessing program, cleans and converts acquired complex, incomplete, repeated and wrong data to obtain structured data and stores the structured data in a Hadoop distributed storage system;

the data analysis module is used for establishing a Hive data warehouse, loading the preprocessed structured data, analyzing the data by using Hive, refining information valuable to employment and job hunting, and importing an analysis result into a MySQL database;

and the data visualization module is used for visually displaying the MySQL database.

Furthermore, the data acquisition and storage module adopts a Scapy distributed crawler frame to realize the crawling of the recruitment website data; the method comprises the following specific steps:

s1, determining a crawling object;

s2, analyzing a webpage structure;

and S3, writing a Scapy crawler program.

Further, in the step S1, the crawling content mainly includes recruitment positions, salaries, work experience, academic requirements, company names, industries where the positions exist, work duties, job assignment requirements and work addresses, and mass position data obtained by crawling is stored in the Hadoop distributed storage system so as to perform data processing and analysis.

Further, in S2, the web page needs to be analyzed, and the similarity of the information elements is found out; the browser-carried developer tool can be used for conveniently analyzing the webpage structure, checking HTML codes, examining required webpage elements and preparing for compiling a crawler program.

Further, in S3, the basic process of the crawler is mainly divided into initiating a request, analyzing content, acquiring response content, and storing data; firstly, initiating a request to a target site through HTTP (hyper text transport protocol), and waiting for a server to respond; if the server can respond normally, a response containing the content of the page to be acquired is obtained, wherein the type of the response comprises one or more types of HTML, JSON character strings and binary data.

Furthermore, the data acquisition and storage module adopts a Hadoop distributed storage system, and comprises three nodes: one Master node is a Master, and two Slave nodes are a Slave1 and a Slave2; each node is provided with JDK and Hadoop, and SSH non-key login is established between a master node and a slave node; the Master node mainly runs NameNode and DataNode processes, and the Slave1 node and the Slave2 node mainly runs DataNode processes.

Further, the data pre-processing module determines dimensions of the data analysis field including one or more of industry, city, skill, salary, and welfare.

Further, the data visualization module performs Web visualization display on the analysis results of the area hot post statistical condition, the job hunting post area distribution condition, the area post salary and welfare data comparison condition and the skill analysis conditions required by different posts.

Further, the data visualization module adopts a visualization display method including one or more of Flex, JQuery and Echarts.

A computer readable storage medium having stored therein at least one instruction, at least one program, set of codes, or set of instructions that is loaded and executed by a processor to implement a employment skill requirement analysis system as described above.

The invention has the beneficial effects that:

the massive recruitment position information of the recruitment website is collected and analyzed by using big data technologies such as web crawlers, data mining, data visualization and the like, so that college students can know the position skill requirements of learned specialties in advance, the positions and salary treatment of enterprises and the like are evaluated, the corresponding professional skills are learned in a targeted manner, and the preparation for employment is better carried out.

Drawings

FIG. 1 is a diagram of an university student employment skill requirement analysis system architecture;

FIG. 2 is a diagram of a cluster structure and IP distribution;

FIG. 3 is a MapReduce data preprocessing process;

FIG. 4 is a data warehouse model;

FIG. 5 is a visualization module architecture diagram;

FIG. 6 is an industry demand distribution diagram of a recruitment position in an embodiment;

FIG. 7 is a distribution of cities where recruitment positions are located in an embodiment;

FIG. 8 is a cloud diagram of Java skill tag words in an embodiment;

FIG. 9 is a cloud diagram of Android skill tag words in an embodiment;

FIG. 10 is a post salary proportion diagram;

fig. 11 is a cloud diagram of welfare label words.

Detailed Description

The system acquires mass recruitment data from an enterprise recruitment website by using a crawler technology, stores the mass recruitment data into a Hadoop cluster, performs data preprocessing, performs data analysis on relevant dimensions of enterprise recruitment positions by using a data mining technology, and finally performs visual display on analysis results by using a visual tool.

As shown in fig. 1, the system development mainly includes data acquisition, data storage, data processing, data analysis, data visualization and other links of the recruitment information, and specifically can be divided into four modules, namely (1) a data acquisition and storage module. In order to obtain a large amount of timeliness recruitment information data, a Scapy crawler frame is adopted to collect recruitment information of a hot recruitment website, a webpage structure is analyzed, a distributed crawler program is compiled, and the obtained position data is stored in a Hadoop distributed storage system (HDFS). (2) And a data preprocessing module. Determining data analysis fields from five dimensions of industry, city, skill, salary and welfare, compiling a data preprocessing program, cleaning and converting acquired complex, various, incomplete, repeated and wrong data, and storing the acquired structured data in the HDFS. (3) And a data analysis module. Establishing a Hive data warehouse, loading the preprocessed structured data, analyzing the data with five dimensions by using Hive, refining information valuable to employment and job hunting of college students, and importing the analysis result into a MySQL database. (4) And a data visualization module. And performing Web visual display on the data in the MySQL database by using the technologies such as Flex, JQuery, echarts and the like on the analysis results of the regional hot post statistics, the job hunting region distribution, the regional post salary welfare data comparison and the analysis conditions of the skills required by different posts, thereby providing a more effective decision for the employment analysis of college students.

Data acquisition of job information of the recruitment website is realized by means of a Scapy crawler frame, and a distributed cluster system needs to be built for storing and processing mass data due to the huge amount of acquired data and the continuous increase of the data. The system adopts a Hadoop cluster based on HDFS to perform distributed storage, reads and processes HDFS data by using a MapReduce cluster computing frame, and stores the processing result in a Hive data warehouse. And analyzing the data stored in the distributed file system by using Hive, and exporting the data from the HDFS to a MySQL relational database by using a Sqoop tool, thereby facilitating the visual display of the data. Data visualization is realized by means of JavaWeb technology and a front-end Echarts visualization library.

As shown in fig. 2, the distributed cluster can be conveniently managed by using Hadoop, mass data can be distributively stored in the cluster, and the data can be processed by using a distributed parallel program. The distributed cluster environment of the system comprises three nodes: one Master node is a Master, and the two Slave nodes are a Slave1 and a Slave2. And each node is provided with JDK and Hadoop, and SSH keyless entry is established between the master node and the slave node. The Master node mainly runs NameNode and DataNode processes, and the Slave1 node and the Slave2 node mainly runs the DataNode processes.

The data acquisition and storage module:

by using the web crawler technology, a large amount of webpage information can be quickly and accurately acquired, and real-time data updating is realized. Because the amount of data required to be acquired by the system is huge, if a common crawler technology is used, not only is the crawling efficiency low, but also the IP address is possibly sealed off due to a reverse crawling mechanism of a website, so that the acquired data is incomplete. Therefore, the script distributed crawler framework is selected to realize the crawling of the data of the recruitment website. The Scapy framework is an application framework for crawling a Web site and extracting structured data based on Python, can be used for data mining, data monitoring, automatic testing and the like, and has the characteristics of simple structure, strong flexibility, high efficiency and rapidness.

(1) Determining crawled objects

At present, large hot recruitment websites have a large amount of recruitment information. The data crawling method realizes data crawling of a job plate of a certain recruitment website, crawling contents mainly comprise information of a recruitment job, salaries, work experience, academic requirements, company names, the industry of the company, work duties, job assignment requirements, work addresses and the like, and mass job data obtained through crawling are stored in the HDFS so as to be convenient for data processing and analysis.

(2) Analyzing web page structure

The nature of the crawler simulates a human being accessing a web page, and does not require manipulation through a web interface. Therefore, for the crawler, checking the web page structure is the most critical step, and the web page needs to be analyzed and the similarity of the information elements is found out. The browser-carried developer tool can be used for conveniently analyzing the webpage structure, checking HTML codes, examining required webpage elements and preparing for compiling a crawler program.

(3) Writing Scapy crawler program

The basic process of the crawler is mainly divided into initiating a request, analyzing content, acquiring response content and storing data. Firstly, a request is sent to a target site through HTTP, and a server response is waited. If the server can respond normally, a response containing the contents of the page to be acquired is obtained, and the type can be an HTML type, a JSON character string type, a binary data type and the like. The storage form is various, and the text can be stored as a text and stored in a database or a Hadoop cluster.

Recruitment information crawling based on script mainly comprises a data field file items, a configuration file setings, a crawler main program file job, and a pipeline file pipeline.

The data field file defines a data structure of crawling information according to analysis on webpage data to be crawled, and establishes a corresponding data field by a script. Such as job _ name for storing job position information, salary for storing work salaries, etc.

The configuration file is a global configuration file of a crawler project and is a hub of Scapy, and requests, responses and data can be circulated among modules of the Scapy only after the modules are configured by Settings. Important contents such as whether to follow Robots protocol, maximum concurrency, download delay, request header, maximum depth of crawling, user agent, duplication removing mode and the like can be configured through the module.

The crawler main program module is the core for realizing data crawling. start _ requests sends a request to a given initial URL. Invoking a specific website of the request by the script.http.FormRequest method, and giving function parseLs invoked next step; parseLs are mainly used to parse the sub-outer page data. And finding the crawled core data part through a developer tool, and finding an Xpath path of each piece of recruitment position information by using the Xpath. Xpath get job store to item [ 'jobname' ]. But not in the current web page, but in a more internal web page, a jump from the current web page is needed. Meta is added to the request, and meta is transferred to response. And then submits the web page request to the next function parse. The parse returns both the item and generates a new request. And (3) the script acquires the results generated in the pase method one by one, if the results are requests, the requests are added into a crawling queue, if the results are item types, pipeline processing is used, and other types return error information.

Py of the pipeline file realizes data storage, and after data crawling, data storage and post processing are realized by pipelines. It is up to the pipeline to decide whether the item continues through the pipelines or is discarded without further processing. In the pipeline module, a Python module pyhdfs can be used to directly access the HDFS, and the captured data is stored. The HdfsClient class in the pyhdfs can realize connection with a NameNode of the Hadoop cluster, and can perform operations such as query, read, write and the like on files on the HDFS.

And finally, executing the crawler file, designating the format of the output file as csv, and storing the original data of the crawled recruitment information into the HDFS (such as HDFS:// master: 9000).

A data preprocessing module:

the webpage data acquired by the web crawler technology are wide in source, various in data types, and have the problems of data loss, data repetition, redundancy, inconsistency, disordered data structures and the like, and if the original data are directly used for analysis, the data decision efficiency is seriously influenced, and even a decision error is caused. Therefore, preprocessing the original data is a key link in the big data analysis process, and the preprocessing operations such as data cleaning and data conversion need to be performed on the data, missing values, abnormal values, repeated values and the like are deleted and filled, standard, clean and continuous data are obtained, and the standard, clean and continuous data are provided for Hive to be counted and analyzed.

(1) Designing a pretreatment protocol

As shown in fig. 3, the collected data is reviewed. And formatting the data content of the data file through a JSON formatting tool, and checking the relevant fields of the stored position information. The fields of the data analysis are determined by looking at their data structure, in terms of the dimensionality of the data analysis.

The university student employment skill demand analysis system mainly inspects industry distribution, urban distribution, salary, welfare and job-holding skills of the recruitment position, extracts data of the five aspects from the acquired miscellaneous data, and performs deletion, filling, merging and format processing to finally obtain structured data. The acquired data are stored on the HDFS, massive data can be processed by using MapReduce distributed parallel computation, and the acquired original data are converted into structured target data.

(2) Writing MapReduce program

The method comprises the steps of firstly creating a Maven project, downloading configuration files of a Hadoop cluster, namely core-site.xml, hdfs-site.xml and mapred-site.xml, by using a remote connection tool, and copying the configuration files into src/main/resource directories of the Maven project. And creating an analysis dimension KPI class for packaging each part of recruitment information extracted from the HDFS. And compiling a MapReduce class to perform the operations of inputting, preprocessing and outputting the recruitment information, wherein the IP address is a Master of the cluster Master node.

In order to facilitate automatic execution of the recruitment information preprocessing program, a project for preprocessing recruitment information data needs to be exported into an executable jar package file, and the executable jar package file is uploaded to a Hadoop cluster by ssh for deployment. The preprocessed data file can be viewed on the Hadoop platform by using the command hdfsdfs-cat/clearjobs/part-r-00000.

A data analysis module:

data analysis is the most important link in a big data value chain, and the aim of the data analysis is to extract hidden data in the data and provide meaningful suggestions to assist in making correct decisions.

The pre-processed recruitment data is analyzed herein by using a distributed file system based Hive. Hive is a data warehouse established on a Hadoop distributed file system, provides a series of tools, can extract, convert and load data stored in the HDFS, and is a tool capable of storing, inquiring and analyzing large-scale data stored in Hadoop.

(1) Design Hive data warehouse

As shown in fig. 4, for a job data analysis module of a recruitment website, analysis needs to be performed from five dimensions of industry, city, salary, welfare and skill, a Hive data warehouse is designed into a star model, and the star model consists of a fact table and 5 dimension tables.

Fact table ods _ jobdata _ origin: the method is mainly used for storing data after the MapReduce computing framework is cleaned, and the table structure of the method is shown in Table 1. Dimension table t _ index _ detail: the method is mainly used for storing data of industry distribution analysis, and the table structure of the method is shown in table 2. Dimension table t _ city _ detail: the method is mainly used for storing data of urban distribution analysis, and the table structure of the method is shown in table 3. Dimension table t _ salary _ detail: the method is mainly used for storing salary distribution analysis data, and the table structure of the method is shown in table 4. Dimension table t _ welfare _ detail: the data is mainly used for storing the data analyzed by the welfare label, and the table structure of the data is shown in table 5. Dimension table t _ kill _ detail: the data is mainly used for storing the data of skill tag analysis, and the table structure is shown in table 6.

TABLE 1 fact tables ods _ jobdata _ origin

Field(s)	Data type	Remarks for note
			job_name	String	Name of jobBalance
company_name	String	Company name
			address	String	Work address
city	String	Work place city
			industry	String	Company's industry
education	String	Request for study calendar
			experience	String	Experience of work requires
salary	String	Post salary
			welfare	String	Post welfare
skill	String	Skill of job

Table 2 dimension table t _ index _ detail

Field(s)	Data type	Remarks for note
			industry	String	Industry
count	int	Frequency of industry

TABLE 3 dimension Table t _ city _ detail

Field(s)	Type of data	Remarks for note
			city	String	City
count	int	Frequency of cities

TABLE 4 dimension Table t _ salary _ detail

Field(s)	Data type	Remarks for note
			salary	String	Salary distribution interval
count	int	Salary frequency

Table 5 dimension table t _ welfare _ detail

Field(s)	Data type	Remarks for note
			welfare	String	Welfare label
count	int	Welfare label frequency

TABLE 6 dimension Table t _ kill _ detail

(2) Recruitment data dimensional analysis

And importing the preprocessed data on the HDFS into a fact table, analyzing the data of each dimension by using an HQL query statement of Hive, storing the analysis result into a corresponding dimension table, and finally querying the table to obtain the analysis result. For example: and saving the positions into a city dimension table t _ city _ detail according to the data of the city distribution.

hive(jobdata)>insert overwrite table t_city_detail select city,count(1)from ods_jobdata_origin group by city；

And saving the data of the job skill requirements into a table t _ kill _ detail.

hive(jobdata)>insert overwrite table t_skill_detail select skill,count(1)from(select explode(skill)as kill from ods_jobdata_origin)as t_skill group by skill；

And carrying out partition grouping statistics according to the lowest salary, and storing data of position salary analysis into a salary dimension table t _ salary _ detail. And saving the analysis data of the company benefits into a benefit dimension table t _ company _ detail.

A data visualization module:

as shown in fig. 5, data visualization is a graphical means for clearly and efficiently communicating information. The job site analysis visualization module of the recruitment website is built on the basis of JavaWeb, the function of the back end is realized through an SSM (spring MVC, spring boot and Mybatis) frame, the front end realizes visualization display in JSP by using Flex and Echarts, and the data interaction of the front end and the back end is realized through the interaction of spring MVC and Ajax.

In fig. 5, ECharts is a powerful and compatible visualization library, provides rich chart types such as bar charts, line charts, pie charts, scatter charts, word cloud charts showing part of data frequency, and supports tens of millions of data volume displays. And the Flex layout can realize various page layouts simply, completely and responsively.

The logic of the analyzed and processed recruitment data visualization service is simple, so that the recruitment data visualization service only needs to be found out from a database, converted into JSON data and provided with an access interface. Asynchronous request is carried out by using Ajax, data are sent to the front end to be displayed, and reference is provided for analysis of employment skills and employment selection of college students according to a displayed result.

By observing the demand distribution of the industry where the recruitment position is located in fig. 6, it can be seen that the recruitment demand of the real estate industry and the computer software industry is far higher than that of other industries, and then the industries such as internet, electronic technology, pharmacy/bioengineering and the like, the recruitment demands of the other industries are equivalent, and the university students can select the corresponding industry according to the recruitment demand when selecting employment.

By observing the distribution of cities where the recruitment positions are located in fig. 7, it can be seen that a large number of recruitment requirements are concentrated in large cities, the first-line cities of shanghai, shenzhen and guangzhou arranged in the first three places are followed by the second-line cities of Chengdu, wuhan, hangzhou and the like, which shows that the cities have higher requirements on talents. It can be seen through observation that the recruitment requirement of Beijing is only about one third of that of other first-line cities.

By observing the skill label word cloud diagrams in fig. 8 and fig. 9, it can be seen which skills need to be mastered when the Java development and the Android development are engaged, and the importance degree of the skills to the position can be clearly understood, so that the skills are used as an important reference for the college students to learn the professional skills. In addition to the requirement on technical ability, enterprise recruitment also pays attention to development experience and communication coordination ability, so that college students are required to participate in project practice more than once during school, and communication abilities and organization coordination abilities are improved.

By observing the station salary duty ratio of fig. 10, it can be understood that the monthly salary distribution of all stations is mainly concentrated between 4k and 24k, and the interval gap is large, wherein the monthly salary interval is the highest between 8k and 12k, and accounts for 34.53%, and the monthly salaries of 4k to 8k and 12k to 24k are equivalent, and respectively account for 28.29% and 27.51%, and the monthly salary under 4k and above 40k are less. The data shows that the post salaries given by the enterprise in the recruitment are in accordance with the actual current situation.

By observing the benefit label word cloud picture in fig. 11, it can be seen that the benefit policies provided by enterprises in the recruitment process mainly include five insurance one fund, performance bonus, annual final bonus, regular physical examination, professional training, employee tourism and the like, the frequently-occurring benefit labels can be regarded as standards of most companies for employees, and college students can be used as a reference when selecting an enrollment company.

In conclusion, the network recruitment platform is a main mode of the college students in employment and job hunting, and mass data generated in the recruitment process of the network platform are collected and analyzed by utilizing a big data technology, so that the college students can know the recruitment status more clearly and obtain valuable employment skill information. The employment skill demand analysis system provided by the invention uses a Scapy crawler frame to collect website information, based on a Hadoop clustering environment, uses MapReduce to preprocess collected original data, uses Hive to analyze five dimensions of skill frequency, industry distribution, city distribution, salary and welfare frequency of a recruitment position, and uses Echarts and other tools to visualize analysis results through Web.

The above embodiments are only for illustrating the technical idea of the present invention, and the protection scope of the present invention is not limited thereby, and any modifications made on the basis of the technical scheme according to the technical idea of the present invention fall within the protection scope of the present invention.

Claims

1. A employment skill demand analysis system is characterized by comprising a data acquisition and storage module, a data preprocessing module, a data analysis module and a data visualization module;

the data acquisition and storage module acquires recruitment information of a recruitment website, analyzes a webpage structure, compiles a distributed crawler program and stores acquired position data into a Hadoop distributed storage system in order to acquire a large amount of timeliness recruitment information data;

2. The employment skill requirement analysis system of claim 1, wherein the data collection and storage module employs a script distributed crawler framework to facilitate crawling of recruitment website data; the method comprises the following specific steps:

s1, determining a crawling object;

s2, analyzing a webpage structure;

and S3, writing a Scapy crawler program.

3. The employment skill requirement analysis system of claim 2, wherein in S1, the crawled content mainly comprises recruitment positions, salaries, work experience, academic requirements, company names, industry, work duties, job requirements and work addresses, and the mass position data crawled is stored in the Hadoop distributed storage system for data processing and analysis.

4. The employment skill requirement analysis system of claim 2, wherein in S2, the web page needs to be analyzed and the similarity of the information elements is found; the browser-carried developer tool can be used for conveniently analyzing the webpage structure, checking HTML codes, examining required webpage elements and preparing for compiling a crawler program.

5. The employment skill requirement analysis system according to claim 2, wherein in S3, the basic flow of the crawler is mainly divided into initiating a request, parsing the content, obtaining the response content and saving the data; firstly, initiating a request to a target site through HTTP (hyper text transport protocol), and waiting for a server to respond; if the server can respond normally, a response containing the content of the page to be acquired is obtained, wherein the type of the response comprises one or more types of HTML, JSON character strings and binary data.

6. The employment skill requirement analysis system of claim 1, wherein the data collection and storage module, which employs a Hadoop distributed storage system, comprises three nodes: one Master node is a Master, and the two Slave nodes are a Slave1 and a Slave2; each node is provided with JDK and Hadoop, and SSH non-key login is established between a master node and a slave node; the Master node mainly runs NameNode and DataNode processes, and the Slave1 node and the Slave2 node mainly runs DataNode processes.

7. An employment skill requirements analysis system according to claim 1, wherein the data preprocessing module determines dimensions of data analysis fields including one or more of industry, city, skill, salary and welfare.

8. The employment skill requirement analysis system of claim 1, wherein the data visualization module is configured to perform Web visualization of the analysis results of the statistics of the hot spots in the area, the distribution of job hunting spots in the area, the comparison of the salary and welfare data of the spots, and the analysis of the skills required by different spots.

9. The employment skill requirement analysis system of claim 8, wherein the data visualization module adopts a visualization display method comprising one or more of Flex, JQuery and Echarts.

10. A computer readable storage medium having stored therein at least one instruction, at least one program, a set of codes, or a set of instructions, which is loaded and executed by a processor to implement the employment skill requirements analysis system as claimed in any one of claims 1-9.