CN113269249A

CN113269249A - Multi-data-source portrait construction method based on deep learning

Info

Publication number: CN113269249A
Application number: CN202110569379.8A
Authority: CN
Inventors: 何康健; 刘兰; 黄志豪; 王鹏铖; 刘浪洲
Original assignee: Guangdong Polytechnic Normal University
Current assignee: Guangdong Polytechnic Normal University
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-17

Abstract

The invention provides a multi-data-source portrait construction method based on deep learning, which comprises the following steps: s1: constructing a multi-source data acquisition engine, and acquiring user information from a plurality of data sources by using the multi-source data acquisition engine; s2: carrying out gradual analysis from shallow semantic analysis to deep association analysis on the user information; s3: matching corresponding labels for the analyzed user information to obtain character label data; s4: collecting a recruitment post data set from an internet recruitment website, and performing post model training by adopting a machine learning algorithm to obtain a trained post model; s5: and constructing a figure portrait by combining figure label data and a post model through a deep learning algorithm to obtain the figure portrait which is fit for the real behavior of the user and does not lose the individual difference. The invention provides a multi-data-source portrait construction method based on deep learning, and solves the problem that the existing portrait technology cannot effectively reflect the real situation of a user.

Description

Multi-data-source portrait construction method based on deep learning

Technical Field

The invention relates to the technical field of artificial intelligence, in particular to a multi-data-source portrait construction method based on deep learning.

Background

At present, a plurality of network recruitment platforms and professional human resource websites provide free services for publishing personal job hunting information, which provides a good opportunity for a large number of job hunters. And the recruited enterprises can browse the information of thousands of job seekers on the internet only by paying lower cost or even without paying cost. The advantages are that the internet talent base is used for selecting talents, the obtained information amount is large, the selectable range is wide, and due to the development of the network, the job seekers are organically classified according to industries, positions and professions, so that the recruiting enterprises can also sit on numbers to search for the talents required by the recruiting enterprises. However, the large amount of information also means a large workload of the recruiter, and the recruiter needs a lot of time to search out a suitable person from thousands of information of job seekers. In order to find a good job, each job seeker wants to package the job seeker as perfectly as possible through the resume on the network, so that the caretaker can know the applicant comprehensively, the caretaker is difficult to know the job seeker completely, the job seeker sometimes sees all the excellent persons from the resume on the network, and the job seeker cannot say a lot of loopholes in the interview, so that the resource of the job seeker is wasted.

The person portrait generation technology can increase the effective exchange rate of information in job hunting and recruitment by generating the portrait, so that the success rate of job hunting is improved. With the development of society, the differences among individuals are continuously accumulated in an iterative way, and the personalized development becomes the current main melody. The existing portrait painting technology only focuses on single-dimension information data of a user, and generated portrait mainly depends on a tag value set in advance, so that when portrait paintings are generated in batches, the generated result has high similarity and structure, the difference of each individual person cannot be distinguished finely, and the real situation of the user cannot be reflected effectively.

In the prior art, for example, in a Chinese patent 2018-11-13, a social network big data-based character image model construction method is disclosed as CN108804701A, the accuracy of character images can be improved by computing the implicit attributes of characters and combining a character social relationship network, the social attributes of the characters can be reflected more comprehensively, but the structure of the character image model is not adjusted to be optimal by combining a deep learning technology, so that the generated character images are not attached to the real behaviors of users.

Disclosure of Invention

The invention provides a multi-data-source portrait construction method based on deep learning, aiming at overcoming the technical defect that the existing portrait technology cannot effectively reflect the real situation of a user.

In order to solve the technical problems, the technical scheme of the invention is as follows:

a multi-data-source portrait construction method based on deep learning comprises the following steps:

s1: constructing a multi-source data acquisition engine, acquiring user information from a plurality of data sources by using the multi-source data acquisition engine, and pre-storing the acquired user information into an information cache pool;

s2: carrying out gradual analysis from shallow semantic analysis to deep association analysis on the user information in the information cache pool, and persistently storing the analyzed user information;

s3: matching corresponding labels for the analyzed user information to obtain character label data;

s4: collecting a recruitment post data set from an internet recruitment website, and performing post model training by adopting a machine learning algorithm to obtain a trained post model;

s5: and constructing a figure portrait by combining figure label data and a post model through a deep learning algorithm to obtain the figure portrait which is fit for the real behavior of the user and does not lose the individual difference.

According to the scheme, the user information is collected from the multiple data sources by constructing the multi-source data collection engine, the user information is analyzed step by step, corresponding labels are matched with the user information to obtain character label data, post model training is carried out by adopting a machine learning algorithm to obtain a trained post model, and finally a character portrait is constructed by combining the character label data and the post model through a deep learning algorithm to obtain the character portrait which is attached to the real behaviors of the user and does not lose individual differences.

Preferably, in step S1, the multi-source data collection engine collects user information deeply from a plurality of different internet platforms and covering a plurality of different fields.

Preferably, a Redis non-relational database is used as the information cache pool.

In the scheme, as more information acquisition nodes are provided, most of the data are structured scattered data, and the random writing performance requirement is high, the Redis non-relational database is used as an information cache pool.

Preferably, the parsing of the user information includes HTML parsing, JSON parsing, XML parsing, and YAML parsing.

In the scheme, the problem of large and complex acquired data is solved through shallow semantic analysis and deep association analysis, and the analyzed user information is ASCII text data.

Preferably, in step S2, the method further includes filtering invalid fields in the user information by using natural language processing technology.

In the scheme, invalid redundant semantics in the user information are removed through a natural language processing technology.

Preferably, step S3 specifically includes the following steps:

s3.1: dividing the analyzed user information into a training set and a test set; the training set is a user information data set which is divided into types in advance, and the test set is a user information data set to be divided into types;

s3.2: carrying out word segmentation processing on each piece of user information in the training set and the test set, dividing the long sentence into single phrases, putting the cut phrases into a word packet, and expanding the word packet into a chain structure to form a word bag model;

s3.3: calculating a TF-IDF weight matrix of each piece of user information in the training set by adopting a TF-IDF algorithm;

s3.4: and carrying out classification training on the training set by adopting a naive Bayes classification method to obtain trained parameters, carrying out classification processing on the test set according to the trained parameters, and matching corresponding labels for the user information in the test set to obtain character label data.

Preferably, the tags include a fact tag, a model tag, and an advanced tag;

the fact labels are used for objectively defining data, the model labels are used for classifying data normal definitions, and the high-level labels are used for mining deep difference information of the data.

Preferably, step S4 specifically includes the following steps:

s4.1: collecting recruitment information from an internet recruitment website to form a recruitment post data set;

s4.2: segmenting the recruitment information in the recruitment position data set and removing stop words to obtain word documents;

s4.3: extracting the characteristics of the word document by adopting a TF-IDF algorithm to obtain a recruitment information vector matrix;

s4.4: and (4) performing classification training on the recruitment information vector matrix by adopting a logistic regression algorithm of machine learning to obtain a trained post model.

Preferably, in the feature engineering of the logistic regression algorithm, the word contribution rate threshold max _ df is set to 0.92 and the word appearance frequency threshold min _ df is set to 2.

In the above scheme, words having a high probability of occurrence but a low contribution rate to classification are removed by setting the word contribution rate threshold max _ df to 0.92, and words having an occurrence frequency of less than 2 times are removed by setting the word occurrence frequency threshold min _ df to 2.

Preferably, in step S4.4, the end condition of the classification training is set to: the precision of the post model obtained by training is more than 95%.

Compared with the prior art, the technical scheme of the invention has the beneficial effects that:

the invention provides a method for constructing a portrait based on multiple data sources of deep learning, which comprises the steps of collecting user information from multiple data sources by constructing a multi-source data collection engine, gradually analyzing the user information, matching corresponding labels with the user information to obtain portrait label data, training a post model by adopting a machine learning algorithm to obtain a trained post model, and finally constructing the portrait by combining the portrait label data and the post model through the deep learning algorithm to obtain the portrait fitting with the real behaviors of a user and not losing individual differences.

Drawings

FIG. 1 is a flow chart of the implementation steps of the technical scheme of the invention.

Detailed Description

The drawings are for illustrative purposes only and are not to be construed as limiting the patent;

for the purpose of better illustrating the embodiments, certain features of the drawings may be omitted, enlarged or reduced, and do not represent the size of an actual product;

it will be understood by those skilled in the art that certain well-known structures in the drawings and descriptions thereof may be omitted.

The technical solution of the present invention is further described below with reference to the accompanying drawings and examples.

Example 1

As shown in fig. 1, a method for constructing a multiple data source portrait based on deep learning includes the following steps:

In the specific implementation process, user information is collected from a plurality of data sources by constructing a multi-source data collection engine, the user information is analyzed step by step, corresponding labels are matched with the user information to obtain character label data, post model training is carried out by adopting a machine learning algorithm to obtain a trained post model, and finally a character portrait is constructed by combining the character label data and the post model through a deep learning algorithm to obtain the character portrait which is attached to the real behaviors of a user and does not lose individuality difference.

Example 2

in actual implementation, the multi-source data acquisition engine is constructed based on a Scapy framework, and a user can directly derive relevant platform data including data of consumption behaviors, browsing histories and the like of the user only by inputting identity verification information; also included is the following restrictions on the multi-source data collection engine: only two threads are allowed to work at the same time at most, and the amount of information crawled per second is lower than 1 megabyte; in addition, part of the multi-source data acquisition engines are set to be in a user-defined mode, personalized acquisition can be carried out according to the actual conditions of the user, and finally, all-round and multi-level information acquisition work of the user needs to be finished;

more specifically, in step S1, the multi-source data collection engine collects user information deeply from 36 different internet platforms and covering 7 different fields;

in practical implementation, the related data sources include major information platforms such as jingdong, naobao, mei cluster, popular comment, travel, pay treasure, 12306 railway in China, WeChao, microblog, QQ, beepli video, Tencent video, Aiqiyi art video, Youkou video, cat eye movie, bean, network Yiyun music, QQ music, China Mobile, China Unicom, China telecom, QQ mailbox, network Yinyu mailbox, Gmail, spring rain doctor, chain family, today's headline, simple book, blog garden, CSDN, open source Chinese blog, east wealth network, hook network, carefree course, intelligent joining, learning network, and the like; the covered fields comprise living shopping, social circles, medical health, travel tourism, game entertainment, knowledge education and investment financing;

more specifically, because the number of information acquisition nodes is large, most of the data is structured scattered data, and the random writing performance requirement is high, a Redis non-relational database is adopted as an information cache pool;

s2: carrying out gradual analysis from shallow semantic analysis to deep association analysis on the user information in the information cache pool, wherein the analyzed user information is ASCII text data, and storing the analyzed user information into a Redis non-relational database again;

more specifically, the analysis of the user information comprises HTML analysis, JSON analysis, XML analysis and YAML analysis due to different data formats returned by different information platforms;

more specifically, in step S2, the method further includes filtering invalid fields in the user information through a natural language processing technique to remove invalid redundant semantics in the user information, reduce the complexity of subsequent training, and increase the training accuracy;

more specifically, step S3 specifically includes the following steps:

s3.1: dividing the analyzed user information into a training set and a test set;

the training set is a user information data set which is divided into types in advance, and the test set is a user information data set to be divided into types;

s3.4: classifying and training the training set by adopting a naive Bayes classification method to obtain trained parameters, classifying the test set according to the trained parameters, and matching corresponding labels for the user information in the test set to obtain character label data;

more specifically, the tags include a fact tag, a model tag, and an advanced tag;

the fact label is used for objectively defining data, the model label is used for classifying data normal definitions, and the high-level label is used for mining deep difference information of the data;

in actual implementation, the fact labels are such as academic calendar, age, gender and the like, and the advanced labels are such as movie hobbies, attention blogger style, favorite cities and the like;

more specifically, step S4 specifically includes the following steps:

s4.4: performing classification training on the recruitment information vector matrix by adopting a logistic regression algorithm of machine learning to obtain a trained post model;

more specifically, in the feature engineering of the logistic regression algorithm, words with high occurrence probability but low contribution rate to classification are removed by setting a word contribution rate threshold max _ df to 0.92, and words with less than 2 occurrences are removed by setting a word occurrence frequency threshold min _ df to 2; in the embodiment, a corpus including 70000 feature words is established, and words consisting of 1-4 characters are selected from the composition of the feature words;

more specifically, in step S4.4, the end condition of the classification training is set to: the precision of the post model obtained by training is more than 95%.

It should be understood that the above-described embodiments of the present invention are merely examples for clearly illustrating the present invention, and are not intended to limit the embodiments of the present invention. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present invention should be included in the protection scope of the claims of the present invention.

Claims

1. A multi-data-source character image construction method based on deep learning is characterized by comprising the following steps:

s2: carrying out gradual analysis from shallow semantic analysis to deep association analysis on the user information in the information cache pool;

2. The method for constructing multiple data source human figures based on deep learning of claim 1, wherein in step S1, the multiple data source data collecting engine collects user information from multiple different internet platforms and covering multiple different fields.

3. The method for constructing the multiple data source human figure based on deep learning of claim 1, wherein a Redis non-relational database is used as the information cache pool.

4. The method for constructing the human figure based on the multiple data sources in the deep learning as claimed in claim 1, wherein the parsing of the user information comprises HTML parsing, JSON parsing, XML parsing and YAML parsing.

5. The method for constructing a multi-data-source human figure based on deep learning of claim 1, wherein in step S2, the method further comprises filtering invalid fields in the user information by natural language processing.

6. The method for constructing a human figure with multiple data sources based on deep learning of claim 1, wherein the step S3 specifically comprises the following steps:

7. The method for constructing the multiple data source human figure based on deep learning of claim 1 or 6, wherein the labels comprise a fact label, a model label and an advanced label;

8. The method for constructing a human figure with multiple data sources based on deep learning of claim 1, wherein the step S4 specifically comprises the following steps:

9. The method for constructing the multiple data source human figure based on deep learning of claim 8, wherein a word contribution rate threshold max _ df-0.92 and a word occurrence frequency threshold min _ df-2 are set in the feature engineering of the logistic regression algorithm.

10. The method for constructing a human figure with multiple data sources based on deep learning of claim 8, wherein in step S4.4, the end conditions of the classification training are set as follows: the precision of the post model obtained by training is more than 95%.