CN117171428B - Method for improving accuracy of search and recommendation results - Google Patents

Method for improving accuracy of search and recommendation results Download PDF

Info

Publication number
CN117171428B
CN117171428B CN202310981457.4A CN202310981457A CN117171428B CN 117171428 B CN117171428 B CN 117171428B CN 202310981457 A CN202310981457 A CN 202310981457A CN 117171428 B CN117171428 B CN 117171428B
Authority
CN
China
Prior art keywords
data
keywords
training
job
objective function
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310981457.4A
Other languages
Chinese (zh)
Other versions
CN117171428A (en
Inventor
时迎超
王杨
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Wangpin Information Technology Co ltd
Original Assignee
Beijing Wangpin Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Wangpin Information Technology Co ltd filed Critical Beijing Wangpin Information Technology Co ltd
Priority to CN202310981457.4A priority Critical patent/CN117171428B/en
Publication of CN117171428A publication Critical patent/CN117171428A/en
Application granted granted Critical
Publication of CN117171428B publication Critical patent/CN117171428B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method for improving accuracy of search and recommendation results, and belongs to the technical field of data processing. The method comprises the following steps: s10, improving the data quality of the knowledge graph through double-chain data, and cleaning and improving the data accuracy of the knowledge graph by using a clustering method; s20, training by using a pre-training model, and integrating the knowledge of the job tree into the pre-training model in advance; s30, using a multitasking training mode to reduce the confusion degree of the pre-training model; and S40, recommending preferred job classes, including the most probable job class and similar job classes. In order to improve the retrieval and matching performance of JD and CV, the invention upgrades and reforms the system in terms of data quality improvement, marking model optimization, vector model optimization and the like.

Description

Method for improving accuracy of search and recommendation results
Technical Field
The invention belongs to the technical field of data processing, and particularly relates to a method for improving the accuracy of search and recommendation results, in particular to a method for improving the accuracy of search and recommendation results based on a knowledge graph.
Background
The technology and application value of big data are widely accepted, and a Knowledge Graph (knowledgegraph) of the future core technology is rapidly developed along with the application of huge companies of internet technology. Amazon uses big data to recommend commodity information to clients, and a comprehensive relationship between people and commodities is formed; microsoft develops a 'person cube', forms a person-to-person three-dimensional relationship, and truly realizes six-degree space search of people and people; the Baidu develops a Baidu brain, redefines a search engine in China, and provides a comprehensive expansion search result for a user; google has even long begun to think "take over the world" with big data, has developed internet search engine at the earliest, has opened the internet era, and has developed Google Brain based on this, has led to concept and technical popularization of the knowledge graph.
The knowledge graph is a knowledge base of graph structure, belonging to the category of knowledge engineering. Different from a common knowledge base, the knowledge graph fuses all disciplines, knowledge units with different sources, different types and different structures are linked into a graph, a knowledge system with wider and deeper is provided for a user based on metadata of each discipline, the knowledge system is continuously expanded, and the knowledge graph essentially comprises the steps of systemizing and relativizing the knowledge data of the field, and visualizing the knowledge in a graph mode. In short, the knowledge graph can be understood as a knowledge system established based on an information system, and the complex knowledge domain is systematically displayed through technologies such as data acquisition, data mining, information processing, knowledge metering, graphic drawing and the like, so that the dynamic development rule of the knowledge domain is revealed.
The job class is one of the most important information in recruitment industry, on-end job class information appears in the use or office flow of the user, and on-policy job class information is also an important ordering or recall policy. And different recruitment platform job trees are huge and have different contents, so that accurate understanding and memorizing of the job trees are particularly costly for users. Only 80% + of the users counted can understand and memorize the target job class for recruitment and select correctly from the huge job class tree. The prepared job classification task has great significance for improving the use efficiency of users, improving the quality of basic data and contributing to the characteristics on business.
For example, prior art, application number: CN202310528124.6 discloses a knowledge graph-based network hotspot information recommendation method, system and equipment, wherein the invention acquires hotspot event and decision information, and builds a network hotspot knowledge graph after keyword extraction and knowledge extraction; acquiring an emergency event, and extracting keywords and knowledge of the emergency event; according to the keywords, entity attributes and relations in the sudden-hot event and the network hot knowledge graph, carrying out keyword similarity, entity attribute similarity and relation similarity evaluation; and recommending the hot event and the decision information according to the keyword similarity, the entity attribute similarity and the relationship similarity obtained through evaluation. However, the existing similar technology has the following problems: the data quality cannot be controlled, the data quality is uneven, the data distribution cannot be controlled, and the data model and the actual data are often deviated; when the number of job classes is increased (more than 1380 conventional job positions are reached at present), the model iteration efficiency is low due to the increase of the target number; from the perspective of training targets, the models are difficult to fit accurately due to the similarity of part of job classes; from a training data perspective, most positions can be divided into multiple positions, with the property of multiple labels. Therefore, a more excellent data recommendation model is needed to solve the above problems.
Disclosure of Invention
Problems to be solved
Aiming at the problems in the prior art, the invention provides a method for improving the accuracy of search and recommendation results, and in order to improve the retrieval and matching performance of JD and CV, the invention upgrades and reforms the system in the aspects of data quality improvement, marking model optimization, vector model optimization and the like.
Technical proposal
In order to solve the problems, the invention adopts the following technical scheme.
A method for improving the accuracy of search and recommendation results comprises the following steps:
s10: the data quality of the knowledge graph is improved through double-chain data, and the data accuracy of the knowledge graph is cleaned and improved through a clustering method;
s20: training by using a pre-training model, and integrating the training with the knowledge of the job tree in advance;
s30: the multitask training mode is used, so that the confusion degree of the pre-training model is reduced;
s40: preferred job classes are recommended, including most probable job classes and similar job classes.
The method for improving the accuracy of the search and recommendation results,
the double-chain data in the step S10 are grouped by using the occurrence frequency of the keywords in the basic data;
the weight formula of the packet of double-stranded data described in step S10 is as follows:
W pf (i)=pf i *idf i /if i
in which W is pf (i) Weight value, pf representing the ith group of keywords i Representing the occurrence frequency of the keywords of the ith group, idf i If representing the ratio between the number of groups of the above-mentioned structured double-stranded data and the number of groups of the above-mentioned unstructured double-stranded data i Representing the inverse frequency.
The method for improving the accuracy of the search and recommendation results,
if in step S10 i The calculation method of (2) is as follows:
wherein N represents the total number of basic data, df i Representing the number of occurrences of the i-th group of keywords in the base data.
The method for improving the accuracy of the search and recommendation results,
the clustering method in step S20 is as follows:
carrying out characteristic representation processing on the structured double-chain data and the semi-structured double-chain data, wherein the characteristic representation processing needs to carry out the following algorithm processing on the weight value of the i-th group of keywords:
wherein P (S) represents the probability of distribution of the weight values of the keywords of all groups, wherein S represents the total sequence of weight values of the keywords of all groups, wherein w i (1.ltoreq.i.ltoreq.n) represents the sequence number of the i-th group keyword.
The method for improving the accuracy of the search and recommendation results,
the method of pre-training the model described in step S20 is as follows:
the double-chain data which is structured after the feature representation processing and the semi-structured double-chain data are sent to an NLP service center;
finally, the NLP service center optimizes the BP network model used by the double-chain data represented by the screened characteristics conforming to the rules to obtain a first entity relationship;
the optimization algorithm using the BP network model is as follows:
wherein G is i Represents the optimized first entity relationship degree value, wherein N represents the sum of statistics of all groups of keywords, and P i n The probability of distribution of the weight values representing the keywords of all groups.
The method for improving the accuracy of the search and recommendation results,
the manner of integration in step S20 is as follows:
and optimizing the scheduling layer, establishing a model by using a max-min mathematical algorithm, and determining each single objective function to obtain the blended scheduling benefit.
The method for improving the accuracy of the search and recommendation results,
the multitasking training method described in step S30 is as follows:
each single objective function is determined, including a first objective function, a second objective function, a third objective function, and a fourth objective function.
The method for improving the accuracy of the search and recommendation results,
the first objective function is F 1 (q v1 )=(q v11 -q v0 )/q v12 Wherein q is v0 、q v1 、q v11 Qv 12 Planning task values for different periods;
the second objective function isWherein F is i (q vi ) The calculation mode is to adopt a weight coefficient transformation method and assign weight for the reference flow value;
the third objective function is V i,j+1 =V i,j +(Q i,j -q i,j -Q lossi,j ) Wherein V is i,j Data volume calculated for the jth time period ith cloud, where V i,j+1 Data volume calculated for the (j+1) th time period (i) th cloud, where Q i,j Data entry for the jth time period, the ith cloud, where q i,j Data leakage calculated for the jth time period ith cloud, where Q lossi,j The data loss amount calculated for the ith cloud in the jth time period;
the fourth objective function isWherein E is sm Energy value calculated for cloud, where V m,T Is the effective storage capacity of the scheduling period, wherein gamma m,T And (3) calculating the number and the total number of m in the T period by using the data quantity of the scheduling period, wherein m and T are calculated by using the cloud.
The method for improving the accuracy of the search and recommendation results,
the algorithm recommended in step S40 is as follows:
of the formula, wherein FD Qk Represents a recommended quantization complexity level value, where d kij Knowledge-graph data representing a recommended kth set of components in column and row, where p ki Complexity value of knowledge-graph data in column of the k-th set of recommended components, where p kj Complexity values representing knowledge-graph data in the row direction of the proposed kth component set.
Advantageous effects
Compared with the prior art, the invention has the beneficial effects that:
the double-chain data is used for improving the data quality, the clustering method is used for cleaning the data, the data accuracy is enhanced, the pre-training large model is integrated with the hierarchical characteristics, the iteration efficiency is improved, and the confusion is reduced by adopting a multi-task joint training mode. The invention realizes the following functions: converting the monitoring of the GPU into the monitoring of the CPU; the point-to-point connection of the api and the model process is realized, and the load balance is realized; the flow requirement of 3 times of the spring station can be met; the method can synchronously expand and other services, and further improve the service performance problem.
Drawings
FIG. 1 is a flow chart of a method for improving accuracy of search and recommendation results according to the present invention;
FIG. 2 is an interface diagram of a post of a method of improving accuracy of search and recommendation results according to the present invention;
FIG. 3 is a diagram of a model calculation, schematically illustrated as a financial accounting position, of one method of improving accuracy of search and recommendation results of the present invention;
FIG. 4 is a diagram illustrating a resume position for a financial accounting position as an example of a method for improving accuracy of search and recommendation results according to the present invention;
FIG. 5 is a JD dimension total feature map of a method of the present invention for improving accuracy of search and recommendation results;
FIG. 6 is a CV dimension overall characteristic diagram of a method for enhancing the accuracy of search and recommendation results according to the present invention;
FIG. 7 is a keyword recognition flowchart of NLP of a method for improving accuracy of search and recommendation results according to the present invention;
FIG. 8 is a keyword sample of a method of improving accuracy of search and recommendation results according to the present invention;
FIG. 9 is a keyword cluster map of a method of improving accuracy of search and recommendation results according to the present invention;
FIG. 10 is a scoring criteria diagram of a method of improving accuracy of search and recommendation results according to the present invention;
FIG. 11 is a diagram showing an example of JD and CV inputs performed to achieve output scoring according to the scoring criteria described above in a method for improving accuracy of search and recommendation results;
FIG. 12 is a diagram of one sample example of a method of improving the accuracy of search and recommendation results in accordance with the present invention;
FIG. 13 is a vector model diagram of a method for improving accuracy of search and recommendation results according to the present invention;
FIG. 14 is a vector model result diagram of a method of improving accuracy of search and recommendation results according to the present invention;
FIG. 15 is a diagram of a system architecture employed in one method of the present invention for improving the accuracy of search and recommendation results;
FIG. 16 is a diagram of a deployment framework of a system employed by the method of the present invention for improving accuracy of search and recommendation results;
FIG. 17 is a diagram of a physical architecture of a system employed in a method of enhancing accuracy of search and recommendation results in accordance with the present invention;
FIG. 18 is a content resolution diagram of the outcome of one method of the present invention for improving the accuracy of search and recommendation results;
FIG. 19 is a diagram of a resume (CV) effect in a method for improving accuracy of search and recommendation results according to the present invention;
FIG. 20 is a diagram of a resume (CV) effect in a method for improving accuracy of search and recommendation results according to the present invention;
FIG. 21 is a diagram showing a Job Description (JD) effect in a method for improving accuracy of search and recommendation results according to the present invention;
fig. 22 is a diagram showing a Job Description (JD) effect in a method for improving accuracy of search and recommendation results according to the present invention.
Detailed Description
The invention is further described below in connection with specific embodiments.
Example 1
As shown in fig. 1, the method for improving the accuracy of the search and recommendation results comprises the following steps:
s10: and improving the data quality of the knowledge graph through double-chain data, and cleaning and improving the data accuracy of the knowledge graph by using a clustering method.
It should be noted that, for structured data and semi-structured data, the present invention adopts a completely different manner from unstructured data.
In the prior art, the following is often adopted: iterative training is performed using BiLSTM (bidirectional long short term cyclic neural network) and CRF (conditional random field) knowledge extraction models in NPL. Wherein both the BiLSTM knowledge model and the CRF model have defects. The method and the system utilize a mature NLP service center to try to extract knowledge from the speech segments of scientific documents after the processes of word segmentation, part-of-speech tagging, syntactic analysis, semantic analysis and the like by using an NLP technology, then convert sentences described by natural language into a form which can be understood by a computer through knowledge representation, and store the sentences in a knowledge base. Knowledge extraction systems are divided into two major parts: one part is natural language processing and the other part is knowledge extraction. The natural language processing mainly analyzes related contents from the language perspective, and comprises 8 large modules of sentence segmentation, automatic word segmentation, part-of-speech tagging, word meaning tagging, syntactic analysis, sentence meaning analysis, sentence segment analysis and language analysis, wherein the first 4 modules are foundations, the syntactic analysis and the sentence meaning analysis are cores, and the sentence segment analysis and the language analysis are extensions. In the operation process of the 8 modules, a keyword library, a probability dictionary, a semantic dictionary, a syntax rule, a domain narrative list and a domain ontology class 6 resource are required to be supported. The knowledge extraction system based on NLP adopts MVC as a design pattern, and Java is adopted for system realization for object-oriented programming; the object-oriented database adopts ObjectStore, and the relational database adopts Oracle; the automatic word segmentation adopts a maximum vector matching algorithm, the part-of-speech tagging adopts a maximum probability algorithm, the grammar analysis adopts an LR analysis algorithm, and the semantic analysis adopts predicate logic; the system interface adopts XML.
The method for improving the accuracy of the search and recommendation results,
the double-chain data in the step S10 are grouped by using the occurrence frequency of the keywords in the basic data;
the weight formula of the packet of double-stranded data described in step S10 is as follows:
W pf (i)=pf i *idf i /if i
in which W is pf (i) Weight value, pf representing the ith group of keywords i Representing the occurrence frequency of the keywords of the ith group, idf i If representing the ratio between the number of groups of the above-mentioned structured double-stranded data and the number of groups of the above-mentioned unstructured double-stranded data i Representing the inverse frequency.
Further, in the method for improving accuracy of search and recommendation results described above, if in step S10 i The calculation method of (2) is as follows:
wherein N represents the total number of basic data, df i Representing the number of occurrences of the i-th group of keywords in the base data.
Further, the structured double-chain data and the semi-structured double-chain data are subjected to NLP feature representation processing, wherein the weight value of the ith group of keywords is required to be subjected to the following algorithm processing during feature representation:
wherein P (S) represents the probability of distribution of the weight values of the keywords of all groups, wherein S represents the total sequence of weight values of the keywords of all groups, wherein w i (1.ltoreq.i.ltoreq.n) represents the sequence number of the i-th group keyword.
The method for improving the accuracy of the search and recommendation results, disclosed by the invention, further comprises the following steps:
s20: training is carried out by using a pre-training model, and the training is integrated into the pre-training model in advance by combining the knowledge of the job tree.
The method for improving the accuracy of the search and recommendation results,
the method of pre-training the model described in step S20 is as follows:
the double-chain data which is structured after the feature representation processing and the semi-structured double-chain data are sent to an NLP service center;
finally, the NLP service center optimizes the BP network model used by the double-chain data represented by the screened characteristics conforming to the rules to obtain a first entity relationship;
the optimization algorithm using the BP network model is as follows:
wherein G is i Represents the optimized first entity relationship degree value, wherein N represents the sum of statistics of all groups of keywords, and P i n The probability of distribution of the weight values representing the keywords of all groups.
The method is also one of the creation points of the application, the BP neural network is generally applied to the modeling direction, the MSE in a projection algorithm is optimized, and the comparison of the relationship degree values of two entities is added. Compared with the traditional iterative training by using a BiLSTM+CR F knowledge extraction model in NLP, the method has the advantages that the processing effect is improved by about 12.4%, and a group of data values can be obtained in about 24 hours. In the process of optimizing a projection algorithm, the method is based on a nonlinear diffusion filtering principle, a nonlinear scale space is constructed by adopting a rapid display diffusion scheme, a data projection profile structure is obtained, the feature extraction has scale invariance, and the profile corner point of a projection block is extracted according to the gray level difference between the projection to be detected and the neighborhood circle pixels in the data projection domain and the scale domain. And finally, calculating a feature description vector by adopting a FREAK algorithm, searching matching points of the projection image according to an epipolar constraint criterion, and accurately extracting and matching contour corner points of the obstacle.
The method for improving the accuracy of the search and recommendation results,
the manner of integration in step S20 is as follows:
and optimizing the scheduling layer, establishing a model by using a max-min mathematical algorithm, and determining each single objective function to obtain the blended scheduling benefit.
The method for improving the accuracy of the search and recommendation results, disclosed by the invention, further comprises the following steps:
s30: the multitask training mode is used, so that the confusion degree of the pre-training model is reduced;
the method for improving the accuracy of the search and recommendation results,
the multitasking training method described in step S30 is as follows:
each single objective function is determined, including a first objective function, a second objective function, a third objective function, and a fourth objective function.
The method for improving the accuracy of the search and recommendation results,
the first objective function is F 1 (q v1 )=(q v11 -q v0 )/q v12 Wherein q is v0 、q v1 、q v11 Qv 12 Planning task values for different periods;
the second objective function isWherein F is i (q vi ) The calculation mode is to adopt a weight coefficient transformation method and assign weight for the reference flow value;
third stepThe objective function is V i,j+1 =V i,j +(Q i,j -q i,j -Q lossi,j ) Wherein V is i,j Data volume calculated for the jth time period ith cloud, where V i,j+1 Data volume calculated for the (j+1) th time period (i) th cloud, where Q i,j Data entry for the jth time period, the ith cloud, where q i,j Data leakage calculated for the jth time period ith cloud, where Q lossi,j The data loss amount calculated for the ith cloud in the jth time period;
the fourth objective function isWherein E is sm Energy value calculated for cloud, where V m,T Is the effective storage capacity of the scheduling period, wherein gamma m,T And (3) calculating the number and the total number of m in the T period by using the data quantity of the scheduling period, wherein m and T are calculated by using the cloud.
The database can be optimized by the following method:
constructing a big data generation countermeasure network cycle D2GAN, which comprises two big data generators and four big data discriminators, namely a small sample generator G, a big sample generator F, a small sample discriminator D1s, a small sample discriminator D2s, a big sample discriminator D1b and a big sample discriminator D2b;
constructing big data to generate an optimized objective function of the countermeasure network, and respectively carrying out iterative training on the two generators and the four discriminators based on the optimized objective function so as to train and obtain a small sample generation parameter model;
wherein the training of the small sample generator G and the training of the small sample discriminators D1s and D2s are a set of challenge processes, and the training of the large sample generator F and the training of the large sample discriminators D1b and D2b are a set of challenge processes.
The method for improving the accuracy of the search and recommendation results, disclosed by the invention, further comprises the following steps:
s40: preferred job classes are recommended, including most probable job classes and similar job classes.
In the above method for improving accuracy of search and recommendation results, the algorithm recommended in step S40 is as follows:
of the formula, wherein FD Qk Represents a recommended quantization complexity level value, where d kij Knowledge-graph data representing a recommended kth set of components in column and row, where p ki Complexity value of knowledge-graph data in column of the k-th set of recommended components, where p kj Complexity values representing knowledge-graph data in the row direction of the proposed kth component set.
Specifically, in order to improve user experience and data intensive management of a knowledge graph, a cloud platform is additionally arranged, wherein the cloud platform comprises a user login unit, an identity library, a display unit, a processor, a data grabbing unit, a deflection data analysis unit, a data collection unit and a data temporary storage unit; the user login unit is used for inputting identity information and corresponding key information of the user, and standard identity information of an approved user and corresponding approved key information of the approved user are stored in the identity library; the user login unit is used for transmitting the identity information and the corresponding secret key information to the processor, and the processor is used for carrying out equipment verification processing on the identity information and the secret key information by combining the identity library to generate a pass signal or an equipment error signal; the processor drives the display unit to display that the used equipment is not trusted and check when generating equipment error signals; when the processor generates an error initial signal, the processor drives the display unit to display an identity key error, please verify; the processor is used for carrying out data grabbing on the identity information by utilizing the personal database when the passing signal is generated; the data collection unit is used for collecting access information groups formed by a plurality of access information of users, wherein the access information is specifically access content of the users when the users access websites; the data collection unit is used for transmitting the access information group to the data temporary storage unit for storage by combining the corresponding identity information; the deflection data analysis unit is used for carrying out data analysis on the access information groups stored in the data temporary storage unit and the corresponding identity information thereof to obtain all the sequence access information corresponding to the identity information; the data grabbing unit is communicated with the Internet and used for acquiring information of the Internet in real time; the deviation data analysis unit is used for transmitting the sequence access information to the personal library, and the processor is used for recommending the identity information by combining the sequence access information in the personal library and the data grabbing unit. The above description is to enhance access to the cloud platform after intensive processing by individuals.
Example 2
Formal application
As shown in fig. 2, the interface diagram of the job publication of the product is verified, and the job category, the industry requirement and the academic experience can be easily set.
FIG. 3 is a model calculation diagram of the present invention, illustrated by way of example as a financial accounting position; fig. 4 is a resume position presentation view of the present invention, taking a financial accounting position as an example.
As shown, taking the accounting job as an example, the contents are as follows:
1. the accounting system is responsible for collecting financial reports of company products, logging in an accounting book, archiving and reporting tax;
2. the financial affairs and business related matters can be well matched and processed;
3. coordination with other departments;
4. completing other work of temporary delivery;
5. the financial software such as friends, golden butterfly and the like is used by the skilled user.
Meanwhile, the models of the financial accounting posts confuse the number of job classes:
first class job features
Secondary job features
Three-level job features.
FIG. 5 is a JD dimension summary feature map of the present invention; FIG. 6 is a CV dimension overall characteristic diagram of the present invention.
Next, as shown in fig. 5 and fig. 6, the id/cv information submitted by the current user includes a situation that the job name or the tertiary job class does not coincide with the job description, where such information may be that the user maliciously swipes a bill to reduce the successful release rate of such low-quality id/cv, and the running of the low-quality job/resume checking process is aimed at improving the overall quality of the platform id/cv.
Based on the existing JD and CV libraries, the invention firstly obtains the threshold values of the low-quality JD and CV through data statistics, designs the screening flow of the low-quality JD and CV, then eliminates the low-quality data from the sample library in a manual auxiliary data script mode, and clears obstacles for subsequent machine learning. The check scope includes JD and CV generated by end users on line B, C at the commit and issue node.
The correlation X is equal to the similarity between the job name and the job description;
x1=the minimum value that can determine that JD/CV misses a low-quality tag, i.e., when x < x1, it is directly determined as normal JD/CV;
x2=can determine that JD/CV hits the maximum value of low quality tags, i.e., when x < x2, it is determined directly as low quality JD/CV.
As shown in fig. 5 and 6, JD & CV content understanding preferably supports [ push B vector recall experiments ], in both the offline and real-time versions of the established schemes, there are the following binning requirements:
scheme one: offline model
Content understanding this period output content: the fall-table hive is needed, in particular, the JD/CV understanding output specification, where the part of the elevation priority is used for offline model training of B-level d/C whole vectors.
Scheme II: real-time model
As shown in fig. 7, a keyword recognition flowchart of NLP in the marking process is illustrated.
FIG. 8 is a keyword sample of the present invention; FIG. 9 is a keyword cluster map of the present invention; fig. 10 is a scoring standard chart of the present invention.
The application of np position description keywords (hereinafter abbreviated as np keywords) is expanded on the end and strategy, and nlp keywords are further optimized to obtain expected benefits based on current capabilities: nlp keyword accuracy improves by 10%, other results are as follows: the accuracy of top3 is 77.4% and the accuracy of top10 is 66%.
Vector model diagrams are shown in fig. 13 and 14, which provide vectors for vector training using a double-tower model, training a double-tower structure using query-title data, and calculating using a title tower in JDCV understanding. And carrying out vectorization representation on the words extracted by the word model. Packaging into batch data allows model parallel computing to improve performance. The model structure is as follows:
1. the title and description information of the job position or the work experience are adopted to be tiled and fed into the model.
2. And encoding the input by using a pre-training model BERT to obtain a chapter vector.
3. The chapter vectors are softmax multi-classified and the penalty is calculated.
4. The final output result is the probability distribution of the job corresponding to the current input, and the probability is the most selected from the probability distribution
The large job class enters a post-processing flow, and the post-processing is carried out on the key job class to be used as a final output.
FIG. 15 is a system architecture diagram of the present invention; fig. 16 is a deployment framework of the present invention.
As shown in fig. 15, the system architecture: the system is divided into three layers from a technical platform layer, a business service layer and an end layer based on the existing micro-service system. The mobile terminal is divided into a C terminal, a B terminal, a sales terminal, a management terminal, an operation terminal and other independent apps according to the group of users. The server performs new expansion and optimization according to new requirements on the basis of the existing architecture, and meets the basic public services of front-end data service, information encryption, privacy protection, authority authentication and the like. The basic platform is based on the existing number bin and machine learning platform, the marked data is imported into a training system, and continuous training is carried out through a new sample model and attributes. And the trained model is released to a verification environment for AB test, and the model is optimized and adjusted through the feedback test effect. And universal interfaces such as a face recognition service interface, a sesame credit interface and the like of a third party are uniformly packaged on a basic platform layer, and service interfaces of other projects and product line body standards can be used for serving the project. And the expandability and maintainability of the system are improved.
FIG. 17 is a physical architecture diagram of the present invention; as in fig. 17, physical architecture: the system is deployed in a unified cloud environment of a company, and additional service nodes are added for meeting the requirements of testing and gray level release of the system while multiplexing service resources of the early-stage system. The automatic capacity reduction and expansion function is required to be provided by an internal operation and maintenance management system in the peak period of business.
Fig. 18 is a content analysis chart of the results of the present invention, meanwhile, in fig. 18, the content analysis of the unstructured field of the core: the content is as follows: deep understanding is carried out on unstructured fields of the core, keywords are extracted, weight calculation and vector characterization are carried out, and the keywords comprise three levels of job classes, job names, company names, skill keywords and the like; application: the method is used for downstream searching, recommending and other scenes, is applied to recall and sequencing layers, improves head results, and enlarges recall quantity.
Content association
Relationship processing is carried out on the analyzed content, and unstructured information analysis of the non-core is supplemented:
and KG capability is perfect:
the content is as follows: and constructing a long-term sustainable KG information production link by combining the analyzed information, and perfecting KG, such as company aliases, skills, schools, professions, job classes and the like.
Application: KG perfection
And (3) checking information:
the content is as follows: and judging the consistency of the same-dimension characteristics, such as the consistency of the idtitle and the id three-level job class, the consistency of the jd academic requirement and the academic extracted from jd description, the consistency of the job name class and the work content in the history work experience of the user cv, and the like.
Application: (1) And (3) optimizing the order in the strategy or the model, and improving the head result to (2) filling guidance, error correction reminding and the like of the service end.
Content mining
Based on the first two working stages, the method can combine the historical behavior data of JD/CV and other chat contents to make progress mining, such as:
quality evaluation:
the content is as follows: the quality of d and cv content is evaluated, such as from the comprehensive dimensions of filling content specification, integrity, update time and frequency, chat content, etc., eg: jd job-seeking risk assessment, cv black yield assessment, cv & id content quality assessment (e.g., identifying low-quality cv, etc.), etc
Application: (1) Recall limit, sort downright, or old flow control of low-quality, high-wind, dangerous id & cv in a policy or model: (2) Id & cv competition assessment of service end, content rewriting guide and the like
Preference mining and prediction:
the content is as follows: such as factory preferences, competition preferences, stability preferences, etc. in id recruitment; distance preference of cv, city 0 preference, factory preference, etc
Application: (1) Preferences are applied in policies or models that promote job-seeking path prediction for the positive chain (2).
As one of the largest domestic specialized recruitment service platforms, intelligent joint recruitment has collected a lot of JD and CV data at present, and has performed semantic analysis and sample labeling of NLP based on the text content of JD and CV, and through NL P analysis and machine learning, an artificial intelligence technology is applied to the fields of resume search and job recommendation, etc., but from the viewpoint of the current operation effect, there is still a problem that the job prediction rate and the standard recall rate of the system are always lower than those of the bid.
The present system has already carried out NLP analysis about jd, but its semantic analysis is not accurate enough, especially for three-level job understanding also has great ambiguity and error, including job name, company name, skill keywords, etc., resulting in the downstream in recall and ordering under the scene, accuracy decline.
The inconsistent knowledge graph for jd and cv results in very low accuracy of front-end search and recommendation algorithms, especially on special posts (three-level job classes) in specific industries, which is more serious.
The current unstructured data mining is insufficient based on jd and cv communication scenes, and the mining and analysis of key information such as chat frequency, chat content, black yield, matching degree and the like are insufficient, so that the waste and the idling of data assets are caused.
In order to improve the application value of the data asset, improve the NLP analysis accuracy and improve the consistency of the knowledge graph, the invention plans to upgrade and reform the current NLP and KG (knowledgegraph) so as to improve the retrieval efficiency and the matching accuracy.
Project goal
The following project targets are expected to be realized through means of KG upgrading, labeling and learning of incremental data samples, NLP algorithm optimization and the like:
a) CV job prediction index
the top1 quasi-recall reaches 90.4% (top 1 quasi-recall=total number of times/total number of valid samples that the job prediction top1 result appears in the labeling job set);
the top4 accuracy reaches 96.8% (top 4 accuracy = total number of times that the job prediction top4 result appears in the labeling job set/total number of valid samples);
the hit rate of user selection reaches 86.4% (hit rate of user selection = total number of times/total number of valid samples that user selected secondary job appears in the set of labeling job classes);
the hit class prediction top1 ratio of the user selection reaches 60.1 percent (the total number of times/total effective sample number of occurrence of the three-level class in-class prediction top1 result selected by the user);
the hit class prediction top4 of the user selection reaches 80.7 percent (the total number of times/total effective sample number of the occurrence of the three-level class in-class prediction top4 result selected by the user);
b) JD job class prediction index:
the top1 quasi-recall reaches 92.7% (top 1 quasi-recall=total number of times/total number of valid samples that the job prediction top1 result appears in the labeling job set);
the top4 accuracy reaches 98.3 percent (top 4 accuracy = total number of times/total number of valid samples that the job prediction top4 result appears in the labeling job set);
the hit rate of user selection reaches 89.7% (hit rate of user selection = total number of times/total number of valid samples that user selected tertiary job class appears in the set of labeling job classes);
the hit class prediction top1 of the user selection reaches 68.0 percent (the total number of times/total effective sample number of occurrence of the three-level class in-class prediction top1 result selected by the user);
the user selects hit class prediction top4 to be 87.6% (the total number of times/total number of valid samples that the user selects three-level class in-class prediction top4 results appear).
As can be seen from a comparison of the schematic results of fig. 19 and 20, the resume prediction effect is good; meanwhile, as can be seen by combining the schematic comparison results of fig. 21 and 22, the matching effect between the resume and the job position is good.
It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein.
Furthermore, it should be understood that although the present disclosure describes embodiments, not every embodiment is provided with a separate embodiment, and that this description is provided for clarity only, and that the disclosure is not limited to the embodiments described in detail below, and that the embodiments described in the examples may be combined as appropriate to form other embodiments that will be apparent to those skilled in the art.

Claims (1)

1. The method for improving the accuracy of the search and recommendation results is characterized by comprising the following steps:
s10: the data quality of the knowledge graph is improved through double-chain data, and the data accuracy of the knowledge graph is cleaned and improved through a clustering method;
s20: training by using a pre-training model, and integrating the training with the knowledge of the job tree in advance;
s30: the multitask training mode is used, so that the confusion degree of the pre-training model is reduced;
s40: recommending preferred job classes, including most probable job classes and similar job classes;
the double-chain data in step S10 are grouped by using the occurrence frequency of the keywords in the basic data;
the weight formula of the packet of double-stranded data described in step S10 is as follows:
W pf (i)=pf i *idf i /if i
in which W is pf (i) Weight value, pf representing the ith group of keywords i Representing the occurrence frequency of the keywords of the ith group, idf i If representing the ratio between the number of groups of structured double-stranded data and the number of groups of unstructured double-stranded data i Represents the inverse frequency;
wherein if in step S10 i The calculation method of (2) is as follows:
wherein N represents the total number of basic data, df i Representing the occurrence times of the ith group of keywords in the basic data;
the clustering method in step S10 is as follows:
carrying out characteristic representation processing on the structured double-chain data and the semi-structured double-chain data, wherein the characteristic representation processing needs to carry out the following algorithm processing on the weight value of the i-th group of keywords:
wherein P (S) represents the probability of distribution of the weight values of the keywords of all groups, wherein S represents the total sequence of weight values of the keywords of all groups, wherein w i Representing the serial number of the i-th group of keywords, wherein i is more than or equal to 1 and less than or equal to n;
in the formula, the method of pre-training the model in step S20 is as follows:
the double-chain data which is structured after the feature representation processing and the semi-structured double-chain data are sent to an NLP service center;
finally, the NLP service center optimizes the BP network model used by the double-chain data represented by the screened characteristics conforming to the rules to obtain a first entity relationship;
the optimization algorithm using the BP network model is as follows:
wherein G is i Represents the optimized first entity relationship degree value, wherein N represents the sum of statistics of all groups of keywords, and P i n A distribution probability of weight values representing keywords of all groups;
the manner of integration in step S20 is as follows:
optimizing a scheduling layer, establishing a model by using a max-min mathematical algorithm, and determining each single objective function to obtain blended scheduling benefit;
the multi-task training method in step S30 is as follows:
determining each single objective function, including a first objective function, a second objective function, a third objective function and a fourth objective function;
wherein the first objective function is F 1 (q v1 )=(q v11 -q v0 )/q v12 Wherein q is v0 、q v1 、q v11 Qv 12 Planning task values for different periods;
the second objective function isWherein F is i (q vi ) For the reference flow value, the calculation mode is to adopt a weight coefficient transformation method and assign weight,
the third objective function is V i,j+1 =V i,j +(Q i,j -q i,j -Q lossi,j ) Wherein V is i,j Data calculated for the jth time period ith cloudIn an amount of V i,j+1 Data volume calculated for the (j+1) th time period (i) th cloud, where Q i,j Data entry for the jth time period, the ith cloud, where q i,j Data leakage calculated for the jth time period ith cloud, where Q lossi,j The data loss amount calculated for the ith cloud in the jth time period;
the fourth objective function isWherein E is sm Energy value calculated for cloud, where V m,T Is the effective storage capacity of the scheduling period, wherein gamma m,T The method comprises the steps that data quantity in a scheduling period is calculated, wherein m is the number of the cloud computing m in a T period, and T is the total number of the cloud computing m in the T period;
the algorithm recommended in step S40 is as follows:
of the formula, wherein FD Qk Represents a recommended quantization complexity level value, where d kij Knowledge-graph data representing a recommended kth set of components in column and row, where p ki Complexity value of knowledge-graph data in column of the k-th set of recommended components, where pk j Complexity values representing knowledge-graph data in the row direction of the proposed kth component set.
CN202310981457.4A 2023-08-04 2023-08-04 Method for improving accuracy of search and recommendation results Active CN117171428B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310981457.4A CN117171428B (en) 2023-08-04 2023-08-04 Method for improving accuracy of search and recommendation results

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310981457.4A CN117171428B (en) 2023-08-04 2023-08-04 Method for improving accuracy of search and recommendation results

Publications (2)

Publication Number Publication Date
CN117171428A CN117171428A (en) 2023-12-05
CN117171428B true CN117171428B (en) 2024-04-05

Family

ID=88943919

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310981457.4A Active CN117171428B (en) 2023-08-04 2023-08-04 Method for improving accuracy of search and recommendation results

Country Status (1)

Country Link
CN (1) CN117171428B (en)

Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834668A (en) * 2015-03-13 2015-08-12 浙江奇道网络科技有限公司 Position recommendation system based on knowledge base
CN106294568A (en) * 2016-07-27 2017-01-04 北京明朝万达科技股份有限公司 A kind of Chinese Text Categorization rule generating method based on BP network and system
CN106485054A (en) * 2016-09-21 2017-03-08 广东工业大学 Intelligent diagnostics data analysing method based on BP neural network algorithm and system
CN107590133A (en) * 2017-10-24 2018-01-16 武汉理工大学 The method and system that position vacant based on semanteme matches with job seeker resume
CN108920544A (en) * 2018-06-13 2018-11-30 桂林电子科技大学 A kind of personalized position recommended method of knowledge based map
CN111190968A (en) * 2019-12-16 2020-05-22 北京航天智造科技发展有限公司 Data preprocessing and content recommendation method based on knowledge graph
CN111698207A (en) * 2020-05-07 2020-09-22 北京华云安信息技术有限公司 Method, equipment and storage medium for generating knowledge graph of network information security
KR20200141919A (en) * 2019-06-11 2020-12-21 주식회사 에이아이앤잡 Method for machine learning train set and recommendation systems to recommend the scores to match between the recruiter and job seekers, and to give the scores of matching candidates to recruiters and to give the pass scores to job seekers respectively
CN112463980A (en) * 2020-11-25 2021-03-09 南京摄星智能科技有限公司 Intelligent plan recommendation method based on knowledge graph
CN115456584A (en) * 2022-09-16 2022-12-09 深圳今日人才信息科技有限公司 Similar JD recall and recommendation method based on deep learning model and expert system
CN115526590A (en) * 2022-09-16 2022-12-27 深圳今日人才信息科技有限公司 Efficient human-sentry matching and re-pushing method combining expert knowledge and algorithm
CN116127186A (en) * 2022-12-09 2023-05-16 之江实验室 Knowledge-graph-based individual matching recommendation method and system for person sentry

Patent Citations (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104834668A (en) * 2015-03-13 2015-08-12 浙江奇道网络科技有限公司 Position recommendation system based on knowledge base
CN106294568A (en) * 2016-07-27 2017-01-04 北京明朝万达科技股份有限公司 A kind of Chinese Text Categorization rule generating method based on BP network and system
CN106485054A (en) * 2016-09-21 2017-03-08 广东工业大学 Intelligent diagnostics data analysing method based on BP neural network algorithm and system
CN107590133A (en) * 2017-10-24 2018-01-16 武汉理工大学 The method and system that position vacant based on semanteme matches with job seeker resume
CN108920544A (en) * 2018-06-13 2018-11-30 桂林电子科技大学 A kind of personalized position recommended method of knowledge based map
KR20200141919A (en) * 2019-06-11 2020-12-21 주식회사 에이아이앤잡 Method for machine learning train set and recommendation systems to recommend the scores to match between the recruiter and job seekers, and to give the scores of matching candidates to recruiters and to give the pass scores to job seekers respectively
CN111190968A (en) * 2019-12-16 2020-05-22 北京航天智造科技发展有限公司 Data preprocessing and content recommendation method based on knowledge graph
CN111698207A (en) * 2020-05-07 2020-09-22 北京华云安信息技术有限公司 Method, equipment and storage medium for generating knowledge graph of network information security
CN112463980A (en) * 2020-11-25 2021-03-09 南京摄星智能科技有限公司 Intelligent plan recommendation method based on knowledge graph
CN115456584A (en) * 2022-09-16 2022-12-09 深圳今日人才信息科技有限公司 Similar JD recall and recommendation method based on deep learning model and expert system
CN115526590A (en) * 2022-09-16 2022-12-27 深圳今日人才信息科技有限公司 Efficient human-sentry matching and re-pushing method combining expert knowledge and algorithm
CN116127186A (en) * 2022-12-09 2023-05-16 之江实验室 Knowledge-graph-based individual matching recommendation method and system for person sentry

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
作物病害智能诊断与处方推荐技术研究进展;张领先 等;《农业机械学报》;20230616;第54卷(第06期);1-18 *
基于知识图谱的高校服务能力提升探索和研究;孙兆群 等;《华东科技》;20220805(第08期);84-89 *
知识关联视角下金融证券知识图谱构建与相关股票发现;刘政昊 等;《数据分析与知识发现》;20211211;第6卷(第Z1期);184-201 *

Also Published As

Publication number Publication date
CN117171428A (en) 2023-12-05

Similar Documents

Publication Publication Date Title
CN111026842B (en) Natural language processing method, natural language processing device and intelligent question-answering system
CN106997341B (en) A kind of innovation scheme matching process, device, server and system
CN112989761B (en) Text classification method and device
CN103838857B (en) Automatic service combination system and method based on semantics
CN116127090B (en) Aviation system knowledge graph construction method based on fusion and semi-supervision information extraction
CN117271767A (en) Operation and maintenance knowledge base establishing method based on multiple intelligent agents
CN114860916A (en) Knowledge retrieval method and device
CN114900346B (en) Network security testing method and system based on knowledge graph
Shanshan et al. An improved hybrid ontology-based approach for online learning resource recommendations
CN110310012A (en) Data analysing method, device, equipment and computer readable storage medium
CN116974626B (en) Analysis sequence chart generation method, device, equipment and computer readable storage medium
CN107798137B (en) A kind of multi-source heterogeneous data fusion architecture system based on additive models
CN113610626A (en) Bank credit risk identification knowledge graph construction method and device, computer equipment and computer readable storage medium
CN113821587A (en) Text relevance determination method, model training method, device and storage medium
CN117494760A (en) Semantic tag-rich data augmentation method based on ultra-large-scale language model
CN117112794A (en) Knowledge enhancement-based multi-granularity government service item recommendation method
CN117171428B (en) Method for improving accuracy of search and recommendation results
CN110134866A (en) Information recommendation method and device
CN113177164B (en) Multi-platform collaborative new media content monitoring and management system based on big data
CN115293479A (en) Public opinion analysis workflow system and method thereof
Zhao et al. Detecting fake reviews via dynamic multimode network
Mehmood et al. Knowledge Graph Embedding in Intent-Based Networking
CN114328797B (en) Content search method, device, electronic apparatus, storage medium, and program product
Wu A Semantic Information Management Approach for Improving Bridge Maintenance based on Advanced Constraint Management
Li et al. Optimization of an Archives Information System based on Intelligent Information Sorting Algorithm

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant