Disclosure of Invention
The technical problem that the prior art cannot meet the requirement of a user on expert recommendation is solved.
In order to achieve the technical purpose, the disclosure provides an expert recommendation method facing enterprise needs, which includes:
collecting expert thesis data and enterprise demand data;
preprocessing the collected expert thesis data and the enterprise demand data to obtain expert information and demand information;
extracting keywords from the preprocessed expert information and the preprocessed demand information to obtain expert characteristic information and demand characteristic information;
constructing a feature vector model according to the expert feature information and the demand feature information after the keyword extraction;
and carrying out similarity calculation analysis according to the feature vectors in the feature vector model to obtain an expert recommendation result.
Further, the collecting the expert thesis data and the enterprise requirement data specifically includes:
and collecting the title, abstract and/or keyword data of the expert paper according to the paper database and selecting an online internet website to collect the title, keyword and/or requirement detail data of the enterprise requirement.
Further, the preprocessing the collected expert thesis data and the collected enterprise demand data specifically includes:
respectively performing word segmentation on the expert thesis data and the enterprise demand data by adopting an LTP model to obtain expert word segmentation data and enterprise word segmentation data;
removing stop words from the participled expert participle data and the participled enterprise data;
and merging the repeated information in the data without the stop words to respectively obtain the expert information and the demand information.
Further, the extracting the keywords from the preprocessed expert information and the preprocessed requirement information specifically includes:
and respectively extracting keywords from the expert information and the demand information by adopting an LDA (latent dirichlet allocation) model, and acquiring a keyword list of each piece of expert information and each piece of demand information to be used as the expert characteristic information and the demand characteristic information.
Further, the constructing a feature vector model according to the expert feature information and the requirement feature information after the keyword extraction specifically includes:
performing feature extraction on the expert feature information and the demand feature information by using a TF-IDF algorithm to obtain an information subject term;
and performing feature selection on the information subject term, and constructing a feature vector and a feature vector model based on the selected feature subject term.
Further, the obtaining of the expert recommendation result by performing similarity calculation analysis according to the feature vectors in the feature vector model specifically includes:
calculating and analyzing according to the feature vectors in the feature vector model by combining a calculation factor of the number of the same feature words of the text in the total length of the feature vectors of the text on the basis of cosine similarity analysis to obtain an expert recommendation result;
wherein, the similarity analysis is calculated by adopting the following formula:
wherein c is a proportional adjustment coefficient, N (D, E) represents the number of the same feature words in the requirement information D and the expert information E, Min (D, E) represents the smaller of the total number of the features of the requirement information D and the total number of the features of the expert information E, sim (D, E) represents the cosine similarity of the requirement information D and the expert information E.
In order to achieve the above technical object, the present disclosure can also provide an expert recommendation apparatus for enterprise needs, including:
the data collection module is used for collecting expert thesis data and enterprise demand data;
the preprocessing module is used for preprocessing the collected expert thesis data and the enterprise demand data to obtain expert information and demand information;
the keyword extraction module is used for extracting keywords from the preprocessed expert information and the preprocessed demand information to obtain expert characteristic information and demand characteristic information;
the vector model construction module is used for constructing a feature vector model according to the expert feature information and the requirement feature information after the keyword extraction;
and the similarity analysis module is used for carrying out similarity calculation analysis according to the feature vectors in the feature vector model to obtain an expert recommendation result.
Further, the collecting the expert thesis data and the enterprise requirement data specifically includes:
and collecting the title, abstract and/or keyword data of the expert paper according to the paper database and selecting an online internet website to collect the title, keyword and/or requirement detail data of the enterprise requirement.
To achieve the above technical objects, the present disclosure can also provide a computer storage medium having a computer program stored thereon, the computer program being executed by a processor to implement the steps of the above expert recommendation method for enterprise needs.
In order to achieve the above technical object, the present disclosure further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, wherein the processor implements the steps of the expert recommendation method for enterprise needs when executing the computer program.
The beneficial effect of this disclosure does:
the invention provides an enterprise expert recommending method based on an LDA topic model, which adopts the topic model to extract characteristics of expert information and enterprise requirements, respectively constructs expert field characteristic vectors and enterprise requirement characteristic vectors based on characteristic keywords, and recommends relevant field experts for enterprises based on the similarity of the expert field characteristic vectors and the enterprise requirement characteristic vectors, thereby effectively avoiding the semantic drift problem of mechanical retrieval.
Detailed Description
Hereinafter, embodiments of the present disclosure will be described with reference to the accompanying drawings. It should be understood that the description is illustrative only and is not intended to limit the scope of the present disclosure. Moreover, in the following description, descriptions of well-known structures and techniques are omitted so as to not unnecessarily obscure the concepts of the present disclosure.
Various structural schematics according to embodiments of the present disclosure are shown in the figures. The figures are not drawn to scale, wherein certain details are exaggerated and possibly omitted for clarity of presentation. The shapes of various regions, layers, and relative sizes and positional relationships therebetween shown in the drawings are merely exemplary, and deviations may occur in practice due to manufacturing tolerances or technical limitations, and a person skilled in the art may additionally design regions/layers having different shapes, sizes, relative positions, as actually required.
The present disclosure relates to the interpretation of terms:
TF-IDF: term frequency-inverse document frequency, a commonly used weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency).
LDA:Latent Dirichlet Allocation。
LTP: language Technology Platform (LTP).
The first embodiment is as follows:
as shown in fig. 1:
the disclosure provides an expert recommendation method facing enterprise requirements, which comprises the following steps:
s1: collecting expert thesis data and enterprise demand data;
specifically, the collecting expert thesis data and enterprise demand data specifically includes:
and collecting the title, abstract and/or keyword data of the expert paper according to the paper database and selecting an online internet website to collect the title, keyword and/or requirement detail data of the enterprise requirement.
Preferably, according to the technical scheme, platforms such as CNKI, Wanfang, Aminer and the like are selected as information sources for acquiring expert document information, and information such as topics, abstracts, paper keywords and the like of papers published by experts is collected. A scientist online website is selected as an enterprise demand information acquisition data source, and information such as a demand title, a demand keyword and demand details is collected.
S2: preprocessing the collected expert thesis data and the enterprise demand data to obtain expert information and demand information;
specifically, the preprocessing the collected expert thesis data and the collected enterprise requirement data specifically includes:
respectively performing word segmentation on the expert thesis data and the enterprise demand data by adopting an LTP model to obtain expert word segmentation data and enterprise word segmentation data;
removing stop words from the participled expert participle data and the participled enterprise data;
and merging the repeated information in the data without the stop words to respectively obtain the expert information and the demand information.
The word segmentation and the stop word removal in the preprocessing disclosed by the invention are processing methods commonly used in the natural language processing field, and except the implementation mode, the expert thesis data and the enterprise demand data can be segmented respectively by adopting a Conditional Random Field (CRF) to obtain expert word segmentation data and enterprise word segmentation data.
S3: extracting keywords from the preprocessed expert information and the preprocessed demand information to obtain expert characteristic information and demand characteristic information;
specifically, the extracting the keywords from the expert feature information and the requirement feature information after extracting the keywords specifically includes:
and respectively extracting key words of the expert characteristic information and the demand characteristic information by adopting an LDA (latent dirichlet allocation) model, and acquiring a key word list of each piece of expert information and each piece of demand information to be used as the expert characteristic information and the demand characteristic information.
Lda (late Dirichlet allocation) is a document topic generation model, also called a three-layer bayesian probability model, and includes three layers of structures of words, topics and documents. By generative model, we mean that each word of an article is considered to be obtained through a process of "selecting a topic with a certain probability and selecting a word from the topic with a certain probability". Document-to-topic follows a polynomial distribution, and topic-to-word follows a polynomial distribution.
LDA is an unsupervised machine learning technique that can be used to identify underlying topic information in large-scale document collections (document collections) or corpora (corpus). It adopts bag of words (bag of words) method, which treats each document as a word frequency vector, thereby converting text information into digital information easy to model. The bag-of-words approach does not take into account word-to-word ordering, which simplifies the complexity of the problem and also provides opportunities for model improvement. Each document represents a probability distribution of topics, and each topic represents a probability distribution of words.
For each document in the corpus, LDA defines the generation process as follows:
1. extracting a theme from the theme distribution for each document;
2. extracting a word from the word distribution corresponding to the extracted subject;
3. the above process is repeated until every word in the document is traversed.
Each document in the corpus corresponds to a multinomial distribution of T topics (given in advance by trial and error, etc.), which is denoted as θ. Each topic, in turn, corresponds to a multinomial distribution of V words in the vocabulary, which multinomial distribution is denoted as
The meanings of some letters are defined first: document set D, topic (topic) set T
Each document D in D is treated as a word sequence < w1, w 2., wn >, wi denotes the ith word, let D have n words. (LDA is called wordbag inside, and the appearance position of each word has no influence on LDA algorithm in practice)
All the different words referred to in D form a large set Vocobulary (VOC), and the LDA takes the document set D as input, and two result vectors (k together, VOC contains m words):
for document D in each D, the probability θ D < pt 1., ptk > that corresponds to a different Topic, where pti represents the probability that D corresponds to the ith Topic in T. The calculation method is intuitive, and pti is nti/n, where nti denotes the number of words in d corresponding to the ith topic, and n is the total number of all words in d.
For topict in each T, the probability of generating a different word
Where pwi represents the probability that t generates the ith word in the VOC. The calculation is also straightforward, pwi ═ Nwi/N, where Nwi represents the number of i-th words in the VOC corresponding to topict and N represents the total number of all words corresponding to topict.
The core formula of LDA is as follows:
p(w|d)=p(w|t)*p(t|d)
the formula is intuitively seen, namely Topic is taken as an intermediate layer, and the current sum of theta and theta can be passed
The probability of the occurrence of the word w in the document d is given. Wherein p (t | d) is calculated by θ d, and p (w | t) is calculated by
And (4) calculating.
In practice, the current sum of θ d is used
We can calculate p (w | d) for a word in a document when it corresponds to any one of Topic, and then update Topic to which the word should correspond based on these results. Then, ifThis update changes Topic to which the word corresponds, which in turn affects θ d and
when the LDA algorithm starts, θ d and θ d are randomly given
Assign a value (for all d and t). The above process is then repeated, and the final converged result is the LDA output. The iterative learning process is described in more detail:
1. for the ith word wi in a particular document ds, if let topic be tj corresponding to the word, the above formula can be rewritten as:
pj(wi|ds)=p(wi|tj)*p(tj|ds)
2. we can now enumerate topic in T, resulting in all pj (wi | ds), where j takes on values 1-k. A topic may then be selected for the ith word wi in ds based on these probability value results. The simplest idea is to take the largest tj (note that only j is a variable in this equation) for pj (wi | ds), i.e., argmax [ j ] pj (wi | ds)
3. Then, if the ith word wi in ds selects a topic different from the original, then the sum of θ d and θ d will be
Has an effect (which can be easily known from the calculation formula of the two vectors mentioned earlier). Their influence in turn affects the calculation of p (w | d) mentioned above. One calculation of p (w | D) is done for all w in all D in D and reselecting topic is considered as one iteration. After n loop iterations, the desired result for LDA is converged.
S4: constructing a feature vector model according to the preprocessed expert information and the preprocessed demand information;
specifically, the constructing a feature vector model according to the preprocessed expert information and the preprocessed demand information specifically includes:
performing feature extraction on the expert information and the demand information by using a TF-IDF algorithm to obtain information subject terms;
and performing feature selection on the information subject term, and constructing a feature vector and a feature vector model based on the selected feature subject term.
TF-IDF is a statistical method to evaluate the importance of a word to one of a set of documents or a corpus. The importance of a word increases in proportion to the number of times it appears in a document, but at the same time decreases in inverse proportion to the frequency with which it appears in the corpus. Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between a document and a user query. In addition to TF-IDF, search engines on the internet use a ranking method based on link analysis to determine the order in which documents appear in search results.
The main idea of TFIDF is: if a word or phrase appears in an article with a high frequency TF and rarely appears in other articles, the word or phrase is considered to have a good classification capability and is suitable for classification. TFIDF is actually: TF, IDF, TF Term Frequency (Term Frequency), IDF Inverse file Frequency (Inverse Document Frequency). TF represents the frequency with which terms appear in document d. The main idea of IDF is: if the documents containing the entry t are fewer, that is, the smaller n is, the larger IDF is, the entry t has good category distinguishing capability. If the number of documents containing the entry t in a certain class of document C is m, and the total number of documents containing the entry t in other classes is k, it is obvious that the number of documents containing t is m + k, when m is large, n is also large, and the value of the IDF obtained according to the IDF formula is small, which means that the category distinguishing capability of the entry t is not strong. In practice, however, if a term frequently appears in a document of a class, it indicates that the term can well represent the characteristics of the text of the class, and such terms should be given higher weight and selected as characteristic words of the text of the class to distinguish the document from other classes. In a given document, the Term Frequency (TF) refers to the frequency with which a given word appears in the document. This number is a normalization of the number of words (term count) to prevent it from biasing towards long documents.
S5: and carrying out similarity calculation analysis according to the feature vectors in the feature vector model to obtain an expert recommendation result.
Specifically, the obtaining of the expert recommendation result by performing similarity calculation analysis according to the feature vectors in the feature vector model specifically includes:
calculating and analyzing according to the feature vectors in the feature vector model by combining a calculation factor of the number of the same feature words of the text in the total length of the feature vectors of the text on the basis of cosine similarity analysis to obtain an expert recommendation result;
wherein, the similarity analysis is calculated by adopting the following formula:
wherein c is a proportional adjustment coefficient, N (D, E) represents the number of the same feature words in the requirement information D and the expert information E, Min (D, E) represents the smaller of the total number of the features of the requirement information D and the total number of the features of the expert information E, sim (D, E) represents the cosine similarity of the requirement information D and the expert information E.
The calculation process of the cosine similarity is as follows:
wherein Vt,AAnd Vt,BThe weights of the t-th feature words of the vectors A and B, respectively.
Example two:
as shown in fig. 2:
the present disclosure also provides an expert recommending apparatus for enterprise demand, including:
a data collection module 201, configured to collect expert thesis data and enterprise demand data;
the preprocessing module 202 is configured to preprocess the collected expert thesis data and the collected enterprise demand data to obtain expert information and demand information;
the keyword extraction module 203 is configured to perform keyword extraction on the preprocessed expert information and the preprocessed demand information to obtain expert characteristic information and demand characteristic information;
the vector model construction module 204 is used for constructing a feature vector model according to the expert feature information and the requirement feature information after the keyword extraction;
and the similarity analysis module 205 is configured to perform similarity calculation analysis according to the feature vectors in the feature vector model to obtain an expert recommendation result.
The data collection module 201 of the present disclosure is sequentially connected to the preprocessing module 202, the keyword extraction module 203, the vector model construction module 204, and the similarity analysis module 205.
Further, the collecting the expert thesis data and the enterprise requirement data specifically includes:
and collecting the title, abstract and/or keyword data of the expert paper according to the paper database and selecting an online internet website to collect the title, keyword and/or requirement detail data of the enterprise requirement.
Example three:
the present disclosure can also provide a computer storage medium having stored thereon a computer program for implementing the steps of the above-described enterprise demand oriented expert recommendation method when executed by a processor.
The computer storage medium of the present disclosure may be implemented with a semiconductor memory, a magnetic core memory, a magnetic drum memory, or a magnetic disk memory.
Semiconductor memories are mainly used as semiconductor memory elements of computers, and there are two types, Mos and bipolar memory elements. Mos devices have high integration, simple process, but slow speed. The bipolar element has the advantages of complex process, high power consumption, low integration level and high speed. NMos and CMos were introduced to make Mos memory dominate in semiconductor memory. NMos is fast, e.g. 45ns for 1K bit sram from intel. The CMos power consumption is low, and the access time of the 4K-bit CMos static memory is 300 ns. The semiconductor memories described above are all Random Access Memories (RAMs), i.e. read and write new contents randomly during operation. And a semiconductor Read Only Memory (ROM), which can be read out randomly but cannot be written in during operation, is used to store solidified programs and data. The ROM is classified into a non-rewritable fuse type ROM, PROM, and a rewritable EPROM.
The magnetic core memory has the characteristics of low cost and high reliability, and has more than 20 years of practical use experience. Magnetic core memories were widely used as main memories before the mid 70's. The storage capacity can reach more than 10 bits, and the access time is 300ns at the fastest speed. The typical international magnetic core memory has a capacity of 4 MS-8 MB and an access cycle of 1.0-1.5 mus. After semiconductor memory is rapidly developed to replace magnetic core memory as a main memory location, magnetic core memory can still be applied as a large-capacity expansion memory.
Drum memory, an external memory for magnetic recording. Because of its fast information access speed and stable and reliable operation, it is being replaced by disk memory, but it is still used as external memory for real-time process control computers and medium and large computers. In order to meet the needs of small and micro computers, subminiature magnetic drums have emerged, which are small, lightweight, highly reliable, and convenient to use.
Magnetic disk memory, an external memory for magnetic recording. It combines the advantages of drum and tape storage, i.e. its storage capacity is larger than that of drum, its access speed is faster than that of tape storage, and it can be stored off-line, so that the magnetic disk is widely used as large-capacity external storage in various computer systems. Magnetic disks are generally classified into two main categories, hard disks and floppy disk memories.
Hard disk memories are of a wide variety. The structure is divided into a replaceable type and a fixed type. The replaceable disk is replaceable and the fixed disk is fixed. The replaceable and fixed magnetic disks have both multi-disk combinations and single-chip structures, and are divided into fixed head types and movable head types. The fixed head type magnetic disk has a small capacity, a low recording density, a high access speed, and a high cost. The movable head type magnetic disk has a high recording density (up to 1000 to 6250 bits/inch) and thus a large capacity, but has a low access speed compared with a fixed head magnetic disk. The storage capacity of a magnetic disk product can reach several hundred megabytes with a bit density of 6250 bits per inch and a track density of 475 tracks per inch. The disk set of the multiple replaceable disk memory can be replaced, so that the disk set has large off-body capacity, large capacity and high speed, can store large-capacity information data, and is widely applied to an online information retrieval system and a database management system.
Example four:
the present disclosure also provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the expert recommendation method for enterprise needs are implemented.
Fig. 3 is a schematic diagram of an internal structure of an electronic device in one embodiment. As shown in fig. 3, the electronic device includes a processor, a storage medium, a memory, and a network interface connected through a system bus. The storage medium of the computer device stores an operating system, a database and computer readable instructions, the database can store control information sequences, and the computer readable instructions, when executed by the processor, can enable the processor to implement an enterprise requirement-oriented expert recommendation method. The processor of the electrical device is used to provide computing and control capabilities to support the operation of the entire computer device. The memory of the computer device may have stored therein computer readable instructions that, when executed by the processor, may cause the processor to perform a method of thread timeout fault detection. The network interface of the computer device is used for connecting and communicating with the terminal. Those skilled in the art will appreciate that the architecture shown in fig. 3 is merely a block diagram of some of the structures associated with the disclosed aspects and is not intended to limit the computing devices to which the disclosed aspects apply, as particular computing devices may include more or less components than those shown, or may combine certain components, or have a different arrangement of components.
The electronic device includes, but is not limited to, a smart phone, a computer, a tablet, a wearable smart device, an artificial smart device, a mobile power source, and the like.
The processor may be composed of an integrated circuit in some embodiments, for example, a single packaged integrated circuit, or may be composed of a plurality of integrated circuits packaged with the same or different functions, including one or more Central Processing Units (CPUs), microprocessors, digital Processing chips, graphics processors, and combinations of various control chips. The processor is a Control Unit of the electronic device, connects various components of the electronic device by using various interfaces and lines, and executes various functions and processes data of the electronic device by running or executing programs or modules (for example, executing remote data reading and writing programs, etc.) stored in the memory and calling data stored in the memory.
The bus may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus may be divided into an address bus, a data bus, a control bus, etc. The bus is arranged to enable connected communication between the memory and at least one processor or the like.
Fig. 3 shows only an electronic device having components, and those skilled in the art will appreciate that the structure shown in fig. 3 does not constitute a limitation of the electronic device, and may include fewer or more components than those shown, or some components may be combined, or a different arrangement of components.
For example, although not shown, the electronic device may further include a power supply (such as a battery) for supplying power to each component, and preferably, the power supply may be logically connected to the at least one processor through a power management device, so that functions such as charge management, discharge management, and power consumption management are implemented through the power management device. The power supply may also include any component of one or more dc or ac power sources, recharging devices, power failure detection circuitry, power converters or inverters, power status indicators, and the like. The electronic device may further include various sensors, a bluetooth module, a Wi-Fi module, and the like, which are not described herein again.
Further, the electronic device may further include a network interface, and optionally, the network interface may include a wired interface and/or a wireless interface (such as a WI-FI interface, a bluetooth interface, etc.), which are generally used to establish a communication connection between the electronic device and other electronic devices.
Optionally, the electronic device may further comprise a user interface, which may be a Display (Display), an input unit (such as a Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable, among other things, for displaying information processed in the electronic device and for displaying a visualized user interface.
Further, the computer usable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required for at least one function, and the like; the storage data area may store data created according to the use of the blockchain node, and the like.
In the embodiments provided in the present invention, it should be understood that the disclosed apparatus, device and method can be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the modules is only one logical functional division, and other divisions may be realized in practice.
The modules described as separate parts may or may not be physically separate, and parts displayed as modules may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment.
In addition, functional modules in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, or in a form of hardware plus a software functional module.
The embodiments of the present disclosure have been described above. However, these examples are for illustrative purposes only and are not intended to limit the scope of the present disclosure. The scope of the disclosure is defined by the appended claims and equivalents thereof. Various alternatives and modifications can be devised by those skilled in the art without departing from the scope of the present disclosure, and such alternatives and modifications are intended to be within the scope of the present disclosure.