CN113239179B - Scientific research technology interest field recognition model training method, scientific and technological resource query method and device - Google Patents

Scientific research technology interest field recognition model training method, scientific and technological resource query method and device Download PDF

Info

Publication number
CN113239179B
CN113239179B CN202110781559.2A CN202110781559A CN113239179B CN 113239179 B CN113239179 B CN 113239179B CN 202110781559 A CN202110781559 A CN 202110781559A CN 113239179 B CN113239179 B CN 113239179B
Authority
CN
China
Prior art keywords
scientific
technological
text
query
candidate
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110781559.2A
Other languages
Chinese (zh)
Other versions
CN113239179A (en
Inventor
杜军平
郭伟杰
寇菲菲
许明英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN202110781559.2A priority Critical patent/CN113239179B/en
Publication of CN113239179A publication Critical patent/CN113239179A/en
Application granted granted Critical
Publication of CN113239179B publication Critical patent/CN113239179B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/332Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/126Character encoding
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks

Abstract

The invention provides a scientific research technology interest field recognition model training method, a scientific resource query method and a scientific resource query device. According to the scientific and technological resource query method, through multi-stage retrieval query reordering, on the basis of similarity judgment, the candidate set is rearranged by comparing the candidate set with the technical field characteristics interested by a learner and combining the influence factors, and the personalization degree and accuracy of scientific and technological resource query are improved.

Description

Scientific research technology interest field recognition model training method, scientific and technological resource query method and device
Technical Field
The invention relates to the technical field of data processing, in particular to a scientific research technical interest field recognition model training method, a scientific and technological resource query method and a scientific and technological resource query device.
Background
With the continuous development of science and technology, various scientific and technological resources emerge. Different from information such as news and social contact which are explosively increased on the Internet, the scientific and technical resources have a unique aspect. Scientific resources are mainly composed of academic-style data such as treatises and patents, and serve scientific research personnel in various fields. However, the scientific and technical resources are large in quantity and various in variety. The traditional query method is difficult to find and utilize the implied value.
Although the existing retrieval and query technology has the advantages of high retrieval speed, accurate semantic matching and the like, the returned results are highly consistent for similar query sentences of different users due to the fact that the existing retrieval and query technology is essentially based on the matching process of scoring of similarity of query sentences, and intelligent retrieval of thousands of people and thousands of faces cannot be realized. This is mainly due to the fact that the search query task does not consider the differences of different users, and does not fully mine the user portrait information to improve the search filtering algorithm of the search engine in a feedback manner. In particular, in scientific and technological resource retrieval, scientific and technological big data resources represented by papers, patents and funds have a plurality of special words in the field, and the same terms in different scientific and technological fields may have distinct meanings. The learner user often wants the query result to be highly related to the research interest field of the learner user due to different academic fields of the learner user, and the personalized requirement of the learner user is difficult to achieve by the existing technical scheme.
With the popularization and development of machine learning technology, various big data analysis technologies are utilized to dig out potential information of users or resources, and the association relationship between scientific and technological entities is comprehensively analyzed, so that the method is a necessary way for constructing efficient scientific and technological resource query. Scientific research personnel can not inquire scientific and technological resources in a traditional keyword matching mode any more, and accurately inquiring interested information from increasing scientific and technological resources is an urgent need.
Disclosure of Invention
The embodiment of the invention provides a scientific research technology interest field recognition model training method, a scientific resource query method and a scientific resource query device, which are used for eliminating or improving one or more defects in the prior art and solving the problem that the prior search technology cannot provide personalized search results according to the interest requirements of scholars.
The technical scheme of the invention is as follows:
in one aspect, the invention provides a scientific research interest field recognition model training method, which comprises the following steps:
the method comprises the steps that a plurality of samples are obtained, each sample comprises a plurality of scientific and technological texts issued or browsed by a scientific researcher in a set time window, and the scientific and technological texts in each sample belong to the same technical field; acquiring text characteristics of each scientific and technical text by adopting a bidirectional long-time memory network, marking the technical field to which each sample belongs as a label of the corresponding sample, and generating a training sample set;
the method comprises the steps of obtaining an initial network model, dividing a set time window into a first set number of time steps, inputting text features of technical texts in a sample to the time steps according to a time sequence of release or browsing to form an input sequence, adding position codes to the text features input at each time step by adopting triangular coding, inputting two vector groups respectively, performing linear transformation, and obtaining a key value matrix and a query matrix in an attention mechanism through an activation function; performing matrix multiplication operation on the query matrix and the transpose of the key value matrix, then scaling dot products and normalizing to obtain a weight matrix of attention; multiplying the weight matrix and the input sequence to obtain an attention matrix, and carrying out weighted average on the attention matrix to obtain a characteristic vector of the technical field; inputting the technical field feature vector into a classifier and outputting a classification result;
and training the initial network model by adopting the training sample set to obtain a scientific research technology interest field recognition model.
In some embodiments, in scaling and normalizing the dot product after performing the matrix multiplication operation on the query matrix and the transpose of the key value matrix, the normalization process is performed by using a softmax function.
In some embodiments, before the text features of each scientific and technical text are acquired by using the bidirectional long-term and short-term memory network, the method further includes: and adjusting the parameters of the bidirectional long-time and short-time memory network by adopting a plurality of preset scientific and technological texts.
In some embodiments, the training the initial network model with the training sample set includes: and performing back propagation adjustment parameters by adopting a cross entropy loss function.
In another aspect, the present invention provides a scientific and technological resource query method, including:
acquiring a plurality of reference scientific and technological texts issued or browsed by a set student in a set time window, acquiring first text characteristics of each reference scientific and technological text by adopting a two-way long-time memory network, inputting the first text characteristics into a scientific research technology interest field recognition model of the scientific research technology interest field recognition model training method, and extracting a technical field characteristic vector corresponding to the set student as an interest vector;
acquiring a query keyword, and returning a first query candidate set by a database based on a similarity ratio pair, wherein the first query candidate set comprises a plurality of candidate scientific and technological texts;
acquiring second text characteristics of each candidate scientific and technological text by adopting the bidirectional long-short time memory network, inputting each second text characteristic into a scientific and technological interest field recognition model of the scientific and technological interest field recognition model training method in a form of repeatedly filling each time step, and extracting a technical field characteristic vector corresponding to each candidate scientific and technological text as a reference vector;
calculating cosine similarity between a reference vector of each candidate scientific and technological text in the first query candidate set and the interest vector, sequencing the candidate scientific and technological texts according to the sequence of cosine similarity from big to small, and eliminating the candidate scientific and technological text with the cosine similarity smaller than a set value to obtain a second query candidate set;
dividing a second set number of segments according to the value range of the cosine similarity corresponding to each candidate science and technology text in the second query candidate set to obtain the influence factor of the candidate science and technology text in each segment;
and reordering the candidate scientific and technical texts in each segment of the second query candidate set according to the sequence of the influence factors from high to low to obtain a query result.
In some embodiments, the database includes a data acquisition layer, a data processing layer, and a data storage layer, a plurality of business function modules subscribe scientific and technological texts to the data processing module layer in a publish-subscribe manner, and each business function module is respectively configured with a corresponding data processing logic and is uniformly executed by the data processing module layer and then stored.
In some embodiments, the data collection layer is deployed on multiple hosts and performs scheduling of the distributed framework based on the same processing logic;
and/or the data processing layer is deployed on a plurality of hosts and carries out scheduling of the distributed framework based on the same processing logic;
and/or the data storage layer is deployed on a plurality of hosts and carries out the scheduling of the distributed framework based on the same processing logic.
In some embodiments, obtaining the influence factor of the candidate scientific and technical text in each segment comprises:
and acquiring the quoted amount of each candidate scientific and technical text and the text sending amount of the corresponding first author, and weighting and averaging the quoted amount and the text sending amount to obtain the influence factor of each candidate scientific and technical text.
In another aspect, the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the method are implemented.
In another aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the above-mentioned method.
The invention has the beneficial effects that:
according to the scientific research technology interest field recognition model training method, the scientific research technology interest field recognition model query method and the scientific research technology interest field recognition model training device, the scientific research technology interest characteristics of the scholars are mined based on the attention mechanism by obtaining scientific texts published or browsed by the scholars in window time, time steps are divided for the windows, position codes are added to input files of the time steps, characteristics of gathering and transferring of the research fields of the scholars in the window time are extracted, and accuracy of recognition of the interest fields is improved. According to the scientific and technological resource query method, through multi-stage retrieval query reordering, on the basis of similarity judgment, the candidate set is rearranged by comparing the candidate set with the technical field characteristics interested by a learner and combining the influence factors, and the personalization degree and accuracy of scientific and technological resource query are improved.
Furthermore, the database of the scientific and technological resource query method is based on a publish-subscribe mode, flexible data acquisition and unified processing are achieved, and data acquisition efficiency is improved.
Additional advantages, objects, and features of the invention will be set forth in part in the description which follows and in part will become apparent to those having ordinary skill in the art upon examination of the following or may be learned from practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.
It will be appreciated by those skilled in the art that the objects and advantages that can be achieved with the present invention are not limited to the specific details set forth above, and that these and other objects that can be achieved with the present invention will be more clearly understood from the detailed description that follows.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the principles of the invention. In the drawings:
fig. 1 is a diagram of an initial network model structure in a scientific research interest field recognition model training method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of a database structure in the scientific and technological resource query method according to an embodiment of the present invention;
fig. 3 is a schematic diagram of data acquisition logic in the scientific and technological resource query method according to an embodiment of the present invention;
fig. 4 is a logic diagram of a scientific and technological resource query method according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be described in further detail with reference to the following embodiments and accompanying drawings. The exemplary embodiments and descriptions of the present invention are provided to explain the present invention, but not to limit the present invention.
It should be noted that, in order to avoid obscuring the present invention with unnecessary details, only the structures and/or processing steps closely related to the scheme according to the present invention are shown in the drawings, and other details not so relevant to the present invention are omitted.
It should be emphasized that the term "comprises/comprising" when used herein, is taken to specify the presence of stated features, elements, steps or components, but does not preclude the presence or addition of one or more other features, elements, steps or components.
The scientific research technical field has higher requirements on retrieval and query of the prior art, on one hand, scientific and technical files in each field have larger stock, are frequently released and have huge data volume, thus providing challenges for data acquisition and query; on the other hand, for a complicated scientific and technical text, a traditional search mode based on keyword similarity comparison can output all scientific and technical texts with the same technical characteristics in different technical fields, so that different technical fields cannot be distinguished, for example, when searching and querying 'computer vision', technical texts related to computer vision in various fields such as medical fields and robots may be output simultaneously, and a scientific researcher initiating the search may not be interested in or need to send data in a part of technical fields. Therefore, it is necessary to improve the accuracy by performing feature matching on the interest field of the researchers and performing personalized search query in the search process.
On one hand, the invention provides a scientific research technology interest field recognition model training method, and with reference to fig. 1, the method comprises the following steps of S101-S103:
step S101: the method comprises the steps that a plurality of samples are obtained, each sample comprises a plurality of scientific and technological texts issued or browsed by a scientific researcher in a set time window, and the scientific and technological texts in each sample belong to the same technical field; and acquiring text characteristics of each scientific and technical text by adopting a bidirectional long-time memory network, marking the technical field to which each sample belongs as a label of the corresponding sample, and generating a training sample set.
Step S102: the method comprises the steps of obtaining an initial network model, dividing a set time window into a first set number of time steps in the initial network model, inputting text features of technical texts in a sample to the time steps according to a time sequence of release or browsing to form an input sequence, adding position codes to the text features input at each time step by adopting triangular coding, inputting two vector groups respectively, performing linear transformation, and obtaining a key value matrix and a query matrix in an attention mechanism through an activation function; performing matrix multiplication operation on the transpose of the query matrix and the key value matrix, then scaling dot products and normalizing to obtain a weight matrix of attention; multiplying the weight matrix by the input sequence to obtain an attention matrix, and carrying out weighted average on the attention matrix to obtain a characteristic vector of the technical field; and inputting the feature vector of the technical field into a classifier and outputting a classification result.
Step S103: and training the initial network model by adopting a training sample set to obtain a scientific research technology interest field recognition model.
The embodiment is used for training a model capable of acquiring the interest field of the scientific research technology of the student and identifying the interest field of the student. In step S101, a training sample set is established, a single sample records scientific and technical texts issued or browsed by a single scientific research learner within a window time, the window time can be set according to requirements, and for files issued by the learner himself, the window time can be set to be 1 month, 3 months, half a year, or a year, taking into account a general publication period. For the files browsed by the scholars, the general time is more concentrated, and the window time can be set to be 1 day or 1 week. Accordingly, the published or browsed scientific and technical texts should be marked with the corresponding published or browsed time sequence.
Further, before the text features of each scientific and technical text are obtained by using the bidirectional long-short term memory network, in some embodiments, the method further includes: and adjusting the parameters of the bidirectional long-time and short-time memory network by adopting a plurality of preset scientific and technological texts.
In step S102, in order to more accurately identify the technical field of interest of the learner, the present embodiment proposes an attention mechanism based on the time window. The time window attention mechanism is similar to the general attention mechanism in that attention is focused on important points in a plurality of information, key information is selected, and other unimportant information is ignored. But its input is a slidable time window, so that the range of attention has the ability to be continuously updated over time. The short-term interest representation of the learner is efficiently extracted by using the latest scientific research related records in the historical behavior sequence of the learner, attention is represented by self-matching of the sequence, and the transition of the interest of the learner along with the time is mined by combining a sliding time window.
Specifically, a time window is divided into s time steps, and the time steps are used for marking the time sequence of technical text publishing or browsing, wherein the publishing or browsing is earlier before and later after. For one sample, the text features of each technical text acquired by the bidirectional long-and-short-term memory network in step S101 are respectively input as an input sequence X according to the corresponding time stepsSAdding position codes by utilizing triangular codes, wherein the position codes are used for marking the time step sequence and introducing the sequence characteristics of issuing or browsing of each scientific and technical text so as to introduce the characteristics of interest contents of a learner changing along with time in window time to obtain a sequence CS. After the position coding information is fused, linear transformation is carried out on the two vector groups, and then a key value matrix K and a query matrix Q in the attention mechanism are obtained through an activation function. The specific calculation formula is as follows:
Figure 609570DEST_PATH_IMAGE001
(1)
Figure 830467DEST_PATH_IMAGE002
(2)
wherein, WKFor generating a parameter matrix for a key-value matrix, WQThe σ function is an activation function for the parameter matrix used to generate the query matrix.
Performing matrix multiplication operation on the transpose of the query matrix and the key value matrix, then scaling a Dot Product (Scaled Dot-Product), and obtaining an attention weight matrix after softmax operation, wherein the calculation formula is as follows:
Figure 387350DEST_PATH_IMAGE003
(3)
where d is the scaling factor, USIs the attention weight matrix.
Further, the original input sequence X is inputSAs the value matrix V of the attention mechanism, the attention matrix in the technical field of interest of the scholars can be obtained by multiplying the value matrix by the attention weight matrix, and the calculation formula is:
Figure 185542DEST_PATH_IMAGE004
(4)
wherein A isSIs an attention matrix.
Further, the attention moment array is weighted and averaged to obtain a technical field feature vector, and the technical field feature vector is input into a classifier to obtain a final classification result.
In step S103, the training sample set in step S101 is used to train the initial network model in step S102, and finally the scientific research interest field recognition model is obtained.
In some embodiments, training the initial network model with a training sample set includes: and performing back propagation adjustment parameters by adopting a cross entropy loss function.
On the other hand, the invention provides a scientific and technological resource query method, which comprises the following steps of S201-S206:
it should be noted in advance that, in this embodiment, the steps S201 to S206 are not limited to the order of the steps, and it should be understood that, in some application scenarios, the steps may be parallel or the order may be changed. In the present embodiment, "first" and "second" of the first text feature and the second text feature are not ordinal numbers, and are used only to distinguish the features of the reference scientific text and the candidate scientific text.
Step S201: the method comprises the steps of obtaining a plurality of reference scientific and technological texts issued or browsed by a set student in a set time window, obtaining first text characteristics of each reference scientific and technological text by adopting a two-way long-time memory network, inputting the first text characteristics into a scientific research technology interest field recognition model of the scientific research technology interest field recognition model training method, and extracting a technical field characteristic vector corresponding to the set student as an interest vector.
Step S202: and obtaining query keywords, and returning a first query candidate set by the database based on the similarity ratio pair, wherein the first query candidate set comprises a plurality of candidate scientific and technical texts.
Step S203: and acquiring second text characteristics of each candidate scientific and technological text by adopting a bidirectional long-time and short-time memory network, inputting each second text characteristic into the scientific and technological interest field recognition model of the scientific and technological interest field recognition model training method in a form of repeatedly filling each time step, and extracting a technical field characteristic vector corresponding to each candidate scientific and technological text as a reference vector.
Step S204: and calculating cosine similarity of the reference vector and the interest vector of each candidate scientific and technological text in the first query candidate set, sequencing the candidate scientific and technological texts according to the sequence of cosine similarity from big to small, and removing the candidate scientific and technological text with the cosine similarity smaller than a set value to obtain a second query candidate set.
Step S205: and dividing the second set number of segments according to the value range of the cosine similarity corresponding to each candidate scientific and technical text in the second query candidate set, and acquiring the influence factor of the candidate scientific and technical text in each segment.
Step S206: and reordering the candidate scientific and technical texts in each segment of the second query candidate set according to the sequence of the influence factors from high to low to obtain a query result.
In step S201, when a setting learner initiates a query on one query keyword, a technical field of interest of the setting learner is first analyzed and identified. And acquiring a reference scientific and technical text published or browsed by a set learner in a set time window, and extracting first text features by adopting a bidirectional long-time memory network. Inputting the first text features of each reference scientific text into the scientific research technical interest field recognition model obtained by training in the steps S101-S103, referring to the section A in FIG. 1, and outputting the technical field feature vector obtained by model operation as an interest vector. It should be clear that, in step S201, the final recognition result of the scientific interest field recognition model does not need to be used, and the reference is the feature vector of the technical field before being input into the classifier for the query operation.
In step S202, a query keyword is obtained, and in the first stage of query retrieval, the database may directly obtain a first query candidate set in a similarity matching manner or a keyword retrieval manner. For a query keyword, a large number of scientific and technical documents in different fields in the database contain corresponding query keywords and all the documents fall into the first query candidate set, so that the first query candidate set obtained in a general similarity matching or keyword retrieval mode cannot easily meet the query requirements of scholars on technologies in a specific direction field. For example, in the search for "computer vision", there are many technical documents in the technical fields of medical image recognition, face recognition, motion capture, and the like.
In step S203, the candidate scientific and technical texts in the first query candidate set are converted to the same dimension of the set learner interest vector for further comparison. And when the reference vector of the candidate scientific and technological text is extracted, referring to the step S201, processing by adopting the scientific research technology interest field recognition model obtained through training in the steps S101-S103. For a single candidate scientific and technological text, in order to meet the input requirement of a scientific and technological interest field recognition model, the input end can be copied for multiple times to fill multiple time steps of window time, and finally, a technical field feature vector before a scientific and technological interest field recognition model classifier is output as a reference vector of each candidate scientific and technological text for comparison with an interest vector of a set learner.
In step S204, the candidate scientific and technical texts in the first query candidate set are ranked by calculating cosine similarity between the reference vector and the interest vector. And candidate scientific and technical texts with cosine similarity smaller than a set value are removed to ensure the query quality, the set value can be set according to the specific scene requirements, and the higher the set value is, the closer the remaining candidate scientific and technical texts are to the technical field in which the set scholars are interested.
In step S205 and step S206, the second query candidate set is segmented to further distinguish different technical fields to some extent. After the first query candidate set is ranked by the cosine similarity to obtain the second query candidate set, the closer the top ranking is to the technical field in which the setting learner is interested, the farther the bottom ranking is from the technical field in which the setting learner is interested. Accordingly, the technical documents in a certain section of the second query candidate set are all similar or identical in technical field. When the technical documents are further reordered according to the influence, the technical texts in each segment are individually rearranged according to the influence factors so as not to disturb the sequence of the overall arrangement from near to far with the technical fields of interest of the set scholars, and thus, on the basis of keeping the near to far of each segment and the corresponding technical fields of interest, the technical texts in each segment are arranged from high to low according to the influence, and finally, the query result is obtained.
In some embodiments, in step S205, obtaining the influence factor of the candidate scientific and technical text in each segment includes: and acquiring the quoted amount of each candidate scientific and technical text and the issue amount of the corresponding first author, and weighting and averaging the quoted amount and the issue amount to obtain the influence factor of each candidate scientific and technical text.
In some embodiments, in step S202, the database includes a data acquisition layer, a data processing layer, and a data storage layer, a plurality of service function modules subscribe scientific and technological texts to the data processing module layer in a publish-subscribe manner, and each service function module is configured with a corresponding data processing logic and is uniformly executed by the data processing module layer and then stored.
The publish-subscribe mode belongs to a behavior mode in a design mode, in a software architecture, publish/subscribe is a message paradigm, and a sender of a message does not directly send the message to a specific receiver, but broadcasts the message through a message channel, so that a subscriber subscribing to the message topic can consume the message. The biggest feature of the publish/subscribe model is the implementation of loose coupling. In the invention, a data acquisition and processing pipeline is designed by introducing a publish-subscribe mode so as to realize the flexibility, high reliability and testability of a data processing system.
In some embodiments, the data collection layer is deployed on multiple hosts and schedules the distributed framework based on the same processing logic; and/or the data processing layer is deployed on a plurality of hosts and carries out the scheduling of the distributed framework based on the same processing logic; and/or the data storage layer is deployed on a plurality of hosts and performs the scheduling of the distributed framework based on the same processing logic.
The scientific and technological resource query method of the present invention is described in detail below with reference to an embodiment:
the scientific and technological resource query method provided by the embodiment combines a distributed retrieval query technology and a deep neural network model. A scientific and technological resource acquisition and processing method based on a publish-subscribe mode is designed, and scientific and technological big data resources can be efficiently acquired, cleaned, processed and stored on the network. A neural network combined with an attention mechanism is adopted to provide a time window attention-based scientific research personnel interest field extraction algorithm, and the characteristics of research field aggregation of scientific researchers in a certain time period and research field transfer in a cross-time period are fully considered. By combining the two methods, the intelligent and precise scientific and technological resource retrieval is realized through a two-stage retrieval query reordering mechanism.
To achieve the above purpose, as shown in fig. 2, the technical solution of the present invention is divided into three parts: firstly, constructing a publish-subscribe mode to acquire and process scientific and technological resources; secondly, constructing a time window attention model for extracting the field of interest of scientific research personnel; thirdly, the personalized two-stage retrieval and query of scientific and technological resources.
(I) constructing a publish-subscribe model for scientific and technological resource acquisition and processing
In order to solve the problems of high redundancy degree, poor expandability and the like existing in the traditional data acquisition and processing flow, a publish-subscribe mode is introduced in the embodiment per se. In detail, on the whole, the data management module is divided into three layers according to the distance from the data source, wherein the three layers comprise a data acquisition layer, a data processing layer and a data storage layer, and all data are acquired, processed and stored from the source point step by step and finally fall into a database suitable for the data characteristics of the data management module. The data processing layer provides a high-level functional interface for receiving subscriber subscription, allows each service function to subscribe to the data processing layer, and binds own service processing logic during subscription. By the method, a coherent and flexible data processing mode is realized, the same processing logic can be deployed on multiple machines for any layer of the data acquisition layer, the data processing layer and the data storage layer, and dynamic load balancing and pressure buffering are realized by scheduling of a distributed framework. In terms of selection of storage tiers, the architecture enables transparent heterogeneous storage. Due to the different inherent structural and sparsity properties of the differential data, the present embodiment provides different underlying storages such as a graph database (Neo 4 j), a relational database (Mysql), a search engine database (elastic search), and the like. A scientific and technological resource acquisition and processing model based on a publish-subscribe model is shown in fig. 3.
Specifically, as shown in fig. 3, the step of acquiring and processing the scientific and technological resources by the database includes the following steps 1.1 to 1.4:
1.1 the data acquisition layer crawls multi-field science and technology big data through distributed data crawlers, and real-time data acquisition and cleaning are carried out.
1.2 based on the publish-subscribe mode binding data processing mode, configuring corresponding processing logic for data with different specifications according to requirements. Wherein, for the picture adopts downloader to download and store, text data directly saves.
1.3 to the retrieval and query problem aimed at by the invention, a bidirectional long-time and short-time memory network model is established for the collected scientific and technological achievement resources to perform text feature representation.
And 1.4, inputting the processing result into a subsequent algorithm module or storing the processing result into a multi-source heterogeneous database.
(II) constructing a time window attention model for extracting the interest field of scientific research personnel
For scholars and users with scientific research experience to provide personalized search query service, research interests and research fields of the scholars and users need to be obtained from research results of the scholars and users. In the section, a time window attention mechanism is provided, an interest extraction model is established, and the interest field of the learner is efficiently mined.
The specific steps of the learner interest expression algorithm are as follows 2.1-2.5:
2.1 establishing a scientific research interest field recognition model shown in FIG. 1, wherein the input of each time step in the time window is a scientific resource entity represented by vectorization to obtain an input sequence XS. For example, the subjects of multiple papers published by a scholars at different times adopt a bidirectional long-short time memory network model to carry out vectorization processing, and input each time step in sequence.
2.2 adding position codes by using the triangular codes, wherein the position codes are used for marking the time step sequence and introducing the sequence characteristics of issuing or browsing of each scientific and technical text so as to introduce the characteristics of interest contents of the scholars changing along with the time in the window time to obtain a sequence CSFusing the position-coding information sequence CSAfter the input vector group is subjected to linear transformation, K, Q matrixes, namely key value matrixes and query matrixes in the attention mechanism are obtained through the activation function respectively, and the calculation modes are shown as formulas (1) and (2).
Figure 790967DEST_PATH_IMAGE005
(1)
Figure 256583DEST_PATH_IMAGE006
(2)
Wherein, WKFor generating a parameter matrix for a key-value matrix, WQThe σ function is an activation function for the parameter matrix used to generate the query matrix.
2.3 the attention weight matrix can be obtained by the Scaled Dot-Product and Softmax operation, as shown in formula (3).
Figure 656471DEST_PATH_IMAGE007
(3)
Where d is the scaling factor, USIs the attention weight matrix.
2.4 original input sequence XSValue matrix V as attention mechanism, passing value matrix and notesThe attention matrix of the scholars' interest technical field can be obtained by multiplying the attention weight matrix, as shown in formula (4).
Figure 676380DEST_PATH_IMAGE008
(4)
Wherein A isSIs an attention matrix.
2.5 weighted averaging of the attention value matrix can be used to obtain a vectorized representation of the interest of the researchers
Figure DEST_PATH_IMAGE009
Third, personalized two-stage retrieval query of scientific and technological resources
In order to efficiently query mass data of scientific and technological resources, the embodiment provides a two-stage query method. And when the query key words input by the user reach the distributed query cluster, returning a query candidate set in the first stage according to the similarity between the resources in the database and the query sentences, wherein irrelevant results can be quickly filtered from the mass data in the first stage to obtain an approximate result set. In the second stage, the candidate query set is re-ranked secondarily according to the above student interest expression and the influence of the students, so that the query is more intelligent and personalized. For a query statement Q input by a user, obtaining a candidate set W = { W = of a query resultiI =1,2,3 … n. Calculating each scientific and technical entity W in the query candidate set WiIn order to be in the same semantic space with the interest expression of the scholars, the feature vector expression of the scientific and technological entity is repeated for N times, N time steps are filled, and the feature vector expression is input into the scientific and technological interest field identification model which is the same as the step 2.1 to obtain the feature expression of the scientific and technological entity in the interest space of the scholars
Figure 616654DEST_PATH_IMAGE010
. Calculate its interest vector with the scholar user
Figure 546564DEST_PATH_IMAGE011
Similarity in semantic space of interestDegree of rotation
Figure 710829DEST_PATH_IMAGE012
. Setting a threshold value
Figure 483613DEST_PATH_IMAGE013
Filtering out candidate sets with similarity less than
Figure 162332DEST_PATH_IMAGE013
The entity of (1). And measuring the correlation in the interest semantic space by using cosine similarity. Referring to fig. 4, the specific steps include 3.1-3.5:
3.1 computing the title of each paper in the query candidate set
Figure 336961DEST_PATH_IMAGE014
3.2 computing interest vectors for each paper and student
Figure 344231DEST_PATH_IMAGE015
Cosine similarity of
Figure 604311DEST_PATH_IMAGE016
Filtering out similarity smaller than
Figure 89650DEST_PATH_IMAGE013
In accordance with
Figure 322049DEST_PATH_IMAGE017
To pair
Figure 624854DEST_PATH_IMAGE018
Reordering to obtain a set
Figure 982017DEST_PATH_IMAGE019
3.3 according to
Figure 395681DEST_PATH_IMAGE020
Figure 154689DEST_PATH_IMAGE021
And number of segments
Figure 893975DEST_PATH_IMAGE022
Partitioning reordered segments
Figure 738435DEST_PATH_IMAGE023
3.4 in each segment
Figure 690210DEST_PATH_IMAGE024
The influence is calculated according to the weighted average of the written text quantity and the quoted quantity of the first author of each article, and the influence is aggregated from large to small
Figure 569304DEST_PATH_IMAGE025
Performing secondary reordering to obtain a set
Figure 213912DEST_PATH_IMAGE026
3.5 according to
Figure 545668DEST_PATH_IMAGE026
The result of (c) is returned as the final query result.
The data processing architecture based on the publish-subscribe mode is used for collecting, processing and storing the scientific and technological resources, and the flexibility of the bottom layer data processing flow related to the retrieval of the scientific and technological resources can be obviously improved. In order to solve the strong demand of a student user on personalized retrieval, the invention excavates and expresses the interest of the student by designing an attention model based on a time sliding window, thereby solving the characteristic that the interest of the student cannot be transferred along with the time by the traditional interest excavation algorithm. Finally, the invention provides a two-stage retrieval query reordering algorithm based on the interest and influence of a learner, and two-time reordering is carried out on a result set of distributed retrieval, so that the algorithm has the advantages of high speed and high matching degree of distributed retrieval and has the characteristics of intelligence and individuation.
In another aspect, the present invention further provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and when the processor executes the computer program, the steps of the method are implemented.
In another aspect, the present invention also provides a computer-readable storage medium, on which a computer program is stored, characterized in that the program, when executed by a processor, implements the steps of the above-mentioned method.
In summary, in the scientific research interest field recognition model training method, the scientific research resource query method and the scientific research interest field recognition model training device, the training method is used for extracting features of gathering and transferring of research fields of a learner in window time by acquiring scientific texts published or browsed by the learner in the window time, mining interest features of the learner based on attention mechanism, dividing time steps for the window and adding position codes to input files of the time steps, and improving accuracy of recognition of the interest fields. According to the scientific and technological resource query method, through multi-stage retrieval query reordering, on the basis of similarity judgment, the candidate set is rearranged by comparing the candidate set with the technical field characteristics interested by a learner and combining the influence factors, and the personalization degree and accuracy of scientific and technological resource query are improved.
Furthermore, the database of the scientific and technological resource query method is based on a publish-subscribe mode, flexible data acquisition and unified processing are achieved, and data acquisition efficiency is improved.
Those of ordinary skill in the art will appreciate that the various illustrative components, systems, and methods described in connection with the embodiments disclosed herein may be implemented as hardware, software, or combinations of both. Whether this is done in hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention. When implemented in hardware, it may be, for example, an electronic circuit, an Application Specific Integrated Circuit (ASIC), suitable firmware, plug-in, function card, or the like. When implemented in software, the elements of the invention are the programs or code segments used to perform the required tasks. The program or code segments may be stored in a machine-readable medium or transmitted by a data signal carried in a carrier wave over a transmission medium or a communication link. A "machine-readable medium" may include any medium that can store or transfer information. Examples of a machine-readable medium include electronic circuits, semiconductor memory devices, ROM, flash memory, Erasable ROM (EROM), floppy disks, CD-ROMs, optical disks, hard disks, fiber optic media, Radio Frequency (RF) links, and so forth. The code segments may be downloaded via computer networks such as the internet, intranet, etc.
It should also be noted that the exemplary embodiments mentioned in this patent describe some methods or systems based on a series of steps or devices. However, the present invention is not limited to the order of the above-described steps, that is, the steps may be performed in the order mentioned in the embodiments, may be performed in an order different from the order in the embodiments, or may be performed simultaneously.
Features that are described and/or illustrated with respect to one embodiment may be used in the same way or in a similar way in one or more other embodiments and/or in combination with or instead of the features of the other embodiments in the present invention.
The above description is only a preferred embodiment of the present invention, and is not intended to limit the present invention, and various modifications and changes may be made to the embodiment of the present invention by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A scientific research technology interest field recognition model training method is characterized by comprising the following steps:
the method comprises the steps that a plurality of samples are obtained, each sample comprises a plurality of scientific and technological texts issued or browsed by a scientific researcher in a set time window, and the scientific and technological texts in each sample belong to the same technical field; acquiring text characteristics of each scientific and technical text by adopting a bidirectional long-time memory network, marking the technical field to which each sample belongs as a label of the corresponding sample, and generating a training sample set;
the method comprises the steps of obtaining an initial network model, dividing a set time window into a first set number of time steps, inputting text features of technical texts in a sample to the time steps according to a time sequence of release or browsing to form an input sequence, adding position codes to the text features input at each time step by adopting triangular coding, inputting two vector groups respectively, performing linear transformation, and obtaining a key value matrix and a query matrix in an attention mechanism through an activation function; performing matrix multiplication operation on the query matrix and the transpose of the key value matrix, then scaling dot products and normalizing to obtain a weight matrix of attention; multiplying the weight matrix and the input sequence to obtain an attention matrix, and carrying out weighted average on the attention matrix to obtain a characteristic vector of the technical field; inputting the technical field feature vector into a classifier and outputting a classification result;
and training the initial network model by adopting the training sample set to obtain a scientific research technology interest field recognition model.
2. The scientific research technical field of interest recognition model training method according to claim 1, wherein in the scaling dot product and normalization processing after matrix multiplication of the query matrix and the transpose of the key value matrix, the normalization processing uses a softmax function.
3. The scientific research technical interest field recognition model training method according to claim 1, before obtaining the text features of each scientific and technological text by using a bidirectional long-and-short-term memory network, further comprising: and adjusting the parameters of the bidirectional long-time and short-time memory network by adopting a plurality of preset scientific and technological texts.
4. The scientific research technical field of interest recognition model training method according to claim 1, wherein the training of the initial network model using the training sample set comprises: and performing back propagation adjustment parameters by adopting a cross entropy loss function.
5. A scientific and technological resource query method is characterized by comprising the following steps:
acquiring a plurality of reference scientific and technological texts issued or browsed by a set student in a set time window, acquiring first text characteristics of each reference scientific and technological text by adopting a two-way long-time memory network, inputting the first text characteristics into a scientific research technology interest field recognition model of the scientific research technology interest field recognition model training method according to any one of claims 1 to 2, and extracting a technical field characteristic vector corresponding to the set student as an interest vector;
acquiring a query keyword, and returning a first query candidate set by a database based on a similarity ratio pair, wherein the first query candidate set comprises a plurality of candidate scientific and technological texts;
acquiring second text features of each candidate scientific and technological text by using the bidirectional long-short time memory network, inputting each second text feature into the scientific and technological interest field recognition model of the scientific and technological interest field recognition model training method according to any one of claims 1 to 2 in a form of repeatedly filling each time step, and extracting a technical field feature vector corresponding to each candidate scientific and technological text as a reference vector;
calculating cosine similarity between a reference vector of each candidate scientific and technological text in the first query candidate set and the interest vector, sequencing the candidate scientific and technological texts according to the sequence of cosine similarity from big to small, and eliminating the candidate scientific and technological text with the cosine similarity smaller than a set value to obtain a second query candidate set;
dividing a second set number of segments according to the value range of the cosine similarity corresponding to each candidate science and technology text in the second query candidate set to obtain the influence factor of the candidate science and technology text in each segment;
and reordering the candidate scientific and technical texts in each segment of the second query candidate set according to the sequence of the influence factors from high to low to obtain a query result.
6. The scientific and technological resource query method of claim 5, wherein the database includes a data acquisition layer, a data processing layer and a data storage layer, scientific and technological texts are subscribed to the data processing module layer by a plurality of business function modules according to a publish-subscribe form, and each business function module is respectively configured with corresponding data processing logic and is stored after being uniformly executed by the data processing module layer.
7. A scientific and technological resource query method according to claim 6, characterized in that the data acquisition layer is deployed on a plurality of hosts and performs scheduling of a distributed framework based on the same processing logic;
and/or the data processing layer is deployed on a plurality of hosts and carries out scheduling of the distributed framework based on the same processing logic;
and/or the data storage layer is deployed on a plurality of hosts and carries out the scheduling of the distributed framework based on the same processing logic.
8. The method according to claim 5, wherein the obtaining of the influence factor of the candidate scientific and technological texts in each segment comprises:
and acquiring the quoted amount of each candidate scientific and technical text and the text sending amount of the corresponding first author, and weighting and averaging the quoted amount and the text sending amount to obtain the influence factor of each candidate scientific and technical text.
9. An electronic device comprising a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the steps of the method according to any of claims 1 to 8 are implemented when the processor executes the program.
10. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the steps of the method according to any one of claims 1 to 8.
CN202110781559.2A 2021-07-12 2021-07-12 Scientific research technology interest field recognition model training method, scientific and technological resource query method and device Active CN113239179B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110781559.2A CN113239179B (en) 2021-07-12 2021-07-12 Scientific research technology interest field recognition model training method, scientific and technological resource query method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110781559.2A CN113239179B (en) 2021-07-12 2021-07-12 Scientific research technology interest field recognition model training method, scientific and technological resource query method and device

Publications (2)

Publication Number Publication Date
CN113239179A CN113239179A (en) 2021-08-10
CN113239179B true CN113239179B (en) 2021-09-17

Family

ID=77135291

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110781559.2A Active CN113239179B (en) 2021-07-12 2021-07-12 Scientific research technology interest field recognition model training method, scientific and technological resource query method and device

Country Status (1)

Country Link
CN (1) CN113239179B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116383391B (en) * 2023-06-06 2023-08-11 深圳须弥云图空间科技有限公司 Text classification method and device

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107391760B (en) * 2017-08-25 2018-05-25 平安科技(深圳)有限公司 User interest recognition methods, device and computer readable storage medium
US11288456B2 (en) * 2018-12-11 2022-03-29 American Express Travel Related Services Company, Inc. Identifying data of interest using machine learning
CN110929164B (en) * 2019-12-09 2023-04-21 北京交通大学 Point-of-interest recommendation method based on user dynamic preference and attention mechanism
CN111931043B (en) * 2020-07-23 2023-09-29 重庆邮电大学 Recommending method and system for science and technology resources
CN112016002A (en) * 2020-08-17 2020-12-01 辽宁工程技术大学 Mixed recommendation method integrating comment text level attention and time factors

Also Published As

Publication number Publication date
CN113239179A (en) 2021-08-10

Similar Documents

Publication Publication Date Title
Liu et al. A survey of sentiment analysis based on transfer learning
US20210209109A1 (en) Method, apparatus, device, and storage medium for intention recommendation
Wu et al. Modeling method of internet public information data mining based on probabilistic topic model
Liu et al. Robust and scalable graph-based semisupervised learning
CN110674407B (en) Hybrid recommendation method based on graph convolution neural network
Xu et al. Exploring big data analysis: fundamental scientific problems
CN105279146A (en) Context-aware approach to detection of short irrelevant texts
US20180239815A1 (en) Method and system for sentiment analysis of information
Ianina et al. Multi-objective topic modeling for exploratory search in tech news
Guo et al. An effective and economical architecture for semantic-based heterogeneous multimedia big data retrieval
WO2004013775A2 (en) Data search system and method using mutual subsethood measures
CN105205096A (en) Text modal and image modal crossing type data retrieval method
Tuarob et al. A generalized topic modeling approach for automatic document annotation
Ramya et al. Sentiment analysis of movie review using machine learning techniques
Safder et al. Detecting target text related to algorithmic efficiency in scholarly big data using recurrent convolutional neural network model
CN113239179B (en) Scientific research technology interest field recognition model training method, scientific and technological resource query method and device
CN103049454B (en) A kind of Chinese and English Search Results visualization system based on many labelings
CN103262079A (en) Search device, search method, search program, and computer-readable memory medium for recording search program
Jokar et al. Web mining and web usage mining techniques
Zhao et al. Multi-view multi-label active learning with conditional Bernoulli mixtures
Lian Implementation of computer network user behavior forensic analysis system based on speech data system log
Hong et al. Neural tensor network for multi-label classification
Agarwal et al. WGSDMM+ GA: A genetic algorithm-based service clustering methodology assimilating dirichlet multinomial mixture model with word embedding
Ganguly et al. Real-time big data analysis using web scraping in apache spark environment: case study—mobile data analysis from flipkart
CN114741587A (en) Article recommendation method, device, medium and equipment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant