Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention realizes a semi-structured massive nuclear power information recommendation system, on one hand, the concept of the knowledge ontology is utilized to perform professional clustering analysis on the technical information structured metadata, and the static massive data template in the assumed space is obtained through a massive data learning analysis algorithm in combination with the technical background and induction preference of nuclear power professionals. On the other hand, a text index is formed according to text analysis of the massive data unstructured document and is combined with the dynamic requirements of nuclear power professionals, index retrieval of data is carried out in a static massive data template, utilization and combination of static information and dynamic data are finally achieved, and data knowledge recommendation of the nuclear power professionals is completed.
The invention realizes the matching processing method of the static data (including metadata and text) of the massive semi-structured nuclear power technical document and the massive data of nuclear power professional requirements (including static knowledge background and dynamic requirements). Configurable nuclear power technology document basic information constraint and a nuclear power professional technical personnel background analysis and recognition technology are included; a method for establishing a structured metadata clustering template and a static mass data template; combining a dynamic log capture analysis technology and a text analysis technology; performing a weighted sorting algorithm on the text matching by using an inverted index technology; a nuclear power professional knowledge information recommendation function scheme integrating static information and dynamic requirements. The technical method meets the information propagation and reconstruction requirements of enterprise knowledge management, and ensures that professional technicians can timely and effectively obtain accurate and matched effective information.
The first embodiment is as follows:
fig. 1 shows an implementation process of a recommendation method for mass information data according to an embodiment of the present invention, where the implementation process is detailed as follows:
step S101, metadata information is acquired from the enterprise content management system ECM.
In the embodiment of the present invention, the ECM may be a nuclear power enterprise content management system, and the ECM includes a large amount of enterprise content, including but not limited to metadata information, unstructured file text content, system access and retrieval related logs, and personnel information.
And S102, generating a metadata clustering template according to the metadata set sample space of the metadata information.
Specifically, a complex metadata structure is simplified to generate a metadata clustering template, that is, contents represented by structured metadata are classified by a clustering method to extract a core metadata structure.
Step S103, obtaining the static attribute space of the user according to the relevant information of the user.
Specifically, the static attribute space of the technical personnel is obtained according to the technical personnel background, such as relevant information of profession, department, participation project, stage, position and the like, and the static attribute space of each technical personnel is recorded.
And step S104, acquiring a corresponding static mass data template according to the static attribute space of the user and the metadata clustering template.
Specifically, the static massive data template is obtained by combining nuclear power technology knowledge clustering obtained according to the metadata clustering template in the step S102 and professional background analysis data obtained in the step S103.
Step S105, monitoring the behavior log of the user, and acquiring the attention word of the user within the preset time according to the behavior log of the user.
Specifically, the user's attention points need to be analyzed by a time-sequence-based user behavior log monitoring and recording method, and further, user behaviors and expectations are mined from log data.
First, the contents of user search, review and attention recorded by the system are collected. Secondly, according to the fact that each retrieval content is decomposed into a plurality of keywords, the frequency and the times of the concerned content units of the user retrieval content are recorded according to time factors (time sequences), and finally the recent popular concerned words of the user are formed.
And step S106, forming a text index according to the text analysis of the unstructured documents of the mass data.
Specifically, information is obtained from a text set, the text is analyzed and preprocessed according to a nuclear power dictionary, vocabularies in the text are screened and recognized, and useless words are removed according to a stop word list. The characteristic extraction is to weight and order the words in the text set according to the word frequency of the words in the text set and the proportion of the number of times of the words appearing in each text of the text set to the number of the texts, namely, the words in the dictionary have higher weight. Selecting how many words form a feature vector according to the sequence of the feature words, indexing the massive texts by a MapReduce algorithm, and giving out feature results and abstracts of the documents.
And S107, searching the content to be recommended according to the text index, the attention word of the user in the preset time and the static mass data template.
Specifically, dynamic index retrieval is established on the basis of indexes of a sample space and unstructured texts under a static data space model algorithm, and finally recommended knowledge information is selected through index sorting.
The embodiment of the invention can combine the static information with the dynamic data and quickly finish the data knowledge pushing of nuclear power professionals, thereby ensuring that the professionals can timely and effectively obtain accurate matched effective information.
Example two:
fig. 2 shows an implementation process of the recommendation method for mass information data according to the second embodiment of the present invention, where the implementation process is detailed as follows:
in step S201, metadata information is acquired from the enterprise content management system ECM.
The step is the same as step S101, and reference may be made to the related description of step S101, which is not repeated herein.
And step S202, generating a metadata clustering template according to the metadata set sample space of the metadata information.
The step is the same as step S102, and reference may be made to the related description of step S102, which is not repeated herein.
Optionally, the generating a metadata clustering template according to the metadata set sample space of the metadata information includes:
step one, randomly selecting K objects from the metadata set sample space as initial cluster centers, wherein K is an integer larger than zero, and one cluster object corresponds to one type of technical documents;
calculating the similarity between all objects in the metadata set sample space and K cluster centers, and classifying each object in all the objects into a cluster with the highest similarity to the object;
recalculating the cluster center of each cluster according to the object in each cluster so as to recalculate K cluster centers;
if any cluster center in the K cluster centers which are recalculated changes, recalculating the similarity between all the objects and the K cluster centers which are recalculated, and classifying each object in all the objects into a cluster with the highest corresponding similarity to form a new cluster object;
and step five, repeating the step three and the step four until K cluster centers are not changed any more, wherein the K cluster centers form the metadata clustering template.
The metadata attribute set space is composed of a collection of independent attribute sets that can be in multiple dimensions. Randomly selecting K objects in a metadata set sample space as the centers of initial clusters (the total work division number of professional technology can be more than or equal to the total work division number), calculating the similarity of each object and the centers of the K clusters, classifying each object into the most similar cluster, and calculating a new average value (center) of the objects in the clusters; then calculating the similarity between each object and the centers of the new K clusters, and assigning each object to the most similar cluster again according to the similarity between each object and the new cluster mean value to form a new cluster object; and updating the average value of the clusters, namely calculating the average value of each object until the average value does not change any more, and finally forming the metadata clustering template.
It should be noted that the static massive data template includes a plurality of cluster objects, each cluster object includes knowledge contents with the same technical features, that is, one cluster object is a class of technical documents.
Step S203, obtaining the static attribute space of the user according to the relevant information of the user.
The step is the same as step S103, and reference may be made to the related description of step S103, which is not described herein again.
And step S204, acquiring a corresponding static mass data template according to the static attribute space of the user and the metadata clustering template.
And the static attribute space of the user corresponds to the technical characteristic parameters described by the metadata clustering template, the intersection of the attribute parameters of the user and the metadata clustering template is taken, and finally, the attribute weights are adjusted according to actual services to form a static data model template.
Optionally, each user belongs to a category of technology-concerned groups; the obtaining of the corresponding static massive data template according to the static attribute space of the user and the metadata clustering template includes:
calculating the matching relation between the technical documents of each type and the attention population mu of each type according to the attribute parameters in the static attribute space of the user and the attribute parameters in the metadata clustering template
To obtain the static mass data template, wherein att
iIs the ith attribute parameter in the intersection of the attribute parameters in the static attribute space of the user and the attribute parameters in the metadata clustering template, n is the number of the attribute parameters in the intersection, Meta (att)
i) Is att
iAttribute information in the metadata clustering template, speciality (att)
i) Is att
iAttribute information in the user's static attribute space,
is att
iThe weight of (2).
For any one document belonging to the static sample space D of the user mu, the static support strength V (mu) and the attribute parameter att
iThe attribute information in the metadata clustering template is inversely related to the variance of the attribute information in the static attribute space of the user, although this value should be multiplied by an attribute parameter att
iImportance indication of
Namely, the weight, and finally, after the information of all the attributes is gathered, the static support strength is formed.
The greater the support strength, the higher the attention degree of the group, so each professional attention matrix can be formed according to the ranking for the use of the subsequent modules.
The step is the same as step S104, and reference may be made to the related description of step S104, which is not repeated herein.
Step S205, monitoring the behavior log of the user, and acquiring the attention word of the user within the preset time according to the behavior log of the user.
The step is the same as step S105, and reference may be made to the related description of step S105, which is not repeated herein.
And step S206, forming a text index according to the text analysis of the massive data unstructured document.
The step is the same as step S106, and reference may be made to the related description of step S106, which is not repeated herein.
And step S207, searching the content to be recommended according to the text index, the attention word of the user in the preset time and the static mass data template.
And dynamic index retrieval is established on the basis of indexes of a sample space and unstructured texts under a static data space model algorithm, and finally recommended knowledge information is selected through index sorting.
The dynamic index retrieval analysis is divided into two aspects, namely content support strength and time support strength.
The content support strength comprises a sample space in a static mass data template, each piece of data in the sample space has corresponding support strength, and the support strengths are calculated from metadata of the nuclear power document; in addition, the method also comprises the step of forming a text index according to the text analysis of the massive data unstructured document, wherein the part is called full text support strength and is a result obtained through the full text index of the document.
The time support strength can be understood as the freshness, from the document perspective, the time factor of document generation is called the document freshness, and the knowledge content viewed, retrieved, downloaded and concerned by the user monitored in step S205 is also related to time, which becomes the attention freshness, and the content information of the attention point and the freshness of each attention point are obtained by calculating the time dimension.
And finally, calculating to obtain a final recommended content result according to the latest attention point of the user and the index sequence of the sample space.
Optionally, the searching for the content to be recommended according to the text index, the attention word of the user within a preset time, and the static massive data template includes:
acquiring the frequency of the attention word of the user in the text index within the preset time
Wherein the content of the first and second substances,
the j-th attention word of the user in a preset time is shown;
according to
And V (mu,) calculating the recommendation strength of each technical document
Wherein m is the number of the attention words of the user in the preset time,
in order to focus on the temporal freshness weight,
for the frequency weight of interest, τ () is the update time parameter of the document;
and generating recommendation content in a list form for the technical documents corresponding to the recommendation strength meeting the preset conditions according to the recommendation strength of each type of technical documents.
The preset time may be a period time set by a user, for example, one week, and is not limited herein. The preset condition may be recommendation strength greater than a preset threshold, and the technical documents corresponding to the recommendation strengths may be arranged in a descending order according to the recommendation strength.
And step S208, recording the searched content to be recommended and the static mass data template.
And recording the operation process, namely recording the static support vector result on one hand, and recording the dynamic requirement updating process and the dynamic index information on the other hand.
Example three:
fig. 3 is a schematic composition diagram of a recommendation apparatus for massive information data according to a third embodiment of the present invention, and for convenience of description, only the parts related to the third embodiment of the present invention are shown, which is detailed as follows:
a metadata information acquisition module 31 for acquiring metadata information from the enterprise content management system ECM;
a metadata aggregation template generation module 32, configured to generate a metadata clustering template according to a metadata set sample space of the metadata information;
a static attribute space obtaining module 33, configured to obtain a static attribute space of a user according to relevant information of the user;
a static mass data template obtaining module 34, configured to obtain a corresponding static mass data template according to the static attribute space of the user and the metadata clustering template;
the word-of-interest obtaining module 35 is configured to monitor the behavior log of the user, and obtain a word of interest of the user within a preset time according to the behavior log of the user;
a text index forming module 36, configured to form a text index according to text analysis of the unstructured documents with mass data;
and the recommended content searching module 37 is configured to search for a content to be recommended according to the text index, the word of interest of the user within a preset time, and the static mass data template.
The metadata information obtaining module 31 is an interface module between a recommendation device for mass information data and an enterprise content management platform, and is responsible for performing data interaction with an ECM (nuclear power enterprise content management system), wherein the enterprise content mainly includes: metadata information, unstructured file text content, system access and retrieval related logs, and personnel information. These information will be stored in the metadata information acquisition module 31 collectively for each module to call, and the main user is the metadata aggregation template generation module 32.
In addition, the update of the system integration data is also taken charge of by the metadata information acquisition module 31.
The recommendation device for mass information data provided in the embodiment of the present invention may be used in the aforementioned first recommendation method embodiment, and for details, refer to the description of the aforementioned first recommendation method embodiment, which is not described herein again.
Example four:
fig. 4 is a schematic composition diagram of a recommendation apparatus for mass information data according to a fourth embodiment of the present invention, and for convenience of description, only the parts related to the embodiment of the present invention are shown, which is detailed as follows:
a metadata information acquisition module 41 for acquiring metadata information from the enterprise content management system ECM;
a metadata aggregation template generating module 42, configured to generate a metadata clustering template according to a metadata set sample space of the metadata information;
a static attribute space obtaining module 43, configured to obtain a static attribute space of a user according to relevant information of the user;
a static mass data template obtaining module 44, configured to obtain a corresponding static mass data template according to the static attribute space of the user and the metadata clustering template;
an attention word obtaining module 45, configured to monitor a behavior log of the user, and obtain an attention word of the user within a preset time according to the behavior log of the user;
a text index forming module 46, configured to form a text index according to text analysis of the unstructured documents with mass data;
a recommended content searching module 47, configured to search for a content to be recommended according to the text index, the word of interest of the user within a preset time, and the static mass data template;
and the log recording module 48 is used for recording the searched content to be recommended and the static mass data template.
The metadata clustering template generating module 42 includes:
a selecting unit 421, configured to arbitrarily select K objects from the metadata set sample space as initial cluster centers, where K is an integer greater than zero, and one of the cluster objects corresponds to one class of technical documents;
a first calculating unit 422, configured to calculate similarities of all objects in the metadata set sample space and K cluster centers, and classify each object in all the objects into a cluster with the highest similarity to the object;
a second calculating unit 423 for recalculating the cluster center of each cluster according to the objects in the cluster to recalculate K cluster centers;
a third calculating unit 424, configured to, if any cluster center of the recalculated K cluster centers changes, recalculate the similarity between the all objects and the recalculated K cluster centers, and classify each object of the all objects into a cluster with the highest corresponding similarity, so as to form a new cluster object;
a forming unit 425 configured to repeatedly execute the second calculating unit and the third calculating unit until K cluster centers are no longer changed, the K cluster centers forming the metadata clustering template.
Each user belongs to a technical concern group; the static massive data template obtaining module 44 is specifically configured to:
calculating the matching relation between the technical documents of each type and the attention population mu of each type according to the attribute parameters in the static attribute space of the user and the attribute parameters in the metadata clustering template
To obtain the static mass data template, wherein att
iIs the ith attribute parameter in the intersection of the attribute parameters in the static attribute space of the user and the attribute parameters in the metadata clustering template, n is the number of the attribute parameters in the intersection, Meta (att)
i) Is att
iValue in the static attribute space of the user, Special (att)
i) Is att
iA value in the metadata cluster template,
is att
iThe weight of (2).
The recommended content search module 47 includes:
a
frequency obtaining unit 471, configured to obtain a frequency of appearance of a word of interest of the user in a text index within a preset time
Wherein the content of the first and second substances,
the j-th attention word of the user in a preset time is shown;
a recommendation
force calculation unit 472 for calculating a recommendation force according to
And V (mu,) calculating the recommendation strength of each technical document
Wherein m is the number of the attention words of the user in the preset time,
in order to focus on the temporal freshness weight,
to focus on the frequency weight, τ () is the documentThe update time parameter of (2);
the recommended content generating unit 473 is configured to generate the recommended content in a list form for the technical documents corresponding to the recommendation strength that meets the preset condition according to the recommendation strength of each type of technical document.
The recommendation device for mass information data provided in the embodiment of the present invention may be used in the aforementioned second corresponding recommendation method embodiment, and for details, reference is made to the description of the aforementioned second embodiment, which is not described herein again.
It can be clearly understood by those skilled in the art that, for convenience and simplicity of description, the foregoing division of the functional modules is merely used as an example, and in practical applications, the foregoing function distribution may be completed by different functional modules as required, that is, the internal structure of the apparatus is divided into different functional modules, and the functional modules may be implemented in a hardware form or a software form. In addition, the specific names of the functional modules are only for convenience of distinguishing from each other, and are not used for limiting the protection scope of the present application.
In conclusion, the embodiment of the invention fills the recommendation problem of nuclear power structured mass information, can effectively combine with the attention information according to the characteristics of nuclear power technical files and the professional attributes of professionals, and can adapt to various nuclear power technical routes. The system can dynamically record the user attention information and record the related operation in a log form. The invention constructs an intelligent knowledge extraction and matching processing method for nuclear power technical data, effectively improves the propagation efficiency and accuracy of nuclear power technical information knowledge, effectively improves the working efficiency, reduces the production cost, and is stable and reliable.
It will be further understood by those skilled in the art that all or part of the steps in the method for implementing the above embodiments may be implemented by relevant hardware instructed by a program stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents and improvements made within the spirit and principle of the present invention are intended to be included within the scope of the present invention.