Disclosure of Invention
In order to solve the problems existing in the scheme, the invention provides an information retrieval and analysis system based on big data. The invention can sequentially search the learning resources stored in the corresponding storage end according to the search priority value, can avoid synchronous interaction of information in the original resource database with large storage capacity, reduces the data search pressure of the original resource database, improves the search efficiency and avoids the waste of search resources.
The purpose of the invention can be realized by the following technical scheme:
an information retrieval and analysis system based on big data comprises a data acquisition module, an information input module, an analysis module, an information receiving module, a browsing module and an evaluation module;
the data acquisition module is used for acquiring learning resource information of an education platform to form an original resource database, and the original resource database comprises a plurality of storage terminals;
the information input module is used for logging in by a user, inputting retrieval information and sending the input retrieval information to the information retrieval module; when a user inputs retrieval information through the information input module, the analysis module is used for tracking a login account of the user, performing statistical analysis on a historical retrieval record of the user to obtain a storage end sequence table of the retrieval, and feeding the storage end sequence table back to the information retrieval module;
after the information retrieval module receives the storage end sequence table, the learning resources stored in the corresponding storage end are sequentially retrieved by combining the retrieval information input by the current user;
the information receiving module is used for receiving the retrieval result of the information retrieval module, auditing and filtering the retrieval result, and pushing the corresponding retrieval result to the user terminal; the browsing module is used for the user terminal to select the retrieval result for looking up until the target data is found, and feeding the target data back to the information retrieval module; and when the user logs out, the evaluation module is used for evaluating the retrieval service of the learning resources by the user.
Further, the specific analysis steps of the analysis module are as follows:
when a user inputs retrieval information, tracking a login account of the user, and collecting retrieval records of the user in the last three months; the retrieval records carry corresponding target data;
acquiring a storage end where each target data is located, and counting the occurrence times of the same storage end and the total browsing time of the target data in the same storage end; calculating to obtain a retrieval attraction value Gi of the storage end;
acquiring all retrieval results fed back by historical retrieval information matched with the current retrieval information; counting the distribution proportion of the corresponding retrieval results at each storage end and marking as the storage end occupation ratio Zi;
using formulas
And calculating to obtain a retrieval priority value JSi of the storage end in the retrieval, and sequencing the storage ends according to the size of the retrieval priority value JSi to obtain a storage end sequence list of the retrieval.
Further, the specific auditing and filtering steps of the information receiving module are as follows:
s1: acquiring a plurality of retrieval results of the retrieval information; extracting original keywords of the learning resources corresponding to each retrieval result, and performing data cleaning on the original keywords to obtain learning keywords;
s2: then, the learning keywords are stored into a specific data format to be used as key information for storage, and a key information coding table of learning resources is established;
s3: carrying out coverage rate analysis on the key information coding tables corresponding to any two retrieval results, and filtering to obtain representative retrieval results;
s4: and (4) performing access value analysis on the representative retrieval results, and selecting the representative retrieval results with the access values of W1 before ranking to feed back to the user terminal, wherein W1 is a preset value.
Further, the original keywords are keywords which appear more than a set threshold frequently in texts corresponding to the learning resources; the specific process of carrying out data cleaning on the original keywords comprises the following steps: unifying keywords with the same meaning or similar keywords, and removing keywords without actual analysis meaning.
Further, wherein the coverage is expressed as: the number of the same codes in the two key information coding tables is compared; the number ratio is the same code number/code number calculated value, and the code number calculated value is the minimum value of the total number of codes of the two code tables.
Further, the filtering in step S3 to obtain the representative search result specifically includes:
if the coverage rate exceeds gamma%, taking the retrieval result with a large number of codes as a representative retrieval result, and rejecting the other retrieval result; if the coverage rate does not exceed gamma%, taking the two search results as representative search results; and then, performing coverage rate analysis on the representative retrieval result and other retrieval results, and so on, wherein gamma is a preset value.
Further, the specific working steps of the evaluation module are as follows:
marking the service score of the user as Qs, acquiring the number of representative retrieval results consulted before the user finds the target data, and marking the representative retrieval results as Cs; calculating the time difference between the time when the user inputs the retrieval information and the time when the target data is fed back to obtain a retrieval time GT;
calculating a retrieval satisfaction value QR of the user by using a formula QR (Qs multiplied by r1)/(Cs multiplied by r2+ GT multiplied by r3), wherein r1, r2 and r3 are coefficient factors; the evaluation module is used for searching the QR satisfaction value, stamping a time stamp and storing the time stamp in the storage module, and transmitting the QR satisfaction value to the display module for real-time display.
Furthermore, the original resource database is used for extracting the release time information of each piece of stored learning resource information and classifying the stored learning resource information according to a plurality of time periods; each storage terminal is in one-to-one correspondence with each type of learning resource information and is used for storing the learning resource information of the corresponding type.
Compared with the prior art, the invention has the beneficial effects that:
1. when a user inputs retrieval information through the information input module, the analysis module is used for tracking a login account of the user, performing statistical analysis on historical retrieval records of the user and generating a corresponding storage end sequence table;
2. the information receiving module is used for receiving the retrieval results of the information retrieval module, auditing and filtering the retrieval results, firstly, performing coverage rate analysis on the key information coding tables corresponding to any two retrieval results, deleting and selecting the homologous retrieval results, and selecting the retrieval results with a large number of codes as representative retrieval results, so that not only can the selection items be reduced, but also the user can obtain more abundant and comprehensive learning resources, the user is prevented from spending time and energy on similar learning resources, and the retrieval efficiency is improved;
3. when the user logs out, the evaluation module is used for evaluating the retrieval service of the learning resources by the user, calculating the retrieval satisfaction value of the user by combining the service score of the user, the number of representative retrieval results consulted before the user finds the target data and the retrieval duration, and transmitting the retrieval satisfaction value to the display module for real-time display, so that the administrator can conveniently and visually know the retrieval satisfaction value.
Detailed Description
The technical solutions of the present invention will be described clearly and completely with reference to the following embodiments, and it should be understood that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, an information retrieval and analysis system based on big data includes a data acquisition module, an original resource database, an information input module, an information retrieval module, an analysis module, an information receiving module, a browsing module, an evaluation module, a storage module, and a display module;
the data acquisition module is used for acquiring learning resource information of the education platform to form an original resource database, and the original resource database is used for extracting the release time information of each piece of stored learning resource information and classifying the stored learning resource information according to a plurality of time periods;
the original resource database comprises a plurality of storage ends, each storage end corresponds to each type of learning resource information one by one, and each storage end is used for storing the corresponding type of learning resource information;
the information input module is used for logging in by a user, inputting retrieval information and sending the input retrieval information to the information retrieval module; the information retrieval module is used for retrieving the learning resources according to the retrieval information; the retrieval information comprises retrieval keywords or keywords;
the information input module is connected with the analysis module, when a user inputs retrieval information through the information input module, the analysis module is used for tracking a login account of the user, performing statistical analysis on a historical retrieval record of the user to obtain a storage end sequence table of the retrieval, and the specific analysis steps are as follows:
the first step is as follows: when a user inputs retrieval information, tracking a login account of the user, and collecting retrieval records of the user in the last three months; the retrieval record comprises input retrieval information and corresponding target data; one piece of retrieval information corresponds to one or more retrieval results, and a user selects a required retrieval item, namely target data;
the second step is that: acquiring a storage end where each target datum is located, counting the occurrence times of the same storage end according to the storage end and marking the occurrence times as storage end frequency Pi; wherein i represents the ith storage end;
accumulating the browsing time of each target data in the same storage end to form a total storage end duration Ti; normalizing the frequency of the storage end and the total time length of the storage end and taking the numerical values of the frequency and the total time length;
calculating a retrieval attraction value Gi of the storage end by using a formula Gi-Pi × a1+ Ti × a2, wherein a1 and a2 are coefficient factors;
the third step: dividing the search information, matching the input search information with the search information input historically, and if the coincidence degree of the keywords or the keywords exceeds mu%, the matching is successful; wherein mu is a preset value and takes a value of 95;
acquiring all retrieval results fed back by historical retrieval information matched with the current retrieval information; counting the distribution proportion of the corresponding retrieval result at each storage end according to the storage end to which the retrieval result belongs, and marking as a storage end occupation ratio Zi; wherein Zi is in one-to-one correspondence with Gi;
the fourth step: carrying out normalization processing on the retrieval attraction value and the storage end ratio and taking the numerical values;
using formulas
Calculating to obtain a retrieval priority value JSi of the storage end in the retrieval, wherein f1 and f1 are preset coefficient factors, and eta is a fixed value;
sorting the storage ends according to the size of the retrieval priority value JSi to obtain a storage end sequence table of the retrieval;
the analysis module is used for feeding back the storage end sequence table searched at this time to the information retrieval module, and after the information retrieval module receives the storage end sequence table, the information retrieval module sequentially retrieves the learning resources stored in the corresponding storage end by combining the retrieval information input by the current user;
the method can sequentially retrieve the learning resources stored in the corresponding storage end according to the retrieval priority value JSi, can avoid synchronous interaction of information in the original resource database with large storage capacity, reduce the data retrieval pressure of the original resource database, improve the retrieval efficiency and avoid the waste of retrieval resources;
the information receiving module is used for receiving the retrieval result of the information retrieval module, auditing and filtering the retrieval result, and pushing the corresponding retrieval result to the user terminal; the specific examination and filtration steps are as follows:
s1: acquiring a plurality of retrieval results of the retrieval information; extracting original keywords of the learning resources corresponding to each retrieval result, and performing data cleaning on the original keywords to obtain learning keywords;
the original keywords are keywords which appear more than a set threshold frequently in texts corresponding to the learning resources; the specific process of carrying out data cleaning on the original keywords comprises the following steps: unifying keywords with the same meaning or similar keywords, and removing keywords without actual analysis meaning;
s2: then, the learning keywords are stored into a specific data format to be used as key information for storage, a key information coding table of learning resources is established, and each learning keyword in the key information corresponds to one binary code respectively;
s3: performing coverage rate analysis on the key information coding tables corresponding to any two retrieval results, and if the coverage rate exceeds gamma%, considering the two retrieval results as homologous retrieval results, wherein gamma is a preset value and takes a value of 97;
for the homologous retrieval results, counting the number of codes in a key information code table corresponding to each retrieval result, selecting the retrieval result with a large number of codes as a representative retrieval result, and rejecting the other retrieval result; if the coverage rate does not exceed gamma%, taking the two search results as representative search results; then, performing coverage rate analysis on the representative retrieval result and other retrieval results, and so on;
wherein the coverage is expressed as: the number of the same codes in the two key information coding tables is compared; the number ratio is the same code number/code number calculation value, and the code number calculation value is the lower value of the total number of codes in the two code tables;
according to the invention, the information receiving module is used for auditing and filtering the homologous retrieval results, so that a user can obtain more abundant and comprehensive learning resources, options can be reduced, the user is prevented from spending time and energy on similar learning resources, and the retrieval efficiency is improved;
s4: acquiring the representative retrieval result processed in step S3, and performing access value analysis on the representative retrieval result; sorting the representative retrieval results according to the access values, and selecting the representative retrieval results of the W1 before ranking to feed back to the user terminal, so that the pushing result is more accurate, and the retrieval efficiency is improved; wherein W1 is a preset value;
the access value acquisition method comprises the following steps:
s31: acquiring access information representing a retrieval result in ten days before the current time of the system; the access information comprises an access object and an access time;
s32: counting the number of visitors representing the retrieval result according to the visiting objects and marking as R1;
sequencing the access time representing the retrieval result according to time sequence, and calculating the time difference of adjacent access time to obtain a single access interval;
summing all the single access intervals and averaging to obtain an access interval average value Gz;
calculating the time difference between the latest access time and the current time of the system to obtain a buffer duration HT; carrying out normalization processing on the number of visitors, the mean value of the visiting intervals and the buffer duration and taking the numerical values of the number of visitors, the mean value of the visiting intervals and the buffer duration;
calculating an access value FW representing a retrieval result by using a formula FW (R1 × b1)/(Gz × b2+ HT × b3), wherein b1, b2 and b3 are coefficient factors;
the browsing module is used for the user terminal to select the retrieval result for reference until the target data is found; feeding target data back to the information retrieval module;
when the user logs out, the evaluation module is used for evaluating the retrieval service of the learning resource by the user, and the evaluation rule is as follows: scoring the retrieval service, wherein the full score is 100; the specific working steps of the evaluation module are as follows:
marking the service score of the user as Qs, acquiring the number of representative retrieval results consulted before the user finds the target data, and marking the representative retrieval results as Cs;
calculating the time difference between the time when the user inputs the retrieval information and the time when the target data is fed back to obtain a retrieval time GT;
carrying out normalization processing on the service scores, the representative retrieval result quantity and the retrieval duration and taking the numerical values of the service scores, the representative retrieval result quantity and the retrieval duration; calculating a retrieval satisfaction value QR of the user by using a formula QR (Qs multiplied by r1)/(Cs multiplied by r2+ GT multiplied by r3), wherein r1, r2 and r3 are coefficient factors; the smaller Cs and the smaller GT are, the faster the user finds the target data, and the higher the retrieval efficiency is, the higher the retrieval satisfaction value of the user is;
the evaluation module is used for stamping a time stamp on the retrieval satisfaction value QR and storing the time stamp in the storage module, and transmitting the retrieval satisfaction value QR to the display module for real-time display.
The above formulas are all calculated by removing dimensions and taking numerical values thereof, the formula is a formula which is obtained by acquiring a large amount of data and performing software simulation to obtain the closest real situation, and the preset parameters and the preset threshold value in the formula are set by the technical personnel in the field according to the actual situation or obtained by simulating a large amount of data.
The working principle of the invention is as follows:
when the information retrieval and analysis system works, when a user inputs retrieval information through an information input module, the analysis module is used for tracking a login account of the user, performing statistical analysis on historical retrieval records of the user, obtaining a retrieval attraction value of a storage end according to the distribution of the storage ends of target data, dividing the retrieval information to obtain historical retrieval results corresponding to the retrieval information, obtaining the distribution proportion of the corresponding retrieval results in each storage end, obtaining a retrieval priority value of the storage end in the retrieval at this time by combining the retrieval attraction value and the storage end occupation ratio to generate a corresponding storage end sequence table, and after the information retrieval module receives the storage end sequence table, sequentially retrieving learning resources stored in the corresponding storage end by combining the retrieval information input by the current user;
the information receiving module is used for receiving the retrieval results of the information retrieval module, auditing and filtering the retrieval results, pushing the corresponding retrieval results to the user terminal, firstly extracting the original keywords of the learning resources corresponding to each retrieval result, and performing data cleaning on the original keywords to obtain the learning keywords; then, the learning keywords are stored into a specific data format to be used as key information for storage, and a key information coding table of learning resources is established; carrying out coverage rate analysis on the key information coding tables corresponding to any two retrieval results to obtain representative retrieval results; then, performing access value analysis on the representative retrieval result; sorting the representative retrieval results according to the access values, and selecting the representative retrieval results of the W1 before ranking to feed back to the user terminal;
the browsing module is used for the user terminal to select the retrieval result for reference until the target data is found; feeding target data back to the information retrieval module; when the user logs out, the evaluation module is used for evaluating the retrieval service of the learning resources by the user, calculating the retrieval satisfaction value of the user by combining the service score of the user, the number of representative retrieval results consulted before the user finds the target data and the retrieval time length, and transmitting the retrieval satisfaction value to the display module for real-time display, so that the administrator can conveniently and visually know the retrieval satisfaction value.
In the description herein, references to the description of "one embodiment," "an example," "a specific example" or the like are intended to mean that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, the schematic representations of the terms used above do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.
The preferred embodiments of the invention disclosed above are intended to be illustrative only. The preferred embodiments are not intended to be exhaustive or to limit the invention to the precise forms disclosed. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the invention and the practical application, to thereby enable others skilled in the art to best utilize the invention. The invention is limited only by the claims and their full scope and equivalents.