CN112487302B - File resource accurate pushing method based on user behaviors - Google Patents

File resource accurate pushing method based on user behaviors Download PDF

Info

Publication number
CN112487302B
CN112487302B CN202011219336.9A CN202011219336A CN112487302B CN 112487302 B CN112487302 B CN 112487302B CN 202011219336 A CN202011219336 A CN 202011219336A CN 112487302 B CN112487302 B CN 112487302B
Authority
CN
China
Prior art keywords
user
archive
label
weight
file
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011219336.9A
Other languages
Chinese (zh)
Other versions
CN112487302A (en
Inventor
王啸峰
颜庆国
朱进
陈健
王永梅
陈莉
吴建周
张颖
孙平
乔勇
胡文燕
史海文
刘成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
State Grid Jiangsu Electric Power Co Ltd
Original Assignee
State Grid Jiangsu Electric Power Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by State Grid Jiangsu Electric Power Co Ltd filed Critical State Grid Jiangsu Electric Power Co Ltd
Priority to CN202011219336.9A priority Critical patent/CN112487302B/en
Publication of CN112487302A publication Critical patent/CN112487302A/en
Application granted granted Critical
Publication of CN112487302B publication Critical patent/CN112487302B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9536Search customisation based on social or collaborative filtering

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides a user behavior-based archive resource accurate pushing method, which comprises a server, wherein the server comprises a user behavior library, the user behavior library comprises user behavior information, and the user behavior information comprises a user label and a user type; the archive repository comprises archive data, and the archive data comprises archive labels and archive types; the specific operation steps are as follows: acquiring a user label and a file label; obtaining the weight of a user label and the weight of an archive label; weighting and scoring the archives according to the weights of the user labels and the weights of the archive labels to obtain a score of an archive j relative to the user i; calculating the user similarity according to the weight of the user label and the weight of the archive label; the invention improves the retrieval utilization efficiency of the files.

Description

File resource accurate pushing method based on user behaviors
Technical Field
The invention relates to the field of pushing of archive resources of user behaviors, in particular to an archive resource accurate pushing method based on user behaviors.
Background
The semantic web provides a tool for intelligent utilization of information resources: the discoverability of the information is improved, the complex search is realized, and a novel network browsing mode is realized. When a user queries by using network search, some key information vocabularies are usually thought of firstly, but actually the required requirements are often complex, and the mastered knowledge is also multi-aspect and multi-angle. When the world trade organization is input in a search engine, related information of China joining the world trade organization may be understood, but the result of general search may be filtered numerous times and returned without work; this is because the computer cannot know the organization condition, main function, agreement and purpose of the world trade organization, but the semantic information can make the program distinguish the elements in different web pages more easily, understand the fact that "China joins the detailed process of the world trade organization", and can combine them together. The semantic information can not only complete the retrieval more accurately, but also automatically process complex processes. In the archive management system, because of huge data and low utilization rate, a method for accurately and optimally recommending clients through semantic analysis and collection is needed.
In the prior art, information interaction between an NLP natural language and archive professional data is not performed, current natural language identification is based on an NLP natural language processing technology and is generally based on a matching process executed after training of an existing database, and related technologies such as archive-based text entity extraction, text classification, key phrase extraction, short text matching, relationship extraction, intelligent voice interaction, character recognition, text similarity algorithm and the like cannot completely realize accurate identification and pushing between archives and users, and cannot realize accurate pushing of associated archives of target archives to users in combination with archive professional data and information.
Disclosure of Invention
In order to solve the above problems, the present invention provides a method for accurately pushing archive resources based on user behavior, wherein the method comprises a server, the server comprises a user behavior library, the user behavior library comprises user behavior information, and the user behavior information comprises a user tag and a user type; the archive comprises archive data, and the archive data comprises archive labels and archive types; the specific operation steps are as follows:
step S100, obtaining a user label U = [ UX = [) 1 ,UX 2 ,…,UX m ]And archive label D = [ DX 1 ,DX 2 ,…,DX n ](ii) a Wherein, UX i An ith user tag for the user; DX j A jth profile tag for said profile;
step S200, obtaining user label UX i Weight Uw (UX) of i ) And archive label DX j Weight Dw (DX) j );
Step S300, UX according to user label i Weight Uw (UX) of i ) And archive label DX j Weight Dw (DX) j ) Carrying out weighted scoring on the archives to obtain a score of an archive j relative to the user i, and recommending the high-grade archive to the user;
step S400, according to the user label UX i Weight Uw (UX) of i ) And archive label DX j Weight Dw (DX) j ) And calculating the similarity of the users, and judging the users similar to the users.
Wherein the user tag UX i Weight Uw (UX) of i ) The acquisition method comprises the following steps:
obtaining the user label UX in the user behavior library i The term frequency TF and the inverse document frequency IDF; wherein, the word frequency TF is:
Figure GDA0003730676410000021
n i is the number of times the word appears throughout, Σ p n p,i Is the sum of the number of occurrences of all words;
the inverse document frequency IDF is:
Figure GDA0003730676410000031
where N is the total number of users in the current user type, N' is the total number of profiles in other user types, d i Is the total number of users having the label in the current user type, d i ' is at otherThe total number of users in the user type that contain the tag.
The user label UX i Weight Uw (UX) of i )=tf i ×Idf i
Wherein, in step S200, the archive label DX j Weight Dw (DX) j ) The calculation method comprises the following steps:
obtaining the file label DX in the file library j The word frequency TF and the inverse document frequency IDF; wherein, the word frequency TF is:
Figure GDA0003730676410000032
wherein, t y Is the number of times the word appears in the file title, p y Is the number of times the word appears in the first segment of the file, n y Is the number of times the word appears in the file text, Σ k n k,y Is the sum of the times of all the vocabulary in the file;
the inverse document frequency IDF is:
Figure GDA0003730676410000033
wherein N is the total number of files of the current file type, N' is the total number of files of other file types, d j Is the total number of files containing the word in the current file type, d j ' is the total number of files that contain the term in other file types.
The file label DX j Weight Dw (DX) j )=tf j ×Idf j
Wherein, in step S300, the score F of the file j relative to the user i is
Figure GDA0003730676410000034
The method for pushing archive resources accurately based on user behaviors as described above, wherein in step S400, the similarity is determined
Figure GDA0003730676410000041
The invention has the following beneficial effects:
the corresponding database is established through actions such as retrieval and acquisition of the user in the archive library and autonomous learning and training are carried out, and meanwhile, the archive types preferred by the user are grouped and matched based on natural semantic recognition, so that accurate archive pushing is provided for the user, the archive data querying speed and experience of the user are improved, and the user can be better assisted to accurately acquire the relevant archives preferred by the user.
Detailed Description
The technical solutions of the present application are clearly and completely described below in conjunction with the embodiments of the present application, and it is obvious that the described embodiments are some, but not all, of the embodiments of the present application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
The application of the invention comprises a user behavior-based file pushing server, wherein the server comprises a user behavior library, the user behavior library comprises user behavior information, and the user behavior information comprises a user label and a user type; the archive repository comprises archive data, and the archive data comprises archive labels and archive types;
the server further comprises a processor and a non-transitory computer readable storage medium storing a computer program which, when executed by the processor, implements the following user behavior based archive resource precision pushing method;
step S100, obtaining a user label U = [ UX = [) 1 ,UX 2 ,…,UX m ]And archive label D = [ DX ] 1 ,DX 2 ,…,DX n ](ii) a Wherein, UX i An ith user tag for the user; DX j A jth profile tag for said profile;
step S200, obtaining a user label UX i Weight Uw (UX) of i ) (ii) a File label DX j Weight Dw (DX) j );
Step S300, UX according to user label i Weight Uw (UX) of i ) And archive label DX j Weight Dw (DX) j ) Carrying out weighted scoring on the files, and recommending the high-grade files to the user;
and S400, calculating the similarity of the users through the user tags, and judging the users similar to the users.
In some embodiments, the user tag UX i Weight Uw (UX) of i ) The calculating method comprises the following steps:
obtaining the user label UX in the user behavior library i The term frequency TF and the inverse document frequency IDF; wherein, the word frequency TF is:
Figure GDA0003730676410000051
n i is the number of times the word appears throughout, Σ p n p,i Is the sum of the times of occurrence of all the words;
the inverse document frequency IDF is:
Figure GDA0003730676410000052
wherein N is the total number of users in the current user type, N' is the total number of profiles in other user types, d i Is the total number of users having the tag in the current user type, d i ' is the total number of users that contain the tag in other user types.
The user label UX i Weight Uw (UX) of i )=tf i ×Idf i
Example 1:
the user system comprises two user types of an archiver and a common user, wherein the type of the user A is the archiver, and the occurrence frequency of the user label of the user A is as follows: [ Jiangsu, file, 1, document, 1, drawing, 1].
The number of users and the number of users having "Jiangsu" user tags are shown in Table 1:
TABLE 1
Figure GDA0003730676410000053
Figure GDA0003730676410000061
Calculating the weights of two user labels of 'Jiangsu' and 'Bureau of archives', and calculating according to a formula to obtain the following results:
Jiangsu
Figure GDA0003730676410000062
Jiangsu
Figure GDA0003730676410000063
weight uw of "Jiangsu" user tag Jiangsu =TF×IDF=9.5228787×0.6197887=5.9021726;
Calculating the weight uw of the label of the 'filing bureau' according to the same method File office =5.7155976;
It can be seen that the "Jiangsu" label is more important to user A than the "Bureau of archives" label.
Example 2:
in other embodiments, the weights are automatically set according to an algorithm, and the weights may change for multiple readings or changes to preferences. The user has a behavior library, the file type, the keywords and the file label of the user are stored, the weight is calculated according to the behavior library, after multiple searches, the behavior library is changed, and the weight is changed along with the change; there is another filtering library to remove files that the user dislikes or has read when recommending.
As shown in table 2, a new user performs the following operations, the behavior library changes accordingly, and the label weight also changes accordingly:
TABLE 2
Figure GDA0003730676410000064
Figure GDA0003730676410000071
In some embodiments, the profile tag DX j Weight Dw (DX) j ) The calculation method comprises the following steps:
obtaining the file label DX in the file library j The word frequency TF and the inverse document frequency IDF; wherein, the word frequency TF is:
Figure GDA0003730676410000072
wherein, t y Is the number of times the word appears in the file title, p y Is the number of times the word appears in the first segment of the file, n y Is the number of times the word appears in the file text, Σ k n k,y Is the sum of the times of all the words in the file;
the inverse document frequency IDF is:
Figure GDA0003730676410000073
wherein N is the total number of files of the current file type, N' is the total number of files of other file types, d j Is the total number of files containing the word in the current file type, d j ' is the total number of files that contain the term in other file types.
The file label DX j Weight Dw (DX) j )=tf j ×Idf j
Example 3:
set up the archives type and be the archives and can count the archives, wherein, archives A's archives type is the archives, and archives A's word distribution is shown as table 3:
TABLE 3
Figure GDA0003730676410000074
The total number of files of the document file type is 50, wherein the number of files containing 'research' is 10, and the number of files containing 'work' is 30; accounting file type file total number is 60, wherein, the number containing 'research' is 5, and the number containing 'work' is 20;
calculating the weight of the document labels 'research' and 'work':
research and study
Figure GDA0003730676410000081
Research and study on
Figure GDA0003730676410000082
Investigation of the weights dw Research and study =TF×IDF=16.8628661;
Calculating the weight dw of the "work" according to the same method Work by =6.6285913;
From the above results, it can be intuitively obtained that the "research" label is more important than the "work" label for archive a.
In some actual ways, by obtaining the weights of the profile tags and the user behavior tags, the profiles can be weighted and scored, and high-level profiles are recommended to the user, and the method comprises the following specific steps:
the score F of profile j relative to user i is
Figure GDA0003730676410000083
Example 4:
setting the first file label weight as: [ Jiangsu 0.5, nanjing 0.1, history 0.9];
the second profile tag weight is: [ Jiangsu 0.9, nanjing 0.1, history 0.1];
the user tag weights are: [ Jiangsu 0.11, nanjing 0.12, history 0.2];
the scores of the first profile and the second profile are:
F 1 =0.5x0.11+0.1x0.12+0.9x0.2=0.242;
F 2 =0.9x0.11+0.1x0.12+0.1x0.2=0.131;
it can be seen that the first profile scores higher because the user is more inclined to the history-related profile.
And after the calculation is finished, sorting to obtain the files with high scores, removing the files which are disliked and read by the user, adding the files obtained according to the collaborative filtering to form a final file list, and recommending the final file list to the user.
In other factual manners, a collaborative filtering recommendation algorithm is further included, the collaborative filtering recommendation uses a similarity algorithm to calculate the similarity of the users, and the preference profiles of the similar users are recommended. The specific implementation process is as follows: the current user opens a file, the system acquires other users who have opened the file, calculates the similarity between the current user and other users, acquires one or more users with high similarity, and recommends the favorite file of the user with high similarity to the current user.
The similarity calculation algorithm is a cosine similarity calculation method, and the calculation formula is as follows: degree of similarity
Figure GDA0003730676410000091
Example 5:
setting that there are now 3 users x, y, z, the similarity between user x and two other users needs to be calculated:
wherein, the weight of each user label of x is [ Jiangsu 0.147, filing bureau 0.095, document file 0.1];
y is [ Jiangsu 0.177, bureau of archives 0.105, paperwork archives 0.155];
z is [ Jiangsu 0.09, filing bureau 0.175, document file 0.032];
similarity of x and y
Figure GDA0003730676410000092
Figure GDA0003730676410000093
Similarity of x and z
Figure GDA0003730676410000094
Higher similarity values for x and y can be obtained.
By the implementation mode, the real-time monitoring, comprehensive and real perception of the user state, the user demand and the management can be realized. By utilizing a big data analysis technology, the method obtains deep insight of user requirements from mass data, fuses, analyzes and processes the perception data, integrates with a service system and makes active response, and can accurately recommend different types of clients and clients with similar characteristics.
While the preferred embodiments of the present application have been described, additional variations and modifications of those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. Therefore, it is intended that the appended claims be interpreted as including preferred embodiments and all such alterations and modifications as fall within the scope of the invention. It will be apparent to those skilled in the art that various changes and modifications can be made in the present application without departing from the spirit and scope of the application. Thus, if such modifications and variations of the present application fall within the scope of the claims of the present application and their equivalents, the present application is also intended to include such modifications and variations.

Claims (3)

1. A method for accurately pushing archive resources based on user behaviors comprises a server, wherein the server comprises a user behavior library and an archive library, and the user behavior library comprises user labels and user types; the archive comprises archive labels and archive types; the specific operation steps are as follows:
step S100, obtaining the user label U = [ UX ] of the user 1 ,UX 2 ,…,UX m ]And archive label D = [ DX 1 ,DX 2 ,…,DX n ](ii) a Wherein, UX i An ith user tag for the user; DX j A jth profile tag for said profile;
step S200, obtaining user label UX i Weight Uw (UX) of i ) And archive label DX j Weight Dw (DX) j );
Step S300, according to the user label UX i Weight Uw (UX) of i ) And archive label DX j Weight Dw (DX) j ) Carrying out weighted scoring on the archives to obtain a score of an archive j relative to the user i, and recommending the high-grade archive to the user;
step S400, UX according to user label i Weight Uw (UX) of i ) And archive label DX j Weight Dw (DX) j ) Calculating user similarity, and judging users similar to the users;
wherein, in step S200, the user tag UX i Weight Uw (UX) of i ) The acquisition method comprises the following steps:
obtaining the user label UX in the user behavior library i The word frequency TF and the inverse document frequency IDF; wherein, the word frequency TF is:
Figure FDA0003730676400000011
n i is the number of times the word appears throughout, Σ m n m,i Is the sum of the times of occurrence of all the words;
the inverse document frequency IDF is:
Figure FDA0003730676400000012
wherein N is the total number of users in the current user type, N' is the total number of profiles in other user types, d i Is the total number of users having the label in the current user type, d i ' is the total number of users that contain the tag in other user types;
the user label UX i Weight Uw (UX) of i )=tf i ×Idf i
Wherein, in step S200, the archive label DX j Weight Dw (DX) j ) The calculation method comprises the following steps:
obtaining the file label DX in the file library j The term frequency TF and the inverse document frequency IDF; wherein, the word frequency TF is:
Figure FDA0003730676400000021
wherein, t j Is the number of times the word appears in the file title, p j Is the number of times the word appears in the first segment of the file, n j Is the number of times the word appears in the file text, Σ n n n,j Is the sum of the times of all the vocabulary in the file;
the inverse document frequency IDF is:
Figure FDA0003730676400000022
wherein N is the total number of files of the current file type, N' is the total number of files of other file types, d j Is the total number of files containing the word in the current file type, d j ' is the total number of files that contain the term in other file types;
the file label DX j Weight Dw (DX) j )=tf j ×Idf j
2. The method for pushing archive resource precisely based on user behavior as claimed in claim 1, wherein in step S300, the score F of archive j relative to user i is
Figure FDA0003730676400000023
3. The method for pushing archive resource precisely based on user behavior as claimed in claim 1, wherein in step S400, similarity degree
Figure FDA0003730676400000024
CN202011219336.9A 2020-11-04 2020-11-04 File resource accurate pushing method based on user behaviors Active CN112487302B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011219336.9A CN112487302B (en) 2020-11-04 2020-11-04 File resource accurate pushing method based on user behaviors

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011219336.9A CN112487302B (en) 2020-11-04 2020-11-04 File resource accurate pushing method based on user behaviors

Publications (2)

Publication Number Publication Date
CN112487302A CN112487302A (en) 2021-03-12
CN112487302B true CN112487302B (en) 2022-11-11

Family

ID=74928129

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011219336.9A Active CN112487302B (en) 2020-11-04 2020-11-04 File resource accurate pushing method based on user behaviors

Country Status (1)

Country Link
CN (1) CN112487302B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106776695A (en) * 2016-11-11 2017-05-31 上海中信信息发展股份有限公司 The method for realizing the automatic identification of secretarial document value
CN107451168A (en) * 2016-05-30 2017-12-08 中华电信股份有限公司 File Classification System and Method Based on Vocabulary Statistics

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107451168A (en) * 2016-05-30 2017-12-08 中华电信股份有限公司 File Classification System and Method Based on Vocabulary Statistics
CN106776695A (en) * 2016-11-11 2017-05-31 上海中信信息发展股份有限公司 The method for realizing the automatic identification of secretarial document value

Also Published As

Publication number Publication date
CN112487302A (en) 2021-03-12

Similar Documents

Publication Publication Date Title
US11663254B2 (en) System and engine for seeded clustering of news events
CN112035658B (en) Enterprise public opinion monitoring method based on deep learning
CN111104526A (en) Financial label extraction method and system based on keyword semantics
EP1647903A1 (en) Systems and methods for providing personalisation by means of search query and result refinement
US20080168056A1 (en) On-line iterative multistage search engine with text categorization and supervised learning
CN107247743A (en) A kind of judicial class case search method and system
CN106776695B (en) Method for automatically identifying value of document and file
US20050138079A1 (en) Processing, browsing and classifying an electronic document
CN102789452A (en) Similar content extraction method
CN111831810A (en) Intelligent question and answer method, device, equipment and storage medium
CN110609950B (en) Public opinion system search word recommendation method and system
CN114491034B (en) Text classification method and intelligent device
CN111125297A (en) Massive offline text real-time recommendation method based on search engine
CN117708270A (en) Enterprise data query method, device, equipment and storage medium
CN112487302B (en) File resource accurate pushing method based on user behaviors
CN111382265B (en) Searching method, device, equipment and medium
Ibrahim et al. A new weighting scheme and discriminative approach for information retrieval in static and dynamic document collections
Bithel et al. Unsupervised identification of relevant prior cases
Guadie et al. Amharic text summarization for news items posted on social media
CN108509449B (en) Information processing method and server
CN111259145B (en) Text retrieval classification method, system and storage medium based on information data
CN116610853A (en) Search recommendation method, search recommendation system, computer device, and storage medium
CN114443961A (en) Content filtering scientific and technological achievement recommendation method, model and storage medium
Williams Results of classifying documents with multiple discriminant functions
CN107341169B (en) Large-scale software information station label recommendation method based on information retrieval

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant