CN106682129B - Hierarchical concept vectorization increment processing method in personal big data management - Google Patents

Hierarchical concept vectorization increment processing method in personal big data management Download PDF

Info

Publication number
CN106682129B
CN106682129B CN201611154347.7A CN201611154347A CN106682129B CN 106682129 B CN106682129 B CN 106682129B CN 201611154347 A CN201611154347 A CN 201611154347A CN 106682129 B CN106682129 B CN 106682129B
Authority
CN
China
Prior art keywords
concept
vector
node
concepts
tree
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611154347.7A
Other languages
Chinese (zh)
Other versions
CN106682129A (en
Inventor
杨良怀
汪庆顺
庄慧
范玉雷
龚卫华
方文菲
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University of Technology ZJUT
Original Assignee
Zhejiang University of Technology ZJUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University of Technology ZJUT filed Critical Zhejiang University of Technology ZJUT
Priority to CN201611154347.7A priority Critical patent/CN106682129B/en
Publication of CN106682129A publication Critical patent/CN106682129A/en
Application granted granted Critical
Publication of CN106682129B publication Critical patent/CN106682129B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3347Query execution using vector based model

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The method for vectorizing incremental processing of the level concept in personal big data management comprises the following steps: 1) and (3) vectorizing all concepts during the initial operation of the system, and carrying out concept vector merging operation on all branch nodes. 2) Executing when the user operates the concept tree: 2.1) obtaining the concept vector and the total number of words of the operated node and the father node thereof; 2.2) modifying the concept vector of the father node according to a formula; 2.3) taking a parent node as an operated node, and recursively executing from 2.1) to a root node; 2.4) updating the inverse document frequency vector. 3) The error accumulation is performed to a certain extent: 3.1) acquiring a current inverse document frequency vector and an inverse document frequency initial value vector; 3.2) updating all vector weights in the vector space in batch; 3.3) updating the vector of the initial value of the frequency of the inverse document. The invention realizes the hierarchical concept vectorization incremental calculation method in personal big data management, can quickly adjust the concept vector in the concept space and improves the execution efficiency.

Description

Hierarchical concept vectorization increment processing method in personal big data management
Technical Field
The invention relates to the management, organization, query and retrieval technology of personal big data, in particular to a hierarchical concept vectorization method based on a vector space model and an incremental calculation method thereof.
Background
With the development of information technology, personal data is explosively increased, including personal documents (text, images, voice), mails, health data, personal mobile phone contact information (WeChat, QQ), Internet data and the like, and enters the age of personal big data (personal big data); the development of wearable devices will further exacerbate the growth of data, and people can record smells and sees and collect physiological health data all day long. How to manage and organize personal big data can always obtain accurate, proper, complete and high-quality information in a proper place through simple operation, and is a target of a personal information management system. However, even if dealing with electronic documents piled up by individuals, the current approaches are not as desirable. For example, due to the lapse of time, people often memorize the previously stored information gradually and fuzzily, and the existing retrieval tools use a keyword matching mode to retrieve the information, so that fuzzy and associated query clues in the brain and sea of the user cannot be fully utilized, and often the retrieval efficiency is low. In addition, the information retrieval method based on accurate matching is difficult to help the user find potentially relevant information.
In the face of mass data, data abstraction helps to grasp and understand mass data. The personal big data management system provided by the invention effectively organizes data by adopting a concept space, wherein the concept refers to a set formed by information resources with similarity or correlation among the information resources, and the set can represent a certain class or a transaction, a task and the like. The user main body can establish a series of concepts according to work needs, personal preferences, personal habits and the like, and the concepts are associated with each other and with respective data sources, so that the whole semantic contact network is formed, and the efficient management of personal information is realized. Associations between concepts may be context, identity, true inclusion in relationships, true inclusion relationships, cross relationships, aggregation, etc., but are the most used or associated for information management. And the concept space is composed of concepts and a semantic relation network taking the concepts as nodes. In practice, each concept is used as a node, and a multi-level tree structure is organized according to the upper and lower relation among the concepts, which is called as a concept tree in the invention and is easy to accept and use by users. How to fully utilize the semantics contained in the concept to improve the query quality is a considerable issue.
For unstructured text data, the document vectorization technology is a method capable of utilizing semantic information contained in a document, and is a basic technology for solving the problems. In the document vectorization technology, documents are regarded as a set of feature items (words), processing of document contents is simplified into vector operation in a vector space, semantic similarity of texts is expressed according to a similarity in the vector space, semantically related documents are provided for users, and the information retrieval breadth is expanded. The semantic similarity can be used as a clue for further retrieval of the user to guide the user to improve the depth of information retrieval. The method for vectorizing the document can be popularized into a concept space, the concept of the document can be similar to the document, and then the concept vectorization can be carried out.
In general, the concept vectorization process is computationally expensive due to the fact that the feature items of a concept are usually thousands. If the traditional document vectorization method is adopted for carrying out concept vectorization, when the number of concepts is changed, such as the addition of a new concept or the deletion of an old concept, all existing concept vectors can be deviated; if the vector space is reconstructed, the amount of computation is usually large.
In addition, most of the conventional document vectorization technologies are based on a single-layer document classification structure, and are not suitable for being directly applied to a concept tree. In the concept tree, when the concept corresponding to the branch node is vectorized, in order to more completely express the semantic information of the concept, in addition to calculating the concept corresponding to the node itself, the concept corresponding to the lower node should be fused, and the specific way is to merge the concept vector of the branch node and the concept vector of the child node.
The invention aims to solve the problem of concept vectorization high-efficiency calculation in the personal big data management, and develops a concept vectorization method based on a hierarchical concept structure on the basis of a vector space model. Aiming at the problem that the vector space generates deviation due to the change of a concept tree structure, a vector increment calculation method is introduced for efficiently adjusting the vector space, and errors generated in the increment calculation process are accumulated and repaired.
Disclosure of Invention
The invention provides a concept vectorization method facing mass personal big data and based on a hierarchical concept structure to solve the problems, aiming at overcoming the defects that the conventional document vectorization technology is not suitable for a concept tree structure and the calculation amount required by vector space reconstruction is huge when the concept tree structure is changed.
The method for processing the hierarchy concept vectorization increment in the personal big data management is applied to a concept space layer of a personal big data management model. The invention can be divided into a vector space initialization stage and a vector increment calculation stage, wherein the vector space initialization stage can be further subdivided into a preprocessing stage and a concept vector merging stage. And in the preprocessing stage, vectorizing the concept of each node in the concept tree to be expressed as a concept vector, recording the total word number of each node and the inverse document frequency of each characteristic item, calculating the weight of each characteristic item by adopting a tf-idf method in the vectorizing process, wherein the total word number of each node refers to the total word number contained in the concept corresponding to the node. It should be noted that a concept may contain multiple documents, and all documents within the same concept are computed as a whole at the time of computation. In the preprocessing process, only the concept corresponding to the node is calculated for the branch node. The concept vector merging phase comprises running the following steps on a computer:
1) taking a root node of the concept tree as a target node;
2) for the target node, all m child nodes C of the target node are obtained1,C2,…,Cm
3) Obtaining C1,C2,…,CmCorresponding concept vector VC1,VC2,…,VCmAnd a concept vector V corresponding to the target node;
(3.1) if there is a child node CiIs a branch node and its corresponding concept vector is not merged, denoted by CiMerging the concept vectors of the target nodes from the step (2).
4) And calculating the sum L of the total number of words of the target node and all the child nodes thereof. Creating a one in vector spaceConcept vector Vnew
5) Assume that there are n different feature terms T in the vector space1,T2,…,TnThen a concept vector V is given, which corresponds to the feature term TiIs recorded as V.WiWherein the total number of corresponding words is marked as LV,VCiTotal number of words of LCi(ii) a Calculating Vnew.Wi=(V.Wi*LV+VC1.Wi*LC1+VC2.Wi*LC2+…+VCm.Wi*LCm) L, where i ═ 1,2, …, n.
6) Changing concept vector corresponding to target node into VnewThe total number of words is changed to L.
The vector increment calculation stage can be divided into an increment calculation process and an error completion process. The incremental computation process is performed immediately after each update operation of the concept tree by the user. The updating operation of the concept tree comprises adding, deleting or moving concept nodes, wherein the concept nodes are regarded as two-step operation of deleting and then adding. For adding or deleting nodes, the following steps are executed on the computer:
A1. node N to be added or deletedcAs a target node;
A2. searching parent node N of target nodep. If N is presentpAnd if not, ending the incremental calculation process.
A3. Obtaining NcCorresponding concept vector VcAnd total number of words Lc,NpCorresponding concept vector VpAnd total number of words Lp
A4. Suppose that the vector space contains n different feature items in total, which are respectively marked as T1,T2,…,TnThe corresponding weight component is denoted as W1,W2,…,Wn. To VpPerforms the following operations:
(A4.1) if it is an Add node operation, Vp.Wi=(Lp*Vp.Wi+Lc*Vc.Wi)/(Lp+Lc) 1,2, …, NpIs changed to (L)p+Lc);
(A4.2) if it is a delete node operation, Vp.Wi=(Lp*Vp.Wi-Lc*Vc.Wi)/(Lp-Lc) 1,2, …, NpIs changed to (L)p-Lc)。
A5. Will NpAnd (3) as a target node, starting from (2).
Further, the error completion process can be subdivided into an inverse document frequency error accumulation vector updating part and a feature item weight batch updating part. Note that the inverse document frequency follows the convention in the conventional tf-idf algorithm, and in the present invention, the inverse document frequency of a feature item is calculated in terms of the total number of concepts and the number of concepts containing the feature item. Here, the "concept" corresponds to a "document" in the conventional td-idf algorithm.
There are several global values in the whole concept space, including the inverse document frequency vector VidfAnd vector V of initial values of inverse document frequencyini. Suppose there are n different feature items in the vector space, which are respectively denoted as T1,T2,…,Tn. Given a concept vector V, which corresponds to a feature term TiIs recorded as V.Wi(ii) a For the characteristic item TiComprising TiTotal number of concepts of (2) is denoted as TiAnd F. Wherein the updating part of the inverse document frequency error accumulation vector is executed immediately after the increment calculation process is finished each time, and the method comprises the following steps of executing on a computer:
D1. obtaining the total concept quantity A and the inverse document frequency vector V in the current concept treeidfVector V of initial values of inverse document frequencyini
D2. To VidfAnd ViniPerforms the following operations:
(D2.1) if Vidf.WiWhen the value is 0, then Vini.Wi=log((A/(Ti.F+0.01))+0.01),i=1,2,…,n;
(D2.2)Vidf.Wi=log((A/(Ti.F+0.01))+0.01),i=1,2,…,n;
The characteristic item weight batch updating part is executed after a plurality of times of incremental calculation processes, the execution is not required to be immediately executed after the completion of certain incremental calculation, the frequency can be changed according to requirements, and the execution process comprises the following steps executed on a computer:
E1. obtaining a current inverse document frequency vector VidfVector V of initial values of inverse document frequencyini
E2. For each node N in the concept tree, its corresponding concept vector V performs the operations of:
V.Wi=V.Wi*Vidf.Wi/Vini.Wi,i=1,2,…,n。
E3.Vini.Wmi=Vidf.Wi,i=1,2,…,n。
further, the personal big data management model is used for completing a series of functions of organization, storage, management, processing and the like of personal information. The personal big data management model comprises a resource layer, a concept space layer and an application layer:
F1) the resource layer includes a large amount of personal data stored in the DBMS, file system, and other systems. Wherein the personal information in the file system includes textual data and non-textual data. The text data comprises data such as email, pdf files, office files and html files, and the non-text data comprises data such as video, audio and pictures;
F2) the concept space layer uses concepts to point to a set formed by information resources with similarity or correlation among the information resources, uniformly identifies data with different types and formats by using the concepts, establishes mutual association and facilitates the abstraction and management of information resources by users;
F3) the application layer is responsible for interacting with a user and providing applications including navigation technology, visualization technology, editing tools and the like.
The concept space layer organizes personal information in a concept tree manner. The concept tree is formed by semantic associations between concepts. Thus, the concept tree satisfies the following condition:
G1) the hierarchical relation of all concepts forms a tree structure, the nodes in the tree represent the concepts, and the edges represent the upper and lower relations among the concepts;
G2) the root node is used as a concept complete set identifier, the branch node is a concept with lower child nodes, and the leaf node is a concept without child nodes;
G3) each branch node has no less than one child node.
Still further, the whole stage takes a vector space model as a support. The vector space model comprises four parts of constructing concept vectors, storing the concept vectors, maintaining the concept vectors and calculating the similarity:
H1) the constructed concept vector is a vector formed by representing concepts into feature items and feature weights according to information resource sets contained in the concepts;
H2) the concept vector storage is to store the related information of the concept vector obtained in the concept vector construction process into a database;
H3) the maintenance concept vector is used for reflecting the changes to the concept vector of the related concept after the concept tree structure is changed and accumulated for a certain number of times;
H4) the similarity calculation is to calculate the similarity between the selected concept and other concepts according to the concept vector of the selected concept and other concepts.
Compared with the prior art, the document vectorization method under the current single-level classification structure is expanded, concept vectorization of the level concept structure is realized, and the method is the basis for realizing similarity calculation between concepts. Aiming at massive personal big data, a vector increment calculation method is provided, vector space can be efficiently adjusted according to changes of a concept tree structure, errors generated in the increment calculation process are accumulated, and after the errors are accumulated to a certain degree, the vector space is updated in batches to repair the errors. Not only ensures the vectorization effect, but also greatly reduces the calculation amount.
The invention has the advantages that: the concept vector in the concept space in the massive big data processing can be adjusted rapidly, and the execution efficiency is improved.
Drawings
FIG. 1 is a schematic diagram of the personal big data management model and vector space model of the present invention.
FIG. 2 is a schematic diagram of feature vectors in the vector space model of the present invention.
FIG. 3 is a general flow diagram of the method of the present invention.
Fig. 4 is a flow chart of the concept vector merging phase in the present invention.
FIG. 5 is a flow chart of the incremental computation process of the present invention.
FIG. 6 is a flowchart of the updating portion of the inverse document frequency error accumulation vector in the present invention.
Detailed Description
The present invention will be described in further detail with reference to the accompanying drawings.
Referring to fig. 1, a concept vectorization method based on a hierarchical concept structure is applied to a concept space layer of a personal big data management model. The personal big data management model is used for finishing a series of functions of organization, storage, management, processing and the like of personal information, and comprises a resource layer, a concept space layer and an application layer:
F1. the resource layer includes personal information stored in the DBMS, file system, and other systems. The personal information in the file system comprises text data and non-text data, wherein the text data comprises data such as email, pdf files, office files and html files, and the non-text data comprises data such as video, audio and pictures;
F2. the concept space layer uses concepts to point to a set formed by information resources with similarity or correlation among the information resources, the concepts are used for uniformly identifying data with different types and formats, and mutual correlation is established, so that a user can conveniently abstract and manage the information resources. The concept space layer organizes the personal data space in a concept tree manner. Concept trees are formed by semantic associations between concepts. Thus, the concept tree satisfies the following condition: the hierarchical relation of all concepts forms a tree structure, the nodes in the tree represent the concepts, and the edges represent the upper and lower relations among the concepts; the root node is used as a concept complete set identifier, the branch node is a concept with lower child nodes, and the leaf node is a concept without child nodes; each branch node has no less than one child node.
F3. The application layer is responsible for interacting with users and providing applications including navigation technology, visualization technology, editing tools and the like. Visualization techniques present a concept tree of concept space layers and provide view support for navigation techniques, editing tools. Editing tools provide operations to add concepts, render concepts, establish semantic associations, merge concepts, move concepts, and the like.
Concept vectorization based on a hierarchical concept structure includes a vector space initialization stage and a vector increment calculation stage.
The vector space initialization stage may be further subdivided into a pre-processing stage and a conceptual vector merging stage. In the preprocessing stage, a vector space model is used as a support, the concept of each node on the concept tree is vectorized and expressed into a concept vector, and the total number of words of each node is recorded. Referring to fig. 1, the vector space model includes four parts of constructing concept vectors, storing concept vectors, maintaining concept vectors, and calculating similarity:
G1. constructing a concept vector is to represent concepts as a vector consisting of feature terms and corresponding weights according to the personal information collection contained. If the personal information is text data, the following steps can be adopted to construct the concept vector:
G11) performing word segmentation on the personal information text data by adopting a word segmentation device to obtain characteristic items;
G12) and calculating the weight of the characteristic item by adopting a tf-idf method. The weight of the feature t in the concept d is as follows: tf idf. Where tf represents the probability of occurrence of the feature t in the concept d, idf represents the inverse document frequency with a value of log ((N/(a +0.01)) +0.01), N represents the total number of concepts contained in the concept tree, a represents the total number of concepts containing the feature t, and +0.01 is operated to prevent the occurrence of a denominator or log-true number less than or equal to 0;
G13) and selecting the characteristic items by adopting an information gain method. The information gain is an index for measuring the importance degree of a characteristic item commonly used in the field of machine learning, and the information quantity carried by the characteristic item is calculated according to the condition that text characteristics appear or do not appear in a text;
G14) and according to the personal information file set contained in the concept, each characteristic item is assigned with a weight, and the concept is also represented as a vector consisting of the characteristic items and the characteristic weights. Each row in fig. 2 is a feature vector, which represents concept i and represents the weight corresponding to the ith feature item;
G2. storing the characteristic vector is to store the related information of the concept vector obtained in the process of constructing the concept vector into a database;
G3. maintaining concept vectors is to reflect changes to concept vectors of related concepts after the concept tree structure is changed and accumulated for a certain number of times;
G4. the similarity calculation is to calculate the similarity between the selected concept and other concepts based on the concept vector of the selected concept and other concepts.
Referring to fig. 4, the concept vector merging phase comprises running on a computer the following steps:
1) taking a root node of the concept tree as a target node;
2) for the target node, all m child nodes C of the target node are obtained1,C2,…,Cm
3) Obtaining C1,C2,…,CmCorresponding concept vector VC1,VC2,…,VCmAnd a concept vector V corresponding to the target node;
(3.1) if there is a child node CiIs a branch node and its corresponding concept vector is not merged, denoted by CiMerging the concept vectors of the target nodes from the step (2).
4) And calculating the sum L of the total number of words of the target node and all the child nodes thereof. Creating a concept vector V in vector spacenew
5) Assume that there are n different feature terms T in the vector space1,T2,…,TnThen given the concept directionQuantity V, corresponding to characteristic term TiIs recorded as V.WiWherein the total number of corresponding words is marked as LV,VCiTotal number of words of LCi(ii) a Calculating Vnew.Wi=(V.Wi*LV+VC1.Wi*LC1+VC2.Wi*LC2+…+VCm.Wi*LCm) L, where i ═ 1,2, …, n.
6) Changing concept vector corresponding to target node into VnewThe total number of words is changed to L.
The vector increment calculation stage can be divided into an increment calculation process and an error completion process. Referring to fig. 5, the incremental computation is performed after each operation (such as adding, deleting, and moving a concept node) performed on the concept tree by the user, and includes the following steps (the following operations are performed after adding or deleting a node) executed on the computer:
A1. node N to be added or deletedcAs a target node;
A2. searching parent node N of target nodep. If N is presentpAnd if not, ending the incremental calculation process.
A3. Obtaining NcCorresponding concept vector VcAnd total number of words Lc,NpCorresponding concept vector VpAnd total number of words Lp
A4. Suppose that the vector space contains n different feature items in total, which are respectively marked as T1,T2…, Tn, the corresponding weight component is denoted as W1,W2,…,Wn. To VpPerforms the following operations:
(4.1) if it is an Add node operation, Vp.Wi=(Lp*Vp.Wi+Lc*Vc.Wi)/(Lp+Lc) 1,2, …, NpIs changed to (L)p+Lc);
(4.2) if the operation is to delete the node, Vp.Wi=(Lp*Vp.Wi-Lc*Vc.Wi)/(Lp-Lc) 1,2, …, NpIs changed to (L)p-Lc)。
A5. Will NpAnd (3) as a target node, starting from (2).
In particular, the mobile node operation can be regarded as one adding node operation after one deleting node operation, and the target nodes of the two operations are the same.
Further, the error completion process can be subdivided into an inverse document frequency error accumulation vector updating part and a feature item weight batch updating part. Note that the inverse document frequency follows the convention in the conventional tf-idf algorithm, and in the present invention, the inverse document frequency of a feature item is calculated in terms of the total number of concepts and the number of concepts containing the feature item. Here, the "concept" corresponds to a "document" in the conventional td-idf algorithm.
There are several global values in the whole concept space, including the inverse document frequency vector VidfAnd vector V of initial values of inverse document frequencyini. Suppose there are n different feature items in the vector space, which are respectively denoted as T1,T2,…,Tn. Given a concept vector V, which corresponds to a feature term TiIs recorded as V.Wi(ii) a For the characteristic item TiComprising TiTotal number of concepts of (2) is denoted as TiAnd F. Wherein the updating part of the inverse document frequency error accumulation vector is executed immediately after the increment calculation process is finished each time, referring to fig. 6, the method comprises the following steps executed on the computer:
D3. obtaining the total concept quantity A and the inverse document frequency vector V in the current concept treeidfVector V of initial values of inverse document frequencyini
D4. To VidfAnd ViniPerforms the following operations:
(D2.1) if Vidf.WiWhen is equal to 0, then Vini.Wi=log((A/(Ti.F+0.01))+0.01),i=1,2,…,n;
(D2.2)Vidf.Wi=log((A/(Ti.F+0.01))+0.01),i=1,2,…,n;
The characteristic item weight batch updating part is executed after a plurality of times of incremental calculation processes, the execution is not required to be immediately executed after the completion of certain incremental calculation, the frequency can be changed according to requirements, and the execution process comprises the following steps executed on a computer:
E4. obtaining a current inverse document frequency vector VidfVector V of initial values of inverse document frequencyini
E5. For each node N in the concept tree, its corresponding concept vector V performs the operations of:
V.Wi=V.Wi*Vidf.Wi/Vini.Wi,i=1,2,…,n。
E6.Vini.Wmi=Vidf.Wi,i=1,2,…,n。
the above embodiments are only for illustrating the invention, and all the steps can be changed, and all the equivalent changes and modifications based on the technical scheme of the invention should not be excluded from the protection scope of the invention.

Claims (4)

1. A level concept vectorization increment processing method in personal big data management comprises a vector space initialization stage and a vector increment calculation stage, wherein the vector space initialization stage can be further subdivided into a preprocessing stage and a concept vector merging stage, and the vector increment calculation stage can be divided into an increment calculation process and an error completion process; the preprocessing stage vectorizes the concept of each node in the concept tree to be expressed as a concept vector, and records the total word number of each node and the inverse document frequency of each characteristic item; the concept vector merging phase comprises running the following steps on a computer:
1) taking a root node of the concept tree as a target node;
2) for the target node, all m child nodes C of the target node are obtained1,C2,…,Cm
3) Obtaining C1,C2,…,CmCorresponding concept vector VC1,VC2,…,VCmAnd a concept vector V corresponding to the target node;
(3.1) if there is a child node CiIs a branch node and its corresponding concept vector is not merged, denoted by CiMerging the concept vectors of the target nodes from the step 2) for the target nodes;
4) calculating the sum L of the total number of words of the target node and all the child nodes of the target node; creating a concept vector V in vector spacenew
5) Assume that there are n different feature terms T in the vector space1,T2,…,TnThen a concept vector V is given, which corresponds to the feature term TiIs recorded as V.WiWherein the total number of corresponding words is marked as LV,VCiTotal number of words of LCi(ii) a Calculating Vnew.Wi=(V.Wi*LV+VC1.Wi*LC1+VC2.Wi*LC2+…+VCm.Wi*LCm) L, where i ═ 1,2, …, n;
6) changing concept vector corresponding to target node into VnewThe total number of words is changed to L;
the incremental calculation process is executed immediately after a user updates the concept tree each time; the updating operation of the concept tree comprises adding, deleting or moving concept nodes, wherein the moving concept nodes are regarded as two steps of operation of deleting and then adding; for adding or deleting nodes, the following steps are executed on the computer:
A1. node N to be added or deletedcAs a target node;
A2. searching parent node N of target nodep(ii) a If N is presentpIf not, ending the incremental calculation process;
A3. obtaining NcCorresponding concept vector VcAnd total number of words Lc,NpCorresponding concept vector VpAnd total number of words Lp
A4. Suppose that the vector space contains n different feature items in total, which are respectively marked as T1,T2,…,TnThe corresponding weight component is denoted as W1,W2,…,Wn(ii) a To VpPerforms the following operations:
(A4.1) if it is an Add node operation, Vp.Wi=(Lp*Vp.Wi+Lc*Vc.Wi)/(Lp+Lc) 1,2, …, NpIs changed to (L)p+Lc);
(A4.2) if it is a delete node operation, Vp.Wi=(Lp*Vp.Wi-Lc*Vc.Wi)/(Lp-Lc) 1,2, …, NpIs changed to (L)p-Lc);
A5. Will NpAs a target node, execution starts from a 2;
the error completion process can be subdivided into an inverse document frequency error accumulation vector updating part and a characteristic item weight batch updating part; there are several global values in the whole concept space, including the inverse document frequency vector VidfAnd vector V of initial values of inverse document frequencyini(ii) a Suppose there are n different feature items in the vector space, which are respectively denoted as T1,T2,…,Tn(ii) a Given a concept vector V, which corresponds to a feature term TiIs recorded as V.Wi(ii) a For the characteristic item TiComprising TiTotal number of concepts of (2) is denoted as Ti.F; wherein the updating part of the inverse document frequency error accumulation vector is executed immediately after the increment calculation process is finished each time, and the method comprises the following steps of executing on a computer:
D1. obtaining the total concept quantity A and the inverse document frequency vector V in the current concept treeidfVector V of initial values of inverse document frequencyini
D2. To VidfAnd ViniPerforms the following operations:
(D2.1) if Vidf.Wi==0,Vini.Wi=log((A/(Ti.F+0.01))+0.01),i=1,2,…,n;
(D2.2)Vidf.Wi=log((A/(Ti.F+0.01))+0.01),i=1,2,…,n;
The characteristic item weight batch updating part is executed after a plurality of times of incremental calculation processes, the execution is not required to be immediately executed after the completion of certain incremental calculation, the frequency can be changed according to requirements, and the execution process comprises the following steps executed on a computer:
E1. obtaining a current inverse document frequency vector VidfVector V of initial values of inverse document frequencyini
E2. For each node N in the concept tree, its corresponding concept vector V performs the operations of: V.Wi=V.Wi*Vidf.Wi/Vini.Wi,i=1,2,…,n;
E3.Vini.Wi=Vidf.Wi,i=1,2,…,n。
2. The method for vectorized incremental processing of concept hierarchy in personal big data management as claimed in claim 1, wherein: the personal big data management model is used for finishing the functions of organization, storage, management and processing of personal information; the personal big data management model comprises a resource layer, a concept space layer and an application layer:
F1) the resource layer includes personal information stored in the DBMS, file system, and other systems;
wherein the personal information in the file system includes textual data and non-textual data; the text data comprises email, pdf files, office files and html file data, and the non-text data comprises video, audio and picture data;
F2) the concept space layer uses concepts to point to a set formed by information resources with similarity or correlation among the information resources, uniformly identifies data with different types and formats by using the concepts, establishes mutual association and facilitates the abstraction and management of information resources by users;
F3) the application layer is responsible for interacting with a user and provides applications including navigation technology, visualization technology and editing tools.
3. The method as claimed in claim 2, wherein the concept space layer organizes personal information in a concept tree manner; the concept tree is formed by semantic associations between concepts; thus, the concept tree satisfies the following condition:
G1) the hierarchical relation of all concepts forms a tree structure, the nodes in the tree represent the concepts, and the edges represent the upper and lower relations among the concepts;
G2) the root node is used as a concept complete set identifier, the branch node is a concept with lower child nodes, and the leaf node is a concept without child nodes;
G3) each branch node has no less than one child node.
4. The method for vectorized incremental processing of concept hierarchy in personal big data management as claimed in claim 1, wherein: the whole stage takes a vector space model as a support; the vector space model comprises four parts of constructing concept vectors, storing the concept vectors, maintaining the concept vectors and calculating the similarity:
H1) the constructed concept vector is a vector formed by representing concepts into feature items and feature weights according to information resource sets contained in the concepts;
H2) the concept vector storage is to store the related information of the concept vector obtained in the concept vector construction process into a database;
H3) the maintenance concept vector is used for reflecting the changes to the concept vector of the related concept after the concept tree structure is changed and accumulated for a certain number of times;
H4) the similarity calculation is to calculate the similarity between the selected concept and other concepts according to the concept vector of the selected concept and other concepts.
CN201611154347.7A 2016-12-14 2016-12-14 Hierarchical concept vectorization increment processing method in personal big data management Active CN106682129B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611154347.7A CN106682129B (en) 2016-12-14 2016-12-14 Hierarchical concept vectorization increment processing method in personal big data management

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611154347.7A CN106682129B (en) 2016-12-14 2016-12-14 Hierarchical concept vectorization increment processing method in personal big data management

Publications (2)

Publication Number Publication Date
CN106682129A CN106682129A (en) 2017-05-17
CN106682129B true CN106682129B (en) 2020-02-21

Family

ID=58868490

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611154347.7A Active CN106682129B (en) 2016-12-14 2016-12-14 Hierarchical concept vectorization increment processing method in personal big data management

Country Status (1)

Country Link
CN (1) CN106682129B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112307219B (en) * 2020-10-22 2022-11-04 首都师范大学 Method and system for updating vocabulary database for website search and computer storage medium

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1460947A (en) * 2003-06-13 2003-12-10 北京大学计算机科学技术研究所 Text classification incremental training learning method supporting vector machine by compromising key words
CN104794168A (en) * 2015-03-30 2015-07-22 明博教育科技有限公司 Correlation method and system for knowledge points
CN105868366A (en) * 2016-03-30 2016-08-17 浙江工业大学 Concept space navigation method based on concept association

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1460947A (en) * 2003-06-13 2003-12-10 北京大学计算机科学技术研究所 Text classification incremental training learning method supporting vector machine by compromising key words
CN104794168A (en) * 2015-03-30 2015-07-22 明博教育科技有限公司 Correlation method and system for knowledge points
CN105868366A (en) * 2016-03-30 2016-08-17 浙江工业大学 Concept space navigation method based on concept association

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
增量聚类在动态多文档摘要中的研究与应用;郭海蓉;《中国优秀硕士论文全文数据库 信息科技辑》;20160215(第(2016)02期);I138-1968 *
多层文本分类与增量学习关键技术研究;冯佳;《中国优秀硕士论文全文数据库 信息科技辑》;20120315(第(2012)03期);I138-2676 *

Also Published As

Publication number Publication date
CN106682129A (en) 2017-05-17

Similar Documents

Publication Publication Date Title
CN109829104B (en) Semantic similarity based pseudo-correlation feedback model information retrieval method and system
CN101404015B (en) Automatically generating a hierarchy of terms
US9110922B2 (en) Joint embedding for item association
US8131684B2 (en) Adaptive archive data management
CN111190997B (en) Question-answering system implementation method using neural network and machine learning ordering algorithm
US7895195B2 (en) Method and apparatus for constructing a link structure between documents
CN111382276B (en) Event development context graph generation method
US20140149429A1 (en) Web search ranking
CN108399213B (en) User-oriented personal file clustering method and system
CN101404016A (en) Determining a document specificity
CN105243083B (en) Document subject matter method for digging and device
CN112115232A (en) Data error correction method and device and server
CN111325030A (en) Text label construction method and device, computer equipment and storage medium
WO2015051481A1 (en) Determining collection membership in a data graph
CN108733745B (en) Query expansion method based on medical knowledge
CN112818121A (en) Text classification method and device, computer equipment and storage medium
Omri et al. Towards an efficient big data indexing approach under an uncertain environment
CN112685452B (en) Enterprise case retrieval method, device, equipment and storage medium
Manne et al. Text categorization with K-nearest neighbor approach
Drakopoulos et al. A semantically annotated JSON metadata structure for open linked cultural data in Neo4j
CN106682129B (en) Hierarchical concept vectorization increment processing method in personal big data management
WO2016206044A1 (en) Extracting enterprise project information
Manne et al. A Query based Text Categorization using K-nearest neighbor Approach
CN112199461A (en) Document retrieval method, device, medium and equipment based on block index structure
CN101493823B (en) Identifying clusters of words according to word affinities

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant