CN117725229A

CN117725229A - Knowledge organization system auxiliary updating method

Info

Publication number: CN117725229A
Application number: CN202410028647.9A
Authority: CN
Inventors: 张运良; 王莉军; 李琳娜; 王力; 金辉
Original assignee: Institute Of Scientific And Technical Information Of China
Current assignee: Institute Of Scientific And Technical Information Of China
Priority date: 2024-01-08
Filing date: 2024-01-08
Publication date: 2024-03-19

Abstract

The invention discloses an auxiliary updating method of a knowledge organization system, which comprises the following steps: building a domain corpus, screening the corpus, fine tuning large model instructions, clustering the corpus according to similarity, indexing keywords by using the understanding capability of a large model, calculating keyword weights, comparing with the existing knowledge organization system to further screen new words, identifying the new word domain by using the understanding capability of the large model, generating the relationship among the first type words and generating the relationship among the second type words. The invention provides a richer and diversified configuration scheme for updating the knowledge organization system in a high-efficiency and rapid manner, and compared with a link prediction scheme aiming at a knowledge graph, the method has more pertinence to the knowledge organization system in the field of book information, and focuses on new word discovery and inter-word relation construction of representative concepts, emphasizes combination and multiple verification of a real corpus, so as to improve the quality of auxiliary updating.

Description

Knowledge organization system auxiliary updating method

Technical Field

The invention relates to the technical field of book information analysis, in particular to an auxiliary updating method of a knowledge organization system.

Background

In the field of book intelligence, a knowledge organization system is a tool for describing and organizing relationships between terms, concepts or topics. And a knowledge system for classifying, organizing and managing the information. It is intended to provide a structured way to describe, access and understand knowledge of different domains and literature resources behind knowledge. The knowledge organization system has various types, such as a narrative list, a subject word list, a vocabulary list, a classification method, a knowledge graph and the like, and is mainly a concept and association thereof in various knowledge organization systems aiming at book information application. Taking the Thesaurus (Thesaurus) as an example, the main knowledge types are as follows: (1) synonymous relationship: the narrative list embodies a set of words related to a particular concept or topic and provides their synonyms or paraphrasing. This helps the user to use a wider vocabulary in the search, enhancing the accuracy and coverage of the search results. (2) hierarchical relationship: the words in the narrative are typically organized in a hierarchical structure, displaying their context or classification. The structure reflects the relationship of the upper level, the lower level, the whole and the part among different words, and the like, and helps a user to better understand the relationship among concepts. (3) correlation: the narrative may provide associations between words, such as associated words, related words, or related topics. Such association information can guide the user to expand the search range and find other contents related to the search purpose. (4) definition and scope annotation: the narrative typically provides a concise definition or interpretation for each concept or word and the field or scope in which it is applicable. This helps the user to understand the meaning and contextual use of the term accurately.

The most common way of updating the knowledge organization system is manual at present, but the following defects exist: (1) time cost is high: manual updating requires a significant amount of manpower and time investment. Because of the complexity and scale of the knowledge organization system, manual updates may take longer and may not keep up with changing knowledge and information requirements in time. (2) consistency and standardization issues: manual updates are prone to consistency and standardization issues. Different editors may have differences in terms of terms selection, hierarchy, association, etc., resulting in insufficient internal consistency or low degree of standardization of the organizational hierarchy. (3) it is difficult to cope with large scale and dynamic changes: with the rapid development of information explosion and knowledge fields, it is difficult for a manual update mode to handle massive and dynamically changing knowledge.

Disclosure of Invention

The invention aims to provide an auxiliary updating method of a knowledge organization system, so as to solve the problems in the prior art.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows:

an auxiliary updating method of a knowledge organization system comprises the following steps:

s100, building a domain corpus and building a multidimensional index of domain dimensions in the corpus;

s200, filtering the corpus, and obtaining an updated corpus set C according to the last updated time tk of the knowledge organization system K and the specific time tp in the corpus;

s300, fine adjustment of large model instructions: selecting a large language model M, constructing a fine tuning instruction set I, and performing instruction fine tuning on the selected large language model M to obtain a fine-tuned large language model Mf;

s400, clustering corpus according to similarity: clustering the corpus C according to a clustering algorithm, controlling the number of clusters to be not more than 20 n classes, wherein each class is respectively marked as C1, C2, …, ci and … Cn;

s500, indexing by using understanding ability keywords of a large model: indexing each document in the language collection C according to the large language model Mf and the corresponding keyword index prompting words to obtain no more than 10 keywords;

s600, calculating keyword weight: the method comprises the steps of carrying out weight ranking on keywords extracted from all documents according to a TF-IDF weight algorithm, interacting with a user, and selecting a user-specified number of candidate keywords;

s700, comparing the new words with the existing knowledge organization system, and further screening new words: further comparing the selected candidate keywords with the existing knowledge organization system K, screening out new keywords, recording all term sets So of the existing knowledge organization system K, and recording new words different from the term sets So into a new word set Sn;

s800, recognizing new word fields by utilizing understanding capability of a large model: judging each new word in the new word set Sn according to the large language model Mf and the corresponding keyword index prompt word, feeding back to a worker, and adjusting the new word set Sn;

s900, generating a first type of inter-word relation and a second type of inter-word relation: and obtaining the first-class word relation generation and the second-class word relation generation according to the determined new word set Sn of the knowledge organization system K and the keywords of the corresponding corpus.

In some of the embodiments of the present invention,

the specific steps of S200 are as follows:

s210, firstly determining the last update time tk of the knowledge organization system K;

s220, selecting the corpus with the time tp larger than tk and the corpus classification number conforming to the specific field range from the corpus to form an updated corpus set C;

where tp is the publication time.

In some of the embodiments of the present invention,

the fine tuning instruction set I includes a series of instruction subsets including, but not limited to, a keyword extraction subset I k, a knowledge organization system relation extraction subset I r, a vocabulary field classification subset I c, and other instructions corresponding to tasks necessary for knowledge organization system construction.

In some of the embodiments of the present invention,

the clustering algorithm comprises the following steps: K-Means algorithm;

the K-Means algorithm is specifically:

firstly, randomly selecting K center points, and then distributing each corpus into clusters where the center points closest to the corpus are located according to the distance between the corpus and the center points; then, the position of the center point is recalculated according to the corpus in the current cluster every iteration until the clustering result is stable.

In some of the embodiments of the present invention,

the formula of the TF-IDF weighting algorithm is as follows:

TF-IDF(w,d)＝TF(w,d)*IDF(w,d)

where TF (w, d) represents the frequency of occurrence of the word w in the document d, and IDF (w, d) represents the inverse document frequency, i.e., the ratio of the number of documents containing the word w to the total number of documents.

In some of the embodiments of the present invention,

the first inter-word relation generation method specifically comprises the following steps:

generating inter-word relations in the class based on each clustered class Ci, counting two words Wa and Wb with co-occurrence of keywords in the class, and sequencing according to frequency before feedback by utilizing a large language model Mf and corresponding inter-word relation judgment prompt words to judge the inter-word relations of the Wa and the Wb if the Wa and the Wb are in the scope of so+Sn.

In some of the embodiments of the present invention,

the second inter-word relation generation method specifically comprises the following steps:

and generating a prompt word for each word Wi in the Sn range by utilizing the large language model Mf and the corresponding inter-word relation, automatically generating a synonymous relation, an upper-lower relation and a related relation of Wi, feeding back the relation which is not in the existing knowledge organization system K to a user for confirmation, and placing Wi and the relation word thereof in a corpus for searching to provide a confirmation basis and/or filtering in advance.

The beneficial effects of the invention are as follows:

the invention discloses an auxiliary updating method of a knowledge organization system, which aims at the updating requirement of the knowledge organization system, provides a richer and diversified configuration scheme, and updates the knowledge organization system efficiently and rapidly.

Drawings

FIG. 1 is a flow chart of an auxiliary updating method of a knowledge organization system.

Detailed Description

The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. Examples of the embodiments are illustrated in the accompanying drawings, wherein like or similar symbols indicate like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention. In the description of the present invention, it should be understood that the terms "top," "bottom," "inner," "outer," "axis," "circumferential," and the like indicate an orientation or a positional relationship based on that shown in the drawings, and are merely for convenience in describing the present invention or simplifying the description, and do not indicate or imply that the devices or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include one or more such feature. In the description of the present invention, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.

In the present invention, unless explicitly specified and limited otherwise, the terms "mounted," "connected," "secured," "engaged," "hinged," and the like are to be construed broadly, and may be, for example, fixedly connected, detachably connected, or integrally formed; can be mechanically or electrically connected; can be directly connected or indirectly connected through an intermediate medium, and can be communicated with the inside of two elements or the interaction relationship of the two elements. The specific meaning of the above terms in the present invention can be understood by those of ordinary skill in the art according to the specific circumstances.

In the description of the present specification, reference to the terms "one embodiment," "some embodiments," "examples," "particular examples," "one particular embodiment," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the invention. In this specification, schematic representations of terms do not necessarily refer to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples.

Referring to fig. 1, a knowledge organization system auxiliary updating method includes the following steps:

s100, constructing a domain corpus: and constructing a domain corpus and constructing a multidimensional index of domain dimensions in the corpus.

In the step of modifying, a domain corpus is constructed, wherein metadata of the domain corpus comprises but is not limited to texts such as titles, keywords, abstracts and the like and publishing time, and the time format is yyyy-MM-dd HH: MM: ss, so that the processing and analysis of the domain corpus can be conveniently carried out by using a natural language processing technology. The multi-dimensional index of one domain dimension can be constructed in the corpus, and the method is suitable for the situation that certain corpora cover a plurality of domains, so that the corpus meeting the conditions can be rapidly positioned.

S200, corpus screening: and obtaining an updated corpus set C according to the last updated time tk of the knowledge organization system K and the specific time tp in the corpus.

In the step, the field multidimensional analysis results can be displayed in a visual display mode such as charts, reports, dynamic animations and the like, so that a user can more intuitively know the characteristics and the trend of the corpus, and meanwhile, the display mode can be adjusted and optimized according to the feedback of the user. In the screening, the last updating time tk of the knowledge organization system K is firstly determined, then, the corpus with the time tp larger than tk and the corpus classification number conforming to the specific field range is selected from the corpus to form an updating corpus set C. Where tp is typically the publication time, or other uniform time may be used to determine, such as contribution time, application time, disclosure time, etc.

S300, selecting a large language model M: and constructing a fine tuning instruction set I, and performing instruction fine tuning on the selected large language model M to obtain a fine-tuned large language model Mf.

In this step, first, a fine tuning instruction set I is built, where the I includes a series of instruction subsets, including but not limited to, a keyword extraction subset I k, a knowledge organization system relation extraction subset I r, and a task-corresponding instruction necessary for building a knowledge organization system such as a vocabulary field classification subset Ic. Then, the selected large language model M (such as Llama, chatGLM, bloom) is subjected to instruction fine adjustment to obtain Mf, so that the support of tasks required for updating a knowledge organization system is enhanced on the premise of ensuring the generation capacity of the large language model, and the tasks are used according to the tasks

The requirement, mf, can be further divided into Mfk, mfr, mfc, which better meets the requirement of updating the knowledge organization system and improves the efficiency and accuracy of updating the knowledge organization system.

The instructions in I can be generated by ChatGPT, and can also be generated by other machine learning methods. For example, keyword extraction can be carried out by manually marking 5-7 keywords on each document to obtain a training set, marking part of speech in a text by using a part of speech marking algorithm, and distinguishing keywords from non-keywords according to the part of speech. Meanwhile, named entity recognition algorithm can be used for recognizing named entities in the text, such as person names, place names, organization names and the like, and the named entities are extracted as keywords, and further more instruction data are generated in a supervised learning mode. Relationship extraction, a rule-based approach, a machine learning-based approach, a deep learning-based approach, etc., may be employed to identify relationships in text. Knowledge graph embedding algorithms such as Word2Vec, gloVe, etc. can also be used to map the vocabulary in the text into vector space and then identify knowledge architecture relationships by computing vector similarity. Lexical domain classification, text may be domain classified using domain classification algorithms such as Support Vector Machines (SVMs), logistic regression (Logistic Regression), random Forest (Random Forest), etc. Meanwhile, a deep learning algorithm, such as a Convolutional Neural Network (CNN), a Recurrent Neural Network (RNN), a variational self-encoder (VAE), etc., may also be used to classify the text in the field.

S400, clustering corpus according to similarity: and clustering the corpus C according to a clustering algorithm, controlling the number of clusters to be not more than 20 n classes, wherein each class is respectively marked as C1, C2, …, ci and … Cn.

In this step, for the corpus set C, each corpus in the corpus set C may be expressed as a vector, the text feature extraction methods such as a Word bag model, TF-IDF algorithm, word2Vec algorithm, etc. may be used to perform vector expression, and then mature text clustering algorithms such as K-Means algorithm, DBSCAN algorithm, STING algorithm, hierarchical clustering algorithm, etc. may be used to perform clustering, and similar corpora are classified into a class, so as to obtain a plurality of clustering results, which are sequentially recorded as C1, C2.

The number of clusters can be adjusted according to specific requirements, and the number of clusters is generally controlled to be not more than 20 types.

S500, indexing by using understanding ability keywords of a large model: and indexing each document in the language collection C according to the large language model Mf and the corresponding keyword index prompting words to obtain no more than 10 keywords.

In this step, the understanding capability of the Mfk model facing the keyword indexing in the large language model Mf is utilized to index the keywords of each document in the language set C, and no more than 10 keywords are obtained through the Mfk model and the control of the prompt words, wherein the used prompt words are the prompt words in the instruction set I k as far as possible.

S600, calculating keyword weight: and carrying out weight sorting on the keywords extracted from all documents according to a TF-IDF weight algorithm, interacting with a user, and selecting a user-specified number of candidate keywords.

After keyword indexing is completed, the keywords extracted from all documents can be weighted and ranked by using a TF-IDF weighting algorithm. The formula of the TF-IDF weighting algorithm is as follows:

TF-IDF(w,d)＝TF(w,d)*IDF(w,d)

In calculating keyword weights, each document may be considered as a vector, where each element represents a term, and the weights are TF-IDF values for that term. These weights may then be ranked, and a user-specified number (e.g., 500) of candidate keywords selected.

S700, comparing the new words with the existing knowledge organization system, and further screening new words: and further comparing the selected candidate keywords with the existing knowledge organization system K, screening out new keywords, recording all term sets So of the existing knowledge organization system K, and recording new words different from the term sets So into a new word set Sn.

In the step, the new words are further screened out by comparing with the existing knowledge organization system. After the candidate keywords are selected, the candidate keywords are compared with the existing knowledge organization system and interacted with by a user, and new words are further screened out to determine a final new word list. And comparing each keyword with the total entry set So of the existing knowledge organization system, if the keyword is not in So, displaying information such as each keyword, part of speech, whether a named entity, TF-IDF weight value and the like to a user, enabling the user to select the keyword according to the information, and recording Sn as a new word according to a result approved by the user.

S800, recognizing new word fields by utilizing understanding capability of a large model: and judging each new word in the new word set Sn according to the large language model Mf and the corresponding keyword index prompt word, feeding back to the manual work, and adjusting the new word set Sn.

In this step, new word areas are identified using the understanding capabilities of the large model. After screening out new words, judging each new word in the new word set Sn by utilizing the understanding capability of a Mfc model facing the recognition classification of the corresponding vocabulary field in the large language model Mf, and determining whether the new word belongs to the professional field of the knowledge organization system K or not through the Mfk model and the prompt word control, wherein the used prompt word is the prompt word in the instruction set I K as far as possible. Judging the vocabulary retention belonging to the professional field of the knowledge organization system K, and temporarily removing the rest vocabulary.

In this step, the first word-word relationship is generated after determining the new word set Sn of the knowledge organization system K and the keywords of the corresponding corpus, and then the word-word relationship is generated. Inter-word relationships refer to relationships or relationships between words, such as co-occurrence, context, synonym relationships, and the like. First, based on each class Ci after clustering, inter-word relationships are generated within the class. For two words Wa and Wb in which keywords coexist in the class, if Wa and Wb both belong to the scope of so+sn, i.e., they belong to words in the existing knowledge organization system K or are newly added words, the word relationship between Wa and Wb can be determined by using the large language model Mfr and the corresponding word relationship determination hint words. Such as upper and lower relation, synonym relation, related relation, etc., and the relation has frequency in different classes, the relation is counted and summarized, and the relation is sorted according to frequency and fed back to the user for confirmation.

In the step, the second-class word relation is generated as to each word Wi in the new word set Sn, a prompt word can be generated by utilizing the large language model Mf and the corresponding word relation, and the synonymous relation, the upper-lower relation and the related relation of the Wi are automatically generated. The related relation words Wj and Wi are subjected to co-occurrence retrieval in the corpus, and two different processing modes can be adopted: the first is to feed back the relation which is not in the existing knowledge organization system K to the user for confirmation, and the co-occurrence search frequency is used as a reference; the second is to introduce a co-occurrence threshold delta, and only word pairs with co-occurrence frequency greater than the co-occurrence threshold and the relation thereof are fed back to the user.

In some of the embodiments of the present invention,

the specific steps of S200 are as follows:

s210, determining the last update time tk of the knowledge organization system K.

S220, selecting the corpus with the time tp larger than tk and the corpus classification number conforming to the specific field range from the corpus to form an updated corpus set C.

Where tp is the publication time.

In some of the embodiments of the present invention,

the fine tuning instruction set I includes a series of instruction subsets including, but not limited to, a keyword extraction subset Ik, a knowledge organization system relation extraction subset Ir, a vocabulary field classification subset Ic, and other instructions corresponding to tasks necessary for knowledge organization system construction.

In some of the embodiments of the present invention,

the clustering algorithm comprises the following steps: K-Means algorithm.

The K-Means algorithm is specifically:

k center points are randomly selected, and then each corpus is distributed to clusters where the center points closest to the corpus are located according to the distance between the corpus and the center points. Then, the position of the center point is recalculated according to the corpus in the current cluster every iteration until the clustering result is stable.

In this embodiment, taking the K-Means algorithm as an example, K center points may be selected randomly first, and then each corpus is assigned to the cluster where the center point closest to the center point is located according to its distance from the center point. Then, the position of the center point is recalculated according to the corpus in the current cluster every iteration until the clustering result is stable.

A typical Distance Edit Distance (Edit Distance), also known as a Levenshtein Distance, may be used to measure the degree of difference between two strings, i.e., the minimum number of operations required to convert one string to another by insert, delete, and replace operations. The edit distance formula is as follows:

D(A,B)＝dp(len(A),len(B))

where dp (i, j) represents the minimum number of operations required to convert the first i characters of a to the first j characters of B.

In addition, the similarity between the texts can be calculated and converted into the distance according to the similarity. The similarity between texts may be calculated using cosine similarity, jaccard similarity, euclidean distance, etc. algorithms. Cosine similarity is a commonly used text similarity calculation method, and is suitable for the case of representing text as a vector. It measures the similarity of two text vectors by calculating their angle cosine values. The cosine similarity formula is as follows:

S(A,B)＝(A·B)/(||A||*||B||)

where A and B are two text vectors, representing the dot product (inner product) of the vectors, A and B represent the A and B norms (lengths), respectively. The formula for converting similarity to distance may employ a monotonically decreasing function, such as d=1-S.

In some of the embodiments of the present invention,

the formula of the TF-IDF weighting algorithm is as follows:

TF-IDF(w,d)＝TF(w,d)*IDF(w,d)

In some of the embodiments of the present invention,

By adopting the technical scheme disclosed by the invention, the following beneficial effects are obtained: the invention discloses an auxiliary updating method of a knowledge organization system, which aims at the updating requirement of the knowledge organization system, provides a richer and diversified configuration scheme, and updates the knowledge organization system efficiently and rapidly.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which is also intended to be covered by the present invention.

Claims

1. The auxiliary updating method of the knowledge organization system is characterized by comprising the following steps of:

s100, constructing a domain corpus: building a domain corpus and building a multidimensional index of domain dimensions in the corpus;

s200, corpus screening: obtaining an updated corpus set C according to the last updated time tk of the knowledge organization system K and the specific time tp in the corpus;

s400, clustering corpus according to similarity: clustering the corpus set C according to a clustering algorithm, controlling the number of clusters to be not more than 20 n classes, wherein each class is respectively marked as C1, C2, …, ci and … Cn;

s500, indexing by using understanding ability keywords of a large model: indexing each document in the corpus C according to the large language model Mf and the corresponding keyword indexing prompt word to obtain no more than 10 keywords;

s600, calculating keyword weight: according to the TF-IDF weight algorithm, the keywords extracted from all documents are weighted and ranked, interaction is carried out with a user, and a user-specified number of candidate keywords are selected;

s700, comparing the new words with the existing knowledge organization system, and further screening new words: comparing the selected candidate keywords with the existing knowledge organization system K, screening out new keywords, recording all term sets So of the existing knowledge organization system K, and recording new words different from the term sets So into a new word set Sn;

s800, recognizing new word fields by utilizing understanding capability of a large model: judging each new word in the new word set Sn according to the large language model Mf and the corresponding keyword index prompting word, feeding back the judgment to a worker, and adjusting the new word set Sn;

s900, generating a first type of inter-word relation and a second type of inter-word relation: and obtaining a first-class inter-word relation generation and a second-class inter-word relation generation according to the determined new word set Sn of the knowledge organization system K and the keywords of the corresponding corpus.

2. The method for assisted updating of a knowledge organization system of claim 1, wherein,

the specific steps of the step S200 are as follows:

s220, selecting a corpus with the time tp larger than tk and the corpus classification number conforming to the specific field range from the corpus to form an updated corpus set C;

where tp is the publication time.

3. The method for assisted updating of a knowledge organization system of claim 2, wherein,

4. The method for auxiliary updating of knowledge organization system according to claim 3,

the clustering algorithm comprises the following steps: K-Means algorithm;

the K-Means algorithm is specifically as follows:

5. The method for assisted updating of a knowledge organization system of claim 4, wherein,

the formula of the TF-IDF weight algorithm is as follows:

TF-IDF(w,d)＝TF(w,d)*IDF(w,d)

6. The method for assisted updating of a knowledge organization system of claim 5, wherein,

generating inter-word relations in the class based on each clustered class Ci, counting two words Wa and Wb with co-occurrence of keywords in the class, and sequencing according to frequency before feedback by utilizing the large language model Mf and the corresponding inter-word relation judgment prompt words to judge the inter-word relations of the Wa and Wb if the Wa and the Wb are in the scope of so+Sn.

7. The method for assisted updating of a knowledge organization system of claim 6, wherein,