CN117272073A

CN117272073A - Text unit semantic distance pre-calculation method and device, and query method and device

Info

Publication number: CN117272073A
Application number: CN202311569661.1A
Authority: CN
Inventors: 张晓东
Original assignee: Hangzhou Langmuda Information Technology Co ltd
Current assignee: Hangzhou Langmuda Information Technology Co ltd
Priority date: 2023-11-23
Filing date: 2023-11-23
Publication date: 2023-12-22
Anticipated expiration: 2043-11-23
Also published as: CN117272073B

Abstract

The invention discloses a text unit semantic distance pre-calculation method and device, and a query method and device, wherein the pre-calculation method comprises the following steps: acquiring all text units in a pre-calculation knowledge base, and acquiring an associated text unit set of each text unit based on an associated unit acquisition mode; acquiring knowledge representations of all object attribute text units in a pre-calculation knowledge base based on an associated text unit set by a preset object knowledge representation acquisition mode, and acquiring knowledge representations of all category attribute text units in the pre-calculation knowledge base by a preset category knowledge representation acquisition mode; based on knowledge representation of text units, calculating semantic distances of all text unit pairs in a text unit relation determining mode, and collecting all the text unit pairs with the calculated semantic distances and the corresponding semantic distances into a semantic distance library of a pre-calculation knowledge base. The semantic distance calculation process has no vector embedding, chunk process and is lossless to the original data information.

Description

Text unit semantic distance pre-calculation method and device, and query method and device

Technical Field

The invention relates to the technical field of knowledge base query, relates to a text unit semantic distance pre-computing method, in particular to a text unit semantic distance pre-computing method and device, and a query method and device.

Background

The pre-training is a strategy for training a deep learning model, and aims to extract common characteristics from a large amount of data as much as possible through training, apply the common characteristics to a specific task model, and use a small amount of labeled data in a related specific field to conduct fine tuning, so that the model only needs to learn a special part of the specific task from the common. However, the pre-training needs to use a large amount of data for training and learning, and the accuracy of the result is not necessarily high and the training cost is high due to the influence of the quality of the training data. And the large model is trained by using historical data, so that real-time updating cannot be realized.

At present, a large model integration frame can combine a large model with external data to realize real-time updating; but the training and reasoning costs of large model integration frameworks are relatively high. The external data is converted into vectors and stored in a data storage supporting a vector search function, the similarity of the vectors needs to be calculated when the vector search function is used, an index needs to be built in the vector storage process, the calculated amount is large, and the time and calculation cost is high. Meanwhile, the process of data embedding and the data in the large model integration frame are too long, the original information is damaged in the process of dividing the data into blocks, semantic information and context relation of the data are lost, and the final query result is irrelevant to the original data.

Therefore, for the data of the pre-training and large-model integration frame, a more convenient, rapid and accurate data retrieval mode is needed, and the data is not required to be segmented, so that the original information is prevented from being damaged, and the semantic information and the context relation of the data are lost.

Disclosure of Invention

The purpose of the application is to provide a text unit semantic distance pre-calculation method and device, and a query method and device, which are used for reducing training cost and improving accuracy; the method solves the problems that the calculation amount of the external data of the existing large model integration frame is large, the original information is damaged by dividing the data, the semantic information and the context relation of the data are lost, and the final query result and the original data are irrelevant.

In a first aspect, the present application provides a text unit semantic distance pre-computing method, including:

acquiring all text units in a pre-calculation knowledge base, and acquiring an associated text unit set of each text unit based on an associated unit acquisition mode;

acquiring knowledge representations of all object attribute text units in the pre-calculation knowledge base based on the associated text unit set by a preset object knowledge representation acquisition mode, and acquiring knowledge representations of all category attribute text units in the pre-calculation knowledge base by a preset category knowledge representation acquisition mode;

Acquiring all text unit pairs which can be formed by all the text units, calculating semantic distances of all the text unit pairs in a text unit relation determination mode based on knowledge representation of the text units, and taking all the text unit pairs with the calculated semantic distances and corresponding semantic distance sets as a semantic distance library of the pre-calculation knowledge base;

the object attribute text unit is an object in the pre-calculation knowledge base, and the category attribute text unit is a category in the pre-calculation knowledge base.

In an embodiment of the present application, acquiring a set of associated text units of a text unit based on an associated unit acquisition manner includes:

acquiring a description page of a conventional text unit from the pre-calculation knowledge base;

taking the text units in the description page as internal text units of the conventional text units, and collecting all types of the internal text units of the conventional text units as an associated text unit set of the conventional text units;

wherein the regular text unit is any text unit in the pre-computed knowledge base.

In an embodiment of the present application, the obtaining, by a preset object knowledge representation obtaining manner, knowledge representations of object attribute text units in the pre-computed knowledge base based on the associated text unit set includes:

Screening the related text unit set facing the object attribute text unit by taking the object attribute text unit as a screening unit, and collecting text units corresponding to the related text unit set meeting screening conditions as knowledge representation of the object attribute text unit;

the object attribute text unit is any one object in the pre-calculation knowledge base; the object attribute text units comprise all associated text unit sets except the associated text unit set corresponding to the object attribute text units in the pre-calculation knowledge base for the associated text unit set facing the object attribute text units; the screening condition is that the screening unit is contained in the associated text unit set.

In an embodiment of the present application, the obtaining, by a preset domain knowledge representation obtaining manner, a knowledge representation of a single domain attribute text unit in the pre-computed knowledge base includes:

acquiring an object attribute text unit belonging to a category attribute text unit as an object text unit, and collecting knowledge representations of all the object text units of the category attribute text unit as knowledge representations of the category attribute text unit;

wherein the category attribute text unit is any category in the pre-calculation knowledge base.

In an embodiment of the present application, calculating the semantic distance of the text unit pair by a text unit relation determining manner includes:

setting one text unit in the text unit pair as a first text unit, and setting the other text unit as a second text unit;

judging whether the knowledge representation of the first text unit and the knowledge representation of the second text unit have an intersection, if so, indicating that the first text unit and the second text unit have a relation, calculating the semantic distance between the first text unit and the second text unit based on the knowledge representation of the first text unit and the knowledge representation of the second text unit, and otherwise, indicating that the first text unit and the second text unit have no relation.

In an embodiment of the present application, the semantic distance between the first text unit and the second text unit is calculated by an oschia coefficient calculation method or a jaccard index calculation method based on the knowledge representation of the first text unit and the knowledge representation of the second text unit.

In a second aspect, the present application provides a text unit semantic distance pre-computing device, including an associated text unit acquisition module, a knowledge representation acquisition module, and a semantic distance library acquisition module:

The associated text unit acquisition module is used for acquiring all text units in the pre-calculation knowledge base and acquiring an associated text unit set of each text unit based on an associated unit acquisition mode;

the knowledge representation acquisition module is used for acquiring knowledge representations of all object attribute text units in the pre-calculation knowledge base based on a preset object knowledge representation acquisition mode and acquiring knowledge representations of all category attribute text units in the pre-calculation knowledge base based on a preset category knowledge representation acquisition mode;

the semantic distance library acquisition module is used for acquiring all text unit pairs which can be formed by all the text units, calculating the semantic distances of all the text unit pairs in a text unit relation determination mode based on knowledge representation of the text units, and collecting all the text unit pairs with the calculated semantic distances and the corresponding semantic distances as a semantic distance library of the pre-calculation knowledge library;

In a third aspect, the present application provides a knowledge base text unit query method, including:

Acquiring a text unit to be queried;

searching all text unit pairs with the text units to be queried from a semantic distance library of a knowledge base to serve as query text unit pairs of the text units to be queried, and collecting all or part of non-query text units in the query text unit pairs to be queried as a query result list based on the corresponding semantic distance;

the non-to-be-queried text units in the query text unit pair are the other text units in the query text unit pair except the to-be-queried text units, and the semantic distance library of the knowledge base is obtained through the text unit semantic distance pre-computing method.

In an embodiment of the present application, searching a semantic distance library of the text units to be queried from the semantic distance library to serve as the semantic distance library to be queried, and obtaining all or part of the text units from the semantic distance library to be queried based on the semantic distance to form a query result list includes:

setting the language type of the text unit to be queried as a first type language and the language type of a knowledge base as a second type language;

if the first type language is the same as the second type language, judging whether the corresponding identifier of the text unit to be queried is unique, if so, searching all text unit pairs with the text unit to be queried from a semantic distance library of a knowledge base to serve as query text unit pairs of the text unit to be queried, and collecting all or part of non-text units to be queried in the query text unit pairs into a query result list based on the corresponding semantic distance; otherwise, acquiring an advanced limiting condition of the text unit to be queried, determining a unit identifier to be queried from all identifiers corresponding to the text unit to be queried based on the advanced limiting condition, searching all text unit pairs with the unit identifier to be queried from a semantic distance library of a knowledge base to serve as query text unit pairs of the unit identifier to be queried, and gathering all or part of text units corresponding to non-unit identifiers in the query text unit pairs into a query result list based on the corresponding semantic distance;

If the first type language is different from the second type language, converting the text units to be queried into equivalent text units of the second type language, searching all text unit pairs with the equivalent text units from a semantic distance library of a knowledge base to serve as query text unit pairs of the equivalent text units, taking all or part of non-equivalent text units in the query text unit pairs as equivalent query units based on the corresponding semantic distance, converting all the equivalent query units into query results of the first type language, and collecting all the query results into a query result list;

wherein, each meaning text unit in the knowledge base has a unique identification.

In a fourth aspect, the present application provides a knowledge base text unit query device, including a query unit acquisition module and a query result acquisition module;

the query unit acquisition module is used for acquiring a text unit to be queried;

the query result acquisition module is used for searching all text unit pairs with the text units to be queried from a semantic distance library of a knowledge base to serve as query text unit pairs of the text units to be queried, and collecting all or part of non-query text units in the query text unit pairs into a query result list based on the corresponding semantic distance;

One or more embodiments of the above-described solution may have the following advantages or benefits compared to the prior art:

by applying the text unit semantic distance pre-calculation method provided by the embodiment of the invention, the text units in the description page are queried, then all the text units with the same internal text unit are integrated to obtain knowledge representation of each text unit, and finally a text unit semantic distance library is obtained based on the knowledge representation. The semantic distance acquisition process has no vector embedding, chunk process, is lossless to the original data information, and improves the accuracy. The text unit semantic distance library obtaining process is a text unit pre-calculating process, and the results obtained in the process can be directly stored in various data storage or databases for query use, so that the time for vector embedding and calculating is saved compared with the existing pre-training and large model integration frame, namely, the time cost and the calculation cost of large model training are saved; the method is beneficial to realizing semantic alignment faster during training and improving the reasoning capacity of the large model.

By applying the knowledge base text unit query method provided by the embodiment of the invention, the query of the text unit can be realized. Further, the method can quickly realize the inquiry of text data in different languages, and simultaneously can also quickly realize the inquiry of text units with a plurality of semantics.

Additional features and advantages of the invention will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the invention, and are incorporated in and constitute a part of this specification, illustrate the invention and together with the embodiments of the invention, serve to explain the invention, without limitation to the invention. In the drawings.

Fig. 1 shows a flowchart of a text unit semantic distance pre-calculation method according to an embodiment of the present application.

Fig. 2 is a schematic structural diagram of a text unit semantic distance pre-calculating device according to an embodiment of the present application.

Fig. 3 is a schematic flow chart of a knowledge base text unit query method according to an embodiment of the present application.

Fig. 4 is a schematic structural diagram of a knowledge base text unit query device according to an embodiment of the present application.

Detailed Description

The following will describe embodiments of the present invention in detail with reference to the drawings and examples, thereby solving the technical problems by applying technical means to the present invention, and realizing the technical effects can be fully understood and implemented accordingly. It should be noted that, as long as no conflict is formed, each embodiment of the present invention and each feature of each embodiment may be combined with each other, and the formed technical solutions are all within the protection scope of the present invention.

It should be noted that, the illustrations provided in the following embodiments merely illustrate the basic concepts of the application by way of illustration, and only the components related to the application are shown in the drawings and are not drawn according to the number, shape and size of the components in actual implementation, and the form, number and proportion of the components in actual implementation may be arbitrarily changed, and the layout of the components may be more complex.

Data Curation (Data Curation) refers to the activity of organizing and integrating Data collected from different sources to create a trustworthy database of Data resources. The system comprises full life cycle activities which are effectively managed by collecting, screening, evaluating, storing, maintaining, utilizing and the like series of data. The data curation may be done manually or may be processed by a machine. The quality of data subjected to data curation is typically high.

Knowledge base is a tool for storing and managing knowledge, which can be used to support knowledge systems, knowledge discovery, knowledge sharing, and other applications. Text units (i.e., token) include words, phrases, and sentences. In the knowledge base are referred to objects and categories, i.e. the words, phrases and sentences that make up them. May be specific token objects such as "Cogito," ergo sum, "Homo homini lupus," "knowledge-of-strength," "NBA Finals Most Valuable Player Award," etc., or token categories such as "NBA Finals," "Chinese acts," etc.

Knowledge representation (knowledge representation, KR for short) refers to the abstraction of things, concepts, relationships, etc. in the real world into a form that a computer can process so that the computer can understand and process these information.

Semantic distance refers to a concept used to measure semantic similarity or difference between two words, phrases, or sentences in semantic space. It can be used for completing word sense disambiguation, text classification, information retrieval and other tasks.

The illusion of a large model refers to the content that an artificial intelligence model generates, not based on any real world data, but rather the product of the large model's own extrapolation. The nature of this illusion is that the large model itself lacks the perception of the real world, and its training data may be in a small number or low quality, and thus may lead to over-fitting and quantization errors in training, and loss of the Prompt context, etc.

Equivalents refer to a set of tokens that represent the same meaning, including a set of tokens that have the same meaning in different languages, such as "Beijing university" and "Beijing university" equivalents, and "Beijing university" and "Peking University" equivalents in Chinese.

The large model integration framework provides a standard modular assembly integrating and integrating different large models and connecting them to various external data sources and APIs.

The following embodiments of the present application provide a text unit semantic distance pre-computing method and apparatus, and a query method and apparatus, which are used to solve the problems that the external data of the existing large model integration frame is calculated very much, and the data needs to be segmented to damage the original information, so that the semantic information and the context relation of the data are lost, and the final query result and the original data are not related.

The principle and implementation of a text unit semantic distance pre-calculation method and device, a query method and device of the present embodiment will be described in detail below with reference to the accompanying drawings, so that those skilled in the art can understand the text unit semantic distance pre-calculation method and device, the query method and device of the present embodiment without creative labor.

As shown in fig. 1, the present embodiment provides a text unit semantic distance pre-computing method, which includes the following steps.

Step S101, acquiring all text units in a pre-calculation knowledge base, and acquiring an associated text unit set of each text unit based on an associated unit acquisition mode.

The knowledge base in this embodiment may be used as an external data set for a large model integration framework. The knowledge base in the embodiment of the present invention is Data-curated (Data-Curation) that includes most commonly used text units (i.e., token). The knowledge base subjected to the advanced data curation has higher data quality, and the similarity between text units can be calculated by using the knowledge base to better reflect the real semantic relationship. And the results calculated by the text units are more accurate and quicker than the results learned by the prior pre-training. Further text units (i.e., token) may be words, phrases, and sentences. In the knowledge base are referred to objects and categories, i.e. the words, phrases and sentences that make up them.

And taking the knowledge base for pre-processing the semantic distance between the text units as a pre-computing knowledge base. All text units in the pre-computed knowledge base are obtained. And then acquiring an associated text unit set of each text unit in the pre-calculation knowledge base based on an associated unit acquisition mode.

Further, the process of acquiring the associated text unit set of the single text unit in the pre-calculated knowledge base based on the associated unit acquisition mode comprises the following steps: the text unit is set as any one of the text units in the pre-computed knowledge base and is set as a regular text unit for distinguishing from other text units. Since the text units are categories or objects in this embodiment, each text unit has a corresponding description page. Firstly, acquiring a description page of a conventional text unit from a pre-calculation knowledge base; then, acquiring all text units appearing in the descriptive page text to serve as internal text units of conventional text units; all types of internal text units of the conventional text units are aggregated, and then an associated text unit set of the conventional text units is obtained. By the method, the associated text unit set of all the text units in the pre-calculation knowledge base can be obtained.

It should be noted that, when acquiring an internal text unit of a regular text unit, it is not necessary to consider the number of times the internal unit appears in the description page. For example, if all the text units of the type appearing in the description page text of the text unit a are the text unit B and the text unit C, the text unit B and the text unit C are the internal text units of the text unit a, and further, it may be indicated that the text unit B and the text unit C are respectively associated with the text unit a, where the associated text unit set of the text unit a may be indicated as a- > [ B, C ].

Step S102, obtaining knowledge representations of all object attribute text units in the pre-calculation knowledge base based on the associated text unit set by a preset object knowledge representation obtaining mode, and obtaining knowledge representations of all category attribute text units in the pre-calculation knowledge base by a preset category knowledge representation obtaining mode.

After the associated text unit set of all the text units is obtained, the knowledge representation of each text unit can be obtained based on the associated text unit set of all the text units. In this embodiment, the text units are objects and domains in the knowledge base, and the knowledge representation acquisition mode when the text units are objects is different from the knowledge representation acquisition mode when the text units are domains.

Further, when the text unit is an object, obtaining the knowledge representation of the single object attribute text unit in the pre-calculation knowledge base by presetting an object knowledge representation obtaining mode comprises the following steps: taking the object attribute text units as screening units, and screening the associated text unit sets faced by the object attribute text units based on the object attribute text units to screen out all associated text unit sets meeting screening conditions; and finally, collecting text units corresponding to all the screened associated text unit sets, and obtaining the knowledge representation of the object attribute text units.

Wherein the object attribute text unit is any one object in the pre-calculated knowledge base. And the associated text unit set faced by the object attribute text unit is all other associated text unit sets except the associated text unit set corresponding to the object attribute text unit in the pre-calculation knowledge base. And the filtering condition is that the text unit set of the related text unit contains the text unit of the attribute of the current corresponding object.

For example, the set of associated text units of the text unit B is B- > [ A, C ], and the set of associated text units of the text unit C is C- > [ A, D ], so that the knowledge representation of the object attribute text unit A is KRa = [ C, B ], and KRa = [ B, C ] is obtained after sorting.

And respectively taking each object attribute text unit as a screening unit, and acquiring knowledge representation of all the object attribute text units by the mode.

Further, when the text unit is an object, obtaining the knowledge representation of the single category attribute text unit in the pre-calculation knowledge base through a preset category knowledge representation obtaining mode comprises the following steps: and acquiring all object attribute text units belonging to the category attribute text unit in a pre-calculation knowledge base, taking the acquired object attribute text units as object text units, collecting knowledge representations of all object text units of the category attribute text units, and further acquiring knowledge representations of the category attribute text units. Wherein the category attribute text unit is any category in the pre-calculated knowledge base. By the method, knowledge representation of all category attribute text units can be obtained.

The process of obtaining all object attribute text units belonging to category attribute text units is to pre-calculate the set attributes of the knowledge base itself, which will not be described in any greater detail herein.

Since each meaning text unit in the knowledge base has a unique identification (i.e., ID), the text units in each knowledge representation can be ordered based on the size of the identification. This arrangement facilitates the determination of the relationship between subsequent units of text.

Step S103, obtaining all text unit pairs which can be formed by all text units, calculating the semantic distances of all the text unit pairs in a text unit relation determination mode based on knowledge representation of the text units, and taking all the text unit pairs with the semantic distances calculated and the corresponding semantic distance set as a semantic distance library of a pre-calculation knowledge base.

Specifically, all text unit pairs which can be constructed by all text units are acquired based on all text units, wherein the text unit pairs comprise any two text units in all text units. For example, all text unit pairs between text unit a, text unit B, and text unit C include (a, B), (B, C), and (a, C). And after knowledge representations of all text units are obtained, the relation between any two text units can be judged based on the knowledge representations, and then a semantic distance library of a pre-calculation knowledge base is obtained. In this embodiment, the two text unit relationships in all the text unit pairs are determined mainly by the text unit relationship determination method.

Further, one of the text units is set as a first text unit, and the other text unit is set as a second text unit. The process of determining the relationship between two text units in the text unit pair by the text unit relationship determination mode includes: firstly judging whether an intersection exists between a knowledge representation of a first text unit and a knowledge representation of a second text unit, if so, indicating that the first text unit and the second text unit have a relation; and then calculating the semantic distance between the first text unit and the second text unit by an Ochiia coefficient calculation mode or a Jacquard index calculation mode. If the knowledge representation of the first text unit and the knowledge representation of the second text unit do not have an intersection, the first text unit and the second text unit are represented without a relation, and the semantic distance between the first text unit and the second text unit is not required to be calculated at the moment, namely the semantic distance between the first text unit and the second text unit is not calculated.

The process for acquiring the semantic distance between the first text unit and the second text unit by using the Jacquard index calculation mode comprises the following steps: assuming that the first text unit is a, its knowledge is denoted KRa, the second text unit is B, and its knowledge is denoted KRb; the expression for obtaining the semantic distance between the first text unit and the second text unit by the base Yu Jieka de index calculation mode is as follows:

The process of obtaining the semantic distance between the first text unit and the second text unit by the Ochiia coefficient calculation mode comprises the following steps: assuming that the first text unit is a, its knowledge is denoted KRa, the second text unit is B, and its knowledge is denoted KRb; the expression for obtaining the semantic distance between the first text unit and the second text unit based on the Ochiia coefficient calculation mode is:

wherein the closer the semantic distance D, the closer the relationship between the token. The semantic distance D is defined in the following way: when D epsilon [0.6,1], it indicates that the two text units are extremely similar; when D epsilon [0.4,0.6], the two text units are similar; when D epsilon [0.2,0.4], it means that the two text units are not very similar; when D e [0,0.2], then we indicate dissimilarity between two text units.

Through the method, all text unit pairs capable of calculating the semantic distance can be obtained, finally, all the text unit pairs and semantic distance sets corresponding to the text unit pairs which calculate the semantic distance are used as a semantic distance library of a pre-calculation knowledge base, and when the text units are inquired, inquiry results of corresponding text units can be obtained quickly by directly inquiring the semantic distance library of the pre-calculation knowledge base.

It should be noted that, the semantic distance library obtained in the embodiment of the present invention is actually a collection of semantic distances among a plurality of text units, and is not limited to a hardware database, and the semantic distance library obtained in the embodiment may be applied to a storage and retrieval mechanism, for example, may be stored in a high-speed storage medium or may be stored in a local or remote memory based on the requirement of the actual application.

According to the text unit semantic distance pre-calculation method provided by the embodiment of the invention, the text units in the description page are queried, then all the text units with the same internal text unit are integrated to obtain knowledge representation of each text unit, and finally a semantic distance base of the text unit is obtained based on the knowledge representation. The semantic distance acquisition process has no vector embedding, chunk process, is lossless to the original data information, and improves the accuracy. The obtaining process of the text unit semantic distance library is a text unit pre-calculating process, and the result obtained by the process can be directly stored in various data storage or databases for query; compared with the existing pre-training and large model integration frame, the vector embedding and calculating time is saved, namely, the time cost and the calculation cost of large model training are saved; the method is beneficial to realizing semantic alignment faster during training and improving the reasoning capacity of the large model.

As shown in fig. 2, the present embodiment provides a text unit semantic distance pre-computing device, which includes an associated text unit acquisition module, a knowledge representation acquisition module, and a semantic distance library acquisition module.

The associated text unit acquisition module is used for acquiring all text units in the pre-calculation knowledge base and acquiring an associated text unit set of each text unit based on an associated unit acquisition mode.

The knowledge representation acquisition module is used for acquiring knowledge representations of all object attribute text units in the pre-calculation knowledge base based on the associated text unit set through a preset object knowledge representation acquisition mode, and acquiring knowledge representations of all category attribute text units in the pre-calculation knowledge base through a preset category knowledge representation acquisition mode.

The semantic distance library acquisition module is used for acquiring all text unit pairs which can be formed by all text units, calculating the semantic distances of all the text unit pairs in a text unit relation determination mode based on knowledge representation of the text units, and collecting all the text unit pairs with the calculated semantic distances and the corresponding semantic distances as a semantic distance library of a pre-calculation knowledge base.

According to the text unit semantic distance pre-calculation device provided by the embodiment of the invention, the text units in the description page are queried, then all the text units with the same internal text unit are integrated to obtain knowledge representation of each text unit, and finally a semantic distance library of the text units is obtained based on the knowledge representation. The semantic distance acquisition process has no vector embedding, chunk process, is lossless to the original data information, and improves the accuracy. The obtaining process of the text unit semantic distance library is a text unit pre-calculating process, and the result obtained by the process can be directly stored in various data storage and databases for query; compared with the existing pre-training and large model integration frame, the vector embedding and calculating time is saved, namely, the time cost and the calculation cost of large model training are saved; the method is beneficial to realizing semantic alignment faster during training and improving the reasoning capacity of the large model.

As shown in fig. 3, the present embodiment provides a knowledge base text unit query method, which includes the following steps.

Step S301, a text unit to be queried is acquired.

And obtaining the text unit to be queried through a text box input or selection mode.

Step S302, searching all text unit pairs with text units to be queried from a semantic distance library of a knowledge base as query text unit pairs of the text units to be queried, and collecting non-query text units in all or part of the query text unit pairs as a query result list based on the corresponding semantic distance.

And inquiring based on the specific condition of the text unit to be inquired. The text unit query types to be queried include homolinguistic queries, multilingual queries, and polysemous queries. The specific query mode is that firstly, the unit language type of the text to be queried is set as a first class language, and the language type of a knowledge base is set as a second class language; and setting a semantic distance library of a knowledge base to which the text unit to be queried belongs, wherein the semantic distance library is obtained through the text unit semantic distance pre-calculation method.

When the first type of language is the same as the second type of language, the query is expressed as a same language query. At this time, whether the text unit to be queried is an ambiguous word is further determined, specifically, whether the identifier corresponding to the text unit to be queried is unique is determined, and if yes, the text unit to be queried is an ambiguous word. At this time, the text units to be queried can be directly used as query conditions, all text unit pairs with the text units to be queried are searched from a semantic distance library of a knowledge base, the searched text unit pairs are used as query text unit pairs of the current text units to be queried, then non-query text units in all the query text unit pairs are sequentially sequenced based on the sequence of the corresponding semantic distances from large to small, and finally all or part of sequenced text units are gathered into a query result list based on the query conditions (such as the number of display results and the like). Wherein the non-to-be-queried text unit in the query text unit pair is another text unit in the query text unit pair except for the to-be-queried text unit.

In the above-mentioned query process, if the text unit to be queried is not the text unit in the knowledge base, the query of the text unit to be queried can be completed based on the equivalent of the text unit to be queried in the knowledge base. For example, the text unit closest to the text unit "Beijing university" is queried, but the text unit does not exist in the knowledge base, the corresponding text unit "Peking university" in the knowledge base is found according to the equivalent mapping, and finally, the query result list closest to "Peking university" is returned.

The equivalent means a group of text units having the same meaning in different languages, and includes a group of text units having the same meaning in different languages, for example, "bei da" and "beijing university" are equivalent, chinese "beijing university" and english "Peking University" are equivalent, and the like. The process of obtaining the equivalent of the same language knowledge base and the process of obtaining the equivalent of different language knowledge bases are both conventional, and will not be described in detail here. Each meaning text unit in the knowledge base has a unique identification.

And if the identification corresponding to the text unit to be queried is judged to be non-unique, the text unit to be queried is expressed as the polysemous word. At this time, the supplementary query condition of the user is required to be obtained as a further limiting condition of the text unit to be queried. And then, screening out identifiers which meet the advanced limiting conditions from all identifiers corresponding to the text units to be queried, and identifying the screened out identifiers of the units to be queried. Searching a corresponding query result list from a semantic distance library of a knowledge base based on the unit identifier to be queried, wherein the specific process is as follows: searching all text unit pairs with unit identifications to be queried from a semantic distance library of a knowledge base, taking the queried text unit pairs as query text unit pairs of the current unit identifications to be queried, sequentially sequencing text units corresponding to non-unit identifications in all query text unit pairs based on the sequence of the corresponding semantic distances from large to small, and finally collecting all or part of sequenced text units into a query result list based on query conditions. The text units corresponding to the non-unit to be queried identification in the query text unit pair are the other text units in the query text unit pair except the text units corresponding to the unit to be queried identification. Whether all text units in the ordered text units are selected to be assembled into a query result list is based on display conditions set by the knowledge base itself. The further limiting conditions can be input and obtained in a text mode, and can also be embodied in a form of options for selection by a user.

In particular, for text units with multiple meanings, unique features can be added to distinguish, and the advanced limiting conditions set in the embodiment are unique features for distinguishing ambiguities. For example, unique features may refer to "country," "field," "industry," etc., and the manner of distinction may be bracketed or otherwise marked. Such as text unit "Northwestern University", may refer to text units of "Northwestern University (United States)", "Northwest University (China)", etc., where "country" features are added, and are noted in brackets. The text unit "Northwestern University" located in "China" is queried, and then the text unit "Northwestern University" closest to the text unit "China" is the text unit "Northwestern University" to be queried.

When the first type of language is different from the second type of language, the query is expressed as a query between different languages. At this time, the text units to be queried are converted into the equivalent text units of the second class language by means of equivalent body replacement or software translation. And searching all text unit pairs with equivalent text units from a semantic distance library of a knowledge base, taking the searched text unit pairs as query text unit pairs of the current same text units, sequentially sequencing non-same text units in all query text unit pairs based on the sequence of the corresponding semantic distances from large to small, and finally taking all or part of the sequenced text units as equivalent query units of the same text units based on query conditions. At this time, the obtained set formed by all the equivalent query units is not the result of the query, all the equivalent query units are converted into the query results of the first class language by means of equivalent body replacement or software translation, and then all the query results are collected to obtain the query result list of the query of the text unit to be queried. And whether all text units in the ordered text units are selected to be assembled into a query result list is implemented based on display conditions set by the knowledge base itself.

For example, if the knowledge base is english, but wants to query the semantic distance between chinese text units. For example, query "I think about" from token, so I find corresponding English equivalents "Cogito, ergo sum" in "what the nearest token can be through equivalent mapping", query the English list that the semantic distance of this equivalent token is nearest in English knowledge base. And converting each token in the English list into Chinese according to the equivalent, and finally returning to the Chinese list.

Queries in text units between different languages may also be applicable as follows.

If the corpus quality of a certain language is not good, the quality of the training result of the large model cannot be ensured even if the corpus quantity is increased. At this time, training results of other languages with good effects can be obtained by using the confirmed training results in the required language through an equivalent mapping mode. For example, the corpus quality of the Chinese knowledge base A is poor, and the training result of the large model is also not ideal. The training result of the English knowledge base B is good, and the token in the B can be mapped into Chinese according to the equivalent, so that the semantic distance relation among the Chinese token is obtained. Thus, the capability of the Chinese language model can be greatly improved.

The knowledge base text unit query method provided by the embodiment of the invention can realize the query of the text unit. Further, the method can quickly realize the inquiry of text data in different languages, and simultaneously can also quickly realize the inquiry of text units with a plurality of semantics.

As shown in fig. 4, the present embodiment provides a knowledge base text unit query device, which includes a query unit acquisition module and a query result acquisition module.

The query unit acquisition module is used for acquiring text units to be queried.

The query result acquisition module is used for searching all text unit pairs with text units to be queried from the semantic distance library of the knowledge base to serve as query text unit pairs of the text units to be queried, and collecting non-query text units in all or part of the query text unit pairs to be queried based on the corresponding semantic distance size to form a query result list.

The non-to-be-queried text units in the query text unit pair are the other text units except the to-be-queried text units in the query text unit pair, and the semantic distance library of the knowledge base is obtained through the text unit semantic distance pre-computing method.

The knowledge base text unit query device provided by the embodiment of the invention can realize the query of text units. Further, the method can quickly realize the inquiry of text data in different languages, and simultaneously can also quickly realize the inquiry of text units with a plurality of semantics.

Although the embodiments of the present invention are disclosed above, the embodiments are only used for the convenience of understanding the present invention, and are not intended to limit the present invention. Any person skilled in the art can make any modification and variation in form and detail without departing from the spirit and scope of the present disclosure, but the scope of the present disclosure is still subject to the scope of the present disclosure as defined by the appended claims.

Claims

1. A text unit semantic distance pre-computation method, comprising:

2. The pre-computing method of claim 1, wherein obtaining the set of associated text units of the text units based on the associated unit obtaining means comprises:

3. The pre-calculation method according to claim 1, wherein obtaining knowledge representation of object property text units in the pre-calculation knowledge base based on the set of associated text units by means of a preset object knowledge representation obtaining means comprises:

4. The pre-calculation method according to claim 1, wherein obtaining knowledge representations of individual category attribute text units in the pre-calculation knowledge base by means of a preset category knowledge representation obtaining means comprises:

5. The pre-calculation method according to claim 1, wherein calculating the semantic distance of the text unit pairs by text unit relation determination comprises:

Setting one text unit in a text unit pair as a first text unit, and setting the other text unit as a second text unit;

6. The pre-calculation method according to claim 5, wherein the semantic distance between the first text unit and the second text unit is calculated by an Ochiia coefficient calculation method or a jaccard exponent calculation method based on the knowledge representation of the first text unit and the knowledge representation of the second text unit.

7. The text unit semantic distance pre-computing device is characterized by comprising an associated text unit acquisition module, a knowledge representation acquisition module and a semantic distance library acquisition module:

The knowledge representation acquisition module is used for acquiring knowledge representations of all object attribute text units in the pre-calculation knowledge base based on the associated text unit set in a preset object knowledge representation acquisition mode, and acquiring knowledge representations of all category attribute text units in the pre-calculation knowledge base in a preset category knowledge representation acquisition mode;

8. A knowledge base text unit query method, comprising:

acquiring a text unit to be queried;

The text units which are not to be queried in the query text unit pair are the other text units except the text units to be queried in the query text unit pair, and the semantic distance library of the knowledge base is obtained by the text unit semantic distance pre-computing method according to any one of claims 1-6.

9. The query method as claimed in claim 8, wherein the step of searching the semantic distance library of the text units to be queried as the semantic distance library to be queried, and obtaining all or part of the text units from the semantic distance library to be queried based on the semantic distance to form the query result list comprises:

10. The knowledge base text unit query device is characterized by comprising a query unit acquisition module and a query result acquisition module;