CN117235137B

CN117235137B - Professional information query method and device based on vector database

Info

Publication number: CN117235137B
Application number: CN202311495259.3A
Authority: CN
Inventors: 张海东; 曾辉; 黄天宇
Original assignee: Shenzhen Elanw Network Co ltd
Current assignee: Shenzhen Elanw Network Co ltd
Priority date: 2023-11-10
Filing date: 2023-11-10
Publication date: 2024-04-02
Anticipated expiration: 2043-11-10
Also published as: CN117235137A

Abstract

The invention is applicable to the technical field of artificial intelligence, and provides a professional information query method and device based on a vector database and terminal equipment, wherein the query method comprises the following steps: and acquiring a trunk vocabulary vector and a topic label in the query sentence, and reducing the query range to a certain range through the topic label. Because different topic labels have different vector characteristics, in order to more accurately match query results, the trunk vocabulary vectors need to be converted based on the first conversion matrix to obtain feature vectors. In order to further reduce the query scope, the target cluster centers are matched through a clustering algorithm, and the feature vectors are converted into query vectors based on the second conversion matrix corresponding to the target cluster centers so as to adapt to the vector features of different cluster centers. And finally, matching an accurate professional information query result according to the distance between the query vector and the plurality of target vector data. Not only greatly improves the query efficiency, but also has higher query accuracy.

Description

Professional information query method and device based on vector database

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a professional information query method and device based on a vector database.

Background

Vector databases are a type of database system that is dedicated to storing and processing large-scale vector data. Unlike conventional relational databases, vector databases can efficiently store and query vector data and support complex similarity search and distance computation operations.

However, for high-dimensional vector data, the query efficiency of the vector database is low. In a high-dimensional space, the computational complexity of similarity grows exponentially, resulting in longer query times. This is a technical problem that needs to be solved.

Disclosure of Invention

In view of the above, the embodiments of the present invention provide a professional information query method, apparatus, terminal device and computer readable storage medium based on a vector database, so as to solve the problem that the query efficiency of the vector database is low for high-dimensional vector data. In a high-dimensional space, the computation complexity of the similarity increases exponentially, resulting in a technical problem of long query time.

A first aspect of an embodiment of the present invention provides a professional information query method based on a vector database, where the query method includes:

acquiring a query sentence, and extracting a trunk vocabulary vector and a theme tag in the query sentence;

According to the theme label, matching a plurality of target vector data corresponding to the theme label; wherein, a plurality of theme labels in the vector database respectively correspond to different vector data;

acquiring a first conversion matrix corresponding to the theme label, and multiplying the trunk vocabulary vector by the first conversion matrix to obtain a feature vector;

based on a clustering algorithm, calculating the distances between the feature vector and a plurality of clustering centers, and taking the clustering center corresponding to the minimum distance as a target clustering center;

obtaining a second conversion matrix corresponding to the target cluster center, and multiplying the feature vector by the second conversion matrix to obtain a query vector;

and acquiring a plurality of target vector data corresponding to the target clustering center, respectively calculating the distances between the query vector and the plurality of target vector data, and taking the original data corresponding to the target vector data corresponding to the maximum distance as a professional information query result.

Further, the step of obtaining the query sentence, and extracting the stem vocabulary vector and the topic label in the query sentence comprises the following steps:

word segmentation processing is carried out on the query sentences to obtain a plurality of first words;

Encoding the first vocabularies through an encoding model to obtain a plurality of first vector values;

constructing an initial semantic graph according to the context relation, the part of speech and the first vector value among each first vocabulary; the nodes of the semantic graph are the first vector values, and the edges of the semantic graph are correlation parameters among the words;

removing edges, in the initial semantic graph, of which the correlation parameters are lower than a threshold value, and removing isolated nodes to obtain a target semantic graph;

selecting a target path according to the correlation parameter in the target semantic graph;

based on the ordering relation of a plurality of nodes on the target path, combining the first vector values corresponding to the nodes respectively to obtain the trunk vocabulary vector;

matching the plurality of first vocabularies with a topic corpus respectively to obtain the topic labels; wherein each topic tag corresponds to a plurality of topic words.

Further, in the target semantic graph, the step of selecting a target path according to the relevance parameter includes:

and sequentially selecting an edge corresponding to the maximum correlation parameter on the current node as a path by taking a first vector value corresponding to the first word as a starting point, so as to obtain the target path formed by a plurality of nodes and a plurality of edges.

Further, the step of constructing a semantic graph according to the context, the part of speech and the first vector value between each of the first vocabularies includes:

according to the context relation, obtaining word distances between the current first vocabulary and other first vocabularies;

converting a first association value according to the word distance;

acquiring parts of speech corresponding to the current first vocabulary and the other first vocabularies respectively, and calculating a second association value according to the mapping relation between the parts of speech;

combining the current first vocabulary and the other first vocabularies into phrases, and inputting the phrases into a sentence recognition model to obtain corresponding combination probability; the combined probability is used for representing the probability that the current vocabulary and the other vocabularies form a correct phrase;

calculating the correlation parameter according to the first correlation value, the second correlation value and the combination probability;

and constructing the semantic graph by taking vector values corresponding to the vocabularies as nodes and corresponding correlation parameters between the vocabularies as edges.

Further, the step of calculating the correlation parameter according to the first correlation value, the second correlation value and the combined probability includes:

Substituting the first association value, the second association value and the combination probability into the following formula to obtain the correlation parameter;

the formula is:

wherein,representing the correlation parameter, x representing the first correlation value, P representing the combined probability, y representing the second correlation value, +.>And->And (3) representing a constant, wherein n is a preset coefficient.

Further, before the step of obtaining the query statement and extracting the stem vocabulary vector and the topic label in the query statement, the method further comprises:

acquiring text data to be put in storage, and performing word segmentation processing on the text data to be put in storage to obtain a plurality of second words;

encoding the plurality of second vocabularies through an encoding model to obtain a plurality of second vector values;

and carrying out feature extraction in a plurality of second vector values based on a deep learning model to obtain the vector data.

matching key words and first numbers of the key words in the second words based on a preset corpus;

acquiring a plurality of related words corresponding to the key words, and matching a second number of the related words in a plurality of second words;

Constructing a first target parameter according to the codes of the key words and the first quantity;

respectively constructing a second target parameter according to the codes of the related words and the second quantity;

calculating a first covariance matrix between the first target parameter and a plurality of second target parameters;

counting the occurrence frequency of each vocabulary in the plurality of second vocabularies, and acquiring target second vocabularies with the occurrence frequencies arranged in the first K;

constructing a third target parameter according to the codes of the target second vocabulary and the third number of the target second vocabulary;

calculating a second covariance matrix among a plurality of the third target parameters;

calculating the similarity between the first covariance matrix and the second covariance matrix;

and if the similarity is greater than a threshold value, taking the preset label of the key word as the theme label.

A second aspect of an embodiment of the present invention provides a professional information query apparatus based on a vector database, including:

the first acquisition unit is used for acquiring the query statement, and extracting a trunk vocabulary vector and a theme tag in the query statement;

the matching unit is used for matching a plurality of target vector data corresponding to the theme label according to the theme label; wherein, a plurality of theme labels in the vector database respectively correspond to different vector data;

The second acquisition unit is used for acquiring a first conversion matrix corresponding to the theme label, and multiplying the trunk vocabulary vector by the first conversion matrix to obtain a feature vector;

the first calculation unit is used for calculating the distances between the feature vector and the plurality of clustering centers based on a clustering algorithm, and taking the clustering center corresponding to the minimum distance as a target clustering center;

the third acquisition unit is used for acquiring a second conversion matrix corresponding to the target clustering center, and multiplying the feature vector by the second conversion matrix to obtain a query vector;

and the second calculation unit is used for acquiring a plurality of target vector data corresponding to the target clustering center, respectively calculating the distances between the query vector and the plurality of target vector data, and taking the original data corresponding to the target vector data corresponding to the maximum distance as a professional information query result.

A third aspect of an embodiment of the present invention provides a terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, the processor implementing the steps of the method of the first aspect when executing the computer program.

A fourth aspect of the embodiments of the present invention provides a computer-readable storage medium storing a computer program which, when executed by a processor, implements the steps of the method of the first aspect.

Compared with the prior art, the embodiment of the invention has the beneficial effects that: according to the invention, the main word vector and the topic label in the query statement are extracted by acquiring the query statement; according to the theme label, matching a plurality of target vector data corresponding to the theme label; wherein, a plurality of theme labels in the vector database respectively correspond to different vector data; acquiring a first conversion matrix corresponding to the theme label, and multiplying the trunk vocabulary vector by the first conversion matrix to obtain a feature vector; based on a clustering algorithm, calculating the distances between the feature vector and a plurality of clustering centers, and taking the clustering center corresponding to the minimum distance as a target clustering center; obtaining a second conversion matrix corresponding to the target clustering center, multiplying the feature vector by the second conversion matrix to obtain a query vector, obtaining a plurality of target vector data corresponding to the target clustering center, respectively calculating the distance between the query vector and the plurality of target vector data, and taking original data corresponding to the target vector data corresponding to the maximum distance as a occupational information query result. According to the scheme, in order to improve query efficiency, the trunk vocabulary vectors and the topic labels in the query sentences are respectively acquired, and the query range is reduced to a certain range through the topic labels. Because different topic labels have different vector characteristics, in order to more accurately match query results, the trunk vocabulary vectors need to be converted based on the first conversion matrix to obtain feature vectors. In order to further reduce the query scope, the target cluster centers are matched through a clustering algorithm, and the feature vectors are converted into query vectors based on the second conversion matrix corresponding to the target cluster centers so as to adapt to the vector features of different cluster centers. And finally, matching an accurate professional information query result according to the distance between the query vector and the plurality of target vector data. Not only greatly improves the query efficiency, but also has higher query accuracy.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments or the related technical descriptions will be briefly described, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

Fig. 1 is a schematic diagram of an apparatus architecture for professional information query based on a vector database according to the present invention;

fig. 2 is a schematic diagram of a professional information query device based on a vector database according to an embodiment of the present invention;

fig. 3 shows a schematic diagram of a terminal device according to an embodiment of the present invention.

Detailed Description

In the following description, for purposes of explanation and not limitation, specific details are set forth such as the particular system architecture, techniques, etc., in order to provide a thorough understanding of the embodiments of the present invention. It will be apparent, however, to one skilled in the art that the present invention may be practiced in other embodiments that depart from these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present invention with unnecessary detail.

The embodiment of the invention provides a professional information query method, device, terminal equipment and computer readable storage medium based on a vector database, which are used for solving the problem that the query efficiency of the vector database is lower for high-dimensional vector data. In a high-dimensional space, the computation complexity of the similarity increases exponentially, resulting in the technical problem of long query time

Firstly, the invention provides a professional information query method based on a vector database. Referring to fig. 1, fig. 1 shows a schematic flow chart of a professional information query method based on a vector database. As shown in fig. 1, the occupation information query method based on the vector database may include the steps of:

step 101: acquiring a query sentence, and extracting a trunk vocabulary vector and a theme tag in the query sentence;

in order to improve the query efficiency, the present application needs to obtain the stem vocabulary vector and the topic label in the query sentence. The theme tag is preset data, and is used for classifying different data, for example: and classifying occupation overview, occupation study report, occupation prospect, development advice and the like. The main vocabulary vectors represent vectors corresponding to key information in the query statement, and are obtained in the following way:

Specifically, step 101 specifically includes step 1011 and step 1017:

step 1011: word segmentation processing is carried out on the query sentences to obtain a plurality of first words;

step 1012: encoding the first vocabularies through an encoding model to obtain a plurality of first vector values;

the encoding process of the first vocabulary through the encoding model is a conventional technology, and will not be described herein.

Step 1013: constructing an initial semantic graph according to the context relation, the part of speech and the first vector value among each first vocabulary; the nodes of the semantic graph are the first vector values, and the edges of the semantic graph are correlation parameters among the words;

in order to fully mine semantic features in a query statement, the present application constructs an initial semantic graph based on the contextual relationships between the first words, the part of speech, and the first vector value. And extracting semantic features according to the initial semantic graph.

Specifically, step 1013 specifically includes step A1 and step A6:

step A1: according to the context relation, obtaining word distances between the current first vocabulary and other first vocabularies;

word spacing refers to the number of words that are separated from the current first word and the other first words.

Step A2: converting a first association value according to the word distance;

and multiplying the word spacing by the conversion parameter to obtain a first association value. The conversion parameter is a preset parameter value and is used for converting the word distance into a first association value.

Step A3: acquiring parts of speech corresponding to the current first vocabulary and the other first vocabularies respectively, and calculating a second association value according to the mapping relation between the parts of speech;

parts of speech include, but are not limited to, nouns, verbs, adjectives or adjectives, and the like. Different mapping relations exist among different parts of speech, and the different mapping relations correspond to different second association values. The second association value is used to characterize the probability value that the two parts of speech are combined into the correct semantic.

Step A4: combining the current first vocabulary and the other first vocabularies into phrases, and inputting the phrases into a sentence recognition model to obtain corresponding combination probability; the combined probability is used for representing the probability that the current vocabulary and the other vocabularies form a correct phrase;

step A5: calculating the correlation parameter according to the first correlation value, the second correlation value and the combination probability;

specifically, step A5 specifically includes: substituting the first association value, the second association value and the combination probability into the following formula to obtain the correlation parameter;

The formula is:

It is to be noted that,and->For adjusting the smoothness and convergence speed of the result.

The method and the device comprehensively consider the influence of factors in multiple aspects, and calculate the correlation parameters based on the first correlation value, the second correlation value and the combination probability because the first correlation value, the second correlation value and the combination probability have certain influence on the correlation parameters. The above formula is based on a large amount of experimental data and verification, but is not limited to the above mathematical expression.

Step A6: and constructing the semantic graph by taking vector values corresponding to the vocabularies as nodes and corresponding correlation parameters between the vocabularies as edges.

In this embodiment, a technology for generating a semantic graph is implemented, where word distances between a current first word and other first words are extracted by analyzing context in sentences. And calculating according to the obtained word spacing to obtain a first association value used for representing the association degree between the current first vocabulary and other first vocabularies. And acquiring the parts of speech of the current first vocabulary and other first vocabularies, and calculating a second association value by using the mapping relation between the parts of speech, wherein the second association value is used for representing the contribution degree of the parts of speech to semantic association. The current first vocabulary and other first vocabularies are combined into phrases, and the phrases are used as input to be transferred to a sentence recognition model, so that corresponding combination probabilities are obtained. The combined probabilities are used to characterize the probability that the current vocabulary and other vocabularies constitute the correct phrase, and are represented by the output of the sentence recognition model. And calculating a relevance parameter according to the first relevance value, the second relevance value and the combined probability, and quantifying the semantic relevance degree among the vocabularies. And constructing a semantic graph by taking vector values corresponding to the words as nodes and correlation parameters corresponding to the words as edges. In this way, semantic association information between words can be represented and processed in the form of a graph. In summary, the technical scheme realizes the process of generating the semantic graph through analysis and calculation of the context relation, the part of speech and the sentence recognition model, so that the semantic association between the vocabularies can be better represented and used in the form of a graph structure.

Step 1014: removing edges, in the initial semantic graph, of which the correlation parameters are lower than a threshold value, and removing isolated nodes to obtain a target semantic graph;

in order to extract key information in the query statement, the present application culls edges below a threshold based on the relevance parameters in the initial semantic graph. When a node is isolated (i.e. no edge exists), the node is indicated to be non-key information, so that the node can be removed to obtain a target semantic graph. The target semantic graph is used for representing the association relation between key information.

Step 1015: selecting a target path according to the correlation parameter in the target semantic graph;

specifically, step 1015 specifically includes: and sequentially selecting an edge corresponding to the maximum correlation parameter on the current node as a path by taking a first vector value corresponding to the first word as a starting point, so as to obtain the target path formed by a plurality of nodes and a plurality of edges.

Illustratively, taking the starting point as an example, traversing the relevance parameters corresponding to all sides of the starting point, and selecting the side corresponding to the maximum relevance parameter as a path to reach the next node. Traversing the correlation parameters corresponding to all edges of the next node, selecting the edge corresponding to the maximum correlation parameter as a path, and analogizing to obtain a target path.

Step 1016: based on the ordering relation of a plurality of nodes on the target path, combining the first vector values corresponding to the nodes respectively to obtain the trunk vocabulary vector;

the correlation parameter characterizes the semantic relation between two nodes (vocabulary), so that the first vector values corresponding to a plurality of nodes on the target path can be combined to obtain a main vocabulary vector.

Step 1017: matching the plurality of first vocabularies with a topic corpus respectively to obtain the topic labels; wherein each topic tag corresponds to a plurality of topic words.

Theme vocabularies corresponding to different theme labels are prestored in a theme corpus. The plurality of first words may be matched based on the topic corpus to obtain topic tags for the query statement.

If a plurality of topic words exist in the plurality of first words, the corresponding number of the topic words is obtained, and the label corresponding to the maximum number of topic words is used as a topic label.

In this embodiment, the query sentence is subjected to word segmentation processing, and is divided into a plurality of words. And coding each first vocabulary by using the coding model to obtain a corresponding first vector value. And constructing an initial semantic graph according to the context relation, the part of speech and the first vector value among the first vocabularies. The nodes of the semantic graph are first vector values and the edges represent correlation parameters between words. And removing edges with correlation parameters lower than the threshold value in the initial semantic graph from the graph according to the set threshold value, and deleting isolated nodes, so that the target semantic graph is obtained. In the target semantic graph, target paths are selected according to the relevance parameters, and the paths represent node sequences with higher relevance. Based on the ordering relation of the nodes on the target path, the corresponding first vector values are combined, so that a trunk vocabulary vector is generated. Through the steps, the main vocabulary vectors with good semantic expression capability can be generated, and the vectors can capture semantic relations among vocabularies, so that a foundation is provided for subsequent text analysis and text processing tasks.

Step 102: according to the theme label, matching a plurality of target vector data corresponding to the theme label; wherein, a plurality of theme labels in the vector database respectively correspond to different vector data;

in order to improve query efficiency, the data of the vector database are classified according to preset categories, and each category corresponds to a topic label.

Step 103: acquiring a first conversion matrix corresponding to the theme label, and multiplying the trunk vocabulary vector by the first conversion matrix to obtain a feature vector;

since there is a certain difference in vector data for each type of data, for example: the major points and expressions of texts such as professional summaries, professional research reports, professional prospects, development suggestions and the like have certain differences, so that corresponding vector data also have differences. Therefore, the method extracts the characteristics of the vector data and obtains the first conversion matrix. And then, the trunk vocabulary vectors are converted into feature vectors based on the first conversion matrix, so that the query precision is improved.

Step 104: based on a clustering algorithm, calculating the distances between the feature vector and a plurality of clustering centers, and taking the clustering center corresponding to the minimum distance as a target clustering center;

In order to further divide the data in the vector database, the vector data corresponding to each topic label is clustered, so that the query precision is further improved.

Step 105: obtaining a second conversion matrix corresponding to the target cluster center, and multiplying the feature vector by the second conversion matrix to obtain a query vector;

there is some variance in the vector data for each cluster center. Therefore, the method extracts the characteristics of the vector data of the clustering center and obtains the second transformation matrix. And then the feature vector is converted into the query vector based on the second conversion matrix so as to improve the query precision.

Step 106: and acquiring a plurality of target vector data corresponding to the target clustering center, respectively calculating the distances between the query vector and the plurality of target vector data, and taking the original data corresponding to the target vector data corresponding to the maximum distance as a professional information query result.

In the embodiment, a main word vector and a topic label in a query sentence are extracted by acquiring the query sentence; according to the theme label, matching a plurality of target vector data corresponding to the theme label; wherein, a plurality of theme labels in the vector database respectively correspond to different vector data; acquiring a first conversion matrix corresponding to the theme label, and multiplying the trunk vocabulary vector by the first conversion matrix to obtain a feature vector; based on a clustering algorithm, calculating the distances between the feature vector and a plurality of clustering centers, and taking the clustering center corresponding to the minimum distance as a target clustering center; obtaining a second conversion matrix corresponding to the target clustering center, multiplying the feature vector by the second conversion matrix to obtain a query vector, obtaining a plurality of target vector data corresponding to the target clustering center, respectively calculating the distance between the query vector and the plurality of target vector data, and taking original data corresponding to the target vector data corresponding to the maximum distance as a occupational information query result. According to the scheme, in order to improve query efficiency, the trunk vocabulary vectors and the topic labels in the query sentences are respectively acquired, and the query range is reduced to a certain range through the topic labels. Because different topic labels have different vector characteristics, in order to more accurately match query results, the trunk vocabulary vectors need to be converted based on the first conversion matrix to obtain feature vectors. In order to further reduce the query scope, the target cluster centers are matched through a clustering algorithm, and the feature vectors are converted into query vectors based on the second conversion matrix corresponding to the target cluster centers so as to adapt to the vector features of different cluster centers. And finally, matching an accurate professional information query result according to the distance between the query vector and the plurality of target vector data. Not only greatly improves the query efficiency, but also has higher query accuracy.

Optionally, vector conversion is further required for the text data to be put into storage before step 101, specifically including steps B1 to B3:

step B1: acquiring text data to be put in storage, and performing word segmentation processing on the text data to be put in storage to obtain a plurality of second words;

step B2: encoding the plurality of second vocabularies through an encoding model to obtain a plurality of second vector values;

step B3: and carrying out feature extraction in a plurality of second vector values based on a deep learning model to obtain the vector data.

In this embodiment, the word segmentation process can effectively break down the text into vocabulary units of suitable size, thereby capturing semantic information more accurately. The coding model processes the second vocabulary and maps it to a corresponding vector representation so that each vocabulary can be expressed numerically. By means of a deep learning model we can extract higher level features from these second vector values to further enhance the expressive power and discrimination of the vector data. In summary, based on this technical scheme, after text data to be put in storage is obtained, we can successfully convert the text data into vector data through steps of word segmentation, encoding processing, feature extraction and the like. This technique can effectively help us convert text data into an operable vector form, providing support for subsequent text mining, semantic analysis, and other related tasks. Through the technical scheme, text data can be better converted into a vector form. Vector data may be more conveniently used in subsequent machine learning and deep learning tasks. The vector representation has the characteristics of compactness and high efficiency, and can better express the meaning and the context information of the text.

Optionally, the process of calculating the subject label before step 101 specifically includes steps C1 to C10:

step C1: matching key words and first numbers of the key words in the second words based on a preset corpus;

the preset corpus is pre-stored with a plurality of key words. Wherein, different topic labels correspond to different key words. And counting whether the key words and the key word quantity exist in the second words or not based on the preset corpus.

Step C2: acquiring a plurality of related words corresponding to the key words, and matching a second number of the related words in a plurality of second words;

related vocabulary refers to other vocabulary related to the key vocabulary.

Step C3: constructing a first target parameter according to the codes of the key words and the first quantity;

and combining the codes of the key words and the first quantity based on a preset strategy to obtain a first target parameter.

Step C4: respectively constructing a second target parameter according to the codes of the related words and the second quantity;

and combining the codes of the related words and the second quantity based on a preset strategy to obtain a second target parameter.

Step C5: calculating a first covariance matrix between the first target parameter and a plurality of second target parameters;

Step C6: counting the occurrence frequency of each vocabulary in the plurality of second vocabularies, and acquiring target second vocabularies with the occurrence frequencies arranged in the first K;

step C7: constructing a third target parameter according to the codes of the target second vocabulary and the third number of the target second vocabulary;

step C8: calculating a second covariance matrix among a plurality of the third target parameters;

step C9: calculating the similarity between the first covariance matrix and the second covariance matrix;

in order to verify whether the key words and the related words are semantic key words in the text data, a first covariance matrix is constructed based on the key words and the related words, and a second covariance matrix is constructed based on the first K target second words. And further, according to the similarity between the first covariance matrix and the second covariance matrix, determining whether the key vocabulary is a semantic key. If the important vocabulary is a semantic important one, taking a preset label of the important vocabulary as the theme label

Step C10: and if the similarity is greater than a threshold value, taking the preset label of the key word as the theme label.

In this embodiment, the technical scheme operates based on a preset corpus. First, a first number of accented words and accented words are matched in a plurality of second words. Then, a plurality of preset description words corresponding to the key words are obtained, and the second number of the preset description words is matched in a plurality of second words. Then, a first target parameter is constructed according to the codes of the key words and the first quantity, and a second target parameter is constructed according to the codes of the preset description words and the second quantity. A first covariance matrix is calculated between the first target parameter and a plurality of second target parameters. Then, the occurrence frequency of each of the plurality of second words is counted, and the target second words whose occurrence frequencies are arranged in the first K are obtained. And constructing a third target parameter according to the codes of the target second vocabulary and the third quantity. Next, a second covariance matrix between the plurality of third target parameters is calculated. Finally, a similarity between the first covariance matrix and the second covariance matrix is calculated. And if the similarity is greater than the threshold value, taking the preset label of the key word as the theme label. Through the scheme, the topic labels can be generated from the preset corpus set, and the topic labels are screened according to the similarity, so that the accuracy of the topic labels is improved.

Referring to fig. 2, fig. 2 shows a schematic diagram of a professional information query device based on a vector database, and fig. 2 shows a professional information query device based on a vector database according to the present invention, where the professional information query device based on a vector database shown in fig. 2 includes:

a first obtaining unit 21, configured to obtain a query sentence, and extract a stem vocabulary vector and a topic label in the query sentence;

a matching unit 22, configured to match, according to the theme label, a plurality of target vector data corresponding to the theme label; wherein, a plurality of theme labels in the vector database respectively correspond to different vector data;

a second obtaining unit 23, configured to obtain a first transformation matrix corresponding to the theme label, and multiply the stem vocabulary vector with the first transformation matrix to obtain a feature vector;

a first calculating unit 24, configured to calculate distances between the feature vector and a plurality of cluster centers based on a clustering algorithm, and take a cluster center corresponding to the minimum distance as a target cluster center;

a third obtaining unit 25, configured to obtain a second transformation matrix corresponding to the target cluster center, and multiply the feature vector with the second transformation matrix to obtain a query vector;

The second calculating unit 26 is configured to obtain a plurality of target vector data corresponding to the target cluster center, calculate distances between the query vector and the plurality of target vector data, and use raw data corresponding to the target vector data corresponding to the maximum distance as a professional information query result.

According to the professional information query device based on the vector database, the main vocabulary vectors and the theme labels in the query sentences are extracted by acquiring the query sentences; according to the theme label, matching a plurality of target vector data corresponding to the theme label; wherein, a plurality of theme labels in the vector database respectively correspond to different vector data; acquiring a first conversion matrix corresponding to the theme label, and multiplying the trunk vocabulary vector by the first conversion matrix to obtain a feature vector; based on a clustering algorithm, calculating the distances between the feature vector and a plurality of clustering centers, and taking the clustering center corresponding to the minimum distance as a target clustering center; obtaining a second conversion matrix corresponding to the target clustering center, multiplying the feature vector by the second conversion matrix to obtain a query vector, obtaining a plurality of target vector data corresponding to the target clustering center, respectively calculating the distance between the query vector and the plurality of target vector data, and taking original data corresponding to the target vector data corresponding to the maximum distance as a occupational information query result. According to the scheme, in order to improve query efficiency, the trunk vocabulary vectors and the topic labels in the query sentences are respectively acquired, and the query range is reduced to a certain range through the topic labels. Because different topic labels have different vector characteristics, in order to more accurately match query results, the trunk vocabulary vectors need to be converted based on the first conversion matrix to obtain feature vectors. In order to further reduce the query scope, the target cluster centers are matched through a clustering algorithm, and the feature vectors are converted into query vectors based on the second conversion matrix corresponding to the target cluster centers so as to adapt to the vector features of different cluster centers. And finally, matching an accurate professional information query result according to the distance between the query vector and the plurality of target vector data. Not only greatly improves the query efficiency, but also has higher query accuracy.

Fig. 3 is a schematic diagram of a terminal device according to an embodiment of the present invention. As shown in fig. 3, a terminal device 3 of this embodiment includes: a processor 30, a memory 31 and a computer program 32 stored in said memory 31 and executable on said processor 30, for example a program for professional information querying based on a vector database. The processor 30 implements the steps of each of the above-described embodiments of the vector database-based professional information query method when executing the computer program 32, such as steps 101 to 107 shown in fig. 1. Alternatively, the processor 30, when executing the computer program 32, performs the functions of the units in the above-described device embodiments, such as the functions of the units 21 to 27 shown in fig. 2.

By way of example, the computer program 32 may be divided into one or more units, which are stored in the memory 31 and executed by the processor 30 to complete the present invention. The one or more units may be a series of computer program instruction segments capable of performing a specific function describing the execution of the computer program 32 in the one terminal device 3. For example, the computer program 32 may be partitioned into units having the following specific functions:

Including but not limited to a processor 30 and a memory 31. It will be appreciated by those skilled in the art that fig. 3 is merely an example of one type of terminal device 3 and is not meant to be limiting as to one type of terminal device 3, and may include more or fewer components than shown, or may combine certain components, or different components, e.g., the one type of terminal device may also include input and output devices, network access devices, buses, etc.

The processor 30 may be a central processing unit (Central Processing Unit, CPU), but may also be other general purpose processors, digital signal processors (Digital Signal Processor, DSP), application specific integrated circuits (Application Specific Integrated Circuit, ASIC), off-the-shelf programmable gate arrays (Field-Programmable Gate Array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, or the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

The memory 31 may be an internal storage unit of the terminal device 3, such as a hard disk or a memory of the terminal device 3. The memory 31 may also be an external storage device of the terminal device 3, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the terminal device 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the one terminal device 3. The memory 31 is used for storing the computer program and other programs and data required for the one roaming control device. The memory 31 may also be used for temporarily storing data that has been output or is to be output.

It should be understood that the sequence number of each step in the foregoing embodiment does not mean that the execution sequence of each process should be determined by the function and the internal logic, and should not limit the implementation process of the embodiment of the present invention.

It should be noted that, because the content of information interaction and execution process between the above devices/units is based on the same concept as the method embodiment of the present invention, specific functions and technical effects thereof may be referred to in the method embodiment section, and will not be described herein.

It will be apparent to those skilled in the art that, for convenience and brevity of description, only the above-described division of the functional units and modules is illustrated, and in practical application, the above-described functional distribution may be performed by different functional units and modules according to needs, that is, the internal structure of the apparatus is divided into different functional units or modules, so as to perform all or part of the functions described above. The functional units and modules in the embodiment may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit, where the integrated units may be implemented in a form of hardware or a form of a software functional unit. In addition, the specific names of the functional units and modules are only for distinguishing from each other, and are not used for limiting the protection scope of the present invention. The specific working process of the units and modules in the above system may refer to the corresponding process in the foregoing method embodiment, which is not described herein again.

Embodiments of the present invention also provide a computer readable storage medium storing a computer program which, when executed by a processor, implements steps for implementing the various method embodiments described above.

Embodiments of the present invention provide a computer program product which, when run on a mobile terminal, causes the mobile terminal to perform steps that enable the implementation of the method embodiments described above.

The integrated units, if implemented in the form of software functional units and sold or used as stand-alone products, may be stored in a computer readable storage medium. Based on such understanding, the present invention may implement all or part of the flow of the method of the above embodiments, and may be implemented by a computer program to instruct related hardware, where the computer program may be stored in a computer readable storage medium, and when the computer program is executed by a processor, the computer program may implement the steps of each of the method embodiments described above. Wherein the computer program comprises computer program code which may be in source code form, object code form, executable file or some intermediate form etc. The computer readable medium may include at least: any entity or device capable of carrying computer program code to a photographing device/terminal apparatus, recording medium, computer Memory, read-Only Memory (ROM), random access Memory (Random Access Memory, RAM), electrical carrier signals, telecommunications signals, and software distribution media. Such as a U-disk, removable hard disk, magnetic or optical disk, etc. In some jurisdictions, computer readable media may not be electrical carrier signals and telecommunications signals in accordance with legislation and patent practice.

In the foregoing embodiments, the descriptions of the embodiments are emphasized, and in part, not described or illustrated in any particular embodiment, reference is made to the related descriptions of other embodiments.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the embodiments provided in the present invention, it should be understood that the disclosed apparatus/network device and method may be implemented in other manners. For example, the apparatus/network device embodiments described above are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions in actual implementation, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed may be an indirect coupling or communication connection via interfaces, devices or units, which may be in electrical, mechanical or other forms.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed over a plurality of network units.

It should be understood that the terms "comprises" and/or "comprising," when used in this specification and the appended claims, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof.

It should also be understood that the term "and/or" as used in the present specification and the appended claims refers to any and all possible combinations of one or more of the associated listed items, and includes such combinations.

As used in the present description and the appended claims, the term "if" may be interpreted as "when..once" or "in response to a determination" or "in response to a detection" depending on the context. Similarly, the phrase "if a determination" or "if a [ described condition or event ] is monitored" may be interpreted in the context of meaning "upon determination" or "in response to determination" or "upon monitoring a [ described condition or event ]" or "in response to monitoring a [ described condition or event ]".

Furthermore, the terms "first," "second," "third," and the like in the description of the present specification and in the appended claims, are used for distinguishing between descriptions and not necessarily for indicating or implying a relative importance.

Reference in the specification to "one embodiment" or "some embodiments" or the like means that a particular feature, structure, or characteristic described in connection with the embodiment is included in one or more embodiments of the invention. Thus, appearances of the phrases "in one embodiment," "in some embodiments," "in other embodiments," and the like in the specification are not necessarily all referring to the same embodiment, but mean "one or more but not all embodiments" unless expressly specified otherwise. The terms "comprising," "including," "having," and variations thereof mean "including but not limited to," unless expressly specified otherwise.

The above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present invention, and are intended to be included in the scope of the present invention.

Claims

1. A professional information query method based on a vector database, the query method comprising:

acquiring word distances between the current first vocabulary and other first vocabularies according to the context;

converting a first association value according to the word distance;

combining the current first vocabulary and the other first vocabularies into phrases, and inputting the phrases into a sentence recognition model to obtain corresponding combination probability; the combined probability is used for representing the probability that the current first vocabulary and the other first vocabularies form correct phrases;

substituting the first association value, the second association value and the combination probability into the following formula to obtain a correlation parameter;

the formula is:

wherein,representing the correlation parameter, x representing the first correlation value, P representing the combined probability, y representing the second correlation value, +. >And->Representing a constant, n being a preset coefficient;

constructing an initial semantic graph by taking vector values corresponding to the words as nodes and corresponding correlation parameters between the words as edges; the nodes of the initial semantic graph are the first vector values, and the edges of the initial semantic graph are the correlation parameters among the first vocabularies;

based on the ordering relation of a plurality of nodes on the target path, combining the first vector values corresponding to the nodes to obtain a trunk vocabulary vector;

matching the first vocabularies with a topic corpus respectively to obtain topic labels; each topic label corresponds to a plurality of topic words;

and acquiring a plurality of sub-target vector data corresponding to the target clustering center, respectively calculating the distances between the query vector and the plurality of sub-target vector data, and taking the original data corresponding to the sub-target vector data corresponding to the maximum distance as a professional information query result.

2. The vector database-based professional information query method according to claim 1, wherein the step of selecting a target path according to the relevance parameter in the target semantic graph includes:

3. The professional information query method based on a vector database according to claim 1, further comprising, before the step of performing word segmentation processing on the query sentence to obtain a plurality of first words:

4. The professional information query method based on a vector database according to claim 1, further comprising, before the step of performing word segmentation processing on the query sentence to obtain a plurality of first words:

matching key words and first quantity of the key words in a plurality of second words based on a preset corpus;

5. A vector database-based professional information query apparatus, wherein the vector database-based professional information query apparatus comprises:

the first acquisition unit is used for carrying out word segmentation processing on the query sentences to obtain a plurality of first vocabularies;

converting a first association value according to the word distance;

the formula is:

wherein,representing the correlation parameter, x representing the first correlation value, P representing the combined probability, y representing the second correlation value, +.>And->Representing a constant, n being a preset coefficient;

and the second calculation unit is used for acquiring a plurality of sub-target vector data corresponding to the target clustering center, respectively calculating the distances between the query vector and the plurality of sub-target vector data, and taking the original data corresponding to the sub-target vector data corresponding to the maximum distance as a professional information query result.

6. A terminal device comprising a memory, a processor and a computer program stored in the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any of claims 1 to 4 when the computer program is executed.

7. A computer readable storage medium storing a computer program, characterized in that the computer program when executed by a processor implements the steps of the method according to any one of claims 1 to 4.