CN116401336A

CN116401336A - Cognitive intelligent query method and device, computer readable storage medium and terminal

Info

Publication number: CN116401336A
Application number: CN202310341145.7A
Authority: CN
Inventors: 余炯
Original assignee: Huayuan Computing Technology Shanghai Co ltd
Current assignee: Huayuan Computing Technology Shanghai Co ltd
Priority date: 2023-03-31
Filing date: 2023-03-31
Publication date: 2023-07-07
Anticipated expiration: 2043-03-31
Also published as: CN116401336B

Abstract

A cognitive intelligent query method and device, a computer readable storage medium and a terminal, wherein the method comprises the following steps: performing text conversion on the relational database to determine a plurality of word sequences, each word sequence containing one or more words; performing vector conversion on at least a part of words in each word sequence to determine a word vector group corresponding to the word sequence; converting the query condition into a word vector group to be queried, and then determining a first similarity between the word vector group to be queried and at least a part of word vector groups in the word vector groups corresponding to each word sequence; and if the obtained maximum first similarity is greater than or equal to a first similarity threshold value, determining a query result based on a word vector group to which the maximum first similarity belongs. The scheme is beneficial to improving the intelligence of data query and the richness and comprehensiveness of the data of the query result.

Description

Cognitive intelligent query method and device, computer readable storage medium and terminal

Technical Field

The invention relates to the technical field of cognitive query, in particular to a cognitive intelligent query method and device, a computer readable storage medium and a terminal.

Background

Traditional data query technologies mainly include queries based on structured query language (Structured Query Language, SQL). Specifically, SQL queries query relevant data from a relational database primarily through a find (select) instruction.

For example, in industrial continuous casting application scenarios, the relational database contains data of process and line related information on a continuous casting line, and typical SQL is often queried by means of keyword matching. For example, to query the information of the state of secondary cooling water (such as water quantity, flux and water pressure) in the continuous casting sector process, the corresponding information can be returned only after the key field of the sector in the process is queried. However, SQL queries cannot return other relevant information independent of the "segment" attribute, such as the number of secondary cooling nozzles, secondary cooling water density, secondary cooling water pressure distribution, and segment secondary cooling real-time video information. The intelligence of the query and the data comprehensiveness of the query results are to be improved.

In addition, the prior art also provides a query method based on the knowledge graph. For example, by constructing a knowledge-graph using a relational database or a non-relational database, and then performing a data query based on the knowledge-graph. Although the knowledge graph can improve the data comprehensiveness of the query result to a certain extent, the knowledge graph is essentially only used for converting the storage structure of the database, does not relate to semantic feature matching of the storage content of the database, cannot reflect the semantic relevance or similarity of the storage content of the database, and the query effect still needs to be improved.

Disclosure of Invention

The technical problem solved by the embodiment of the invention is how to improve the intelligence of data query and the data richness and comprehensiveness of query results.

In order to solve the technical problems, the embodiment of the invention provides a cognitive intelligent query method, which comprises the following steps: performing text conversion on the relational database to determine a plurality of word sequences, each word sequence containing one or more words; performing vector conversion on at least a part of words in each word sequence to determine a word vector group corresponding to the word sequence; converting the query condition into a word vector group to be queried, and then determining a first similarity between the word vector group to be queried and at least a part of word vector groups in the word vector groups corresponding to each word sequence; and if the obtained maximum first similarity is greater than or equal to a first similarity threshold value, determining a query result based on a word vector group to which the maximum first similarity belongs.

Optionally, the relational database is a continuous casting quality relational database, and the continuous casting quality relational database comprises a plurality of continuous casting process data tables, each continuous casting process data table comprises one or more tuples, and each tuple comprises one or more continuous casting process data; the text conversion is performed in the relational database to determine a plurality of word sequences, including: and for each continuous casting process data table, performing text conversion on at least one part of continuous casting process data of each tuple in the continuous casting process data table to determine a word corresponding to each piece of continuous casting process data, thereby obtaining a word sequence corresponding to the tuple.

Optionally, the types of the various data contained in the relational database are selected from: words, strings, images, video, audio, and documents.

Optionally, before determining the first similarity between the word vector group to be queried and at least a part of word vector groups in the word vector groups corresponding to the word sequences, the method further includes: clustering the word vector groups corresponding to the determined word sequences to obtain a plurality of word vector group candidate sets; screening the plurality of word vector group candidate sets by adopting a preset second similarity threshold to obtain a plurality of word vector group sets; and selecting one word vector group from each word vector group set as a word vector group to be matched, and taking the selected plurality of word vector groups to be matched as at least one part of word vector groups.

Optionally, the screening the candidate sets of the plurality of word vector groups by using a preset second similarity threshold includes: for each candidate set of word vector sets, determining a second similarity between each two word vector sets in the candidate set of word vector sets; if the average value of the obtained second similarity is smaller than or equal to the second similarity threshold value, discarding the candidate set of word vector groups.

Optionally, selecting a word vector group from each word vector group set as the word vector group to be matched includes: randomly selecting a word vector group from each word vector group set as a word vector group to be matched; or selecting the word vector group serving as the clustering center from each word vector group set as the word vector group to be matched.

Optionally, after obtaining the plurality of word vector sets, the method further includes: determining a pre-added word vector group according to a preset external data source; for each pre-added word vector group, determining a third similarity between the pre-added word vector group and the word vector group to be matched in each word vector group set; and if the maximum third similarity is greater than or equal to the third similarity threshold, adding the pre-added word vector group into a word vector group set to which the word vector group to be matched with the maximum third similarity belongs.

Optionally, the external data source is selected from: image, video, and audio: the determining the pre-added word vector group according to the preset external data source comprises the following steps: determining one or more frames to be processed according to the external data source; determining text labeling information of each frame to be processed; determining candidate words to be converted according to the text annotation information; vector conversion is carried out on candidate words to be converted corresponding to frames to be processed of each frame, so that a pre-added word vector group corresponding to the external data source is obtained.

Optionally, the type of the external data source further includes one or more of: word document, excel document, HTML document, JSON string.

Optionally, the determining the query result based on the word vector group to which the maximum first similarity belongs includes: determining a word vector group to be matched to which the maximum first similarity belongs; and determining the query result according to the word vector group set to which the word vector group to be matched belongs.

Optionally, vector conversion is performed on at least a part of words in each word sequence to determine a word vector group corresponding to the word sequence, including: at least one part of words in each word sequence are input into a machine learning model based on a word embedding mode, so that a word vector group corresponding to the word sequence is obtained.

Optionally, the machine learning model based on the word embedding mode is selected from: word2Vec model, gloVe model.

The embodiment of the invention also provides a cognitive intelligent query device, which comprises: the text conversion module is used for carrying out text conversion on the relational database to determine a plurality of word sequences, wherein each word sequence comprises one or more words; the vector conversion module is used for carrying out vector conversion on at least one part of words in each word sequence so as to determine a word vector group corresponding to the word sequence; the similarity determining module is used for converting the query condition into a word vector group to be queried, and then determining first similarity between the word vector group to be queried and at least one part of word vector groups in the word vector groups corresponding to each word sequence; and the query result determining module is used for determining a query result based on the word vector group to which the maximum first similarity belongs if the obtained maximum first similarity is greater than or equal to a first similarity threshold value.

The embodiment of the invention also provides a computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, and the computer program executes the steps of the cognitive intelligent query method when being run by a processor.

The embodiment of the invention also provides a terminal which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the cognitive intelligent query method when running the computer program.

Compared with the prior art, the technical scheme of the embodiment of the invention has the following beneficial effects:

the embodiment of the invention provides a cognitive intelligent query method, which carries out text conversion on a relational database to determine a plurality of word sequences, wherein each word sequence comprises one or more words; performing vector conversion on at least a part of words in each word sequence to determine a word vector group corresponding to the word sequence; converting the query condition into a word vector group to be queried, and then determining a first similarity between the word vector group to be queried and at least a part of word vector groups in the word vector groups corresponding to each word sequence; and if the obtained maximum first similarity is greater than or equal to a first similarity threshold value, determining a query result based on a word vector group to which the maximum first similarity belongs.

In the embodiment of the invention, vector conversion is carried out on the content of the relational database to obtain a plurality of word vector groups, and then the query result is determined by calculating the similarity between the word vector groups. Compared with the query method based on the relational database SQL or the knowledge graph in the prior art, the method for querying the word sequence based on the relational database SQL or the knowledge graph returns corresponding query results only when the keywords are queried, the method does not depend on keyword matching, but queries based on vector group similarity matching corresponding to the word sequence, and the similarity among the word vector groups can reflect the semantic similarity among the word sequences because the word vector groups contain the semantic features of the corresponding word sequences. Therefore, by adopting the embodiment, all the associated information with higher semantic similarity (the similarity is larger than the first similarity threshold value) with the query condition can be returned, so that the query intelligence and the data richness and comprehensiveness of the query result are effectively improved.

Further, before determining the first similarity between the word vector group to be queried and at least a part of word vector groups in the word vector groups corresponding to each word sequence, the method further includes: clustering the word vector groups corresponding to the determined word sequences to obtain a plurality of word vector group candidate sets; screening the plurality of word vector group candidate sets by adopting a preset second similarity threshold to obtain a plurality of word vector group sets; and selecting one word vector group from each word vector group set as a word vector group to be matched, and taking the selected plurality of word vector groups to be matched as at least one part of word vector groups.

In the embodiment of the invention, a plurality of candidate word vector sets are determined in a clustering mode, and then the second similarity threshold is adopted to further screen and obtain a plurality of word vector sets, and as the word vector sets in each word vector set have higher similarity, only the similarity between the word vector set to be queried and one word vector set to be matched selected in each word vector set can be determined in the query process. Therefore, compared with the prior art, the method and the device for matching the similarity between the word vector group to be queried and all or most word vector groups in each word vector group set can greatly reduce the operation data quantity and improve the query efficiency.

Further, the screening the plurality of candidate sets of word vector sets by using a preset second similarity threshold includes: for each candidate set of word vector sets, determining a second similarity between each two word vector sets in the candidate set of word vector sets; if the average value of the obtained second similarity is smaller than or equal to the second similarity threshold value, discarding the candidate set of word vector groups. Therefore, the word vector group candidate sets with smaller similarity among word vector groups in the set can be removed from the plurality of word vector group candidate sets obtained by clustering, so that the subsequent matching operation data volume can be reduced, and more accurate query results can be obtained.

Further, after deriving the plurality of sets of word vector groups, the method further comprises: determining a pre-added word vector group according to a preset external data source; for each pre-added word vector group, determining a third similarity between the pre-added word vector group and the word vector group to be matched in each word vector group set; and if the maximum third similarity is greater than or equal to the third similarity threshold, adding the pre-added word vector group into a word vector group set to which the word vector group to be matched with the maximum third similarity belongs. In the embodiment of the invention, the word vector group set obtained after screening is expanded by adopting an external data source, so that the query range can be enhanced, and richer queried objects can be obtained, thereby reducing the operation data quantity, improving the query efficiency and further effectively improving the data richness and comprehensiveness of the query result.

Further, the external data source is selected from: image, video, and audio: the determining the pre-added word vector group according to the preset external data source comprises the following steps: determining one or more frames to be processed according to the external data source; determining text labeling information of each frame to be processed; determining candidate words to be converted according to the text annotation information; vector conversion is carried out on candidate words to be converted corresponding to frames to be processed of each frame, so that a pre-added word vector group corresponding to the external data source is obtained.

In the embodiment of the invention, the query is performed based on vector similarity matching, and various types of data can be converted into one or more corresponding word vectors so as to represent semantic features contained in the corresponding data, such as audio, video and image files, and word vector conversion can be performed by determining text labeling information of the semantic features. Therefore, the word vector group set obtained based on the original relation database can be expanded by adopting different types of external data according to the actual application scene requirements. Further, because the audio and video, the image file and the like often contain more comprehensive and rich information compared with the text, the rich, complete and comprehensive query result can be further obtained.

Drawings

FIG. 1 is a flow chart of a cognitive intelligent query method in an embodiment of the invention;

FIG. 2 is a schematic diagram of a text conversion and vector similarity matching process based on a continuous casting process data table according to an embodiment of the present invention;

FIG. 3 is a flow chart of a cognitive intelligent query method in an industrial continuous casting scenario in an embodiment of the invention;

fig. 4 is a schematic structural diagram of a cognitive intelligent query device according to an embodiment of the present invention.

Detailed Description

As described in the background art, the existing data query technology based on the SQL query or the knowledge graph of the relational database needs to improve the intelligence of the query and the data richness and comprehensiveness of the query result.

In order to solve the above technical problems, an embodiment of the present invention provides a cognitive intelligent query method, which specifically includes: performing text conversion on the relational database to determine a plurality of word sequences, each word sequence containing one or more words; performing vector conversion on at least a part of words in each word sequence to determine a word vector group corresponding to the word sequence; converting the query condition into a word vector group to be queried, and then determining a first similarity between the word vector group to be queried and at least a part of word vector groups in the word vector groups corresponding to each word sequence; and if the obtained maximum first similarity is greater than or equal to a first similarity threshold value, determining a query result based on a word vector group to which the maximum first similarity belongs.

From the above, in the embodiment of the present invention, the contents of the relational database are subjected to vector conversion to obtain a plurality of word vector groups, and then the query result is determined by calculating the similarity between the word vector groups. Compared with the prior art, the query method based on the relational database SQL or the knowledge graph returns corresponding query results only when the keywords are queried, the embodiment does not depend on keyword matching, but queries based on vector group similarity matching corresponding to word sequences, and because the word vector groups contain semantic features of the corresponding word sequences, the similarity among the word vector groups can reflect the semantic similarity among the word sequences, so that all associated information with higher semantic similarity (similarity is larger than the first similarity threshold) with query conditions can be returned by adopting the embodiment, thereby effectively improving the query intelligence and the data richness and comprehensiveness of the query results.

In order to make the above objects, features and advantages of the present invention more comprehensible, embodiments accompanied with figures are described in detail below.

In an embodiment of the present invention, the relational database may be a set of a plurality of entities and connections between the entities in a specific application field. The relational database may contain a plurality of data tables, each of which typically stores data in the form of a two-dimensional table, a logical group of related information arranged in rows and columns, similar to an Excel worksheet. A two-dimensional data table may be considered a relationship.

One row in the data table is a "tuple," or "record. Each column in the data table is referred to as a "field" or "attribute". The data table is defined by the various fields it contains, each field describing the meaning of the data it contains. When creating the data table, each field is assigned a data type defining their data length and other attributes. The fields may contain various characters, numbers, or other forms of data.

In the data table, the intersection position of a row and a column represents the value of a certain attribute. For example, in the field of industrial continuous casting, for a continuous casting process data sheet, the "mold" and "segment" are different values of the "equipment body" attribute, and the "liquid level", "amount of secondary cooling water" and "vibration frequency" are different values of the "process parameter" attribute. In the data table, each tuple has data that can uniquely identify the tuple, referred to as a "primary key" (or "primary code" or "primary key"). The "primary key" may be a field or fields, often used as an index field of a data table. For a data table a, the attributes contained are: a1 A2, a3, a4, the data table can be briefly represented by a (a 1, a2, a3, a 4).

Referring to fig. 1, fig. 1 is a flowchart of a cognitive intelligent query method according to an embodiment of the present invention. The method may include steps S11 to S14:

step S11: performing text conversion on the relational database to determine a plurality of word sequences, each word sequence containing one or more words;

step S12: performing vector conversion on at least a part of words in each word sequence to determine a word vector group corresponding to the word sequence;

step S13: converting the query condition into a word vector group to be queried, and then determining a first similarity between the word vector group to be queried and at least a part of word vector groups in the word vector groups corresponding to each word sequence;

step S14: and if the obtained maximum first similarity is greater than or equal to a first similarity threshold value, determining a query result based on a word vector group to which the maximum first similarity belongs.

In the implementation of step S11, text converting the relational database to determine the plurality of word sequences may specifically include: extracting part or all of the data in each tuple in each data table of the relational database; performing text conversion on the extracted various data to obtain a plurality of words; and then, based on a preset arrangement sequence and each converted word, arranging and forming a word sequence corresponding to the tuple.

Wherein every two adjacent words in the word sequence may contain a preset character, for example, a space. The preset arrangement order may be a left-to-right order of the data in the tuple, or may be a right-to-left order, or may be other suitable order.

As one non-limiting example, the relational database is a continuous casting quality relational database comprising a plurality of continuous casting process data tables, each continuous casting process data table comprising one or more tuples, each tuple comprising one or more items of continuous casting process data. Text conversion of the continuous casting quality relation database may include: and for each continuous casting process data table, performing text conversion on at least one part of continuous casting process data of each tuple in the continuous casting process data table to determine a word corresponding to each piece of continuous casting process data, thereby obtaining a word sequence corresponding to the tuple.

In particular, the text conversion may be performed on all of the continuous casting process data contained in each tuple. Or, according to the actual application scene, a plurality of continuous casting process data belonging to specific attributes can be selected from each tuple to perform text conversion.

Referring to fig. 2, fig. 2 is a schematic diagram of a process for text conversion and vector similarity matching based on a continuous casting process data table according to an embodiment of the present invention.

The continuous casting process data table 201 may be derived from a continuous casting quality relational database. As shown in fig. 2, the continuous casting process data table 201 includes 4 tuples and includes 6 columns, and "attributes" of each column are "equipment body", "process parameters", "units", "parameter values", "maximum values", and "minimum values", respectively.

The process of determining the respective corresponding word sequences will be described below by taking text conversion of each tuple (denoted tuple a, tuple B, tuple C, tuple D) in the continuous casting process data table 201 as an example.

Text conversion is carried out on each item of continuous casting process data contained in the tuple A, so that a plurality of corresponding words are obtained: "crystallizer", "liquid level", "mm", "0.5", "5", "0"; then, according to the arrangement sequence of each continuous casting process data in the tuple A and each word obtained by conversion, a space is inserted between every two adjacent words to form a word sequence A' corresponding to the tuple A: crystallizer liquid level mm 0.55.

The same method as described above can be used for the tuples B to D to obtain the corresponding word sequences B 'to D' by conversion, respectively:

B': crystallizer vibration frequency Hz 20 40 0;

c': fan section two cooling water volume MPa 20 50 3;

d': the water quantity sprayed by the fan-shaped section is MPa 30 50 3.

And then vector conversion (for example, vector space conversion is performed based on WordEmbeddding) is performed on each word sequence A 'to D' obtained through conversion, so that each corresponding word vector group is obtained, and then vector similarity matching is performed. Because the word sequences A 'and B' have the same equipment main body of a crystallizer and the minimum liquid level value is 0, the similarity of the word sequences A 'and B' is higher; the word sequences C 'and D' have the same equipment main body of 'sector segment', the units are 'MPa', the minimum values are '0', and the similarity of the two is high.

As shown in the vector similarity matching result 202 in fig. 2, in the vector space, the vector groups (respectively identified by "crystallizer") obtained by converting the word sequences a 'and B' have higher similarity, and the vector groups obtained by converting the word sequences C 'and D' (respectively identified by "sector") have higher similarity.

It should be noted that, in the implementation, in the process of text converting each tuple to obtain a word sequence, the "attribute" of each piece of continuous casting process data (i.e. the column name of the column) may be added before the word converted by the continuous casting process data according to the needs of a specific application scenario.

For example, for tuple a, the text conversion and attribute addition results in the corresponding word sequence a "(not shown in fig. 2) as: the unit mm parameter value of the process parameter liquid level of the main body crystallizer of the continuous casting equipment is 0.5, and the maximum value is 5 and the minimum value is 0;

for another example, for tuple C, the text conversion and attribute addition yields the corresponding word sequence C "as (not shown in fig. 2): and the process parameter secondary cooling water volume unit MPa parameter value 20 maximum value 40 minimum value 3% of the fan-shaped section of the continuous casting equipment main body.

The method comprises the steps of converting all or part of continuous casting process data contained in a strip of tuple into words and forming a space between every two words to form a word sequence corresponding to the strip of tuple.

The word sequence obtained by text conversion of each tuple can be called token (token) sequence. It should be noted that, in addition to each item of data included in the corresponding tuple and the attribute to which each item of data belongs, each word sequence or token sequence may further include other required appropriate data, for example, word sequence identification Information (ID), time information (such as a current timestamp), and signature data (such as data obtained by encrypting one or more words in the word sequence).

In the embodiment of the invention, the required proper data can be added in the word sequence obtained by text conversion according to the actual requirement, so that the enrichment and expansion of the original information contained in the tuple corresponding to the word sequence can be realized, and the scene requirements of different cognitive intelligent queries can be met.

It should be noted that, in addition to the text conversion method described in the foregoing embodiment, in a specific implementation, a method for performing text conversion on data of a numeric type (or a numeric type) included in a tuple in a data table may also be selected according to actual scene needs, where one or more of the following methods are selected.

The first method may be to convert a number into a form of "head_number", wherein the title may be an attribute name or a column name to which the data belongs. For example, in the column of the Year date, 2022 years becomes "Year_2022".

The second method may be to convert a value into a form of "head_range" to type values belonging to different value ranges, wherein the header may be an attribute name or a column name to which the data belongs, and the value ranges may be divided in advance. The different value intervals may correspond to a low value interval (or a small value interval), a medium value interval, a high value interval (or a large value interval), respectively. For example, fields in a plant with continuous casting secondary cooling water: the secondary cooling water quantity value 20 of the continuous casting sector section is 50% compared with the maximum value 40 and the minimum value 3, and can be converted into the secondary cooling water quantity_middle position of the continuous casting sector section.

The third method is to cluster the numerical values by using a clustering algorithm (for example, K neighbor, K mean, hierarchical clustering algorithm, etc.), so as to obtain a numerical cluster. Specifically, each value in a column of the relational database can be replaced with a numeric cluster ID containing the value. For example, in the continuous casting quality relation database, the Cooling Water amount may be expressed as "cooling_water_id", and the actual Cooling Water amount value 20 may be expressed as "cooling_water_id_20", where 20 is a value of a digital cluster ID containing 20. Further, the type of each item of data contained in the relational database is selected from: words, strings, images, video, audio, and documents.

In the specific embodiment of step S12, at least a part of the words in each word sequence are subjected to vector conversion to determine a word vector group corresponding to the word sequence.

Further, the step S12 may include: at least one part of words in each word sequence are input into a machine learning model based on a word embedding mode, so that a word vector group corresponding to the word sequence is obtained.

At least a part of words in each word sequence can be main words which are required to be selected from the word sequence in combination with an actual application scene. The embodiment of the invention does not limit the specific words and the number thereof selected from each word sequence for vector conversion.

Without limitation, the word embedding mode-based machine learning model is selected from: word2Vec model, gloVe model.

Wherein the word embedding mode (wordEmbeddding) may also be referred to as a word vector mode. The conversion between words and vectors can be realized by adopting a machine learning model based on a word embedding mode. The word vector group obtained by vector conversion of each word sequence may contain semantic information (or semantic features) contained in the word sequence. Specifically, if the meanings of the two words are the same or similar, the vector similarity obtained by converting the two words is higher, otherwise, if the meaning difference of the two words is larger, the vector similarity obtained by converting the two words is lower. Based on this, a correlation between the data contained in the relational data table can be found.

Currently, there are many methods for vector conversion of words or sentences in a computing natural language, such as: word2Vec and GloVe models. The vectors generated by the Word2Vec model cover the semantic syntax and semantic attribute of words or sentences or texts, and semantic relativity among different words (or sentences or texts) can be analyzed by calculating the similarity among vectors corresponding to different words or sentences or texts. For example, in the field of continuous casting, the continuous casting process parameter "pull speed" is similar to the semantic meaning of "casting speed", the pull speed value is high and corresponds to the casting speed being high, the pull speed value and the casting speed are closely related, the semantic meaning of "crystallizer liquid level" is similar to the semantic meaning of "liquid level fluctuation", the crystallizer liquid level change is large and corresponds to the liquid level fluctuation to be large, and the pull speed value and the casting speed are closely related.

In a specific embodiment of step S13, the query condition is converted into a word vector set to be queried, and then a first similarity between the word vector set to be queried and at least a part of word vector sets corresponding to each word sequence is determined.

The query condition may be an SQL query statement, and the converting the query condition into the word vector group to be queried may specifically include: extracting keywords in the conditions to be queried, and determining word sequences corresponding to the query conditions according to the keywords; and then carrying out vector conversion on the word sequence to obtain a word vector group to be queried corresponding to the query condition.

Further, determining the word sequence corresponding to the query condition according to the keyword may specifically include: directly adopting each extracted keyword and space arrangement to form a word sequence; or adding other proper data or parameters according to actual needs, and arranging the data or parameters together with the extracted keywords and the spaces to form a word sequence.

For a specific method of forming the word sequence, reference may be made to the description of the related content in fig. 2, which is not repeated here.

In a specific implementation, the method for determining the similarity between the vector groups may select a cosine similarity calculation method or a maximum norm algorithm. Specifically, when comparing two vector groups, the similarity comparison method is used to output the similarity value of the pair of vector groups.

In specific implementation, the method for clustering the word vector group corresponding to each word sequence may be an existing clustering algorithm, for example, a K-nearest neighbor (K-Neighbors) clustering algorithm, or a K-Means (K-Means) clustering algorithm. By clustering the plurality of word vector groups, word vector groups with the same or similar semantics can be clustered together to obtain a plurality of word vector group candidate sets, that is, the purpose of clustering is to realize that the similarity between the word vector groups in the same word vector group candidate set is higher, and the similarity between the word vector groups in different word vector group candidate sets is lower.

Further, the screening the plurality of candidate sets of word vector sets by using a preset second similarity threshold includes: for each candidate set of word vector sets, determining a second similarity between each two word vector sets in the candidate set of word vector sets; if the average value of the obtained second similarity is smaller than or equal to the second similarity threshold value, discarding the candidate set of word vector groups.

It can be appreciated that, by adopting the second similarity threshold to screen the plurality of word vector group candidate sets obtained by clustering, the word vector group candidate sets with insufficiently high similarity (less than or equal to the second similarity threshold) among the word vector groups in the word vector group candidate sets can be further removed. By adopting the scheme, on one hand, the operation data amount in the subsequent vector similarity matching can be reduced, and the query efficiency is improved; on the other hand, the accuracy of the query result is improved.

In a specific implementation, the second similarity threshold may be set appropriately according to the actual application requirement. It should be understood that the second similarity threshold should not be too large, otherwise, the number of the word vector group sets obtained by screening is too small, so that the data amount of the subsequent determined query result is not abundant and comprehensive enough; the second similarity threshold should not be too small, otherwise, the similarity between the word vector groups in the screened word vector group set may be too small, thereby reducing the accuracy of the query result.

Without limitation, an appropriate value between [0.7,1.0] may be chosen as the second similarity threshold.

Further, selecting a word vector group from each word vector group set as a word vector group to be matched, including: randomly selecting a word vector group from each word vector group set as a word vector group to be matched; or selecting the word vector group serving as the clustering center from each word vector group set as the word vector group to be matched.

The clustering center may be an initial clustering center point selected during the execution of the clustering algorithm, or may also be a center point (also referred to as an ending clustering center point) of each class obtained by ending the execution of the clustering algorithm. Taking K-Means as an example, in the algorithm execution process, K points are required to be selected as initial clustering center points, then each point is associated with the closest clustering center point according to the distance from the K initial clustering center points, and all points associated with the same clustering center point are clustered into one type or one group; then calculating the average value of each group, and adopting the average value as the associated central point of the group; the above steps are repeated until the center point is no longer changed. The center point of each class obtained at this time is the ending cluster center point.

In the embodiment of the invention, a plurality of word vector group candidate sets are determined in a clustering mode, and then the second similarity threshold is adopted to further screen and obtain a plurality of word vector group sets, wherein the word vector groups in each word vector group set have higher similarity. Therefore, only the similarity between the word vector group to be queried and one word vector group to be matched selected in each word vector group set can be determined in the query process. Therefore, compared with the similarity matching between the word vector group to be queried and all or most word vector groups in each word vector group set, the method can greatly reduce the operation data quantity and improve the query efficiency.

Still further, after deriving the plurality of sets of word vector sets, the method further comprises: determining a pre-added word vector group according to a preset external data source; for each pre-added word vector group, determining a third similarity between the pre-added word vector group and the word vector group to be matched in each word vector group set; and if the maximum third similarity is greater than or equal to the third similarity threshold, adding the pre-added word vector group into a word vector group set to which the word vector group to be matched with the maximum third similarity belongs.

In the embodiment of the invention, the word vector group set obtained after screening is expanded by adopting an external data source, so that the query range can be enhanced, and richer queried objects can be obtained, thereby reducing the operation data quantity, improving the query efficiency and further effectively improving the data richness and comprehensiveness of the query result.

In some embodiments, the external data source is selected from: images, video, and audio; the determining the pre-added word vector group according to the preset external data source may include: determining one or more frames to be processed according to the external data source; determining text labeling information of each frame to be processed; determining candidate words to be converted according to the text annotation information; vector conversion is carried out on candidate words to be converted corresponding to frames to be processed of each frame, so that a pre-added word vector group corresponding to the external data source is obtained.

Taking the external data source as an example of video, determining one or more frames to be processed according to the external data source may specifically include: carrying out framing treatment on the video to obtain multi-frame video frames; and then performing frame extraction processing on the multi-frame video to obtain one or more frames of frames to be processed. The frame extraction processing may be that frames are extracted every preset frame number, or frames are extracted every preset time length, or a plurality of frames may be extracted by other modes according to the actual application requirements.

Further, the method for determining text annotation information of each frame to be processed can comprise the following steps: determining an object to be identified in an image by adopting a manual mode, an image identification mode or a neural network model; and then determining text annotation information according to the object to be identified. For example, for an image acquired for a crystallizer, the liquid level value and/or vibration frequency of the crystallizer in the image can be determined in the above manner, and the liquid level value and/or vibration frequency of the crystallizer in the image can be used as text labeling information of the image.

In other embodiments, the type of external data source may further include one or more of the following: word documents, excel documents, hyperText markup Language (HTML) documents, JS object profile (JavaScript Object Notation, JSON) strings.

The types of the external data sources listed above belong to text type data sources, for the text type data sources, a plurality of pre-added keywords can be determined by adopting word segmentation processing, keyword extraction and other modes, and then vector conversion is carried out on the pre-added keywords to obtain the pre-added word vector group.

It should be noted that the above-described method of determining the pre-added word vector set from the pre-set external data source is merely a non-limiting example, and in a specific implementation, other suitable methods may be used to implement the conversion between the external data source and the pre-added word vector set.

It will be appreciated that, because the query is based on vector similarity matching, for various types of data, it may be converted into corresponding one or more word vectors to characterize semantic features contained in the corresponding data. For example, for external data sources such as audio, video, image files and the like, word vector conversion can be performed by determining text annotation information of the external data sources; for external data sources of text types such as Word documents, excel documents, HTML documents, JSON strings and the like, a plurality of keywords can be determined based on keyword extraction and the like, and then Word vector conversion can be performed.

Therefore, in the embodiment of the invention, the word vector set obtained based on the original relational database can be expanded by adopting different types of external data sources according to the actual application scene requirements so as to enhance the query range. Further, because audio and video, image files and the like often contain more comprehensive and rich information than characters or characters, the method is beneficial to obtaining more abundant, complete and comprehensive query results.

In a specific embodiment of step S14, the determining the query result based on the word vector group to which the maximum first similarity belongs includes: determining a word vector group to be matched to which the maximum first similarity belongs; and determining the query result according to the word vector group set to which the word vector group to be matched belongs.

Specifically, determining the query result according to the set of word vector groups to which the word vector group to be matched belongs may include: determining a corresponding word sequence for each word vector group in the word vector group set; and then, collecting word sequences corresponding to each word vector group in the word vector group set as the query result.

Further, after determining the word vector groups to determine the corresponding word sequences, a data source (e.g., audio, video, word document, etc.) corresponding to each word sequence may also be determined, and then the set of the word sequences and the corresponding data sources are used together as the query result and returned to the user.

In the embodiment of the invention, vector conversion is carried out on the content of the relational database to obtain a plurality of word vector groups, and then the query result is determined by calculating the similarity between the word vector groups. Compared with the prior art that the query method based on the relational database SQL or the knowledge graph returns corresponding query results only by depending on the query to the keywords, the embodiment does not depend on keyword matching, but queries based on vector group similarity matching corresponding to word sequences, and because the word vector groups contain semantic features of the corresponding word sequences, the similarity between the word vector groups can reflect the semantic similarity between the word sequences, therefore, by adopting the embodiment, all associated information with higher semantic similarity (the similarity is larger than the first similarity threshold) with query conditions can be returned, and the associated information can comprise data of different types such as texts, videos, audios, images and the like, so that the data richness and comprehensiveness of the query results are effectively improved. Furthermore, the query method based on semantic similarity matching can obtain more intelligent query (recommendation) results compared with the traditional query method based on keyword matching.

For example, in an industrial continuous casting scenario, for an abnormal event such as a high level of the crystallizer, the existing query technology can only acquire a single event such as the abnormal event, but with the present embodiment, the information related to the crystallizer can be returned completely, and information including, but not limited to, the maximum level value, the minimum level value, the vibration frequency and the like of the crystallizer can be acquired, so that analysis of the abnormal event can be performed based on richer information.

For another example, if the information of the size, flux, water pressure, etc. of the secondary cooling water in the continuous casting sector process is to be queried, the corresponding information, such as the information of the size, flux, water pressure, etc. of the secondary cooling water in the tuple containing the keyword of the sector, can be returned only after the keyword of the sector in the process is queried, and according to the embodiment, besides the above information, other information with semantic similarity or semantic relevance to the sector, such as the number of secondary cooling nozzles, secondary cooling water density, secondary cooling water pressure distribution, and secondary cooling real-time video of the sector, etc. can be queried based on similarity matching of the word vector group. Since this information is independent of the tuple containing the "segment", it will not be returned using existing SQL query methods.

Referring to fig. 3, fig. 3 is a flowchart of a cognitive intelligent query method in an industrial continuous casting scene according to an embodiment of the present invention. The cognitive intelligent query method in the industrial continuous casting scene specifically can comprise steps S301 to S309.

In step S301, a continuous casting process data table is determined.

Wherein, the determining the continuous casting process data table specifically may include: determining a relational database, which may for example select a continuous casting quality relational database in the field of industrial continuous casting, which may contain one or more continuous casting process data tables; and then selecting part or all of continuous casting process data tables from the continuous casting quality relation database. In this embodiment, the continuous casting quality relational database may be referred to as an internal data source. In which a continuous casting process data table contained in the continuous casting quality relation database is exemplarily given in S301 shown in fig. 3.

In step S302, text conversion is performed on each tuple included in the continuous casting process data table, so as to obtain a word sequence corresponding to each tuple. Wherein a partial word sequence is exemplarily given in S302 shown in fig. 3.

In step S303, vector conversion is performed on each term sequence using a machine learning model based on wordeumbeddding.

In step S304, a set of word vector groups is determined based on the vector conversion result of step S303. Wherein a partial word vector is exemplarily given in S304 shown in fig. 3.

Further, after determining the set of word vector groups based on the internal data source, step S305 may also be performed, in which step S305 the set of word vector groups is expanded based on an external data source.

Specifically, a pre-added set of word vectors may be determined based on an external data source and added to the set of word vectors to augment the set of word vectors.

In step S306, a query condition is input.

In step S307, a set of word vectors to be queried is determined.

Specifically, the query condition may be converted into the word vector set to be queried.

In step S308, vector similarity matching is performed.

Specifically, the word vector group to be queried may be subjected to similarity matching with at least a part of the word vector groups in the word vector group set.

In step S309, a cognitive intelligent query result is determined based on the similarity matching result.

For more details on the steps in fig. 3, reference may be made to the foregoing and related descriptions in fig. 2, and details are not repeated here.

Referring to fig. 4, fig. 4 is a schematic structural diagram of a cognitive intelligent query device according to an embodiment of the present invention. The cognitive intelligent query device may include:

a text conversion module 41 for performing text conversion on the relational database to determine a plurality of word sequences, each word sequence containing one or more words;

a vector conversion module 42, configured to perform vector conversion on at least a part of words in each word sequence, so as to determine a word vector group corresponding to the word sequence;

the similarity determining module 43 is configured to convert the query condition into a word vector set to be queried, and then determine a first similarity between the word vector set to be queried and at least a part of word vector sets in the word vector sets corresponding to each word sequence;

the query result determining module 44 is configured to determine a query result based on the word vector group to which the maximum first similarity belongs if the obtained maximum first similarity is equal to or greater than the first similarity threshold.

Regarding the principle, implementation and beneficial effects of the cognitive intelligent query device, please refer to the foregoing and the related descriptions of the cognitive intelligent query method shown in fig. 1 to 3, which are not repeated herein.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when being executed by a processor, performs the steps of the cognitive intelligent query method shown in the above figures 1 to 3. The computer readable storage medium may include non-volatile memory (non-volatile) or non-transitory memory, and may also include optical disks, mechanical hard disks, solid state disks, and the like.

Specifically, in the embodiment of the present invention, the processor may be a central processing unit (central processing unit, abbreviated as CPU), and the processor may also be other general purpose processors, digital signal processors (digital signal processor, abbreviated as DSP), application specific integrated circuits (application specific integrated circuit, abbreviated as ASIC), off-the-shelf programmable gate arrays (field programmable gate array, abbreviated as FPGA) or other programmable logic devices, discrete gates or transistor logic devices, discrete hardware components, and so on. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It should also be appreciated that the memory in embodiments of the present application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The nonvolatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an electrically erasable ROM (electrically EPROM, EEPROM), or a flash memory. The volatile memory may be a random access memory (random access memory, RAM for short) which acts as an external cache. By way of example but not limitation, many forms of random access memory (random access memory, abbreviated as RAM) are available, such as static random access memory (static RAM), dynamic Random Access Memory (DRAM), synchronous Dynamic Random Access Memory (SDRAM), double data rate synchronous dynamic random access memory (double data rate SDRAM, abbreviated as DDR SDRAM), enhanced Synchronous Dynamic Random Access Memory (ESDRAM), synchronous Link DRAM (SLDRAM), and direct memory bus random access memory (direct rambus RAM, abbreviated as DR RAM).

The embodiment of the invention also provides a terminal, which comprises a memory and a processor, wherein the memory stores a computer program capable of running on the processor, and the processor executes the steps of the cognitive intelligent query method shown in the figures 1 to 3 when running the computer program. The terminal can include, but is not limited to, terminal equipment such as a mobile phone, a computer, a tablet computer, a server, a cloud platform, and the like.

It should be understood that the term "and/or" is merely an association relationship describing the associated object, and means that three relationships may exist, for example, a and/or B may mean: a exists alone, A and B exist together, and B exists alone. In this context, the character "/" indicates that the front and rear associated objects are an "or" relationship.

The term "plurality" as used in the embodiments herein refers to two or more.

The first, second, etc. descriptions in the embodiments of the present application are only used for illustrating and distinguishing the description objects, and no order division is used, nor does it indicate that the number of the devices in the embodiments of the present application is particularly limited, and no limitation on the embodiments of the present application should be construed.

It should be noted that the serial numbers of the steps in the present embodiment do not represent a limitation on the execution sequence of the steps.

Although the present invention is disclosed above, the present invention is not limited thereto. Various changes and modifications may be made by one skilled in the art without departing from the spirit and scope of the invention, and the scope of the invention should be assessed accordingly to that of the appended claims.

Claims

1. The cognitive intelligent query method is characterized by comprising the following steps of:

performing text conversion on the relational database to determine a plurality of word sequences, each word sequence containing one or more words;

performing vector conversion on at least a part of words in each word sequence to determine a word vector group corresponding to the word sequence;

converting the query condition into a word vector group to be queried, and then determining a first similarity between the word vector group to be queried and at least a part of word vector groups in the word vector groups corresponding to each word sequence;

and if the obtained maximum first similarity is greater than or equal to a first similarity threshold value, determining a query result based on a word vector group to which the maximum first similarity belongs.

2. The method of claim 1, wherein the relational database is a continuous casting quality relational database comprising a plurality of continuous casting process data tables, each continuous casting process data table comprising one or more tuples, each tuple comprising one or more items of continuous casting process data;

The text conversion is performed in the relational database to determine a plurality of word sequences, including:

and for each continuous casting process data table, performing text conversion on at least one part of continuous casting process data of each tuple in the continuous casting process data table to determine a word corresponding to each piece of continuous casting process data, thereby obtaining a word sequence corresponding to the tuple.

3. A method according to claim 1 or 2, wherein the type of data contained in the relational database is selected from:

words, strings, images, video, audio, and documents.

4. The method of claim 1, wherein prior to determining the first similarity between the set of word vectors to be queried and at least a portion of the set of word vectors corresponding to each word sequence, the method further comprises:

clustering the word vector groups corresponding to the determined word sequences to obtain a plurality of word vector group candidate sets;

screening the plurality of word vector group candidate sets by adopting a preset second similarity threshold to obtain a plurality of word vector group sets;

and selecting one word vector group from each word vector group set as a word vector group to be matched, and taking the selected plurality of word vector groups to be matched as at least one part of word vector groups.

5. The method of claim 4, wherein the filtering the plurality of candidate sets of word vector sets using a predetermined second similarity threshold comprises:

for each candidate set of word vector sets, determining a second similarity between each two word vector sets in the candidate set of word vector sets;

if the average value of the obtained second similarity is smaller than or equal to the second similarity threshold value, discarding the candidate set of word vector groups.

6. The method of claim 4, wherein selecting a set of word vectors from each set of word vectors as the set of word vectors to be matched comprises:

randomly selecting a word vector group from each word vector group set as a word vector group to be matched;

or alternatively, the process may be performed,

and selecting the word vector group serving as a clustering center from each word vector group set as a word vector group to be matched.

7. The method of any one of claims 4 to 6, wherein after deriving the plurality of sets of word vector groups, the method further comprises:

determining a pre-added word vector group according to a preset external data source;

for each pre-added word vector group, determining a third similarity between the pre-added word vector group and the word vector group to be matched in each word vector group set;

And if the maximum third similarity is greater than or equal to the third similarity threshold, adding the pre-added word vector group into a word vector group set to which the word vector group to be matched with the maximum third similarity belongs.

8. The method of claim 7, wherein the external data source is selected from the group consisting of: image, video, and audio:

the determining the pre-added word vector group according to the preset external data source comprises the following steps:

determining one or more frames to be processed according to the external data source;

determining text labeling information of each frame to be processed;

determining candidate words to be converted according to the text annotation information;

vector conversion is carried out on candidate words to be converted corresponding to frames to be processed of each frame, so that a pre-added word vector group corresponding to the external data source is obtained.

9. The method of claim 7, wherein the type of external data source further comprises one or more of:

word document, excel document, HTML document, JSON string.

10. The method of claim 4, wherein the determining the query result based on the set of word vectors to which the maximum first similarity belongs comprises:

determining a word vector group to be matched to which the maximum first similarity belongs;

And determining the query result according to the word vector group set to which the word vector group to be matched belongs.

11. The method of claim 1, wherein vector converting at least a portion of the words in each word sequence to determine a set of word vectors corresponding to the word sequence comprises:

at least one part of words in each word sequence are input into a machine learning model based on a word embedding mode, so that a word vector group corresponding to the word sequence is obtained.

12. The method of claim 11, wherein the word embedding pattern based machine learning model is selected from the group consisting of: word2Vec model, gloVe model.

13. A cognitive intelligent query device, comprising:

the text conversion module is used for carrying out text conversion on the relational database to determine a plurality of word sequences, wherein each word sequence comprises one or more words;

the vector conversion module is used for carrying out vector conversion on at least one part of words in each word sequence so as to determine a word vector group corresponding to the word sequence;

the similarity determining module is used for converting the query condition into a word vector group to be queried, and then determining first similarity between the word vector group to be queried and at least one part of word vector groups in the word vector groups corresponding to each word sequence;

And the query result determining module is used for determining a query result based on the word vector group to which the maximum first similarity belongs if the obtained maximum first similarity is greater than or equal to a first similarity threshold value.

14. A computer readable storage medium having stored thereon a computer program, which when executed by a processor performs the steps of the cognitive intelligent query method of any of claims 1 to 12.

15. A terminal comprising a memory and a processor, the memory having stored thereon a computer program executable on the processor, characterized in that the processor executes the steps of the cognitive intelligent query method of any of claims 1 to 12 when the computer program is executed.