CN116501947B

CN116501947B - Construction method, system and equipment of semantic search cloud platform and storage medium

Info

Publication number: CN116501947B
Application number: CN202310735695.7A
Authority: CN
Inventors: 王素平; 朱立谷; 赵虹宁
Original assignee: Communication University of China
Current assignee: Communication University of China
Priority date: 2023-06-21
Filing date: 2023-06-21
Publication date: 2023-10-27
Anticipated expiration: 2043-06-21
Also published as: CN116501947A

Abstract

The invention provides a construction method of a semantic search cloud platform, which belongs to the technical field of deep learning, and aims to ensure the efficient storage of massive public opinion data and the semantic vectorization embedding and searching process, break through the performance bottleneck of a single-node database and realize the hybrid query of scalar vectors by constructing a hierarchical architecture of the cloud platform and creating a corpus storage cluster and a distributed vector database in a storage middleware layer; the method solves the problem that keyword searching can ignore potential semantic similarity public opinion data; designing a corpus storage cluster and a distributed vector database to form a distributed bottom storage cluster service, wherein the distributed bottom storage cluster service comprises unstructured corpus storage and vectorized mass index storage; the problem that in the prior art, the storage space needs to be elastically expanded and contracted due to the fact that large-scale data are increased in real time is solved; and the collected public opinion data is embedded in an increment way in real time by designing the architecture of a containerized deployment layer and presetting an embedded pipeline expansion scheduling rule, so that an end-to-end semantic search system is finally realized.

Description

Construction method, system and equipment of semantic search cloud platform and storage medium

Technical Field

The invention belongs to the technical field of deep learning, and particularly relates to a construction method, a construction system, construction equipment and a storage medium of a semantic search cloud platform.

Background

In the data intellectualization era, public opinion data generated in social media is explosively increased, and the problem of how to support mass data to be reasonably and efficiently stored in the face of sudden, random and diversified media public opinion data, and meanwhile, valuable information meeting requirements can be quickly and accurately searched out, so that the problem to be solved in constructing a large-scale, multi-mode and incremental information search system is urgent.

Traditional keyword matching-based search engines rely primarily on matching keywords in queries with keywords in documents. It usually uses statistical methods such as word frequency and inverse document frequency (TF-IDF) to measure the importance of keywords. The keyword search system mainly focuses on the occurrence number and distribution of keywords in a document, but does not focus on the overall meaning of the document. The traditional retrieval mode has great limitation, is difficult to understand the query intention and semantic relation of a user, is more easily influenced by the problems of word sense ambiguity and the like, so that the requirement of the user for quickly acquiring information cannot be met, and meanwhile, the result ranking of the searched news article list is also influenced by factors such as the occurrence times of keywords and the like, so that the problem of unreasonable search result ranking is caused.

Disclosure of Invention

Based on the current situation of the traditional keyword matching-based search engine, the invention provides a construction method, a system, equipment and a storage medium of a semantic search cloud platform, which are used for overcoming at least one technical problem in the prior art.

In order to achieve the above object, the present invention provides a method for constructing a semantic search cloud platform, including:

creating a hierarchical architecture of a semantic search cloud platform; the hierarchical architecture comprises a business micro-service layer, a storage middleware layer and a containerized deployment layer;

respectively creating a data acquisition service for acquiring multi-modal corpus in real time, a model service for vectorizing the acquired multi-modal corpus, an index service for constructing an index for the vector of the multi-modal corpus and a semantic search service for user input search in the business micro-service layer; the semantic search service establishes a connection relationship with the vectorization model service;

respectively creating a corpus storage cluster for storing multi-modal corpuses and a distributed vector database for storing vectors of indexed multi-modal corpuses in the storage middleware layer, and establishing a connection relationship between the distributed vector database and the vectorization model service; establishing a mapping relation between the distributed vector database and the corpus storage cluster;

And in the containerized deployment layer, deploying the data in the storage middleware layer to a preset container based on a preset embedded pipeline expansion scheduling rule so as to complete the construction of the semantic search cloud platform.

In order to solve the above problems, the present invention further provides a system for constructing a real world data semantic search cloud platform, comprising:

the architecture creation module is used for creating a hierarchical architecture of the semantic search cloud platform; the hierarchical architecture comprises a business micro-service layer, a storage middleware layer and a containerized deployment layer;

the micro-service creation module is used for respectively creating a data acquisition service for acquiring the multi-modal corpus in real time, a model service for vectorizing the acquired multi-modal corpus, an index service for constructing an index for the vector of the multi-modal corpus and a semantic search service for inputting search for a user at the business micro-service layer; the semantic search service establishes a connection relationship with the vectorization model service;

the storage creation module is used for respectively creating a corpus storage cluster for storing the multi-modal corpus and a distributed vector database for storing the vector of the indexed multi-modal corpus at the storage middleware layer, and establishing a connection relationship between the distributed vector database and the vectorization model service; establishing a mapping relation between the distributed vector database and the corpus storage cluster;

The deployment module is used for deploying the data in the storage middleware layer to a preset container based on a preset embedded pipeline expansion scheduling rule in the containerized deployment layer so as to complete the construction of the semantic search cloud platform.

In order to solve the above problems, the present invention also provides an electronic device including:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform steps in a method of constructing a semantic search cloud platform as previously described.

In order to solve the above problems, the present invention further provides a computer readable storage medium, in which at least one instruction is stored, and when the at least one instruction is executed by a processor in an electronic device, the method for constructing the semantic search cloud platform is implemented.

According to the construction method, the system, the equipment and the storage medium of the semantic search cloud platform, provided by the invention, the hierarchical architecture comprising the business micro-service layer, the storage middleware layer and the containerized deployment layer is constructed, the corpus storage cluster for storing the multi-modal corpus and the distributed vector database for storing the vector of the indexed multi-modal corpus are created in the storage middleware layer, the efficient storage and semantic vectorization processes of massive public opinion data are ensured, the performance bottleneck existing in a single-node database is broken through, and the hybrid query of scalar vectors is realized; the method solves the problem that keyword searching can ignore potential semantic similarity public opinion data; designing a corpus storage cluster and a distributed vector database to form a distributed bottom storage cluster service, wherein the distributed bottom storage cluster service comprises unstructured corpus storage and vectorized mass index storage; the problem that in the prior art, the storage space needs to be elastically expanded and contracted due to the fact that large-scale data are increased in real time is solved; and embedding the collected public opinion data in real time in an increment mode by designing the architecture of a containerized deployment layer and presetting an embedded pipeline expansion scheduling rule, and finally realizing end-to-end semantic system call.

Drawings

In order to more clearly illustrate the embodiments of the invention or the technical solutions in the prior art, the drawings that are required in the embodiments or the description of the prior art will be briefly described, it being obvious that the drawings in the following description are only some embodiments of the invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic flow chart of a method for constructing a semantic search cloud platform according to an embodiment of the present invention;

FIG. 2 is a schematic block diagram of a system for constructing a semantic search cloud platform according to an embodiment of the present invention;

fig. 3 is a schematic diagram of an internal structure of an electronic device for implementing a method for constructing a semantic search cloud platform according to an embodiment of the present invention;

fig. 4 is a schematic structural diagram of a hierarchical architecture of a semantic search cloud platform according to an embodiment of the present invention;

FIG. 5 is a schematic deployment diagram of a corpus storage cluster according to an embodiment of the present invention;

FIG. 6 is a diagram of an overall architecture for vectorization and index construction of a multi-modal corpus in accordance with one embodiment of the present invention;

FIG. 7 is a flowchart of an index vector construction according to an embodiment of the present invention;

FIG. 8 is a flow chart of semantic search according to one embodiment of the present invention;

FIG. 9 is a flowchart of a preset embedded pipeline extended scheduling rule according to an embodiment of the present invention;

fig. 10 is a preferred phase flow diagram of the preferred phase of fig. 9 provided in accordance with one embodiment of the present invention.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

Based on the problems in the prior art, the invention mainly provides a construction method, a system, equipment and a storage medium of a semantic search cloud platform, which mainly aim to solve the problems that in the prior art, the traditional information search engine has limitation based on keyword matching, the query intention and semantic relation of a user are difficult to understand, the influence of word sense ambiguity and the like is more likely to occur, and the requirement of the user on quickly acquiring information cannot be met; and the traditional multi-mode data feature vector extraction process is complex and complicated, and the configuration environment needs to spend a great deal of time and effort and has the problems of performance bottleneck of single-node database storage data.

Fig. 1 is a flow chart of a method for constructing a semantic search cloud platform according to an embodiment of the present invention. The method may be performed by a system, which may be implemented in software and/or hardware.

Fig. 1 illustrates the construction method of the semantic search cloud platform in its entirety. As shown in FIG. 1, in the embodiment, the method for constructing the semantic search cloud platform includes steps S110 to S140.

S110, creating a hierarchical architecture of a semantic search cloud platform; the hierarchical architecture comprises a business micro-service layer, a storage middleware layer and a containerized deployment layer.

Specifically, aiming at the defects of keyword matching based on the traditional information search engine and the problem of elastic expansion and contraction of storage space caused by real-time increase of large-scale data, the invention designs a set of hierarchical architecture of distributed bottom storage cluster service, which comprises a business micro-service layer for acquiring corpus in real time according to real-time change of massive corpus, a storage middleware layer for unstructured corpus storage and vectorized massive index storage for user input search, and a containerized deployment layer for deploying data; therefore, the method aims at the problems that the traditional process of extracting the feature vector (embedding) of the multi-mode data is complex and complicated, a configuration environment needs to spend a great deal of time and effort, and in a hierarchical architecture, the packaged multi-mode data embedding micro-service is provided by means of hardware resources of an underlying server and container arrangement capability of a cluster, and the multi-mode semantic search service is externally provided by means of semantic information extracted from the multi-mode data. According to the characteristic of real-time updating of mass data, a corpus storage cluster, preferably a distributed MongoDB cluster, is constructed in the semantic search cloud platform so as to realize a set of stable and reliable bottom corpus storage service and break through the performance bottleneck of single-node MongoDB.

As shown in fig. 4, the layering architecture of the cloud platform adopts a layering architecture mode, and functions of different layers are different. The bottom layer is a PaaS platform with GPU resources and NFS storage servers. The system consists of a Master node, a workbench node and a storage server. The workbench node consists of a CPU high-performance server and a GPU server. Each CPU high performance server is equipped with a 40 core logic CPU and 128G memory. In addition, two GPU servers, one with 8 NVIDIA RTX 3080G GPU graphics cards and the other with 1 NVIDIA Tesla V100 32GB GPU graphics card, can assist in completing the acceleration of the semantic search service on the GPU computing power requirement or the searching process by using the GPU. The physical machines are built on the basis of the high-availability clusters of the Kubernetes, so that isolation and scheduling work among different user containers can be well realized, the cloud platform has the capability of transverse expansion, and when the public opinion data storage space collected in a large scale reaches the upper limit of the current storage space, new working nodes and resources are added elastically without affecting the use of the current system functions. Meanwhile, the whole cluster is abutted against the NAS storage server, and the NAS server utilizes the NFS protocol to realize that data in different container environments are mounted in the distributed directory files, so that the purpose of data persistence storage is achieved.

The functional construction work of the platform is carried out on the basis of the whole PaaS platform. Firstly, an unstructured vector database Milvus and a MongoDB cluster of a large-scale corpus are built based on a Docker container technology and a Kubernetes cluster technology, and the database can be used for rapidly and efficiently storing, inquiring and indexing vector information of unstructured data. Secondly, the assembly line reasoning service and the model service realize the customization work of the assembly line through codes, and can set a data reasoning format, a data decoding algorithm and a data feature extraction, namely an embedding model call according to requirements. And finally, constructing pipeline mirror images through a Dockerfile file, and designating high-performance reasoning frameworks such as a Cuda version, nvidia Triton and the like. The unstructured data is provided to the user in the form of micro-services for processing.

As an alternative embodiment of the invention, the hierarchical architecture further comprises a model distributed training layer;

and respectively creating a vectorization model training service, a model iteration service for carrying out optimization iteration on the vectorization model obtained by training the vectorization model training service and a model updating service for updating the vectorization model generated by the iterative optimization of the model iteration service to the vectorization model service in the model distributed training layer.

Specifically, in order to improve recall rate of semantic search of massive public opinion data based on vector index, distributed training of multi-mode search service is supported in a cloud platform, and the screened optimal model is accessed into a business micro-service layer for calling during search through model training, model updating, optimization iteration and other links. In addition, in the distributed training process of the model, the extending scheduling strategy of the embedding assembly line of the ebedding is utilized to accelerate the training process of the model and improve the utilization rate of the whole cloud platform resource.

As an alternative embodiment of the present invention, the hierarchical architecture further includes a system performance support layer;

the method comprises the steps of respectively creating a code warehouse for storing development codes of services in a business micro-service layer in a system efficiency supporting layer, a retrospective version control service for controlling a service development stage in the business micro-service layer, a continuous integration service for deploying the development codes in the code warehouse, a continuous delivery service for delivering the development codes deployed through the continuous integration service to the business micro-service layer and a mirror image warehouse for storing mirror image files in a semantic search cloud platform.

Specifically, the layering architecture of the cloud platform increases a system efficiency supporting layer to improve the system development efficiency of the whole cloud platform and ensure the automation of the flow of each functional module in the business micro-service layer. Specifically, a team collaborative development mode is realized through operations such as pull pulling, push pushing and the like of codes in a Gitlab code warehouse; meanwhile, the version control can support the snapshot of each development stage of the system, so that the quick version backtracking when the function bug appears is facilitated; continuous integration and continuous delivery can be combined with Gitlab to complete efficient automatic deployment of codes; and building an image warehouse to support and maintain each image file created in the system.

As an alternative embodiment of the present invention, the hierarchical architecture further includes a service monitoring and remediation layer.

Specifically, a monitoring and treatment system of each service module is realized in the layering architecture of the cloud platform, so that multi-dimensional dynamic resource monitoring and version management and storage functions of different images are provided for the layering architecture of the cloud platform. Specifically, the service monitoring and treatment system comprises service monitoring, service recovery, service configuration management, service state notification, data mounting setting, mirror image management, mirror image file storage setting and other functional modules. The whole cloud platform implements the DevOps concept, the front end and the rear end of the semantic search system are automatically deployed in the cloud platform cluster by using a CI/CD (continuous integration/continuous deployment) mode, tedious deployment work of developers is reduced, and the whole system has good version control. The service of the cloud platform uniformly uses the interface style of the Restful API, accesses back-end resources by using methods such as GET, SEARCH and the like, and realizes load balancing and high availability of the whole cluster through Haproxy/Nginx+Keepalied.

S120, respectively creating a data acquisition service for acquiring multi-modal corpus in real time, a model service for vectorizing the acquired multi-modal corpus, an index service for constructing an index for the vector of the multi-modal corpus and a semantic search service for inputting search for a user at a business micro-service layer; the semantic search service establishes a connection relation with the vectorization model service.

Specifically, a data acquisition service is created in a service micro-service layer, for example, a data crawling port is set to crawl data of multi-mode corpus from a network in real time; creating a vectorization model service, and vectorizing the collected data of the multi-mode corpus; creating an index service to create an index for the vectorized multimodal corpus; and then creating a semantic search service, constructing by using a vector search strategy such as FAISS, annoy, HNSW and the like, wherein the core is to solve the problem of dense vector similarity retrieval, and in addition, the public opinion data, namely the multi-modal corpus, adopts the technologies of data segmentation and segmentation, data persistence, incremental data ingestion and scalar vector mixed query to improve the retrieval performance. The cloud platform combines unstructured and dense features of high-dimensional vectors, introduces a Milvus vector database and a vector recall technology, and aims at vector query and retrieval design, so that index can be established for vector data.

As an alternative embodiment of the present invention, a service method of an index service includes:

establishing a connection with a distributed vector database;

and calculating a vector set obtained by vectorizing the multimodal corpus subjected to vectorization model service processing according to a preset index algorithm, generating a vector index for the vector set, and inserting the vector and a corresponding index value into a distributed vector database.

Specifically, as shown in fig. 7, the collected multi-modal corpus such as text, image, etc. is embedded by a deep learning semantic search model and then converted into a vector, which includes vector id and vector data, so as to perform similarity indexing. The index file is a result of clustering operation on vector data, and comprises information such as the type of index, the center vector of each cluster, which vectors are respectively arranged in each cluster, and the like.

After the client initiates the request for indexing, the proxy agent receives the request to make some static checks, and then forwards the request to the core scheduler, the scheduler persists the request for indexing into the key-value key value store, the proxy agent is returned to the proxy agent, and the proxy agent is returned to the client SDK. The actual indexing process is that the core scheduler initiates a request to the index controller to build an index.

S130, respectively creating a corpus storage cluster for storing multi-modal corpuses and a distributed vector database for storing vectors of indexed multi-modal corpuses in a storage middleware layer, and establishing a connection relation between the distributed vector database and a vectorization model service; and establishing a mapping relation between the distributed vector database and the corpus storage cluster.

Specifically, the multi-mode corpus aiming at mass public opinion increased in real time is unstructured, the bottom storage in the cloud platform is based on MongoDB fragmentation cluster deployment, 3 parts can be included, a mongos routing node is used for improving a routing guide function, a mongos configuration node is used for storing metadata and cluster configuration information, mongod is an actual storage node and maintains the stability of the cluster in a copy set mode, and failover, switching and recovery are realized. The overall deployment situation of the constructed MongoDB cluster, namely the corpus storage cluster is shown in figure 5.

Aiming at the flow of the EMBedding vectorization processing multi-mode data, a storage and calculation separated infrastructure is combined in a cloud platform, a corpus storage cluster, preferably a MongoDB cluster, is combined with a distributed vector database, preferably a Milvus cluster, so that parallel management of original corpus and vectorization index and underlying storage of a support semantic search system are realized, and the synchronization of the corpus in storage is ensured to be embedded by the EMBedding vectorization in real time. The vector space has massive and high-dimensional feature vectors, and the proper index types, the vector distance calculation modes and other configuration items are set for the data set according to the features of the public opinion data, so that the vector retrieval speed and accuracy can be improved. The overall architecture of the process building of request for vectorized index, scheduled messaging, etc. is shown in fig. 6.

In a vector database bottom architecture of the semantic search cloud platform, the main design can be divided into five layers to support flexible expansion and contraction of a cloud platform system and flexible scheduling of resources.

1) The request access layer is made up of a plurality of proxy agents. And providing a unified connection interface for the outside and verifying the request of the client. With the massive parallel processing architecture, the proxy component performs global aggregation and processing on intermediate results returned by the execution nodes, and then returns the intermediate results to the client.

2) The core scheduler module is responsible for issuing different tasks. The method comprises cluster load balancing management, data management, timer management and the like, and further comprises the steps of managing index construction of mass data and maintaining index metadata.

3) The index construction and semantic search module is mainly responsible for executing various tasks and data operation requests of agent requests sent from the core scheduler, and high-efficiency expansion and contraction capacity and high availability of the multi-mode data processing cloud platform are completed. The massive corpus nodes acquire log information from the message queues, process data requests, package and store the log data on the object storage to realize the persistent storage of the log information; the index node is responsible for executing the task of index construction, and the vectorization process of the public opinion data is realized by scheduling the pipeline process embedded by the ebedding; the query node obtains log data through subscription information storage and provides scalar+vector mixed query and search functions.

4) The storage service type reasonably plans all data information in the cloud platform, and is divided into unstructured storage of massive corpus, storage of vectorized embedded data, index storage and storage of log data in the platform.

5) Based on the storage service requirements, a bottom storage support cluster is built, and the large-scale corpus and vectorized index data in the cloud platform are respectively subjected to persistent storage. In addition, the combination information, the node state information and the like in the vector data are subjected to snapshot storage in combination with metadata storage, so that extremely high availability and strong consistency can be realized; aiming at the problem that the system possibly encounters the fault of an executing node or the shutdown maintenance, the historical information is traced back by adopting a message storage mode, and the integrity of data such as data inquiry, event notification, result return and the like is ensured.

In the layering architecture of the whole cloud platform, complex system functions are decoupled in a micro-service mode, the building of a component type system is guaranteed, read-write operation records of all functions are stored in a log mode, and all operations for changing the states of a database and a data set are recorded. Meanwhile, the development mode of continuous integration and delivery can ensure automatic deployment after code updating and version rollback when a system fails, and the usability of the system function is ensured as much as possible.

The input flow of vector data: the vectors are inserted in batches, but the vectors are not written to the disk every time, the vector database opens up a space for each table in the memory as a writable buffer, the data can be quickly and directly written into the writable buffer, when a certain number of the data is accumulated, the writable buffer is marked as the writable buffer, and a new writable buffer is opened up to wait for new data.

As an alternative embodiment of the present invention, a log storage service for recording data operation information in a corpus storage cluster and a distributed vector database and a cache message queue service for recording received data operation information from a user side are created in a storage middleware layer.

Specifically, the data meta-information is used to manage the state and information of the file, mySQL is used to manage meta-data, two Tables are created in the database after the program is started, the Tables are used to record the information of all Tables, and the Tables are used to record the information of all data files and index files. The table metadata information comprises table names, vector dimensions, day before creation, states, index types, clustering quantity of clusters, distance calculation modes and the like; the TableFiles record the table name to which the file belongs, the file index type, the file name, the file type, the file size, the vector number of rows, and the creation date.

And S140, at a containerized deployment layer, deploying the data in the storage middleware layer to a preset container based on a preset embedded pipeline expansion scheduling rule so as to complete the construction of the semantic search cloud platform.

Specifically, a distributed vector storage architecture of large-scale ebedding embedded data is constructed according to the characteristics of vector increment generation and index real-time construction in a feature vector extraction flow of the multi-mode corpus; aiming at the data processing requirement of the multi-mode ebedding pipeline, an extended scheduling strategy of the ebedding pipeline is provided. The overall strategy optimizes the preselected and preferred phases of the scheduling process. And designing an expansion filtering strategy and an expansion scoring strategy, wherein the expansion filtering strategy optimizes the evaluation and filtering of the GPU and the GPU video memory resources by defining new expansion resource objects on the basis of a default scheduling strategy.

As an alternative embodiment of the present invention, a Docker containerized deployment service for container deployment of data in a storage middleware layer, a Kubernetes container orchestration service for simultaneous operation orchestration of multiple nodes on data in the storage middleware layer, and an NFS storage server for storing data in the storage middleware layer are created in the containerized deployment layer, respectively.

Specifically, the cloud platform packs the multi-mode enabling assembly line to be mirror images in a micro-service construction mode, deploys the mirror images into the clusters, and ensures load balancing of the whole cloud platform; and combining the Gitlab code warehouse with the CI/CD automatic construction in the cloud platform to complete the automatic pushing of the codes and the iterative updating of the platform functions, wherein the whole CI/CD automatic construction process comprises the processes of pushing, compiling, mirroring construction, cluster deployment and the like of the codes.

As an optional embodiment of the present invention, the preset embedded pipeline expansion scheduling rule is:

screening nodes of the preset container through a filtering function to judge whether the nodes meet the data requirement to be scheduled or not, and adding the nodes meeting the data requirement to be scheduled into a node queue;

selecting a scoring optimization strategy for the node queue according to whether the data to be scheduled carries a request for requesting to expand resources or not;

When the data to be scheduled does not request for expanding resources, a preset default scoring strategy is adopted, nodes in the node queue are ordered according to the high-to-low scores according to the preset default scoring strategy, and the node with the highest scoring is bound with the data to be scheduled; the default scoring strategy is preset, wherein the sum of the scoring of the CPU and the scoring of the memory is taken as the score of the node;

when the data to be scheduled does not request the expansion resource, a preset expansion scoring strategy is adopted, nodes in the node queue are ordered according to the high-to-low scoring according to the preset expansion scoring strategy, and the node with the highest scoring is bound with the data to be scheduled; the method comprises the steps of obtaining a preset expansion scoring strategy, wherein the preset expansion scoring strategy is to take the sum of the scoring of a CPU (Central processing Unit) multiplied by a preset CPU weight value, the scoring of a memory multiplied by a preset memory weight value, the scoring of a GPU (graphics processing Unit) multiplied by a preset GPU weight value, the scoring of the rest of a video memory resource multiplied by a preset video memory resource rest weight value and the scoring of the matching degree of the required quantity of the type of the data resource to be scheduled and the rest of the same type of the resource of a node multiplied by a matching degree weight value as the score of the node; the sum of a preset CPU weight value, the preset memory weight value, a preset GPU weight value, a preset memory resource residual weight value and a matching degree weight value is 1; the preset CPU weight value and the preset memory weight value are set as x; the preset GPU weight value, the preset memory resource residual weight value and the matching degree weight value are all set to be 2x.

Specifically, in order to better solve the problem of resource scheduling of multi-mode data processing, the invention carries out optimal design on a default Kube-schedule scheduling strategy. Presetting an embedded pipeline expansion scheduling rule. The optimization and expansion of the scheduling policy are performed during the pre-selection and the optimization of the scheduling respectively. In the pre-selection filtering stage, firstly, the problem of Pod priority of data to be scheduled is considered, a reasonable Pod priority queue is set according to Pod actual resource requests, two kinds of expansion resources are defined, and Node filtering operation is carried out on the two kinds of new expansion resources. The node matching degree and the scoring weight of the GPU index are increased in the optimal scoring stage, and the proper optimal node can be selected for the micro-service of the ebadd pipeline more pertinently, so that the aim of high speed and high efficiency of data processing is achieved. The principle of the multi-task parallel scheduling strategy is as follows, node nodes are filtered through a default pre-selected strategy, and if the data Pod to be scheduled is detected to contain an extended resource request, an extended filtering strategy is triggered to filter the Node nodes. And the Node nodes after filtering are subjected to preferential scoring, the scoring is firstly performed through a default scoring strategy, then an expanded scoring strategy is triggered to perform scoring, and finally, the Node nodes with highest rank are summed and screened out to perform binding operation with the data Pod to be scheduled, and the whole flow is shown in figure 9.

The design of the extending scheduling strategy of the ebedding pipeline is carried out by adopting a mode of adding a scheduling extension program, namely a new scheduling rule is added for Kubernetes. Expansion is performed at pre-selected and preferred stages, respectively, and expansion declarations are added in default scheduling rules. Firstly, programming an expansion program in a preselected stage, wherein two parameters are required to be defined, the Name of the parameter defining the expansion filtering strategy is Name, the Name of the expansion filtering strategy is Gpu _filter, and the Function defining the expansion filtering strategy is implemented as Function. The input parameters of the function are the data Pod to be scheduled and the candidate Node nodes, and whether the Node nodes meet the requirement of the data Pod to be scheduled is judged through a filtering function. And if the demand is met, adding the Node into the Node queue to enter the next preferential scoring flow.

The Node queue obtained through the pre-selection stage enters the preferred stage, a new expansion scoring strategy is adopted in the preferred stage, only the CPU and the memory use amount is considered in the Kubernetes default scheduling scoring strategy, and only the CPU and the memory matching is considered when the Node matching degree with the data Pod to be scheduled is calculated. In order to better allocate resources for the sounding pipeline, a new scoring strategy needs to be expanded for a preferred stage, and the scoring weights of the GPU and the video memory need to be emphasized, and the Node and Pod matching degree is better. The scoring strategy for resources in Node nodes is defined by firstly scoring CPU and memory in the default scoring strategy of Kubernetes The Node is provided with a CPU score, and the specific formula is as follows:

；

wherein totalCPUThe Node is provided with a logic CPU total,usedCPUthe total amount of CPU being used by the Node. Definition of the sameMemory score on Node,/-for>Scoring the GPU use condition on Node, < +.>The service condition score for the GPU video memory on Node nodes is specifically shown as the following formula:

；

next, the overall matching degree of the Node residual resources and the Pod resource requirements of the data to be scheduled needs to be considered. The same resource type (CPU, memory, GPU, video memory) is used to represent Node resource residual quantity and Pod resource demand quantity to be scheduled, two vectors are defined and />The two resource amounts are respectively represented by the following formulas:

；

wherein ,for the resource type demand of the data to be scheduled Pod, the resource demands of the CPU, the memory, the GPU and the video memory are respectively defined in a Yaml resource file describing the data to be scheduled Pod, ">The resource surplus of the same type of Node is CPUrest _cpu ) Memory of%rest _memory ）、GPU（rest _gpu ) Memory of displayrest _vram ) The vector similarity is defined as the cosine angle between two vectors. Definitions->The matching degree between the Pod to be scheduled and the Node is very highIt is obvious that the larger the cosine value between two vectors, the lower the similarity of the two vectors, namely, the lower the matching degree of the Pod to be scheduled and Node resources. / >The smaller the value is, the higher the similarity of the two vectors is, namely the matching degree of the Pod to be scheduled and Node resources is high. />The specific calculation formula is as follows:

；

by calculating various resource scoring formulas as defined above, the total scoring formula in the extended scoring strategy is defined as follows:

wherein ,、/>、/>、/>、/>the weight values of the CPU, the memory, the GPU, the video memory and the resource matching degree are respectively 1. And setting the weight values of the CPU and the memory as x, and setting the weight values of the GPU, the video memory and the resource matching degree as 2x. The overall preferred stage scoring flow is shown in fig. 10.

The Node queues in the pre-selection stage are obtained in the whole preferred stage, scoring operation is carried out on the head of queue Pod of the data Pod queue to be scheduled, whether the Pod to be scheduled carries a request of an extended resource is detected, if the Pod to be scheduled carries the request, the Node queues are scored by using an extended scoring strategy, and otherwise, a default scoring strategy is adopted. And finally binding the Node with the highest score with the Pod to be scheduled, checking whether the Pod queue to be scheduled still has the Pod to be scheduled, if not, ending the whole scheduling process, and if so, continuing to circulate the whole preselected optimal procedure.

And then packaging and constructing mirror image operation along with the expansion strategies of the preselected stage and the preferred stage, constructing the mirror image by writing Dokcerfile, firstly using Ubuntu as a basic mirror image, packaging the written expansion strategy program into an executable directory of the Ubuntu, adding executable permission, and finally operating the expansion strategy file. The specific file content is as follows:

And running the Dockerfile to construct a mirror image, and uploading the mirror image to a private Harbor mirror image warehouse. And then writing a configuration file of the extension component, writing a ConfigMap form registration interface, and finally writing a depth file to schedule and deploy the extension schedule, wherein the configuration file is approximately as follows:

deployment of the extended scheduling component in the cluster is achieved through a Deployment configuration file, and the Deployment of the extended scheduling strategy of the unbedding pipeline can be achieved.

Aiming at the expansion scheduling strategy of the multi-mode queuing pipeline, the aspects of resource demand, load balancing, data locality, elastic expansion, fault recovery and the like of tasks are fully considered, and the data processing performance and the resource utilization rate of the multi-mode queuing pipeline can be effectively improved.

As shown in fig. 8, the flow of the semantic search service is as follows:

the LRU (Latest Recently Used) policy is used as a permutation policy for the data. The first inquiry is cold inquiry, the data is on the hard disk when the first inquiry is performed, the data is required to be loaded into the memory, and part of the data is also loaded into the video memory; when the second inquiry is performed, part or all of the data is already in the memory, so that the time for reading the hard disk is saved, and the inquiry can be fast.

The search of the vector has two key parameters, one is n, which refers to the n-item target vector, and the other is k, which refers to the top k vectors that are most similar. For a query, the result set is n sets of key-value pairs, each set of key-value pairs having k pairs of key-values. The scheduler firstly searches to obtain a result set, then merges the result sets pairwise, and finally obtains a final result set after two rounds of merging.

As shown in fig. 2, the present invention provides a system 200 for construction of a semantic search cloud platform, which may be installed in an electronic device. Depending on the functionality implemented, the system 200 of construction of the semantic search cloud platform may include an architecture creation module 210, a micro-service creation module 220, a storage creation module 230, and a deployment module 240. The inventive unit, which may also be referred to as a module, refers to a series of computer program segments, which are stored in the memory of the electronic device, capable of being executed by the processor of the electronic device and of performing a fixed function.

In the present embodiment, the functions concerning the respective modules/units are as follows:

the architecture creation module 210 is configured to create a hierarchical architecture of the semantic search cloud platform; the hierarchical architecture comprises a business micro-service layer, a storage middleware layer and a containerized deployment layer;

The micro-service creation module 220 is configured to create, at a service micro-service layer, a data acquisition service for acquiring multi-modal corpus in real time, a model service for vectorizing the acquired multi-modal corpus, an index service for constructing an index for the vector of the multi-modal corpus, and a semantic search service for inputting search for a user, respectively; the semantic search service establishes a connection relation with the vectorization model service;

the storage creating module 230 is configured to create, at a storage middleware layer, a corpus storage cluster storing a multimodal corpus and a distributed vector database storing a vector of the indexed multimodal corpus, respectively, and establish a connection relationship between the distributed vector database and the vectorization model service; establishing a mapping relation between a distributed vector database and the corpus storage cluster;

the deployment module 240 is configured to deploy, at the containerized deployment layer, data in the storage middleware layer to a preset container based on a preset embedded pipeline expansion scheduling rule, so as to complete construction of the semantic search cloud platform.

According to the system 200 for constructing the semantic search cloud platform, a hierarchical architecture comprising a business micro-service layer, a storage middleware layer and a containerized deployment layer is constructed, a corpus storage cluster for storing multi-modal corpuses and a distributed vector database for storing vectors of indexed multi-modal corpuses are created in the storage middleware layer, so that the efficient storage and semantic vectorization processes of massive public opinion data are ensured, the performance bottleneck existing in a single-node database is broken through, and the hybrid query of scalar vectors is realized; the method solves the problem that keyword searching can ignore potential semantic similarity public opinion data; designing a corpus storage cluster and a distributed vector database to form a distributed bottom storage cluster service, wherein the distributed bottom storage cluster service comprises unstructured corpus storage and vectorized mass index storage; the problem that in the prior art, the storage space needs to be elastically expanded and contracted due to the fact that large-scale data are increased in real time is solved; and embedding the collected public opinion data in real time in an increment mode by designing the architecture of a containerized deployment layer and presetting an embedded pipeline expansion scheduling rule, and finally realizing end-to-end semantic system call.

As shown in fig. 3, the present invention provides an electronic device 3 of a method for constructing a semantic search cloud platform.

The electronic device 3 may comprise a processor 30, a memory 31 and a bus, and may further comprise a computer program stored in the memory 31 and executable on said processor 30, such as a build program 32 of a semantic search based cloud platform. The memory 31 may also include both internal storage units and external storage devices of the semantic search cloud platform based build system. The memory 31 may be used not only for storing code or the like of a build program installed in application software and various types of data, such as a semantic search cloud platform, but also for temporarily storing data that has been output or is to be output.

The memory 31 includes at least one type of readable storage medium, including flash memory, a mobile hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, etc. The memory 31 may in some embodiments be an internal storage unit of the electronic device 3, such as a removable hard disk of the electronic device 3. The memory 31 may in other embodiments also be an external storage device of the electronic device 3, such as a plug-in mobile hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the electronic device 3. Further, the memory 31 may also include both an internal storage unit and an external storage device of the electronic device 3. The memory 31 may be used not only to store application software installed in the electronic device 3 and various types of data, such as code based on construction of a semantic search cloud platform, but also to temporarily store data that has been output or is to be output.

The processor 30 may be comprised of integrated circuits in some embodiments, for example, a single packaged integrated circuit, or may be comprised of multiple integrated circuits packaged with the same or different functions, including one or more central processing units (Central Processing unit, CPU), microprocessors, digital processing chips, graphics processors, combinations of various control chips, and the like. The processor 30 is a Control Unit (Control Unit) of the electronic device, connects respective components of the entire electronic device using various interfaces and lines, executes or executes programs or modules (for example, a construction program based on a semantic search cloud platform, etc.) stored in the memory 31, and invokes data stored in the memory 31 to perform various functions of the electronic device 3 and process the data.

The bus may be a peripheral component interconnect standard (peripheral component interconnect, PCI) bus or an extended industry standard architecture (extended industry standard architecture, EISA) bus, among others. The bus may be classified as an address bus, a data bus, a control bus, etc. The bus is arranged to enable a connection communication between the memory 31 and at least one processor 30 or the like.

Fig. 3 shows only an electronic device with components, it being understood by a person skilled in the art that the structure shown in fig. 3 does not constitute a limitation of the electronic device 3, and may comprise fewer or more components than shown, or may combine certain components, or may be arranged in different components.

For example, although not shown, the electronic device 3 may further include a power source (such as a battery) for supplying power to the respective components, and preferably, the power source may be logically connected to the at least one processor 30 through a power management system, so as to implement functions of charge management, discharge management, and power consumption management through the power management system. The power supply may also include one or more of any of a direct current or alternating current power supply, a recharging system, a power failure detection circuit, a power converter or inverter, a power status indicator, and the like. The electronic device 3 may further include various sensors, bluetooth modules, wi-Fi modules, etc., which will not be described herein.

Further, the electronic device 3 may also comprise a network interface, optionally comprising a wired interface and/or a wireless interface (e.g. WI-FI interface, bluetooth interface, etc.), typically used for establishing a communication connection between the electronic device 3 and other electronic devices.

The electronic device 3 may optionally further comprise a user interface, which may be a Display, an input unit, such as a Keyboard (Keyboard), or a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch, or the like. The display may also be referred to as a display screen or display unit, as appropriate, for displaying information processed in the electronic device 3 and for displaying a visual user interface.

It should be understood that the embodiments described are for illustrative purposes only and are not limited to this configuration in the scope of the patent application.

The semantic search cloud platform based build program 32 stored in the memory 31 of the electronic device 3 is a combination of instructions that, when executed in the processor 30, may implement:

s110, creating a hierarchical architecture of a semantic search cloud platform; the hierarchical architecture comprises a business micro-service layer, a storage middleware layer and a containerized deployment layer;

s120, respectively creating a data acquisition service for acquiring multi-modal corpus in real time, a model service for vectorizing the acquired multi-modal corpus, an index service for constructing an index for the vector of the multi-modal corpus and a semantic search service for inputting search for a user at a business micro-service layer; the semantic search service establishes a connection relation with the vectorization model service;

S130, respectively creating a corpus storage cluster for storing multi-modal corpuses and a distributed vector database for storing vectors of indexed multi-modal corpuses at the storage middleware layer, and establishing a connection relation between the distributed vector database and a vectorization model service; establishing a mapping relation between the distributed vector database and the corpus storage cluster;

Specifically, the specific implementation method of the above instructions by the processor 30 may refer to the description of the relevant steps in the corresponding embodiment of fig. 1, which is not repeated herein. It should be emphasized that, in order to further ensure the privacy and security of the building program of the semantic search cloud platform, the database high-availability processing data is stored in the node where the server cluster is located.

Further, the modules/units integrated by the electronic device 3 may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as a stand alone product. The computer readable medium may include: any entity or system capable of carrying the computer program code, a recording medium, a U disk, a removable hard disk, a magnetic disk, an optical disk, a computer Memory, a Read-Only Memory (ROM).

Embodiments of the present invention also provide a computer readable storage medium, which may be non-volatile or volatile, storing a computer program which when executed by a processor implements:

Specifically, the specific implementation method of the computer program when executed by the processor may refer to descriptions of related steps in the construction method of the embodiment semantic search cloud platform, which are not described herein.

In the several embodiments provided by the present invention, it should be understood that the disclosed apparatus, system and method may be implemented in other manners. For example, the system embodiments described above are merely illustrative, e.g., the division of the modules is merely a logical function division, and other manners of division may be implemented in practice.

The modules described as separate components may or may not be physically separate, and components shown as modules may or may not be physical units, may be located in one place, or may be distributed over multiple network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional module in the embodiments of the present invention may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units can be realized in a form of hardware or a form of hardware and a form of software functional modules.

It will be evident to those skilled in the art that the invention is not limited to the details of the foregoing illustrative embodiments, and that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof.

The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, the scope of the invention being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference signs in the claims shall not be construed as limiting the claim concerned.

Furthermore, it is evident that the word "comprising" does not exclude other elements or steps, and that the singular does not exclude a plurality. Multiple units or systems as set forth in the system claims may also be implemented by means of one unit or system in software or hardware. The terms second, etc. are used to denote a name, but not any particular order.

Finally, it should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that modifications and equivalents may be made to the technical solution of the present invention without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. The construction method of the semantic search cloud platform is characterized by comprising the following steps of:

respectively creating a data acquisition service for acquiring multi-modal corpus in real time, a model service for vectorizing the acquired multi-modal corpus, an index service for constructing an index for the vector of the multi-modal corpus and a semantic search service for user input search in the business micro-service layer; the semantic search service establishes a connection relationship with the vectorization model service; in the hierarchical architecture, a business micro-service mode is adopted to decouple complex platform system functions into a component type platform system, and the data acquisition service is used for crawling data of multi-mode corpus from a network in real time by setting a data crawling port;

at the containerized deployment layer, deploying the data in the storage middleware layer to a preset container based on a preset embedded pipeline expansion scheduling rule so as to complete the construction of the semantic search cloud platform; the preset embedded pipeline expansion scheduling rule is as follows:

selecting a scoring optimization strategy for the node queue according to whether the data to be scheduled carries a request for requesting for expanding resources or not;

when the data to be scheduled does not request for expanding resources, a preset default scoring strategy is adopted, nodes in the node queue are ordered according to the score from high to low according to the preset default scoring strategy, and the node with the highest scoring is bound with the data to be scheduled; the default scoring strategy is a score of taking the sum of the scoring of the CPU and the scoring of the memory as a node;

When the data to be scheduled requests for the extended resources, a preset extended scoring strategy is adopted, nodes in the node queue are ordered according to the scoring from high to low according to the preset extended scoring strategy, and the node with the highest scoring is bound with the data to be scheduled; the preset expansion scoring strategy is to multiply the scoring of the CPU by a preset CPU weight value, multiply the scoring of the memory by a preset memory weight value, multiply the scoring of the GPU by a preset GPU weight value, multiply the scoring of the rest of the video memory resource by a preset video memory resource rest weight value and multiply the scoring of the matching degree of the required quantity of the data resource type to be scheduled and the rest of the same resource type of the node by the sum of the matching degree weight values as the scoring of the node; the sum of the preset CPU weight value, the preset memory weight value, the preset GPU weight value, the preset video memory resource residual weight value and the matching degree weight value is 1; the preset CPU weight value and the preset memory weight value are set as x; the preset GPU weight value, the preset video memory resource residual weight value and the matching degree weight value are all set to be 2x; wherein,

the formula for scoring the CPU is: ； wherein ,totalCPUthe Node is provided with a logic CPU total,usedCPUthe total amount of CPU being used by Node nodes;

the formula for scoring the memory is as follows:； wherein ,totalMemoryis the total amount of memory of the Node,usedMemorythe total amount of memory being used by Node nodes;

the formula for scoring the GPU is:；

wherein ,totalGPUthe total amount of GPUs for Node nodes, usedGPUthe total amount of GPU being used by Node nodes;

the formula for scoring the remaining video memory resources is as follows:；

wherein ,totalvramis the total amount of the video memory resources of the Node nodes, usedvramthe total amount of the video memory resources being used by the Node is calculated;

the formula for scoring the matching degree of the resource type demand of the data to be scheduled and the residual quantity of the same resource type of the node is as follows:； wherein ,

;

；

、/>、/>、/>the data to be scheduled of CPU, memory, GPU and video memory resources of Node nodes are respectively required;

、/>、/>、/>and the resource residual amounts of CPU, memory, GPU and video memory resources of the Node nodes are respectively.

2. The method for constructing a semantic search cloud platform according to claim 1, wherein the hierarchical architecture further comprises a model distributed training layer;

and respectively creating a vectorization model training service, a model iteration service for carrying out optimization iteration on a model obtained by training the vectorization model training service and a model updating service for updating a model generated by iterative optimization of the model iteration service to the vectorization model service in the model distributed training layer.

3. The method for constructing a semantic search cloud platform according to claim 1, wherein the hierarchical architecture further comprises a system efficiency supporting layer;

and respectively creating a service development code warehouse in the business micro service layer, a retrospective version control service for controlling a service development stage in the business micro service layer, a continuous integration service for deploying development codes in the code warehouse, a continuous delivery service for delivering the development codes deployed through the continuous integration service to the business micro service layer and a mirror warehouse for storing mirror files in the semantic search cloud platform in the system efficiency supporting layer.

4. The method for constructing a semantic search cloud platform according to claim 1, wherein a log storage service for recording data operation information in the corpus storage cluster and the distributed vector database and a cache message queue service for recording received data operation information from a user request are created in the storage middleware layer.

5. The method for constructing the semantic search cloud platform according to claim 1, wherein a Docker containerized deployment service for carrying out container deployment on data in the storage middleware layer, a Kubernetes container orchestration service for carrying out simultaneous operation orchestration on a plurality of nodes on the data in the storage middleware layer and an NFS storage server for storing the data in the storage middleware layer are respectively created in the containerized deployment layer.

6. The method for constructing a semantic search cloud platform according to claim 1, wherein the service method of the index service comprises:

establishing a connection with the distributed vector database;

and calculating a vector set obtained by vectorizing the multimodal corpus subjected to vectorization model service processing according to a preset index algorithm, generating a vector index for the vector set, and inserting a vector and a corresponding index value into the distributed vector database.

7. A system for construction of a semantic search cloud platform, comprising:

the micro-service creation module is used for respectively creating a data acquisition service for acquiring the multi-modal corpus in real time, a model service for vectorizing the acquired multi-modal corpus, an index service for constructing an index for the vector of the multi-modal corpus and a semantic search service for inputting search for a user at the business micro-service layer; the semantic search service establishes a connection relationship with the vectorization model service; in the hierarchical architecture, a business micro-service mode is adopted to decouple complex platform system functions into a component type platform system, and the data acquisition service is used for crawling data of multi-mode corpus from a network in real time by setting a data crawling port;

the deployment module is used for deploying the data in the storage middleware layer to a preset container based on a preset embedded pipeline expansion scheduling rule in the containerized deployment layer so as to complete the construction of the semantic search cloud platform; the preset embedded pipeline expansion scheduling rule is as follows:

the formula for scoring the GPU is:；

the formula for scoring the remaining video memory resources is as follows:；

the pair ofThe formula for scoring the matching degree of the demand of the data resource type to be scheduled and the residual quantity of the same resource type of the node is as follows:； wherein ,

;

；

8. An electronic device, the electronic device comprising:

at least one processor; the method comprises the steps of,

a memory communicatively coupled to the at least one processor; wherein,

the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the steps in the method of constructing a semantic search cloud platform according to any one of claims 1 to 6.

9. A computer readable storage medium storing at least one instruction, wherein the at least one instruction, when executed by a processor in an electronic device, implements a method of constructing a semantic search cloud platform according to any one of claims 1 to 6.