WO2022166828A1 - 数据索引方法和系统以及存储介质 - Google Patents

数据索引方法和系统以及存储介质 Download PDF

Info

Publication number
WO2022166828A1
WO2022166828A1 PCT/CN2022/074712 CN2022074712W WO2022166828A1 WO 2022166828 A1 WO2022166828 A1 WO 2022166828A1 CN 2022074712 W CN2022074712 W CN 2022074712W WO 2022166828 A1 WO2022166828 A1 WO 2022166828A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
index
model data
indexing
formulating
Prior art date
Application number
PCT/CN2022/074712
Other languages
English (en)
French (fr)
Inventor
施瑜
王景龙
王泽宇
Original Assignee
易保网络技术(上海)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 易保网络技术(上海)有限公司 filed Critical 易保网络技术(上海)有限公司
Publication of WO2022166828A1 publication Critical patent/WO2022166828A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/316Indexing structures
    • G06F16/328Management therefor

Definitions

  • the present application relates to the field of data indexing, and in particular, to a data indexing method, a data indexing system, and a computer-readable storage medium.
  • Elasticsearch is a Lucene-based search service that provides a full-text search engine with distributed multi-user capabilities, based on a RESTful web interface.
  • Elasticsearch is a popular enterprise-grade search engine.
  • Such programs are not friendly to non-professionals, and the cost of learning is high.
  • configuration for specific needs is time-consuming and labor-intensive. In view of this, it is necessary to propose a mechanism that can implement data indexing simply and efficiently.
  • Embodiments of the present application provide a data indexing method, a data indexing system, and a computer-readable storage medium, which are used to efficiently index data and reduce the workload of indexing tasks.
  • a data indexing method including: configuring a structure of model data; formulating an indexing scheme according to the structure; and indexing the model data to form and store index data.
  • the method further includes storing the model data, wherein the index data is stored at the same time as the model data is stored.
  • configuring the structure of the model data includes configuring the structure of the model data according to a data dictionary; and formulating an indexing scheme according to the structure includes using the data dictionary to define fields included in the index.
  • formulating an indexing scheme according to the structure includes establishing an index configuration table, where the index configuration table includes an index, an index field, a dictionary path, and a value expression.
  • formulating an indexing scheme according to the structure further includes using a DSL to define fields included in the index.
  • a system for data indexing comprising: a processor; and a computer-readable medium coupled to the processor and having instructions stored thereon, the instructions being When executed, the processor causes the processor to perform the following operations: configure the structure of the model data; formulate an indexing scheme according to the structure; and index the model data to form index data and store it.
  • the processor further performs an operation of: storing the model data, wherein the index data is stored simultaneously with the model data.
  • configuring the structure of the model data includes configuring the structure of the model data according to a data dictionary; and formulating an indexing scheme according to the structure includes using the data dictionary to define fields included in the index.
  • formulating an indexing scheme according to the structure includes establishing an index configuration table, where the index configuration table includes an index, an index field, a dictionary path, and a value expression.
  • formulating an indexing scheme according to the structure further includes using a DSL to define fields included in the index.
  • a computer-readable storage medium wherein instructions are stored in the computer-readable storage medium, wherein when the instructions are executed by a processor, the processor is caused to execute as above any of the methods described in this article.
  • FIG. 1 shows a data indexing method according to an embodiment of the present application.
  • FIG. 2 shows a data indexing mechanism according to an embodiment of the present application.
  • FIG. 3 shows a data indexing system according to an embodiment of the present application.
  • the data indexing method 10 includes the following steps: configuring the structure of model data in step S102, formulating an indexing scheme according to the structure in step S104, and indexing the model data in step S106 to form index data and store it .
  • the data indexing method 10 configures the structure of the model data in step S102.
  • Model data in the present invention refers to data that can be indexed for searching, for example, can be formatted text, non-formatted text, policy data, contract data, and the like.
  • the structure of the model data can be set according to the requirements, or the content to be indexed can be set. It should be noted that this setting does not always fully reflect the complete context structure of the model data, and may only be a cut-off of the part of interest. After the structure of the model data is determined, some or all of the content of the model data (e.g., the part of interest) will be indexed so that searches of these contents can be implemented.
  • the structure may be roughly the same.
  • several possible structure attribute alternatives can be pre-specified for selection.
  • one or more options of interest can be selected from several alternative options, and the structure of the model data can be constructed accordingly.
  • the data indexing method 10 formulates an indexing scheme in step S104 according to the structure determined in step S102.
  • the structure of the model data is determined according to the content of interest. In order to realize the search for the content of interest, it is first necessary to index this part of the content. Since the specified indexing scheme is determined according to the content of interest, and the content of interest is also determined by the configuration structure, it can be considered that the "configuration" operation in step S102 directly triggers the "formulation" of the corresponding indexing scheme . This work does not require manual participation, so the workload of secondary development can be greatly reduced.
  • data processing personnel can synchronously and accurately index it, which is more efficient and accurate than traditional solutions.
  • the "structure" determined in step S102 above will generate an index field corresponding to it in step S104.
  • the "structure” determined in step S102 may generate several index fields corresponding to it in step S104.
  • several "structures” determined in step S102 may collectively generate an index field in step S104.
  • configuring the structure of the model data in step S102 includes configuring the structure of the model data according to a data dictionary, and formulating an indexing scheme according to the structure of the model data in step S104 includes using the data dictionary to define fields included in the index .
  • a data dictionary 202 including several optional fields is shown. For example, the figure shows that field A, field B, and field C are checked, but field D is not checked. These checked fields can be used to configure the structure of model data.
  • FIG. 2 a data dictionary 202 including several optional fields is shown. For example, the figure shows that field A, field B, and field C are checked, but field D is not checked. These checked fields can be used to configure the structure of model data.
  • the content of interest in the model data 201 (marked as data to be indexed (A) in the figure) includes the first part, the second part and the third part, and the first part, the second part
  • the content of the part and the third part is related to the A field, the B field and the C field in the data dictionary 202, so it can be realized by checking the A field, the B field and the C field in the UI interface, etc.
  • the configuration of the content of interest in the model data which is also referred to in this application as configuring the structure of the model data.
  • FIG. 2 illustrates each field and each part of the data to be indexed in the order of illustration for the purpose of convenience of description, in actual situations, the order of each field and each part of the data to be indexed may be other Arrangement, and there may also be interspersed and cross-arranged situations.
  • the division of portions of the data to be indexed is done by their contextual semantics. Contents with the same semantics may exist in different positions of the data to be indexed, and the contents in these different positions constitute a (semantic) part of the data to be indexed. In this way, multiple different parts will correspond to the fields of the data dictionary respectively.
  • data can be grabbed from the model data according to the preset method of forming index data.
  • the A field is the name
  • the B field is the ID document number. Then the algorithm for grabbing name data corresponding to field A and the algorithm for grabbing ID document number data corresponding to field B can traverse at least part of the model data and grab name data and ID document number data from it, so as to form an index data.
  • the index entry 203 shown in Figure 2 will automatically include the field just checked.
  • selecting a desired field from the data dictionary will trigger the configuration of the model data 201, and also trigger the formation of the content of the corresponding field and the formation of an index entry.
  • the content of the corresponding field can be captured by the above method.
  • formulating an index scheme according to the structure in step S104 includes establishing an index configuration table, and the index configuration table includes an index, an index field, a dictionary path, and a value expression.
  • formulating an indexing scheme according to the structure in step S104 further includes defining fields included in the index using a DSL.
  • a field is defined, similar to checking a field, the configuration of the model data 201 will be triggered, and at the same time, the content of the corresponding field will be triggered and an index entry will be formed.
  • the data indexing method 10 further includes storing model data, wherein the model data is stored and the index data is stored at the same time. It has been described above that the model data can be accurately indexed synchronously as it is being processed. Sometimes it is necessary to modify the model data, or modify the fields included in the index, and the corresponding index data may change adaptively. If the index data cannot change at the same time as the model data, there will be discrepancies between the two, which may affect the accuracy of the search results. In some examples of the present invention, the time when the model data is stored is set as the time when the index data is stored, so that the two will remain synchronized. It has been described above that an indexing scheme can be formulated through a certain structure. Therefore, in some examples of the present invention, index data can also be updated in real time/quasi real time through this automatic indexing mechanism, ensuring the synchronization of the two.
  • index data established in the above way will be available to engines such as Elasticsearch.
  • the above-mentioned indexing scheme shields the details of the underlying search engine and can be automatically indexed. This indexing method is especially suitable for scenarios where the upper and lower types of model data are relatively consistent.
  • data indexing system 30 includes a processor 301 and a computer-readable medium 302 coupled to processor 301 and having instructions stored thereon.
  • the instructions when executed by the processor 301, cause the processor 301 to perform the following operations: configure the structure of the model data; formulate an indexing scheme according to the structure; and index the model data to form and store the index data.
  • the data indexing system 30 in FIG. 3 may also include other unit modules, such as: a display device for displaying the execution of instructions; an external input device for inputting instructions; a RAM for temporarily storing data; External device's graphics card, sound card.
  • these unit modules are omitted from FIG. 3 .
  • Model data in the present invention refers to data that can be indexed for searching, for example, can be formatted text, non-formatted text, policy data, contract data, and the like.
  • the structure of the model data can be set according to the requirements, or the content to be indexed can be set. It should be noted that this setting does not always fully reflect the complete context structure of the model data, and may only be a cut-off of the part of interest.
  • some or all of the content of the model data (eg, parts of interest) will be indexed so that searches of these contents can be implemented.
  • the structure may be roughly the same.
  • several possible structure attribute alternatives can be pre-specified for selection.
  • one or more options of interest can be selected from several alternative options, and the structure of the model data can be constructed accordingly.
  • the instructions when executed by the processor 301, cause the processor 301 to perform operations including formulating an indexing scheme according to the structure determined above.
  • the structure of the model data is determined according to the content of interest. In order to realize the search for the content of interest, it is first necessary to index this part of the content. Since the specified indexing scheme is determined according to the content of interest, and the content of interest is also determined by the configuration structure, it can be considered that the "configuration" operation above directly triggers the "formulation" of the corresponding indexing scheme. This work does not require manual participation, so the workload of secondary development can be greatly reduced. When processing model data, data processing personnel can synchronously index it accurately, which is more efficient and accurate than traditional solutions.
  • the "structure” determined in the above step will generate a corresponding index field in this step.
  • the "structure” determined in the above step may generate several index fields corresponding thereto in this step.
  • several "structures” determined in the above steps may collectively generate an index field in this step.
  • configuring the structure of the model data includes configuring the structure of the model data according to a data dictionary; and formulating an indexing scheme according to the structure includes using the data dictionary to define fields included in the index.
  • a data dictionary 202 including several optional fields is shown. For example, the figure shows that field A, field B, and field C are checked, but field D is not checked. These checked fields can be used to configure the structure of model data.
  • FIG. 2 shows that field A, field B, and field C are checked, but field D is not checked. These checked fields can be used to configure the structure of model data.
  • the content of interest in the model data 201 (marked as data to be indexed (A) in the figure) includes the first part, the second part and the third part, and the first part, the second part
  • the content of the part and the third part is related to the A field, the B field and the C field in the data dictionary 202, so it can be realized by checking the A field, the B field and the C field in the UI interface, etc.
  • the configuration of the content of interest in the model data which is also referred to in this application as configuring the structure of the model data.
  • FIG. 2 illustrates each field and each part of the data to be indexed in the order of illustration for the purpose of convenience of description, in actual situations, the order of each field and each part of the data to be indexed may be other Arrangement, and there may also be interspersed and cross-arranged situations.
  • the division of portions of the data to be indexed is done by their contextual semantics. Contents with the same semantics may exist in different positions of the data to be indexed, and the contents in these different positions constitute a (semantic) part of the data to be indexed. In this way, multiple different parts will correspond to the fields of the data dictionary respectively.
  • data can be grabbed from the model data according to the preset method of forming index data.
  • the A field is the name
  • the B field is the ID document number. Then the algorithm for grabbing name data corresponding to field A and the algorithm for grabbing ID document number data corresponding to field B will be able to traverse at least part of the model data and grab name data and ID document number data from it, so as to form an index data.
  • the index entry 203 shown in Figure 2 will automatically include the field just checked.
  • selecting a desired field from the data dictionary will trigger the configuration of the model data 201, and also trigger the formation of the content of the corresponding field and the formation of an index entry.
  • the content of the corresponding field can be captured by the above method.
  • formulating an indexing scheme according to a structure includes establishing an index configuration table, and the index configuration table includes an index, an index field, a dictionary path, and a value expression.
  • formulating an indexing scheme according to the structure further includes using a DSL to define fields included in the index.
  • a DSL to define fields included in the index.
  • the instructions when executed by the processor 301, cause the processor 301 to further perform an operation of storing model data, wherein the index data is stored along with the model data.
  • the model data can be accurately indexed synchronously as it is being processed. Sometimes it is necessary to modify the model data, or modify the fields included in the index, and the corresponding index data may change adaptively. If the index data cannot change at the same time as the model data, there will be discrepancies between the two, which may affect the accuracy of search results.
  • the time when the model data is stored is set as the time when the index data is stored, so that the two will remain synchronized. It has been described above that an indexing scheme can be formulated through a certain structure. In some examples of the present invention, index data can be updated in real-time/quasi-real-time through this automatic indexing mechanism, ensuring the synchronization of the two.
  • a computer-readable storage medium in which instructions are stored that, when executed by a processor, cause the processor to perform any one of the data indexing methods described above.
  • Computer-readable media as referred to in this application include various types of computer storage media and can be any available media that can be accessed by a general purpose or special purpose computer.
  • computer readable media may include RAM, ROM, EPROM, E2PROM, registers, hard disks, removable disks, CD - ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or capable of being used to carry or Any other transitory or non-transitory medium that stores desired program code elements in the form of instructions or data structures and that can be accessed by a general purpose or special purpose computer, or general purpose or special purpose processor. Disks, as used herein, typically reproduce data magnetically, while discs reproduce data optically with lasers. Combinations of the above should also be included within the scope of protection of computer-readable media.
  • An exemplary storage medium is coupled to the processor such the processor can read and write information from/to the storage medium.
  • the storage medium may be integrated into the processor.
  • the processor and storage medium may reside in the ASIC.
  • the ASIC may reside in the user terminal.
  • the processor and storage medium may reside in the user terminal as discrete components.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Business, Economics & Management (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请涉及数据索引方法和系统以及存储介质,所述方法包括:配置模型数据的结构;根据所述结构制定索引方案;以及对所述模型数据进行索引以形成索引数据并存储。通过该方法可以高效地对数据进行索引,降低索引任务工作量。

Description

数据索引方法和系统以及存储介质 技术领域
本申请涉及对数据进行索引的领域,具体而言,涉及数据索引方法、数据索引系统以及计算机可读存储介质。
背景技术
现有技术中已经有诸如Elasticsearch等的索引、搜索方案。Elasticsearch是一种基于Lucene的搜索服务,它提供了一个分布式多用户能力的全文搜索引擎,基于RESTful web接口。Elasticsearch是用Java语言开发的,并作为Apache许可条款下的开放源码发布,是一种流行的企业级搜索引擎。但是此类方案对非专业人员并不友好,同时学习成本也较高。此外,针对特定需求的配置也较为费时耗力。鉴于此,有必要提出一种能够简单、高效实现数据索引的机制。
发明内容
本申请的实施例提供了一种数据索引方法、数据索引系统以及计算机可读存储介质,用于高效地对数据进行索引,降低索引任务工作量。
根据本申请的一方面,提供一种数据索引方法,包括:配置模型数据的结构;根据所述结构制定索引方案;以及对所述模型数据进行索引以形成索引数据并存储。
在本申请的一些实施例中,可选地,方法还包括存储所述模型数据,其中,存储所述模型数据的同时存储所述索引数据。
在本申请的一些实施例中,可选地,配置模型数据的结构包括根据数据字典配置模型数据的结构;并且根据所述结构制定索引方案包括使用所述数据字典定义索引包括的字段。
在本申请的一些实施例中,可选地,根据所述结构制定索引方案包括建立索引配置表,所述索引配置表包括索引、索引字段、字典路径、取值表达式。
在本申请的一些实施例中,可选地,根据所述结构制定索引方案还包括使用DSL定义索引包括的字段。
根据本申请的另一方面,提供一种用于数据索引系统,包括:处理器;以及耦合至所述处理器并且具有存储在其上的指令的计算机可读介质,所述指令在被所述处理器执行时使得所述处理器执行如下操作:配置模型数据的结构;根据所述结构制定索引方案;以及对所述模型数据进行索引以形成索引数据并存储。
在本申请的一些实施例中,可选地,所述处理器还执行操作:存储所述模型数据,其中,存储所述模型数据的同时存储所述索引数据。
在本申请的一些实施例中,可选地,配置模型数据的结构包括根据数据字典配置模型数据的结构;并且根据所述结构制定索引方案包括使用所述数据字典定义索引包括的字段。
在本申请的一些实施例中,可选地,根据所述结构制定索引方案包括建立索引配置表,所述索引配置表包括索引、索引字段、字典路径、取值表达式。
在本申请的一些实施例中,可选地,根据所述结构制定索引方案还包括使用DSL定义索引包括的字段。
根据本申请的另一方面,提供一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,其特征在于,当所述指令由处理器执行时,使得所述处理器执行如上文所述的任意一种方法。
附图说明
从结合附图的以下详细说明中,将会使本申请的上述和其他目的及优点更加完整清楚,其中,相同或相似的要素采用相同的标号表示。
图1示出了根据本申请的一个实施例的数据索引方法。
图2示出了根据本申请的一个实施例的数据索引机制。
图3示出了根据本申请的一个实施例的数据索引系统。
具体实施方式
出于简洁和说明性目的,本文主要参考其示范实施例来描述本申请的原理。但是,本领域技术人员将容易地认识到相同的原理可等效地应用于所有类型的数据索引方法、数据索引系统以及计算机可读存储介质,并且可以在其中实施这些相同或相似的原理,任何此类变化不背离本申请的真实精神和范围。
根据本申请的一方面,提供一种数据索引方法。如图1所示,数据索引方法10包括如下步骤:在步骤S102中配置模型数据的结构,在步骤S104中根据结构制定索引方案,以及在步骤S106中对模型数据进行索引以形成索引数据并存储。
数据索引方法10在步骤S102中配置模型数据的结构。本发明中的模型数据指代可用于被索引供搜索的数据,例如,可以为格式文本、非格式文本、保单数据、合同数据等。数据索引方法10在步骤S102中可以按照需求设定模型数据的结构,或者说设定需要被索引的内容。需要说明的是,这种设定并不总是完全反映模型数据的完整上下文结构,可以仅是对感兴趣部分的截取。在确定模型数据的结构后,模型数据的部分或者全部内容(例如,感 兴趣部分)将被索引,从而可以实现对这些内容的搜索。
对于内容区别不大的多种模型数据,其结构可能是大体相同的。为了结构配置的便利性,可以预先指定若干可能的结构属性备选项供挑选。待需要确定模型数据的结构时,可以从若干备选项中挑选感兴趣的一个或多个选项,并以此构成模型数据的结构。
数据索引方法10在步骤S104中根据在步骤S102中确定的结构来制定索引方案。以上在步骤S102中是根据感兴趣的内容来确定模型数据的结构的,为了实现可以对感兴趣的内容进行搜索,首先需要对这部分内容进行索引。由于指定索引方案是根据感兴趣的内容而定的,而感兴趣的内容也是通过配置结构来确定的,因而可以认为是步骤S102中的“配置”操作直接触发了相应的索引方案的“制定”。这一工作不需要人工参与,因而可以较大程度降低二次开发的工作量。数据加工人员在加工模型数据的时候就能够同步对其进行精确索引,这一方式较传统方案更为高效、精确。
具体而言,在一些示例中,以上在步骤S102中确定的“结构”将在步骤S104中产生与之对应的一个索引字段。在其他示例中,在步骤S102中确定的“结构”可以在步骤S104中产生与之对应的数个索引字段。在其他示例中,在步骤S102中确定的数个“结构”可以在步骤S104中共同产生一个索引字段。
在本申请的一些实施例中,在步骤S102中配置模型数据的结构包括根据数据字典配置模型数据的结构,并且在步骤S104中根据模型数据的结构制定索引方案包括使用数据字典定义索引包括的字段。如图2所示,图中示出了包括若干可选字段的数据字典202。例如,图中示出了勾选A字段、B字段、C字段,但是没有勾选D字段,这些勾选的字段可以用于配置模型数据的结构。如图2所示,如果对模型数据201(在图中又被标示为待索引数据(甲))的感兴趣内容包括其中的第一部分、第二部分和第三部分,而第一部分、第二部分和第三部分在内容上又是与数据字典202中的A字段、B字段和C字段是关联的,那么可以就可以通过在诸如UI界面等中勾选A字段、B字段和C字段实现对模型数据中感兴趣内容的配置,这在本申请中又称为配置模型数据的结构。
需要说明的是,尽管图2中出于方便说明的目的以图示的顺序说明了各个字段和待索引数据的各个部分,但是实际情况中各个字段和待索引数据的各个部分的顺序可以是其他排列情况,并且还可能存在间杂、交叉排列的情况。例如,在一些示例中,待索引数据的各个部分的划分是以其上下文语义进行的。相同语义的内容可能存在于待索引数据的不同位置,这些不同位置的内容构成了待索引数据的一个(语义)部分。如此,多个不同的部分将分别对应到数据字典的字段。
待选择好需要的字段后,可以根据预置的形成索引数据的方法从模型数据中抓取数据。例如,A字段为姓名,B字段为ID证件号码。那么与A字段对应的抓取姓名数据的算法、与B字段对应的抓取ID证件号码数据的算法将可以遍历模型数据中的至少部分并从中抓取姓名数据、ID证件号码数据,以便形成索引数据。
另一方面,图2中示出的索引条目203中将自动包括刚刚勾选的字段。由此,从数据字典中选择需要的字段将触发对模型数据201的配置,还同时触发形成对应字段的内容并形成索引条目。对应字段的内容可以通过上文的方式抓取。
在本申请的一些实施例中,在步骤S104中根据结构制定索引方案包括建立索引配置表,索引配置表包括索引、索引字段、字典路径、取值表达式。
在本申请的一些实施例中,在步骤S104中根据结构制定索引方案还包括使用DSL定义索引包括的字段。尽管直接勾选预置的字段可以满足大部分情况的需求,但是在一些情况下还可以通过DSL语言来定义新的字段来扩展字段选项。当定义好字段后,与勾选字段类似,将触发对模型数据201的配置,还同时触发形成对应字段的内容并形成索引条目。
在本申请的一些实施例中,数据索引方法10还包括存储模型数据,其中,存储模型数据的同时存储索引数据。上文已经描述了可以在加工模型数据的过程中就能够同步对其进行精确索引。有时候需要对模型数据进行修改,或者对索引包括的字段进行修改,此时相应的索引数据可能会适应地变化。如果索引数据不能随着模型数据同时变化,那么二者将存在出入,这可能会影响搜索结果的准确性。在本发明的一些示例中,将模型数据入库的时刻设定为存储索引数据的时刻,由此二者将保持同步。上文已经描述了可以通过确定的结构来制定索引方案,因此,在本发明的一些示例中还可以通过这种自动索引机制实时/准实时地更新索引数据,保证了二者的同步。
通过以上方式建立的索引数据将可以用于诸如Elasticsearch等引擎。以上方式的索引方案屏蔽了底层搜索引擎的细节,并且可以自动索引。这种索引方式尤其适用于模型数据上下类型比较一致的场景。
根据本申请的另一方面,提供一种用于数据索引系统。如图3所示,数据索引系统30包括处理器301以及耦合至处理器301并且具有存储在其上的指令的计算机可读介质302。其中,指令在被处理器301执行时使得处理器301执行如下操作:配置模型数据的结构;根据结构制定索引方案;以及对模型数据进行索引以形成索引数据并存储。
图3中的数据索引系统30还可以包括其他单元模块,例如:用于显示指令执行情况的显示设备;用于输入指令的外部输入设备;用于暂存数据的RAM;用于驱动显示设备、 外放设备的显卡、声卡。出于清晰示出本发明原理的角度考虑,图3中省略了这些单元模块。
指令在被处理器301执行时使得处理器301执行的操作包括:配置模型数据的结构。本发明中的模型数据指代可用于被索引供搜索的数据,例如,可以为格式文本、非格式文本、保单数据、合同数据等。在这一步骤中可以按照需求设定模型数据的结构,或者说设定需要被索引的内容。需要说明的是,这种设定并不总是完全反映模型数据的完整上下文结构,可以仅是对感兴趣部分的截取。在确定模型数据的结构后,模型数据的部分或者全部内容(例如,感兴趣部分)将被索引,从而可以实现对这些内容的搜索。
对于内容区别不大的多种模型数据,其结构可能是大体相同的。为了结构配置的便利性,可以预先指定若干可能的结构属性备选项供挑选。待需要确定模型数据的结构时,可以从若干备选项中挑选感兴趣的一个或多个选项,并以此构成模型数据的结构。
指令在被处理器301执行时使得处理器301执行的操作包括:根据在上文中确定的结构来制定索引方案。上文中是根据感兴趣的内容来确定模型数据的结构的,为了实现可以对感兴趣的内容进行搜索,首先需要对这部分内容进行索引。由于指定索引方案是根据感兴趣的内容而定的,而感兴趣的内容也是通过配置结构来确定的,因而可以认为是上文中的“配置”操作直接触发了相应的索引方案的“制定”。这一工作不需要人工参与,因而可以较大程度降低二次开发的工作量。数据加工人员在加工模型数据的时候就能够同步对其进行精确索引,这一方式较传统方案更为高效、精确。
具体而言,在一些示例中,以上步骤中确定的“结构”将在此步骤中产生与之对应的一个索引字段。在其他示例中,以上步骤中确定的“结构”可以在此步骤中产生与之对应的数个索引字段。在其他示例中,以上步骤中确定的数个“结构”可以在此步骤中共同产生一个索引字段。
在本申请的一些实施例中,配置模型数据的结构包括根据数据字典配置模型数据的结构;并且根据结构制定索引方案包括使用数据字典定义索引包括的字段。如图2所示,图中示出了包括若干可选字段的数据字典202。例如,图中示出了勾选A字段、B字段、C字段,但是没有勾选D字段,这些勾选的字段可以用于配置模型数据的结构。如图2所示,如果对模型数据201(在图中又被标示为待索引数据(甲))的感兴趣内容包括其中的第一部分、第二部分和第三部分,而第一部分、第二部分和第三部分在内容上又是与数据字典202中的A字段、B字段和C字段是关联的,那么可以就可以通过在诸如UI界面等中勾选A字段、B字段和C字段实现对模型数据中感兴趣内容的配置,这在本申请中又称为配置模 型数据的结构。
需要说明的是,尽管图2中出于方便说明的目的以图示的顺序说明了各个字段和待索引数据的各个部分,但是实际情况中各个字段和待索引数据的各个部分的顺序可以是其他排列情况,并且还可能存在间杂、交叉排列的情况。例如,在一些示例中,待索引数据的各个部分的划分是以其上下文语义进行的。相同语义的内容可能存在于待索引数据的不同位置,这些不同位置的内容构成了待索引数据的一个(语义)部分。如此,多个不同的部分将分别对应到数据字典的字段。
待选择好需要的字段后,可以根据预置的形成索引数据的方法从模型数据中抓取数据。例如,A字段为姓名,B字段为ID证件号码。那么与A字段对应的抓取姓名数据的算法、与B字段对应的抓取ID证件号码数据的算法将可以遍历模型数据中的至少部分并从中抓取姓名数据、ID证件号码数据,以便形成索引数据。
另一方面,图2中示出的索引条目203中将自动包括刚刚勾选的字段。由此,从数据字典中选择需要的字段将触发对模型数据201的配置,还同时触发形成对应字段的内容并形成索引条目。对应字段的内容可以通过上文的方式抓取。
在本申请的一些实施例中,根据结构制定索引方案包括建立索引配置表,索引配置表包括索引、索引字段、字典路径、取值表达式。
在本申请的一些实施例中,根据结构制定索引方案还包括使用DSL定义索引包括的字段。尽管直接勾选预置的字段可以满足大部分情况的需求,但是在一些情况下还可以通过DSL语言来定义新的字段来扩展字段选项。当定义好字段后,与勾选字段类似,将触发对模型数据201的配置,还同时触发形成对应字段的内容并形成索引条目。
在本申请的一些实施例中,指令在被处理器301执行时使得处理器301还执行操作:存储模型数据,其中,存储模型数据的同时存储索引数据。上文已经描述了可以在加工模型数据的过程中就能够同步对其进行精确索引。有时候需要对模型数据进行修改,或者对索引包括的字段进行修改,此时相应的索引数据可能会适应地变化。如果索引数据不能随着模型数据同时变化,那么二者将存在出入,这可能会影响搜索结果的准确性。在本发明的一些示例中,将模型数据入库的时刻设定为存储索引数据的时刻,由此二者将保持同步。上文已经描述了可以通过确定的结构来制定索引方案,在本发明的一些示例中还可以通过这种自动索引机制实时/准实时地更新索引数据,保证了二者的同步。
根据本申请的另一方面,提供一种计算机可读存储介质,其中存储有指令,当所述指令由处理器执行时,使得所述处理器执行如上文所述的任意一种数据索引方法。本申请中 所称的计算机可读介质包括各种类型的计算机存储介质,可以是通用或专用计算机能够存取的任何可用介质。举例而言,计算机可读介质可以包括RAM、ROM、EPROM、E 2PROM、寄存器、硬盘、可移动盘、CD-ROM或其他光盘存储器、磁盘存储器或其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码单元并能够由通用或特定用途计算机、或者通用或特定用途处理器进行存取的任何其他临时性或者非临时性介质。如本文所使用的盘通常磁性地复制数据,而碟则用激光来光学地复制数据。上述的组合也应当包括在计算机可读介质的保护范围之内。示例性存储介质耦合到处理器以使得该处理器能从/向该存储介质读写信息。在替换方案中,存储介质可以被整合到处理器。处理器和存储介质可驻留在ASIC中。ASIC可驻留在用户终端中。在替换方案中,处理器和存储介质可作为分立组件驻留在用户终端中。
以上仅为本申请的具体实施方式,但本申请的保护范围并不局限于此。本领域的技术人员可以根据本申请所披露的技术范围想到其他可行的变化或替换,此等变化或替换皆涵盖于本申请的保护范围之中。在不冲突的情况下,本申请的实施方式及实施方式中的特征还可以相互组合。本申请的保护范围以权利要求的记载为准。

Claims (11)

  1. 一种数据索引方法,包括:
    配置模型数据的结构;
    根据所述结构制定索引方案;以及
    对所述模型数据进行索引以形成索引数据并存储。
  2. 根据权利要求1所述的方法,还包括存储所述模型数据,其中,存储所述模型数据的同时存储所述索引数据。
  3. 根据权利要求1所述的方法,其中:
    配置模型数据的结构包括根据数据字典配置模型数据的结构;并且
    根据所述结构制定索引方案包括使用所述数据字典定义索引包括的字段。
  4. 根据权利要求3所述的方法,根据所述结构制定索引方案包括建立索引配置表,所述索引配置表包括索引、索引字段、字典路径、取值表达式。
  5. 根据权利要求4所述的方法,根据所述结构制定索引方案还包括使用DSL定义索引包括的字段。
  6. 一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,其特征在于,当所述指令由处理器执行时,使得所述处理器执行如权利要求1-5中任一项所述的方法。
  7. 一种用于数据索引系统,包括:
    处理器;以及
    耦合至所述处理器并且具有存储在其上的指令的计算机可读介质,所述指令在被所述处理器执行时使得所述处理器执行如下操作:
    配置模型数据的结构;
    根据所述结构制定索引方案;以及
    对所述模型数据进行索引以形成索引数据并存储。
  8. 根据权利要求7所述的系统,所述处理器还执行操作:存储所述模型数据,其中,存储所述模型数据的同时存储所述索引数据。
  9. 根据权利要求7所述的系统,其中:
    配置模型数据的结构包括根据数据字典配置模型数据的结构;并且
    根据所述结构制定索引方案包括使用所述数据字典定义索引包括的字段。
  10. 根据权利要求9所述的系统,根据所述结构制定索引方案包括建立索引配置表,所述索引配置表包括索引、索引字段、字典路径、取值表达式。
  11. 根据权利要求10所述的系统,根据所述结构制定索引方案还包括使用DSL定义索引包括的字段。
PCT/CN2022/074712 2021-02-03 2022-01-28 数据索引方法和系统以及存储介质 WO2022166828A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110146856.X 2021-02-03
CN202110146856.XA CN114860864A (zh) 2021-02-03 2021-02-03 数据索引方法和系统以及存储介质

Publications (1)

Publication Number Publication Date
WO2022166828A1 true WO2022166828A1 (zh) 2022-08-11

Family

ID=82622814

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/074712 WO2022166828A1 (zh) 2021-02-03 2022-01-28 数据索引方法和系统以及存储介质

Country Status (2)

Country Link
CN (1) CN114860864A (zh)
WO (1) WO2022166828A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1143349A1 (en) * 2000-04-07 2001-10-10 IconParc GmbH Method and apparatus for generating index data for search engines
AU2003204729A1 (en) * 2002-06-17 2004-01-15 Canon Kabushiki Kaisha Indexing and Querying Structured Documents
WO2008130501A1 (en) * 2007-04-16 2008-10-30 Retrevo, Inc. Unstructured and semistructured document processing and searching and generation of value-based information
CN106649597A (zh) * 2016-11-22 2017-05-10 浙江大学 一种基于图书内容的图书书后索引自动构建方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1143349A1 (en) * 2000-04-07 2001-10-10 IconParc GmbH Method and apparatus for generating index data for search engines
AU2003204729A1 (en) * 2002-06-17 2004-01-15 Canon Kabushiki Kaisha Indexing and Querying Structured Documents
WO2008130501A1 (en) * 2007-04-16 2008-10-30 Retrevo, Inc. Unstructured and semistructured document processing and searching and generation of value-based information
CN106649597A (zh) * 2016-11-22 2017-05-10 浙江大学 一种基于图书内容的图书书后索引自动构建方法

Also Published As

Publication number Publication date
CN114860864A (zh) 2022-08-05

Similar Documents

Publication Publication Date Title
US10169471B2 (en) Generating and executing query language statements from natural language
US9703830B2 (en) Translation of a SPARQL query to a SQL query
US10282447B2 (en) Adapting a relational query to accommodate hierarchical data
US20120233153A1 (en) Hierarchical browsing operations on a directory attribute
US20200125619A1 (en) Methods and systems for providing a search service application
WO2015029187A1 (ja) データベース管理システム及び計算機システム
US12086194B2 (en) Methods and systems for building search service application
US9734176B2 (en) Index merge ordering
US10303726B2 (en) Decoupling filter injection and evaluation by forced pushdown of filter attributes in calculation models
JP5597623B2 (ja) データベース処理方法
US20190147088A1 (en) Reporting and data governance management
WO2018076971A1 (zh) 一种xml数据的操作方法及设备
US20170262507A1 (en) Feedback mechanism for query execution
WO2022166828A1 (zh) 数据索引方法和系统以及存储介质
US10318524B2 (en) Reporting and data governance management
US20130138683A1 (en) Systems and methods of automatic generation and execution of database queries
KR20220092990A (ko) 신속한 데이터 탐색을 위한 범용 데이터 인덱스
US10025818B2 (en) Customize column sequence in projection list of select queries
US20140089207A1 (en) System and method for providing high level view tracking of changes in sca artifacts
WO2015156336A1 (ja) 用語統一システム及び用語統一プログラム、並びに用語統一方法
JP5359446B2 (ja) 情報処理システム、アクセスパス決定方法及びアクセスパス決定プログラム
US8788511B2 (en) Enabling siebel database manipulation by use of a direct SQL statement
US20180101622A1 (en) Perform graph traversal with graph query language
JP2001034634A (ja) データベース検索システムおよびそのプログラム記録媒体
EP1542136A1 (en) Method and computer system for data retrieval

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22749107

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22749107

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 070224)