WO2022057460A1 - 一种基于ai训练平台的海量文件检索方法、装置及设备 - Google Patents

一种基于ai训练平台的海量文件检索方法、装置及设备 Download PDF

Info

Publication number
WO2022057460A1
WO2022057460A1 PCT/CN2021/109217 CN2021109217W WO2022057460A1 WO 2022057460 A1 WO2022057460 A1 WO 2022057460A1 CN 2021109217 W CN2021109217 W CN 2021109217W WO 2022057460 A1 WO2022057460 A1 WO 2022057460A1
Authority
WO
WIPO (PCT)
Prior art keywords
retrieval
training platform
task
traversal
queue
Prior art date
Application number
PCT/CN2021/109217
Other languages
English (en)
French (fr)
Inventor
姬贵阳
Original Assignee
苏州浪潮智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 苏州浪潮智能科技有限公司 filed Critical 苏州浪潮智能科技有限公司
Priority to US18/011,514 priority Critical patent/US11768805B2/en
Publication of WO2022057460A1 publication Critical patent/WO2022057460A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/13File access structures, e.g. distributed indices
    • G06F16/134Distributed indices
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/156Query results presentation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound

Definitions

  • the invention belongs to the technical field of document retrieval, and in particular relates to a method, device and equipment for retrieving massive documents based on an AI training platform.
  • the AI training platform is an artificial intelligence training platform that manages and schedules resources such as CPU and GPU, model training, and task management.
  • AI training platform effectively solves the computing power requirements of enterprises or scientific research institutions.
  • AI One of the most important and most basic functions of the training platform is file-related operations. How to retrieve valid information from a large number of files has become a common basic function for algorithm researchers. Perform specific files in the data set or in the user directory. The retrieval of fuzzy documents and the performance of massive document retrieval are related to the work efficiency of researchers using the AI training platform. How to improve retrieval performance is an urgent problem to be solved.
  • the existing technology involving file retrieval is mainly used for file management of various systems, etc.
  • the number of files is very different from the number of files of the AI training platform, and the existing technology is nothing more than recursive traversal of all files to perform fuzzy matching. Comparison; use the relevant command find of the operating system linux to perform fuzzy matching search; and build distributed file storage for large-scale specialized file management, and its performance depends on a large number of hardware and other high-profile devices.
  • These technologies are very backward when used in AI training platforms, and they do not have very suitable technical application scenarios to solve the retrieval function of AI training platform files.
  • the existing technology also has the retrieval of massive files, but its approach is very specific to the business, that is, it is only for file operations, and generally does not involve other business functions.
  • This existing technology relies on too many hardware devices and other high-configuration resources to build The distributed file management platform performs distributed retrieval and search. This type of technology is only suitable for solving a single business, and the business support for the AI training platform is very low or even undesirable, resulting in a waste of resources.
  • the present invention provides a massive file retrieval method based on an AI training platform, Devices and equipment to solve the above technical problems.
  • the present invention provides a massive file retrieval method based on an AI training platform, comprising the following steps:
  • the AI training platform obtains the retrieval task issued by the user
  • the AI training platform generates a retrieval thread flow according to the retrieval task, and controls the business logic of the retrieval process according to the retrieval thread flow;
  • the S3.AI training platform sequentially encodes the files in the database in units of folders, generates an ordered queue folder, and extracts the retrieval keywords from the retrieval tasks, and then uses the combination of binary search and depth-first traversal to search for each Sequence queue folder for keyword search.
  • step S1 is as follows:
  • the AI training platform obtains the user's login token
  • the AI training platform receives the retrieval task issued by the user according to the token. It is guaranteed that the same token corresponds to a retrieval task.
  • step S2 is as follows:
  • the AI training platform starts the retrieval thread
  • step S24 If not, go to step S24;
  • step S28 If yes, go to step S28;
  • step S26 If yes, go to step S26;
  • step S27 If not, go to step S27;
  • step S22 are as follows:
  • step S224 If not, go to step S224;
  • step S24 If not, go to step S24.
  • the same token can only correspond to one retrieval task. If there is the next retrieval task issued by the user of the same token, the current retrieval task thread needs to be interrupted, and the operation shall be performed according to the type of the next retrieval task, and the interruption of the current retrieval task The time is limited, and if the timeout is exceeded, the interrupted retrieval task will be stopped.
  • step S222 determine the next retrieval task type
  • next retrieval task type is coverage retrieval
  • next retrieval task is regarded as a new retrieval task, and the process returns to step S21;
  • next retrieval task type is queuing retrieval, set the next retrieval task to the waiting queue, and enter step S24;
  • next retrieval task type is pause retrieval, then go to step S25;
  • next retrieval task type is to continue retrieval, then go to step S24;
  • next retrieval task type determines whether to overwrite the interrupted hard task.
  • step S3 is as follows:
  • AI training platform sequentially encodes the files in the database by means of hashing in folders, and generates increasing or decreasing queue folders;
  • the AI training platform extracts the retrieval keyword from the retrieval task and locates a queue folder
  • the AI training platform determines the traversal depth according to the retrieval keyword and the positioning queue folder, and then determines the traversal path by binary search according to the traversal depth;
  • the AI training platform performs traversal and retrieval in the positioning queue folder along the traversal path, and after the traversal and retrieval is completed, determines whether all queue folders have been traversed;
  • step S35 If yes, go to step S35;
  • Sequential coding is the basis of subsequent binary search, so as to ensure that the subsequent search is carried out in a jumping manner, that is, the search is performed in half according to the ordered queue. If the searched object is smaller than the middle object of the queue, the search range is narrowed to the first half. , otherwise the search range is located in the second half area; the more efficient binary balanced fork tree or B+ tree is not used for retrieval, because the binary balanced fork tree or B+ tree needs to be indexed, a large number of indexes need to be maintained, and the performance cost is not suitable for AI. Therefore, this patent uses the binary search method to determine the traversal path; the files in the same folder are encoded sequentially in the unit of folder, not all files are encoded in units, which prevents too many files and the queue is huge and difficult to maintain .
  • step S33 is as follows:
  • the AI training platform obtains the search keyword type
  • the AI training platform determines the traversal depth according to the search keyword type and the content in the positioning queue folder;
  • the AI training platform determines all sequential file nodes traversing the depth level in the positioning queue folder, and determines the positioning head node and positioning tail node according to the ascending or descending order of the file nodes;
  • AI training platform calculates the positioning intermediate node according to the positioning head node and positioning tail node;
  • the S335.AI training platform locates the new head node and the tail node according to the ascending or descending order of the file nodes, and calculates the new positioning intermediate nodes until the set traversal path of the file nodes of the same traversal depth is completed.
  • Binary search realizes the jump of retrieval and greatly reduces the scope of retrieval;
  • the retrieval keyword type is a folder keyword
  • step S332 the AI training platform determines the traversal depth based on the folder keyword attribute and the content in the positioning queue folder;
  • step S333 if the content attributes in the positioning queue file are inconsistent, the AI training platform establishes a traversal path with the folder nodes in the same traversal depth, and ignores the file nodes in the same traversal depth;
  • step S331 the retrieval keyword type is a file keyword
  • step S332 the AI training platform determines the traversal depth based on the file keyword attribute and the content in the positioning queue folder;
  • step S333 if the content attributes in the positioning queue files are inconsistent, the AI training platform establishes a traversal path with file nodes in the same traversal depth, and folder nodes in the same traversal depth, and returns to step S332 to continue to determine the traversal depth.
  • the present invention provides a massive file retrieval device based on an AI training platform, including:
  • the retrieval task acquisition module is used to set the AI training platform to obtain retrieval tasks issued by users;
  • the retrieval thread flow setting module is used to set the AI training platform to generate the retrieval thread flow according to the retrieval task, and control the business logic of the retrieval process according to the retrieval thread flow;
  • the traversal retrieval module is used to set the AI training platform to sequentially encode the files in the database in folders, generate an ordered queue folder, and extract the retrieval keywords from the retrieval task, and then combine the binary search with the depth-first traversal. way to perform keyword search for each ordered queue folder.
  • the present invention also provides a device, including a processor and a memory; wherein, the memory is used to store a computer program; the processor is used to call and run the computer program from the memory, so that the device executes the above-mentioned first aspect. method described.
  • the beneficial effect of the present invention is that,
  • the massive file retrieval device based on the AI training platform provided by the present invention uses the retrieval thread six to control the business logic of the retrieval process, prevents the AI training platform from occupying the CPU of the server resources for a long time, reduces the resource utilization rate, and ensures the stable operation of the AI training platform business.
  • the combination of depth-first traversal and binary search improves the retrieval efficiency, avoiding the defect that depth-first traversal is used alone, and the retrieval time of files at the back is very long, which shortens the training time of the AI training platform and improves the efficiency of model training.
  • the performance of the AI training platform's massive file retrieval has been improved, and the competitiveness of the AI training platform has been enhanced.
  • the present invention has reliable design principle and simple structure, and has a very wide application prospect.
  • Fig. 1 is method flow schematic diagram one of the present invention
  • Fig. 2 is method flow schematic diagram two of the present invention
  • Fig. 3 is the system schematic diagram of the present invention.
  • 1-retrieval task acquisition module 1.1-token acquisition unit; 1.2-retrieval task acquisition unit; 2-retrieval thread flow setting module; 2.1-retrieval thread startup unit; 2.2-retrieval times threshold judgment unit; 2.3-retrieval Task completion judgment unit; 2.4-Continue retrieval unit; 2.5-Retrieval total duration judgment unit; 2.6-Retrieval timeout judgment unit; 2.7-Retrieval wake-up unit; Encoding unit; 3.2-queue folder positioning unit; 3.3-traversing path determination unit; 3.4-traversing retrieval unit; 3.5-queue folder relocation unit; 3.6-retrieving content second returning unit.
  • the present invention provides a massive file retrieval method based on an AI training platform, comprising the following steps:
  • the AI training platform obtains the retrieval task issued by the user
  • the AI training platform generates a retrieval thread flow according to the retrieval task, and controls the business logic of the retrieval process according to the retrieval thread flow;
  • the S3.AI training platform sequentially encodes the files in the database in units of folders, generates an ordered queue folder, and extracts the retrieval keywords from the retrieval tasks, and then uses the combination of binary search and depth-first traversal to search for each Sequence queue folder for keyword search.
  • the present invention provides a massive file retrieval method based on an AI training platform, comprising the following steps:
  • the AI training platform obtains the retrieval task issued by the user; the specific steps are as follows:
  • the AI training platform obtains the user's login token
  • the AI training platform receives the retrieval task issued by the user according to the token; it ensures that the same token corresponds to one retrieval task;
  • the S2.AI training platform generates a retrieval thread flow according to the retrieval task, and controls the business logic of the retrieval process according to the retrieval thread flow; the specific steps are as follows:
  • the AI training platform starts the retrieval thread
  • step S24 If not, go to step S24;
  • step S28 If yes, go to step S28;
  • step S26 If yes, go to step S26;
  • step S27 If not, go to step S27;
  • the retrieval duration threshold is used to control the retrieval duration and pause time to prevent uninterrupted retrieval, and the retrieval will automatically end if it times out.
  • the S3.AI training platform sequentially encodes the files in the database in units of folders, generates an ordered queue folder, and extracts the retrieval keywords from the retrieval tasks, and then uses the combination of binary search and depth-first traversal to search for each Search by keyword in the sequence queue folder; the specific steps are as follows:
  • AI training platform sequentially encodes the files in the database by means of hashing in folders, and generates increasing or decreasing queue folders;
  • the AI training platform extracts the retrieval keyword from the retrieval task and locates a queue folder
  • the AI training platform determines the traversal depth according to the retrieval keyword and the positioning queue folder, and then determines the traversal path by binary search according to the traversal depth;
  • the AI training platform performs traversal and retrieval in the positioning queue folder along the traversal path, and after the traversal and retrieval is completed, determines whether all queue folders have been traversed;
  • step S35 If yes, go to step S35;
  • Sequential coding is the basis of subsequent binary search, so as to ensure that the subsequent search is carried out in a jumping manner, that is, the search is performed in half according to the ordered queue. If the searched object is smaller than the middle object of the queue, the search range is narrowed to the first half. , otherwise the search range is located in the second half area; the more efficient binary balanced fork tree or B+ tree is not used for retrieval, because the binary balanced fork tree or B+ tree needs to be indexed, a large number of indexes need to be maintained, and the performance cost is not suitable for AI. Therefore, this patent uses the binary search method to locate and retrieve files.
  • step S22 are as follows:
  • step S224 If not, go to step S224;
  • step S24 If not, go to step S24.
  • the same token can only correspond to one retrieval task. If there is the next retrieval task issued by the user of the same token, the current retrieval task thread needs to be interrupted, and the operation shall be performed according to the type of the next retrieval task, and the interruption of the current retrieval task The time is limited, and if the timeout is exceeded, the interrupted retrieval task will be stopped. In some embodiments, in step S222, determine the next retrieval task type;
  • next retrieval task type is coverage retrieval
  • next retrieval task is regarded as a new retrieval task, and the process returns to step S21;
  • next retrieval task type is queuing retrieval, set the next retrieval task to the waiting queue, and enter step S24;
  • next retrieval task type is pause retrieval, then go to step S25;
  • next retrieval task type is to continue retrieval, then go to step S24;
  • next retrieval task type determines whether to overwrite the interrupted hard task.
  • step S33 are as follows:
  • the AI training platform obtains the search keyword type
  • the AI training platform determines the traversal depth according to the search keyword type and the content in the positioning queue folder;
  • the AI training platform determines all sequential file nodes traversing the depth level in the positioning queue folder, and determines the positioning head node and positioning tail node according to the increasing or decreasing order of the file nodes;
  • AI training platform calculates the positioning intermediate node according to the positioning head node and positioning tail node;
  • the S335.AI training platform locates the new head node and the tail node according to the ascending or descending order of the file nodes, and calculates the new positioning intermediate nodes until the set traversal path of the file nodes of the same traversal depth is completed.
  • Binary search realizes the jump of retrieval and greatly reduces the scope of retrieval.
  • the type of the retrieval keyword is a folder keyword; for example, a folder with certain keywords is retrieved;
  • step S332 the AI training platform determines the traversal depth based on the folder keyword attribute and the content in the positioning queue folder;
  • step S333 if the content attributes in the positioning queue file are inconsistent, for example, there are both folders and files in the positioning queue folder, the AI training platform establishes a traversal path with the folder nodes in the same traversal depth, and the files in the same traversal depth. node ignore;
  • the search key type is a file key; for example, a file with some suffixes is searched;
  • step S332 the AI training platform determines the traversal depth based on the file keyword attribute and the content in the positioning queue folder;
  • step S333 if the content attributes in the positioning queue file are inconsistent, for example, there are both folders and files in the positioning queue folder, the AI training platform establishes a traversal path with the file nodes in the same traversal depth, and the folder in the same traversal depth. node, return to step S332 to continue to determine the traversal depth.
  • the present invention provides a massive file retrieval device based on an AI training platform, including:
  • the retrieval task acquisition module 1 is used to set the AI training platform to acquire retrieval tasks issued by the user; the retrieval task acquisition module 1 includes:
  • Token obtaining unit 1.1 used to set the AI training platform to obtain the user's login token
  • the retrieval task acquisition unit 1.2 is used to set the AI training platform to receive the retrieval task issued by the user according to the token;
  • the retrieval thread flow setting module 2 is used to set the AI training platform to generate a retrieval thread flow according to the retrieval task, and control the business logic of the retrieval process according to the retrieval thread flow; the retrieval thread flow setting module 2 includes:
  • Retrieval thread starting unit 2.1 used to set the AI training platform to start the retrieval thread
  • Retrieval times threshold judging unit 2.2 used to judge whether the retrieval times threshold is satisfied every time a retrieval is completed;
  • the retrieval task completion judgment unit 2.3 is used to determine whether the retrieval task is completed when the retrieval times threshold is not met;
  • the retrieval total duration judgment unit 2.5 is used to return the retrieval content, suspend the retrieval, and judge whether the retrieval total duration exceeds the retrieval duration threshold when the retrieval number threshold is not met, but the retrieval task is not completed;
  • the retrieval timeout determination unit 2.6 is used to determine that the retrieval has timed out and the retrieval has ended when the total retrieval duration exceeds the retrieval duration threshold;
  • Retrieval wake-up unit 2.7 used for judging the retrieval wake-up if the total retrieval duration does not exceed the retrieval duration threshold
  • the retrieval content first returning unit 2.8 is used to return the retrieval content when the retrieval times threshold is not met and the retrieval task is completed, and the retrieval ends;
  • Traverse retrieval module 3 is used to set the AI training platform to sequentially encode the files in the database in folders, generate an ordered queue folder, and extract the retrieval keywords from the retrieval tasks, and then perform binary search and depth-first traversal.
  • the keyword retrieval is performed on each ordered queue folder in a combined manner; the traversal retrieval module 3 includes:
  • Sequence encoding unit 3.1 is used to set the AI training platform to sequentially encode the files in the database by means of hashing in folders to generate increasing or decreasing queue folders;
  • the queue folder location unit 3.2 is used to set the AI training platform to extract the retrieval keyword from the retrieval task and locate a queue folder;
  • the traversal path determination unit 3.3 is used to set the AI training platform to determine the traversal depth according to the retrieval keyword and the positioning queue folder before the traversal retrieval, and then determine the traversal path by binary search according to the traversal depth;
  • Traverse retrieval unit 3.4 is used to set the AI training platform to perform traversal retrieval in the positioning queue folder along the traversal path, and after the traversal retrieval is completed, determine whether all queue folders have been traversed;
  • the queue folder relocation unit 3.5 is used to locate the next queue folder when there is a queue folder that has not been traversed;
  • the second returning unit 3.6 of the retrieval content is used to return the retrieval content when all the queue folders are traversed, and the retrieval ends.
  • the present invention provides a device, including a processor and a memory; wherein, the memory is used to store a computer program; the processor is used to call and run the computer program from the memory, so that the device executes the above-mentioned Embodiment 1 or Embodiment 2 Methods.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Library & Information Science (AREA)
  • Computing Systems (AREA)
  • Computational Linguistics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Medical Informatics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于AI训练平台的海量文件检索方法、装置及设备,所述方法:AI训练平台获取用户下发的检索任务;AI训练平台根据检索任务生成检索线程流,并根据检索线程流控制检索过程的业务逻辑;AI训练平台将数据库中文件以文件夹为单位进行顺序编码,生成有序队列文件夹,并从检索任务中提取出检索关键字,再通过二分查找与深度优先遍历结合方式对每个有序队列文件夹进行关键字检索。本发明提供利用检索线程流控制检索过程业务逻辑,防止AI训练平台长时间占用服务器资源的CPU,同时深度优先遍历与二分查找方式结合提高了检索效率,避免单独使用深度优先遍历,而文件靠后的检索时间很长的缺陷,缩短了AI训练平台的训练时间。

Description

一种基于AI训练平台的海量文件检索方法、装置及设备
本申请要求于2020年09月18日提交中国专利局、申请号为202010988313.8、发明名称为“一种基于AI训练平台的海量文件检索方法、装置及设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明属于文件检索技术领域,具体涉及一种基于AI训练平台的海量文件检索方法、装置及设备。
背景技术
AI训练平台,即人工智能训练平台,管理并调度CPU和GPU等资源,模型训练,任务管理的平台。
随着人工智能相关产业的蓬勃发展,越来越多的科研企业和高校的研究人员对计算力的要求也是越来越高,AI训练平台有效解决了企业或科研机构对计算力的要求,AI训练平台的一项重要的也是最基本的功能就是文件的相关操作,如何在海量的文件中检索出有效信息,成为算法研究人员的常用基础功能,在数据集中或者在用户目录下进行具体文件以及模糊文件的检索,海量文件检索的性能关乎于AI训练平台使用研究人员的工作效率,如何提升检索性能是亟需解决的问题。
目前涉及文件检索的现有技术主要用于各个系统的文件管理等,文件数量大小与AI训练平台的文件数量级别差别很大,而且现有技术也无外乎循环递归遍历所有文件,进行模糊匹配比对;利用操作系统linux的相关命令find进行模糊匹配查找;再者搭建文件分布式存储进行大规模专门的对文件进行管理,其性能好坏依赖于大量硬件等高配置的设备。这些技术用于AI训练平台显得非常落后,其没有非常合适的技术应用场景解决AI训练平台文件的检索功能。
同时循环遍历所有文件进行模糊匹配比对,不仅耗时非常之长,而且在底层系统中占有非常多的系统的资源,包含CPU等,首先对用户使用体 验来说非常差,等待的时间非常长,如果文件数据量达到T级别,那等待的时间让使用人员崩溃;其次消耗非常多的系统资源,对系统平台其他业务的影响非常大,最危险的可能造成系统崩溃的情况。部分不考虑系统性能的检索,例如find模糊查找,等待的时间比循环递归遍历还要慢很久,不仅不释放当前占用的资源,甚至会造成底层资源的进程阻塞等。
现有技术也存在海量文件的检索,但是其做法对业务非常专一,即只针对文件的操作,一般不会涉及其他的业务功能,此现有技术依靠太多硬件设备等高配置资源,搭建分布式文件管理平台进行分布式检索查找,这类技术只适合解决业务单一,对AI训练平台的业务支持非常低,甚至不可取,造成资源的浪费。
此为现有技术的不足,因此,针对现有技术中的上述缺陷,提供一种基于AI训练平台的海量文件检索方法、装置及设备,是非常有必要的。
发明内容
针对现有技术的上述现有检索的遍历方式、文件管理方式以及仅支持文件操作的方式不适合AI训练平台,造成资源浪费的缺陷,本发明提供一种基于AI训练平台的海量文件检索方法、装置及设备,以解决上述技术问题。
第一方面,本发明提供一种基于AI训练平台的海量文件检索方法,包括如下步骤:
S1.AI训练平台获取用户下发的检索任务;
S2.AI训练平台根据检索任务生成检索线程流,并根据检索线程流控制检索过程的业务逻辑;
S3.AI训练平台将数据库中文件以文件夹为单位进行顺序编码,生成有序队列文件夹,并从检索任务中提取出检索关键字,再通过二分查找与深度优先遍历结合方式对每个有序队列文件夹进行关键字检索。
进一步地,步骤S1具体步骤如下:
S11.AI训练平台获取用户登录的令牌;
S12.AI训练平台根据令牌接收用户下发的检索任务。保证同一令牌对 应一个检索任务。
进一步地,步骤S2具体步骤如下:
S21.AI训练平台启动检索线程;
S22.每完成一次检索,判断是否满足检索次数阈值;
若是,进入步骤S23;
若否,进入步骤S24;
S23.判断检索任务是否完成;
若是,进入步骤S28;
若否,进入步骤S25;
S24.继续检索,返回步骤S22;
S25.返回检索内容,暂停检索,判断检索总时长是否超过检索时长阈值;
若是,进入步骤S26;
若否,进入步骤S27;
S26.检索超时,检索结束;
S27.检索唤醒,返回步骤S24;
S28.返回检索内容,检索结束。设置检索次数,在满足检索次数时,需要控制进行一次暂停,而未完成检索任务的需要继续检索,若检索结束仍未达到检索次数,则返回结束标志;设置检索时长阈值,用于控制检索时长和暂停时间,防止出现检索不中断,超时则自动结束检索,实际检索过程中,为了速度,也不会等待很长时间,而且暂停时间过长,也会造成线程不释放的问题。
进一步地,步骤S22具体步骤如下:
S221.每完成一次检索,判断是否有同一令牌的用户下发的下一检索任务;
若有,进入步骤S222;
若否,进入步骤S224;
S222.中断当前检索任务线程,根据下一检索任务类型进行操作;
S223.判断当前检索线程任务中断是否超时;
若超时,则判定检索中断超时,检索结束;
若未超时,等待下一检索任务完成,返回步骤S223;
S224.判断是否满足检索次数阈值;
若是,进入步骤S23;
若否,进入步骤S24。同一个令牌只能对应一个检索任务,若有同一令牌的用户下发的下一检索任务,则需要中断当前检索任务线程,并根据下一检索任务类型进行操作,而当前检索任务的中断时间有限制,超时,则中断的检索任务停止。
进一步地,步骤S222中,判断下一检索任务类型;
若下一检索任务类型为覆盖检索,则将下一检索任务作为新的检索任务,返回步骤S21;
若下一检索任务类型为排队检索,则将下一检索任务设定到等待队列,进入步骤S24;
若下一检索任务类型为暂停检索,则进入步骤S25;
若下一检索任务类型为继续检索,则进入步骤S24;
若下一检索任务类型为终止检索,则进入步骤S28。根下一检索任务类型确定是否覆盖中断的艰难任务。
进一步地,步骤S3具体步骤如下:
S31.AI训练平台将数据库中文件以文件夹为单位通过hash散列方式进行顺序编码,生成递增或者递减队列文件夹;
S32.AI训练平台从检索任务中提取出检索关键字,定位一个队列文件夹;
S33.AI训练平台在遍历检索前根据检索关键字和定位队列文件夹确定遍历深度,再根据遍历深度以二分查找方式确定遍历路径;
S34.AI训练平台沿着遍历路径在定位队列文件夹中进行遍历检索,并在遍历检索完成后,判断是否所有队列文件夹遍历完毕;
若是,进入步骤S35;
若否,定位到下一个队列文件夹,返回步骤S33;
S35.返回检索内容,检索结束。顺序编码是后续二分查找的基础,从 而保证后续在查找过程中采用跳跃式的方式进行检索,即按照有序的队列进行折半查找,如果查找的对象小于队列中间对象,则查找范围缩小到前半区,否则查找范围定位到后半区;未使用效率更高的二平衡叉树或者B+树的进行检索,由于二平衡叉树或者B+树需要进行建立索引,需要维护大量索引,性能成本不适合AI训练平台,故而本专利使用二分查找的方式进行遍历路径确定;以文件夹为单位,对同一文件夹内文件进行顺序编码,不是以所有文件为单位进行编码,防止文件太多,队列庞大难于维护。
进一步地,步骤S33具体步骤如下:
S331.AI训练平台获取检索关键字类型;
S332.AI训练平台根据检索关键字类型及定位队列文件夹中内容确定遍历深度;
S333.AI训练平台确定定位队列文件夹中遍历深度层级的所有顺序文件节点,根据文件节点递增或递减顺序,确定定位首节点和定位尾节点;
S334.AI训练平台根据定位首节点及定位尾节点计算出定位中间节点;
S335.AI训练平台根据文件节点采用递增还是递减顺序,新的定位首节点和定位尾节点,并计算新的定位中间节点,直至同一遍历深度的文件节点设定遍历路径完毕。二分查找实现检索的跳跃,大大缩小检索范围;
进一步地,若步骤S331中,检索关键字类型为文件夹关键字;
步骤S332中,AI训练平台以文件夹关键字属性以及定位队列文件夹中内容进行遍历深度确定;
步骤S333中,若定位队列文件中内容属性不一致,则AI训练平台以同一遍历深度中文件夹节点建立遍历路径,而同一遍历深度中文件节点忽略;
若步骤S331中,检索关键字类型为文件关键字;
步骤S332中,AI训练平台以文件关键字属性以及定位队列文件夹中内容进行遍历深度确定;
步骤S333中,若定位队列文件中内容属性不一致,则AI训练平台以同一遍历深度中文件节点建立遍历路径,而同一遍历深度中文件夹节点,返回步骤S332中继续确定遍历深度。
第二方面,本发明提供一种基于AI训练平台的海量文件检索装置,包 括:
检索任务获取模块,用于设置AI训练平台获取用户下发的检索任务;
检索线程流设置模块,用于设置AI训练平台根据检索任务生成检索线程流,并根据检索线程流控制检索过程的业务逻辑;
遍历检索模块,用于设置AI训练平台将数据库中文件以文件夹为单位进行顺序编码,生成有序队列文件夹,并从检索任务中提取出检索关键字,再通过二分查找与深度优先遍历结合方式对每个有序队列文件夹进行关键字检索。
第三方面,本发明还提供一种设备,包括处理器和存储器;其中,该存储器用于存储计算机程序;该处理器用于从存储器中调用并运行该计算机程序,使得设备执行上述第一方面所述的方法。
本发明的有益效果在于,
本发明提供的基于AI训练平台的海量文件检索装置,利用检索线程六控制检索过程业务逻辑,防止AI训练平台长时间占用服务器资源的CPU,降低资源利用率,保证AI训练平台的业务稳定运行,同时深度优先遍历与二分查找方式结合提高了检索效率,避免单独使用深度优先遍历,而文件靠后的检索时间很长的缺陷,缩短了AI训练平台的训练时间,提高了模型训练的效效率,提高了AI训练平台的海量文件检索的性能,增强AI训练平台的竞争力。
此外,本发明设计原理可靠,结构简单,具有非常广泛的应用前景。
由此可见,本发明与现有技术相比,具有突出的实质性特点和显著的进步,其实施的有益效果也是显而易见的。
附图说明
为了更清楚地说明本发明实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,对于本领域普通技术人员而言,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明的方法流程示意图一;
图2是本发明的方法流程示意图二;
图3是本发明的系统示意图;
图中,1-检索任务获取模块;1.1-令牌获取单元;1.2-检索任务获取单元;2-检索线程流设置模块;2.1-检索线程启动单元;2.2-检索次数阈值判断单元;2.3-检索任务完成判断单元;2.4-继续检索单元;2.5-检索总时长判断单元;2.6-检索超时判定单元;2.7-检索唤醒单元;2.8-检索内容第一返回单元;3-遍历检索模块;3.1-序列编码单元;3.2-队列文件夹定位单元;3.3-遍历路径确定单元;3.4-遍历检索单元;3.5-队列文件夹再定位单元;3.6-检索内容第二返回单元。
具体实施方式
为了使本技术领域的人员更好地理解本发明中的技术方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
实施例1:
如图1所示,本发明提供一种基于AI训练平台的海量文件检索方法,包括如下步骤:
S1.AI训练平台获取用户下发的检索任务;
S2.AI训练平台根据检索任务生成检索线程流,并根据检索线程流控制检索过程的业务逻辑;
S3.AI训练平台将数据库中文件以文件夹为单位进行顺序编码,生成有序队列文件夹,并从检索任务中提取出检索关键字,再通过二分查找与深度优先遍历结合方式对每个有序队列文件夹进行关键字检索。
实施例2:
如图2所示,本发明提供一种基于AI训练平台的海量文件检索方法,包括如下步骤:
S1.AI训练平台获取用户下发的检索任务;具体步骤如下:
S11.AI训练平台获取用户登录的令牌;
S12.AI训练平台根据令牌接收用户下发的检索任务;保证同一令牌对应一个检索任务;
S2.AI训练平台根据检索任务生成检索线程流,并根据检索线程流控制检索过程的业务逻辑;具体步骤如下:
S21.AI训练平台启动检索线程;
S22.每完成一次检索,判断是否满足检索次数阈值;
若是,进入步骤S23;
若否,进入步骤S24;
S23.判断检索任务是否完成;
若是,进入步骤S28;
若否,进入步骤S25;
S24.继续检索,返回步骤S22;
S25.返回检索内容,暂停检索,判断检索总时长是否超过检索时长阈值;
若是,进入步骤S26;
若否,进入步骤S27;
S26.检索超时,检索结束;
S27.检索唤醒,返回步骤S24;
S28.返回检索内容,检索结束;
设置检索次数,如请求检索页面50次,在完成检索50条情况下,需要控制进行一次暂停,而未完成检索任务的需要继续检索,若检索结束仍未达到50条,则返回结束标志;设置检索时长阈值,用于控制检索时长和暂停时间,防止出现检索不中断,超时则自动结束检索,实际检索过程中,为了速度,也不会等待很长时间,而且暂停时间过长,也会造成线程不释放的问题;以文件夹为单位,对同一文件夹内文件进行顺序编码,不是以所有文件为单位进行编码,防止文件太多,队列庞大难于维护;
S3.AI训练平台将数据库中文件以文件夹为单位进行顺序编码,生成有序队列文件夹,并从检索任务中提取出检索关键字,再通过二分查找与深 度优先遍历结合方式对每个有序队列文件夹进行关键字检索;具体步骤如下:
S31.AI训练平台将数据库中文件以文件夹为单位通过hash散列方式进行顺序编码,生成递增或者递减队列文件夹;
S32.AI训练平台从检索任务中提取出检索关键字,定位一个队列文件夹;
S33.AI训练平台在遍历检索前根据检索关键字和定位队列文件夹确定遍历深度,再根据遍历深度以二分查找方式确定遍历路径;
S34.AI训练平台沿着遍历路径在定位队列文件夹中进行遍历检索,并在遍历检索完成后,判断是否所有队列文件夹遍历完毕;
若是,进入步骤S35;
若否,定位到下一个队列文件夹,返回步骤S33;
S35.返回检索内容,检索结束。
顺序编码是后续二分查找的基础,从而保证后续在查找过程中采用跳跃式的方式进行检索,即按照有序的队列进行折半查找,如果查找的对象小于队列中间对象,则查找范围缩小到前半区,否则查找范围定位到后半区;未使用效率更高的二平衡叉树或者B+树的进行检索,由于二平衡叉树或者B+树需要进行建立索引,需要维护大量索引,性能成本不适合AI训练平台,故而本专利使用二分查找的方式进行文件定位检索。
在某些实施例中,步骤S22具体步骤如下:
S221.每完成一次检索,判断是否有同一令牌的用户下发的下一检索任务;
若有,进入步骤S222;
若否,进入步骤S224;
S222.中断当前检索任务线程,根据下一检索任务类型进行操作;
S223.判断当前检索线程任务中断是否超时;
若超时,则判定检索中断超时,检索结束;
若未超时,等待下一检索任务完成,返回步骤S223;
S224.判断是否满足检索次数阈值;
若是,进入步骤S23;
若否,进入步骤S24。同一个令牌只能对应一个检索任务,若有同一令牌的用户下发的下一检索任务,则需要中断当前检索任务线程,并根据下一检索任务类型进行操作,而当前检索任务的中断时间有限制,超时,则中断的检索任务停止。在某些实施例中,步骤S222中,判断下一检索任务类型;
若下一检索任务类型为覆盖检索,则将下一检索任务作为新的检索任务,返回步骤S21;
若下一检索任务类型为排队检索,则将下一检索任务设定到等待队列,进入步骤S24;
若下一检索任务类型为暂停检索,则进入步骤S25;
若下一检索任务类型为继续检索,则进入步骤S24;
若下一检索任务类型为终止检索,则进入步骤S28。根下一检索任务类型确定是否覆盖中断的艰难任务。
在某些实施例中,步骤S33具体步骤如下:
S331.AI训练平台获取检索关键字类型;
S332.AI训练平台根据检索关键字类型及定位队列文件夹中内容确定遍历深度;
S333.AI训练平台确定定位队列文件夹中遍历深度层级的所有顺序文件节点,根据文件节点递增或递减顺序,确定定位首节点和定位尾节点;
S334.AI训练平台根据定位首节点及定位尾节点计算出定位中间节点;
S335.AI训练平台根据文件节点采用递增还是递减顺序,新的定位首节点和定位尾节点,并计算新的定位中间节点,直至同一遍历深度的文件节点设定遍历路径完毕。二分查找实现检索的跳跃,大大缩小检索范围。
在某些实施例中,若步骤S331中,检索关键字类型为文件夹关键字;例如,检索带有某些关键字的文件夹;
步骤S332中,AI训练平台以文件夹关键字属性以及定位队列文件夹中内容进行遍历深度确定;
步骤S333中,若定位队列文件中内容属性不一致,例如,定位队列文 件夹中既有文件夹,也有文件,则AI训练平台以同一遍历深度中文件夹节点建立遍历路径,而同一遍历深度中文件节点忽略;
若步骤S331中,检索关键字类型为文件关键字;例如检索带有某些后缀的文件;
步骤S332中,AI训练平台以文件关键字属性以及定位队列文件夹中内容进行遍历深度确定;
步骤S333中,若定位队列文件中内容属性不一致,例如,定位队列文件夹中既有文件夹,也有文件,则AI训练平台以同一遍历深度中文件节点建立遍历路径,而同一遍历深度中文件夹节点,返回步骤S332中继续确定遍历深度。
实施例3:
如图3所示,本发明提供一种基于AI训练平台的海量文件检索装置,包括:
检索任务获取模块1,用于设置AI训练平台获取用户下发的检索任务;检索任务获取模块1包括:
令牌获取单元1.1,用于设置AI训练平台获取用户登录的令牌;
检索任务获取单元1.2,用于设置AI训练平台根据令牌接收用户下发的检索任务;
检索线程流设置模块2,用于设置AI训练平台根据检索任务生成检索线程流,并根据检索线程流控制检索过程的业务逻辑;检索线程流设置模块2包括:
检索线程启动单元2.1,用于设置AI训练平台启动检索线程;
检索次数阈值判断单元2.2,用于当每完成一次检索时,判断是否满足检索次数阈值;
检索任务完成判断单元2.3,用于不满足检索次数阈值时,判断检索任务是否完成;
继续检索单元2.4,用于当满足检索次数阈值时,继续检索;
检索总时长判断单元2.5,用于不满足检索次数阈值,但检索任务未完成时,返回检索内容,暂停检索,判断检索总时长是否超过检索时长阈值;
检索超时判定单元2.6,用于检索总时长超过检索时长阈值时,判定检索超时,检索结束;
检索唤醒单元2.7,用于检索总时长未超过检索时长阈值,判定检索唤醒;
检索内容第一返回单元2.8,用于不满足检索次数阈值,且检索任务完成时,返回检索内容,检索结束;
遍历检索模块3,用于设置AI训练平台将数据库中文件以文件夹为单位进行顺序编码,生成有序队列文件夹,并从检索任务中提取出检索关键字,再通过二分查找与深度优先遍历结合方式对每个有序队列文件夹进行关键字检索;遍历检索模块3包括:
序列编码单元3.1,用于设置AI训练平台将数据库中文件以文件夹为单位通过hash散列方式进行顺序编码,生成递增或者递减队列文件夹;
队列文件夹定位单元3.2,用于设置AI训练平台从检索任务中提取出检索关键字,定位一个队列文件夹;
遍历路径确定单元3.3,用于设置AI训练平台在遍历检索前根据检索关键字和定位队列文件夹确定遍历深度,再根据遍历深度以二分查找方式确定遍历路径;
遍历检索单元3.4,用于设置AI训练平台沿着遍历路径在定位队列文件夹中进行遍历检索,并在遍历检索完成后,判断是否所有队列文件夹遍历完毕;
队列文件夹再定位单元3.5,用于存在队列文件夹未遍历时,定位到下一个队列文件夹;
检索内容第二返回单元3.6,用于所有队列文件夹遍历完毕时,返回检索内容,检索结束。
实施例4:
本发明提供一种设备,包括处理器和存储器;其中,该存储器用于存储计算机程序;该处理器用于从存储器中调用并运行该计算机程序,使得设备执行上述实施例1或实施例2所述的方法。
尽管通过参考附图并结合优选实施例的方式对本发明进行了详细描 述,但本发明并不限于此。在不脱离本发明的精神和实质的前提下,本领域普通技术人员可以对本发明的实施例进行各种等效的修改或替换,而这些修改或替换都应在本发明的涵盖范围内/任何熟悉本技术领域的技术人员在本发明揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本发明的保护范围之内。因此,本发明的保护范围应以所述权利要求的保护范围为准。

Claims (10)

  1. 一种基于AI训练平台的海量文件检索方法,其特征在于,包括如下步骤:
    S1.AI训练平台获取用户下发的检索任务;
    S2.AI训练平台根据检索任务生成检索线程流,并根据检索线程流控制检索过程的业务逻辑;
    S3.AI训练平台将数据库中文件以文件夹为单位进行顺序编码,生成有序队列文件夹,并从检索任务中提取出检索关键字,再通过二分查找与深度优先遍历结合方式对每个有序队列文件夹进行关键字检索。
  2. 如权利要求1所述的基于AI训练平台的海量文件检索方法,其特征在于,步骤S1具体步骤如下:
    S11.AI训练平台获取用户登录的令牌;
    S12.AI训练平台根据令牌接收用户下发的检索任务。
  3. 如权利要求2所述的基于AI训练平台的海量文件检索方法,其特征在于,步骤S2具体步骤如下:
    S21.AI训练平台启动检索线程;
    S22.每完成一次检索,判断是否满足检索次数阈值;
    若是,进入步骤S23;
    若否,进入步骤S24;
    S23.判断检索任务是否完成;
    若是,进入步骤S28;
    若否,进入步骤S25;
    S24.继续检索,返回步骤S22;
    S25.返回检索内容,暂停检索,判断检索总时长是否超过检索时长阈值;
    若是,进入步骤S26;
    若否,进入步骤S27;
    S26.检索超时,检索结束;
    S27.检索唤醒,返回步骤S24;
    S28.返回检索内容,检索结束。
  4. 如权利要求3所述的基于AI训练平台的海量文件检索方法,其特征在于,步骤S22具体步骤如下:
    S221.每完成一次检索,判断是否有同一令牌的用户下发的下一检索任务;
    若有,进入步骤S222;
    若否,进入步骤S224;
    S222.中断当前检索任务线程,根据下一检索任务类型进行操作;
    S223.判断当前检索线程任务中断是否超时;
    若超时,则判定检索中断超时,检索结束;
    若未超时,等待下一检索任务完成,返回步骤S223;
    S224.判断是否满足检索次数阈值;
    若是,进入步骤S23;
    若否,进入步骤S24。
  5. 如权利要求4所述的基于AI训练平台的海量文件检索方法,其特征在于,步骤S222中,判断下一检索任务类型;
    若下一检索任务类型为覆盖检索,则将下一检索任务作为新的检索任务,返回步骤S21;
    若下一检索任务类型为排队检索,则将下一检索任务设定到等待队列,进入步骤S24;
    若下一检索任务类型为暂停检索,则进入步骤S25;
    若下一检索任务类型为继续检索,则进入步骤S24;
    若下一检索任务类型为终止检索,则进入步骤S28。
  6. 如权利要求1所述的基于AI训练平台的海量文件检索方法,其特征在于,步骤S3具体步骤如下:
    S31.AI训练平台将数据库中文件以文件夹为单位通过hash散列方式进行顺序编码,生成递增或者递减队列文件夹;
    S32.AI训练平台从检索任务中提取出检索关键字,定位一个队列文件夹;
    S33.AI训练平台在遍历检索前根据检索关键字和定位队列文件夹确定遍历深度,再根据遍历深度以二分查找方式确定遍历路径;
    S34.AI训练平台沿着遍历路径在定位队列文件夹中进行遍历检索,并在遍历检索完成后,判断是否所有队列文件夹遍历完毕;
    若是,进入步骤S35;
    若否,定位到下一个队列文件夹,返回步骤S33;
    S35.返回检索内容,检索结束。
  7. 如权利要求6所述的基于AI训练平台的海量文件检索方法,其特征在于,步骤S33具体步骤如下:
    S331.AI训练平台获取检索关键字类型;
    S332.AI训练平台根据检索关键字类型及定位队列文件夹中内容确定遍历深度;
    S333.AI训练平台确定定位队列文件夹中遍历深度层级的所有顺序文件节点,根据文件节点递增或递减顺序,确定定位首节点和定位尾节点;
    S334.AI训练平台根据定位首节点及定位尾节点计算出定位中间节点;
    S335.AI训练平台根据文件节点采用递增还是递减顺序,新的定位首节点和定位尾节点,并计算新的定位中间节点,直至同一遍历深度的文件节点设定遍历路径完毕。
  8. 如权利要求7所述的基于AI训练平台的海量文件检索方法,其特征在于,若步骤S331中,检索关键字类型为文件夹关键字;
    步骤S332中,AI训练平台以文件夹关键字属性以及定位队列文件夹中内容进行遍历深度确定;
    步骤S333中,若定位队列文件中内容属性不一致,则AI训练平台以同一遍历深度中文件夹节点建立遍历路径,而同一遍历深度中文件节点忽略;
    若步骤S331中,检索关键字类型为文件关键字;
    步骤S332中,AI训练平台以文件关键字属性以及定位队列文件夹中内容进行遍历深度确定;
    步骤S333中,若定位队列文件中内容属性不一致,则AI训练平台以同一遍历深度中文件节点建立遍历路径,而同一遍历深度中文件夹节点,返回步骤S332中继续确定遍历深度。
  9. 一种基于AI训练平台的海量文件检索装置,其特征在于,包括:
    检索任务获取模块(1),用于设置AI训练平台获取用户下发的检索任务;
    检索线程流设置模块(2),用于设置AI训练平台根据检索任务生成检索线程流,并根据检索线程流控制检索过程的业务逻辑;
    遍历检索模块(3),用于设置AI训练平台将数据库中文件以文件夹为单位进行顺序编码,生成有序队列文件夹,并从检索任务中提取出检索关键字,再通过二分查找与深度优先遍历结合方式对每个有序队列文件夹进行关键字检索。
  10. 一种设备,其特征在于,包括处理器和存储器;其中,该存储器用于存储计算机程序;该处理器用于从存储器中调用并运行该计算机程序,使得设备执行上述权利要求1-8任一项所述的方法。
PCT/CN2021/109217 2020-09-18 2021-07-29 一种基于ai训练平台的海量文件检索方法、装置及设备 WO2022057460A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
US18/011,514 US11768805B2 (en) 2020-09-18 2021-07-29 Mass file retrieval method and apparatus based on AI training platform, and device

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010988313.8 2020-09-18
CN202010988313.8A CN111949610B (zh) 2020-09-18 2020-09-18 一种基于ai训练平台的海量文件检索方法、装置及设备

Publications (1)

Publication Number Publication Date
WO2022057460A1 true WO2022057460A1 (zh) 2022-03-24

Family

ID=73357631

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/109217 WO2022057460A1 (zh) 2020-09-18 2021-07-29 一种基于ai训练平台的海量文件检索方法、装置及设备

Country Status (3)

Country Link
US (1) US11768805B2 (zh)
CN (1) CN111949610B (zh)
WO (1) WO2022057460A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251416A (zh) * 2023-11-10 2023-12-19 深圳软牛科技有限公司 一种文件扫描方法、装置、计算机设备及存储介质

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111949610B (zh) * 2020-09-18 2022-12-23 苏州浪潮智能科技有限公司 一种基于ai训练平台的海量文件检索方法、装置及设备

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456055A (zh) * 2010-10-28 2012-05-16 腾讯科技(深圳)有限公司 兴趣点检索的方法及装置
CN105117502A (zh) * 2015-10-13 2015-12-02 四川中科腾信科技有限公司 一种基于大数据的检索方法
US20190156032A1 (en) * 2017-01-06 2019-05-23 Crowdstrike, Inc. Binary Search of Byte Sequences Using Inverted Indices
US20190236178A1 (en) * 2018-01-31 2019-08-01 Salesforce.Com, Inc. Trie-based normalization of field values for matching
CN111459886A (zh) * 2020-03-12 2020-07-28 苏州浪潮智能科技有限公司 一种日志内容匹配检索的方法、装置、设备及存储介质
CN111949610A (zh) * 2020-09-18 2020-11-17 苏州浪潮智能科技有限公司 一种基于ai训练平台的海量文件检索方法、装置及设备

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10929611B2 (en) * 2017-12-05 2021-02-23 discourse.ai, Inc. Computer-based interlocutor understanding using classifying conversation segments
US11507533B2 (en) * 2018-02-05 2022-11-22 Huawei Technologies Co., Ltd. Data query method and apparatus
CN108876435A (zh) * 2018-05-25 2018-11-23 北京百度网讯科技有限公司 人工智能平台实现方法、装置、计算机设备及存储介质
US11048767B2 (en) * 2018-11-16 2021-06-29 Sap Se Combination content search
CN110795141B (zh) * 2019-10-12 2023-10-10 广东浪潮大数据研究有限公司 一种训练任务提交方法、装置、设备及介质
CN111488492B (zh) * 2020-04-08 2023-11-17 北京百度网讯科技有限公司 用于检索图数据库的方法和装置
CN111581615A (zh) * 2020-05-08 2020-08-25 南京大创师智能科技有限公司 一种向个人提供人工智能平台的方法及系统

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102456055A (zh) * 2010-10-28 2012-05-16 腾讯科技(深圳)有限公司 兴趣点检索的方法及装置
CN105117502A (zh) * 2015-10-13 2015-12-02 四川中科腾信科技有限公司 一种基于大数据的检索方法
US20190156032A1 (en) * 2017-01-06 2019-05-23 Crowdstrike, Inc. Binary Search of Byte Sequences Using Inverted Indices
US20190236178A1 (en) * 2018-01-31 2019-08-01 Salesforce.Com, Inc. Trie-based normalization of field values for matching
CN111459886A (zh) * 2020-03-12 2020-07-28 苏州浪潮智能科技有限公司 一种日志内容匹配检索的方法、装置、设备及存储介质
CN111949610A (zh) * 2020-09-18 2020-11-17 苏州浪潮智能科技有限公司 一种基于ai训练平台的海量文件检索方法、装置及设备

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117251416A (zh) * 2023-11-10 2023-12-19 深圳软牛科技有限公司 一种文件扫描方法、装置、计算机设备及存储介质
CN117251416B (zh) * 2023-11-10 2024-03-29 深圳软牛科技集团股份有限公司 一种文件扫描方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
US20230214356A1 (en) 2023-07-06
CN111949610B (zh) 2022-12-23
US11768805B2 (en) 2023-09-26
CN111949610A (zh) 2020-11-17

Similar Documents

Publication Publication Date Title
WO2022057460A1 (zh) 一种基于ai训练平台的海量文件检索方法、装置及设备
Zhi et al. Research of Hadoop-based data flow management system
El Alami et al. Supply of a key value database redis in-memory by data from a relational database
Liu Research on the application of big data in academic libraries
Pradhan et al. A query model for retrieving relevant intervals within a video stream
CN105022743A (zh) 一种管理索引的方法及装置
Wang et al. OPTAS: Optimal data placement in MapReduce
Xiaolian Application and Innovation of Cloud Computing Technology in Computer Data Processing
CN112115123A (zh) 用于分布式数据库的性能优化的方法和装置
CN112861495A (zh) 一种基于Excel模板文件生成ImpalaSQL语句的方法
Xu et al. Computer Information Management Database System based on Genetic Algorithm
CN112182094A (zh) 一种语音数据文字文本形式的大数据分布式存储方法
CN110941788A (zh) 边缘计算的云环境分布式Web页面提取分析系统和方法
Wietrzyk et al. Dynamic reorganization of object databases
Junwei et al. Architecture for component library retrieval on the cloud
Zhao et al. A fast adaptive replica recovery algorithm based on access frequency and environment awareness
Haoran et al. A data deduplication method in the cloud storage based on FP-tree
Liu et al. SAC: exploiting stable set model to enhance cacheFiles
CN200976656Y (zh) 新型大容量用户信息处理系统
Ni et al. Design of appearance patent retrieval system based on MapReduce cluster framework
Li et al. A method of constructing distributed big data analysis model for machine learning based on Cloud Computing
Meng et al. Quasi Real-Time Distributed Search Engine Based on Massive Operation and Maintenance Data
Li et al. Research and development of ict call center data auxiliary analysis system based on knowledge discovery
Li et al. Research on the Construction of a Knowledge Graph of a Certain Equipment Based on the Domestic Independent and Controllable Environment
Yang Research on Internet Public Opinion Analysis Technology in the New Media Era Based on Hadoop Platform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21868289

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 21868289

Country of ref document: EP

Kind code of ref document: A1