WO2021043088A1 - File query method and device, and computer device and storage medium - Google Patents

File query method and device, and computer device and storage medium Download PDF

Info

Publication number
WO2021043088A1
WO2021043088A1 PCT/CN2020/112336 CN2020112336W WO2021043088A1 WO 2021043088 A1 WO2021043088 A1 WO 2021043088A1 CN 2020112336 W CN2020112336 W CN 2020112336W WO 2021043088 A1 WO2021043088 A1 WO 2021043088A1
Authority
WO
WIPO (PCT)
Prior art keywords
query
word
file
collection
words
Prior art date
Application number
PCT/CN2020/112336
Other languages
French (fr)
Chinese (zh)
Inventor
钱克功
沈网中
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021043088A1 publication Critical patent/WO2021043088A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving

Definitions

  • This application relates to the field of artificial intelligence technology, and in particular to a file query method, device, computer equipment and storage medium.
  • the computer's file system is responsible for creating files for users, and controlling file access by storing, reading, modifying, and dumping files.
  • users no longer use files, they can revoke, delete files, etc., so the file system of the computer can support the storage of massive files.
  • the inventor realizes that for users, facing a large number of files, it takes a certain amount of time and energy to retrieve the target file. At present, there is no related technology or product in the industry that can perform fast file query.
  • This application provides a file query method, device, computer equipment and storage medium.
  • a document query method provided by this application includes:
  • this application also provides a file query device, which includes:
  • the service description creation module is used to obtain the collection file set of the client, create the service description of the collection file set in the file system, and store the collection file set after the service description is created in the cloud storage;
  • the keyword extraction module is configured to perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, and convert the keywords into word vectors and then store the word vectors;
  • the similarity calculation module is used to receive the query content input by the user, and calculate the similarity between the query content and the word vector;
  • the query module is configured to select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
  • the present application also provides a computer device that includes a memory and a processor, the memory stores a file query program that can be run on the processor, and the file query program is executed by the processor.
  • this application also provides a computer-readable storage medium having a file query program stored on the computer-readable storage medium, and the file query program can be executed by one or more processors to implement the following steps:
  • FIG. 1 is a schematic flowchart of a file query method provided by an embodiment of this application
  • FIG. 2 is a schematic diagram of the internal structure of a computer device provided by an embodiment of the application.
  • FIG. 3 is a schematic diagram of modules of a file query device provided by an embodiment of the application.
  • This application provides a file query method.
  • FIG. 1 it is a schematic flowchart of a file query method provided by an embodiment of this application.
  • the method can be executed by a device, and the device can be implemented by software and/or hardware.
  • the file query method includes:
  • the client is also called a client, which refers to a program that corresponds to a server and provides local services to the client.
  • the collection of the client’s collection of files is obtained in the following two ways: mode one, traversing and searching from the client’s local disk to obtain the collection of collections; mode two, using keywords from the search engine according to the needs of the user
  • the collection of files is obtained by searching.
  • the cloud storage refers to a mode of online online storage (Cloud storage), that is, storing data on multiple virtual servers usually hosted by a third party instead of a dedicated server.
  • the file system in this application is Hadoop Distributed File System (HDFS).
  • HDFS Hadoop Distributed File System
  • the HDFS has high fault tolerance and can be deployed on low-cost hardware.
  • the HDFS relaxes the requirement for a portable operating system interface so that it can access file data in the form of streams, thereby providing high throughput. Access to application data is suitable for applications with large data sets.
  • the HDFS is composed of a NameNode (master node) and n DataNodes (slave nodes).
  • the NameNode is mainly responsible for managing the file namespace and the master server for client access, and the DataNode is responsible for storing files. To manage.
  • the preferred embodiment of the present application creates the service description of the collection file set in the master node of the HDFS file system.
  • the business description refers to a brief summary of the content of the collection file set, and can also be expressed as the name of the collection file set.
  • a plurality of different files are established in the master node of the Hadoop. Service description, and set up several slave nodes under the master node to store the corresponding collection files of the service description, so the corresponding collection files of the service description can be realized through the retrieval of the service description Query.
  • the performing keyword extraction on the business description through a keyword extraction algorithm includes:
  • Dep(W i , W j ) represents the degree of dependency relationship between the words W i and W j
  • len(W i , W j ) represents the length of the dependency path between the words W i and W j
  • b is Hyperparameter
  • f grav (W i , W j ) represents the gravitational forces of the words W i and W j
  • tfidf(W i ) represents the TF-IDF value of the word W i
  • tfidf(W j ) represents the TF-IDF of the word W j
  • IDF value TF means word frequency
  • IDF means inverse document frequency index
  • d is the Euclidean distance between the word vectors of words W i and W j;
  • the correlation strength between the words W i and W j is:
  • Binding strength of the association degree of importance of the word W i is calculated scores:
  • W i is associated with a set of vertices
  • is the damping coefficient
  • the present application selects t words with the highest scores as keywords for the business description according to the importance score of the word.
  • this application uses a one-hot algorithm to convert keywords into word vectors for representation.
  • the one-hot representation algorithm is a basic method of vector representation of words. It is similar to the idea of bag-of-words model.
  • a dictionary is constructed by extracting all the words in the corpus, and each word in the dictionary is represented by a word vector.
  • the dimension of the word vector is equal to the dictionary scale, and only the value of the dimension corresponding to the current word in the vector is 1, and the values of the other dimensions are all 0. Therefore, this application converts the dimension of the keyword of the business description to 1.
  • the dimension of the remaining words is 0, so that the keyword can be converted into a word vector representation.
  • S3. Receive the query content input by the user, and calculate the similarity between the query content and the word vector.
  • the preferred embodiment of the present application calculates the similarity between the query content and the word vector by using the cosin method (cosine similarity).
  • cosine similarity is to use the cosine value of the angle between two vectors in the vector space as a measure of the size of the difference between two individuals.
  • the cosine value of the cosine similarity is closer to 1, it indicates that the two vectors are The closer the angle between them is to 0 degrees, the more similar the two vectors are.
  • the calculation formula of the cosine similarity is as follows:
  • X represents the word vector
  • Y represents the query content
  • the similarity range of the cosine value of the cosine similarity is -1 to 1: when the cosine value is -1, it means that the query content is The direction pointed by the word vector is exactly opposite, indicating that the similarity between the query content and the word vector is 0, and when the cosine value is 1, it means that the query content and the direction pointed by the word vector are exactly the same , It means that the similarity between the query content and the word vector is 100%, and when the cosine value is 0, it means that the query content and the word vector are independent, indicating that the query There is moderate similarity or dissimilarity between the content and the word vector.
  • This application obtains the similarity between the query content and the word vector according to the cosine value.
  • the multi-strategy search method in the preferred embodiment of the present application includes Levenstein Distance (LD).
  • LD Levenstein Distance
  • the similarity calculation method is used to compare with the business description of the favorite file in the cloud storage to determine whether it matches. If there is a match, return the favorite file to the user directly; if it does not match, calculate the similarity between the query content entered by the user and the keywords of the business description in the favorite file, and the preset threshold is 0.8, and the similarity result is greater than
  • the collection file corresponding to the service description with the preset threshold is used as a query result and returned to the user.
  • this application uses the LD to calculate the similarity between the query content input by the user and the character string in the service description of the favorite file.
  • this application presets that the original character string in the query content input by the user is m, the service description target character string of the collection file is n, and it is necessary to record that the original character string m is transformed into the target character string n
  • the number of edits L for deleting, inserting, and replacing operations, and the L of the two strings m and n is recorded as lev m,n (
  • the invention also provides a computer device.
  • FIG. 2 it is a schematic diagram of the internal structure of a computer device provided by an embodiment of this application.
  • the computer device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server.
  • the computer device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
  • the memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like.
  • the memory 11 may be an internal storage unit of the computer device 1 in some embodiments, such as a hard disk of the computer device 1. In other embodiments, the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (SD) equipped on the computer device 1. Card, Flash Card, etc. Further, the memory 11 may also include both an internal storage unit of the computer device 1 and an external storage device.
  • the memory 11 can be used not only to store application software and various data installed in the computer device 1, such as the code of the file query program 01, etc., but also to temporarily store data that has been output or will be output.
  • the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, for running program codes or processing stored in the memory 11 Data, for example, execute file query program 01, etc.
  • CPU central processing unit
  • controller microcontroller
  • microprocessor microprocessor
  • other data processing chip for running program codes or processing stored in the memory 11 Data, for example, execute file query program 01, etc.
  • the communication bus 13 is used to realize the connection and communication between these components.
  • the network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the computer device 1 and other electronic devices.
  • the computer device 1 may also include a user interface.
  • the user interface may include a display (Display) and an input unit such as a keyboard (Keyboard).
  • the optional user interface may also include a standard wired interface and a wireless interface.
  • the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc.
  • the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the computer device 1 and to display a visualized user interface.
  • Figure 2 only shows the computer device 1 with the components 11-14 and the file query program 01.
  • Figure 1 does not constitute a limitation on the computer device 1, and may include a comparison chart. Show fewer or more components, or combinations of certain components, or different component arrangements.
  • the file query program 01 is stored in the memory 11; when the processor 12 executes the file query program 01 stored in the memory 11, the following steps are implemented:
  • Step 1 Obtain the collection of collection files of the client, create a service description of the collection of files in the file system, and store the collection of collection files after the service description is created in cloud storage.
  • the client is also called the client, which refers to the program corresponding to the server and providing local services to the client.
  • the collection of the client’s collection of files is obtained in the following two ways: mode one, traversing and searching from the client’s local disk to obtain the collection of collections; mode two, using keywords from the search engine according to the needs of the user
  • the collection of files is obtained by searching.
  • the cloud storage refers to a mode of online online storage (Cloud storage), that is, storing data on multiple virtual servers usually hosted by a third party instead of a dedicated server.
  • the file system in this application is Hadoop Distributed File System (HDFS).
  • HDFS Hadoop Distributed File System
  • the HDFS has high fault tolerance and can be deployed on low-cost hardware.
  • the HDFS relaxes the requirement for a portable operating system interface so that it can access file data in the form of streams, thereby providing high throughput. Access to application data is suitable for applications with large data sets.
  • the HDFS is composed of a NameNode (master node) and n DataNodes (slave nodes).
  • the NameNode is mainly responsible for managing the file namespace and the master server for client access, and the DataNode is responsible for storing files. To manage.
  • the preferred embodiment of the present application creates the service description of the collection file set in the master node of the HDFS file system.
  • the business description refers to a brief summary of the content of the collection file set, and can also be expressed as the name of the collection file set.
  • a plurality of different files are established in the master node of the Hadoop. Service description, and set up several slave nodes under the master node to store the corresponding collection files of the service description, so the corresponding collection files of the service description can be realized through the retrieval of the service description Query.
  • Step 2 Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors.
  • the performing keyword extraction on the business description through a keyword extraction algorithm includes:
  • Dep(W i , W j ) represents the degree of dependency relationship between the words W i and W j
  • len(W i , W j ) represents the length of the dependency path between the words W i and W j
  • b is Hyperparameter
  • f grav (W i , W j ) represents the gravitational forces of the words W i and W j
  • tfidf(W i ) represents the TF-IDF value of the word W i
  • tfidf(W j ) represents the TF-IDF of the word W j
  • IDF value TF means word frequency
  • IDF means inverse document frequency index
  • d is the Euclidean distance between the word vectors of words W i and W j;
  • the correlation strength between the words W i and W j is:
  • Binding strength of the association degree of importance of the word W i is calculated scores:
  • W i is associated with a set of vertices
  • is the damping coefficient
  • the present application selects t words with the highest scores as keywords for the business description according to the importance score of the word.
  • this application uses a one-hot algorithm to convert keywords into word vectors for representation.
  • the one-hot representation algorithm is a basic method of vector representation of words. It is similar to the idea of bag-of-words model.
  • a dictionary is constructed by extracting all the words in the corpus, and each word in the dictionary is represented by a word vector.
  • the dimension of the word vector is equal to the dictionary scale, and only the value of the dimension corresponding to the current word in the vector is 1, and the values of the other dimensions are all 0. Therefore, this application converts the dimension of the keyword of the business description to 1.
  • the dimension of the remaining words is 0, so that the keyword can be converted into a word vector representation.
  • Step 3 Receive the query content input by the user, and calculate the similarity between the query content and the word vector.
  • the preferred embodiment of the present application calculates the similarity between the query content and the word vector by using the cosin method (cosine similarity).
  • cosine similarity is to use the cosine value of the angle between two vectors in the vector space as a measure of the size of the difference between two individuals.
  • the cosine value of the cosine similarity is closer to 1, it indicates that the two vectors are The closer the angle between them is to 0 degrees, the more similar the two vectors are.
  • the calculation formula of the cosine similarity is as follows:
  • X represents the word vector
  • Y represents the query content
  • the similarity range of the cosine value of the cosine similarity is -1 to 1: when the cosine value is -1, it means that the query content is The direction pointed by the word vector is exactly opposite, indicating that the similarity between the query content and the word vector is 0, and when the cosine value is 1, it means that the query content and the direction pointed by the word vector are exactly the same , It means that the similarity between the query content and the word vector is 100%, and when the cosine value is 0, it means that the query content and the word vector are independent, indicating that the query There is moderate similarity or dissimilarity between the content and the word vector.
  • This application obtains the similarity between the query content and the word vector according to the cosine value.
  • Step 4 Select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
  • the multi-strategy search method in the preferred embodiment of the present application includes Levenstein Distance (LD).
  • LD Levenstein Distance
  • the similarity calculation method is used to compare with the service description of the favorite file in the cloud storage to determine whether it matches. If there is a match, return the favorite file to the user directly; if it does not match, calculate the similarity between the query content entered by the user and the keywords of the business description in the favorite file, and the preset threshold is 0.8, and the similarity result is greater than
  • the collection file corresponding to the service description with the preset threshold is used as a query result and returned to the user.
  • this application uses the LD to calculate the similarity between the query content input by the user and the character string in the service description of the favorite file.
  • this application presets that the original character string in the query content entered by the user is m, the service description target character string of the collection file is n, and it is necessary to record that the original character string m is transformed into the target character string n
  • the number of edits L for deleting, inserting, and replacing operations, and the L of the two strings m and n is recorded as lev m,n (
  • the document query program includes a business description creation module 10, a keyword extraction module 20, a similarity calculation module 30, and a query
  • the module 40 exemplarily:
  • the service description creation module 10 is used to obtain the collection file set of the client, create the service description of the collection file set in the file system, and store the collection file set after the service description is created in cloud storage.
  • the keyword extraction module 20 is configured to: perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, and convert the keywords into word vectors and then store the word vectors .
  • the similarity calculation module 30 is configured to receive the query content input by the user, and calculate the similarity between the query content and the word vector.
  • the query module 40 is configured to select a corresponding business description according to the similarity, query the cloud storage for favorite files through a multi-strategy retrieval method, and return the query result to the user.
  • the embodiment of the present application also proposes a computer-readable storage medium.
  • the computer-readable storage medium may be non-volatile or volatile, and a file query program is stored on the computer-readable storage medium.
  • the file query program can be executed by one or more processors to achieve the following operations:

Abstract

A file query method, which comprises: acquiring a collection file set of a client, creating a service description of the collection file set in a file system, and storing the collection file set with the created service description into a cloud storage (S1); performing keyword extraction on the service description through a keyword extraction algorithm to obtain keywords of the service description, converting the keyword into a word vector and then storing the word vector (S2); receiving query content input by a user, and calculating a similarity between the query content and the word vector (S3); and selecting a corresponding service description according to the similarity, querying the collection file in the cloud storage in a multi-policy retrieval mode, and returning a query result to the user (S4). The method realizes accurate file query.

Description

文件查询方法、装置、计算机设备及存储介质File query method, device, computer equipment and storage medium
本申请要求于2019年9月3日提交中国专利局、申请号为CN201910829794.5,发明名称为“文件查询方法、装置及计算机可读存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the Chinese Patent Office on September 3, 2019, the application number is CN201910829794.5, and the invention title is "File query method, device and computer readable storage medium". The entire content of the patent application is approved The reference is incorporated in this application.
技术领域Technical field
本申请涉及人工智能技术领域,尤其涉及一种文件查询方法、装置、计算机设备及存储介质。This application relates to the field of artificial intelligence technology, and in particular to a file query method, device, computer equipment and storage medium.
背景技术Background technique
随着技术的发展,信息量呈爆炸性增长,越来越多的文件需要存储在用户的计算机中。计算机的文件系统负责为用户建立文件,通过存入、读出、修改、转储文件,控制文件的存取。当用户不再使用文件时可以撤销、删除文件等,所以计算机的文件系统可以支撑起海量文件的存储。但发明人意识到对于用户来说,面对海量的文件,检索出目标文件就需要耗费一定的时间和精力,在目前业内还没有出现相关技术或产品可以进行快速文件的查询。With the development of technology, the amount of information has exploded, and more and more files need to be stored in the user's computer. The computer's file system is responsible for creating files for users, and controlling file access by storing, reading, modifying, and dumping files. When users no longer use files, they can revoke, delete files, etc., so the file system of the computer can support the storage of massive files. However, the inventor realizes that for users, facing a large number of files, it takes a certain amount of time and energy to retrieve the target file. At present, there is no related technology or product in the industry that can perform fast file query.
发明内容Summary of the invention
本申请提供一种文件查询方法、装置、计算机设备及存储介质。This application provides a file query method, device, computer equipment and storage medium.
本申请提供的一种文件查询方法,包括:A document query method provided by this application includes:
获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中;Acquire the collection file set of the client, create a service description of the collection file set in the file system, and store the collection file set after the service description is created in cloud storage;
通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量;Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors;
接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度;Receiving the query content input by the user, and calculating the similarity between the query content and the word vector;
根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。Select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
,此外,本申请还提供一种文件查询装置,所述装置包括:In addition, this application also provides a file query device, which includes:
业务描述创建模块,用于获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中;The service description creation module is used to obtain the collection file set of the client, create the service description of the collection file set in the file system, and store the collection file set after the service description is created in the cloud storage;
关键词提取模块,用于通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量;The keyword extraction module is configured to perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, and convert the keywords into word vectors and then store the word vectors;
相似度计算模块,用于接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度;The similarity calculation module is used to receive the query content input by the user, and calculate the similarity between the query content and the word vector;
查询模块,用于根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。The query module is configured to select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
此外,本申请还提供一种计算机设备,该计算机设备包括存储器和处理器,所述存储器中存储有可在所述处理器上运行的文件查询程序,所述文件查询程序被所述处理器执行时实现如下步骤:In addition, the present application also provides a computer device that includes a memory and a processor, the memory stores a file query program that can be run on the processor, and the file query program is executed by the processor. When implementing the following steps:
获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中;Acquire the collection file set of the client, create a service description of the collection file set in the file system, and store the collection file set after the service description is created in cloud storage;
通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量;Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors;
接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度;Receiving the query content input by the user, and calculating the similarity between the query content and the word vector;
根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。Select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
,此外,本申请还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有文件查询程序,所述文件查询程序可被一个或者多个处理器执行,以实现如下步骤:In addition, this application also provides a computer-readable storage medium having a file query program stored on the computer-readable storage medium, and the file query program can be executed by one or more processors to implement the following steps:
获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中;Acquire the collection file set of the client, create a service description of the collection file set in the file system, and store the collection file set after the service description is created in cloud storage;
通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量;Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors;
接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度;Receiving the query content input by the user, and calculating the similarity between the query content and the word vector;
根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。Select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
附图说明Description of the drawings
图1为本申请一实施例提供的文件查询方法的流程示意图;FIG. 1 is a schematic flowchart of a file query method provided by an embodiment of this application;
图2为本申请一实施例提供的计算机设备的内部结构示意图;2 is a schematic diagram of the internal structure of a computer device provided by an embodiment of the application;
图3为本申请一实施例提供的文件查询装置的模块示意图。FIG. 3 is a schematic diagram of modules of a file query device provided by an embodiment of the application.
本申请目的的实现、功能特点及优点将结合实施例,参照附图做进一步说明。The realization, functional characteristics, and advantages of the purpose of this application will be further described in conjunction with the embodiments and with reference to the accompanying drawings.
具体实施方式detailed description
应当理解,此处所描述的具体实施例仅仅用以解释本申请,并不用于限定本申请。It should be understood that the specific embodiments described here are only used to explain the application, and not used to limit the application.
本申请提供一种文件查询方法。参照图1所示,为本申请一实施例提供的文件查询方法的流程示意图。该方法可以由一个装置执行,该装置可以由软件和/或硬件实现。This application provides a file query method. Referring to FIG. 1, it is a schematic flowchart of a file query method provided by an embodiment of this application. The method can be executed by a device, and the device can be implemented by software and/or hardware.
在本实施例中,文件查询方法包括:In this embodiment, the file query method includes:
S1、获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中。S1. Acquire the collection file set of the client, create a service description of the collection file set in the file system, and store the collection file set after the service description is created in cloud storage.
本申请较佳实施例中,所述客户端又称用户端,指的是与服务器相对应,为客户提供本地服务的程序。所述客户端的收藏文件集通过以下两种方式取得到:方式一、从所述客户端的本地磁盘中进行遍历检索得到所述收藏文件集;方式二、根据用户的需求利用关键字从搜索引擎中搜索得到所述收藏文件集。In a preferred embodiment of the present application, the client is also called a client, which refers to a program that corresponds to a server and provides local services to the client. The collection of the client’s collection of files is obtained in the following two ways: mode one, traversing and searching from the client’s local disk to obtain the collection of collections; mode two, using keywords from the search engine according to the needs of the user The collection of files is obtained by searching.
所述云存储指的是一种网上在线存储(Cloud storage)的模式,即把数据存放在通常由第三方托管的多台虚拟服务器,而非专属的服务器上。The cloud storage refers to a mode of online online storage (Cloud storage), that is, storing data on multiple virtual servers usually hosted by a third party instead of a dedicated server.
较佳地,本申请中所述文件系统为Hadoop文件系统(Hadoop Distributed File System,HDFS)。所述HDFS具有高容错性,可以部署在低成本的硬件之上,同时所述HDFS放松了对可移植操作系统接口的需求,使其可以以流的形式访问文件数据,从而提供高吞吐量地对应用程序的数据进行访问,适合大数据集的应用程序。Preferably, the file system in this application is Hadoop Distributed File System (HDFS). The HDFS has high fault tolerance and can be deployed on low-cost hardware. At the same time, the HDFS relaxes the requirement for a portable operating system interface so that it can access file data in the form of streams, thereby providing high throughput. Access to application data is suitable for applications with large data sets.
详细地,所述HDFS是由一个NameNode(主节点)和n个DataNode(从节点)组成,其中,所述NameNode主要负责管理文件命名空间和客户端访问的主服务器,所述DataNode负责对文件存储进行管理。本申请较佳实施例在所述HDFS文件系统的主节点中创建所述收藏文件集的业务描述。In detail, the HDFS is composed of a NameNode (master node) and n DataNodes (slave nodes). The NameNode is mainly responsible for managing the file namespace and the master server for client access, and the DataNode is responsible for storing files. To manage. The preferred embodiment of the present application creates the service description of the collection file set in the master node of the HDFS file system.
进一步地,所述业务描述指的是对所述收藏文件集的内容简要概括,也可以表示为所述收藏文件集的名称,本申请较佳实施例在所述Hadoop的主节点建立多个不同的业务描述,并在所述主节点下设置若干个从节点用来存储所述业务描述的对应收藏文件,于是,可以通过对所述业务描述的检索,实现对所述业务描述的对应收藏文件的查询。Further, the business description refers to a brief summary of the content of the collection file set, and can also be expressed as the name of the collection file set. In the preferred embodiment of the present application, a plurality of different files are established in the master node of the Hadoop. Service description, and set up several slave nodes under the master node to store the corresponding collection files of the service description, so the corresponding collection files of the service description can be realized through the retrieval of the service description Query.
S2、通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量。S2. Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors.
本申请较佳实施例中,所述通过关键词提取算法对所述业务描述进行关键词抽取包括:In a preferred embodiment of the present application, the performing keyword extraction on the business description through a keyword extraction algorithm includes:
对所述业务描述进行分词操作;Perform word segmentation operations on the business description;
计算所述业务描述中的任意两个词W i和W j的依存关联度: Calculate the dependency correlation degree of any two words W i and W j in the business description:
Figure PCTCN2020112336-appb-000001
Figure PCTCN2020112336-appb-000001
其中,Dep(W i,W j)表示所述词W i和W j的依存关联度,len(W i,W j)表示所述词W i和W j之间的依存路径长度,b是超参数; Among them, Dep(W i , W j ) represents the degree of dependency relationship between the words W i and W j , len(W i , W j ) represents the length of the dependency path between the words W i and W j, and b is Hyperparameter
计算所述词W i和W j的引力: Calculate the gravitational forces of the words W i and W j:
Figure PCTCN2020112336-appb-000002
Figure PCTCN2020112336-appb-000002
其中,f grav(W i,W j)表示所述词W i和W j的引力,tfidf(W i)表示词W i的TF-IDF值,tfidf(W j)表示词W j的TF-IDF值,TF表示词频,IDF表示逆文档频率指数,d是词W i和W j的词向量之间的欧式距离; Among them, f grav (W i , W j ) represents the gravitational forces of the words W i and W j , tfidf(W i ) represents the TF-IDF value of the word W i , and tfidf(W j ) represents the TF-IDF of the word W j IDF value, TF means word frequency, IDF means inverse document frequency index, d is the Euclidean distance between the word vectors of words W i and W j;
根据计算的所述依存关联度和所述引力得到所述词W i和W j之间的关联强度为: According to the calculated dependency correlation degree and the gravity, the correlation strength between the words W i and W j is:
weight(W i,W j)=Dep(W i,W j)*f grav(W i,W j) weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )
结合所述关联强度计算出所述词W i的重要度得分: Binding strength of the association degree of importance of the word W i is calculated scores:
Figure PCTCN2020112336-appb-000003
Figure PCTCN2020112336-appb-000003
其中,
Figure PCTCN2020112336-appb-000004
是与顶点W i有关的集合,η为阻尼系数;
among them,
Figure PCTCN2020112336-appb-000004
W i is associated with a set of vertices, η is the damping coefficient;
优选地,本申请根据所述词的重要度得分选取t个得分最高的词作为所述业务描述的关键词。Preferably, the present application selects t words with the highest scores as keywords for the business description according to the importance score of the word.
进一步地,本申请利用独热表示(one hot)算法将关键词转换为词向量进行表示。所述独热表示算法是词的向量表示的一种基本方法,和词袋模型思想类似,通过提取语料库中所有的词构建一个词典,所述词典中的每一个词都用一个词向量表示,其中词向量的维度和词典规模相等,并且向量中只有当前词对应的维度的值是1,其余维度的值全部为0,于是,本申请将所述业务描述的关键词的维度转化为1,其余词的维度为0,从而可以将所述关键词转换为词向量表示。Further, this application uses a one-hot algorithm to convert keywords into word vectors for representation. The one-hot representation algorithm is a basic method of vector representation of words. It is similar to the idea of bag-of-words model. A dictionary is constructed by extracting all the words in the corpus, and each word in the dictionary is represented by a word vector. The dimension of the word vector is equal to the dictionary scale, and only the value of the dimension corresponding to the current word in the vector is 1, and the values of the other dimensions are all 0. Therefore, this application converts the dimension of the keyword of the business description to 1. The dimension of the remaining words is 0, so that the keyword can be converted into a word vector representation.
S3、接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度。S3. Receive the query content input by the user, and calculate the similarity between the query content and the word vector.
本申请较佳实施例通过利用cosin方法(余弦相似度)计算出所述查询内容与所述词向量的相似度。所述余弦相似度是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量,其中,当所述余弦相似度的余弦值越接近1,表明两个向量之间夹角越接近0度,即两个向量越相似。其中,所述余弦相似度的计算公式如下所示:The preferred embodiment of the present application calculates the similarity between the query content and the word vector by using the cosin method (cosine similarity). The cosine similarity is to use the cosine value of the angle between two vectors in the vector space as a measure of the size of the difference between two individuals. When the cosine value of the cosine similarity is closer to 1, it indicates that the two vectors are The closer the angle between them is to 0 degrees, the more similar the two vectors are. Wherein, the calculation formula of the cosine similarity is as follows:
Figure PCTCN2020112336-appb-000005
Figure PCTCN2020112336-appb-000005
其中,X表示所述词向量,Y表示所述查询内容,所述余弦相似度的余弦值的相似性范围为-1到1:当所述余弦值为-1时,表示所述查询内容与所述词向量指向的方向正好截然相反,说明所述查询内容与所述词向量相似度为0,当所述余弦值为1表示表示所述查询内容与所述词向量指向的方向是完全相同的,说明所述表示所述查询内容与所述词向量相似度为100%,当所述余弦值为0时,表示所述查询内容与所述词向量之间是独立的,说明所述查询内容与所述词向量之间为中度的相似性或相异性。本申请根据所述余弦值得到所述查询内容和所述词向量的相似度。Wherein, X represents the word vector, Y represents the query content, and the similarity range of the cosine value of the cosine similarity is -1 to 1: when the cosine value is -1, it means that the query content is The direction pointed by the word vector is exactly opposite, indicating that the similarity between the query content and the word vector is 0, and when the cosine value is 1, it means that the query content and the direction pointed by the word vector are exactly the same , It means that the similarity between the query content and the word vector is 100%, and when the cosine value is 0, it means that the query content and the word vector are independent, indicating that the query There is moderate similarity or dissimilarity between the content and the word vector. This application obtains the similarity between the query content and the word vector according to the cosine value.
S4、根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。S4. Select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
本申请较佳实施例中所述多策略检索方式包括莱文斯坦距离法(Levenshtein Distance,LD)。当用户输入的查询内容时,通过上述相似度计算方法,与所述云存储中收藏文件的 业务描述进行对比,判断是否匹配。若匹配,则直接返回该收藏文件给所述用户;若不匹配,将用户输入的查询内容与收藏文件中业务描述的关键词进行相似度计算,并预设阈值为0.8,将相似度结果大于所述预设阈值的业务描述对应的收藏文件作为查询结果,并返回给所述用户。The multi-strategy search method in the preferred embodiment of the present application includes Levenstein Distance (LD). When the user enters the query content, the similarity calculation method is used to compare with the business description of the favorite file in the cloud storage to determine whether it matches. If there is a match, return the favorite file to the user directly; if it does not match, calculate the similarity between the query content entered by the user and the keywords of the business description in the favorite file, and the preset threshold is 0.8, and the similarity result is greater than The collection file corresponding to the service description with the preset threshold is used as a query result and returned to the user.
进一步地,当所述相似度结果均没有大于预设阈值时,本申请通过所述LD计算所述用户输入的查询内容与所述收藏文件的业务描述中字符串之间的相似度。详细地,本申请预设所述用户输入的查询内容中原字符串为m,所述收藏文件的业务描述目标字符串为n,记录所述原字符串m变换为所述目标字符串n所需的删除、插入、替换操作的编辑次数L,并将2个字符串m、n的L记为lev m,n(|m|,|n|),其中|m|,|n|分别为字符串m,n的长度离。其中,当L越大,字符串的相似度越低,于是,本申请选取所述L值最小的对应收藏文件作为查询结果,并返回给所述用户。 Further, when none of the similarity results are greater than a preset threshold, this application uses the LD to calculate the similarity between the query content input by the user and the character string in the service description of the favorite file. In detail, this application presets that the original character string in the query content input by the user is m, the service description target character string of the collection file is n, and it is necessary to record that the original character string m is transformed into the target character string n The number of edits L for deleting, inserting, and replacing operations, and the L of the two strings m and n is recorded as lev m,n (|m|,|n|), where |m|,|n| are characters respectively The length of string m, n is apart. Wherein, when L is larger, the similarity of character strings is lower. Therefore, this application selects the corresponding collection file with the smallest L value as the query result and returns it to the user.
发明还提供一种计算机设备。参照图2所示,为本申请一实施例提供的计算机设备的内部结构示意图。The invention also provides a computer device. Referring to FIG. 2, it is a schematic diagram of the internal structure of a computer device provided by an embodiment of this application.
在本实施例中,所述计算机设备1可以是PC(Personal Computer,个人电脑),或者是智能手机、平板电脑、便携计算机等终端设备,也可以是一种服务器等。该计算机设备1至少包括存储器11、处理器12,通信总线13,以及网络接口14。In this embodiment, the computer device 1 may be a PC (Personal Computer, personal computer), or a terminal device such as a smart phone, a tablet computer, or a portable computer, or a server. The computer device 1 at least includes a memory 11, a processor 12, a communication bus 13, and a network interface 14.
其中,存储器11至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或DX存储器等)、磁性存储器、磁盘、光盘等。存储器11在一些实施例中可以是计算机设备1的内部存储单元,例如该计算机设备1的硬盘。存储器11在另一些实施例中也可以是计算机设备1的外部存储设备,例如计算机设备1上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,存储器11还可以既包括计算机设备1的内部存储单元也包括外部存储设备。存储器11不仅可以用于存储安装于计算机设备1的应用软件及各类数据,例如文件查询程序01的代码等,还可以用于暂时地存储已经输出或者将要输出的数据。The memory 11 includes at least one type of readable storage medium, and the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or DX memory, etc.), magnetic memory, magnetic disk, optical disk, and the like. The memory 11 may be an internal storage unit of the computer device 1 in some embodiments, such as a hard disk of the computer device 1. In other embodiments, the memory 11 may also be an external storage device of the computer device 1, such as a plug-in hard disk, a smart media card (SMC), and a secure digital (SD) equipped on the computer device 1. Card, Flash Card, etc. Further, the memory 11 may also include both an internal storage unit of the computer device 1 and an external storage device. The memory 11 can be used not only to store application software and various data installed in the computer device 1, such as the code of the file query program 01, etc., but also to temporarily store data that has been output or will be output.
处理器12在一些实施例中可以是一中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器或其他数据处理芯片,用于运行存储器11中存储的程序代码或处理数据,例如执行文件查询程序01等。In some embodiments, the processor 12 may be a central processing unit (CPU), controller, microcontroller, microprocessor, or other data processing chip, for running program codes or processing stored in the memory 11 Data, for example, execute file query program 01, etc.
通信总线13用于实现这些组件之间的连接通信。The communication bus 13 is used to realize the connection and communication between these components.
网络接口14可选的可以包括标准的有线接口、无线接口(如WI-FI接口),通常用于在该计算机设备1与其他电子设备之间建立通信连接。The network interface 14 may optionally include a standard wired interface and a wireless interface (such as a WI-FI interface), and is usually used to establish a communication connection between the computer device 1 and other electronic devices.
可选地,该计算机设备1还可以包括用户接口,用户接口可以包括显示器(Display)、输入单元比如键盘(Keyboard),可选的用户接口还可以包括标准的有线接口、无线接口。可选地,在一些实施例中,显示器可以是LED显示器、液晶显示器、触控式液晶显示器以及OLED(Organic Light-Emitting Diode,有机发光二极管)触摸器等。其中,显示器也可以适当的称为显示屏或显示单元,用于显示在计算机设备1中处理的信息以及用于显示可视化的用户界面。Optionally, the computer device 1 may also include a user interface. The user interface may include a display (Display) and an input unit such as a keyboard (Keyboard). The optional user interface may also include a standard wired interface and a wireless interface. Optionally, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode, organic light-emitting diode) touch device, etc. Among them, the display can also be appropriately called a display screen or a display unit, which is used to display the information processed in the computer device 1 and to display a visualized user interface.
图2仅示出了具有组件11-14以及文件查询程序01的计算机设备1,本领域技术人员可以理解的是,图1示出的结构并不构成对计算机设备1的限定,可以包括比图示更少或者更多的部件,或者组合某些部件,或者不同的部件布置。Figure 2 only shows the computer device 1 with the components 11-14 and the file query program 01. Those skilled in the art can understand that the structure shown in Figure 1 does not constitute a limitation on the computer device 1, and may include a comparison chart. Show fewer or more components, or combinations of certain components, or different component arrangements.
在图2所示的计算机设备1实施例中,存储器11中存储有文件查询程序01;处理器12执行存储器11中存储的文件查询程序01时实现如下步骤:In the embodiment of the computer device 1 shown in FIG. 2, the file query program 01 is stored in the memory 11; when the processor 12 executes the file query program 01 stored in the memory 11, the following steps are implemented:
步骤一、获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中。Step 1: Obtain the collection of collection files of the client, create a service description of the collection of files in the file system, and store the collection of collection files after the service description is created in cloud storage.
本申请较佳实施例中,所述客户端又称用户端,指的是与服务器相对应,为客户提供 本地服务的程序。所述客户端的收藏文件集通过以下两种方式取得到:方式一、从所述客户端的本地磁盘中进行遍历检索得到所述收藏文件集;方式二、根据用户的需求利用关键字从搜索引擎中搜索得到所述收藏文件集。In the preferred embodiment of the present application, the client is also called the client, which refers to the program corresponding to the server and providing local services to the client. The collection of the client’s collection of files is obtained in the following two ways: mode one, traversing and searching from the client’s local disk to obtain the collection of collections; mode two, using keywords from the search engine according to the needs of the user The collection of files is obtained by searching.
所述云存储指的是一种网上在线存储(Cloud storage)的模式,即把数据存放在通常由第三方托管的多台虚拟服务器,而非专属的服务器上。The cloud storage refers to a mode of online online storage (Cloud storage), that is, storing data on multiple virtual servers usually hosted by a third party instead of a dedicated server.
较佳地,本申请中所述文件系统为Hadoop文件系统(Hadoop Distributed File System,HDFS)。所述HDFS具有高容错性,可以部署在低成本的硬件之上,同时所述HDFS放松了对可移植操作系统接口的需求,使其可以以流的形式访问文件数据,从而提供高吞吐量地对应用程序的数据进行访问,适合大数据集的应用程序。Preferably, the file system in this application is Hadoop Distributed File System (HDFS). The HDFS has high fault tolerance and can be deployed on low-cost hardware. At the same time, the HDFS relaxes the requirement for a portable operating system interface so that it can access file data in the form of streams, thereby providing high throughput. Access to application data is suitable for applications with large data sets.
详细地,所述HDFS是由一个NameNode(主节点)和n个DataNode(从节点)组成,其中,所述NameNode主要负责管理文件命名空间和客户端访问的主服务器,所述DataNode负责对文件存储进行管理。本申请较佳实施例在所述HDFS文件系统的主节点中创建所述收藏文件集的业务描述。In detail, the HDFS is composed of a NameNode (master node) and n DataNodes (slave nodes). The NameNode is mainly responsible for managing the file namespace and the master server for client access, and the DataNode is responsible for storing files. To manage. The preferred embodiment of the present application creates the service description of the collection file set in the master node of the HDFS file system.
进一步地,所述业务描述指的是对所述收藏文件集的内容简要概括,也可以表示为所述收藏文件集的名称,本申请较佳实施例在所述Hadoop的主节点建立多个不同的业务描述,并在所述主节点下设置若干个从节点用来存储所述业务描述的对应收藏文件,于是,可以通过对所述业务描述的检索,实现对所述业务描述的对应收藏文件的查询。Further, the business description refers to a brief summary of the content of the collection file set, and can also be expressed as the name of the collection file set. In the preferred embodiment of the present application, a plurality of different files are established in the master node of the Hadoop. Service description, and set up several slave nodes under the master node to store the corresponding collection files of the service description, so the corresponding collection files of the service description can be realized through the retrieval of the service description Query.
步骤二、通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量。Step 2: Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors.
本申请较佳实施例中,所述通过关键词提取算法对所述业务描述进行关键词抽取包括:In a preferred embodiment of the present application, the performing keyword extraction on the business description through a keyword extraction algorithm includes:
对所述业务描述进行分词操作;计算所述业务描述中的任意两个词W i和W j的依存关联度: The service description for word operations; dependence of the degree of association calculated service described in any two words W i and W j of:
Figure PCTCN2020112336-appb-000006
Figure PCTCN2020112336-appb-000006
其中,Dep(W i,W j)表示所述词W i和W j的依存关联度,len(W i,W j)表示所述词W i和W j之间的依存路径长度,b是超参数; Among them, Dep(W i , W j ) represents the degree of dependency relationship between the words W i and W j , len(W i , W j ) represents the length of the dependency path between the words W i and W j, and b is Hyperparameter
计算所述词W i和W j的引力: Calculate the gravitational forces of the words W i and W j:
Figure PCTCN2020112336-appb-000007
Figure PCTCN2020112336-appb-000007
其中,f grav(W i,W j)表示所述词W i和W j的引力,tfidf(W i)表示词W i的TF-IDF值,tfidf(W j)表示词W j的TF-IDF值,TF表示词频,IDF表示逆文档频率指数,d是词W i和W j的词向量之间的欧式距离; Among them, f grav (W i , W j ) represents the gravitational forces of the words W i and W j , tfidf(W i ) represents the TF-IDF value of the word W i , and tfidf(W j ) represents the TF-IDF of the word W j IDF value, TF means word frequency, IDF means inverse document frequency index, d is the Euclidean distance between the word vectors of words W i and W j;
根据计算的所述依存关联度和所述引力得到所述词W i和W j之间的关联强度为: According to the calculated dependency correlation degree and the gravity, the correlation strength between the words W i and W j is:
weight(W i,W j)=Dep(W i,W j)*f grav(W i,W j) weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )
结合所述关联强度计算出所述词W i的重要度得分: Binding strength of the association degree of importance of the word W i is calculated scores:
Figure PCTCN2020112336-appb-000008
Figure PCTCN2020112336-appb-000008
其中,
Figure PCTCN2020112336-appb-000009
是与顶点W i有关的集合,η为阻尼系数;
among them,
Figure PCTCN2020112336-appb-000009
W i is associated with a set of vertices, η is the damping coefficient;
优选地,本申请根据所述词的重要度得分选取t个得分最高的词作为所述业务描述的关键词。Preferably, the present application selects t words with the highest scores as keywords for the business description according to the importance score of the word.
进一步地,本申请利用独热表示(one hot)算法将关键词转换为词向量进行表示。所述独热表示算法是词的向量表示的一种基本方法,和词袋模型思想类似,通过提取语料库中所有的词构建一个词典,所述词典中的每一个词都用一个词向量表示,其中词向量的维度和词典规模相等,并且向量中只有当前词对应的维度的值是1,其余维度的值全部为0,于是,本申请将所述业务描述的关键词的维度转化为1,其余词的维度为0,从而可以将 所述关键词转换为词向量表示。Further, this application uses a one-hot algorithm to convert keywords into word vectors for representation. The one-hot representation algorithm is a basic method of vector representation of words. It is similar to the idea of bag-of-words model. A dictionary is constructed by extracting all the words in the corpus, and each word in the dictionary is represented by a word vector. The dimension of the word vector is equal to the dictionary scale, and only the value of the dimension corresponding to the current word in the vector is 1, and the values of the other dimensions are all 0. Therefore, this application converts the dimension of the keyword of the business description to 1. The dimension of the remaining words is 0, so that the keyword can be converted into a word vector representation.
步骤三、接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度。Step 3: Receive the query content input by the user, and calculate the similarity between the query content and the word vector.
本申请较佳实施例通过利用cosin方法(余弦相似度)计算出所述查询内容与所述词向量的相似度。所述余弦相似度是用向量空间中两个向量夹角的余弦值作为衡量两个个体间差异的大小的度量,其中,当所述余弦相似度的余弦值越接近1,表明两个向量之间夹角越接近0度,即两个向量越相似。其中,所述余弦相似度的计算公式如下所示:The preferred embodiment of the present application calculates the similarity between the query content and the word vector by using the cosin method (cosine similarity). The cosine similarity is to use the cosine value of the angle between two vectors in the vector space as a measure of the size of the difference between two individuals. When the cosine value of the cosine similarity is closer to 1, it indicates that the two vectors are The closer the angle between them is to 0 degrees, the more similar the two vectors are. Wherein, the calculation formula of the cosine similarity is as follows:
Figure PCTCN2020112336-appb-000010
Figure PCTCN2020112336-appb-000010
其中,X表示所述词向量,Y表示所述查询内容,所述余弦相似度的余弦值的相似性范围为-1到1:当所述余弦值为-1时,表示所述查询内容与所述词向量指向的方向正好截然相反,说明所述查询内容与所述词向量相似度为0,当所述余弦值为1表示表示所述查询内容与所述词向量指向的方向是完全相同的,说明所述表示所述查询内容与所述词向量相似度为100%,当所述余弦值为0时,表示所述查询内容与所述词向量之间是独立的,说明所述查询内容与所述词向量之间为中度的相似性或相异性。本申请根据所述余弦值得到所述查询内容和所述词向量的相似度。Wherein, X represents the word vector, Y represents the query content, and the similarity range of the cosine value of the cosine similarity is -1 to 1: when the cosine value is -1, it means that the query content is The direction pointed by the word vector is exactly opposite, indicating that the similarity between the query content and the word vector is 0, and when the cosine value is 1, it means that the query content and the direction pointed by the word vector are exactly the same , It means that the similarity between the query content and the word vector is 100%, and when the cosine value is 0, it means that the query content and the word vector are independent, indicating that the query There is moderate similarity or dissimilarity between the content and the word vector. This application obtains the similarity between the query content and the word vector according to the cosine value.
步骤四、根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。Step 4: Select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
本申请较佳实施例中所述多策略检索方式包括莱文斯坦距离法(Levenshtein Distance,LD)。当用户输入的查询内容时,通过上述相似度计算方法,与所述云存储中收藏文件的业务描述进行对比,判断是否匹配。若匹配,则直接返回该收藏文件给所述用户;若不匹配,将用户输入的查询内容与收藏文件中业务描述的关键词进行相似度计算,并预设阈值为0.8,将相似度结果大于所述预设阈值的业务描述对应的收藏文件作为查询结果,并返回给所述用户。The multi-strategy search method in the preferred embodiment of the present application includes Levenstein Distance (LD). When the user enters the query content, the similarity calculation method is used to compare with the service description of the favorite file in the cloud storage to determine whether it matches. If there is a match, return the favorite file to the user directly; if it does not match, calculate the similarity between the query content entered by the user and the keywords of the business description in the favorite file, and the preset threshold is 0.8, and the similarity result is greater than The collection file corresponding to the service description with the preset threshold is used as a query result and returned to the user.
进一步地,当所述相似度结果均没有大于预设阈值时,本申请通过所述LD计算所述用户输入的查询内容与所述收藏文件的业务描述中字符串之间的相似度。详细地,本申请预设所述用户输入的查询内容中原字符串为m,所述收藏文件的业务描述目标字符串为n,记录所述原字符串m变换为所述目标字符串n所需的删除、插入、替换操作的编辑次数L,并将2个字符串m、n的L记为lev m,n(|m|,|n|),其中|m|,|n|分别为字符串m,n的长度离。其中,当L越大,字符串的相似度越低,于是,本申请选取所述L值最小的对应收藏文件作为查询结果,并返回给所述用户。 Further, when none of the similarity results are greater than a preset threshold, this application uses the LD to calculate the similarity between the query content input by the user and the character string in the service description of the favorite file. In detail, this application presets that the original character string in the query content entered by the user is m, the service description target character string of the collection file is n, and it is necessary to record that the original character string m is transformed into the target character string n The number of edits L for deleting, inserting, and replacing operations, and the L of the two strings m and n is recorded as lev m,n (|m|,|n|), where |m|,|n| are characters respectively The length of string m, n is apart. Wherein, when L is larger, the similarity of character strings is lower. Therefore, this application selects the corresponding collection file with the smallest L value as the query result and returns it to the user.
参照图3所示,为本申请文件查询装置一实施例中的模块示意图,该实施例中,所述文件查询程序包括业务描述创建模块10、关键词提取模块20、相似度计算模块30以及查询模块40示例性地:3, a schematic diagram of modules in an embodiment of the document query device of this application. In this embodiment, the document query program includes a business description creation module 10, a keyword extraction module 20, a similarity calculation module 30, and a query The module 40 exemplarily:
所述业务描述创建模块10用于:获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中。The service description creation module 10 is used to obtain the collection file set of the client, create the service description of the collection file set in the file system, and store the collection file set after the service description is created in cloud storage.
所述关键词提取模块20用于:通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量。The keyword extraction module 20 is configured to: perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, and convert the keywords into word vectors and then store the word vectors .
所述相似度计算模块30用于:接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度。The similarity calculation module 30 is configured to receive the query content input by the user, and calculate the similarity between the query content and the word vector.
所述查询模块40用于:根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。The query module 40 is configured to select a corresponding business description according to the similarity, query the cloud storage for favorite files through a multi-strategy retrieval method, and return the query result to the user.
上述文本业务描述创建模块10、关键词提取模块20、相似度计算模块30以及查询模块40等程序模块被执行时所实现的功能或操作步骤与上述实施例大体相同,在此不再赘述。The functions or operation steps implemented when the program modules such as the text service description creation module 10, the keyword extraction module 20, the similarity calculation module 30, and the query module 40 are executed are substantially the same as those in the foregoing embodiment, and will not be repeated here.
此外,本申请实施例还提出一种计算机可读存储介质,所述计算机可读存储介质可以 是非易失性,也可以是易失性,所述计算机可读存储介质上存储有文件查询程序,所述文件查询程序可被一个或多个处理器执行,以实现如下操作:In addition, the embodiment of the present application also proposes a computer-readable storage medium. The computer-readable storage medium may be non-volatile or volatile, and a file query program is stored on the computer-readable storage medium. The file query program can be executed by one or more processors to achieve the following operations:
获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中;Acquire the collection file set of the client, create a service description of the collection file set in the file system, and store the collection file set after the service description is created in cloud storage;
通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量;Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors;
接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度;Receiving the query content input by the user, and calculating the similarity between the query content and the word vector;
根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。Select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
本申请计算机可读存储介质具体实施方式与上述文件查询装置和方法各实施例基本相同,在此不作累述。The specific implementation of the computer-readable storage medium of this application is basically the same as the above-mentioned file query device and method embodiments, and will not be repeated here.
需要说明的是,上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。并且本文中的术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、装置、物品或者方法不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、装置、物品或者方法所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括该要素的过程、装置、物品或者方法中还存在另外的相同要素。It should be noted that the serial numbers of the foregoing embodiments of the present application are only for description, and do not represent the advantages and disadvantages of the embodiments. And the terms "include", "include" or any other variants thereof in this article are intended to cover non-exclusive inclusion, so that a process, device, article or method including a series of elements not only includes those elements, but also includes those elements that are not explicitly included. The other elements listed may also include elements inherent to the process, device, article, or method. If there are no more restrictions, the element defined by the sentence "including a..." does not exclude the existence of other identical elements in the process, device, article, or method that includes the element.
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在如上所述的一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,或者网络设备等)执行本申请各个实施例所述的方法。Through the description of the above implementation manners, those skilled in the art can clearly understand that the above-mentioned embodiment method can be implemented by means of software plus the necessary general hardware platform, of course, it can also be implemented by hardware, but in many cases the former is better.的实施方式。 Based on this understanding, the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM) as described above. , Magnetic disk, optical disk), including a number of instructions to make a terminal device (which can be a mobile phone, a computer, a server, or a network device, etc.) execute the method described in each embodiment of the present application.
以上仅为本申请的优选实施例,并非因此限制本申请的专利范围,凡是利用本申请说明书及附图内容所作的等效结构或等效流程变换,或直接或间接运用在其他相关的技术领域,均同理包括在本申请的专利保护范围内。The above are only the preferred embodiments of the application, and do not limit the scope of the patent for this application. Any equivalent structure or equivalent process transformation made using the content of the description and drawings of the application, or directly or indirectly applied to other related technical fields , The same reason is included in the scope of patent protection of this application.

Claims (20)

  1. 一种文件查询方法,其中,所述方法包括:A file query method, wherein the method includes:
    获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中;Acquire the collection file set of the client, create a service description of the collection file set in the file system, and store the collection file set after the service description is created in cloud storage;
    通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量;Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors;
    接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度;Receiving the query content input by the user, and calculating the similarity between the query content and the word vector;
    根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。Select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
  2. 如权利要求1所述的文件查询方法,其中,所述获取客户端的收藏文件集包括:5. The file query method according to claim 1, wherein said obtaining the collection of files of the client terminal comprises:
    从所述客户端的本地磁盘中进行遍历检索得到所述收藏文件集;或Traverse and retrieve from the local disk of the client to obtain the collection of files; or
    根据用户的需求利用关键字从搜索引擎中搜索得到所述收藏文件集。According to the needs of the user, the collection of documents is obtained by searching from the search engine by using keywords.
  3. 如权利要求1所述的文件查询方法,其中,所述通过关键词提取算法对所述业务描述进行关键词抽取,包括:The file query method according to claim 1, wherein said performing keyword extraction on said business description through a keyword extraction algorithm comprises:
    对所述业务描述进行分词操作;Perform word segmentation operations on the business description;
    计算所述业务描述中的任意两个词W i和W j的依存关联度: Calculate the dependency correlation degree of any two words W i and W j in the business description:
    Figure PCTCN2020112336-appb-100001
    Figure PCTCN2020112336-appb-100001
    其中,Dep(W i,W j)表示所述词W i和W j的依存关联度,len(W i,W j)表示所述词W i和W j之间的依存路径长度,b是超参数; Among them, Dep(W i , W j ) represents the degree of dependency relationship between the words W i and W j , len(W i , W j ) represents the length of the dependency path between the words W i and W j, and b is Hyperparameter
    计算所述词W i和W j的引力: Calculate the gravitational forces of the words W i and W j:
    Figure PCTCN2020112336-appb-100002
    Figure PCTCN2020112336-appb-100002
    其中,f grav(W i,W j)表示所述词W i和W j的引力,tfidf(W i)表示词W i的TF-IDF值,tfidf(W j)表示词W j的TF-IDF值,TF表示词频,IDF表示逆文档频率指数,d是词W i和W j的词向量之间的欧式距离; Among them, f grav (W i , W j ) represents the gravitational forces of the words W i and W j , tfidf(W i ) represents the TF-IDF value of the word W i , and tfidf(W j ) represents the TF-IDF of the word W j IDF value, TF means word frequency, IDF means inverse document frequency index, d is the Euclidean distance between the word vectors of words W i and W j;
    根据计算的所述依存关联度和所述引力得到所述词W i和W j之间的关联强度为: According to the calculated dependency correlation degree and the gravity, the correlation strength between the words W i and W j is:
    weight(W i,W j)=Dep(W i,W j)*f grav(W i,W j) weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )
    结合所述关联强度计算出所述词W i的重要度得分: Binding strength of the association degree of importance of the word W i is calculated scores:
    Figure PCTCN2020112336-appb-100003
    Figure PCTCN2020112336-appb-100003
    其中,
    Figure PCTCN2020112336-appb-100004
    是与顶点W i有关的集合,η为阻尼系数;
    among them,
    Figure PCTCN2020112336-appb-100004
    W i is associated with a set of vertices, η is the damping coefficient;
    根据所述词W i的重要度得分选取t个得分最高的词作为所述业务描述的关键词。 The degree of importance of the selected word W i t score highest-scoring word as a keyword in the service description.
  4. 如权利要求1所述的文件查询方法,其中,所述查询内容与所述词向量的相似度的计算公式为:3. The file query method according to claim 1, wherein the calculation formula for the similarity between the query content and the word vector is:
    Figure PCTCN2020112336-appb-100005
    Figure PCTCN2020112336-appb-100005
    其中,X表示所述词向量,Y表示所述查询内容。Wherein, X represents the word vector, and Y represents the query content.
  5. 如权利要求1至4中任一项所述的文件查询方法,其中,所述通过多策略检索方式向所述云存储进行收藏文件的查询,包括:The file query method according to any one of claims 1 to 4, wherein the querying of the collection file from the cloud storage in a multi-strategy search mode includes:
    预设所述用户输入的查询内容中原字符串为m,所述收藏文件的业务描述目标字符串为n;Preset that the original character string in the query content input by the user is m, and the business description target character string of the collection file is n;
    记录所述原字符串m变换为所述目标字符串n所需的删除、插入、替换操作的编辑次 数L;Recording the number of edit times L of deletion, insertion, and replacement operations required to transform the original character string m into the target character string n;
    选取所述L值最小的对应收藏文件作为查询结果,并返回给所述用户。The corresponding favorite file with the smallest L value is selected as the query result and returned to the user.
  6. 如权利要求1所述的文件查询方法,其中,所述将所述关键词转换为词向量,包括:The file query method according to claim 1, wherein said converting said keywords into word vectors comprises:
    利用独热表示算法将所述关键词转换为词向量进行表示。The one-hot representation algorithm is used to convert the keywords into word vectors for representation.
  7. 如权利要求1所述的文件查询方法,其中,所述文件系统为Hadoop文件系统。The file query method according to claim 1, wherein the file system is a Hadoop file system.
  8. 一种计算机设备,其中,所述装置包括存储器和处理器,所述存储器上存储有可在所述处理器上运行的文件查询程序,所述文件查询程序被所述处理器执行时实现如下步骤:A computer device, wherein the device includes a memory and a processor, the memory stores a file query program that can be run on the processor, and when the file query program is executed by the processor, the following steps are implemented :
    获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中;Acquire the collection file set of the client, create a service description of the collection file set in the file system, and store the collection file set after the service description is created in cloud storage;
    通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量;Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors;
    接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度;Receiving the query content input by the user, and calculating the similarity between the query content and the word vector;
    根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。Select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
  9. 如权利要求8所述的计算机设备,其中,所述获取客户端的收藏文件集包括:8. The computer device according to claim 8, wherein said acquiring the collection of collection files of the client terminal comprises:
    从所述客户端的本地磁盘中进行遍历检索得到所述收藏文件集;或Traverse and retrieve from the local disk of the client to obtain the collection of files; or
    根据用户的需求利用关键字从搜索引擎中搜索得到所述收藏文件集。According to the needs of the user, the collection of documents is obtained from the search engine by using keywords.
  10. 如权利要求8所述的计算机设备,其中,所述通过关键词提取算法对所述业务描述进行关键词抽取,包括:8. The computer device according to claim 8, wherein said performing keyword extraction on said business description through a keyword extraction algorithm comprises:
    对所述业务描述进行分词操作;Perform word segmentation operations on the business description;
    计算所述业务描述中的任意两个词W i和W j的依存关联度: Calculate the dependency correlation degree of any two words W i and W j in the business description:
    Figure PCTCN2020112336-appb-100006
    Figure PCTCN2020112336-appb-100006
    其中,Dep(W i,W j)表示所述词W i和W j的依存关联度,len(W i,W j)表示所述词W i和W j之间的依存路径长度,b是超参数; Among them, Dep(W i , W j ) represents the degree of dependency relationship between the words W i and W j , len(W i , W j ) represents the length of the dependency path between the words W i and W j, and b is Hyperparameter
    计算所述词W i和W j的引力: Calculate the gravitational forces of the words W i and W j:
    Figure PCTCN2020112336-appb-100007
    Figure PCTCN2020112336-appb-100007
    其中,f grav(W i,W j)表示所述词W i和W j的引力,tfidf(W i)表示词W i的TF-IDF值,tfidf(W j)表示词W j的TF-IDF值,TF表示词频,IDF表示逆文档频率指数,d是词W i和W j的词向量之间的欧式距离; Among them, f grav (W i , W j ) represents the gravitational forces of the words W i and W j , tfidf(W i ) represents the TF-IDF value of the word W i , and tfidf(W j ) represents the TF-IDF of the word W j IDF value, TF means word frequency, IDF means inverse document frequency index, d is the Euclidean distance between the word vectors of words W i and W j;
    根据计算的所述依存关联度和所述引力得到所述词W i和W j之间的关联强度为: According to the calculated dependency correlation degree and the gravity, the correlation strength between the words W i and W j is:
    weight(W i,W j)=Dep(W i,W j)*f grav(W i,W j) weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )
    结合所述关联强度计算出所述词W i的重要度得分: Binding strength of the association degree of importance of the word W i is calculated scores:
    Figure PCTCN2020112336-appb-100008
    Figure PCTCN2020112336-appb-100008
    其中,
    Figure PCTCN2020112336-appb-100009
    是与顶点W i有关的集合,η为阻尼系数;
    among them,
    Figure PCTCN2020112336-appb-100009
    W i is associated with a set of vertices, η is the damping coefficient;
    根据所述词W i的重要度得分选取t个得分最高的词作为所述业务描述的关键词。 The degree of importance of the selected word W i t score highest-scoring word as a keyword in the service description.
  11. 如权利要求8所述的计算机设备,其中,所述查询内容与所述词向量的相似度的计算公式为:8. The computer device according to claim 8, wherein the formula for calculating the similarity between the query content and the word vector is:
    Figure PCTCN2020112336-appb-100010
    Figure PCTCN2020112336-appb-100010
    其中,X表示所述词向量,Y表示所述查询内容。Wherein, X represents the word vector, and Y represents the query content.
  12. 如权利要求8至11中任一项所述的计算机设备,其中,所述通过多策略检索方式向所述云存储进行收藏文件的查询,包括:11. The computer device according to any one of claims 8 to 11, wherein said querying said cloud storage for collection files in a multi-strategy search mode comprises:
    预设所述用户输入的查询内容中原字符串为m,所述收藏文件的业务描述目标字符串为n;Preset that the original character string in the query content input by the user is m, and the business description target character string of the collection file is n;
    记录所述原字符串m变换为所述目标字符串n所需的删除、插入、替换操作的编辑次数L;Record the number of edits L of the deletion, insertion, and replacement operations required to transform the original character string m into the target character string n;
    选取所述L值最小的对应收藏文件作为查询结果,并返回给所述用户。The corresponding favorite file with the smallest L value is selected as the query result and returned to the user.
  13. 如权利要求8所述的计算机设备,其中,所述将所述关键词转换为词向量,包括:8. The computer device of claim 8, wherein said converting said keyword into a word vector comprises:
    利用独热表示算法将所述关键词转换为词向量进行表示。The one-hot representation algorithm is used to convert the keywords into word vectors for representation.
  14. 一种文件查询装置,其中,所述装置包括:A file query device, wherein the device includes:
    业务描述创建模块,用于获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中;The service description creation module is used to obtain the collection file set of the client, create the service description of the collection file set in the file system, and store the collection file set after the service description is created in the cloud storage;
    关键词提取模块,用于通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量;The keyword extraction module is configured to perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, and convert the keywords into word vectors and then store the word vectors;
    相似度计算模块,用于接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度;The similarity calculation module is used to receive the query content input by the user, and calculate the similarity between the query content and the word vector;
    查询模块,用于根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。The query module is configured to select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
  15. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有文件查询程序,所述文件查询程序可被一个或者多个处理器执行,以实现如下所述的文件查询方法的步骤:A computer-readable storage medium, wherein a file query program is stored on the computer-readable storage medium, and the file query program can be executed by one or more processors to implement the steps of the file query method described below :
    获取客户端的收藏文件集,在文件系统中创建所述收藏文件集的业务描述,并将创建业务描述后的所述收藏文件集存入云存储中;Acquire the collection file set of the client, create a service description of the collection file set in the file system, and store the collection file set after the service description is created in cloud storage;
    通过关键词提取算法对所述业务描述进行关键词抽取,得到所述业务描述的关键词,并将所述关键词转换为词向量后存储所述词向量;Perform keyword extraction on the business description through a keyword extraction algorithm to obtain keywords of the business description, convert the keywords into word vectors, and store the word vectors;
    接收用户输入的查询内容,计算出所述查询内容与所述词向量的相似度;Receiving the query content input by the user, and calculating the similarity between the query content and the word vector;
    根据所述相似度选择对应的业务描述,通过多策略检索方式向所述云存储进行收藏文件的查询,并将查询结果返回给所述用户。Select the corresponding business description according to the similarity, query the cloud storage for the favorite files through a multi-strategy retrieval method, and return the query result to the user.
  16. 如权利要求15所述的计算机可读存储介质,其中,所述获取客户端的收藏文件集包括:15. The computer-readable storage medium according to claim 15, wherein said acquiring the collection of collection files of the client comprises:
    从所述客户端的本地磁盘中进行遍历检索得到所述收藏文件集;或Traverse and retrieve from the local disk of the client to obtain the collection of files; or
    根据用户的需求利用关键字从搜索引擎中搜索得到所述收藏文件集。According to the needs of the user, the collection of documents is obtained from the search engine by using keywords.
  17. 如权利要求15所述的计算机可读存储介质,其中,所述通过关键词提取算法对所述业务描述进行关键词抽取,包括:15. The computer-readable storage medium of claim 15, wherein said performing keyword extraction on said business description by a keyword extraction algorithm comprises:
    对所述业务描述进行分词操作;Perform word segmentation operations on the business description;
    计算所述业务描述中的任意两个词W i和W j的依存关联度: Calculate the dependency correlation degree of any two words W i and W j in the business description:
    Figure PCTCN2020112336-appb-100011
    Figure PCTCN2020112336-appb-100011
    其中,Dep(W i,W j)表示所述词W i和W j的依存关联度,len(W i,W j)表示所述词W i和W j之间的依存路径长度,b是超参数; Among them, Dep(W i , W j ) represents the degree of dependency relationship between the words W i and W j , len(W i , W j ) represents the length of the dependency path between the words W i and W j, and b is Hyperparameter
    计算所述词W i和W j的引力: Calculate the gravitational forces of the words W i and W j:
    Figure PCTCN2020112336-appb-100012
    Figure PCTCN2020112336-appb-100012
    其中,f grav(W i,W j)表示所述词W i和W j的引力,tfidf(W i)表示词W i的TF-IDF值,tfidf(W j)表示词W j的TF-IDF值,TF表示词频,IDF表示逆文档频率指数,d是词W i和W j的 词向量之间的欧式距离; Among them, f grav (W i , W j ) represents the gravitational forces of the words W i and W j , tfidf(W i ) represents the TF-IDF value of the word W i , and tfidf(W j ) represents the TF-IDF of the word W j IDF value, TF means word frequency, IDF means inverse document frequency index, d is the Euclidean distance between the word vectors of words W i and W j;
    根据计算的所述依存关联度和所述引力得到所述词W i和W j之间的关联强度为: According to the calculated dependency correlation degree and the gravity, the correlation strength between the words W i and W j is:
    weight(W i,W j)=Dep(W i,W j)*f grav(W i,W j) weight(W i ,W j )=Dep(W i ,W j )*f grav (W i ,W j )
    结合所述关联强度计算出所述词W i的重要度得分: Binding strength of the association degree of importance of the word W i is calculated scores:
    Figure PCTCN2020112336-appb-100013
    Figure PCTCN2020112336-appb-100013
    其中,
    Figure PCTCN2020112336-appb-100014
    是与顶点W i有关的集合,η为阻尼系数;
    among them,
    Figure PCTCN2020112336-appb-100014
    W i is associated with a set of vertices, η is the damping coefficient;
    根据所述词W i的重要度得分选取t个得分最高的词作为所述业务描述的关键词。 The degree of importance of the selected word W i t score highest-scoring word as a keyword in the service description.
  18. 如权利要求15所述的计算机设备,其中,所述查询内容与所述词向量的相似度的计算公式为:15. The computer device according to claim 15, wherein the formula for calculating the similarity between the query content and the word vector is:
    Figure PCTCN2020112336-appb-100015
    Figure PCTCN2020112336-appb-100015
    其中,X表示所述词向量,Y表示所述查询内容。Wherein, X represents the word vector, and Y represents the query content.
  19. 如权利要求15至18中任一项所述的计算机可读存储介质,其中,所述通过多策略检索方式向所述云存储进行收藏文件的查询,包括:18. The computer-readable storage medium according to any one of claims 15 to 18, wherein the querying of the collection file from the cloud storage in a multi-strategy retrieval manner comprises:
    预设所述用户输入的查询内容中原字符串为m,所述收藏文件的业务描述目标字符串为n;Preset that the original character string in the query content input by the user is m, and the business description target character string of the collection file is n;
    记录所述原字符串m变换为所述目标字符串n所需的删除、插入、替换操作的编辑次数L;Record the number of edits L of the deletion, insertion, and replacement operations required to transform the original character string m into the target character string n;
    选取所述L值最小的对应收藏文件作为查询结果,并返回给所述用户。The corresponding favorite file with the smallest L value is selected as the query result and returned to the user.
  20. 如权利要求15所述的计算机可读存储介质,其中,所述将所述关键词转换为词向量,包括:15. The computer-readable storage medium of claim 15, wherein said converting said keywords into word vectors comprises:
    利用独热表示算法将所述关键词转换为词向量进行表示。The one-hot representation algorithm is used to convert the keywords into word vectors for representation.
PCT/CN2020/112336 2019-09-03 2020-08-30 File query method and device, and computer device and storage medium WO2021043088A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910829794.5 2019-09-03
CN201910829794.5A CN110674087A (en) 2019-09-03 2019-09-03 File query method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
WO2021043088A1 true WO2021043088A1 (en) 2021-03-11

Family

ID=69076316

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112336 WO2021043088A1 (en) 2019-09-03 2020-08-30 File query method and device, and computer device and storage medium

Country Status (2)

Country Link
CN (1) CN110674087A (en)
WO (1) WO2021043088A1 (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110674087A (en) * 2019-09-03 2020-01-10 平安科技(深圳)有限公司 File query method and device and computer readable storage medium
CN111753526A (en) * 2020-06-18 2020-10-09 北京无忧创想信息技术有限公司 Similar competitive product data analysis method and system
CN113806619B (en) * 2021-08-19 2022-09-09 广州云硕科技发展有限公司 Semantic analysis system and semantic analysis method

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855252A (en) * 2011-06-30 2013-01-02 北京百度网讯科技有限公司 Method and device for data retrieval based on demands
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Word vector similarity based retrieval method and system
CN108804409A (en) * 2017-04-28 2018-11-13 西安科技大市场创新云服务股份有限公司 A kind of semantic retrieving method and device
CN109857841A (en) * 2018-12-05 2019-06-07 厦门快商通信息技术有限公司 A kind of FAQ question sentence Text similarity computing method and system
CN110032632A (en) * 2019-04-04 2019-07-19 平安科技(深圳)有限公司 Intelligent customer service answering method, device and storage medium based on text similarity
CN110674087A (en) * 2019-09-03 2020-01-10 平安科技(深圳)有限公司 File query method and device and computer readable storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7720870B2 (en) * 2007-12-18 2010-05-18 Yahoo! Inc. Method and system for quantifying the quality of search results based on cohesion
CN103577416B (en) * 2012-07-20 2017-09-22 阿里巴巴集团控股有限公司 Expanding query method and system
CN103198136B (en) * 2013-04-15 2016-01-13 天津理工大学 A kind of PC file polling method based on sequential correlation
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN108170739A (en) * 2017-12-18 2018-06-15 深圳前海微众银行股份有限公司 Problem matching process, terminal and computer readable storage medium

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855252A (en) * 2011-06-30 2013-01-02 北京百度网讯科技有限公司 Method and device for data retrieval based on demands
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Word vector similarity based retrieval method and system
CN108804409A (en) * 2017-04-28 2018-11-13 西安科技大市场创新云服务股份有限公司 A kind of semantic retrieving method and device
CN109857841A (en) * 2018-12-05 2019-06-07 厦门快商通信息技术有限公司 A kind of FAQ question sentence Text similarity computing method and system
CN110032632A (en) * 2019-04-04 2019-07-19 平安科技(深圳)有限公司 Intelligent customer service answering method, device and storage medium based on text similarity
CN110674087A (en) * 2019-09-03 2020-01-10 平安科技(深圳)有限公司 File query method and device and computer readable storage medium

Also Published As

Publication number Publication date
CN110674087A (en) 2020-01-10

Similar Documents

Publication Publication Date Title
US11151145B2 (en) Tag selection and recommendation to a user of a content hosting service
WO2021043088A1 (en) File query method and device, and computer device and storage medium
JP5346279B2 (en) Annotation by search
US9355171B2 (en) Clustering of near-duplicate documents
US10104021B2 (en) Electronic mail data modeling for efficient indexing
US11798208B2 (en) Computerized systems and methods for graph data modeling
US20120131009A1 (en) Enhancing personal data search with information from social networks
CN107085583B (en) Electronic document management method and device based on content
US8631097B1 (en) Methods and systems for finding a mobile and non-mobile page pair
JP2013541793A (en) Multi-mode search query input method
US9298757B1 (en) Determining similarity of linguistic objects
WO2013112415A1 (en) Indexing structures using synthetic document summaries
WO2020056977A1 (en) Knowledge point pushing method and device, and computer readable storage medium
CN107844493B (en) File association method and system
CN102037465A (en) Method for aggregating web feed minimizing redundancies
CN112328548A (en) File retrieval method and computing device
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
WO2021051934A1 (en) Method and apparatus for extracting key contract term on basis of artificial intelligence, and storage medium
CN111666383A (en) Information processing method, information processing device, electronic equipment and computer readable storage medium
KR101234795B1 (en) Apparatus and method for browsing contents
KR102076548B1 (en) Apparatus for managing document utilizing of morphological analysis and operating method thereof
TWI682286B (en) System for document searching using results of text analysis and natural language input
CN110008407B (en) Information retrieval method and device
Lu et al. Semantic retrieval of personal photos using a deep autoencoder fusing visual features with speech annotations represented as word/paragraph vectors
CN113486148A (en) PDF file conversion method and device, electronic equipment and computer readable medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20860247

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20860247

Country of ref document: EP

Kind code of ref document: A1