CN110674087A - File query method and device and computer readable storage medium - Google Patents

File query method and device and computer readable storage medium Download PDF

Info

Publication number
CN110674087A
CN110674087A CN201910829794.5A CN201910829794A CN110674087A CN 110674087 A CN110674087 A CN 110674087A CN 201910829794 A CN201910829794 A CN 201910829794A CN 110674087 A CN110674087 A CN 110674087A
Authority
CN
China
Prior art keywords
word
service description
file
query
keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201910829794.5A
Other languages
Chinese (zh)
Inventor
钱克功
沈网中
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ping An Technology Shenzhen Co Ltd
Original Assignee
Ping An Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ping An Technology Shenzhen Co Ltd filed Critical Ping An Technology Shenzhen Co Ltd
Priority to CN201910829794.5A priority Critical patent/CN110674087A/en
Publication of CN110674087A publication Critical patent/CN110674087A/en
Priority to PCT/CN2020/112336 priority patent/WO2021043088A1/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/14Details of searching files based on file metadata
    • G06F16/148File search processing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/11File system administration, e.g. details of archiving or snapshots
    • G06F16/113Details of archiving

Abstract

The invention relates to an artificial intelligence technology, and discloses a file query method, which comprises the following steps: acquiring a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into a cloud storage; extracting keywords from the service description through a keyword extraction algorithm to obtain keywords of the service description, converting the keywords into word vectors, and storing the word vectors; receiving query content input by a user, and calculating the similarity between the query content and the word vector; and selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user. The invention also provides a file inquiry device and a computer readable storage medium. The invention realizes the accurate query of the file.

Description

File query method and device and computer readable storage medium
Technical Field
The invention relates to the technical field of artificial intelligence, in particular to a file query method and device and a computer readable storage medium.
Background
As the amount of information has increased explosively with the development of technology, more and more files need to be stored in the user's computer. The file system of the computer is responsible for establishing files for users, and controlling the access of the files through storing, reading, modifying and dumping the files. When the user does not use the file any more, the file can be cancelled, deleted and the like, so that the file system of the computer can support the storage of massive files. However, for users, in the face of massive files, a certain amount of time and energy are needed to retrieve target files, and no related technology or product capable of quickly querying files exists in the industry at present.
Disclosure of Invention
The invention provides a file query method, a file query device and a computer readable storage medium, and mainly aims to present an accurate file query result to a user when the user queries a file in a text.
In order to achieve the above object, the present invention provides a file query method, including:
acquiring a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into a cloud storage;
extracting keywords from the service description through a keyword extraction algorithm to obtain keywords of the service description, converting the keywords into word vectors, and storing the word vectors;
receiving query content input by a user, and calculating the similarity between the query content and the word vector;
and selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user.
Optionally, the obtaining the collection file set of the client includes:
traversing and retrieving from a local disk of the client to obtain the collection file set; or
And searching the collected file set from a search engine by using the keywords according to the requirements of the user.
Optionally, the extracting keywords from the service description by using a keyword extraction algorithm includes:
performing word segmentation operation on the service description;
calculating any two words W in the service descriptioniAnd WjDependence relevance of (2):
Figure BDA0002189592880000021
wherein, Dep (W)i,Wj) Represents the word WiAnd WjDependence degree of (2), len (W)i,Wj) Represents the word WiAnd WjB is a hyper-parameter;
calculating the word WiAnd WjThe gravity of (2):
Figure BDA0002189592880000022
wherein f isgrav(Wi,Wj) Represents the word WiAnd WjGravitation of, tfidf (W)i) The expression WiTF-IDF value of (1), tfidf (W)j) The expression WjTF-IDF value of (1), TF represents word frequency, IDF represents inverse document frequency index, d is word WiAnd WjThe euclidean distance between the word vectors of (a);
obtaining the word W according to the calculated dependency relevance and the gravityiAnd WjThe strength of the association between:
weight(Wi,Wj)=Dep(Wi,Wj)*fgrav(Wi,Wj)
calculating the word W in combination with the correlation strengthjThe importance score of (a):
Figure BDA0002189592880000023
wherein the content of the first and second substances,
Figure BDA0002189592880000024
is at the vertex WiA related set, η is a damping coefficient;
according to the word WiThe importance degree score selects t words with the highest score as the keywords of the service description.
Optionally, the calculation formula of the similarity between the query content and the word vector is as follows:
Figure BDA0002189592880000025
wherein X represents the word vector and Y represents the query content.
Optionally, the querying of the favorite file to the cloud storage through a multi-policy retrieval manner includes:
presetting an original character string m in query content input by the user and a service description target character string n of the collection file;
recording the editing times L of deletion, insertion and replacement operations required by the conversion of the original character string m into the target character string n;
and selecting the corresponding collection file with the minimum L value as a query result, and returning the query result to the user.
In addition, in order to achieve the above object, the present invention further provides a file query apparatus, which includes a memory and a processor, wherein the memory stores a file query program operable on the processor, and the file query program, when executed by the processor, implements the following steps:
acquiring a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into a cloud storage;
extracting keywords from the service description through a keyword extraction algorithm to obtain keywords of the service description, converting the keywords into word vectors, and storing the word vectors;
receiving query content input by a user, and calculating the similarity between the query content and the word vector;
and selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user.
Optionally, the obtaining the collection file set of the client includes:
traversing and retrieving from a local disk of the client to obtain the collection file set; or
And searching the collected file set from a search engine by using the keywords according to the requirements of the user.
Optionally, the extracting keywords from the service description by using a keyword extraction algorithm includes:
performing word segmentation operation on the service description;
calculating any two words W in the service descriptioniAnd WjDependence relevance of (2):
Figure BDA0002189592880000031
wherein, Dep (W)i,Wj) Represents the word WiAnd WjDependence degree of (2), len (W)i,Wj) Represents the word WiAnd WjB is a hyper-parameter;
calculating the word WiAnd WjThe gravity of (2):
Figure BDA0002189592880000032
wherein f isgrav(Wi,Wj) Represents the word WiAnd WjGravitation of, tfidf (W)i) The expression WiTF-IDF value of (1), tfidf (W)j) The expression WjTF-IDF value of (TF represents word frequency, IDF represents inverse document frequency)Index, d is the word WiAnd WiThe euclidean distance between the word vectors of (a);
obtaining the word W according to the calculated dependency relevance and the gravityiAnd WjThe strength of the association between:
weight(Wi,Wj)=Dep(Wi,Wj)*fgrav(Wi,Wj)
calculating the word W in combination with the correlation strengthjThe importance score of (a):
Figure BDA0002189592880000041
wherein the content of the first and second substances,
Figure BDA0002189592880000042
is at the vertex WiA related set, η is a damping coefficient;
according to the word WiThe importance degree score selects t words with the highest score as the keywords of the service description.
Optionally, the calculation formula of the similarity between the query content and the word vector is as follows:
Figure BDA0002189592880000043
wherein X represents the word vector and Y represents the query content.
In addition, to achieve the above object, the present invention also provides a computer readable storage medium having a file query program stored thereon, where the file query program is executable by one or more processors to implement the steps of the file query method as described above.
According to the file query method, the file query device and the computer readable storage medium, when a user queries files, the collected files are analyzed for service description based on the collected files of the client, the similarity between the query content of the files required by the user input and the analyzed service description is calculated, the files are queried in a multi-strategy retrieval mode according to the similarity, the query result is returned to the user, and the accurate file query result can be presented to the user.
Drawings
Fig. 1 is a schematic flowchart of a file query method according to an embodiment of the present invention;
fig. 2 is a schematic diagram of an internal structure of a file query apparatus according to an embodiment of the present invention;
fig. 3 is a block diagram illustrating a file query program in a file query device according to an embodiment of the present invention.
The implementation, functional features and advantages of the objects of the present invention will be further explained with reference to the accompanying drawings.
Detailed Description
It should be understood that the specific embodiments described herein are merely illustrative of the invention and are not intended to limit the invention.
The invention provides a file query method. Fig. 1 is a schematic flow chart of a file query method according to an embodiment of the present invention. The method may be performed by an apparatus, which may be implemented by software and/or hardware.
In this embodiment, the file query method includes:
s1, acquiring a collection file set of the client, creating service description of the collection file set in the file system, and storing the collection file set after creating the service description into cloud storage.
In the preferred embodiment of the present invention, the client is also called a client, which refers to a program corresponding to the server for providing local services to the client. The collection file set of the client is obtained by the following two ways: in the first mode, traversing and retrieving are carried out from a local disk of the client to obtain the collection file set; and secondly, searching the collected file set from a search engine by using the keywords according to the requirements of the user.
The Cloud storage refers to a mode of online storage (Cloud storage), i.e., data is stored on a plurality of virtual servers, which are usually hosted by third parties, rather than on dedicated servers.
Preferably, the file system in the present invention is a Hadoop Distributed File System (HDFS). The HDFS has high fault tolerance and can be deployed on low-cost hardware, and meanwhile, the HDFS relaxes the requirement on a portable operating system interface, so that the HDFS can access file data in a streaming mode, thereby providing high-throughput access to data of an application program and being suitable for the application program with a large data set.
In detail, the HDFS is composed of a NameNode (master node) and n datanodes (slave nodes), where the NameNode is mainly responsible for managing a file namespace and a master server accessed by a client, and the DataNode is responsible for managing file storage. In the preferred embodiment of the present invention, a service description of the collection file set is created in the master node of the HDFS file system.
Further, the service description refers to a brief summary of the content of the collection file set, and may also be represented as a name of the collection file set.
S2, extracting keywords from the service description through a keyword extraction algorithm to obtain the keywords of the service description, converting the keywords into word vectors, and storing the word vectors.
In a preferred embodiment of the present invention, the extracting keywords from the service description by using a keyword extraction algorithm includes:
performing word segmentation operation on the service description;
calculating any two words W in the service descriptioniAnd WjDependence relevance of (2):
Figure BDA0002189592880000061
wherein, Dep (W)i,Wj) Represents the word WiAnd WjDependence degree of (2), len (W)i,Wj) Represents the word WiAnd WjB is a hyper-parameter;
calculating the word WiAnd WjThe gravity of (2):
Figure BDA0002189592880000062
wherein f isgrav(Wi,Wj) Represents the word WiAnd WjGravitation of, tfidf (W)i) The expression WiTF-IDF value of (1), tfidf (W)j) The expression WjTF-IDF value of (1), TF represents word frequency, IDF represents inverse document frequency index, d is word WiAnd WjThe euclidean distance between the word vectors of (a);
obtaining the word W according to the calculated dependency relevance and the gravityiAnd WjThe strength of the association between:
weight(Wi,Wj)=Dep(Wi,Wj)*fgrav(Wi,Wj)
calculating the word W in combination with the correlation strengthjThe importance score of (a):
Figure BDA0002189592880000063
wherein the content of the first and second substances,is at the vertex WiA related set, η is a damping coefficient;
preferably, the invention selects t words with the highest scores as the keywords of the service description according to the importance scores of the words.
Further, the invention utilizes a one-hot representation (one hot) algorithm to convert the keywords into word vectors for representation. The one-hot representation algorithm is a basic method for representing vectors of words, is similar to the concept of a word bag model, a dictionary is constructed by extracting all words in a corpus, each word in the dictionary is represented by a word vector, the dimension of the word vector is equal to the dictionary scale, the value of the dimension corresponding to the current word in the vector is 1, and the values of the other dimensions are all 0, so that the dimension of the keyword described by the business is converted into 1, and the dimensions of the other words are 0, so that the keyword can be converted into the word vector representation.
S3, receiving query contents input by a user, and calculating the similarity between the query contents and the word vectors.
In the preferred embodiment of the present invention, the similarity between the query content and the word vector is calculated by using a cosin method (cosine similarity). The cosine similarity is a measure for measuring the difference between two individuals by using the cosine value of the included angle between two vectors in a vector space, wherein the closer the cosine value of the cosine similarity is to 1, the closer the included angle between the two vectors is to 0 degree, namely the more similar the two vectors are. The calculation formula of the cosine similarity is as follows:
wherein X represents the word vector, Y represents the query content, and the similarity range of cosine values of the cosine similarity is-1 to 1: when the cosine value is-1, it indicates that the directions pointed by the query content and the word vector are exactly opposite, which indicates that the similarity between the query content and the word vector is 0, and when the cosine value is 1, it indicates that the directions pointed by the query content and the word vector are completely the same, which indicates that the similarity between the query content and the word vector is 100%, and when the cosine value is 0, it indicates that the query content and the word vector are independent, which indicates that the similarity or the difference between the query content and the word vector is moderate. According to the cosine value, the similarity between the query content and the word vector is obtained.
And S4, selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning the inquiry result to the user.
In the preferred embodiment of the present invention, the multi-strategy search method includes Levenshtein Distance (LD). And when the query content input by the user is matched, comparing the query content with the service description of the collected file in the cloud storage through the similarity calculation method. If the collection files are matched with the user files, the collection files are directly returned to the user; if not, similarity calculation is carried out on the query content input by the user and the keywords of the service description in the collected files, a preset threshold value is 0.8, and the collected files corresponding to the service description with the similarity result larger than the preset threshold value are used as query results and returned to the user.
Further, when none of the similarity results is greater than a preset threshold, the similarity between the query content input by the user and the character strings in the service description of the favorite file is calculated through the LD. In detail, the invention presets the original character string in the query content input by the user as m, the service description target character string of the collection file as n, records the editing times L of the deletion, insertion and replacement operation required by the conversion of the original character string m into the target character string n, and records the L of the 2 character strings m and n as levm,n(| m |, | n |), wherein | m |, | n | is the length of the character string m, n respectively. And when the L is larger, the similarity of the character strings is lower, and then the corresponding collection file with the minimum L value is selected as a query result and returned to the user.
The invention also provides a file inquiry device. Fig. 2 is a schematic diagram illustrating an internal structure of a file query apparatus according to an embodiment of the present invention.
In this embodiment, the file inquiry apparatus 1 may be a PC (Personal Computer), a terminal device such as a smart phone, a tablet Computer, or a mobile Computer, or may be a server. The file querying device 1 comprises at least a memory 11, a processor 12, a communication bus 13, and a network interface 14.
The memory 11 includes at least one type of readable storage medium, which includes a flash memory, a hard disk, a multimedia card, a card type memory (e.g., SD or DX memory, etc.), a magnetic memory, a magnetic disk, an optical disk, and the like. The memory 11 may in some embodiments be an internal storage unit of the file querying device 1, such as a hard disk of the file querying device 1. The memory 11 may also be an external storage device of the file query apparatus 1 in other embodiments, such as a plug-in hard disk provided on the file query apparatus 1, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card), and the like. Further, the memory 11 may also include both an internal storage unit and an external storage device of the file inquiry apparatus 1. The memory 11 may be used not only to store application software installed in the file search apparatus 1 and various types of data, such as a code of the file search program 01, but also to temporarily store data that has been output or is to be output.
The processor 12 may be a Central Processing Unit (CPU), controller, microcontroller, microprocessor or other data Processing chip in some embodiments, and is used for executing program codes stored in the memory 11 or Processing data, such as executing the file query program 01.
The communication bus 13 is used to realize connection communication between these components.
The network interface 14 may optionally include a standard wired interface, a wireless interface (e.g., WI-FI interface), typically used to establish a communication link between the apparatus 1 and other electronic devices.
Optionally, the apparatus 1 may further comprise a user interface, which may comprise a Display (Display), an input unit such as a Keyboard (Keyboard), and optionally a standard wired interface, a wireless interface. Alternatively, in some embodiments, the display may be an LED display, a liquid crystal display, a touch-sensitive liquid crystal display, an OLED (Organic Light-Emitting Diode) touch device, or the like. The display, which may also be referred to as a display screen or display unit, is suitable for displaying information processed in the document querying device 1 and for displaying a visualized user interface.
Fig. 2 shows only the document querying device 1 with the components 11 to 14 and the document querying program 01, and it will be understood by those skilled in the art that the structure shown in fig. 1 does not constitute a limitation of the document querying device 1, and may comprise fewer or more components than those shown, or may combine some components, or may be arranged differently.
In the embodiment of the apparatus 1 shown in fig. 2, a file query program 01 is stored in the memory 11; the following steps are implemented when the processor 12 executes the file query program 01 stored in the memory 11:
the method comprises the steps of firstly, acquiring a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into cloud storage.
In the preferred embodiment of the present invention, the client is also called a client, which refers to a program corresponding to the server for providing local services to the client. The collection file set of the client is obtained by the following two ways: in the first mode, traversing and retrieving are carried out from a local disk of the client to obtain the collection file set; and secondly, searching the collected file set from a search engine by using the keywords according to the requirements of the user.
The Cloud storage refers to a mode of online storage (Cloud storage), i.e., data is stored on a plurality of virtual servers, which are usually hosted by third parties, rather than on dedicated servers.
Preferably, the file system in the present invention is a Hadoop Distributed File System (HDFS). The HDFS has high fault tolerance and can be deployed on low-cost hardware, and meanwhile, the HDFS relaxes the requirement on a portable operating system interface, so that the HDFS can access file data in a streaming mode, thereby providing high-throughput access to data of an application program and being suitable for the application program with a large data set.
In detail, the HDFS is composed of a NameNode (master node) and n datanodes (slave nodes), where the NameNode is mainly responsible for managing a file namespace and a master server accessed by a client, and the DataNode is responsible for managing file storage. In the preferred embodiment of the present invention, a service description of the collection file set is created in the master node of the HDFS file system.
Further, the service description refers to a brief summary of the content of the collection file set, and may also be represented as a name of the collection file set.
And step two, extracting keywords from the service description through a keyword extraction algorithm to obtain the keywords of the service description, converting the keywords into word vectors, and storing the word vectors.
In a preferred embodiment of the present invention, the extracting keywords from the service description by using a keyword extraction algorithm includes:
performing word segmentation operation on the service description; calculating any two words W in the service descriptioniAnd WjDependence relevance of (2):
Figure BDA0002189592880000101
wherein, Dep (W)i,Wj) Represents the word WiAnd WjDependence degree of (2), len (W)i,Wj) Represents the word WiAnd WjB is a hyper-parameter;
calculating the word WiAnd WjThe gravity of (2):
wherein f isgrav(Wi,Wj) Represents the word WiAnd WjGravitation of, tfidf (W)i) The expression WiTF-IDF value of (1), tfidf (W)j) The expression WjTF-IDF value of (1), TF represents word frequency, IDF represents inverse document frequency index, d is word WiAnd WjThe euclidean distance between the word vectors of (a);
obtaining the word W according to the calculated dependency relevance and the gravityiAnd WjThe strength of the association between:
weight(Wi,Wj)=Dep(Wi,Wj)*fgrav(Wi,Wj)
calculating the word W in combination with the correlation strengthjThe importance score of (a):
Figure BDA0002189592880000103
wherein the content of the first and second substances,is at the vertex WiA related set, η is a damping coefficient;
preferably, the invention selects t words with the highest scores as the keywords of the service description according to the importance scores of the words.
Further, the invention utilizes a one-hot representation (one hot) algorithm to convert the keywords into word vectors for representation. The one-hot representation algorithm is a basic method for representing vectors of words, is similar to the concept of a word bag model, a dictionary is constructed by extracting all words in a corpus, each word in the dictionary is represented by a word vector, the dimension of the word vector is equal to the dictionary scale, the value of the dimension corresponding to the current word in the vector is 1, and the values of the other dimensions are all 0, so that the dimension of the keyword described by the business is converted into 1, and the dimensions of the other words are 0, so that the keyword can be converted into the word vector representation.
And step three, receiving query contents input by a user, and calculating the similarity between the query contents and the word vectors.
In the preferred embodiment of the present invention, the similarity between the query content and the word vector is calculated by using a cosin method (cosine similarity). The cosine similarity is a measure for measuring the difference between two individuals by using the cosine value of the included angle between two vectors in a vector space, wherein the closer the cosine value of the cosine similarity is to 1, the closer the included angle between the two vectors is to 0 degree, namely the more similar the two vectors are. The calculation formula of the cosine similarity is as follows:
Figure BDA0002189592880000111
wherein X represents the word vector, Y represents the query content, and the similarity range of cosine values of the cosine similarity is-1 to 1: when the cosine value is-1, it indicates that the directions pointed by the query content and the word vector are exactly opposite, which indicates that the similarity between the query content and the word vector is 0, and when the cosine value is 1, it indicates that the directions pointed by the query content and the word vector are completely the same, which indicates that the similarity between the query content and the word vector is 100%, and when the cosine value is 0, it indicates that the query content and the word vector are independent, which indicates that the similarity or the difference between the query content and the word vector is moderate. According to the cosine value, the similarity between the query content and the word vector is obtained.
And step four, selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user.
In the preferred embodiment of the present invention, the multi-strategy search method includes Levenshtein Distance (LD). And when the query content input by the user is matched, comparing the query content with the service description of the collected file in the cloud storage through the similarity calculation method. If the collection files are matched with the user files, the collection files are directly returned to the user; if not, similarity calculation is carried out on the query content input by the user and the keywords of the service description in the collected files, a preset threshold value is 0.8, and the collected files corresponding to the service description with the similarity result larger than the preset threshold value are used as query results and returned to the user.
Further, when none of the similarity results is greater than a preset threshold, the similarity between the query content input by the user and the character strings in the service description of the favorite file is calculated through the LD. In detail, the invention presets the original character string in the query content input by the user as m, the service description target character string of the collection file as n, records the editing times L of the deletion, insertion and replacement operation required by the conversion of the original character string m into the target character string n, and records the L of the 2 character strings m and n as levm,n(| m |, | n |), wherein | m |, | n | is the length of the character string m, n respectively. And when the L is larger, the similarity of the character strings is lower, and then the corresponding collection file with the minimum L value is selected as a query result and returned to the user.
Alternatively, in other embodiments, the file query program may be divided into one or more modules, and the one or more modules are stored in the memory 11 and executed by one or more processors (in this embodiment, the processor 12) to implement the present invention.
For example, referring to fig. 3, a schematic diagram of program modules of a document query program in an embodiment of the document query apparatus of the present invention is shown, in this embodiment, the document query program may be divided into a service description creation module 10, a keyword extraction module 20, a similarity calculation module 30, and a query module 40, which exemplarily:
the service description creation module 10 is configured to: the method comprises the steps of obtaining a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into cloud storage.
The keyword extraction module 20 is configured to: and extracting keywords from the service description through a keyword extraction algorithm to obtain the keywords of the service description, converting the keywords into word vectors, and storing the word vectors.
The similarity calculation module 30 is configured to: and receiving query content input by a user, and calculating the similarity between the query content and the word vector.
The query module 40 is configured to: and selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user.
The functions or operation steps implemented when the program modules such as the text service description creation module 10, the keyword extraction module 20, the similarity calculation module 30, and the query module 40 are executed are substantially the same as those in the above embodiments, and are not described herein again.
Furthermore, an embodiment of the present invention further provides a computer-readable storage medium, where a file query program is stored on the computer-readable storage medium, where the file query program is executable by one or more processors to implement the following operations:
acquiring a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into a cloud storage;
extracting keywords from the service description through a keyword extraction algorithm to obtain keywords of the service description, converting the keywords into word vectors, and storing the word vectors;
receiving query content input by a user, and calculating the similarity between the query content and the word vector;
and selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user.
The embodiment of the computer-readable storage medium of the present invention is substantially the same as the embodiments of the file querying device and method, and will not be described herein again.
It should be noted that the above-mentioned numbers of the embodiments of the present invention are merely for description, and do not represent the merits of the embodiments. And the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, apparatus, article, or method that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, apparatus, article, or method. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other like elements in a process, apparatus, article, or method that includes the element.
Through the above description of the embodiments, those skilled in the art will clearly understand that the method of the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but in many cases, the former is a better implementation manner. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) as described above and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.
The above description is only a preferred embodiment of the present invention, and not intended to limit the scope of the present invention, and all modifications of equivalent structures and equivalent processes, which are made by using the contents of the present specification and the accompanying drawings, or directly or indirectly applied to other related technical fields, are included in the scope of the present invention.

Claims (10)

1. A file query method, the method comprising:
acquiring a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into a cloud storage;
extracting keywords from the service description through a keyword extraction algorithm to obtain keywords of the service description, converting the keywords into word vectors, and storing the word vectors;
receiving query content input by a user, and calculating the similarity between the query content and the word vector;
and selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user.
2. The file query method of claim 1, wherein the obtaining of the favorite file set of the client comprises:
traversing and retrieving from a local disk of the client to obtain the collection file set; or
And searching the collected file set from a search engine by using the keywords according to the requirements of the user.
3. The document searching method of claim 1, wherein the extracting the keywords from the service description by the keyword extraction algorithm comprises:
performing word segmentation operation on the service description;
calculating any two words W in the service descriptioniAnd WjDependence relevance of (2):
wherein, Dep (W)i,Wj) Represents the word WiAnd WjDependence degree of (2), len (W)i,Wj) Represents the word WiAnd WjB is a hyper-parameter;
calculating the word WiAnd WjThe gravity of (2):
Figure FDA0002189592870000012
wherein f isgrav(Wi,Wj) Represents the word WiAnd WjGravitation of, tfidf (W)i) The expression WiTF-IDF value of (1), tfidf (W)j) The expression WjTF-IDF value of (1), TF represents word frequency, IDF represents inverse document frequency index, d is word WiAnd WjThe euclidean distance between the word vectors of (a);
obtaining the word W according to the calculated dependency relevance and the gravityiAnd WjThe strength of the association between:
weight(Wi,Wj)=Dep(Wi,Wj)*fgrav(Wi,Wj)
calculating the word W in combination with the correlation strengthiThe importance score of (a):
Figure FDA0002189592870000021
wherein the content of the first and second substances,
Figure FDA0002189592870000022
is at the vertex WiA related set, η is a damping coefficient;
according to the word WiThe importance degree score selects t words with the highest score as the keywords of the service description.
4. The document query method of claim 1, wherein the similarity between the query content and the word vector is calculated by the formula:
wherein X represents the word vector and Y represents the query content.
5. The file query method according to any one of claims 1 to 4, wherein the querying of the favorite files to the cloud storage through a multi-policy retrieval manner includes:
presetting an original character string m in query content input by the user and a service description target character string n of the collection file;
recording the editing times L of deletion, insertion and replacement operations required by the conversion of the original character string m into the target character string n;
and selecting the corresponding collection file with the minimum L value as a query result, and returning the query result to the user.
6. A document inquiry apparatus, comprising a memory and a processor, wherein the memory stores a document inquiry program operable on the processor, and the document inquiry program when executed by the processor implements the steps of:
acquiring a collection file set of a client, creating service description of the collection file set in a file system, and storing the collection file set after the service description is created into a cloud storage;
extracting keywords from the service description through a keyword extraction algorithm to obtain keywords of the service description, converting the keywords into word vectors, and storing the word vectors;
receiving query content input by a user, and calculating the similarity between the query content and the word vector;
and selecting corresponding service description according to the similarity, inquiring the collected files from the cloud storage in a multi-strategy retrieval mode, and returning an inquiry result to the user.
7. The file query apparatus of claim 6, wherein the obtaining of the favorite file set of the client comprises:
traversing and retrieving from a local disk of the client to obtain the collection file set; or
And searching the collected file set from a search engine by using the keywords according to the requirements of the user.
8. The apparatus for querying documents according to claim 6, wherein said extracting keywords from said service description by keyword extraction algorithm comprises:
performing word segmentation operation on the service description;
calculating any two words W in the service descriptioniAnd WjDependence relevance of (2):
Figure FDA0002189592870000031
wherein, Dep (W)i,Wj) Represents the word WiAnd WjDependence degree of (2), len (W)i,Wj) Represents the word WiAnd WjB is a hyper-parameter;
calculating the word WiAnd WjThe gravity of (2):
Figure FDA0002189592870000032
wherein f isgrav(Wi,Wj) Represents the word WiAnd WjGravitation of, tfidf (W)i) The expression WiTF-IDF value of (1), tfidf (W)j) The expression WjTF-IDF value of (1), TF represents word frequency, IDF represents inverse document frequency index, d is word WiAnd WjThe euclidean distance between the word vectors of (a);
obtaining the word W according to the calculated dependency relevance and the gravityiAnd WjThe strength of the association between:
weight(Wi,Wj)=Dep(Wi,Wj)*fgrav(Wi,Wj)
calculating the word W in combination with the correlation strengthiThe importance score of (a):
Figure FDA0002189592870000033
wherein the content of the first and second substances,
Figure FDA0002189592870000034
is at the vertex WiA related set, η is a damping coefficient;
according to the word WiThe importance degree score selects t words with the highest score as the keywords of the service description.
9. The document querying device according to claim 6, wherein the similarity between the query content and the word vector is calculated by the formula:
Figure FDA0002189592870000035
wherein X represents the word vector and Y represents the query content.
10. A computer-readable storage medium having stored thereon a file query program executable by one or more processors to perform the steps of the file query method as claimed in any one of claims 1 to 5.
CN201910829794.5A 2019-09-03 2019-09-03 File query method and device and computer readable storage medium Pending CN110674087A (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201910829794.5A CN110674087A (en) 2019-09-03 2019-09-03 File query method and device and computer readable storage medium
PCT/CN2020/112336 WO2021043088A1 (en) 2019-09-03 2020-08-30 File query method and device, and computer device and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910829794.5A CN110674087A (en) 2019-09-03 2019-09-03 File query method and device and computer readable storage medium

Publications (1)

Publication Number Publication Date
CN110674087A true CN110674087A (en) 2020-01-10

Family

ID=69076316

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910829794.5A Pending CN110674087A (en) 2019-09-03 2019-09-03 File query method and device and computer readable storage medium

Country Status (2)

Country Link
CN (1) CN110674087A (en)
WO (1) WO2021043088A1 (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111753526A (en) * 2020-06-18 2020-10-09 北京无忧创想信息技术有限公司 Similar competitive product data analysis method and system
WO2021043088A1 (en) * 2019-09-03 2021-03-11 平安科技(深圳)有限公司 File query method and device, and computer device and storage medium
CN113806619A (en) * 2021-08-19 2021-12-17 广州云硕科技发展有限公司 Semantic analysis system and semantic analysis method

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157652A1 (en) * 2007-12-18 2009-06-18 Luciano Barbosa Method and system for quantifying the quality of search results based on cohesion
CN102855252A (en) * 2011-06-30 2013-01-02 北京百度网讯科技有限公司 Method and device for data retrieval based on demands
CN103198136A (en) * 2013-04-15 2013-07-10 天津理工大学 Sequence-association-based query method for personal computer files
CN103577416A (en) * 2012-07-20 2014-02-12 阿里巴巴集团控股有限公司 Query expansion method and system
CN108170739A (en) * 2017-12-18 2018-06-15 深圳前海微众银行股份有限公司 Problem matching process, terminal and computer readable storage medium
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN109857841A (en) * 2018-12-05 2019-06-07 厦门快商通信息技术有限公司 A kind of FAQ question sentence Text similarity computing method and system
CN110032632A (en) * 2019-04-04 2019-07-19 平安科技(深圳)有限公司 Intelligent customer service answering method, device and storage medium based on text similarity

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105631009A (en) * 2015-12-25 2016-06-01 广州视源电子科技股份有限公司 Word vector similarity based retrieval method and system
CN108804409A (en) * 2017-04-28 2018-11-13 西安科技大市场创新云服务股份有限公司 A kind of semantic retrieving method and device
CN110674087A (en) * 2019-09-03 2020-01-10 平安科技(深圳)有限公司 File query method and device and computer readable storage medium

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20090157652A1 (en) * 2007-12-18 2009-06-18 Luciano Barbosa Method and system for quantifying the quality of search results based on cohesion
CN102855252A (en) * 2011-06-30 2013-01-02 北京百度网讯科技有限公司 Method and device for data retrieval based on demands
CN103577416A (en) * 2012-07-20 2014-02-12 阿里巴巴集团控股有限公司 Query expansion method and system
CN103198136A (en) * 2013-04-15 2013-07-10 天津理工大学 Sequence-association-based query method for personal computer files
CN108536708A (en) * 2017-03-03 2018-09-14 腾讯科技(深圳)有限公司 A kind of automatic question answering processing method and automatically request-answering system
CN108170739A (en) * 2017-12-18 2018-06-15 深圳前海微众银行股份有限公司 Problem matching process, terminal and computer readable storage medium
CN109857841A (en) * 2018-12-05 2019-06-07 厦门快商通信息技术有限公司 A kind of FAQ question sentence Text similarity computing method and system
CN110032632A (en) * 2019-04-04 2019-07-19 平安科技(深圳)有限公司 Intelligent customer service answering method, device and storage medium based on text similarity

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2021043088A1 (en) * 2019-09-03 2021-03-11 平安科技(深圳)有限公司 File query method and device, and computer device and storage medium
CN111753526A (en) * 2020-06-18 2020-10-09 北京无忧创想信息技术有限公司 Similar competitive product data analysis method and system
CN113806619A (en) * 2021-08-19 2021-12-17 广州云硕科技发展有限公司 Semantic analysis system and semantic analysis method

Also Published As

Publication number Publication date
WO2021043088A1 (en) 2021-03-11

Similar Documents

Publication Publication Date Title
US9489401B1 (en) Methods and systems for object recognition
JP5346279B2 (en) Annotation by search
CN107085583B (en) Electronic document management method and device based on content
US20120117051A1 (en) Multi-modal approach to search query input
KR101510973B1 (en) Methods for indexing and searching based on language locale
WO2013133985A1 (en) Entity augmentation service from latent relational data
WO2006108069A2 (en) Searching through content which is accessible through web-based forms
WO2021043088A1 (en) File query method and device, and computer device and storage medium
CN107844493B (en) File association method and system
US11030242B1 (en) Indexing and querying semi-structured documents using a key-value store
WO2020056977A1 (en) Knowledge point pushing method and device, and computer readable storage medium
CN111400323B (en) Data retrieval method, system, equipment and storage medium
EP2192503A1 (en) Optimised tag based searching
CN112328548A (en) File retrieval method and computing device
Giangreco et al. ADAM pro: Database support for big multimedia retrieval
CN115563313A (en) Knowledge graph-based document book semantic retrieval system
CN111666383A (en) Information processing method, information processing device, electronic equipment and computer readable storage medium
US11442973B2 (en) System and method for storing and querying document collections
CN108614821B (en) Geological data interconnection and mutual-checking system
CN113486148A (en) PDF file conversion method and device, electronic equipment and computer readable medium
Dhar et al. Mathematical document retrieval system based on signature hashing
JP2011159100A (en) Successive similar document retrieval apparatus, successive similar document retrieval method and program
US9530094B2 (en) Jabba-type contextual tagger
CN111752922A (en) Method and device for establishing knowledge database and realizing knowledge query
CN105279172A (en) Video matching method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination