CN111309970A

CN111309970A - Data retrieval method and device, electronic equipment and storage medium

Info

Publication number: CN111309970A
Application number: CN202010232032.XA
Authority: CN
Inventors: 王雪锋; 袁玮玮
Original assignee: Beijing QIYI Century Science and Technology Co Ltd
Current assignee: Beijing QIYI Century Science and Technology Co Ltd
Priority date: 2020-03-27
Filing date: 2020-03-27
Publication date: 2020-06-19

Abstract

The application provides a data retrieval method, a data retrieval device, electronic equipment and a storage medium, and belongs to the technical field of data processing. The method comprises the following steps: receiving a data retrieval request, wherein the data retrieval request carries a plurality of keywords to be matched; constructing a keyword dictionary tree according to the keywords to be matched; traversing the keyword dictionary tree by the video related data aiming at the video related data to be matched so as to match the video related data with the keywords to be matched; and generating a retrieval result according to the video related data matched with the keywords. By adopting the technical scheme provided by the application, the accuracy of data retrieval can be improved.

Description

Data retrieval method and device, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data retrieval method and apparatus, an electronic device, and a storage medium.

Background

In order to purify the network environment, all video related data stored in the database of the video website need to be checked, that is, a server of the video website searches video related data matched with a preset keyword from all video related data stored in the database, and then the searched video related data are manually rechecked, and the video related data which do not pass the manual rechecking are deleted.

In the related art, in order to facilitate data retrieval, the server may store video related data in a word segmentation storage manner, for example, when a video name of a certain video is stored, the server may perform word segmentation processing on the video name "sponge baby is cooking" of the video to obtain words segmented "sponge baby", "at" and "cooking", and then the server may store the obtained words segmented. When the review and the check are performed subsequently, the server can match each word segmentation included in the video related data with each keyword to be matched according to each video related data. If the matching is successful, the server can take the video related data as the video related data needing manual review.

However, because different word segmentation processing methods are adopted for the same video related data, the obtained word segmentation is also different, and therefore, a certain keyword included in a certain video related data may be resolved into different word segmentation during word segmentation processing, so that the server cannot match the keyword in the word segmentation included in the video related data during review and inventory, and the accuracy of data retrieval is low.

Disclosure of Invention

An embodiment of the present application provides a data retrieval method, an apparatus, an electronic device, and a storage medium, so as to improve accuracy of data retrieval. The specific technical scheme is as follows:

in a first aspect of this application, there is provided a data retrieval method, where the method includes:

receiving a data retrieval request, wherein the data retrieval request carries a plurality of keywords to be matched;

constructing a keyword dictionary tree according to the keywords to be matched;

traversing the keyword dictionary tree by the video related data aiming at the video related data to be matched so as to match the video related data with the keywords to be matched;

and generating a retrieval result according to the video related data matched with the keywords.

In a second aspect of the present application, there is provided a data retrieval apparatus, the apparatus comprising:

the system comprises a receiving module, a matching module and a matching module, wherein the receiving module is used for receiving a data retrieval request which carries a plurality of keywords to be matched;

the building module is used for building a keyword dictionary tree according to the keywords to be matched;

the matching module is used for traversing the keyword dictionary tree through the video related data aiming at the video related data to be matched so as to match the video related data with the keywords to be matched;

and the generating module is used for generating a retrieval result according to the video related data matched with the keywords.

In a third aspect of the present application, an electronic device is provided, which includes a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;

a memory for storing a computer program;

a processor adapted to perform the method steps of any of the first aspects when executing a program stored in the memory.

In a fourth aspect of the present application, there is provided a computer-readable storage medium having a computer program stored thereon, wherein the program is adapted to perform the method steps of any of the first aspects when executed by a processor.

In a fifth aspect of this embodiment, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the method steps of any of the first aspects described above.

The data retrieval method, the data retrieval device, the electronic equipment and the storage medium provided by the embodiment of the application can receive a data retrieval request, wherein the data retrieval request carries a plurality of keywords to be matched; constructing a keyword dictionary tree according to a plurality of keywords to be matched; traversing a keyword dictionary tree by the video related data aiming at the video related data to be matched so as to match the video related data with a plurality of keywords to be matched; and generating a retrieval result according to the video related data matched with the keywords.

The keyword dictionary tree is traversed through the video related data so as to match the video related data with a plurality of keywords to be matched, and the keywords contained in the video related data can be determined without performing word segmentation processing on the video related data, so that the problem that the keywords cannot be matched when a certain keyword contained in the video related data is analyzed into different words can be solved, and the accuracy of data retrieval can be improved.

Of course, not all advantages described above need to be achieved at the same time in the practice of any one product or method of the present application.

Drawings

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below.

Fig. 1 is a flowchart of a data retrieval method according to an embodiment of the present application;

fig. 2a is an exemplary diagram of a keyword dictionary tree according to an embodiment of the present application;

FIG. 2b is an exemplary diagram of another keyword dictionary tree provided in the embodiments of the present application;

FIG. 3 is a flow chart of another data retrieval method provided by an embodiment of the present application;

FIG. 4 is a flow chart of another data retrieval method provided by an embodiment of the present application;

fig. 5 is a diagram illustrating a data retrieval method according to an embodiment of the present application;

fig. 6 is a schematic structural diagram of a data retrieval device according to an embodiment of the present application;

fig. 7 is a schematic structural diagram of another data retrieval device according to an embodiment of the present application;

fig. 8 is a schematic structural diagram of another data retrieval device according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of another data retrieval device according to an embodiment of the present application;

fig. 10 is a schematic structural diagram of another data retrieval device according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of another data retrieval device according to an embodiment of the present application;

fig. 12 is a schematic structural diagram of another data retrieval device according to an embodiment of the present application;

fig. 13 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be described below with reference to the drawings in the embodiments of the present application.

The embodiment of the application provides a data retrieval method which can be applied to a server of a video website.

By adopting the data retrieval method provided by the embodiment of the application, the server can check and check all video related data stored in the database of the video website, namely, the server searches the video related data matched with the keywords in all the video related data stored in the database, and then manually rechecks the searched video related data to delete the video related data which does not pass the manual rechecking. The video related data may be a video name, a picture name, a video profile, a bullet screen, a user name, and the like.

A data retrieval method provided in the embodiments of the present application will be described in detail below with reference to specific embodiments, as shown in fig. 1, the specific steps are as follows:

step 101, receiving a data retrieval request.

The data retrieval request carries a plurality of keywords to be matched.

In implementation, an auditing system can be pre-installed in a control end of an auditor, and when auditing and clearing are required, the auditor can execute preset operation on the control end so that the control end generates a data retrieval request and sends the data retrieval request to a server. The preset operation may be inputting a plurality of keywords to be matched in a preset input box in the auditing system, or inputting a plurality of keywords to be matched and attribute information of the video related data to be screened in a preset input box in the auditing system. The attribute information comprises at least one of data uploading time, data source, data format and data classification.

Therefore, the server can receive the data retrieval request carrying a plurality of keywords to be matched.

For example, the data retrieval request may carry 5 keywords to be matched, where the 5 keywords are respectively: sponge baby, do operation, love the sha princess, happy garden, birthday party.

Optionally, the reviewer may obtain the keywords in a plurality of ways: for example, the auditor may select a word from a pre-stored sensitive word library as a keyword to be matched. The sensitive word library contains words which are not suitable for being displayed on the line, word attributes of the words, the sensitivity degree of the words, word attribute association conditions for screening the words and the like. Or, the auditor may use the search keyword of the current hotspot event as the keyword to be matched.

And 102, constructing a keyword dictionary tree according to a plurality of keywords to be matched.

In the embodiment of the present application, a specific processing procedure for constructing a keyword dictionary tree according to a plurality of keywords to be matched is described by taking a plurality of keywords to be matched as how, he, and her as examples.

As shown in fig. 2a, the server may create a root node "/", for a character h in the keyword how, the server may determine that there is no child node connected to the root node and matching the character, and then, the server may create a child node "h" connected to the root node. Similarly, for the character o in the keyword how, the server may determine that there is no child node connected to the node "h" and matching the character, and then, the server may create a child node "o" connected to the node "h"; for a character w in the keyword how, the server may determine that there is no child node connected to the node "o" and matching the character, and then the server may create one child node "w" connected to the node "o". Thereby, the server may get a tree structure as shown in (1) in fig. 2 a.

Similarly, for character h in keyword he, the server may determine that there is a child node "h" connected to the root node and matching the character. Then, the server may determine, for the character e, that there is no child node connected to the node "h" and matching the character, and then, the server may create a child node "e" connected to the node "h"; thereby, the server may get a tree structure as shown in (2) in fig. 2 a.

Similarly, for the character h in the keyword her, the server may determine that there is a child node "h" connected to the root node and matching the character. The server may then determine, for character e, that there is a child node "e" connected to node "h" and matching the character. The server may determine, for the character r, that there is no child node connected to the node "e" and matching the character, and then the server may create a child node "r" connected to the node "e"; thus, the server can obtain a tree structure as shown in (3) in fig. 2a, and obtain a keyword dictionary tree constructed by keywords how, he, and her.

Step 103, traversing the keyword dictionary tree through the video related data aiming at the video related data to be matched so as to match the video related data with a plurality of keywords to be matched.

In an implementation, the server may determine the video related data to be matched among the video related data contained in the database. Then, the server may sequentially read characters from the video related data according to a character arrangement order by an AC automaton (Aho-corasickatomaton, multimode matching algorithm) for each video related data to be matched, input the read characters into the keyword dictionary tree, traverse the keyword dictionary tree by the video related data, and obtain a matching result of the video related data, thereby implementing matching of the video related data with a plurality of keywords to be matched. The matching result comprises the keywords which are not matched or matched.

For example, the video name "sponge baby is in a park" is the video related data to be matched, for the video related data, the server sequentially reads in the keyword dictionary tree from the sea, the cotton, the baby, the presence, the music and the park through the AC automaton, and traverses the keyword dictionary tree through the video related data to obtain the matching result of the video related data, that is, the matched keywords are: sponge baby, paradise.

The embodiment of the application provides a process for matching video related data and a keyword dictionary tree by a server through an AC automaton, which comprises the following steps:

step 1, constructing a fail pointer based on the keyword dictionary tree, wherein the fail pointer represents the next node matched with the input character when the character is not matched with all child nodes of the current node. The node pointed to by the fail pointer is shown by the arrow in FIG. 2 b.

And 2, sequentially acquiring the characters to be matched according to the arrangement sequence of the characters in the video related data to be matched.

Step 3, searching a node matched with the character from all child nodes of the current node;

and if the searching is successful, the matched node can be used as the current node, and the step 2 is executed until all characters contained in the video related data to be matched are taken out.

At the same time, it may be determined whether the matched node represents the end of a keyword. When a node corresponds to the last character in the keyword, the node may indicate the end of the keyword. For example, as shown in FIG. 2b, node "e" may represent the end of keyword he, node "w" may represent the end of keyword how, and node "r" may represent the end of keyword her.

If the matched node does not represent the end of a keyword, the matched node indicates that the keyword is not matched.

And if the matched node represents the end of a keyword, the matched keyword is shown, and the matched keyword consists of characters represented by each node on a connection path between the root node and the matched node. The matched keywords may then be used as keywords contained in the video related data.

And if the searching fails, determining the node pointed by the fail pointer of the current node, taking the determined node as the current node, and executing the step 2 until all characters contained in the video related data to be matched are taken out.

For convenience of understanding, in the embodiment of the present application, the video related data to be matched is taken as "yaher", and the keyword dictionary tree shown in fig. 2b is taken as an example, and a process of matching the video related data and the keyword dictionary tree by the server through the AC automaton is described. The keyword dictionary tree is composed of keywords how, he, and her.

The server can sequentially acquire the characters to be matched according to the arrangement sequence of the characters in the video related data "yaher" to be matched through the AC automaton to obtain the characters y. The current node is a root node, and a node matched with the character y is searched from all child nodes of the current node. Since the child node of the root node is the node "h", the server may determine that the lookup fails, and then determine the node pointed by the fail pointer of the current node to obtain the root node. Then, the server may use the root node as the current node, return to execute step 2, obtain the character to be matched, and obtain the character a.

Similarly, since there is no node matching the character a in the child nodes of the current node, the server may use the root node as the current node, and return to execute step 2 to obtain the character to be matched, so as to obtain the character h.

And (3) because the child node of the current node has the node matched with the character, the server can take the node 'h' as the current node and return to execute the step (2) to obtain the character to be matched and obtain the character e. Since node "h" does not represent the end of a keyword, it can be determined that no keyword is matched.

For the character e, because the child node of the current node "h" has the node "e" matched with the character, the node "e" can be used as the current node and the step 2 is executed again to obtain the character to be matched, and obtain the character r. Since node "e" represents the end of the keyword "he". Thus, the server can determine that the keyword he is matched.

Then, for the character r, since there is a node matching the character among the child nodes of the current node "e", and the node "r" represents the end of the keyword "her", the server can determine that the keyword her is matched.

Therefore, the characters are sequentially read from the video related data "yaher" according to the character arrangement sequence through the AC automaton, the read characters are sequentially input into the keyword dictionary tree, and the server can determine that the matching result of the video related data "yaher" is as follows: keyword he and keyword her.

In the embodiment of the application, the AC automaton is adopted for keyword matching, the algorithm complexity is irrelevant to the number of the keywords and is only relevant to the number of characters contained in video related data, and all the keywords matched with the video related data can be obtained through one-time matching.

In the embodiment of the present application, the server may determine the video related data to be matched in a plurality of ways: in a possible implementation manner, the server may use each video related data in the database as the video related data to be matched; in another feasible implementation manner, the data search request may carry attribute information screening conditions, the server may use video related data that meets the attribute information screening conditions in the database as video related data to be matched, and the detailed processing procedure will be described in detail later.

And 104, generating a retrieval result according to the video related data matched with the keywords.

In implementation, the manner in which the server generates the search result according to the video-related data matched to the keyword may be various.

In one possible implementation, the server may generate the search result directly based on the video related data matched to the keyword.

For example, the server may use video related data matched to the keyword as a retrieval result; alternatively, the server may determine the data identifier of the video related data matching the keyword, and generate a search result including the determined data identifier. The search result may be: the data identification of the video related data is a1, A3, B1, and B2.

Further, the server may further generate a retrieval result based on the video related data matched to the keyword and the matched keyword, and the processing procedure includes:

and determining keywords matched with the video related data aiming at each video related data matched with the keywords to obtain target keywords. And generating a retrieval result containing the video related data matched with the keywords and the target keywords.

As the server matches the video related data and the keyword dictionary tree through the AC automaton, the obtained matching result comprises unmatched keywords or matched keywords. Therefore, for each video related data matched with the keyword, the server can obtain the matching result of the video related data, and the keyword contained in the matching result is used as the target keyword. The server may then generate a search result containing the video-related data, the target keyword, that matches the keyword. Therefore, the retrieval result containing the video related data and the matched keywords can be provided, and the subsequent auditors can conveniently perform manual review.

In another possible implementation manner, the server may further filter the video related data matched with the keyword, generate a search result based on the video related data obtained by the filtering, and the detailed processing procedure will be described later.

In the embodiment of the application, a server can receive a data retrieval request, wherein the data retrieval request carries a plurality of keywords to be matched; constructing a keyword dictionary tree according to a plurality of keywords to be matched; traversing a keyword dictionary tree by the video related data aiming at the video related data to be matched so as to match the video related data with a plurality of keywords to be matched; and generating a retrieval result according to the video related data matched with the keywords.

In the related art, when matching certain video related data with a plurality of keywords, the video related data is matched with each keyword, and therefore, the keywords matched with the video related data can be obtained only by performing matching for many times. Compared with the related technology, the method has the advantages that the AC automaton and the keyword dictionary tree are adopted for keyword matching, characters contained in video related data can be screened according to matching paths of character strings in the keyword dictionary tree, and the video related data can be matched with a plurality of keywords at the same time. Therefore, the embodiment of the application adopts the AC automaton, the matching of the video related data and the plurality of keywords can be realized in one matching process, and the keywords matched with the video related data are determined, so that the data retrieval efficiency can be improved.

Optionally, the data retrieval request may carry the attribute information screening condition, and the server may obtain the attribute information screening condition carried by the data retrieval request. Then, the server may screen the video related data in the database based on the attribute information screening condition to obtain the video related data to be matched, as shown in fig. 3, the specific processing procedure includes:

301, obtaining attribute information of each video related data in the database.

In implementation, the database may store attribute information of each video-related data, where the attribute information includes at least one of data upload time, data source, data format, and data classification.

For a certain video related data, the server may obtain all attribute information of the video related data, and the server may also obtain attribute information of the video related data corresponding to the attribute information filtering condition.

And step 302, taking the video related data with the attribute information meeting the attribute information screening condition as the video related data to be matched.

For example, the attribute information filtering condition is: the data is classified into movies, and the data source is uploaded by a user. The server may obtain a data classification and a data source of each video-related data in the database, and then the server may determine that the data classification is a movie and the data source is the video-related data uploaded by the user. Then, the server may use the determined video related data as the video related data to be matched.

In the embodiment of the application, the video related data of which the attribute information meets the attribute information screening condition is used as the video related data to be matched, so that the number of the video related data needing keyword matching can be reduced, the data processing load of a server can be reduced, and the data retrieval efficiency is improved.

Optionally, the data retrieval request may also carry a keyword screening condition, and the server may also be locally preset with the keyword screening condition. The server can further screen the video related data matched with the keywords based on the keyword screening conditions to obtain a retrieval result, and the specific processing process comprises the following steps:

step 1, judging whether the keywords matched with the video related data meet the keyword screening condition or not aiming at the video related data matched with the keywords.

In an implementation, for each video-related data matched to a keyword, the server may determine the keyword matched to the video-related data, and then, the server may determine whether the keyword matched to the video-related data satisfies a keyword screening condition.

If the keywords matched with the video related data meet the keyword screening condition, the server can execute the step 2; if the keyword matched with the video related data does not satisfy the keyword screening condition, the server may perform step 3.

In the embodiment of the present application, the keyword screening condition may be various, and the keyword screening condition may be related to the number of the keywords, or the keyword screening condition may be related to the word attributes of the keywords. For different keyword screening conditions, the processing procedures of the server for judging whether the keywords matched with the video related data meet the keyword screening conditions are different, and the detailed description will be given later on.

And 2, generating a retrieval result according to the video related data.

In implementation, the processing procedure of the server generating the search result according to the determined video related data may refer to the processing procedure of the server generating the search result directly based on the video related data matched to the keyword in step 104, and details are not repeated here.

And 3, not performing subsequent treatment.

In the embodiment of the application, the server can judge whether the keywords matched with the video related data meet the keyword screening condition or not according to the video related data matched with the keywords. And when the keywords matched with the video related data do not meet the keyword screening condition, performing subsequent processing. The server further screens the video related data matched with the keywords based on the keyword screening conditions, and then generates a retrieval result based on the video related data meeting the keyword screening conditions, so that the accuracy of data retrieval can be improved. On the other hand, personalized customization of data retrieval can be realized by setting different keyword screening conditions.

Optionally, the keyword screening condition may include a preset number threshold of the keywords, where the preset number threshold may be set by the auditor, for example, the preset number threshold may be 5. The embodiment of the present application provides an implementation manner in which a server determines whether a keyword matched with certain video related data satisfies a keyword screening condition, including:

step one, judging whether the number of the keywords matched with the video related data reaches a preset number threshold value.

The server may determine the number of keywords matching a certain video-related data, and then, the server may determine whether the number of keywords matching the video-related data reaches a preset number threshold.

If the number of the keywords matched with the video related data reaches a preset number threshold, the server can execute the step two; if the number of keywords matching the video related data does not reach the preset number threshold, the server may perform step three.

And step two, judging that the keywords matched with the video related data meet keyword screening conditions.

And step three, judging that the keywords matched with the video related data do not meet the keyword screening conditions.

In the embodiment of the application, the video related data matched with the keywords are further screened based on the keyword screening conditions related to the number of the keywords, and the video related data with high matching degree with a plurality of keywords can be selected, so that the accuracy of data retrieval can be improved. On the other hand, personalized customization of data retrieval can be realized by setting different keyword screening conditions.

Optionally, the keyword screening condition may be related to a word attribute of the keyword, and an embodiment of the present application further provides an implementation manner in which the server determines whether the keyword matched with the relevant data of a certain video meets the keyword screening condition, as shown in fig. 4, the implementation manner includes the following steps:

step 401, determine whether the number of keywords matching with a certain video related data is at least two.

In an implementation, the server may determine the number of keywords that match certain video related data. Then, the server may determine whether the number of keywords matching the video-related data is at least two.

If the number of keywords matching the video related data is at least two, i.e., the video related data matches at least two keywords, the server may perform step 402; if the number of keywords matching the video related data is 1, i.e., the video related data matches 1 keyword, the server may perform step 405.

Step 402, obtaining word attributes of at least two keywords.

The word attributes comprise parts of speech and word classification categories. The part of speech may be a noun, a verb, etc., and the word classification category may be a word belonging to "medical means", a word belonging to "violent action", etc.

In an implementation, the server may query the term attributes of the at least two keywords in the pre-stored correspondence between terms and term attributes after determining that the video related data matches the at least two keywords.

In the embodiment of the application, the server can query the term attributes of at least two keywords from the sensitive library. The sensitive word library contains words which are not suitable for being displayed on the line, word attributes of the words, the sensitivity degree of the words, word attribute association conditions for screening the words and the like.

In another possible implementation manner, after receiving the data search request, the server may determine the term attributes of the keywords to be matched according to the correspondence between the terms and the term attributes. Then, the server may generate a keyword array including the keywords and word attributes of the keywords, and record a keyword array corresponding to each keyword in a keyword dictionary tree constructed from a plurality of keywords. Then, when the server traverses the keyword dictionary tree through certain video related data and determines the keywords matched with the video related data, the server can obtain a keyword array containing the matched keywords. Therefore, the server can read the word attribute of the keyword from the keyword array.

And 403, judging whether the word attributes of the at least two keywords meet the word attribute association condition.

The server may be preset with a word attribute association condition, where the word attribute association condition may be that a part-of-speech combination of the keyword conforms to a preset part-of-speech combination, and a classification category combination of the word classification categories conforms to the preset classification category combination. For example, the word attribute association condition may be that, of the two keywords, the part of speech of one keyword is a verb, the part of speech of the other keyword is a noun, and the word classification categories of the two keywords are the same. The auditor can change the word attribute association condition according to the audit requirement.

In an implementation, the server may determine whether the word attributes of the at least two keywords satisfy the word attribute association condition. If the word attributes of the at least two keywords satisfy the word attribute association condition, the server may perform step 404; if the word attributes of the at least two keywords do not satisfy the word attribute association condition, the server may perform step 405.

Optionally, an embodiment of the present application provides an implementation manner in which a server determines whether term attributes of at least two keywords satisfy a term attribute association condition, including: and generating a part-of-speech combination containing parts of speech of at least two keywords and a classification category combination containing a word classification category of the at least two keywords.

If the part of speech combination accords with the preset part of speech combination and the classification category combination accords with the preset classification category combination, judging that the word attributes of the at least two keywords meet the word attribute association condition;

and if the part of speech combination does not accord with the preset part of speech combination, or the classification category combination does not accord with the preset classification category combination, judging that the word attributes of the at least two keywords do not meet the word attribute association condition.

For example, for the matching policy of verb nouns, the preset part-of-speech combination may be { noun, verb }, and the preset classification category combination may be { cartoon character, medical means }. When the video related data is matched with at least two keywords: when the sponge baby, the Aisha princess and the operation are performed, the server can obtain the word attributes of the keywords, namely the sponge baby and the Aisha princess, and obtain that the parts of speech are nouns and the classification categories of words are cartoon characters. Similarly, the server may obtain the word attribute of the keyword "do surgery" to obtain that the part of speech is verb and the word classification category is "medical means".

Then, the server can generate part-of-speech combinations of parts-of-speech including at least two keywords to obtain { noun, verb }; and generating a classification category combination of the word classification categories comprising at least two keywords to obtain the cartoon character, the cartoon character and the medical means. The server may determine that the part-of-speech combination { noun, verb } conforms to a preset part-of-speech combination { noun, verb }, and the classification category combination { cartoon character, medical means } conforms to a preset classification category combination { cartoon character, medical means }, i.e., the word attributes of the at least two keywords satisfy the word attribute association condition.

For the word skipping matching strategy, the preset part-of-speech combination can be { noun, noun }, and the preset classification category combination can be { real-time hotspot, real-time hotspot }. When the video related data is matched with at least two keywords: when the words are classified into the word categories, the server can obtain the word attributes of the keyword 'speed', and obtain real-time hot spots, wherein the word attributes are nouns and the word categories are 'real-time hot spots'; similarly, the server may obtain the word attributes of the keyword "the palace", and obtain that the part of speech is a noun and the word classification category is a "real-time hot spot".

Then, the server can generate part-of-speech combinations of parts-of-speech including at least two keywords, and obtain { noun, noun }; and generating a classification category combination of the word classification categories comprising at least two keywords to obtain a real-time hot spot and a real-time hot spot. The server may determine that the part-of-speech combination conforms to a preset part-of-speech combination, and the classification category combination conforms to a preset classification category combination, that is, the term attributes of the at least two keywords satisfy the term attribute association condition.

In the embodiment of the application, under the condition that the part-of-speech combination of the at least two keywords meets the preset part-of-speech combination and the classification category combination meets the preset classification category combination, the at least two keywords are judged to meet the word attribute association condition, so that the video related data containing the at least two keywords meets the keyword screening condition, and the accurate retrieval of the video related data can be realized. Furthermore, through setting a preset part-of-speech combination and a preset classification category combination, personalized retrieval of video related data can be realized.

And step 404, judging that the keywords matched with the video related data meet keyword screening conditions.

Step 405, determining that the keywords matched with the video related data do not meet the keyword screening conditions.

In the embodiment of the application, the video related data matched with the keywords are further screened based on the keyword screening conditions related to the word attributes of the keywords, so that the video related data containing the keywords with the incidence relation can be retrieved, and the accuracy of data retrieval is improved. On the other hand, personalized data retrieval requirements such as a dynamic term matching strategy and a word skipping matching strategy can be realized by setting different keyword screening conditions.

Optionally, the server may further send the determined search result to the control end, so that the control end displays the search result.

In the related art, all video related data of a video website is stored in an original database, which is a MongoDB (distributed file storage based database). In this embodiment of the application, a server may use a MapReduce (mapping reduction programming model) mode to backup and store all video related data of a video website stored in an original database into a target database, where the target database is a database for data storage based on an HBase (distributed, column-oriented, open source database) framework. Since the video-related data is stored based on the target database, the high-performance writing and reading requirements of the data can be met, and at the same time, the storage cost can be saved.

Optionally, the server may also detect a data modification operation on the original database; the data modification operation may be deleting certain video related data, storing certain video related data. The server may then perform data modification operations on the target database to synchronize the original database with the video-related data stored in the target database.

Optionally, the server may divide the video related data included in the target database in a MapReduce manner to obtain a plurality of packets. Therefore, after receiving the data retrieval request, the server can simultaneously judge whether each video related data contained in each group meets the attribute information screening condition or not aiming at each group, and obtain the video related data to be matched.

Then, the server may construct a keyword dictionary tree from a plurality of keywords to be matched. Then, the server can match the keyword dictionary tree to the video related data to be matched in each group, so that the parallel processing of keyword matching can be realized, the time required by data retrieval can be reduced, and the processing speed of data retrieval can be improved.

Optionally, after the video related data is matched with the plurality of keywords and the video related data matched with the keywords is determined, the server may generate a search result including the data identifier of the determined video related data, and store the search result in an HDFS (Hadoop Distributed File System). The server may then send a retrieval complete message to the control end.

The control end can obtain the retrieval result from the HDFS after receiving the retrieval completion message. Then, the control end can acquire the video related data from the target database according to the data identifier of the video related data included in the retrieval result. And then, the control end can display the video related data in a preset display page so as to facilitate the auditor to further audit.

Fig. 5 is a diagram illustrating an example of a data retrieval method provided in an embodiment of the present application, wherein,

an auditing system can be pre-installed in a control end of an auditor, and the auditing system can realize a task management function.

The task management functions may include: and (3) creating a checking task function, namely, when an auditor wants to perform data retrieval, selecting a plurality of keywords to be matched from the sensitive word bank, generating a data retrieval request through the control end, and sending the data retrieval request to the server by the control end.

The server may operate based on Hadoop, and after receiving the data retrieval request, the server may provide the following functions:

and the query analysis function is used for analyzing the data retrieval request and determining information such as a plurality of keywords to be matched and attribute information screening conditions of video related data.

The data filtering function is to divide the video related data stored in the HBase database into a plurality of packets in a MapReduce manner, and then, regarding each packet, use the video related data meeting the attribute information screening condition in the packet as the video related data to be matched.

The sensitive word tree function is that a keyword dictionary tree is constructed according to a plurality of keywords to be matched, and for each video related data to be matched, the keyword dictionary tree is traversed through the video related data, so that the video related data is matched with the keywords to be matched. And then, generating a retrieval result according to the data identification of the video related data matched with the keywords in each group.

And a data output function, namely, storing the retrieval result into the HDFS.

The task management function in the auditing system may further include: the data downloading function is checked, that is, the control end can obtain the retrieval result from the HDFS, and then the control end can obtain the video related data from the HBase database according to the data identifier of the video related data included in the retrieval result.

The task management function in the auditing system may further include: and the data display function is checked, namely, the control end can display the video related data in a preset display page so as to facilitate the further audit of auditors.

Further, after the auditor performs data processing operations such as deletion and update on certain video related data, the control end may send a data synchronization request to the server, where the data synchronization request is used to instruct the same data processing operation to be performed on the same video related data in the HBase database, so as to implement data synchronization.

An embodiment of the present application further provides a data retrieval device, as shown in fig. 6, the device includes:

a receiving module 610, configured to receive a data retrieval request, where the data retrieval request carries a plurality of keywords to be matched;

a building module 620, configured to build a keyword dictionary tree according to the multiple keywords to be matched;

a matching module 630, configured to traverse the keyword dictionary tree through video related data to match the video related data with the multiple keywords to be matched;

and the generating module 640 is configured to generate a retrieval result according to the video related data matched with the keyword.

Optionally, as shown in fig. 7, the generating module 640 includes:

the determining submodule 641 is configured to determine, for each video-related data matched with the keyword, whether the keyword matched with the video-related data meets a keyword screening condition;

the first generating sub-module 642 is configured to generate a search result according to the video related data when the keyword matched with the video related data meets the keyword screening condition.

Optionally, the determining sub-module 641 is specifically configured to, when the video related data matches at least two keywords, obtain word attributes of the at least two keywords, where the word attributes include part of speech and word classification categories; and when the word attributes of the at least two keywords meet the word attribute association condition, judging that the keywords matched with the video related data meet the keyword screening condition; and when the word attributes of the at least two keywords do not meet the word attribute association condition, judging that the keywords matched with the video related data do not meet the keyword screening condition.

Optionally, the determining sub-module 641 is further configured to generate a part-of-speech combination including the parts-of-speech of the at least two keywords, and a classification category combination including the word classification categories of the at least two keywords; and when the part of speech combination is a preset part of speech combination and the classification category combination is a preset classification category combination, judging that the keywords matched with the video related data meet keyword screening conditions.

Optionally, the determining sub-module 641 is specifically configured to determine whether the number of the keywords matched with the video related data reaches a preset number threshold; if the number reaches the preset number threshold value, judging that the keywords matched with the video related data meet keyword screening conditions; and if the number does not reach the preset number threshold value, judging that the keywords matched with the video related data do not meet the keyword screening condition.

Optionally, as shown in fig. 8, the apparatus further includes:

an obtaining module 650, configured to obtain attribute information of each video related data in the database when the data retrieval request further carries an attribute information screening condition, where the attribute information includes at least one of data uploading time, data source, data format, and data classification;

the first determining module 660 is configured to use the video related data whose attribute information meets the attribute information screening condition as the video related data to be matched.

Optionally, as shown in fig. 9, the generating module 640 includes:

a determining sub-module 643, configured to determine, for each piece of video-related data matched to a keyword, a keyword matched to the video-related data, so as to obtain a target keyword;

a second generating sub-module 644, configured to generate a search result including the video related data matched to the keyword and the target keyword.

Optionally, as shown in fig. 10, the apparatus further includes:

the storage module 670 is configured to store video related data stored in an original database into a target database in a mapreduce manner, so as to obtain video related data to be matched from the target database, where the target database is a database for performing data storage based on an HBase framework.

Optionally, as shown in fig. 11, the apparatus further includes:

a detecting module 680, configured to detect a data modification operation on the original database;

a synchronization module 690, configured to perform the data modification operation on the target database to synchronize the original database with the video related data stored in the target database.

Optionally, as shown in fig. 12, the apparatus further includes:

a dividing module 6100, configured to divide the video related data included in the target database to obtain a plurality of packets;

a second determining module 6120, configured to, for each packet, use each video-related data included in the packet as video-related data to be matched.

In the embodiment of the application, a data retrieval request can be received, wherein the data retrieval request carries a plurality of keywords to be matched; constructing a keyword dictionary tree according to a plurality of keywords to be matched; traversing a keyword dictionary tree by the video related data aiming at the video related data to be matched so as to match the video related data with a plurality of keywords to be matched; and generating a retrieval result according to the video related data matched with the keywords.

The embodiment of the present application further provides an electronic device, which can be used as a server of a video website, as shown in fig. 13, and includes a processor 1301, a communication interface 1302, a memory 1303, and a communication bus 1304, where the processor 1301, the communication interface 1302, and the memory 1303 complete communication with each other through the communication bus 1304,

a memory 1303 for storing a computer program;

the processor 1301 is configured to implement the following steps when executing the program stored in the memory 1303:

constructing a keyword dictionary tree according to the keywords to be matched;

Optionally, the generating a search result according to the video related data matched with the keyword includes:

judging whether the keywords matched with the video related data meet the keyword screening condition or not aiming at the video related data of each matched keyword;

and if the keywords matched with the video related data meet the keyword screening condition, generating a retrieval result according to the video related data.

Optionally, the determining whether the keyword matched with the video related data meets the keyword screening condition includes:

if the video related data is matched with at least two keywords, acquiring word attributes of the at least two keywords, wherein the word attributes comprise part of speech and word classification categories;

if the word attributes of the at least two keywords meet the word attribute association condition, judging that the keywords matched with the video related data meet the keyword screening condition;

and if the word attributes of the at least two keywords do not meet the word attribute association condition, judging that the keywords matched with the video related data do not meet the keyword screening condition.

Optionally, after obtaining the term attributes of the at least two keywords, the method further includes:

generating a part-of-speech combination containing parts of speech of the at least two keywords and a classification category combination containing word classification categories of the at least two keywords;

if the word attributes of the at least two keywords meet the word attribute association condition, judging that the keywords matched with the video related data meet the keyword screening condition, wherein the judgment comprises the following steps:

and if the part of speech combination is a preset part of speech combination and the classification category combination is a preset classification category combination, judging that the keywords matched with the video related data meet keyword screening conditions.

judging whether the number of the keywords matched with the video related data reaches a preset number threshold value or not;

if the number reaches the preset number threshold value, judging that the keywords matched with the video related data meet keyword screening conditions;

and if the number does not reach the preset number threshold value, judging that the keywords matched with the video related data do not meet the keyword screening condition.

Optionally, the data retrieval request further carries an attribute information screening condition, and the method further includes:

acquiring attribute information of video related data in a database, wherein the attribute information comprises at least one of data uploading time, data source, data format and data classification;

and taking the video related data of which the attribute information meets the attribute information screening condition as the video related data to be matched.

determining keywords matched with the video related data aiming at each video related data matched with the keywords to obtain target keywords;

and generating a retrieval result containing the video related data matched with the keywords and the target keywords.

Optionally, the method further includes:

and storing the video related data stored in the original database into a target database by adopting a mapping protocol programming model mapreduce mode so as to obtain the video related data to be matched from the target database, wherein the target database is a database for storing data based on an HBase framework.

Optionally, the method further includes:

detecting a data modification operation in the original database;

and executing the data modification operation on the target database to synchronize the video related data stored in the original database and the target database.

Optionally, the obtaining method of the video related data to be matched includes:

dividing video related data contained in the target database to obtain a plurality of groups;

and regarding each packet, taking each video related data contained in the packet as the video related data to be matched.

The communication bus mentioned in the above terminal may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.

The communication interface is used for communication between the terminal and other equipment.

The Memory may include a Random Access Memory (RAM) or a non-volatile Memory (non-volatile Memory), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.

The Processor may be a general-purpose Processor, and includes a Central Processing Unit (CPU), a Network Processor (NP), and the like; the device can also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, a discrete Gate or transistor logic device, or a discrete hardware component.

In yet another embodiment provided by the present application, a computer-readable storage medium is further provided, which has instructions stored therein, and when the instructions are executed on a computer, the instructions cause the computer to execute the data retrieval method described in any of the above embodiments.

In yet another embodiment provided by the present application, there is also provided a computer program product containing instructions which, when run on a computer, cause the computer to perform the data retrieval method of any of the above embodiments.

In the above embodiments, the implementation may be wholly or partially realized by software, hardware, firmware, or any combination thereof. When implemented in software, may be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When loaded and executed on a computer, cause the processes or functions described in accordance with the embodiments of the application to occur, in whole or in part. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another, for example, from one website site, computer, server, or data center to another website site, computer, server, or data center via wired (e.g., coaxial cable, fiber optic, Digital Subscriber Line (DSL)) or wireless (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device, such as a server, a data center, etc., that incorporates one or more of the available media. The usable medium may be a magnetic medium (e.g., floppy Disk, hard Disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium (e.g., Solid State Disk (SSD)), among others.

It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for the system embodiment, since it is substantially similar to the method embodiment, the description is simple, and for the relevant points, reference may be made to the partial description of the method embodiment.

The above description is only for the preferred embodiment of the present application, and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application are included in the protection scope of the present application.

Claims

1. A method for data retrieval, the method comprising:

constructing a keyword dictionary tree according to the keywords to be matched;

2. The method of claim 1, wherein generating the search result according to the video related data matched to the keyword comprises:

3. The method of claim 2, wherein the determining whether the keyword matching the video related data satisfies the keyword screening condition comprises:

4. The method of claim 3, wherein after obtaining the word attributes of the at least two keywords, further comprising:

5. The method of claim 2, wherein the determining whether the keyword matching the video related data satisfies the keyword screening condition comprises:

6. The method of claim 1, wherein the data retrieval request further carries an attribute information filtering condition, and the method further comprises:

7. The method of claim 1, wherein generating the search result according to the video related data matched to the keyword comprises:

8. The method of claim 1, further comprising:

9. The method of claim 8, further comprising:

detecting a data modification operation in the original database;

10. The method according to claim 8, wherein the obtaining of the video related data to be matched comprises:

11. A data retrieval device, the device comprising:

12. An electronic device is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing mutual communication by the memory through the communication bus;

a memory for storing a computer program;

a processor for implementing the method steps of any of claims 1-10 when executing a program stored in the memory.

13. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method steps of any one of claims 1 to 10.