CN114416926A - Keyword matching method and device, computing equipment and computer readable storage medium - Google Patents

Keyword matching method and device, computing equipment and computer readable storage medium Download PDF

Info

Publication number
CN114416926A
CN114416926A CN202210068520.0A CN202210068520A CN114416926A CN 114416926 A CN114416926 A CN 114416926A CN 202210068520 A CN202210068520 A CN 202210068520A CN 114416926 A CN114416926 A CN 114416926A
Authority
CN
China
Prior art keywords
keywords
target
target text
candidate
candidate keywords
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210068520.0A
Other languages
Chinese (zh)
Inventor
白金国
李长亮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Kingsoft Digital Entertainment Co Ltd
Original Assignee
Beijing Kingsoft Digital Entertainment Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Kingsoft Digital Entertainment Co Ltd filed Critical Beijing Kingsoft Digital Entertainment Co Ltd
Publication of CN114416926A publication Critical patent/CN114416926A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a keyword matching method, a keyword matching device, equipment and a storage medium, wherein the keyword matching method comprises the following steps: acquiring a target text; performing character-by-character keyword matching on the target text to obtain a plurality of candidate keywords; and matching the candidate keywords, and determining the candidate keywords with the incidence relation as the target keywords of the target text. The scheme can improve the keyword matching accuracy.

Description

Keyword matching method and device, computing equipment and computer readable storage medium
Technical Field
The application relates to the technical field of artificial intelligence, in particular to a keyword matching method. The application also relates to a keyword matching device, a computing device and a computer readable storage medium.
Background
In scenes of summarizing and auditing the content of a text by keywords in the text, a keyword matching technology capable of automatically acquiring the keywords in the text is widely used.
In the related art, generally, the content of a text is character-by-character matched with a preset dictionary tree composed of characters, and one of a plurality of words obtained by matching is used as a target keyword for indicating the content of the text. However, one keyword often cannot accurately indicate the content of the text, so that the keyword matching result of the text is inaccurate.
Disclosure of Invention
In view of this, the present application provides a keyword matching method to solve the technical defects in the prior art. The embodiment of the application also provides a keyword matching device, a computing device and a computer readable storage medium.
According to a first aspect of an embodiment of the present application, a keyword matching method is provided, including:
acquiring a target text;
performing character-by-character keyword matching on the target text to obtain a plurality of candidate keywords;
and matching the candidate keywords, and determining the candidate keywords with the incidence relation as the target keywords of the target text.
Optionally, the matching the multiple candidate keywords and determining the candidate keywords with the association relationship as the target keywords of the target text includes:
determining candidate keywords matched with all words belonging to the same incidence relation in a word level library from the candidate keywords; the word level library stores a plurality of words and the incidence relation among the words;
and taking the matched candidate keywords as target keywords of the target text.
Optionally, the matching the multiple candidate keywords and determining the candidate keywords with the association relationship as the target keywords of the target text includes:
obtaining the association degree among the candidate keywords;
and matching the candidate keywords with the relevance degrees larger than or equal to a relevance degree threshold value, and determining the candidate keywords with the relevance relations as the target keywords of the target text.
Optionally, the obtaining of the association degrees between the multiple candidate keywords includes:
acquiring positions of the candidate keywords in the target text;
and determining the spacing distance of each candidate keyword in the target text as the association degree by using the position of each candidate keyword.
Optionally, the performing character-by-character keyword matching on the target text to obtain a plurality of candidate keywords includes:
sequentially matching each character of the target text with a character level library to obtain a plurality of candidate keywords;
and the characters in the character level library belong to preset keywords.
Optionally, after the matching is performed on the multiple candidate keywords and the candidate keywords having the association relationship are determined to be the target keywords of the target text, the method further includes:
inputting the target keywords into a classification model obtained by pre-training to obtain the type of the target text; the classification model is a neural network model obtained by training by utilizing sample keywords and type labels of sample texts corresponding to the sample keywords;
and outputting the type.
Optionally, after the matching is performed on the multiple candidate keywords and the candidate keywords having the association relationship are determined to be the target keywords of the target text, the method further includes:
generating a content abstract of the target text by using the target key words;
and outputting the content summary.
Optionally, the obtaining the target text includes:
acquiring a file to be matched and determining the file type of the file;
and when the file type is a text document, determining the file as a target text, otherwise, converting the file into the text document, and determining the converted text document as the target text.
According to a second aspect of the embodiments of the present application, there is provided a keyword matching apparatus, including:
a text acquisition module configured to acquire a target text;
the single keyword matching module is configured to perform character-by-character keyword matching on the target text to obtain a plurality of candidate keywords;
and the multi-keyword matching module is configured to match the candidate keywords and determine the candidate keywords with the incidence relation as the target keywords of the target text.
Optionally, the multi-keyword matching module is further configured to:
determining candidate keywords matched with all words belonging to the same incidence relation in a word level library from the candidate keywords; the word level library stores a plurality of words and the incidence relation among the words;
and taking the matched candidate keywords as target keywords of the target text.
Optionally, the multi-keyword matching module is further configured to:
obtaining the association degree among the candidate keywords;
and matching the candidate keywords with the relevance degrees larger than or equal to a relevance degree threshold value, and determining the candidate keywords with the relevance relations as the target keywords of the target text.
Optionally, the multi-keyword matching module is further configured to:
acquiring positions of the candidate keywords in the target text;
and determining the spacing distance of each candidate keyword in the target text as the association degree by using the position of each candidate keyword.
Optionally, the single keyword matching module is further configured to:
sequentially matching each character of the target text with a character level library to obtain a plurality of candidate keywords;
and the characters in the character level library belong to preset keywords.
Optionally, the apparatus further comprises an output module configured to:
after matching the candidate keywords and determining the candidate keywords with incidence relation as the target keywords of the target text, inputting the target keywords into a classification model obtained by pre-training to obtain the type of the target text; the classification model is a neural network model obtained by training by utilizing sample keywords and type labels of sample texts corresponding to the sample keywords;
and outputting the type.
Optionally, the apparatus further comprises an output module configured to:
after the candidate keywords are matched and the candidate keywords with the incidence relation are determined to be the target keywords of the target text, generating a content abstract of the target text by using the target keywords;
and outputting the content summary.
Optionally, the text obtaining module is further configured to:
acquiring a file to be matched and determining the file type of the file;
and when the file type is a text document, determining the file as a target text, otherwise, converting the file into the text document, and determining the converted text document as the target text.
According to a third aspect of embodiments herein, there is provided a computing device comprising:
a memory and a processor;
the memory is used for storing computer-executable instructions, and the processor realizes the steps of the keyword matching method when executing the computer-executable instructions.
According to a fourth aspect of embodiments herein, there is provided a computer-readable storage medium storing computer-executable instructions that, when executed by a processor, implement the steps of the keyword matching method.
According to the scheme, the target text is obtained, character-by-character keyword matching is conducted on the target text, a plurality of candidate keywords are obtained, then the candidate keywords are matched, and the candidate keywords with incidence relations are determined to serve as the target keywords of the target text. The target keywords are the candidate keywords with semantic association in the candidate keywords, so that the target keywords can reflect the semantics of the content corresponding to the target keywords in the target text, the keyword matching at the semantic level is realized, and the matching accuracy of the target keywords can be improved. On the basis, whether the target text violates the rules or not is checked by using the target keywords, so that the condition that the candidate keywords in the target text do not violate the rules and the semantics reflected by the target keywords violate the rules can be guaranteed, and the accuracy of the checking result of whether the target text violates the rules or not is improved.
Drawings
Fig. 1 is a schematic structural diagram of a keyword matching system according to an embodiment of the present application;
FIG. 2 is a flowchart of a keyword matching method according to an embodiment of the present application;
FIG. 3 is a flowchart of a keyword matching method according to another embodiment of the present application;
FIG. 4 is a flowchart of a keyword matching method according to another embodiment of the present application;
FIG. 5 is a schematic structural diagram of a text auditing system according to yet another embodiment of the present application;
fig. 6 is a schematic structural diagram of a keyword matching apparatus according to an embodiment of the present application;
fig. 7 is a block diagram of a computing device according to an embodiment of the present application.
Detailed Description
In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application. This application is capable of implementation in many different ways than those herein set forth and of similar import by those skilled in the art without departing from the spirit of this application and is therefore not limited to the specific implementations disclosed below.
The terminology used in the one or more embodiments of the present application is for the purpose of describing particular embodiments only and is not intended to be limiting of the one or more embodiments of the present application. As used in one or more embodiments of the present application and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used in one or more embodiments of the present application refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It will be understood that, although the terms first, second, etc. may be used herein in one or more embodiments of the present application to describe various information, these information should not be limited by these terms. These terms are only used to distinguish one type of information from another. For example, a first aspect may be termed a second aspect, and, similarly, a second aspect may be termed a first aspect, without departing from the scope of one or more embodiments of the present application.
First, the noun terms to which one or more embodiments of the present application relate are explained.
Character string matching algorithm: an algorithm for locating a second string from the first string, wherein the second string is contained in the first string.
Dictionary tree: also known as a word-lookup tree, is a tree-like structure that can be used to store strings. The root node does not contain characters, and each node except the root node only contains one character; from the root node to a certain node, the characters passing through the path are connected together and are character strings corresponding to the node; all children of each node contain different characters.
Automaton algorithm (Aho-coresick, AC automaton): a string matching algorithm by multi-pattern matching. The algorithm uses multi-mode strings to establish a deterministic tree finite state machine, uses the main string as the input of the finite state machine, and makes the state machine perform state transition, and when certain specific states are reached, the occurrence of pattern matching is indicated.
Determining a Finite Automaton (DFA): a string matching algorithm similar to AC automata. The algorithm is built with a finite set of states and edges leading from one state to another, each marked with a symbol, where one state is an initial state and some states are final states.
In the present application, a keyword matching method is provided. The present application also relates to a keyword matching apparatus, a computing device, and a computer-readable storage medium, which are described in detail in the following embodiments one by one.
Referring to fig. 1, fig. 1 is a schematic structural diagram of a keyword matching system according to an embodiment of the present application. The execution subject of the contract identification method provided by the embodiment of the present application may be a server or a terminal, which is not limited in the embodiment of the present application. The terminal may be any electronic product capable of performing human-Computer interaction with a user, such as a Personal Computer (PC), a mobile terminal, a pocket PC, a tablet PC, and so on. The server may be one server, a server cluster composed of multiple servers, or a cloud computing service center, which is not limited in this embodiment of the present application.
Taking the execution main body as a terminal as an example, the terminal acquires a target text and performs character-by-character keyword matching on the target text to obtain a plurality of candidate keywords; and matching the candidate keywords, and determining the candidate keywords with the association relation as target keywords of the target text. And checking whether the target text violates rules or not by using the target keywords. When a plurality of candidate keywords are matched and a candidate keyword having an association is determined as a target keyword of a target text, an association model and/or an association model may be used. The association degree model and/or the association relation model can be obtained through training of the server and sent to the terminal. When the target keywords are used for checking whether the target text violates rules, the classification model may be used. The classification model can be obtained through training of the server and sent to the terminal.
Taking the execution main body as a server as an example, the server acquires a target text and performs character-by-character keyword matching on the target text to obtain a plurality of candidate keywords; and matching the candidate keywords, and determining the candidate keywords with the association relation as target keywords of the target text. And checking whether the target text violates rules or not by using the target keywords. And the server can obtain an association model based on the first training sample training and/or obtain an association model based on the second training sample training, and use the association model and the association model when matching a plurality of candidate keywords and determining the candidate keywords with association as the target keywords of the target text. The server can obtain a classification model based on the third training sample training, and the classification model is used when the target keywords are used for checking whether the target text violates rules or not.
In the embodiment of the application, the target keywords are the candidate keywords with semantic association in the candidate keywords, so that the target keywords can reflect the semantics of the content corresponding to the target keywords in the target text, the keyword matching at the semantic level is realized, and the matching accuracy of the target keywords can be improved. On the basis, whether the target text violates the rules or not is checked by using the target keywords, so that the condition that the candidate keywords in the target text do not violate the rules and the semantics reflected by the target keywords violate the rules can be guaranteed, and the accuracy of the checking result of whether the target text violates the rules or not is improved.
Those skilled in the art should understand that the above-mentioned terminal and server are only examples, and other existing or hereafter-existing terminals or servers, such as may be suitable for the embodiments of the present application, should also be included in the scope of the embodiments of the present application, and are hereby incorporated by reference herein.
Fig. 2 shows a flowchart of a keyword matching method according to an embodiment of the present application, which specifically includes the following steps:
s201, acquiring a target text.
In a particular application, target text refers to text that can be used to extract keywords for a rich keyword library. The target text may be obtained from a specified file. The specified files may include files submitted by a user, files in a public network, and so on. After the target text is obtained, the target text may be stored in a database for subsequent processing. Furthermore, the manner of obtaining the target text may be various, and is specifically described in the form of an alternative embodiment below.
In an optional implementation manner, the obtaining of the target text may specifically include the following steps:
and when the text submission instruction is detected, receiving a text corresponding to the submission instruction as a target text.
In another optional implementation, the obtaining of the target text specifically may include the following steps:
acquiring a file to be matched and determining the file type of the file;
and when the file type is a text document, determining the file as a target text, otherwise, converting the file into the text document, and determining the converted text document as the target text.
In a specific application, a file to be matched may be received or looked up. The file types of the files to be matched may include: text documents (such as documents in DOC Format or DOCX Format), Portable Document formats (such as Portable Document Format, PDF Format documents), and pictures, among other file types. By determining the file type of the file to be matched and further performing corresponding processing on the file corresponding to different file types, the application range of the keyword matching method provided by the embodiment of the application can be expanded.
S202, performing character-by-character keyword matching on the target text to obtain a plurality of candidate keywords.
The character-by-character keyword matching of the target text to obtain a plurality of candidate keywords refers to character-by-character matching of the target text and a preset character library, and a plurality of candidate keywords are extracted according to a matching result. Wherein, the candidate keyword is a candidate keyword. One target text may include a plurality of keywords each composed of letters or chinese characters. Therefore, character-by-character keyword matching can be performed on the target text to obtain a plurality of candidate keywords. Step S202 is described in detail in the following in alternative embodiments for easy understanding and reasonable layout.
S203, matching the candidate keywords, and determining the candidate keywords with the association relation as the target keywords of the target text.
The association relationship may be a relationship in which a plurality of different keywords describe the same event, that is, semantic association between different keywords. For example, the keyword: the screen, the show, the theater and the lead actor all describe the same event "movie", and therefore, for the event of a movie, there is an association between the three words screen, show, theater and lead actor. Illustratively, counting the occurrence frequency of characters in the candidate keywords in the target text, combining adjacent characters, and counting the frequency of the adjacent characters; if the adjacent characters appear in the common word list, the characters are discarded, and if not, the characters are used as the target keywords. And iterating according to the mode until the maximum iteration times are reached or the words are not changed, and ending. Step S203 is described in detail in the following in the form of an alternative embodiment for easy understanding and reasonable layout.
And, the target text refers to a text that can be used to extract keywords for enriching a keyword library, or a text for auditing. Therefore, in a scenario for enriching a keyword library, after a target keyword of a target text is obtained, the target keyword may be stored into the target keyword library. Therefore, the richness of the target keyword library can be improved by acquiring the keywords in the non-common word list. In a scene for auditing, after the target keywords of the target text are obtained, the target keywords can be utilized to audit whether the target text violates rules, so that the condition that the candidate keywords in the target text do not violate rules and the semantics of the target keywords reflect violation rules can be guaranteed, and the accuracy of the auditing result of whether the target text violates rules can be improved.
According to the scheme, the target text is obtained, character-by-character keyword matching is conducted on the target text, a plurality of candidate keywords are obtained, then the candidate keywords are matched, and the candidate keywords with incidence relations are determined to serve as the target keywords of the target text. The target keywords are the candidate keywords with semantic association in the candidate keywords, so that the target keywords can reflect the semantics of the content corresponding to the target keywords in the target text, the keyword matching at the semantic level is realized, and the matching accuracy of the target keywords can be improved. On the basis, whether the target text violates the rules or not is checked by using the target keywords, so that the condition that the candidate keywords in the target text do not violate the rules and the semantics reflected by the target keywords violate the rules can be guaranteed, and the accuracy of the checking result of whether the target text violates the rules or not is improved.
In an optional implementation manner, performing character-by-character keyword matching on a target text to obtain a plurality of candidate keywords may specifically include the following steps:
sequentially matching each character of the target text with a character level library to obtain a plurality of candidate keywords;
wherein, the characters in the character level library belong to preset keywords.
In a particular application, the character level library may be various. Illustratively, the character-level library may be a character-level table, or a character-level dictionary tree. The characters belonging to the same preset keyword in the character level table may be stored in the same row or the same column. The nodes of the character level dictionary tree are characters, and the nodes belonging to the same preset keyword can be stored in the nodes of the same sub-tree. For example, characters in a preset keyword can be extracted, each character is a column in a character level table, and characters belonging to the same preset keyword are located in the same row; or, each character is a node in the character-level dictionary tree, and characters belonging to the same preset keyword are distributed in the nodes in sequence according to the arrangement sequence in the preset keyword to form a parent-child relationship.
In addition, the matching at the character level described above may be performed specifically by using an AC automaton and a DFC. Wherein, the matching of the character level refers to the matching process of extracting candidate keywords. Any string matching algorithm can be configured as the present application, and the present embodiment does not limit this.
In an optional implementation manner, the determining, as the target keyword of the target text, the candidate keyword having an association relationship among the plurality of candidate keywords may specifically include the following steps:
determining candidate keywords matched with all words belonging to the same incidence relation in a word level library from a plurality of candidate keywords; the word level library stores a plurality of words and incidence relations among the words;
and taking the matched candidate keywords as target keywords of the target text.
In a specific application, the association relationship between the words is stored, and the association relationship can be specifically identified by upper and lower level node identifiers, or by using the identifiers located in the same row, or can be described by using a specified vocabulary. The word level library can be constructed by utilizing a plurality of preset keywords in advance. Specifically, a plurality of preset keywords may be empirically constructed by a technician as a word level library according to the association relationship between the preset keywords. For example, the preset keywords include millet, mobile phone, smart home, food, and plant. The constructed word level library may include: millet, mobile phones and smart homes are related, and millet, food and planting are related. Also, the term level library may be various. Illustratively, the term level library may be a term level table, or a term level dictionary tree. The words in the word level table having the association relationship may be stored in the same row or the same column. The nodes of the word level dictionary tree are words, and the words with the association relationship can be stored in all the nodes of the same subtree.
For the word level table: if the words with the association relationship are stored in the same row, determining that the association relationship exists among a plurality of candidate keywords when the candidate keywords are matched with the keywords in the same row in the word level table; and if the words with the association relationship are stored in the same column, determining that the candidate keywords are the candidate keywords with the association relationship when the candidate keywords are matched with the keywords in the same column in the word level table.
For a word-level dictionary tree: and when the candidate keywords are matched with all nodes belonging to the same subtree in the word level dictionary tree, determining the candidate keywords as the candidate keywords with the incidence relation. In addition, the matching of the word levels can be specifically performed by using an AC automaton and a DFC. Any string matching algorithm can be used in the present application, and the present embodiment does not limit this.
Fig. 3 is a flowchart illustrating a keyword matching method according to another embodiment of the present application, which specifically includes the following steps:
s301, acquiring a target text.
S302, performing character-by-character keyword matching on the target text to obtain a plurality of candidate keywords.
The above steps S301 to S302 are the same as the steps S201 to S202 in the embodiment of fig. 2, and are not repeated herein, for details, see the description of the embodiment of fig. 2.
S303, obtaining the association degree among a plurality of candidate keywords.
In a specific application, the manner of obtaining the association degree between a plurality of candidate keywords may be various. For example, a plurality of candidate keywords may be input into a relevance model obtained through pre-training, so as to obtain the relevance between the plurality of candidate keywords. The relevancy model is a neural network model obtained by training through relevancy labels among a plurality of sample keywords and a plurality of sample keywords. Or, for example, the separation distance of the candidate keywords in the target text may be obtained as the association degree between the candidate keywords. The second exemplary illustration is described in detail below in the form of an alternative embodiment for ease of understanding and reasonable layout.
In an optional implementation manner, the obtaining of the association degrees between the multiple candidate keywords may specifically include the following steps:
acquiring positions of a plurality of candidate keywords in a target text;
and determining the spacing distance of each candidate keyword in the target text by using the position of each candidate keyword in the target text as the association degree among the candidate keywords.
In a specific application, the positions of a plurality of candidate keywords in the target text can be obtained by using a character string matching algorithm. And, the position of any candidate keyword in the target text may be a position identifier, such as a natural number written in sequence of 1, 2, 3, … … n, etc. Wherein, the larger the position mark, the farther the position of the key word in the text is from the initial key word in the text. Alternatively, the position of any candidate keyword in the target text may be row and column, e.g., row 1, column 1, etc. In this way, the rows and columns can be used as position coordinates of the keywords, so that the distance between the keywords can be calculated by the position coordinates, and the greater the distance, the smaller the association between the keywords. Determining the spacing distance of each candidate keyword in the target text by using the position of each candidate keyword in the target text, which may specifically include: and calculating the difference between the positions of the candidate keywords in the target text as the spacing distance of the candidate keywords in the target text.
In the target text, the closer the distance between keywords, the more likely the associated semantics are reflected. For example, the more likely keywords in the same row describe the same action, event, etc., and accordingly, the more relevant the semantics reflected by the keywords are, the greater the semantic relevance the keywords reflect in different segments. In the same target text, the closer the separation distance between different keywords is, the more likely the different keywords describe the same event, and the higher the association degree between the different keywords is. Based on this, in the optional embodiment, the separation distance of each candidate keyword in the target text may be used as the association degree between the candidate keywords, and reflect the association degree of the semantics of different keywords.
S304, matching the candidate keywords with the relevance degrees larger than or equal to the relevance degree threshold value, and determining the candidate keywords with the relevance relations as the target keywords of the target text.
In specific application, by matching a plurality of candidate keywords with the relevance degrees larger than or equal to the relevance degree threshold, the redundant operation of matching the candidate keywords with the unlikely semantic relevance degrees can be reduced, and the processing efficiency is improved. Moreover, matching the multiple candidate keywords with the relevance degrees greater than or equal to the relevance degree threshold may specifically include: matching a plurality of candidate keywords with the relevance degrees larger than or equal to the relevance degree threshold value with the word level table, and determining the plurality of candidate keywords with relevance relations as target keywords according to the matching result; or, a plurality of candidate keywords with the relevance greater than or equal to the relevance threshold are arranged and combined to obtain a plurality of candidate keyword groups, the plurality of candidate keyword groups are respectively input into a relevance relation model obtained through pre-training to obtain a result of whether the candidate keyword groups have relevance, and the candidate keywords in the candidate keyword groups with relevance as the result are determined as target keywords. And the incidence relation model is obtained by utilizing a second training sample for training. The second training sample may include: and whether the sample key phrase has a label of an association relation or not.
In the embodiment, the candidate keywords are screened according to the relevance between the candidate keywords, so that the problem of efficiency reduction caused by matching the candidate keywords with the relevance smaller than the relevance threshold can be solved. In addition, the relevance threshold can be set according to specific requirements.
In an optional implementation manner, after the matching is performed on the multiple candidate keywords and the candidate keywords having the association relationship are determined to be the target keywords of the target text, the keyword matching method provided in the embodiment of the present application may further include the following steps:
inputting the target keywords into a classification model obtained by pre-training to obtain the type of the target text; the classification model is a neural network model obtained by training by utilizing sample keywords and type labels of sample texts corresponding to the sample keywords;
and outputting the type of the target text.
In a specific application, the type of any text is used to indicate the difference of the subject matter of the text, or whether the content of the text is illegal. For example, the type of content subject matter difference indicating text may include: electronic product introduction, cosmetic introduction, movie reviews, news, and the like. The optional embodiment may output the subject of the target text, or whether the target text is illegal, for the review of the target text. And the third training sample is the sample keyword and the type label of the sample text corresponding to the sample keyword.
In an optional implementation manner, after the matching is performed on the multiple candidate keywords and the candidate keywords having the association relationship are determined to be the target keywords of the target text, the keyword matching method provided in the embodiment of the present application may further include the following steps:
generating a content abstract of a target text by using the target key words;
and outputting the content summary.
In a particular application, a target keyword is a non-repeating word. Content summaries refer to summaries of the target text content. One target keyword corresponds to one content abstract, and the number of content abstracts of one target text may be multiple. For example, a content summary that introduces the target text of a certain movie may include: [ MOVIE "," MOVIE F "," director D "," hero A go on event E ]. Also, the content digest of the target text may be generated variously using the target keyword. Illustratively, the target keywords may be directly used as the content summary of the target text. The generated content abstract can be stored in a database, and a corresponding relation is established between the generated content abstract and the target text. Generally, if the number of the target texts is large, the target texts can be used as rows of the data table, and the content summaries of the target texts are used as columns corresponding to the rows, so that the obtained data table not only contains the corresponding relation between the content summaries and the target texts, but also integrates and stores different target texts, and management of a large number of target texts can be facilitated. Or, the target keywords may be input into a text generation model obtained by pre-training to obtain a content abstract of the target text. The text generation model may be a model obtained by training a Recurrent Neural Network (RNN) using a plurality of sample keywords and a sample text. In this way, the target keywords may be organized into content summaries that conform to the form of the sample text representation. For example, a content summary that introduces the target text of a certain movie may include: in movie F by director D, hero a carries out event E.
For ease of understanding, the above embodiments are described below in an integrated manner in the form of an exemplary illustration.
Fig. 4 is a flowchart illustrating a keyword matching method according to another embodiment of the present application, which specifically includes the following steps:
inputting a document; DFA (word level); hit the keyword set; DFA (word level); and outputting the keywords.
For ease of understanding, the method shown in fig. 4 is further described below by taking the application of the method shown in fig. 5 as an example. Fig. 5 is a schematic structural diagram of a text auditing system according to another embodiment of the present application, where the system includes: client 502, server 504, and database 506;
the user may enter a document through client 502, the document containing the target text. Client 502 is configured to receive an incoming document and send the document to server 504. In one case, the client may store the document in database 506 and server 504 may retrieve the document from database 506. Also, it is reasonable that database 506 may be a database of client 502, or a database of server 504, or a dedicated database separate from both client 502 and server 504.
On the basis, the server 504 may obtain a target text based on the document; extracting candidate keywords from a target text through DFA (word level) to obtain a hit keyword set; semantically related keywords are determined from the hit keyword set by DFA (word level), and the determined keywords are output. The output keywords are the target keywords in the embodiment of fig. 2 and the alternative embodiment of fig. 2. Outputting the keyword may include: outputting the keywords to the auditors so that the auditors can audit whether the target text violates the rules through the keywords; or, the keyword is output to the review module of the server 504, and the review module can input the keyword into the classification model to obtain a review result of whether the target text violates rules. The specific implementation of each step in this embodiment is the same as the steps in the above-mentioned embodiment of fig. 2 and the alternative embodiment of fig. 2, and details thereof are not repeated here, see the description of the above-mentioned embodiment of fig. 2 and the alternative embodiment of fig. 2.
Corresponding to the above method embodiment, the present application further provides an embodiment of a keyword matching apparatus, and fig. 6 shows a schematic structural diagram of a keyword matching apparatus provided in an embodiment of the present application. As shown in fig. 6, the apparatus includes:
a text acquisition module 601 configured to acquire a target text;
a single keyword matching module 602 configured to perform character-by-character keyword matching on the target text to obtain a plurality of candidate keywords;
a multi-keyword matching module 603 configured to match the plurality of candidate keywords, and determine candidate keywords having an association relationship as target keywords of the target text.
According to the scheme provided by the application, the target keywords are the candidate keywords with semantic association in the candidate keywords, so that the target keywords can reflect the semantics of the content corresponding to the target keywords in the target text, the keyword matching at the semantic level is realized, and the matching accuracy of the target keywords can be improved. On the basis, whether the target text violates the rules or not is checked by using the target keywords, so that the condition that the candidate keywords in the target text do not violate the rules and the semantics reflected by the target keywords violate the rules can be guaranteed, and the accuracy of the checking result of whether the target text violates the rules or not is improved.
In an optional implementation manner, the text obtaining module 601 is further configured to:
acquiring a file to be matched and determining the file type of the file;
and when the file type is a text document, determining the file as a target text, otherwise, converting the file into the text document, and determining the converted text document as the target text.
In an optional implementation, the multi-keyword matching module 603 is further configured to:
determining candidate keywords matched with all words belonging to the same incidence relation in a word level library from the candidate keywords; the word level library stores a plurality of words and the incidence relation among the words;
and taking the matched candidate keywords as target keywords of the target text.
In an optional implementation, the multi-keyword matching module 603 is further configured to:
obtaining the association degree among the candidate keywords;
and matching the candidate keywords with the relevance degrees larger than or equal to a relevance degree threshold value, and determining the candidate keywords with the relevance relations as the target keywords of the target text.
In an optional implementation, the multi-keyword matching module 603 is further configured to:
acquiring positions of the candidate keywords in the target text;
and determining the spacing distance of each candidate keyword in the target text as the association degree by using the position of each candidate keyword.
In an alternative embodiment, the single keyword matching module 602 is further configured to:
sequentially matching each character of the target text with a character level library to obtain a plurality of candidate keywords;
and the characters in the character level library belong to preset keywords.
In an alternative embodiment, the apparatus further comprises an output module configured to:
after matching the candidate keywords and determining the candidate keywords with incidence relation as the target keywords of the target text, inputting the target keywords into a classification model obtained by pre-training to obtain the type of the target text; the classification model is a neural network model obtained by training by utilizing sample keywords and type labels of sample texts corresponding to the sample keywords;
and outputting the type.
In an alternative embodiment, the apparatus further comprises an output module configured to:
after the candidate keywords are matched and the candidate keywords with the incidence relation are determined to be the target keywords of the target text, generating a content abstract of the target text by using the target keywords;
and outputting the content summary.
The above is a schematic scheme of the keyword matching apparatus of this embodiment. It should be noted that the technical solution of the keyword matching apparatus and the technical solution of the keyword matching method belong to the same concept, and details that are not described in detail in the technical solution of the keyword matching apparatus can be referred to the description of the technical solution of the keyword matching method. Further, the components in the device embodiment should be understood as functional blocks that must be created to implement the steps of the program flow or the steps of the method, and each functional block is not actually divided or separately defined. The device claims defined by such a set of functional modules are to be understood as a functional module framework for implementing the solution mainly by means of a computer program as described in the specification, and not as a physical device for implementing the solution mainly by means of hardware.
Fig. 7 shows a block diagram of a computing device according to an embodiment of the present application. The components of the computing device 700 include, but are not limited to, memory 710 and a processor 720. Processor 720 is coupled to memory 710 via bus 730, and database 750 is configured to store data.
Computing device 700 also includes access device 740, access device 740 enabling computing device 700 to communicate via one or more networks 760. Examples of such networks include a Public Switched Telephone Network (PSTN), a Local Area Network (LAN), a Wide Area Network (WAN), a Personal Area Network (PAN), or a combination of communication networks such as the internet. The Access device 740 may include one or more of any type of Network Interface (e.g., a Network Interface Controller (NIC)) whether wired or Wireless, such as an IEEE802.11 Wireless Local Area Network (WLAN) Wireless Interface, a Worldwide Interoperability for Microwave Access (Wi-MAX) Interface, an ethernet Interface, a Universal Serial Bus (USB) Interface, a cellular Network Interface, a bluetooth Interface, a Near Field Communication (NFC) Interface, and so forth.
In one embodiment of the application, the above-described components of the computing device 700 and other components not shown in fig. 7 may also be connected to each other, for example, by a bus. It should be understood that the block diagram of the computing device architecture shown in FIG. 7 is for purposes of example only and is not limiting as to the scope of the present application. Those skilled in the art may add or replace other components as desired.
Computing device 700 may be any type of stationary or mobile computing device, including a mobile computer or mobile computing device (e.g., tablet, personal digital assistant, laptop, notebook, netbook, etc.), mobile phone (e.g., smartphone), wearable computing device (e.g., smartwatch, smartglasses, etc.), or other type of mobile device, or a stationary computing device such as a desktop computer or PC. Computing device 700 may also be a mobile or stationary server.
Wherein processor 720 is configured to execute the computer-executable instructions of the keyword matching method.
The above is an illustrative scheme of a computing device of the present embodiment. It should be noted that the technical solution of the computing device and the technical solution of the keyword matching method belong to the same concept, and details that are not described in detail in the technical solution of the computing device can be referred to the description of the technical solution of the keyword matching method.
An embodiment of the present application also provides a computer readable storage medium storing computer instructions that, when executed by a processor, are used for a keyword matching method.
The above is an illustrative scheme of a computer-readable storage medium of the present embodiment. It should be noted that the technical solution of the storage medium and the technical solution of the keyword matching method belong to the same concept, and details that are not described in detail in the technical solution of the storage medium can be referred to the description of the technical solution of the keyword matching method.
The foregoing description of specific embodiments of the present application has been presented. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The computer instructions comprise computer program code which may be in the form of source code, object code, an executable file or some intermediate form, or the like. The computer-readable medium may include: any entity or device capable of carrying the computer program code, recording medium, usb disk, removable hard disk, magnetic disk, optical disk, computer Memory, Read-Only Memory (ROM), Random Access Memory (RAM), electrical carrier wave signals, telecommunications signals, software distribution medium, and the like.
It should be noted that, for the sake of simplicity, the above-mentioned method embodiments are described as a series of acts or combinations, but those skilled in the art should understand that the present application is not limited by the described order of acts, as some steps may be performed in other orders or simultaneously according to the present application. Further, those skilled in the art should also appreciate that the embodiments described in the specification are preferred embodiments and that the acts and modules referred to are not necessarily required in this application.
In the above embodiments, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
The preferred embodiments of the present application disclosed above are intended only to aid in the explanation of the application. Alternative embodiments are not exhaustive and do not limit the invention to the precise embodiments described. Obviously, many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles of the application and its practical applications, to thereby enable others skilled in the art to best understand and utilize the application. The application is limited only by the claims and their full scope and equivalents.

Claims (18)

1. A keyword matching method, the method comprising:
acquiring a target text;
performing character-by-character keyword matching on the target text to obtain a plurality of candidate keywords;
and matching the candidate keywords, and determining the candidate keywords with the incidence relation as the target keywords of the target text.
2. The method according to claim 1, wherein the matching the candidate keywords and determining the candidate keywords having an association relationship as the target keywords of the target text comprises:
determining candidate keywords matched with all words belonging to the same incidence relation in a word level library from the candidate keywords; the word level library stores a plurality of words and incidence relations among the words;
and taking the matched candidate keywords as target keywords of the target text.
3. The method according to claim 1, wherein the matching the candidate keywords and determining the candidate keywords having an association relationship as the target keywords of the target text comprises:
obtaining the association degree among the candidate keywords;
and matching the candidate keywords with the relevance degrees larger than or equal to a relevance degree threshold value, and determining the candidate keywords with the relevance relations as the target keywords of the target text.
4. The method according to claim 3, wherein the obtaining the association degree between the candidate keywords comprises:
acquiring positions of the candidate keywords in the target text;
and determining the spacing distance of each candidate keyword in the target text as the association degree by using the position of each candidate keyword.
5. The method of claim 1, wherein said character-by-character keyword matching said target text to obtain a plurality of candidate keywords comprises:
sequentially matching each character of the target text with a character level library to obtain a plurality of candidate keywords;
and the characters in the character level library belong to preset keywords.
6. The method according to claim 1, wherein after matching the candidate keywords and determining candidate keywords having association relations as target keywords of the target text, the method further comprises:
inputting the target keywords into a classification model obtained by pre-training to obtain the type of the target text; the classification model is a neural network model obtained by training by utilizing sample keywords and type labels of sample texts corresponding to the sample keywords;
and outputting the type.
7. The method according to claim 1, wherein after matching the candidate keywords and determining candidate keywords having association relations as target keywords of the target text, the method further comprises:
generating a content abstract of the target text by using the target key words;
and outputting the content summary.
8. The method of claim 1, wherein obtaining the target text comprises:
acquiring a file to be matched and determining the file type of the file;
and when the file type is a text document, determining the file as a target text, otherwise, converting the file into the text document, and determining the converted text document as the target text.
9. A keyword matching apparatus, the apparatus comprising:
a text acquisition module configured to acquire a target text;
the single keyword matching module is configured to perform character-by-character keyword matching on the target text to obtain a plurality of candidate keywords;
and the multi-keyword matching module is configured to match the candidate keywords and determine the candidate keywords with the incidence relation as the target keywords of the target text.
10. The apparatus of claim 9, wherein the multi-keyword matching module is further configured to:
determining candidate keywords matched with all words belonging to the same incidence relation in a word level library from the candidate keywords; the word level library stores a plurality of words and the incidence relation among the words;
and taking the matched candidate keywords as target keywords of the target text.
11. The apparatus of claim 9, wherein the multi-keyword matching module is further configured to:
obtaining the association degree among the candidate keywords;
and matching the candidate keywords with the relevance degrees larger than or equal to a relevance degree threshold value, and determining the candidate keywords with the relevance relations as the target keywords of the target text.
12. The apparatus of claim 11, wherein the multi-keyword matching module is further configured to:
acquiring positions of the candidate keywords in the target text;
and determining the spacing distance of each candidate keyword in the target text as the association degree by using the position of each candidate keyword.
13. The apparatus of claim 9, wherein the single keyword matching module is further configured to:
sequentially matching each character of the target text with a character level library to obtain a plurality of candidate keywords;
and the characters in the character level library belong to preset keywords.
14. The apparatus of claim 9, further comprising an output module configured to:
after matching the candidate keywords and determining the candidate keywords with incidence relation as the target keywords of the target text, inputting the target keywords into a classification model obtained by pre-training to obtain the type of the target text; the classification model is a neural network model obtained by training by utilizing sample keywords and type labels of sample texts corresponding to the sample keywords;
and outputting the type.
15. The apparatus of claim 9, further comprising an output module configured to:
after the candidate keywords are matched and the candidate keywords with the incidence relation are determined to be the target keywords of the target text, generating a content abstract of the target text by using the target keywords;
and outputting the content summary.
16. The apparatus of claim 9, wherein the text acquisition module is further configured to:
acquiring a file to be matched and determining the file type of the file;
and when the file type is a text document, determining the file as a target text, otherwise, converting the file into the text document, and determining the converted text document as the target text.
17. A computing device, comprising:
a memory and a processor;
the memory is configured to store computer-executable instructions, and the processor is configured to execute the computer-executable instructions to implement the steps of the keyword matching method of any one of claims 1 to 8.
18. A computer-readable storage medium storing computer instructions which, when executed by a processor, perform the steps of the keyword matching method of any one of claims 1 to 8.
CN202210068520.0A 2021-07-13 2022-01-20 Keyword matching method and device, computing equipment and computer readable storage medium Pending CN114416926A (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110789955X 2021-07-13
CN202110789955 2021-07-13

Publications (1)

Publication Number Publication Date
CN114416926A true CN114416926A (en) 2022-04-29

Family

ID=81276461

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210068520.0A Pending CN114416926A (en) 2021-07-13 2022-01-20 Keyword matching method and device, computing equipment and computer readable storage medium

Country Status (1)

Country Link
CN (1) CN114416926A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115525611A (en) * 2022-08-16 2022-12-27 北京矩阵分解科技有限公司 Method, device and equipment for inquiring key words in portable document format file
CN116975301A (en) * 2023-09-22 2023-10-31 腾讯科技(深圳)有限公司 Text clustering method, text clustering device, electronic equipment and computer readable storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115525611A (en) * 2022-08-16 2022-12-27 北京矩阵分解科技有限公司 Method, device and equipment for inquiring key words in portable document format file
CN116975301A (en) * 2023-09-22 2023-10-31 腾讯科技(深圳)有限公司 Text clustering method, text clustering device, electronic equipment and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN110837550B (en) Knowledge graph-based question answering method and device, electronic equipment and storage medium
CN110647614B (en) Intelligent question-answering method, device, medium and electronic equipment
US10360307B2 (en) Automated ontology building
US11514235B2 (en) Information extraction from open-ended schema-less tables
CN109670163B (en) Information identification method, information recommendation method, template construction method and computing device
CN104778158B (en) A kind of document representation method and device
US10176228B2 (en) Identification and evaluation of lexical answer type conditions in a question to generate correct answers
CN109271514B (en) Generation method, classification method, device and storage medium of short text classification model
CN111124487B (en) Code clone detection method and device and electronic equipment
US10083398B2 (en) Framework for annotated-text search using indexed parallel fields
CN113434858B (en) Malicious software family classification method based on disassembly code structure and semantic features
CN114416926A (en) Keyword matching method and device, computing equipment and computer readable storage medium
CN113282729A (en) Question-answering method and device based on knowledge graph
CN116151220A (en) Word segmentation model training method, word segmentation processing method and device
Yan et al. Chemical name extraction based on automatic training data generation and rich feature set
CN108038109A (en) Method and system, the computer program of Feature Words are extracted from non-structured text
KR20230115964A (en) Method and apparatus for generating knowledge graph
WO2022262632A1 (en) Webpage search method and apparatus, and storage medium
CN114997167A (en) Resume content extraction method and device
CN113420127A (en) Threat information processing method, device, computing equipment and storage medium
Sworna et al. IRP2API: Automated Mapping of Cyber Security Incident Response Plan to Security Tools’ APIs
CN110750989A (en) Statement analysis method and device
US11775755B2 (en) Processing and visualization of textual data based on syntactic dependency trees and sentiment scoring
KR20230172283A (en) Device and Method for Generating Training Data of Language Model
Bongale et al. Automatic News Summarizer Using TextRank

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination