CN108304424B - Text keyword extraction method and text keyword extraction device - Google Patents

Text keyword extraction method and text keyword extraction device Download PDF

Info

Publication number
CN108304424B
CN108304424B CN201710203566.8A CN201710203566A CN108304424B CN 108304424 B CN108304424 B CN 108304424B CN 201710203566 A CN201710203566 A CN 201710203566A CN 108304424 B CN108304424 B CN 108304424B
Authority
CN
China
Prior art keywords
text
trained
network model
extracted
keyword
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710203566.8A
Other languages
Chinese (zh)
Other versions
CN108304424A (en
Inventor
包恒耀
苏可
饶孟良
陈益
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Shenzhen Co Ltd
Original Assignee
Tencent Technology Shenzhen Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Shenzhen Co Ltd filed Critical Tencent Technology Shenzhen Co Ltd
Priority to CN201710203566.8A priority Critical patent/CN108304424B/en
Publication of CN108304424A publication Critical patent/CN108304424A/en
Application granted granted Critical
Publication of CN108304424B publication Critical patent/CN108304424B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3346Query execution using probabilistic model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Machine Translation (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A method and a device for extracting text keywords are provided, and the method in one embodiment comprises the following steps: acquiring a text to be extracted; searching in the associated keyword library, and matching out the keywords in the text to be extracted; determining all text sentence patterns and corresponding keyword combinations according to the text to be extracted and the matched keywords in the text to be extracted; analyzing and determining the probability of the combination of each text sentence pattern and the corresponding keyword according to the keyword probability network model; and determining the keyword combination corresponding to the probability with the maximum probability median determined by analysis as the keyword combination extracted from the text to be extracted. The scheme of the embodiment has high response speed, simplifies the difficulty of extracting the text keywords and improves the accuracy of the text keywords.

Description

Text keyword extraction method and text keyword extraction device
Technical Field
The invention relates to the field of intelligent interaction, in particular to a text keyword extraction method and a text keyword extraction device.
Background
Taking an intelligent interaction device such as an intelligent sound or an intelligent assistant as an example, the intelligent interaction device generally interacts with a user in a form of a conversation, and when the intelligent interaction device interacts with the user, keywords (also called entity words in some technical applications) in a text are extracted after a speech of the user is recognized as the text. However, in such an interaction, the text of the interaction is usually very short, with only a few words, and it is very difficult to extract keywords (e.g., singer name, song title) therein. On the other hand, for short texts, compared with long texts, the short texts cannot crawl a large amount of data from the internet, and no large amount of public tagged data can be used, so that the public corpus data in the vertical field is very little, and a developer needs to collect the public corpus data by himself, which is very disadvantageous to the cold start stage of the project. Therefore, a text keyword extraction method capable of obtaining a better result is needed.
At present, under the condition of no labeled data, a method for extracting text keywords mainly adopts a maximum matching algorithm and a template matching-based method. The maximum matching algorithm is commonly used in chinese word segmentation systems, which include forward maximum matching and reverse maximum matching. Taking the forward maximum matching as an example, several continuous characters in the text to be segmented are matched with a word list of an entity library (also called a keyword library) from left to right, and if the characters are matched, a word with the longest length is segmented. For example: the short text is ' i want to listen to the song of ABC ' (A, B, C represents a specific word respectively) ', the entity library of singers is ' AB ' and ' ABC ' }, and then the extracted entities (keywords) are ' ABC ' instead of ' AB ' according to the maximum matching principle. The template matching method is to pre-design some commonly used templates, such as "i want to listen to [ song ]" in [ singer ]. If the query string of the user is 'I want to listen to SX of ABC', the keywords 'ABC' and 'SX' can be extracted through template matching, then the keywords are checked whether to contain the keywords or not in the corresponding entity library, and if yes, the result is returned. However, although the maximum matching algorithm is fast, the effect is not good, and keywords with the same name cannot be distinguished. For example, "kissing" might be both songs and albums. In the template matching method, users say strangely, and to achieve a good effect, each scene may need hundreds of thousands of templates, which not only results in a slow speed, but also results in no keywords being extracted once the query mode of the user is not in the template.
Disclosure of Invention
Therefore, the present embodiment provides a method and an apparatus for extracting text keywords, which can improve the accuracy of text keywords and are fast.
A text keyword extraction method comprises the following steps:
acquiring a text to be extracted;
searching in the associated keyword library, and matching out the keywords in the text to be extracted;
determining all text sentence patterns and corresponding keyword combinations according to the text to be extracted and the matched keywords in the text to be extracted;
analyzing and determining the probability of the combination of each text sentence pattern and the corresponding keyword according to the keyword probability network model;
and determining the keyword combination corresponding to the probability with the maximum probability median determined by analysis as the keyword combination extracted from the text to be extracted.
A text keyword extraction apparatus comprising:
the text acquisition module is used for acquiring a text to be extracted;
the keyword matching module is used for searching in the associated keyword library and matching out the keywords in the text to be extracted;
the combination determining module is used for determining all text sentence patterns and corresponding keyword combinations according to the texts to be extracted and the matched keywords in the texts to be extracted;
the probability analysis module is used for analyzing and determining the probability of the establishment of each text sentence pattern and the corresponding keyword combination according to the keyword probability network model;
and the extraction determining module is used for determining the keyword combination corresponding to the probability with the maximum probability in the probability median determined by the probability analysis module as the keyword combination extracted from the text to be extracted.
According to the scheme of the embodiment, when the keywords in the text to be extracted need to be extracted, searching is performed in the associated keyword library based on the associated keyword library to match the keywords in the text to be extracted, then all text sentences and corresponding keyword combinations are determined based on the keywords, the probabilities of the text sentences and the corresponding keyword combinations are determined according to the analysis of the keyword probability network model, and the keyword combination corresponding to the probability with the highest median value determined by the analysis is determined as the keyword combination extracted from the text to be extracted. On the basis of extracting the keywords in the text to be extracted, all text sentences and corresponding keyword combinations are determined, and then the probability of each text sentence and corresponding keyword combination is determined based on the keyword probability network model.
Drawings
FIG. 1 is a schematic diagram of an application environment of a scheme in one embodiment;
fig. 2 is a schematic diagram of a composition structure of a terminal in one embodiment;
FIG. 3 is a schematic diagram of a component architecture of a server in one embodiment;
FIG. 4 is a flow diagram that illustrates a method for text keyword extraction, in one embodiment;
FIG. 5 is a schematic diagram of generating a keyword probabilistic network model in one embodiment;
FIG. 6 is a schematic diagram of extracting text keywords, in one embodiment;
FIG. 7 is a schematic flow chart of generating a keyword probability network model in one specific example;
FIG. 8 is a schematic flow chart of generating a keyword probability network model in another specific example;
FIG. 9 is a schematic flow chart of generating a keyword probability network model in another specific example;
FIG. 10 is a block diagram of a text keyword extraction apparatus according to an embodiment;
FIG. 11 is a block diagram showing a configuration of a text keyword extraction section in another embodiment;
FIG. 12 is a block diagram of a model generation module in one particular example.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is further described in detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
Fig. 1 shows a schematic diagram of an operating environment in an embodiment of the present invention, as shown in fig. 1, the operating environment relates to a terminal 101 and may also relate to a server 102, and the terminal 101 and the server 102 may communicate through a network. The terminal 101 may intelligently interact with a terminal user, receive text content input by the terminal user, or recognize a voice of the terminal user as the text content, and extract keywords in the text content, so as to perform subsequent related services, such as playing a corresponding song from a local or network query based on the extracted keywords, querying a corresponding movie from the local or network query based on the extracted keywords, querying a corresponding weather based on the extracted keywords, and the like. The process of extracting the keywords in the text content may be performed at the terminal 101, or may be performed at the server 102 after the terminal 101 transmits the text content to the server 102. In this embodiment, when extracting the keywords in the text content, the keyword probability network model may be combined, where the keyword probability network model is determined by the server 102 and stored locally in the server 102 to execute a subsequent process of extracting the keywords in the text content, or the server 102 sends the keyword probability network model to the terminal 101 and then the terminal 101 executes a subsequent process of extracting the keywords in the text content. On the other hand, the keyword probabilistic network model may be determined by the terminal 101, transmitted to the server 102, and distributed to another terminal 101 by the server 102 to be executed. The embodiment of the invention relates to a scheme for extracting keywords in text contents by a terminal 101 or a server 102.
A schematic diagram of the structure of the terminal 101 in one embodiment is shown in fig. 2. The terminal 101 includes a processor, a storage medium, a communication interface, a power interface, and a memory connected by a system bus. A storage medium of the terminal 101 stores a text keyword extraction device, and the device is used for implementing a text keyword extraction method. The communication interface of the terminal 101 is used for connecting and communicating with the server 102 or other servers in the network, and the power interface of the terminal 101 is used for connecting with an external power supply, which supplies power to the terminal 101 through the power interface. The terminal 101 may be any device capable of realizing intelligent input and output, such as a mobile terminal (e.g., a mobile phone, a tablet computer, etc.), a smart speaker, etc.; other intelligent devices with the above structure are also possible.
A schematic diagram of the server 102 and in one embodiment is shown in fig. 3. The system comprises a processor, a power supply module, a storage medium, a memory and a communication interface which are connected through a system bus. The storage medium of the server 102 stores an operating system, a database, and a text keyword extraction apparatus, which is used to implement a text keyword extraction method. The communication interface of the server is used for connection and communication with the terminal 101 and other servers in the network.
Fig. 4 is a flowchart illustrating a text keyword extraction method in an embodiment, and as shown in fig. 4, the text keyword extraction method in the embodiment includes:
step S401: acquiring a text to be extracted;
step S402: searching in the associated keyword library, and matching out the keywords in the text to be extracted;
step S403: determining all text sentence patterns and corresponding keyword combinations according to the text to be extracted and the matched keywords in the text to be extracted, wherein the determined any text sentence pattern and the corresponding keyword combination jointly form the text to be extracted;
step S404: analyzing and determining the probability of the combination of each text sentence pattern and the corresponding keyword according to the keyword probability network model;
step S405: and determining the keyword combination corresponding to the probability with the maximum probability median determined by analysis as the keyword combination extracted from the text to be extracted.
According to the scheme of the embodiment, when the keywords in the text to be extracted need to be extracted, searching is performed in the associated keyword library based on the associated keyword library to match the keywords in the text to be extracted, then all text sentences and corresponding keyword combinations are determined based on the keywords, the probabilities of the text sentences and the corresponding keyword combinations are determined according to the analysis of the keyword probability network model, and the keyword combination corresponding to the probability with the highest median value determined by the analysis is determined as the keyword combination extracted from the text to be extracted. On the basis of extracting the keywords in the text to be extracted, all text sentences and corresponding keyword combinations are determined, and then the probability of each text sentence and corresponding keyword combination is determined based on the keyword probability network model.
The scheme in the above embodiment may be executed on a terminal, or may be executed on a server.
In an example of implementation on the terminal, the text to be extracted may be a text input by the terminal user, for example, a text input by the terminal user through a user interactive device such as a keyboard and a touch screen, or a text obtained by recognizing a voice of the terminal user. In this embodiment, the manner of acquiring the text to be extracted may be to receive a text input by a user, or to translate a voice input by the user into a text, and in other embodiments, the text to be extracted may also be acquired in other manners.
On the other hand, in an example of being executed on a terminal, the keyword probability network model may be generated in advance by the terminal, and in this case, before the text to be extracted is obtained, the method may further include the steps of: and generating the keyword probability network model. In addition, after the server generates the keyword probability network model, the terminal may acquire the keyword probability network model from the server. At this time, before the obtaining of the text to be extracted, the method may further include: and acquiring the keyword probability network model generated by the server.
Taking the implementation on the server as an example, the text to be extracted may be received from the terminal, and the terminal uploads the text to be extracted to the server after obtaining the text to be extracted. The text to be extracted may be a text input by the terminal user, for example, a text input by the terminal user through a user interactive device such as a keyboard and a touch screen, or a text obtained by recognizing a voice of the terminal user, or a text obtained in other manners in other embodiments.
On the other hand, taking the implementation on the server as an example, the keyword probability network model may be generated in advance by the server, and in this case, before the obtaining of the text to be extracted, the method may further include the steps of: and generating the keyword probability network model.
In a specific example, when the terminal or the server generates the keyword probability network model, specific manners may include:
acquiring a text to be trained, wherein the text to be trained comprises sentence pattern rule templates and corpus texts in various fields;
and training according to the text to be trained to obtain the keyword probability network model.
The sentence pattern rule template indicates a specific sentence pattern rule. Since the set sentence pattern rule may not include all sentence patterns, such as some spoken sentence patterns, the text to be trained may further include corpus texts of various fields, which may be some spoken texts. In a specific application implementation manner, the corpus texts in each field can be obtained in a crawler crawling manner.
When the keyword probability network model is obtained by training according to the text to be trained, the text to be trained comprises two texts, namely sentence pattern rule templates and corpus texts in various fields, so that the text to be trained can be determined by combining with the actual technical requirements during training.
In a specific example, when training is performed according to a text to be trained, it may not be necessary to distinguish whether the text to be trained is a sentence pattern rule template or a corpus text in each field, and in the process of each training, the text to be trained is randomly selected once, and the specific manner may include:
randomly extracting a current text to be trained from the text to be trained, wherein the current text to be trained is a sentence pattern rule template or a corpus text, namely the extracted current text to be trained can be the sentence pattern rule template or the corpus text;
inputting the extracted current text to be trained into a current network model to be trained for training to obtain a trained network model to be trained;
when the sentence pattern rule templates in the text to be trained or the corpus texts in the fields are not extracted completely, updating the current network model to be trained by using the trained network model to be trained, and returning to the step of randomly extracting one current text to be trained from the text to be trained until the sentence pattern rule templates in the text to be trained and the corpus texts in the fields are extracted completely;
and determining the obtained trained network model to be trained as the keyword probability network model.
In the above specific example, the case of updating the current network model to be trained with the trained network model is only performed when it is determined that extraction of the sentence rule templates of the text to be trained or the corpus text of each field is not completed, and in a specific technical application, it may also be determined whether extraction of the sentence rule templates of the text to be trained and the corpus text of each field is completed after the current network model to be trained is updated with the trained network model, and at this time, after extraction of the sentence rule templates of the text to be trained and the corpus text of each field is completed, the updated current network model to be trained is determined as the keyword probability network model.
Based on the above example of obtaining the keyword probability network model by training, it can be understood that, since one current text to be trained is randomly extracted from the text to be trained each time, the current text to be trained randomly extracted from the text to be trained in two adjacent training processes may be the same type of text, such as sentence pattern rule templates or corpus texts, or different types of texts, such as sentence pattern rule templates extracted at one time and corpus texts extracted at another time.
In another specific example, the number of sentence rule templates in the text to be trained and the number of corpus texts may be set to be the same, and in this case, in the training, the sentence rule templates and the expected texts in each field may be alternately performed, and specific manners may include:
extracting a sentence pattern rule template from each sentence pattern rule template, and inputting the extracted sentence pattern rule template into the current network model to be trained for training to obtain the trained network model to be trained;
after the trained network model to be trained is used for updating the current network model to be trained, a corpus text is extracted from the corpus texts in each field, and the extracted corpus text is input into the updated current network model to be trained for training, so that the trained network model to be trained is obtained;
when the sentence pattern rule templates in the text to be trained or the corpus texts in the fields are not extracted completely, updating the current network model to be trained by using the trained network model to be trained, and returning to the step of extracting one sentence pattern rule template from each sentence pattern rule template until the sentence pattern rule templates in the text to be trained and the corpus texts in the fields are extracted completely;
and determining the obtained trained network model to be trained as the keyword probability network model.
In the above description of this specific example, when it is determined that the sentence pattern rule templates of the text to be trained or the corpus texts in each field have not been extracted, the trained network model to be trained is used to update the current network model to be trained and return to extract the sentence pattern rule templates again, in the specific technical application, it may also be determined whether the sentence pattern rule templates in the text to be trained and the corpus texts in each field have been extracted after the trained network model to be trained is used to update the current network model to be trained, and at this time, after the sentence pattern rule templates in the text to be trained and the corpus texts in each field have been extracted, the updated current network model to be trained is determined as the keyword probability network model.
In the two specific examples, when the extracted sentence pattern rule template or corpus text is input into the current network model to be trained for training, the extracted sentence pattern rule template or corpus text may be input into the current network model to be trained for training by taking words as units, so as to obtain better generalization ability.
Based on the above-described embodiment and the specific examples thereof, when the specific technology is implemented, the scheme of this embodiment may be divided into two processes, namely, offline model training and online text entity extraction. In the online lower model training, after obtaining training data (text to be trained), training the obtained text to be trained to obtain a final keyword probability network model, as shown in fig. 5, in the online text entity extraction stage, the obtained keyword probability network model may be used to extract keywords, as shown in fig. 6.
When performing offline model training, two types of training data may be prepared, one type is a rule template of each vertical service field, for example, taking a music scene as an example, the rule template may be: i want to listen to [ song ] of [ singer ], who is, which songs of [ album ], where [ singer ] represents singer, [ song ] represents song, and [ album ] represents album. These rule templates are referred to as sentence rule templates in this embodiment. For different vertical service fields, such as music, movies, weather, etc., different sentence rule templates may be provided, so that corresponding different keyword probability network models are trained for different vertical service fields.
When the sentence rule templates are collected, the sentence rule templates may be corpora marked out by the collected user data, or may be some simple templates manually written by developers. These sentence rule templates may typically be rule templates written by developers at the beginning of each vertical service domain.
Another type of training data may be corpus text in non-vertical fields, and may be some spoken corpus data in general, so as to supplement some corpus that may not be in the sentence rule template, so as to improve the generalization ability of the trained keyword probability network model, which is referred to as corpus text in each field in this embodiment. For example, "i want to listen to a song" because the sentence "listening to a song" does not appear in the sentence rule template, the training capability of the trained model is poor. If a word "listen" appears in the text to be extracted, the keyword may not be recognized or recognized as a wrong keyword. Therefore, the generalization capability of the trained model can be improved by adding some spoken texts (corpus texts in various fields). In order not to influence the extraction of the keywords of each vertical domain, the part of the corpus text may be selected from some non-vertical domains, i.e. the corpus text may be suitable for the training of the model of each vertical domain. In specific application, the corpus texts can be obtained in a crawler crawling manner, and the number of the corpus texts to be crawled can be determined by combining actual needs.
When the text to be trained is obtained, training can be performed according to actual needs. As described above, the specific training process may vary based on the definition of the sentence rule templates in the text to be trained and the number of corpus texts in each domain.
Fig. 7 is a schematic flow chart illustrating a process of generating a keyword probability network model in a specific example, which is described by taking an example of not distinguishing whether a text to be trained is a sentence rule template or a corpus text of each field.
As shown in fig. 7, first, a text to be trained is obtained, which includes a sentence rule template and corpus texts in each field, taking a music field as an example, the sentence rule template may be [ song ], [ song ] of [ singer ], which is sung, which is [ album ], and the like, and the corpus text in each field may be a text including "i want to listen to song, etc., wherein the sentence rule template is a text only applicable to the current field, and the corpus text in each field is a text not only applicable to the current field but also applicable to other fields.
Subsequently, as shown in fig. 7, the specific training process may be:
randomly extracting a sentence pattern rule template or a corpus text from a text to be trained, namely the extracted current text to be trained can be the sentence pattern rule template or the corpus text;
inputting the extracted sentence pattern rule template or corpus text into the current network model to be trained for training to obtain a trained network model to be trained;
judging whether the extraction of all sentence pattern rule templates and the corpus texts in all fields in the text to be trained is finished or not;
if the extraction is not finished, namely at least one text in the sentence pattern rule template in the text to be trained or the corpus text of each field is not extracted, updating the current network model to be trained by using the trained network model to be trained, returning to the step of randomly extracting one sentence pattern rule template or corpus text from the text to be trained, and repeating the process until the extraction of each sentence pattern rule template in the text to be trained and the corpus text of each field is finished;
and if the extraction is finished, namely the sentence pattern rule template in the text to be trained and the corpus text in each field are extracted, determining the obtained trained network model to be the keyword probability network model, and finishing the training process.
In the above specific example, the case of updating the current network model to be trained with the trained network model is only performed when it is determined that extraction of the sentence rule templates of the text to be trained or the corpus text of each field is not completed, and in a specific technical application, it may also be determined whether extraction of the sentence rule templates of the text to be trained and the corpus text of each field is completed after the current network model to be trained is updated with the trained network model, and at this time, after extraction of the sentence rule templates of the text to be trained and the corpus text of each field is completed, the updated current network model to be trained is determined as the keyword probability network model.
When the extracted sentence pattern rule template or the corpus text is input into the current network model to be trained for training, the extracted sentence pattern rule template or the corpus text can be input into the current network model to be trained for training by taking characters as units. By inputting and training by taking the characters as units, the situation that the result obtained when the words are input by taking the characters as units and the corpus is less is very sparse and the effect is poor is avoided, the better generalization capability is obtained, and the extraction accuracy and the accuracy of the short text are improved. Wherein, aiming at different service fields, different corresponding keyword probability network models can be trained.
The current network model to be trained can adopt a possible training model in combination with actual needs, in a specific application example, LSTM (Long Short-Term Memory network) can be used as the model to be trained for training, and as a special convolutional neural network, LSTM can well learn Long-Term dependence information, and the probability of the establishment of syntax can be well approximately calculated by using LSTM. Since there are many unknown parameters in the LSTM network, specific values of these parameters can be estimated through the training process, and then the keywords in the text to be extracted are extracted during the extraction of the specific keywords. In the training process, based on the LSTM network, a bptt (back Propagation Through time) algorithm may be used for training.
Fig. 8 is a schematic flow chart illustrating a process of generating a keyword probability network model in another specific example, which is described by taking an example in which the number of sentence pattern rule templates in a text to be trained is the same as the number of corpus texts, and the sentence pattern rule templates and the expected texts in each field are trained alternately.
As shown in fig. 8, the specific training process may be:
extracting a sentence pattern rule template from each sentence pattern rule template;
inputting the extracted sentence pattern rule template into the current network model to be trained for training to obtain a trained network model to be trained;
updating the current network model to be trained by using the trained network model to be trained;
extracting a corpus text from the corpus texts in each field;
inputting the extracted corpus text into the updated current network model to be trained for training to obtain a trained network model to be trained;
judging whether the sentence pattern rule templates in the text to be trained and the corpus texts in the fields are extracted completely;
if not, namely the sentence pattern rule templates in the text to be trained or the corpus texts in the fields are not extracted, updating the current network model to be trained by using the trained network model to be trained, and returning to the step of extracting one sentence pattern rule template from each sentence pattern rule template until the sentence pattern rule templates in the text to be trained and the corpus texts in the fields are extracted;
and if the extraction is finished, determining the obtained trained network model to be trained as the keyword probability network model.
In the above specific example, the sentence pattern rule template is extracted for training, and then the corpus text of each field is extracted for description, and in another example, the corpus text of each field may be extracted for training, and then the sentence pattern rule template may be extracted for training.
In addition, in the above specific example, the explanation is given by taking an example that the network model to be trained after training is updated by using the trained network model to be trained and the sentence rule template is returned to re-extract the sentence rule template when it is determined that the extraction of the sentence rule templates or the corpus texts in the fields of the text to be trained is not completed, and in the specific technical application, it may also be determined whether the extraction of the sentence rule templates and the corpus texts in the fields of the text to be trained is completed after the network model to be trained after training is updated by using the trained network model to be trained, as shown in fig. 9, at this time, after the extraction of the sentence rule templates and the corpus texts in the fields of the text to be trained is completed, the updated current network model to be trained is determined as the keyword probability network model.
Other technical features in the generated keyword probability network model in the above examples shown in fig. 8 and 9 may be the same as those in the example shown in fig. 7.
After the keyword probability network model is obtained through training, the keyword probability network model can be applied to extract keywords in the text to be extracted. Under the condition that the server trains to obtain the keyword probability network model, the server can extract the text keywords by the terminal after sending the keyword probability network model to the terminal, or the server can extract the text keywords by itself after receiving the text to be extracted sent by the terminal. Under the condition that the terminal trains to obtain the keyword probability network model, the terminal can extract text keywords based on the keyword probability network model, or the keyword probability network model is sent to the server and then distributed to other terminals by the server, and the server and each terminal can extract the text keywords based on the keyword probability network model.
When text key words are extracted specifically, a text to be extracted is obtained first, where the text to be extracted may be a text input by a terminal user through a user interactive device such as a keyboard and a touch screen, may also be a text obtained by recognizing a voice of the terminal user, and may also be a text obtained in other manners.
In this embodiment, after obtaining the text to be extracted, the current domain to which the text belongs may be determined, and then the text keywords may be extracted according to the domain to which the text belongs by combining the keyword library and the keyword probability network model corresponding to the domain to which the text belongs. When text keywords are extracted from only one field, for example, an intelligent sound box, the text keywords can be extracted by directly combining a default keyword library and a keyword probability network model. When text keyword extraction may be performed on a plurality of fields, for example, executed at a server, the field to which the text keyword belongs may be determined first, and then the text keyword extraction may be performed by combining a keyword library and a keyword probability network model corresponding to the field to which the text keyword belongs. In the following examples, the description is given taking as an example that the field of the art has been determined.
After the text to be extracted is obtained, searching is carried out in the associated keyword library according to the field to which the text to be extracted belongs and the associated keyword library of the field to which the text belongs, and keywords in the text to be extracted are matched, so that the keywords in the text to be extracted are listed. And then determining all text sentence patterns and corresponding keyword combinations according to the texts to be extracted and the matched keywords in the texts to be extracted, wherein the determined any text sentence pattern and the corresponding keyword combination jointly form the texts to be extracted. It can be understood by those skilled in the art that the matching of the keywords in the text to be extracted is to match all words in the text to be extracted that match words in the keyword library, and the determination of all text sentence patterns and corresponding keyword combinations is to match all possible sentence patterns of the text to be extracted and keywords in the sentence patterns. Suppose the text to be extracted is "QLX that i want to listen to ABC", the singer entity is { "AB", "ABC" }, and the song entity library is { "QLX" }, where A, B, C, Q, L, X respectively represents a specific word or character. Then, for the text to be extracted being "QLX that i want to listen to ABC", according to the singer entity library { "AB", "ABC" } and the song entity library { "QLX" }, the matched text keyword to be extracted is: AB. ABC, QLX, and the determined possible text sentence patterns comprise: QLX where I want to listen to ABC, QLX where I want to listen to [ singer ] C, QLX where I want to listen to [ singer ], song where I want to listen to [ singer ] C, and thus the possible text sentence patterns and corresponding keyword combinations that I want to listen to are shown in Table 1 below.
TABLE 1
Possible combinations [singer] [song] Probability of
QLX I want to listen to ABC 0.001
I want to listen to]QLX of C AB 0.002
I want to listen to]QLX (g) ABC 0.009
I want to listen to]Of C [ song] AB QLX 0.011
I want to listen to]Of (song)] ABC QLX 0.051
As can be seen from Table 1 above, the sentence pattern "I want to hear QLX of [ singer ] C" is combined with its corresponding keyword "[ singer ]: AB 'jointly forms the original text to be extracted' QLX that I want to listen to ABC ', the sentence pattern' QLX that I want to listen to [ singer ', and the corresponding keyword combination' [ singer ]: ABC 'together form the original text to be extracted, "i want to listen to ABC's QLX", and the sentence "i want to listen to [ song ] of [ singer ] C" is combined with its corresponding keyword "[ singer ]: AB; [ song ]: QLX "together constitute the original text to be extracted" i want to listen to ABC's QLX ".
Then, each text sentence pattern is input into the keyword probability network model, and the probability that each text sentence pattern and the corresponding keyword are combined is obtained, as shown in the last column in table 1 above. As can be seen from table 1, the probability value is 0.051 with the maximum probability value, that is, the probability value of "i want to listen to [ QLX ] of [ ABC ] is the maximum, so the text sentence pattern and the corresponding keyword combination corresponding to the probability value of 0.051 with the maximum probability value are selected, and the finally determined extracted keyword combination is { [ singer ]: ABC; [ song ]: QLX }.
Based on the same idea as the above method, the present embodiment further provides a text keyword extraction apparatus, and fig. 10 shows a schematic structural diagram of the text keyword extraction apparatus in one embodiment.
As shown in fig. 10, the text keyword extraction apparatus of this embodiment includes:
the text acquisition module 101 is used for acquiring a text to be extracted;
the keyword matching module 102 is configured to search in an associated keyword library, and match keywords in the text to be extracted;
a combination determining module 103, configured to determine all text sentence patterns and corresponding keyword combinations according to the text to be extracted and the matched keywords in the text to be extracted, where any determined text sentence pattern and corresponding keyword combination jointly form the text to be extracted;
a probability analysis module 104, configured to analyze and determine a probability that each text sentence pattern and the corresponding keyword are combined according to the keyword probability network model;
and an extraction determining module 105, configured to determine, as the keyword combination extracted from the text to be extracted, the keyword combination corresponding to the probability with the highest probability in the probability median determined by the probability analysis module 104.
According to the scheme of the embodiment, when the keywords in the text to be extracted need to be extracted, searching is performed in the associated keyword library based on the associated keyword library to match the keywords in the text to be extracted, then all text sentences and corresponding keyword combinations are determined based on the keywords, the probabilities of the text sentences and the corresponding keyword combinations are determined according to the analysis of the keyword probability network model, and the keyword combination corresponding to the probability with the highest median value determined by the analysis is determined as the keyword combination extracted from the text to be extracted. On the basis of extracting the keywords in the text to be extracted, all text sentences and corresponding keyword combinations are determined, and then the probability of each text sentence and corresponding keyword combination is determined based on the keyword probability network model.
The scheme in the above embodiment may be executed on a terminal, or may be executed on a server.
In an example implemented on the terminal, the text to be extracted may be a text input by the terminal user, for example, a text input by the terminal user through a user interactive device such as a keyboard and a touch screen, or may be a text obtained by recognizing a voice of the terminal user, or may be a text obtained in other manners in other embodiments.
Taking the implementation on the server as an example, the text to be extracted may be received from the terminal, and the terminal uploads the text to be extracted to the server after obtaining the text to be extracted. The text to be extracted may be a text input by the terminal user, for example, a text input by the terminal user through a user interactive device such as a keyboard and a touch screen, or a text obtained by recognizing a voice of the terminal user, or a text obtained by other methods.
On the other hand, when the apparatus is installed in a terminal or a server, the keyword probability network model may be generated in advance by the terminal or the server. Therefore, in a specific example, as shown in fig. 11, the text keyword extraction apparatus may further include:
and a model generation module 106, configured to generate the keyword probability network model.
When the apparatus is installed in a terminal, the terminal may acquire the keyword probability network model from the server after the server generates the keyword probability network model. Therefore, as shown in fig. 11, in another embodiment, the text keyword extraction apparatus may further include:
and the model obtaining module 107 is configured to obtain the keyword probability network model generated by the server.
Fig. 12 shows a schematic structural diagram of the model generation module 106 in a specific example, and as shown in fig. 12, the model generation module 106 includes:
a training text obtaining module 1061, configured to obtain a text to be trained, where the text to be trained includes sentence pattern rule templates and corpus texts in various fields;
the training module 1062 is configured to train according to the text to be trained to obtain the keyword probability network model.
The sentence pattern rule template indicates a specific sentence pattern rule. Since the set sentence pattern rule may not include all sentence patterns, such as some spoken sentence patterns, the text to be trained may further include corpus texts of various fields, which may be some spoken texts. In a specific application implementation manner, the corpus texts in each field can be obtained in a crawler crawling manner.
As shown in fig. 12, the training module 1062 may specifically include: a training text extraction unit 10621, a training unit 10622, and a model determination unit 10623.
When the keyword probability network model is obtained by training according to the text to be trained, the text to be trained comprises two texts, namely sentence pattern rule templates and corpus texts in various fields, so that the text to be trained can be determined by combining with the actual technical requirements during training.
In a specific example, when training is performed according to a text to be trained, whether the text to be trained is a sentence rule template or a corpus text in each field may not be distinguished, and in the process of each training, the text to be trained is randomly selected once, at this time:
the training text extracting unit 1061 is configured to randomly extract a current text to be trained from the text to be trained, where the current text to be trained is a sentence pattern rule template or a corpus text, and when the training unit obtains the trained network model to be trained and the sentence pattern rule templates in the text to be trained or the corpus texts in each field are not extracted completely, re-randomly extract a current text to be trained from the text to be trained until all the sentence pattern rule templates in the text to be trained and the corpus texts in each field are extracted completely;
the training unit 10622 is configured to input the current to-be-trained text extracted by the training text extraction module into the current to-be-trained network model for training, obtain a trained to-be-trained network model, and update the current to-be-trained network model with the trained to-be-trained network model;
the model determining unit 10623 is configured to determine, when all sentence pattern rule templates in the text to be trained and corpus texts in each field are extracted, the trained network model to be trained obtained by the training unit as the keyword probability network model.
In another specific example, the number of sentence rule templates in the text to be trained and the number of corpus texts may be set to be the same, and in this case, in the training, the sentence rule templates and the expected texts in each field may be alternated, in this case:
the training text extracting unit 10621 is configured to alternately extract a sentence pattern rule template from each sentence pattern rule template or a corpus text from each corpus text when extraction of each sentence pattern rule template or corpus text in each field in the text to be trained is not completed;
the training unit 10622 is configured to input the sentence pattern rule template or the corpus text extracted by the training text extraction unit into the current network model to be trained for training, obtain a trained network model to be trained, and update the current network model to be trained with the trained network model to be trained;
the model determining unit 10623 is configured to determine, when all sentence pattern rule templates in the text to be trained and corpus texts in each field are extracted, the trained network model to be trained obtained by the training unit as the keyword probability network model.
In the two specific examples, when the extracted sentence pattern rule template or corpus text is input into the current network model to be trained for training, the training unit 10622 may input the extracted sentence pattern rule template or corpus text into the current network model to be trained in units of words to perform training, so as to obtain better generalization ability.
It will be understood by those skilled in the art that all or part of the processes in the methods of the embodiments described above may be implemented by a computer program, which is stored in a non-volatile computer readable storage medium, and in the embodiments of the present invention, the program may be stored in the storage medium of a computer system and executed by at least one processor in the computer system to implement the processes of the embodiments including the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.
The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.
The above-mentioned embodiments only express several embodiments of the present invention, and the description thereof is more specific and detailed, but not construed as limiting the scope of the invention. It should be noted that, for a person skilled in the art, several variations and modifications can be made without departing from the inventive concept, which falls within the scope of the present invention. Therefore, the protection scope of the present patent shall be subject to the appended claims.

Claims (12)

1. A text keyword extraction method is characterized by comprising the following steps:
acquiring a text to be extracted;
searching in the associated keyword library, and matching out the keywords in the text to be extracted;
determining all text sentence patterns and corresponding keyword combinations according to the text to be extracted and the matched keywords in the text to be extracted;
analyzing and determining the probability of the combination of each text sentence pattern and the corresponding keyword according to the keyword probability network model;
determining the keyword combination corresponding to the probability with the maximum probability median determined by analysis as the keyword combination extracted from the text to be extracted;
the generation mode of the keyword probability network model comprises the following steps:
acquiring a text to be trained, wherein the text to be trained comprises sentence pattern rule templates and corpus texts in various fields;
randomly extracting a current text to be trained from the text to be trained, wherein the current text to be trained is a sentence pattern rule template or a corpus text;
inputting the extracted current text to be trained into a current network model to be trained for training to obtain a trained network model to be trained;
when the sentence pattern rule templates in the text to be trained or the corpus texts in the fields are not extracted completely, updating the current network model to be trained by using the trained network model to be trained, and returning to the step of randomly extracting one current text to be trained from the text to be trained until the sentence pattern rule templates in the text to be trained and the corpus texts in the fields are extracted completely;
and determining the obtained trained network model to be trained as the keyword probability network model.
2. The method for extracting text keywords according to claim 1, wherein before the text to be extracted is obtained, the method further comprises the steps of:
and generating the keyword probability network model.
3. The method for extracting text keywords according to claim 1, wherein the number of sentence pattern rule templates in the text to be trained is the same as the number of corpus texts;
randomly extracting a current text to be trained from the text to be trained, wherein the current text to be trained is a sentence pattern rule template or a corpus text, and the method comprises the following steps: extracting a corpus text from each corpus text;
inputting the extracted current text to be trained into the current network model to be trained for training, and after obtaining the trained network model to be trained, before updating the current network model to be trained by using the trained network model to be trained when the sentence pattern rule templates in the text to be trained or the corpus texts in each field are not extracted, the method further comprises the following steps: and after updating the current network model to be trained by using the trained network model to be trained, extracting a sentence pattern rule template from each sentence pattern rule template, inputting the extracted sentence pattern rule template into the current network model to be trained for training, and obtaining the trained network model to be trained.
4. The method for extracting text keywords according to claim 1, wherein the number of sentence pattern rule templates in the text to be trained is the same as the number of corpus texts;
randomly extracting a current text to be trained from the text to be trained, wherein the current text to be trained is a sentence pattern rule template or a corpus text, and the method comprises the following steps: extracting a sentence pattern rule template from each sentence pattern rule template;
inputting the extracted current text to be trained into the current network model to be trained for training, and after obtaining the trained network model to be trained, before updating the current network model to be trained by using the trained network model to be trained when the sentence pattern rule templates in the text to be trained or the corpus texts in each field are not extracted, the method further comprises the following steps: and after updating the current network model to be trained by using the trained network model to be trained, extracting a corpus text from each corpus text, inputting the extracted corpus text into the current network model to be trained for training, and obtaining the trained network model to be trained.
5. The method for extracting text keywords according to claim 1, 3 or 4, wherein the extracted sentence pattern rule template or corpus text is input into the current network model to be trained for training in word units.
6. The method for extracting text keywords according to claim 1, wherein before the text to be extracted is obtained, the method further comprises the steps of:
and acquiring the keyword probability network model generated by the server.
7. A text keyword extraction apparatus, characterized by comprising:
the text acquisition module is used for acquiring a text to be extracted;
the keyword matching module is used for searching in the associated keyword library and matching out the keywords in the text to be extracted;
the combination determining module is used for determining all text sentence patterns and corresponding keyword combinations according to the texts to be extracted and the matched keywords in the texts to be extracted;
the probability analysis module is used for analyzing and determining the probability of the establishment of each text sentence pattern and the corresponding keyword combination according to the keyword probability network model;
the extraction determining module is used for determining the keyword combination corresponding to the probability with the maximum probability in the probability median value analyzed and determined by the probability analyzing module as the keyword combination extracted from the text to be extracted;
the model generation module is used for generating the keyword probability network model;
the model generation module includes:
the training text acquisition module is used for acquiring a text to be trained, wherein the text to be trained comprises sentence pattern rule templates and corpus texts in various fields;
the training module is used for training the text to be trained to obtain the keyword probability network model;
the training module comprises:
a training text extraction unit, configured to randomly extract a current text to be trained from the text to be trained, where the current text to be trained is a sentence pattern rule template or a corpus text, and after the training unit obtains the trained network model to be trained and when the sentence pattern rule templates in the text to be trained or the corpus texts in each field are not extracted completely, randomly extract a current text to be trained from the text to be trained again until the sentence pattern rule templates in the text to be trained and the corpus texts in each field are extracted completely;
the training unit is used for inputting the current to-be-trained text extracted by the training text extraction module into a current to-be-trained network model for training to obtain a trained network model, and updating the current to-be-trained network model by using the trained network model;
and the model determining unit is used for determining the trained network model to be trained obtained by the training unit as the keyword probability network model when the sentence pattern rule templates in the text to be trained and the corpus texts in the fields are extracted.
8. The apparatus according to claim 7, wherein the number of sentence pattern rule templates in the text to be trained is the same as the number of corpus texts;
the training text extraction unit is used for alternately extracting a sentence pattern rule template from each sentence pattern rule template or extracting a corpus text from each corpus text when the extraction of each sentence pattern rule template or corpus text in each field in the text to be trained is not finished;
and the training unit is used for inputting the sentence pattern rule template or the corpus text extracted by the training text extraction unit into the current network model to be trained for training to obtain the trained network model to be trained, and updating the current network model to be trained by using the trained network model to be trained.
9. The text keyword extraction apparatus according to claim 7 or 8, wherein the training unit inputs the extracted sentence pattern rule template or corpus text into the current network model to be trained in units of words for training.
10. The text keyword extraction apparatus according to claim 7, further comprising:
and the model acquisition module is used for acquiring the keyword probability network model generated by the server.
11. A computer device comprising a memory and a processor, the memory storing a computer program, wherein the processor implements the steps of the method of any one of claims 1 to 6 when executing the computer program.
12. A computer-readable storage medium, storing a computer program which, when executed by a processor, causes the processor to carry out the steps of the method according to any one of claims 1 to 6.
CN201710203566.8A 2017-03-30 2017-03-30 Text keyword extraction method and text keyword extraction device Active CN108304424B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710203566.8A CN108304424B (en) 2017-03-30 2017-03-30 Text keyword extraction method and text keyword extraction device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710203566.8A CN108304424B (en) 2017-03-30 2017-03-30 Text keyword extraction method and text keyword extraction device

Publications (2)

Publication Number Publication Date
CN108304424A CN108304424A (en) 2018-07-20
CN108304424B true CN108304424B (en) 2021-09-07

Family

ID=62872103

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710203566.8A Active CN108304424B (en) 2017-03-30 2017-03-30 Text keyword extraction method and text keyword extraction device

Country Status (1)

Country Link
CN (1) CN108304424B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377916B (en) * 2018-08-17 2022-12-16 腾讯科技(深圳)有限公司 Word prediction method, word prediction device, computer equipment and storage medium
CN109271521B (en) * 2018-11-16 2021-03-30 北京九狐时代智能科技有限公司 Text classification method and device
CN111309878B (en) * 2020-01-19 2023-08-22 支付宝(杭州)信息技术有限公司 Search type question-answering method, model training method, server and storage medium
CN111324722B (en) * 2020-05-15 2020-08-14 支付宝(杭州)信息技术有限公司 Method and system for training word weight model
CN111737979B (en) * 2020-06-18 2021-01-12 龙马智芯(珠海横琴)科技有限公司 Keyword correction method, device, correction equipment and storage medium for voice text
CN113010648A (en) * 2021-04-15 2021-06-22 联仁健康医疗大数据科技股份有限公司 Content search method, content search device, electronic equipment and storage medium

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103186509B (en) * 2011-12-29 2016-03-30 北京百度网讯科技有限公司 The extensive method and apparatus of asterisk wildcard class template, the extensive method and system of common template
CN104239300B (en) * 2013-06-06 2017-10-20 富士通株式会社 The method and apparatus that semantic key words are excavated from text
US9785630B2 (en) * 2014-05-30 2017-10-10 Apple Inc. Text prediction using combined word N-gram and unigram language models
CN105138515B (en) * 2015-09-02 2018-10-19 百度在线网络技术(北京)有限公司 Name entity recognition method and device

Also Published As

Publication number Publication date
CN108304424A (en) 2018-07-20

Similar Documents

Publication Publication Date Title
CN108304424B (en) Text keyword extraction method and text keyword extraction device
CN108287858B (en) Semantic extraction method and device for natural language
CN106776711B (en) Chinese medical knowledge map construction method based on deep learning
CN106776544B (en) Character relation recognition method and device and word segmentation method
CN107797984B (en) Intelligent interaction method, equipment and storage medium
CN109165302B (en) Multimedia file recommendation method and device
CN109783651B (en) Method and device for extracting entity related information, electronic equipment and storage medium
CN110704743B (en) Semantic search method and device based on knowledge graph
CN109325040B (en) FAQ question-answer library generalization method, device and equipment
CN111831911B (en) Query information processing method and device, storage medium and electronic device
CN112989055B (en) Text recognition method and device, computer equipment and storage medium
CN105956053B (en) A kind of searching method and device based on the network information
CN103886034A (en) Method and equipment for building indexes and matching inquiry input information of user
CN117056471A (en) Knowledge base construction method and question-answer dialogue method and system based on generation type large language model
CN108768824B (en) Information processing method and device
CN108538294B (en) Voice interaction method and device
CN110457672A (en) Keyword determines method, apparatus, electronic equipment and storage medium
CN107665188B (en) Semantic understanding method and device
CN110134780B (en) Method, device, equipment and computer readable storage medium for generating document abstract
CN112115232A (en) Data error correction method and device and server
CN112434533B (en) Entity disambiguation method, entity disambiguation device, electronic device, and computer-readable storage medium
CN112650842A (en) Human-computer interaction based customer service robot intention recognition method and related equipment
CN112527955A (en) Data processing method and device
CN111414735A (en) Text data generation method and device
CN113051384B (en) User portrait extraction method based on dialogue and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant