WO2021139466A1 - 一种文本主题词确定方法、装置、存储介质及终端 - Google Patents

一种文本主题词确定方法、装置、存储介质及终端 Download PDF

Info

Publication number
WO2021139466A1
WO2021139466A1 PCT/CN2020/134772 CN2020134772W WO2021139466A1 WO 2021139466 A1 WO2021139466 A1 WO 2021139466A1 CN 2020134772 W CN2020134772 W CN 2020134772W WO 2021139466 A1 WO2021139466 A1 WO 2021139466A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
topic
word
mapping relationship
target text
Prior art date
Application number
PCT/CN2020/134772
Other languages
English (en)
French (fr)
Inventor
马文康
王鹏
王永会
Original Assignee
北京大米科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京大米科技有限公司 filed Critical 北京大米科技有限公司
Publication of WO2021139466A1 publication Critical patent/WO2021139466A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/258Heading extraction; Automatic titling; Numbering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking

Definitions

  • This application relates to the field of computer technology, and in particular to a method, device, storage medium and terminal for determining text subject words.
  • the theme is the central idea of the article/work, which embodies the main body and core of the article/work content; while the subject term can concisely summarize the main content of the article/work through a few words.
  • the topic model is a common method of statistical text topic mining, which can discover and summarize the topic content of the text without human participation.
  • the embodiments of the present application provide a method, device, storage medium, and terminal for determining text topic words, which are suitable for short text and can accurately mine topic words.
  • the technical solution is as follows:
  • an embodiment of the present application provides a method for determining text subject terms, the method including:
  • At least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic word corresponding to the at least one target text is determined based on the second mapping relationship.
  • an embodiment of the present application provides an apparatus for determining text subject words, the apparatus including:
  • the target text obtaining module is used to preprocess at least one input text to obtain at least one target text
  • the first mapping relationship construction module is configured to construct a first mapping relationship between the at least one target text and at least one word in the word set according to the word set obtained by pre-training;
  • the third mapping relationship determination module is configured to determine the third mapping relationship between the at least one target text and the at least one topic type based on the second mapping relationship between the topic type obtained in advance and at least one word in the word set. Mapping relations;
  • a topic word determination module configured to determine at least one topic type corresponding to the at least one target text according to the third mapping relationship, and then determine at least one topic word corresponding to the at least one target text based on the second mapping relationship .
  • the embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of any one of the above methods are implemented.
  • an embodiment of the present application provides a terminal, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
  • the processor implements any of the above methods when the program is executed. A step of.
  • the terminal first preprocesses at least one input text to obtain at least one target text; then constructs the at least one target text and the word set based on the pre-trained word set The first mapping relationship between at least one word in the set; and then based on the second mapping relationship between the subject type obtained by pre-training and the at least one word in the word set, determine the difference between the at least one target text and the at least one subject type Finally, at least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic corresponding to the at least one target text is determined based on the second mapping relationship word.
  • the embodiment provided in this application uses short text to train the topic model in advance to ensure the applicability of the model to short text; when used online, the word set generated in the pre-training process and the second mapping relationship are directly used to obtain the input text Subject words ensure the accurate mining of short text subject words when used online.
  • FIG. 1 is a schematic flowchart of a method for determining text subject words according to an embodiment of the present application
  • FIG. 2 is a schematic flowchart of a method for determining text subject words according to an embodiment of the present application
  • FIG. 3 is a schematic flowchart of a method for determining text subject words according to an embodiment of the present application
  • FIG. 4 is a schematic diagram of the training process of a text topic word mining model provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a complete process of offline training and online use of a method for determining text subject words provided by an embodiment of the present application;
  • Fig. 6 is a schematic structural diagram of a text subject word determination device provided by an embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of a text subject word determination device provided by an embodiment of the present application.
  • FIG. 8 is a schematic structural diagram of a text subject word determination device provided by an embodiment of the present application.
  • FIG. 9 is a structural block diagram of a terminal provided by an embodiment of the present application.
  • FIG. 1 is a schematic flowchart of a method for determining text subject words according to an embodiment of this application.
  • the method of the embodiment of the present application may include the following steps:
  • the terminal Before determining the subject words of the input text, the terminal must first preprocess the input text. Preprocessing can speed up the subsequent mining of the subject words of the input text; the subject word mining in the embodiment of this application is for short texts, therefore, obtain The text length of the input text should be less than the preset threshold.
  • the text length is the number of characters contained in the text.
  • the preset threshold can be set to 120, etc.; the number of obtained input texts is not limited, it can be one, or It can be at least one.
  • the text length of the preprocessed input text will have a certain change, and the preprocessed input text is defined as the target text.
  • the preprocessing includes typos correction, text word order structure adjustment, and removal of emoticons, and so on. For example, correct the typos in text 1 "Huang Liang Yi Meng” and modify it to "Huang Liang Yi Meng”; adjust the word order structure of text 2 "He went to the library, probably", and adjust it to "He probably went to the library.””; to text 3 "The scenery here is infinitely good "To remove emoticons, and change to "the scenery here is infinitely good” and so on.
  • the remove emoticons include the processing of removing Japanese emoticons, removing Emoji, and removing emoticons.
  • the preprocessing may also include text merging processing.
  • S102 Construct a first mapping relationship between the at least one target text and at least one word in the word set according to the word set obtained by pre-training;
  • the text is composed of words.
  • a mapping relationship can be constructed between the target text and at least one word in the word set, which is called the first mapping relationship.
  • word composition analysis may be performed on the generated target text to obtain the words contained in the target text, and based on the word set and the words contained in the obtained target text, the word set and the target text are determined
  • Corresponding words form a mapping relationship.
  • the mapping relationship can be one-to-one or one-to-many.
  • the mapping type is not limited, for example, it can be a list type or a dictionary type.
  • the word set is generated based on at least one sample text.
  • at least one sample text is obtained by preprocessing such as typos correction, text word order structure adjustment, and emoji removal, and then word segmentation processing is performed on the sample text to obtain the words contained in the sample text, and at least one sample The words contained in the text constitute a set of words.
  • S103 Determine a third mapping relationship between the at least one target text and the at least one topic type based on the second mapping relationship between the topic type obtained in advance and at least one word in the word set;
  • At least one topic type set for multiple sample texts is also summarized by words, and the mapping relationship formed between the topic type and at least one word in the word set is called the second mapping relationship.
  • Both the first mapping relationship and the second mapping relationship are related to word sets, and the third mapping relationship can be obtained by combining the two, that is, the corresponding relationship between the target text and the topic type.
  • S104 Determine at least one topic type corresponding to the at least one target text according to the third mapping relationship, and then determine at least one topic word corresponding to the at least one target text based on the second mapping relationship.
  • the subject type of the target text can be determined according to the third mapping relationship, and then the word corresponding to the subject type of the target text can be determined through the second mapping relationship, and the word is used as the subject word of the target text.
  • the subject words can concisely summarize the main theme of the text, and the subject words of the target text can be one or more.
  • the terminal first preprocesses at least one input text to obtain at least one target text; then, according to the pre-trained word set, constructs the relationship between the at least one target text and at least one word in the word set.
  • at least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic word corresponding to the at least one target text is determined based on the second mapping relationship.
  • the embodiment provided in this application uses short text to train the topic model in advance to ensure the applicability of the model to short text; when used online, the word set generated in the pre-training process and the second mapping relationship are directly used to obtain the input text Subject words ensure the accurate mining of short text subject words when used online.
  • FIG. 2 is a schematic flowchart of a method for determining text subject words according to an embodiment of this application.
  • the method of the embodiment of the present application may include the following steps:
  • S201 Perform text merging processing on at least one first sample text to generate at least one second sample text, where the text length of the first sample text is all less than a preset threshold, and the text length of the second sample text is all greater than Equal to the preset threshold;
  • this embodiment of the application provides a model training method.
  • the model training is completed, the second mapping relationship generated during the training process is saved.
  • the training completion time When used online, it can be based on the input text and the training completion time. Save the second mapping relationship to accurately obtain the subject terms of the input text.
  • the training of the model in this embodiment is performed based on different types of samples, that is, the training samples include multiple types, for example, including both commercial type text and literary type text. It is defined that the sample text before the text merging process is the first sample text, the first sample text is short text, and the text length is less than a preset threshold.
  • the co-occurrence rule of topic words severeal words appearing together
  • the rule of is more difficult, and the matrix generated by training will also be sparse, so that the accuracy of the subject words obtained based on this matrix during subsequent online use is also insufficient.
  • the at least one first sample text needs to be subjected to text merging processing to generate at least one second sample text.
  • the second sample text is a long text, and the text length is greater than or equal to a preset threshold.
  • the text merging process for at least one first sample text may be to use some existing clustering algorithms (such as K-means clustering, mean shift algorithm, etc.) to first cluster the at least one first sample text, Then, according to the clustering results, the texts are merged in various combinations to generate at least one second sample text, and the number of words in each sample text is increased.
  • some existing clustering algorithms such as K-means clustering, mean shift algorithm, etc.
  • some existing natural language processing technologies may be used to combine/merge at least one first sample text in different ways to generate at least one second sample text, thereby increasing the number of words in each sample text. For example, several pieces of first sample text with the same grammatical structure are merged and expanded into one piece of second sample text.
  • the determination of the text length is to count the number of characters contained in the text.
  • the characters include text and punctuation in various languages.
  • a Chinese character or Chinese punctuation is usually counted as two characters, and an English letter or English punctuation is usually counted as one character.
  • "Current Events Hot News” contains 12 characters, and the text length of the text is 12; "Hello! contains 6 characters, and the text length of the text is 6.
  • a text length threshold is preset, and texts with a text length less than the preset threshold are divided into short texts, and texts with a text length greater than or equal to the preset threshold are divided into long texts, where the preset threshold can be set to 140 or 150, and so on.
  • the training of the model can also be performed only for the same type of text, that is, the obtained at least one first sample text is all the same type of text, for example, as The sample text a, text b, and text c all belong to the sports category.
  • the number of the first sample text before the merging process and the number of the second sample text after the merging process may increase or decrease, or be the same.
  • S202 Obtain topic prior information based on the topic type and topic word of the at least one second sample text
  • Prior information refers to the experience gained based on historical data or data. In this application, before using the sample to train the model, it is necessary to obtain prior information about the subject in order to make the training result better.
  • the priori information of the subject is obtained through an algorithm, which has both historical experience and certain data analysis, and the prior information is more reliable.
  • the topic type and topic words corresponding to the topic type are pre-stored in the terminal, and when the at least one second sample text is generated, the stored preset topic types and topic words corresponding to the topic types are obtained, and then combined with the generated topic words According to the existing language processing technology, determine the preset topic type to which the at least one second sample text belongs, and use this result as the topic prior information, and perform step S103.
  • This embodiment does not limit the number of pre-stored topic types and the number of keywords corresponding to each topic type.
  • the preset topic type to which each second sample text belongs can be determined by the probability, and the preset topic type membership result of at least one second sample text is used as the topic prior information .
  • S203 Based on the at least one second sample text and the topic prior information, train a text topic mining model to obtain a topic type-term frequency matrix.
  • the model is trained based on at least one second sample text obtained in the above steps and subject prior information, which can increase the length of the sample text, increase the number of words in the sample text, reduce the difficulty of obtaining word co-occurrence rules, and solve the matrix output by the model
  • the sparse problem effectively guarantees the accuracy of the extraction of the subject words of the input text during subsequent online use, and enhances the interpretability of the label.
  • the model is a text topic word mining model, and any model that can perform topic word mining can be applied to the embodiments provided in this application.
  • an implicit Dirichlet subject label mining model (Labeled-Latent Dirichlet Allocation, Labeled-LDA) can be used.
  • step S101 For details of this step, refer to step S101, which will not be repeated here.
  • the frequency of each word in the word set in the target text is determined, and the target text-word frequency matrix (ie, the first mapping relationship in step S102) is constructed.
  • step S102 For details that are not described in this step, please refer to step S102, which will not be repeated here.
  • the target text-topic type matrix can be obtained from the target text-term frequency matrix constructed in step S205 and the topic type-term frequency matrix saved when the training is completed, and step S207 is executed.
  • the topic type-word frequency matrix saved when the training is completed can determine the topic words of the target text.
  • the specific process is to determine that the most probable topic type of target text a is b through the topic type index (a, b) with the largest probability value in the target text-topic type matrix, and then determine the topic type-term frequency matrix
  • the topic word index (b, c) of topic type b, the topic word c of the output text a, and the topic word can be composed of the type of the text and the high-frequency words/keywords in the text.
  • the terminal first preprocesses at least one input text to obtain at least one target text; then, according to the pre-trained word set, constructs the relationship between the at least one target text and at least one word in the word set.
  • at least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic word corresponding to the at least one target text is determined based on the second mapping relationship.
  • the embodiment provided in this application uses short text to train the topic model in advance to ensure the applicability of the model to short text; when used online, the word set generated in the pre-training process and the second mapping relationship are directly used to obtain the input text Subject words ensure the accurate mining of short text subject words when used online.
  • FIG. 3 is a schematic flowchart of a method for determining text subject words according to an embodiment of this application.
  • the method of the embodiment of the present application may include the following steps:
  • S301 Generate a word set according to at least one vocabulary in the first sample text
  • each first sample text is different, the number of words contained is also different, and there will be some meaningless words at the same time, so it is necessary to segment each first sample text and stop words Processing to obtain valid words contained in each first sample text, and at least one valid word contained in the first sample text constitutes a collection of words.
  • word segmentation refers to segmentation of sentences in the text. For example, if the text "Xiao Ming is attracted by a flower on the lakeshore" is segmented, the result of the word segmentation may be "Xiaoming/ ⁇ / ⁇ "/ ⁇ / ⁇ / ⁇ /Attract/Live it” and so on.
  • the word segmentation method can specifically select the forward maximum matching method, the word segmentation method based on the N-gram language model, and the word segmentation method based on HMM. Stop words refer to words that have no actual meaning in the text, such as " ⁇ , ⁇ , in, a, an, the", etc., removing some words with no actual meaning in the text can make the sample more meaningful, and the model training speed Faster.
  • S302 Construct a target text-term frequency matrix based on the statistical result of the occurrence frequency of the words in the at least one first sample text;
  • the frequency of each word in the word set in each first sample text is counted, and according to the at least one first sample text and the word frequency statistics result of each first sample text, Construct the target text-term frequency matrix.
  • the target text-term frequency matrix is a real text-term frequency matrix calculated by artificial statistics.
  • step S201 For details of this step, refer to step S201, which will not be repeated here.
  • processing before performing text merging processing on the at least one first sample text, processing may also include typos correction, text word order structure adjustment, and emoji removal.
  • S304 Use the probability distribution of the at least one second sample text belonging to different topic types as topic prior information
  • the model parameters can be better and the output result closer to the real data.
  • step S202 For details of this step that is not described in detail, please refer to step S202, which will not be repeated here.
  • a sample text-word frequency matrix is formed during the model training process.
  • the sample text-word frequency matrix generated during the training process of the text topic word mining model can be combined with the actual target text calculated by artificial statistics.
  • the word frequency matrix is compared. When the two are consistent, the model training is completed. At this time, the various parameters in the model have reached the optimal value.
  • the sample topic type-word frequency matrix generated during the training process is obtained and saved for subsequent use Online use.
  • sample text-term frequency matrix is inconsistent with the target text-term frequency matrix, it means that the training has not been completed, and the model needs to be adjusted and continue to use more second sample texts for training until the sample text-term frequency matrix matches the target The text-term frequency matrix is consistent.
  • step S101 For details of this step, refer to step S101, which will not be repeated here.
  • S307 Construct a target text-word frequency matrix according to the target text and the word set generated by pre-training;
  • step S205 For details of this step, refer to step S205, which will not be repeated here.
  • S308 Determine the target text-topic type matrix based on the target text-term frequency matrix and the topic type-term frequency matrix obtained through pre-training;
  • step S206 For details of this step, refer to step S206, which will not be repeated here.
  • step S207 For details of this step, refer to step S207, which will not be repeated here.
  • the terminal first preprocesses at least one input text to obtain at least one target text; then, according to the pre-trained word set, constructs the relationship between the at least one target text and at least one word in the word set.
  • at least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic word corresponding to the at least one target text is determined based on the second mapping relationship.
  • the embodiment provided in this application uses short text to train the topic model in advance to ensure the applicability of the model to short text; when used online, the word set generated in the pre-training process and the second mapping relationship are directly used to obtain the input text
  • the subject words ensure the accurate mining of short text subject words when used online.
  • FIG. 4 is a schematic diagram of the training process of a text topic word mining model provided by an embodiment of this application.
  • the Labeled-LDA model is taken as an example to illustrate the model training process in the foregoing embodiment.
  • the training process is specifically as follows: 1) A real text-word frequency matrix s is obtained by artificial statistical calculation (that is, the target text-word frequency matrix); 2) The implicit Dirichlet distribution uses two different parameters ⁇ and ⁇ Two different distributions were formed, named Dirichlet distribution ⁇ and Dirichlet distribution ⁇ ; 3) Based on subjective identification, some topic types were given and then subject to certain calculations to obtain topic prior information; 4) Samples The text (not shown in the figure), Dirichlet distribution ⁇ and topic prior information can obtain the sample text-topic type matrix ⁇ ; 5) The sample text (not shown in the figure) and Dirichlet distribution ⁇ can be obtained.
  • the sample text-term frequency matrix w can be obtained; 7) The sample generated during the training of the Labeled-LDA model When the text-word frequency matrix w is infinitely close/consistent with the real text-word frequency matrix s, it indicates that the Labeled-LDA model training is completed. At this time, the parameter ⁇ in the model reaches the optimal value. For the sample topic type generated during the training process- The word frequency matrix ⁇ is saved and used for subsequent online use.
  • FIG. 5 is a schematic diagram of a complete process of offline training and online use of a method for determining a subject word provided in an embodiment of this application.
  • the input sample text is clustered using a clustering algorithm to form n categories, and the texts in these n categories are combined and merged to form the text 11 shown in the figure...
  • Text n4 long text, counted as d); after removing emoticons, word segmentation, removing stop words, etc.
  • a word set of size w is generated; t number of topic types are artificially preset, and The probability distribution of each sample text belonging to each topic type is calculated, and the probability distribution formed by d long texts constitutes the topic prior information; the Labeled-LDA model is trained using the topic prior information and d texts, during the training process , The model will generate a d*w text-term frequency matrix based on the word set and the d pieces of text.
  • the d*w text-term frequency matrix and the target text-term frequency matrix (calculated by artificial statistics) are generated during the training
  • the output is consistent, it means that the training is completed.
  • the parameters in the model are optimal, and the topic type-term frequency matrix of t*w is output.
  • a k*w text-word frequency matrix is constructed according to the word set of size w generated in the offline training process and the k pieces of input text after the preprocessing.
  • the k*w text-term frequency matrix and the t*w topic type-term frequency matrix output from offline training can obtain the k*t text-topic type matrix, and the k*t text-topic type matrix is the largest
  • the index corresponding to the probability value is used as the topic type index of the input text.
  • the topic type of the input text is determined, and then the topic type-term frequency matrix output by offline training can obtain the topic words of the input text (that is, as shown in the figure). Shows subject heading 1... subject heading k).
  • FIG. 6 is a schematic structural diagram of a text subject word determination device provided by an exemplary embodiment of this application.
  • the text subject word determination device can be implemented as all or a part of the terminal through software, hardware or a combination of the two, and can also be integrated on the server as an independent module.
  • the apparatus for determining text subject terms in the embodiment of the present application is applied to a terminal.
  • the apparatus 1 includes a target text acquisition module 11, a first mapping relationship building module 12, a third mapping relationship determining module 13, and a subject term determining module 14, wherein:
  • the target text obtaining module 11 is configured to preprocess at least one input text to obtain at least one target text;
  • the first mapping relationship construction module 12 is configured to construct a first mapping relationship between the at least one target text and at least one word in the word set according to the word set obtained by pre-training;
  • the third mapping relationship determining module 13 is configured to determine the first mapping relationship between the at least one target text and the at least one topic type based on the second mapping relationship between the topic type obtained in advance and at least one word in the word set.
  • the topic word determining module 14 is configured to determine at least one topic type corresponding to the at least one target text according to the third mapping relationship, and then determine at least one topic corresponding to the at least one target text based on the second mapping relationship word.
  • the one mapping relationship includes a target text-term frequency matrix
  • the second mapping relationship includes a topic type-term frequency matrix
  • the third mapping relationship is a target text-topic type matrix
  • the topic word determining module 14 is specifically used for:
  • the index corresponding to the maximum probability value in the target text-topic type matrix is used as the topic type index of the target text, and the at least one topic word is determined based on the topic type index and the topic type-term frequency matrix.
  • FIG. 7 is a schematic structural diagram of a text subject word determination device provided by an exemplary embodiment of this application.
  • the apparatus 1 for determining text subject words provided by the embodiment of the present application further includes:
  • the second sample text generation module 15 is configured to perform text merging processing on at least one first sample text to generate at least one second sample text, and the text length of the first sample text is less than a preset threshold, and the first sample text The text lengths of the two sample texts are both greater than or equal to the preset threshold;
  • the subject prior information acquisition module 16 is configured to acquire subject prior information based on the subject type and subject words of the at least one second sample text;
  • the topic type-term frequency matrix obtaining module 17 is configured to train a text topic mining model based on the at least one second sample text and the topic prior information to obtain a topic type-term frequency matrix.
  • the subject prior information acquisition module 16 is specifically configured to:
  • the topic prior information includes: probability distributions of the at least one second sample text belonging to different topic types.
  • FIG. 8 is a schematic structural diagram of a text subject word determination device provided by an exemplary embodiment of this application.
  • the apparatus 1 for determining text subject words provided by the embodiment of the present application further includes:
  • the word set generating module 18 is configured to generate a word set according to the vocabulary in the at least one first sample text
  • the target text-term frequency matrix construction module 19 is configured to construct a target text-term frequency matrix based on the statistical result of the occurrence frequency of words in the at least one first sample text;
  • the topic type-term frequency matrix obtaining module 17 is specifically configured to:
  • the at least one second sample text and the topic prior information are used to train the text topic word mining model.
  • the sample text-word frequency matrix generated in the training process is consistent with the target text-word frequency matrix
  • the model training is completed; the sample topic type-word frequency matrix generated during the training process is obtained;
  • the device for determining text topic words executes the method for determining text topic words
  • the above-mentioned functional modules are used as an example for illustration. In actual applications, the above-mentioned function assignments can be divided according to needs.
  • the function module is completed, that is, the internal structure of the device is divided into different function modules to complete all or part of the functions described above.
  • the text subject word determination device provided in the above-mentioned embodiment and the text subject word determination method embodiment belong to the same concept. For the implementation process of the text subject word determination method, please refer to the method embodiment, which will not be repeated here.
  • the terminal first preprocesses at least one input text to obtain at least one target text; then, according to the pre-trained word set, constructs the relationship between the at least one target text and at least one word in the word set.
  • at least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic word corresponding to the at least one target text is determined based on the second mapping relationship.
  • the embodiment provided in this application uses short text to train the topic model in advance to ensure the applicability of the model to short text; when used online, the word set generated in the pre-training process and the second mapping relationship are directly used to obtain the input text Subject words ensure the accurate mining of short text subject words when used online.
  • the embodiments of the present application also provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method in any of the foregoing embodiments are implemented.
  • the computer-readable storage medium may include, but is not limited to, any type of disk, including floppy disks, optical disks, DVDs, CD-ROMs, micro drives, and magneto-optical disks, ROM, RAM, EPROM, EEPROM, DRAM, VRAM, flash memory devices , Magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or equipment suitable for storing instructions and/or data.
  • An embodiment of the present application also provides a terminal, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the steps of the method in any of the foregoing embodiments when the program is executed.
  • FIG. 9 is a structural block diagram of a terminal provided in an embodiment of this application.
  • the terminal 600 includes a processor 601 and a memory 602.
  • the processor 601 is a control center of a computer system, and may be a processor of a physical machine or a processor of a virtual machine.
  • the processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
  • the processor 601 can adopt at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array). achieve.
  • the processor 601 may also include a main processor and a coprocessor.
  • the main processor is a processor used to process data in the awake state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor used to process data in the standby state.
  • the memory 602 may include one or more computer-readable storage media, which may be non-transitory.
  • the memory 602 may also include a high-speed random access memory and a non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
  • the non-transitory computer-readable storage medium in the memory 602 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 601 to implement the method in the embodiment of the present application .
  • the terminal 600 further includes: a peripheral device interface 603 and at least one peripheral device.
  • the processor 601, the memory 602, and the peripheral device interface 603 may be connected by a bus or a signal line.
  • Each peripheral device can be connected to the peripheral device interface 603 through a bus, a signal line, or a circuit board.
  • the peripheral device includes: at least one of a display screen 604, a camera 605, and an audio circuit 606.
  • the peripheral device interface 603 can be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 601 and the memory 602.
  • the processor 601, the memory 602, and the peripheral device interface 603 are integrated on the same chip or circuit board; in some other embodiments of the present application, the processor 601, the memory 602, and the peripheral device interface Any one or two of 603 can be implemented on a separate chip or circuit board. The embodiments of the present application do not specifically limit this.
  • the display screen 604 is used to display a UI (User Interface, user interface).
  • the UI can include graphics, text, icons, videos, and any combination thereof.
  • the display screen 604 also has the ability to collect touch signals on or above the surface of the display screen 604.
  • the touch signal can be input to the processor 601 as a control signal for processing.
  • the display screen 604 may also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
  • the display screen 604 may be a flexible display screen, which is arranged on the curved surface or the folding surface of the terminal 600. Furthermore, the display screen 604 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
  • the display screen 604 may be made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).
  • the camera 605 is used to collect images or videos.
  • the camera 605 includes a front camera and a rear camera.
  • the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.
  • the camera 605 may also include a flashlight.
  • the flash can be a single-color flash or a dual-color flash. Dual color temperature flash refers to a combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
  • the audio circuit 606 may include a microphone and a speaker.
  • the microphone is used to collect the sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 601 for processing.
  • the microphone can also be an array microphone or an omnidirectional collection microphone.
  • the power supply 607 is used to supply power to various components in the terminal 600.
  • the power source 607 may be alternating current, direct current, disposable batteries, or rechargeable batteries.
  • the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery.
  • a wired rechargeable battery is a battery charged through a wired line
  • a wireless rechargeable battery is a battery charged through a wireless coil.
  • the rechargeable battery can also be used to support fast charging technology.
  • the terminal structural block diagram shown in the embodiments of the present application does not constitute a limitation on the terminal 600.
  • the terminal 600 may include more or fewer components than shown in the figure, or combine certain components, or adopt different component arrangements.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本主题词确定方法、装置、存储介质及终端,包括:对至少一个输入文本进行预处理,得到至少一个目标文本(S101);根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系(S102);基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系(S103);根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词(S104)。运用本方法能够准确挖掘短文本的主题词。

Description

一种文本主题词确定方法、装置、存储介质及终端 技术领域
本申请涉及计算机技术领域,尤其涉及一种文本主题词确定方法、装置、存储介质及终端。
背景技术
主题是文章/作品的中心思想,它体现的是文章/作品内容的主体及核心;而主题词则能通过少量的词语简明扼要地概括出文章/作品的主要内容。
主题模型是统计文本主题挖掘的常用方法,能够在无人工参与的前提下发现和归纳文本的主题内容。
传统的主题挖掘算法通常是利用长文本对主题模型进行无监督的训练,该方法训练出来的主题模型不适用于短文本,从而使得对短文本进行主题挖掘时,挖掘的主题词准确性不够。
发明内容
本申请实施例提供了一种文本主题词确定方法、装置、存储介质及终端,适用于短文本且能够准确地挖掘主题词。所述技术方案如下:
第一方面,本申请实施例提供了一种文本主题词确定方法,所述方法包括:
对至少一个输入文本进行预处理,得到至少一个目标文本;
根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;
基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;
根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。
第二方面,本申请实施例提供了一种文本主题词确定装置,所述装置包括:
目标文本获取模块,用于对至少一个输入文本进行预处理,得到至少一个目标文本;
第一映射关系构建模块,用于根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;
第三映射关系确定模块,用于基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;
主题词确定模块,用于根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。
第三方面,本申请实施例提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现上述任一项方法的步骤。
第四方面,本申请实施例提供了一种终端,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时实现上述任一项方法的步骤。
本申请一些实施例提供的技术方案带来的有益效果至少包括:
在本申请的一个或多个实施例中,终端首先对至少一个输入文本进行预处理,得到至少一个目标文本;接着根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;再基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;最后根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。本申请提供的实施例是预先利用短文本对主题模型进行训练,保证了模型对短文本的适用性;在线使用时直接使用预先训练过程中生成的词语集合以及第二映射关系来得出输入文本的主题词,保证了在线使用时对短文本主题词的准确挖掘。
附图说明
为了更清楚地说明本申请实施例或现有技术中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本申请实施例提供的一种文本主题词确定方法的流程示意图;
图2是本申请实施例提供的一种文本主题词确定方法的流程示意图;
图3是本申请实施例提供的一种文本主题词确定方法的流程示意图;
图4是本申请实施例提供的一种文本主题词挖掘模型的训练过程示意图;
图5是本申请实施例提供的一种文本主题词确定方法的离线训练与在线使用过程的完整流程示意图;
图6是本申请实施例提供的一种文本主题词确定装置的结构示意图;
图7是本申请实施例提供的一种文本主题词确定装置的结构示意图;
图8是本申请实施例提供的一种文本主题词确定装置的结构示意图;
图9是本申请实施例提供的一种终端结构框图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施例方式作进一步地详细描述。
下面的描述涉及附图时,除非另有表示,不同附图中的相同数字表示相同或相似的要素。以下示例性实施例中所描述的实施方式并不代表与本申请相一致的所有实施方式。相反,它们仅是如所附权利要求书中所详述的、本申请的一些方面相一致的装置和方法的例子。
在本申请的描述中,需要理解的是,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。对于本领域的普通技术人员而言,可以具体情况理解上述术语在本申请中的具体含义。此外,在本申请的描述中,除非另有说明,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。字符“/”一般表示前后关联对象是一种“或”的关系。
下面将结合附图1-附图5,对本申请实施例提供的文本主题词确定方法进行详细介绍。
请参见图1,为本申请实施例提供的一种文本主题词确定方法的流程示意图。
如图1所示,本申请实施例的所述方法可以包括以下步骤:
S101,对至少一个输入文本进行预处理,得到至少一个目标文本;
在确定输入文本的主题词之前,终端要先对输入文本进行预处理,预处理能够加速后续对输入文本主题词的挖掘;本申请实施例对主题词的挖掘针对的是短文本,因此,获取的输入文本其文本长度应小于预设阈值,文本长度即文本中所含有的字符数,预设阈值可以设置为120等;所获取的输入文本在条数上不受限制,可以为一条,也可以为至少一条。
经过预处理的输入文本其文本长度会存在一定的变化,定义经过预处理的输入文本为目标文本,所述预处理包括错别字纠正、文本语序结构调整以及去表情符等等。例如,对文本1“黄梁一梦”进行错别字纠正,修改为“黄粱一梦”;对文本2“他去图书馆了吧,大概”进行语序结构调整,调整为“他大概去图书馆了吧”;对文本3“这边风景无限好
Figure PCTCN2020134772-appb-000001
”进行去表情符处理,变为“这边风景无限好”等等。其中,所述去表情符包含了去颜文字、去Emoji以及去表情包等处理。
一些可行的实施例中,所述预处理还可以包括文本合并处理。
S102,根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;
文本由词语组合而成,基于预先训练生成的词语集合,可以在目标文本与词语集合中至少一个词语之间构建一种映射关系,称为第一映射关系。具体地,可以对生成的目标文本进行词语构成分析,获取目标文本所包含的词语,基于所述词语集合以及所获取的目标文本所包含的词语,在所述词语集合中确定与所述目标文本相对应的词语,形成映射关系。映射关系可以是一对一的关系,也可以是一对多。映射类型不受限制,例如可以是列表式,也可以是字典式等。
其中,所述词语集合是基于至少一条样本文本生成的。具体地,预先训练过程中先对获取的至少一条样本文本进行错别字纠正、文本语序结构调整以及去表情符等预处理,再对样本文本进行分词处理,获取样本文本所包含的词语,至少一条样本文本所包含的词语即构成词语集合。
S103,基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;
预先训练时,为多个样本文本设置的至少一个主题类型同样也是由词语概括而成的,将主题类型与词语集合中至少一个词语之间形成的映射关系称为第二映射关系。第一映射关系与第二映射关系均与词语集合有关,将两者进行结合便可得出第三映射关系,即目标文本与所述主题类型之间的对应关系。
S104,根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。
根据第三映射关系可以确定目标文本的主题类型,接着通过第二映射关系可以确定目标文本主题类型对应的词语,将该词语作为目标文本的主题词。主题词能够简练概括文本主旨,目标文本的主题词可以是一个或多个。
在本申请实施例中,终端首先对至少一个输入文本进行预处理,得到至少一个目标文本;接着根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;再基于预先训练得到的主题类型与所述词语集合中至少一个词语 之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;最后根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。本申请提供的实施例是预先利用短文本对主题模型进行训练,保证了模型对短文本的适用性;在线使用时直接使用预先训练过程中生成的词语集合以及第二映射关系来得出输入文本的主题词,保证了在线使用时对短文本主题词的准确挖掘。
请参见图2,为本申请实施例提供的一种文本主题词确定方法的流程示意图。
如图2所示,本申请实施例的所述方法可以包括以下步骤:
S201,对至少一条第一样本文本进行文本合并处理,生成至少一条第二样本文本,所述第一样本文本的文本长度均小于预设阈值,所述第二样本文本的文本长度均大于等于所述预设阈值;
针对短文本主题词的挖掘,本申请实施例提供了一种模型训练方法,模型训练完成时会对训练过程中生成的第二映射关系进行保存,在线使用时可以根据输入文本以及训练完成时所保存的第二映射关系来准确获取输入文本的主题词。
本实施例对模型的训练是基于不同类型的样本来进行的,也就是,训练样本包含多个类型,例如既包含商业类型的文本又包含文学类型的文本。定义进行文本合并处理之前的样本文本为第一样本文本,所述第一样本文本即为短文本,文本长度小于预设阈值。单单利用彼此独立且不经过任何处理的至少一条短文本对模型进行训练时,由于每条短文本所含有的词语数量较少,训练过程中找到主题词语共现规律(几个词语连在一起出现的规律)的难度较大,训练生成的矩阵也会稀疏,从而使得后续在线使用时基于此矩阵获取的主题词准确性也不够。因此,本实施例在获取到至少一条文本长度均小于预设阈值的第一样本文本后,要对所述至少一条第一样本文本进行文本合并处理,以生成至少一条第二样本文本来对模型进行训练。所述第二样本文本即为长文本,其文本长度大于等于预设阈值。
其中,第一样本文本经合并处理之后,每条文本所含有的字符数增多,其文本长度均大于等于预设阈值,从而变为长文本。对至少一条第一样本文本的文本合并处理,可以是利用一些现有的聚类算法(例如K均值聚类、均值漂移算法等)先对所述至少一条第一样本文本进行聚类,再根据聚类结果用各种组合方式对文本进行合并,以生成至少一条第二样本文本,增大每条样本文本的词语数量。
或者,可以是利用一些现有的自然语言处理技术来对至少一条第一样本文本进行不同方式的结合/合并,以生成至少一条第二样本文本,从而增大每条样本文本的词语数量。例如,将语法结构相同的几条第一样本文本合并扩大为一条第二样本文本。
文本长度的确定也就是计算文本中所含有的字符数,字符包含各类语言的文字及标点符号,一个汉字或中文标点通常算作两个字符,一个英文字母或英文标点通常算作一个字符。例如,“时事热点新闻”包含12个字符,该文本的文本长度为12;“Hello!”包含6个字符,该文本的文本长度为6。预设一个文本长度阈值,将文本长度小于预设阈值的文本划分为短文本,将文本长度大于等于预设阈值的文本划分为长文本,其中预设阈值可以设置为140或150等等。
需要说明的是,一些可行的实施例中,对模型的训练也可以仅针对同一类型的文本来进行,也就是说,所获取的至少一条第一样本文本均为同一类型的文本,例如作为样本的文本 a、文本b、文本c均属于体育运动类。此外,合并处理之前第一样本文本的条数与合并处理之后第二样本文本的条数不存在固定的大小关系,合并处理之后样本的条数可以增多也可以减少、或者一致。
S202,基于所述至少一条第二样本文本的主题类型、主题词,获取主题先验信息;
在对未知待测事物做出一些推断/决策时,当前未知待测事物本身的状态虽然重要,但历史经验也同样重要,先验信息即指基于历史数据或资料所获得的经验。本申请在利用样本对模型进行训练之前,需要获取主题先验信息,以使训练结果更优。
直接将依赖主观判断而预设的主题类型作为先验信息作用于模型训练时,会导致模型的训练结果准确性不够。本实施例是在人为预设主题类型的基础上,通过算法获得主题先验信息,既有历史经验又存在一定的数据分析,先验信息更可靠。
具体地,于终端预先存储主题类型以及与主题类型对应的主题词,当所述至少一条第二样本文本生成时,获取所存储的预设主题类型以及主题类型对应的主题词,再结合所生成的至少一条第二样本文本,根据现有的语言处理技术确定所述至少一条第二样本文本所属的预设主题类型,并将此结果作为主题先验信息,执行步骤S103。
本实施例对预先存储的主题类型数量以及与各个主题类型对应的关键词的数量不作限定。当预先存储的主题类型为多个时,可以以概率大小来确定每条第二样本文本所属的预设主题类型,并将至少一条第二样本文本的预设主题类型隶属结果作为主题先验信息。
S203,基于所述至少一条第二样本文本以及所述主题先验信息,对文本主题挖掘模型进行训练,获取主题类型-词语频率矩阵。
基于上述步骤获取的至少一条第二样本文本以及主题先验信息来对模型进行训练,可以加长样本文本长度,增大样本文本的词语数量,降低词语共现规律的获取难度,解决模型输出的矩阵稀疏的问题,切实保证后续在线使用时对输入文本主题词提取的准确性,增强标签的解释性。
其中,所述模型为文本主题词挖掘模型,凡是能够进行主题词挖掘的模型均可应用于本申请提供的实施例中。例如可选用隐式狄利克雷主题标签挖掘模型(Labeled-Latent Dirichlet Allocation,Labeled-LDA)等。
S204,对至少一个输入文本进行预处理,得到至少一个目标文本;
该步骤具体可参见步骤S101,此处不再赘述。
S205,根据预先训练得到的词语集合,构建目标文本-词语频率矩阵;
在线使用时,根据上述训练过程生成的词语集合,确定出所述目标文本中出现词语集合内各个词语的频率,构建目标文本-词语频率矩阵(即步骤S102中的第一映射关系)。
本步骤未作详尽说明之处具体可参见步骤S102,此处不再赘述。
S206,基于所述主题类型-词语频率矩阵,确定目标文本-主题类型矩阵;
获取目标文本的主题词需要先确定输入文本最可能隶属的主题类型。通过步骤S205构建的目标文本-词语频率矩阵以及训练完成时所保存的主题类型-词语频率矩阵可以获得目标文本-主题类型矩阵,执行步骤S207。
S207,将所述目标文本-主题类型矩阵中最大概率值对应的索引作为所述目标文本的主题类型索引,基于所述主题类型索引以及所述主题类型-词语频率矩阵,确定所述至少一个主题词。
根据目标文本-主题类型矩阵中最大概率值对应的索引确定目标文本的主题类型索引,也就是由矩阵中的最大值可以确定出该目标文本的主题类型;再根据目标文本的主题类型索引以及上述训练完成时所保存的主题类型-词语频率矩阵可以确定该目标文本的主题词。
该过程具体为,通过目标文本-主题类型矩阵中概率值最大的主题类型索引(a,b),确定出目标文本a最可能的主题类型是b,再通过主题类型-词语频率矩阵确定出该主题类型b的主题词索引(b,c),输出文本a的主题词c,所述主题词可以由文本所属类型以及文本中的高频词汇/关键词等构成。
在本申请实施例中,终端首先对至少一个输入文本进行预处理,得到至少一个目标文本;接着根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;再基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;最后根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。本申请提供的实施例是预先利用短文本对主题模型进行训练,保证了模型对短文本的适用性;在线使用时直接使用预先训练过程中生成的词语集合以及第二映射关系来得出输入文本的主题词,保证了在线使用时对短文本主题词的准确挖掘。
请参见图3,为本申请实施例提供的一种文本主题词确定方法的流程示意图。
如图3所示,本申请实施例的所述方法可以包括以下步骤:
S301,根据至少一条第一样本文本中的词汇生成词语集合;
每条第一样本文本的文本长度不同,所包含的词语数量也不等,同时也会存在一些无意义的词语,因此要对所述每条第一样本文本进行分词处理及停用词处理,以获得每条第一样本文本所包含的有效词语,至少一条第一样本文本所包含的有效词语构成词语合集。
其中,分词处理是指对文本中的语句进行词语切分,例如,对文本“小明被湖岸上的一朵花吸引住了”进行分词处理,分词的结果就可能为“小明/被/湖岸上/的/一朵/花/吸引/住了”等等,分词处理方法具体可以选用正向最大匹配法、基于N-gram语言模型的分词方法、基于HMM的分词方法等。停用词是指文本中出现的没有实际含义的词,例如“的、地、在、a、an、the”等,去掉文本中一些没有实际含义的词可以使样本更有意义,模型训练速度更快。
S302,基于所述至少一条第一样本文本中词语出现的频率统计结果,构建目标文本-词语频率矩阵;
根据上述步骤获取的词语集合,统计每条第一样本文本中出现词语集合内各个词语的频率,依据所述至少一条第一样本文本以及每条第一样本文本的词语频率统计结果,构建目标文本-词语频率矩阵。所述目标文本-词语频率矩阵是由人工统计计算出的真实的文本-词语频率矩阵。
S303,对所述至少一条第一样本文本进行文本合并处理,生成至少一条第二样本文本,所述第一样本文本的文本长度均小于预设阈值,所述第二样本文本的文本长度均大于等于所述预设阈值;
该步骤具体可参见步骤S201,此处不再赘述。
一些可行的实施例中,在对所述至少一条第一样本文本进行文本合并处理之前,还可以包括错别字纠正、文本语序结构调整以及去表情符等处理。
S304,将所述至少一条第二样本文本隶属于不同主题类型的概率分布作为主题先验信息;
在对文本进行主题推断、主题词挖掘时,基于文本本身的内容,再利用根据历史数据或资料获得的经验信息对模型进行训练,能够使模型参数更优,输出结果更接近于真实数据。
本申请在利用上述步骤所获取的至少一条第二样本文本对文本主题词挖掘模型进行训练之前,需要人为预设多个主题类型并人工统计出每条第二样本文本隶属于各个主题类型的概率,形成概率分布;所述至少一条第二长度样本文形成的多个概率分布即构成主题先验信息,将所述主题先验信息用于模型训练。
该步骤未作详尽说明之处具体可参见步骤S202,此处不再赘述。
S305,采用所述至少一条第二样本文本以及所述主题先验信息,对文本主题词挖掘模型进行训练,当训练过程中生成的样本文本-词语频率矩阵与所述目标文本-词语频率矩阵一致时,模型训练完成,获取训过程中生成的样本主题类型-词语频率矩阵;
模型训练过程中会形成一个样本文本-词语频率矩阵,在确定模型训练是否完成时,可以将文本主题词挖掘模型训练过程中生成的样本文本-词语频率矩阵与人工统计计算出的真实的目标文本-词语频率矩阵进行比较,当两者一致时表示模型训练完成,此时模型中的各项参数已达最优,获取训练过程中生成的样本主题类型-词语频率矩阵并进行保存,用于后续的在线使用。
当样本文本-词语频率矩阵与目标文本-词语频率矩阵不一致时,表示训练未完成,还需要对模型进行调整并继续使用更多的第二样本文本来训练,直至样本文本-词语频率矩阵与目标文本-词语频率矩阵一致。
S306,对至少一个输入文本进行预处理,得到至少一个目标文本;
该步骤具体可参见步骤S101,此处不再赘述。
S307,根据所述目标文本以及预先训练生成的词语集合,构建目标文本-词语频率矩阵;
该步骤具体可参见步骤S205,此处不再赘述。
S308,基于所述目标文本-词语频率矩阵以及预先训练获取的主题类型-词语频率矩阵,确定目标文本-主题类型矩阵;
该步骤具体可参见步骤S206,此处不再赘述。
S309,将所述目标文本-主题类型矩阵中最大概率值对应的索引作为所述目标文本的主题索引,基于所述主题索引以及所述主题类型-词语频率矩阵,确定所述目标文本的主题词。
该步骤具体可参见步骤S207,此处不再赘述。
在本申请实施例中,终端首先对至少一个输入文本进行预处理,得到至少一个目标文本;接着根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;再基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;最后根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。本申请提供的实施例是预先利用短文本对主题模型进行训练,保证了模型对短文本的适用性;在线使用时直接使用预先训练过程中生成的词语集合以及第二映射关系来得出输入文本的主题词,保证了在线使用时对短文本主题词的准确挖掘。
请参见图4,为本申请实施例提供的一种文本主题词挖掘模型的训练过程示意图。
如图4所示,以Labeled-LDA模型为例对上述实施例中的模型训练过程进行说明。
训练过程具体为:1)人工统计计算获得一个真实的文本-词语频率矩阵s(也就是目标文本-词语频率矩阵);2)隐式狄利克雷分布在使用了两个不同的参数α和β后形成了两个不同的分布,分别命名为狄利克雷分布α和狄利克雷分布β;3)基于主观认定给出的一些主题类型再经过一定的计算获得主题先验信息;4)由样本文本(图中未示出)、狄利克雷分布α以及主题先验信息三者可以获得样本文本-主题类型矩阵θ;5)通过样本文本(图中未示出)以及狄利克雷分布β可以获得主题类型-词语频率矩阵φ;6)由样本文本-主题类型矩阵θ和主题类型-词语频率矩阵φ可以获得样本文本-词语频率矩阵w;7)当Labeled-LDA模型训练过程中生成的样本文本-词语频率矩阵w与真实的文本-词语频率矩阵s无限接近/一致时,表明Labeled-LDA模型训练完成,此时模型中的参数φ达到最优,对训练过程中生成的样本主题类型-词语频率矩阵φ进行保存,并用于后续在线使用。
请参见图5,为本申请实施例提供的一种主题词确定方法的离线训练与在线使用过程的完整流程示意图。
如图5所示,离线训练过程中,利用聚类算法对输入的样本文本进行聚类,形成n个类别,对这n个类别中的文本进行组合合并,形成图中所示的文本11…文本n4(长文本,并计条数为d);在对样本文本进行去表情符、分词、去停用词等处理之后,生成大小为w的词语集合;人为预设t个主题类型,并计算获得每条样本文本隶属于各个主题类型的概率分布,d条长文本所形成的概率分布构成主题先验信息;利用所述主题先验信息以及d条文本训练Labeled-LDA模型,训练过程中,模型会根据词语集合以及所述d条文本生成d*w的文本-词语频率矩阵,当训练过程中生成的d*w的文本-词语频率矩阵与目标文本-词语频率矩阵(人工统计计算得出的)一致时,表示训练完成,此时模型中的参数达到最优,输出t*w的主题类型-词语频率矩阵。
在线使用时,对k条输入文本进行预处理之后,根据离线训练过程中生成的大小为w的词语集合以及所述预处理后的k条输入文本构建k*w的文本-词语频率矩阵,由所述k*w的文本-词语频率矩阵以及离线训练输出的t*w的主题类型-词语频率矩阵可以得到k*t的文本-主题类型矩阵,将k*t的文本-主题类型矩阵中最大概率值对应的索引作为输入文本的主题类型索引,根据该主题类型索引确定输入文本的主题类型,再通过离线训练输出的主题类型-词语频率矩阵便可获得输入文本的主题词(即图中所示主题词1…主题词k)。
下述为本申请装置实施例,可以用于执行本申请方法实施例。对于本申请装置实施例中未披露的细节,请参照本申请方法实施例。
请参见图6,为本申请一个示例性实施例提供的文本主题词确定装置的结构示意图。该文本主题词确定装置可以通过软件、硬件或者两者的结合实现成为终端的全部或一部分,还可以作为独立的模块集成于服务器上。本申请实施例中的文本主题词确定装置应用于终端,所述装置1包括目标文本获取模块11、第一映射关系构建模块12、第三映射关系确定模块13和主题词确定模块14,其中:
目标文本获取模块11,用于对至少一个输入文本进行预处理,得到至少一个目标文本;
第一映射关系构建模块12,用于根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;
第三映射关系确定模块13,用于基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;
主题词确定模块14,用于根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。
作为可选的,所述一映射关系包括目标文本-词语频率矩阵,第二映射关系包括主题类型-词语频率矩阵,所述第三映射关系为目标文本-主题类型矩阵,所述主题词确定模块14具体用于:
将所述目标文本-主题类型矩阵中最大概率值对应的索引作为所述目标文本的主题类型索引,基于所述主题类型索引以及所述主题类型-词语频率矩阵,确定所述至少一个主题词。
请参见图7,为本申请一个示例性实施例提供的文本主题词确定装置的结构示意图。本申请实施例提供的文本主题词确定装置1还包括:
第二样本文本生成模块15,用于对至少一条第一样本文本进行文本合并处理,生成至少一条第二样本文本,所述第一样本文本的文本长度均小于预设阈值,所述第二样本文本的文本长度均大于等于所述预设阈值;
主题先验信息获取模块16,用于基于所述至少一条第二样本文本的主题类型、主题词,获取主题先验信息;
主题类型-词语频率矩阵获取模块17,用于基于所述至少一条第二样本文本以及所述主题先验信息,对文本主题挖掘模型进行训练,获取主题类型-词语频率矩阵。
作为可选的,所述主题先验信息获取模块16具体用于:
所述主题先验信息包括:所述至少一条第二样本文本隶属于不同主题类型的概率分布。
请参见图8,为本申请一个示例性实施例提供的文本主题词确定装置的结构示意图。本申请实施例提供的文本主题词确定装置1还包括:
词语集合生成模块18,用于根据所述至少一条第一样本文本中的词汇生成词语集合;
目标文本-词语频率矩阵构建模块19,用于基于所述至少一条第一样本文本中词语出现的频率统计结果,构建目标文本-词语频率矩阵;
所述主题类型-词语频率矩阵获取模块17具体用于:
采用所述至少一条第二样本文本以及所述主题先验信息,对文本主题词挖掘模型进行训练,当训练过程中生成的样本文本-词语频率矩阵与所述目标文本-词语频率矩阵一致时,模型训练完成;获取训过程中生成的样本主题类型-词语频率矩阵;
需要说明的是,上述实施例提供的文本主题词确定装置在执行文本主题词确定方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的文本主题词确定装置与文本主题词确定方法实施例属于同一构思,其体现实现过程详见方法实施例,这里不再赘述。
上述本申请实施例序号仅仅为了描述,不代表实施例的优劣。
在本申请实施例中,终端首先对至少一个输入文本进行预处理,得到至少一个目标文本;接着根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;再基于预先训练得到的主题类型与所述词语集合中至少一个词语 之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;最后根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。本申请提供的实施例是预先利用短文本对主题模型进行训练,保证了模型对短文本的适用性;在线使用时直接使用预先训练过程中生成的词语集合以及第二映射关系来得出输入文本的主题词,保证了在线使用时对短文本主题词的准确挖掘。
本申请实施例还提供了一种计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现前述任一实施例方法的步骤。其中,计算机可读存储介质可以包括但不限于任何类型的盘,包括软盘、光盘、DVD、CD-ROM、微型驱动器以及磁光盘、ROM、RAM、EPROM、EEPROM、DRAM、VRAM、闪速存储器设备、磁卡或光卡、纳米系统(包括分子存储器IC),或适合于存储指令和/或数据的任何类型的媒介或设备。
本申请实施例还提供了一种终端,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时实现上述任一实施例方法的步骤。
请参见图9,为本申请实施例提供的一种终端结构框图。
如图9所示,终端600包括有:处理器601和存储器602。
本申请实施例中,处理器601为计算机系统的控制中心,可以是实体机的处理器,也可以是虚拟机的处理器。处理器601可以包括一个或多个处理核心,比如4核心处理器、8核心处理器等。处理器601可以采用DSP(Digital Signal Processing,数字信号处理)、FPGA(Field-Programmable Gate Array,现场可编程门阵列)、PLA(Programmable Logic Array,可编程逻辑阵列)中的至少一种硬件形式来实现。处理器601也可以包括主处理器和协处理器,主处理器是用于对在唤醒状态下的数据进行处理的处理器,也称CPU(Central Processing Unit,中央处理器);协处理器是用于对在待机状态下的数据进行处理的低功耗处理器。
存储器602可以包括一个或多个计算机可读存储介质,该计算机可读存储介质可以是非暂态的。存储器602还可包括高速随机存取存储器,以及非易失性存储器,比如一个或多个磁盘存储设备、闪存存储设备。在本申请的一些实施例中,存储器602中的非暂态的计算机可读存储介质用于存储至少一个指令,该至少一个指令用于被处理器601所执行以实现本申请实施例中的方法。
一些实施例中,终端600还包括有:外围设备接口603和至少一个外围设备。处理器601、存储器602和外围设备接口603之间可以通过总线或信号线相连。各个外围设备可以通过总线、信号线或电路板与外围设备接口603相连。具体地,外围设备包括:显示屏604、摄像头605和音频电路606中的至少一种。
外围设备接口603可被用于将I/O(Input/Output,输入/输出)相关的至少一个外围设备连接到处理器601和存储器602。在本申请的一些实施例中,处理器601、存储器602和外围设备接口603被集成在同一芯片或电路板上;在本申请的一些其他实施例中,处理器601、存储器602和外围设备接口603中的任意一个或两个可以在单独的芯片或电路板上实现。本申请实施例对此不作具体限定。
显示屏604用于显示UI(User Interface,用户界面)。该UI可以包括图形、文本、图标、视频及其它们的任意组合。当显示屏604是触摸显示屏时,显示屏604还具有采集在显示屏604的表面或表面上方的触摸信号的能力。该触摸信号可以作为控制信号输入至处理器601进行处理。此时,显示屏604还可以用于提供虚拟按钮和/或虚拟键盘,也称软按钮和/或软键盘。在本申请的一些实施例中,显示屏604可以为一个,设置终端600的前面板;在本申请的另一些实施例中,显示屏604可以为至少两个,分别设置在终端600的不同表面或呈折叠设计;在本申请的再一些实施例中,显示屏604可以是柔性显示屏,设置在终端600的弯曲表面上或折叠面上。甚至,显示屏604还可以设置成非矩形的不规则图形,也即异形屏。显示屏604可以采用LCD(Liquid Crystal Display,液晶显示屏)、OLED(Organic Light-Emitting Diode,有机发光二极管)等材质制备。
摄像头605用于采集图像或视频。可选地,摄像头605包括前置摄像头和后置摄像头。通常,前置摄像头设置在终端的前面板,后置摄像头设置在终端的背面。在一些实施例中,后置摄像头为至少两个,分别为主摄像头、景深摄像头、广角摄像头、长焦摄像头中的任意一种,以实现主摄像头和景深摄像头融合实现背景虚化功能、主摄像头和广角摄像头融合实现全景拍摄以及VR(Virtual Reality,虚拟现实)拍摄功能或者其它融合拍摄功能。在本申请的一些实施例中,摄像头605还可以包括闪光灯。闪光灯可以是单色温闪光灯,也可以是双色温闪光灯。双色温闪光灯是指暖光闪光灯和冷光闪光灯的组合,可以用于不同色温下的光线补偿。
音频电路606可以包括麦克风和扬声器。麦克风用于采集用户及环境的声波,并将声波转换为电信号输入至处理器601进行处理。出于立体声采集或降噪的目的,麦克风可以为多个,分别设置在终端600的不同部位。麦克风还可以是阵列麦克风或全向采集型麦克风。
电源607用于为终端600中的各个组件进行供电。电源607可以是交流电、直流电、一次性电池或可充电电池。当电源607包括可充电电池时,该可充电电池可以是有线充电电池或无线充电电池。有线充电电池是通过有线线路充电的电池,无线充电电池是通过无线线圈充电的电池。该可充电电池还可以用于支持快充技术。
本申请实施例中示出的终端结构框图并不构成对终端600的限定,终端600可以包括比图示更多或更少的组件,或者组合某些组件,或者采用不同的组件布置。
在本申请中,术语“第一”、“第二”等仅用于描述的目的,而不能理解为指示或暗示相对重要性或顺序;术语“多个”则指两个或两个以上,除非另有明确的限定。术语“安装”、“相连”、“连接”、“固定”等术语均应做广义理解,例如,“连接”可以是固定连接,也可以是可拆卸连接,或一体地连接;“相连”可以是直接相连,也可以通过中间媒介间接相连。对于本领域的普通技术人员而言,可以根据具体情况理解上述术语在本申请中的具体含义。
本申请的描述中,需要理解的是,术语“上”、“下”等指示的方位或位置关系为基于附图所示的方位或位置关系,仅是为了便于描述本申请和简化描述,而不是指示或暗示所指的装置或单元必须具有特定的方向、以特定的方位构造和操作,因此,不能理解为对本申请的限制。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,依本申请权利要求所作的等同变化,仍属本申请所涵盖的范围。

Claims (10)

  1. 一种文本主题词确定方法,其特征在于,所述方法包括:
    对至少一个输入文本进行预处理,得到至少一个目标文本;
    根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;
    基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;
    根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。
  2. 根据权利要求1所述的方法,其特征在于,所述第一映射关系包括目标文本-词语频率矩阵,第二映射关系包括主题类型-词语频率矩阵。
  3. 根据权利要求1所述的方法,其特征在于,所述第三映射关系为目标文本-主题类型矩阵;以及
    所述根据所述第三映射关系确定所述至少一个目标文本对应的主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词,包括:
    将所述目标文本-主题类型矩阵中最大概率值对应的索引作为所述目标文本的主题类型索引,基于所述主题类型索引以及所述主题类型-词语频率矩阵,确定所述至少一个主题词。
  4. 根据权利要求2所述的方法,其特征在于,所述主题类型-词语频率矩阵训练过程,包括:
    对至少一条第一样本文本进行文本合并处理,生成至少一条第二样本文本,所述第一样本文本的文本长度均小于预设阈值,所述第二样本文本的文本长度均大于等于所述预设阈值;
    基于所述至少一条第二样本文本的主题类型、主题词,获取主题先验信息;
    基于所述至少一条第二样本文本以及所述主题先验信息,对文本主题挖掘模型进行训练,获取主题类型-词语频率矩阵。
  5. 根据权利要求4所述的方法,其特征在于,所述主题先验信息包括:所述至少一条第二样本文本隶属于不同主题类型的概率分布。
  6. 根据权利要求4所述的方法,其特征在于,所述主题类型-词语频率矩阵训练过程,还包括:
    根据所述至少一条第一样本文本中的词汇生成词语集合;
    基于所述至少一条第一样本文本中词语出现的频率统计结果,构建目标文本-词语频率矩阵;
    采用所述至少一条第二样本文本以及所述主题先验信息,对文本主题词挖掘模型进行训练,当训练过程中生成的样本文本-词语频率矩阵与所述目标文本-词语频率矩阵一致时,模型训练完成;
    获取训过程中生成的样本主题类型-词语频率矩阵。
  7. 一种文本主题词确定装置,其特征在于,所述装置包括:
    目标文本获取模块,用于对至少一个输入文本进行预处理,得到至少一个目标文本;
    第一映射关系构建模块,用于根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;
    第三映射关系确定模块,用于基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;
    主题词确定模块,用于根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。
  8. 根据权利要求7所述的装置,其特征在于,所述一映射关系包括目标文本-词语频率矩阵,第二映射关系包括主题类型-词语频率矩阵,所述第三映射关系为目标文本-主题类型矩阵,所述主题词确定模块具体用于:
    将所述目标文本-主题类型矩阵中最大概率值对应的索引作为所述目标文本的主题类型索引,基于所述主题类型索引以及所述主题类型-词语频率矩阵,确定所述至少一个主题词。
  9. 一种计算机存储介质,其特征在于,所述计算机存储介质存储有多条指令,所述指令适于由处理器加载并执行如权利要求1~6任意一项的方法步骤。
  10. 一种终端,其特征在于,包括:处理器和存储器;其中,所述存储器存储有计算机程序,所述计算机程序适于由所述处理器加载并执行如权利要求1~6任意一项的方法步骤。
PCT/CN2020/134772 2020-01-06 2020-12-09 一种文本主题词确定方法、装置、存储介质及终端 WO2021139466A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010010680.0 2020-01-06
CN202010010680.0A CN111274798B (zh) 2020-01-06 2020-01-06 一种文本主题词确定方法、装置、存储介质及终端

Publications (1)

Publication Number Publication Date
WO2021139466A1 true WO2021139466A1 (zh) 2021-07-15

Family

ID=71000087

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/134772 WO2021139466A1 (zh) 2020-01-06 2020-12-09 一种文本主题词确定方法、装置、存储介质及终端

Country Status (2)

Country Link
CN (1) CN111274798B (zh)
WO (1) WO2021139466A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111274798B (zh) * 2020-01-06 2023-08-18 北京大米科技有限公司 一种文本主题词确定方法、装置、存储介质及终端
CN111831788A (zh) * 2020-06-16 2020-10-27 国网江苏省电力有限公司信息通信分公司 一种电力语料标记模型构建方法及系统
CN112084772A (zh) * 2020-09-25 2020-12-15 北京明略昭辉科技有限公司 一种文本质量的监测方法、装置、电子设备及存储介质
CN115983251B (zh) * 2023-02-16 2023-06-09 江苏联著实业股份有限公司 一种基于句用分析的文本主题提取系统及方法
CN116431814B (zh) * 2023-06-06 2023-09-05 北京中关村科金技术有限公司 信息提取方法、装置、电子设备及可读存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049568A (zh) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 对海量文档库的文档分类的方法
CN106649422A (zh) * 2016-06-12 2017-05-10 中国移动通信集团湖北有限公司 关键词提取方法及装置
CN107368489A (zh) * 2016-05-12 2017-11-21 阿里巴巴集团控股有限公司 一种资讯数据处理方法及装置
US20180239741A1 (en) * 2017-02-17 2018-08-23 General Electric Company Methods and systems for automatically identifying keywords of very large text datasets
CN108763213A (zh) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) 主题特征文本关键词提取方法
CN111274798A (zh) * 2020-01-06 2020-06-12 北京大米科技有限公司 一种文本主题词确定方法、装置、存储介质及终端

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101315624B (zh) * 2007-05-29 2015-11-25 阿里巴巴集团控股有限公司 一种文本主题推荐的方法和装置
CN105045812B (zh) * 2015-06-18 2019-01-29 上海高欣计算机系统有限公司 文本主题的分类方法及系统
CN107797982B (zh) * 2016-08-31 2021-05-07 百度在线网络技术(北京)有限公司 用于识别文本类型的方法、装置和设备
CN110162771B (zh) * 2018-11-22 2023-08-29 腾讯科技(深圳)有限公司 事件触发词的识别方法、装置、电子设备
CN110032639B (zh) * 2018-12-27 2023-10-31 中国银联股份有限公司 将语义文本数据与标签匹配的方法、装置及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103049568A (zh) * 2012-12-31 2013-04-17 武汉传神信息技术有限公司 对海量文档库的文档分类的方法
CN107368489A (zh) * 2016-05-12 2017-11-21 阿里巴巴集团控股有限公司 一种资讯数据处理方法及装置
CN106649422A (zh) * 2016-06-12 2017-05-10 中国移动通信集团湖北有限公司 关键词提取方法及装置
US20180239741A1 (en) * 2017-02-17 2018-08-23 General Electric Company Methods and systems for automatically identifying keywords of very large text datasets
CN108763213A (zh) * 2018-05-25 2018-11-06 西南电子技术研究所(中国电子科技集团公司第十研究所) 主题特征文本关键词提取方法
CN111274798A (zh) * 2020-01-06 2020-06-12 北京大米科技有限公司 一种文本主题词确定方法、装置、存储介质及终端

Also Published As

Publication number Publication date
CN111274798B (zh) 2023-08-18
CN111274798A (zh) 2020-06-12

Similar Documents

Publication Publication Date Title
WO2021139466A1 (zh) 一种文本主题词确定方法、装置、存储介质及终端
US11947911B2 (en) Method for training keyword extraction model, keyword extraction method, and computer device
CN108170749B (zh) 基于人工智能的对话方法、装置及计算机可读介质
US20200294488A1 (en) Method, device and storage medium for speech recognition
US20180293507A1 (en) Method and apparatus for extracting keywords based on artificial intelligence, device and readable medium
CN108595431B (zh) 语音交互文本纠错方法、装置、终端及存储介质
CN107608532B (zh) 一种联想输入方法、装置及电子设备
US11127394B2 (en) Method and system of high accuracy keyphrase detection for low resource devices
WO2018165932A1 (en) Generating responses in automated chatting
WO2015171646A1 (en) Method and system for speech input
WO2020151690A1 (zh) 语句生成方法、装置、设备及存储介质
US11830482B2 (en) Method and apparatus for speech interaction, and computer storage medium
US20220309088A1 (en) Method and apparatus for training dialog model, computer device, and storage medium
CN111414736A (zh) 故事生成模型训练方法、装置、设备及存储介质
CN111368525A (zh) 信息搜索方法、装置、设备及存储介质
EP3734472A1 (en) Method and device for text processing
CN108920649A (zh) 一种信息推荐方法、装置、设备和介质
EP3790002A1 (en) System and method for modifying speech recognition result
CN111883117A (zh) 语音唤醒方法及装置
CN116778040B (zh) 基于口型的人脸图像生成方法、模型的训练方法以及设备
CN112036174A (zh) 一种标点标注方法及装置
CN117454954A (zh) 模型训练方法、装置、计算机设备及存储介质
CN111414737A (zh) 故事生成模型训练方法、装置、设备及存储介质
WO2023173659A1 (zh) 人脸匹配方法及装置、电子设备、存储介质、计算机程序产品及计算机程序
US20220245364A1 (en) Electronic device and control method thereof

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20911553

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20911553

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 08.02.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 20911553

Country of ref document: EP

Kind code of ref document: A1