WO2021139466A1 - 一种文本主题词确定方法、装置、存储介质及终端 - Google Patents
一种文本主题词确定方法、装置、存储介质及终端 Download PDFInfo
- Publication number
- WO2021139466A1 WO2021139466A1 PCT/CN2020/134772 CN2020134772W WO2021139466A1 WO 2021139466 A1 WO2021139466 A1 WO 2021139466A1 CN 2020134772 W CN2020134772 W CN 2020134772W WO 2021139466 A1 WO2021139466 A1 WO 2021139466A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- text
- topic
- word
- mapping relationship
- target text
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/258—Heading extraction; Automatic titling; Numbering
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/284—Lexical analysis, e.g. tokenisation or collocates
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/279—Recognition of textual entities
- G06F40/289—Phrasal analysis, e.g. finite state techniques or chunking
Definitions
- This application relates to the field of computer technology, and in particular to a method, device, storage medium and terminal for determining text subject words.
- the theme is the central idea of the article/work, which embodies the main body and core of the article/work content; while the subject term can concisely summarize the main content of the article/work through a few words.
- the topic model is a common method of statistical text topic mining, which can discover and summarize the topic content of the text without human participation.
- the embodiments of the present application provide a method, device, storage medium, and terminal for determining text topic words, which are suitable for short text and can accurately mine topic words.
- the technical solution is as follows:
- an embodiment of the present application provides a method for determining text subject terms, the method including:
- At least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic word corresponding to the at least one target text is determined based on the second mapping relationship.
- an embodiment of the present application provides an apparatus for determining text subject words, the apparatus including:
- the target text obtaining module is used to preprocess at least one input text to obtain at least one target text
- the first mapping relationship construction module is configured to construct a first mapping relationship between the at least one target text and at least one word in the word set according to the word set obtained by pre-training;
- the third mapping relationship determination module is configured to determine the third mapping relationship between the at least one target text and the at least one topic type based on the second mapping relationship between the topic type obtained in advance and at least one word in the word set. Mapping relations;
- a topic word determination module configured to determine at least one topic type corresponding to the at least one target text according to the third mapping relationship, and then determine at least one topic word corresponding to the at least one target text based on the second mapping relationship .
- the embodiments of the present application provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of any one of the above methods are implemented.
- an embodiment of the present application provides a terminal, including a memory, a processor, and a computer program stored in the memory and capable of running on the processor.
- the processor implements any of the above methods when the program is executed. A step of.
- the terminal first preprocesses at least one input text to obtain at least one target text; then constructs the at least one target text and the word set based on the pre-trained word set The first mapping relationship between at least one word in the set; and then based on the second mapping relationship between the subject type obtained by pre-training and the at least one word in the word set, determine the difference between the at least one target text and the at least one subject type Finally, at least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic corresponding to the at least one target text is determined based on the second mapping relationship word.
- the embodiment provided in this application uses short text to train the topic model in advance to ensure the applicability of the model to short text; when used online, the word set generated in the pre-training process and the second mapping relationship are directly used to obtain the input text Subject words ensure the accurate mining of short text subject words when used online.
- FIG. 1 is a schematic flowchart of a method for determining text subject words according to an embodiment of the present application
- FIG. 2 is a schematic flowchart of a method for determining text subject words according to an embodiment of the present application
- FIG. 3 is a schematic flowchart of a method for determining text subject words according to an embodiment of the present application
- FIG. 4 is a schematic diagram of the training process of a text topic word mining model provided by an embodiment of the present application.
- FIG. 5 is a schematic diagram of a complete process of offline training and online use of a method for determining text subject words provided by an embodiment of the present application;
- Fig. 6 is a schematic structural diagram of a text subject word determination device provided by an embodiment of the present application.
- FIG. 7 is a schematic structural diagram of a text subject word determination device provided by an embodiment of the present application.
- FIG. 8 is a schematic structural diagram of a text subject word determination device provided by an embodiment of the present application.
- FIG. 9 is a structural block diagram of a terminal provided by an embodiment of the present application.
- FIG. 1 is a schematic flowchart of a method for determining text subject words according to an embodiment of this application.
- the method of the embodiment of the present application may include the following steps:
- the terminal Before determining the subject words of the input text, the terminal must first preprocess the input text. Preprocessing can speed up the subsequent mining of the subject words of the input text; the subject word mining in the embodiment of this application is for short texts, therefore, obtain The text length of the input text should be less than the preset threshold.
- the text length is the number of characters contained in the text.
- the preset threshold can be set to 120, etc.; the number of obtained input texts is not limited, it can be one, or It can be at least one.
- the text length of the preprocessed input text will have a certain change, and the preprocessed input text is defined as the target text.
- the preprocessing includes typos correction, text word order structure adjustment, and removal of emoticons, and so on. For example, correct the typos in text 1 "Huang Liang Yi Meng” and modify it to "Huang Liang Yi Meng”; adjust the word order structure of text 2 "He went to the library, probably", and adjust it to "He probably went to the library.””; to text 3 "The scenery here is infinitely good "To remove emoticons, and change to "the scenery here is infinitely good” and so on.
- the remove emoticons include the processing of removing Japanese emoticons, removing Emoji, and removing emoticons.
- the preprocessing may also include text merging processing.
- S102 Construct a first mapping relationship between the at least one target text and at least one word in the word set according to the word set obtained by pre-training;
- the text is composed of words.
- a mapping relationship can be constructed between the target text and at least one word in the word set, which is called the first mapping relationship.
- word composition analysis may be performed on the generated target text to obtain the words contained in the target text, and based on the word set and the words contained in the obtained target text, the word set and the target text are determined
- Corresponding words form a mapping relationship.
- the mapping relationship can be one-to-one or one-to-many.
- the mapping type is not limited, for example, it can be a list type or a dictionary type.
- the word set is generated based on at least one sample text.
- at least one sample text is obtained by preprocessing such as typos correction, text word order structure adjustment, and emoji removal, and then word segmentation processing is performed on the sample text to obtain the words contained in the sample text, and at least one sample The words contained in the text constitute a set of words.
- S103 Determine a third mapping relationship between the at least one target text and the at least one topic type based on the second mapping relationship between the topic type obtained in advance and at least one word in the word set;
- At least one topic type set for multiple sample texts is also summarized by words, and the mapping relationship formed between the topic type and at least one word in the word set is called the second mapping relationship.
- Both the first mapping relationship and the second mapping relationship are related to word sets, and the third mapping relationship can be obtained by combining the two, that is, the corresponding relationship between the target text and the topic type.
- S104 Determine at least one topic type corresponding to the at least one target text according to the third mapping relationship, and then determine at least one topic word corresponding to the at least one target text based on the second mapping relationship.
- the subject type of the target text can be determined according to the third mapping relationship, and then the word corresponding to the subject type of the target text can be determined through the second mapping relationship, and the word is used as the subject word of the target text.
- the subject words can concisely summarize the main theme of the text, and the subject words of the target text can be one or more.
- the terminal first preprocesses at least one input text to obtain at least one target text; then, according to the pre-trained word set, constructs the relationship between the at least one target text and at least one word in the word set.
- at least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic word corresponding to the at least one target text is determined based on the second mapping relationship.
- the embodiment provided in this application uses short text to train the topic model in advance to ensure the applicability of the model to short text; when used online, the word set generated in the pre-training process and the second mapping relationship are directly used to obtain the input text Subject words ensure the accurate mining of short text subject words when used online.
- FIG. 2 is a schematic flowchart of a method for determining text subject words according to an embodiment of this application.
- the method of the embodiment of the present application may include the following steps:
- S201 Perform text merging processing on at least one first sample text to generate at least one second sample text, where the text length of the first sample text is all less than a preset threshold, and the text length of the second sample text is all greater than Equal to the preset threshold;
- this embodiment of the application provides a model training method.
- the model training is completed, the second mapping relationship generated during the training process is saved.
- the training completion time When used online, it can be based on the input text and the training completion time. Save the second mapping relationship to accurately obtain the subject terms of the input text.
- the training of the model in this embodiment is performed based on different types of samples, that is, the training samples include multiple types, for example, including both commercial type text and literary type text. It is defined that the sample text before the text merging process is the first sample text, the first sample text is short text, and the text length is less than a preset threshold.
- the co-occurrence rule of topic words severeal words appearing together
- the rule of is more difficult, and the matrix generated by training will also be sparse, so that the accuracy of the subject words obtained based on this matrix during subsequent online use is also insufficient.
- the at least one first sample text needs to be subjected to text merging processing to generate at least one second sample text.
- the second sample text is a long text, and the text length is greater than or equal to a preset threshold.
- the text merging process for at least one first sample text may be to use some existing clustering algorithms (such as K-means clustering, mean shift algorithm, etc.) to first cluster the at least one first sample text, Then, according to the clustering results, the texts are merged in various combinations to generate at least one second sample text, and the number of words in each sample text is increased.
- some existing clustering algorithms such as K-means clustering, mean shift algorithm, etc.
- some existing natural language processing technologies may be used to combine/merge at least one first sample text in different ways to generate at least one second sample text, thereby increasing the number of words in each sample text. For example, several pieces of first sample text with the same grammatical structure are merged and expanded into one piece of second sample text.
- the determination of the text length is to count the number of characters contained in the text.
- the characters include text and punctuation in various languages.
- a Chinese character or Chinese punctuation is usually counted as two characters, and an English letter or English punctuation is usually counted as one character.
- "Current Events Hot News” contains 12 characters, and the text length of the text is 12; "Hello! contains 6 characters, and the text length of the text is 6.
- a text length threshold is preset, and texts with a text length less than the preset threshold are divided into short texts, and texts with a text length greater than or equal to the preset threshold are divided into long texts, where the preset threshold can be set to 140 or 150, and so on.
- the training of the model can also be performed only for the same type of text, that is, the obtained at least one first sample text is all the same type of text, for example, as The sample text a, text b, and text c all belong to the sports category.
- the number of the first sample text before the merging process and the number of the second sample text after the merging process may increase or decrease, or be the same.
- S202 Obtain topic prior information based on the topic type and topic word of the at least one second sample text
- Prior information refers to the experience gained based on historical data or data. In this application, before using the sample to train the model, it is necessary to obtain prior information about the subject in order to make the training result better.
- the priori information of the subject is obtained through an algorithm, which has both historical experience and certain data analysis, and the prior information is more reliable.
- the topic type and topic words corresponding to the topic type are pre-stored in the terminal, and when the at least one second sample text is generated, the stored preset topic types and topic words corresponding to the topic types are obtained, and then combined with the generated topic words According to the existing language processing technology, determine the preset topic type to which the at least one second sample text belongs, and use this result as the topic prior information, and perform step S103.
- This embodiment does not limit the number of pre-stored topic types and the number of keywords corresponding to each topic type.
- the preset topic type to which each second sample text belongs can be determined by the probability, and the preset topic type membership result of at least one second sample text is used as the topic prior information .
- S203 Based on the at least one second sample text and the topic prior information, train a text topic mining model to obtain a topic type-term frequency matrix.
- the model is trained based on at least one second sample text obtained in the above steps and subject prior information, which can increase the length of the sample text, increase the number of words in the sample text, reduce the difficulty of obtaining word co-occurrence rules, and solve the matrix output by the model
- the sparse problem effectively guarantees the accuracy of the extraction of the subject words of the input text during subsequent online use, and enhances the interpretability of the label.
- the model is a text topic word mining model, and any model that can perform topic word mining can be applied to the embodiments provided in this application.
- an implicit Dirichlet subject label mining model (Labeled-Latent Dirichlet Allocation, Labeled-LDA) can be used.
- step S101 For details of this step, refer to step S101, which will not be repeated here.
- the frequency of each word in the word set in the target text is determined, and the target text-word frequency matrix (ie, the first mapping relationship in step S102) is constructed.
- step S102 For details that are not described in this step, please refer to step S102, which will not be repeated here.
- the target text-topic type matrix can be obtained from the target text-term frequency matrix constructed in step S205 and the topic type-term frequency matrix saved when the training is completed, and step S207 is executed.
- the topic type-word frequency matrix saved when the training is completed can determine the topic words of the target text.
- the specific process is to determine that the most probable topic type of target text a is b through the topic type index (a, b) with the largest probability value in the target text-topic type matrix, and then determine the topic type-term frequency matrix
- the topic word index (b, c) of topic type b, the topic word c of the output text a, and the topic word can be composed of the type of the text and the high-frequency words/keywords in the text.
- the terminal first preprocesses at least one input text to obtain at least one target text; then, according to the pre-trained word set, constructs the relationship between the at least one target text and at least one word in the word set.
- at least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic word corresponding to the at least one target text is determined based on the second mapping relationship.
- the embodiment provided in this application uses short text to train the topic model in advance to ensure the applicability of the model to short text; when used online, the word set generated in the pre-training process and the second mapping relationship are directly used to obtain the input text Subject words ensure the accurate mining of short text subject words when used online.
- FIG. 3 is a schematic flowchart of a method for determining text subject words according to an embodiment of this application.
- the method of the embodiment of the present application may include the following steps:
- S301 Generate a word set according to at least one vocabulary in the first sample text
- each first sample text is different, the number of words contained is also different, and there will be some meaningless words at the same time, so it is necessary to segment each first sample text and stop words Processing to obtain valid words contained in each first sample text, and at least one valid word contained in the first sample text constitutes a collection of words.
- word segmentation refers to segmentation of sentences in the text. For example, if the text "Xiao Ming is attracted by a flower on the lakeshore" is segmented, the result of the word segmentation may be "Xiaoming/ ⁇ / ⁇ "/ ⁇ / ⁇ / ⁇ /Attract/Live it” and so on.
- the word segmentation method can specifically select the forward maximum matching method, the word segmentation method based on the N-gram language model, and the word segmentation method based on HMM. Stop words refer to words that have no actual meaning in the text, such as " ⁇ , ⁇ , in, a, an, the", etc., removing some words with no actual meaning in the text can make the sample more meaningful, and the model training speed Faster.
- S302 Construct a target text-term frequency matrix based on the statistical result of the occurrence frequency of the words in the at least one first sample text;
- the frequency of each word in the word set in each first sample text is counted, and according to the at least one first sample text and the word frequency statistics result of each first sample text, Construct the target text-term frequency matrix.
- the target text-term frequency matrix is a real text-term frequency matrix calculated by artificial statistics.
- step S201 For details of this step, refer to step S201, which will not be repeated here.
- processing before performing text merging processing on the at least one first sample text, processing may also include typos correction, text word order structure adjustment, and emoji removal.
- S304 Use the probability distribution of the at least one second sample text belonging to different topic types as topic prior information
- the model parameters can be better and the output result closer to the real data.
- step S202 For details of this step that is not described in detail, please refer to step S202, which will not be repeated here.
- a sample text-word frequency matrix is formed during the model training process.
- the sample text-word frequency matrix generated during the training process of the text topic word mining model can be combined with the actual target text calculated by artificial statistics.
- the word frequency matrix is compared. When the two are consistent, the model training is completed. At this time, the various parameters in the model have reached the optimal value.
- the sample topic type-word frequency matrix generated during the training process is obtained and saved for subsequent use Online use.
- sample text-term frequency matrix is inconsistent with the target text-term frequency matrix, it means that the training has not been completed, and the model needs to be adjusted and continue to use more second sample texts for training until the sample text-term frequency matrix matches the target The text-term frequency matrix is consistent.
- step S101 For details of this step, refer to step S101, which will not be repeated here.
- S307 Construct a target text-word frequency matrix according to the target text and the word set generated by pre-training;
- step S205 For details of this step, refer to step S205, which will not be repeated here.
- S308 Determine the target text-topic type matrix based on the target text-term frequency matrix and the topic type-term frequency matrix obtained through pre-training;
- step S206 For details of this step, refer to step S206, which will not be repeated here.
- step S207 For details of this step, refer to step S207, which will not be repeated here.
- the terminal first preprocesses at least one input text to obtain at least one target text; then, according to the pre-trained word set, constructs the relationship between the at least one target text and at least one word in the word set.
- at least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic word corresponding to the at least one target text is determined based on the second mapping relationship.
- the embodiment provided in this application uses short text to train the topic model in advance to ensure the applicability of the model to short text; when used online, the word set generated in the pre-training process and the second mapping relationship are directly used to obtain the input text
- the subject words ensure the accurate mining of short text subject words when used online.
- FIG. 4 is a schematic diagram of the training process of a text topic word mining model provided by an embodiment of this application.
- the Labeled-LDA model is taken as an example to illustrate the model training process in the foregoing embodiment.
- the training process is specifically as follows: 1) A real text-word frequency matrix s is obtained by artificial statistical calculation (that is, the target text-word frequency matrix); 2) The implicit Dirichlet distribution uses two different parameters ⁇ and ⁇ Two different distributions were formed, named Dirichlet distribution ⁇ and Dirichlet distribution ⁇ ; 3) Based on subjective identification, some topic types were given and then subject to certain calculations to obtain topic prior information; 4) Samples The text (not shown in the figure), Dirichlet distribution ⁇ and topic prior information can obtain the sample text-topic type matrix ⁇ ; 5) The sample text (not shown in the figure) and Dirichlet distribution ⁇ can be obtained.
- the sample text-term frequency matrix w can be obtained; 7) The sample generated during the training of the Labeled-LDA model When the text-word frequency matrix w is infinitely close/consistent with the real text-word frequency matrix s, it indicates that the Labeled-LDA model training is completed. At this time, the parameter ⁇ in the model reaches the optimal value. For the sample topic type generated during the training process- The word frequency matrix ⁇ is saved and used for subsequent online use.
- FIG. 5 is a schematic diagram of a complete process of offline training and online use of a method for determining a subject word provided in an embodiment of this application.
- the input sample text is clustered using a clustering algorithm to form n categories, and the texts in these n categories are combined and merged to form the text 11 shown in the figure...
- Text n4 long text, counted as d); after removing emoticons, word segmentation, removing stop words, etc.
- a word set of size w is generated; t number of topic types are artificially preset, and The probability distribution of each sample text belonging to each topic type is calculated, and the probability distribution formed by d long texts constitutes the topic prior information; the Labeled-LDA model is trained using the topic prior information and d texts, during the training process , The model will generate a d*w text-term frequency matrix based on the word set and the d pieces of text.
- the d*w text-term frequency matrix and the target text-term frequency matrix (calculated by artificial statistics) are generated during the training
- the output is consistent, it means that the training is completed.
- the parameters in the model are optimal, and the topic type-term frequency matrix of t*w is output.
- a k*w text-word frequency matrix is constructed according to the word set of size w generated in the offline training process and the k pieces of input text after the preprocessing.
- the k*w text-term frequency matrix and the t*w topic type-term frequency matrix output from offline training can obtain the k*t text-topic type matrix, and the k*t text-topic type matrix is the largest
- the index corresponding to the probability value is used as the topic type index of the input text.
- the topic type of the input text is determined, and then the topic type-term frequency matrix output by offline training can obtain the topic words of the input text (that is, as shown in the figure). Shows subject heading 1... subject heading k).
- FIG. 6 is a schematic structural diagram of a text subject word determination device provided by an exemplary embodiment of this application.
- the text subject word determination device can be implemented as all or a part of the terminal through software, hardware or a combination of the two, and can also be integrated on the server as an independent module.
- the apparatus for determining text subject terms in the embodiment of the present application is applied to a terminal.
- the apparatus 1 includes a target text acquisition module 11, a first mapping relationship building module 12, a third mapping relationship determining module 13, and a subject term determining module 14, wherein:
- the target text obtaining module 11 is configured to preprocess at least one input text to obtain at least one target text;
- the first mapping relationship construction module 12 is configured to construct a first mapping relationship between the at least one target text and at least one word in the word set according to the word set obtained by pre-training;
- the third mapping relationship determining module 13 is configured to determine the first mapping relationship between the at least one target text and the at least one topic type based on the second mapping relationship between the topic type obtained in advance and at least one word in the word set.
- the topic word determining module 14 is configured to determine at least one topic type corresponding to the at least one target text according to the third mapping relationship, and then determine at least one topic corresponding to the at least one target text based on the second mapping relationship word.
- the one mapping relationship includes a target text-term frequency matrix
- the second mapping relationship includes a topic type-term frequency matrix
- the third mapping relationship is a target text-topic type matrix
- the topic word determining module 14 is specifically used for:
- the index corresponding to the maximum probability value in the target text-topic type matrix is used as the topic type index of the target text, and the at least one topic word is determined based on the topic type index and the topic type-term frequency matrix.
- FIG. 7 is a schematic structural diagram of a text subject word determination device provided by an exemplary embodiment of this application.
- the apparatus 1 for determining text subject words provided by the embodiment of the present application further includes:
- the second sample text generation module 15 is configured to perform text merging processing on at least one first sample text to generate at least one second sample text, and the text length of the first sample text is less than a preset threshold, and the first sample text The text lengths of the two sample texts are both greater than or equal to the preset threshold;
- the subject prior information acquisition module 16 is configured to acquire subject prior information based on the subject type and subject words of the at least one second sample text;
- the topic type-term frequency matrix obtaining module 17 is configured to train a text topic mining model based on the at least one second sample text and the topic prior information to obtain a topic type-term frequency matrix.
- the subject prior information acquisition module 16 is specifically configured to:
- the topic prior information includes: probability distributions of the at least one second sample text belonging to different topic types.
- FIG. 8 is a schematic structural diagram of a text subject word determination device provided by an exemplary embodiment of this application.
- the apparatus 1 for determining text subject words provided by the embodiment of the present application further includes:
- the word set generating module 18 is configured to generate a word set according to the vocabulary in the at least one first sample text
- the target text-term frequency matrix construction module 19 is configured to construct a target text-term frequency matrix based on the statistical result of the occurrence frequency of words in the at least one first sample text;
- the topic type-term frequency matrix obtaining module 17 is specifically configured to:
- the at least one second sample text and the topic prior information are used to train the text topic word mining model.
- the sample text-word frequency matrix generated in the training process is consistent with the target text-word frequency matrix
- the model training is completed; the sample topic type-word frequency matrix generated during the training process is obtained;
- the device for determining text topic words executes the method for determining text topic words
- the above-mentioned functional modules are used as an example for illustration. In actual applications, the above-mentioned function assignments can be divided according to needs.
- the function module is completed, that is, the internal structure of the device is divided into different function modules to complete all or part of the functions described above.
- the text subject word determination device provided in the above-mentioned embodiment and the text subject word determination method embodiment belong to the same concept. For the implementation process of the text subject word determination method, please refer to the method embodiment, which will not be repeated here.
- the terminal first preprocesses at least one input text to obtain at least one target text; then, according to the pre-trained word set, constructs the relationship between the at least one target text and at least one word in the word set.
- at least one topic type corresponding to the at least one target text is determined according to the third mapping relationship, and then at least one topic word corresponding to the at least one target text is determined based on the second mapping relationship.
- the embodiment provided in this application uses short text to train the topic model in advance to ensure the applicability of the model to short text; when used online, the word set generated in the pre-training process and the second mapping relationship are directly used to obtain the input text Subject words ensure the accurate mining of short text subject words when used online.
- the embodiments of the present application also provide a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the steps of the method in any of the foregoing embodiments are implemented.
- the computer-readable storage medium may include, but is not limited to, any type of disk, including floppy disks, optical disks, DVDs, CD-ROMs, micro drives, and magneto-optical disks, ROM, RAM, EPROM, EEPROM, DRAM, VRAM, flash memory devices , Magnetic or optical cards, nanosystems (including molecular memory ICs), or any type of media or equipment suitable for storing instructions and/or data.
- An embodiment of the present application also provides a terminal, including a memory, a processor, and a computer program stored on the memory and capable of running on the processor, and the processor implements the steps of the method in any of the foregoing embodiments when the program is executed.
- FIG. 9 is a structural block diagram of a terminal provided in an embodiment of this application.
- the terminal 600 includes a processor 601 and a memory 602.
- the processor 601 is a control center of a computer system, and may be a processor of a physical machine or a processor of a virtual machine.
- the processor 601 may include one or more processing cores, such as a 4-core processor, an 8-core processor, and so on.
- the processor 601 can adopt at least one hardware form among DSP (Digital Signal Processing), FPGA (Field-Programmable Gate Array), and PLA (Programmable Logic Array, Programmable Logic Array). achieve.
- the processor 601 may also include a main processor and a coprocessor.
- the main processor is a processor used to process data in the awake state, also called a CPU (Central Processing Unit, central processing unit); the coprocessor is A low-power processor used to process data in the standby state.
- the memory 602 may include one or more computer-readable storage media, which may be non-transitory.
- the memory 602 may also include a high-speed random access memory and a non-volatile memory, such as one or more magnetic disk storage devices and flash memory storage devices.
- the non-transitory computer-readable storage medium in the memory 602 is used to store at least one instruction, and the at least one instruction is used to be executed by the processor 601 to implement the method in the embodiment of the present application .
- the terminal 600 further includes: a peripheral device interface 603 and at least one peripheral device.
- the processor 601, the memory 602, and the peripheral device interface 603 may be connected by a bus or a signal line.
- Each peripheral device can be connected to the peripheral device interface 603 through a bus, a signal line, or a circuit board.
- the peripheral device includes: at least one of a display screen 604, a camera 605, and an audio circuit 606.
- the peripheral device interface 603 can be used to connect at least one peripheral device related to I/O (Input/Output) to the processor 601 and the memory 602.
- the processor 601, the memory 602, and the peripheral device interface 603 are integrated on the same chip or circuit board; in some other embodiments of the present application, the processor 601, the memory 602, and the peripheral device interface Any one or two of 603 can be implemented on a separate chip or circuit board. The embodiments of the present application do not specifically limit this.
- the display screen 604 is used to display a UI (User Interface, user interface).
- the UI can include graphics, text, icons, videos, and any combination thereof.
- the display screen 604 also has the ability to collect touch signals on or above the surface of the display screen 604.
- the touch signal can be input to the processor 601 as a control signal for processing.
- the display screen 604 may also be used to provide virtual buttons and/or virtual keyboards, also called soft buttons and/or soft keyboards.
- the display screen 604 may be a flexible display screen, which is arranged on the curved surface or the folding surface of the terminal 600. Furthermore, the display screen 604 can also be set as a non-rectangular irregular figure, that is, a special-shaped screen.
- the display screen 604 may be made of materials such as LCD (Liquid Crystal Display) and OLED (Organic Light-Emitting Diode).
- the camera 605 is used to collect images or videos.
- the camera 605 includes a front camera and a rear camera.
- the front camera is set on the front panel of the terminal, and the rear camera is set on the back of the terminal.
- the camera 605 may also include a flashlight.
- the flash can be a single-color flash or a dual-color flash. Dual color temperature flash refers to a combination of warm light flash and cold light flash, which can be used for light compensation under different color temperatures.
- the audio circuit 606 may include a microphone and a speaker.
- the microphone is used to collect the sound waves of the user and the environment, and convert the sound waves into electrical signals and input them to the processor 601 for processing.
- the microphone can also be an array microphone or an omnidirectional collection microphone.
- the power supply 607 is used to supply power to various components in the terminal 600.
- the power source 607 may be alternating current, direct current, disposable batteries, or rechargeable batteries.
- the rechargeable battery may be a wired rechargeable battery or a wireless rechargeable battery.
- a wired rechargeable battery is a battery charged through a wired line
- a wireless rechargeable battery is a battery charged through a wireless coil.
- the rechargeable battery can also be used to support fast charging technology.
- the terminal structural block diagram shown in the embodiments of the present application does not constitute a limitation on the terminal 600.
- the terminal 600 may include more or fewer components than shown in the figure, or combine certain components, or adopt different component arrangements.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
Claims (10)
- 一种文本主题词确定方法,其特征在于,所述方法包括:对至少一个输入文本进行预处理,得到至少一个目标文本;根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。
- 根据权利要求1所述的方法,其特征在于,所述第一映射关系包括目标文本-词语频率矩阵,第二映射关系包括主题类型-词语频率矩阵。
- 根据权利要求1所述的方法,其特征在于,所述第三映射关系为目标文本-主题类型矩阵;以及所述根据所述第三映射关系确定所述至少一个目标文本对应的主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词,包括:将所述目标文本-主题类型矩阵中最大概率值对应的索引作为所述目标文本的主题类型索引,基于所述主题类型索引以及所述主题类型-词语频率矩阵,确定所述至少一个主题词。
- 根据权利要求2所述的方法,其特征在于,所述主题类型-词语频率矩阵训练过程,包括:对至少一条第一样本文本进行文本合并处理,生成至少一条第二样本文本,所述第一样本文本的文本长度均小于预设阈值,所述第二样本文本的文本长度均大于等于所述预设阈值;基于所述至少一条第二样本文本的主题类型、主题词,获取主题先验信息;基于所述至少一条第二样本文本以及所述主题先验信息,对文本主题挖掘模型进行训练,获取主题类型-词语频率矩阵。
- 根据权利要求4所述的方法,其特征在于,所述主题先验信息包括:所述至少一条第二样本文本隶属于不同主题类型的概率分布。
- 根据权利要求4所述的方法,其特征在于,所述主题类型-词语频率矩阵训练过程,还包括:根据所述至少一条第一样本文本中的词汇生成词语集合;基于所述至少一条第一样本文本中词语出现的频率统计结果,构建目标文本-词语频率矩阵;采用所述至少一条第二样本文本以及所述主题先验信息,对文本主题词挖掘模型进行训练,当训练过程中生成的样本文本-词语频率矩阵与所述目标文本-词语频率矩阵一致时,模型训练完成;获取训过程中生成的样本主题类型-词语频率矩阵。
- 一种文本主题词确定装置,其特征在于,所述装置包括:目标文本获取模块,用于对至少一个输入文本进行预处理,得到至少一个目标文本;第一映射关系构建模块,用于根据预先训练得到的词语集合,构建所述至少一个目标文本与所述词语集合中至少一个词语之间的第一映射关系;第三映射关系确定模块,用于基于预先训练得到的主题类型与所述词语集合中至少一个词语之间的第二映射关系,确定所述至少一个目标文本与至少一个主题类型之间的第三映射关系;主题词确定模块,用于根据所述第三映射关系确定所述至少一个目标文本对应的至少一个主题类型,进而基于所述第二映射关系,确定所述至少一个目标文本对应的至少一个主题词。
- 根据权利要求7所述的装置,其特征在于,所述一映射关系包括目标文本-词语频率矩阵,第二映射关系包括主题类型-词语频率矩阵,所述第三映射关系为目标文本-主题类型矩阵,所述主题词确定模块具体用于:将所述目标文本-主题类型矩阵中最大概率值对应的索引作为所述目标文本的主题类型索引,基于所述主题类型索引以及所述主题类型-词语频率矩阵,确定所述至少一个主题词。
- 一种计算机存储介质,其特征在于,所述计算机存储介质存储有多条指令,所述指令适于由处理器加载并执行如权利要求1~6任意一项的方法步骤。
- 一种终端,其特征在于,包括:处理器和存储器;其中,所述存储器存储有计算机程序,所述计算机程序适于由所述处理器加载并执行如权利要求1~6任意一项的方法步骤。
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202010010680.0 | 2020-01-06 | ||
CN202010010680.0A CN111274798B (zh) | 2020-01-06 | 2020-01-06 | 一种文本主题词确定方法、装置、存储介质及终端 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2021139466A1 true WO2021139466A1 (zh) | 2021-07-15 |
Family
ID=71000087
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2020/134772 WO2021139466A1 (zh) | 2020-01-06 | 2020-12-09 | 一种文本主题词确定方法、装置、存储介质及终端 |
Country Status (2)
Country | Link |
---|---|
CN (1) | CN111274798B (zh) |
WO (1) | WO2021139466A1 (zh) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN111274798B (zh) * | 2020-01-06 | 2023-08-18 | 北京大米科技有限公司 | 一种文本主题词确定方法、装置、存储介质及终端 |
CN111831788A (zh) * | 2020-06-16 | 2020-10-27 | 国网江苏省电力有限公司信息通信分公司 | 一种电力语料标记模型构建方法及系统 |
CN112084772A (zh) * | 2020-09-25 | 2020-12-15 | 北京明略昭辉科技有限公司 | 一种文本质量的监测方法、装置、电子设备及存储介质 |
CN115983251B (zh) * | 2023-02-16 | 2023-06-09 | 江苏联著实业股份有限公司 | 一种基于句用分析的文本主题提取系统及方法 |
CN116431814B (zh) * | 2023-06-06 | 2023-09-05 | 北京中关村科金技术有限公司 | 信息提取方法、装置、电子设备及可读存储介质 |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049568A (zh) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | 对海量文档库的文档分类的方法 |
CN106649422A (zh) * | 2016-06-12 | 2017-05-10 | 中国移动通信集团湖北有限公司 | 关键词提取方法及装置 |
CN107368489A (zh) * | 2016-05-12 | 2017-11-21 | 阿里巴巴集团控股有限公司 | 一种资讯数据处理方法及装置 |
US20180239741A1 (en) * | 2017-02-17 | 2018-08-23 | General Electric Company | Methods and systems for automatically identifying keywords of very large text datasets |
CN108763213A (zh) * | 2018-05-25 | 2018-11-06 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | 主题特征文本关键词提取方法 |
CN111274798A (zh) * | 2020-01-06 | 2020-06-12 | 北京大米科技有限公司 | 一种文本主题词确定方法、装置、存储介质及终端 |
Family Cites Families (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101315624B (zh) * | 2007-05-29 | 2015-11-25 | 阿里巴巴集团控股有限公司 | 一种文本主题推荐的方法和装置 |
CN105045812B (zh) * | 2015-06-18 | 2019-01-29 | 上海高欣计算机系统有限公司 | 文本主题的分类方法及系统 |
CN107797982B (zh) * | 2016-08-31 | 2021-05-07 | 百度在线网络技术(北京)有限公司 | 用于识别文本类型的方法、装置和设备 |
CN110162771B (zh) * | 2018-11-22 | 2023-08-29 | 腾讯科技(深圳)有限公司 | 事件触发词的识别方法、装置、电子设备 |
CN110032639B (zh) * | 2018-12-27 | 2023-10-31 | 中国银联股份有限公司 | 将语义文本数据与标签匹配的方法、装置及存储介质 |
-
2020
- 2020-01-06 CN CN202010010680.0A patent/CN111274798B/zh active Active
- 2020-12-09 WO PCT/CN2020/134772 patent/WO2021139466A1/zh active Application Filing
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN103049568A (zh) * | 2012-12-31 | 2013-04-17 | 武汉传神信息技术有限公司 | 对海量文档库的文档分类的方法 |
CN107368489A (zh) * | 2016-05-12 | 2017-11-21 | 阿里巴巴集团控股有限公司 | 一种资讯数据处理方法及装置 |
CN106649422A (zh) * | 2016-06-12 | 2017-05-10 | 中国移动通信集团湖北有限公司 | 关键词提取方法及装置 |
US20180239741A1 (en) * | 2017-02-17 | 2018-08-23 | General Electric Company | Methods and systems for automatically identifying keywords of very large text datasets |
CN108763213A (zh) * | 2018-05-25 | 2018-11-06 | 西南电子技术研究所(中国电子科技集团公司第十研究所) | 主题特征文本关键词提取方法 |
CN111274798A (zh) * | 2020-01-06 | 2020-06-12 | 北京大米科技有限公司 | 一种文本主题词确定方法、装置、存储介质及终端 |
Also Published As
Publication number | Publication date |
---|---|
CN111274798B (zh) | 2023-08-18 |
CN111274798A (zh) | 2020-06-12 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
WO2021139466A1 (zh) | 一种文本主题词确定方法、装置、存储介质及终端 | |
US11947911B2 (en) | Method for training keyword extraction model, keyword extraction method, and computer device | |
CN108170749B (zh) | 基于人工智能的对话方法、装置及计算机可读介质 | |
US20200294488A1 (en) | Method, device and storage medium for speech recognition | |
US20180293507A1 (en) | Method and apparatus for extracting keywords based on artificial intelligence, device and readable medium | |
CN108595431B (zh) | 语音交互文本纠错方法、装置、终端及存储介质 | |
CN107608532B (zh) | 一种联想输入方法、装置及电子设备 | |
US11127394B2 (en) | Method and system of high accuracy keyphrase detection for low resource devices | |
WO2018165932A1 (en) | Generating responses in automated chatting | |
WO2015171646A1 (en) | Method and system for speech input | |
WO2020151690A1 (zh) | 语句生成方法、装置、设备及存储介质 | |
US11830482B2 (en) | Method and apparatus for speech interaction, and computer storage medium | |
US20220309088A1 (en) | Method and apparatus for training dialog model, computer device, and storage medium | |
CN111414736A (zh) | 故事生成模型训练方法、装置、设备及存储介质 | |
CN111368525A (zh) | 信息搜索方法、装置、设备及存储介质 | |
EP3734472A1 (en) | Method and device for text processing | |
CN108920649A (zh) | 一种信息推荐方法、装置、设备和介质 | |
EP3790002A1 (en) | System and method for modifying speech recognition result | |
CN111883117A (zh) | 语音唤醒方法及装置 | |
CN116778040B (zh) | 基于口型的人脸图像生成方法、模型的训练方法以及设备 | |
CN112036174A (zh) | 一种标点标注方法及装置 | |
CN117454954A (zh) | 模型训练方法、装置、计算机设备及存储介质 | |
CN111414737A (zh) | 故事生成模型训练方法、装置、设备及存储介质 | |
WO2023173659A1 (zh) | 人脸匹配方法及装置、电子设备、存储介质、计算机程序产品及计算机程序 | |
US20220245364A1 (en) | Electronic device and control method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 20911553 Country of ref document: EP Kind code of ref document: A1 |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20911553 Country of ref document: EP Kind code of ref document: A1 |
|
32PN | Ep: public notification in the ep bulletin as address of the adressee cannot be established |
Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 08.02.2023) |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 20911553 Country of ref document: EP Kind code of ref document: A1 |