CN113850076A - Theme extraction method and device, electronic equipment and storage medium - Google Patents

Theme extraction method and device, electronic equipment and storage medium Download PDF

Info

Publication number
CN113850076A
CN113850076A CN202111101344.8A CN202111101344A CN113850076A CN 113850076 A CN113850076 A CN 113850076A CN 202111101344 A CN202111101344 A CN 202111101344A CN 113850076 A CN113850076 A CN 113850076A
Authority
CN
China
Prior art keywords
text
sentence
sentences
titles
subject
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111101344.8A
Other languages
Chinese (zh)
Inventor
张记袁
黄焱晖
郑烨翰
彭卫华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Baidu Netcom Science and Technology Co Ltd
Original Assignee
Beijing Baidu Netcom Science and Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Baidu Netcom Science and Technology Co Ltd filed Critical Beijing Baidu Netcom Science and Technology Co Ltd
Priority to CN202111101344.8A priority Critical patent/CN113850076A/en
Publication of CN113850076A publication Critical patent/CN113850076A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

The disclosure provides a theme extraction method, a theme extraction device, electronic equipment and a storage medium, and relates to the field of artificial intelligence, in particular to the technical fields of natural language processing, computer vision and deep learning. The specific implementation scheme is as follows: determining a text to be processed and candidate subject sentences in the text; carrying out redundancy removal processing on the candidate subject sentences to obtain simplified sentences corresponding to the candidate subject sentences; the simplified sentences are determined as the subjects of the text, so that the extracted candidate subject sentences can be subjected to redundancy removal processing, the extracted subjects are prevented from being too redundant, the accuracy of the extracted subjects is improved, and the subject extraction efficiency is improved.

Description

Theme extraction method and device, electronic equipment and storage medium
Technical Field
The present disclosure relates to the field of artificial intelligence technologies, and in particular, to the field of natural language processing, computer vision, and deep learning technologies, and in particular, to a method and an apparatus for extracting a theme, an electronic device, and a storage medium.
Background
With the rapid development of the internet, information on the internet also increases explosively, and when text search or text recommendation is performed on the internet, a text including a search word in a corresponding topic is generally provided; or recommending the text with higher similarity to the existing text topic to the user. Therefore, how to accurately determine the subject of the text has become an important research direction.
Disclosure of Invention
The disclosure provides a theme extraction method, a theme extraction device, an electronic device and a storage medium.
According to an aspect of the present disclosure, there is provided a topic extraction method including: determining a text to be processed and candidate subject sentences in the text; carrying out redundancy removal processing on the candidate subject statement to obtain a simplified statement corresponding to the candidate subject statement; and determining the simplified sentence as the subject of the text.
According to another aspect of the present disclosure, there is provided a training method of a topic reduction model, including: determining a corpus, wherein the corpus comprises: the method comprises the steps that a preset number of sample long titles and corresponding sample short titles are obtained, and texts to which the sample long titles belong and texts to which the corresponding sample short titles belong correspond to the same event; and training the initial subject simplification model by taking the sample long title as input and the corresponding sample short title as output to obtain a preset subject simplification model, and performing redundancy removal processing on candidate subject sentences to be processed to obtain simplified sentences.
According to another aspect of the present disclosure, there is provided a theme extraction apparatus including: the first determination module is used for determining a text to be processed and candidate subject sentences in the text; the processing module is used for carrying out redundancy removal processing on the candidate subject statement to obtain a simplified statement corresponding to the candidate subject statement; and the second determining module is used for determining the simplified sentence as the subject of the text.
According to another aspect of the present disclosure, there is provided a training apparatus for a topic reduction model, including: a determining module, configured to determine a corpus, where the corpus includes: the method comprises the steps that a preset number of sample long titles and corresponding sample short titles are obtained, and texts to which the sample long titles belong and texts to which the corresponding sample short titles belong correspond to the same event; and the training module is used for training the initial subject simplification model by taking the sample long title as input and the corresponding sample short title as output to obtain a preset subject simplification model, and performing redundancy removal processing on candidate subject sentences to be processed to obtain simplified sentences.
According to still another aspect of the present disclosure, there is provided an electronic device including:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform a method of extracting subject matter presented in the above-mentioned aspect of the disclosure; or, on the other hand, the provided method for training the topic reduction model.
According to yet another aspect of the present disclosure, there is provided a non-transitory computer readable storage medium storing computer instructions for causing a computer to perform the subject extraction method set forth in the above aspect of the present disclosure; or, on the other hand, the provided method for training the topic reduction model.
According to yet another aspect of the present disclosure, there is provided a computer program product comprising a computer program which, when executed by a processor, performs the steps of the subject extraction method set forth in the above aspect of the present disclosure; or, on the other hand, the steps of the training method of the topic reduction model are provided.
It should be understood that the statements in this section do not necessarily identify key or critical features of the embodiments of the present disclosure, nor do they limit the scope of the present disclosure. Other features of the present disclosure will become apparent from the following description.
Drawings
The drawings are included to provide a better understanding of the present solution and are not to be construed as limiting the present disclosure. Wherein:
FIG. 1 is a schematic diagram according to a first embodiment of the present disclosure;
FIG. 2 is a schematic diagram according to a second embodiment of the present disclosure;
FIG. 3 is a block diagram of topic extraction;
FIG. 4 is a schematic diagram according to a third embodiment of the present disclosure;
FIG. 5 is a schematic diagram according to a fourth embodiment of the present disclosure;
FIG. 6 is a schematic diagram according to a fifth embodiment of the present disclosure;
FIG. 7 is a block diagram of an electronic device used to implement an embodiment of the disclosure.
Detailed Description
Exemplary embodiments of the present disclosure are described below with reference to the accompanying drawings, in which various details of the embodiments of the disclosure are included to assist understanding, and which are to be considered as merely exemplary. Accordingly, those of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
In the related technology, the topic extraction scheme mainly includes that sentences are extracted from texts and directly used as topics of the texts, the extracted topics are too redundant, topic extraction efficiency is poor, and extraction accuracy is low.
In view of the above problems, the present disclosure provides a theme extraction method, apparatus, electronic device, and storage medium.
Fig. 1 is a schematic diagram of a first embodiment of the present disclosure, and it should be noted that the theme extracting method according to the embodiment of the present disclosure is applicable to a theme extracting apparatus, and the apparatus may be configured in an electronic device, so that the electronic device may perform a theme extracting function.
The electronic device may be any device having a computing capability, for example, a Personal Computer (PC), a mobile terminal, a server, and the like, and the mobile terminal may be a hardware device having various operating systems, touch screens, and/or display screens, such as an in-vehicle device, a mobile phone, a tablet Computer, a Personal digital assistant, and a wearable device.
As shown in fig. 1, the theme extraction method may include the steps of:
step 101, determining a text to be processed and candidate subject sentences in the text.
In the embodiment of the present disclosure, the text to be processed may be a text that needs to be subject-extracted. The extracted subject can be used for recommending similar texts, recommending materials and the like. For example, for a text read or selected by a user, after determining the subject of the text, a text similar to the text or similar material may be recommended to the user.
In the embodiment of the present disclosure, the text may be, for example, a paper, news, and the like, and may be set according to actual needs, which is not specifically limited herein.
In the disclosed embodiments, one or more sentences may be included in the text. The candidate topic sentence may be a sentence selected from one or more sentences in the text, which may represent a main meaning of the text. For each text, the number of candidate topic sentences may be one or more, and the number of candidate topic sentences may be set according to actual needs.
And 102, performing redundancy removal processing on the candidate subject statement to obtain a simplified statement corresponding to the candidate subject statement.
In the embodiment of the present disclosure, the redundancy removal processing is performed on the candidate topic sentence, which means that redundant words or terms in the candidate topic sentence are removed, and a simplified sentence with a small number of words is generated. For example, the candidate subject sentence may be "wave breaking 2020| wind wave tip, innovation history new height on market of 17 property companies", and the generated condensed sentence may be "innovation height on market of 17 property companies", for example.
In the embodiment of the disclosure, in order to further obtain a better simplified sentence and improve the accuracy of generating the simplified sentence, the simplified sentence may be obtained by combining with a model. Correspondingly, the topic extraction device may perform the process of step 102, for example, to input the candidate topic sentences into a preset topic reduction model to obtain reduced sentences corresponding to the candidate topic sentences. The topic reduction model can be a semantic representation model.
In the embodiment of the present disclosure, in order to further improve the accuracy of the simplified sentences, ensure consistency of content between the simplified sentences and the corresponding candidate topic sentences, and ensure that the simplified sentences and the corresponding candidate topic sentences correspond to the same event, a training process of the topic simplified model may be, for example, determining a corpus of the topic simplified model, where the corpus includes: a certain number of sample long titles and corresponding sample short titles, wherein texts to which the sample long titles belong and texts to which the corresponding sample short titles belong correspond to the same event; and training the initial theme simplification model by taking the sample long title as input and the corresponding sample short title as output to obtain a preset theme simplification model. In order to improve the accuracy of the trained topic reduction model, the initial topic reduction model may be a pre-trained semantic representation model.
In the embodiment of the present disclosure, the sample long header is a header with a word number greater than or equal to a preset number threshold; and the sample short title is a title with the word number smaller than a preset number threshold value.
And 103, determining the simplified sentence as the subject of the text.
The topic extraction method of the embodiment of the disclosure determines the text to be processed and the candidate topic sentences in the text; carrying out redundancy removal processing on the candidate subject sentences to obtain simplified sentences corresponding to the candidate subject sentences; the simplified sentences are determined as the subjects of the text, so that the extracted candidate subject sentences can be subjected to redundancy removal processing, the extracted subjects are prevented from being too redundant, the accuracy of the extracted subjects is improved, and the subject extraction efficiency is improved.
In order to further improve the topic extraction efficiency, the extraction efficiency of candidate topic sentences may be improved, as shown in fig. 2, fig. 2 is a schematic diagram according to a second embodiment of the present disclosure, and in the embodiment of the present disclosure, whether each sentence in a text is a candidate topic sentence may be determined in combination with the text, so as to improve the correlation between the selected candidate topic sentence and the text. The embodiment shown in fig. 2 may include the following steps:
step 201, determining a text to be processed and at least one sentence in the text.
In the embodiment of the present disclosure, after the text to be processed is obtained, sentence division processing may be performed on the text to obtain at least one sentence in the text. The text can be divided according to the sentence end characters to obtain each sentence. Alternatively, the text may be input into a model for dividing the sentences to obtain each sentence.
Step 202, for each sentence in at least one sentence, determining whether the sentence is a candidate subject sentence in the text according to the sentence and the text.
In the embodiment of the present disclosure, in an example, in order to further improve the accuracy of candidate topic sentence extraction, a correlation degree between a sentence and a text may be determined according to the sentence and the text; sorting each text in the sentences in a descending order according to the relevance, and selecting a preset number of the sentences sorted in the front as candidate subject sentences; or, the sentence with the corresponding correlation degree larger than the preset correlation degree threshold value is used as the candidate subject sentence.
In another example, in order to further improve the accuracy of candidate topic sentence extraction, the topic extraction device executing the step 202 may, for example, input the sentence and reference content of the text into a preset topic extraction model to determine whether the sentence is a candidate topic sentence in the text; the reference content is all the content of the text, or the reference content is the title and abstract content of the text.
The abstract content of the text can be obtained by calculating the text according to an abstract algorithm; or calling a summary interface provided by other systems to generate the summary content of the text.
In the embodiment of the present disclosure, in order to further improve the accuracy of extracting candidate topic sentences, the training process of the topic extraction model may be, for example, determining a corpus of the topic extraction model, where the corpus includes: sample reference contents of a certain amount of texts and corresponding sample subject sentences; and training the initial theme extraction model by taking the sample reference content as input and the corresponding sample theme sentence as output to obtain a preset theme extraction model. In order to improve the accuracy of the trained topic extraction model, the initial topic extraction model may be a pre-trained semantic representation model.
And 203, performing redundancy removal processing on the candidate subject statement to obtain a simplified statement corresponding to the candidate subject statement.
Step 204, the simplified sentence is determined as the subject of the text.
It should be noted that, step 203 and step 204 may be implemented by any one of the embodiments of the present disclosure, and the embodiments of the present disclosure are not limited thereto and are not described again.
The theme extraction method of the embodiment of the disclosure determines a text to be processed and at least one sentence in the text; determining whether the sentence is a candidate subject sentence in the text or not according to the sentence and the text aiming at each sentence in at least one sentence; carrying out redundancy removal processing on the candidate subject sentences to obtain simplified sentences corresponding to the candidate subject sentences; the simplified sentences are determined as the subjects of the text, so that the extracted candidate subject sentences can be subjected to redundancy removal processing, the extracted subjects are prevented from being too redundant, the accuracy of the extracted subjects is improved, and the subject extraction efficiency is improved.
In order to more clearly illustrate the above embodiments, the description will now be made by way of example.
For example, as shown in FIG. 3, a framework diagram of topic extraction is shown. In fig. 3, the news is illustrated as text. The method comprises the steps of firstly obtaining a news title and a news abstract corresponding to news, inputting the news title and the news abstract into a topic extraction model, and obtaining candidate topic sentences (topic sentences) in the news title and the news abstract; then, the candidate subject sentences are input into the subject simplification model, and corresponding simplified sentences (short subject sentences) are obtained. The topic reduction model can be a model based on sequence labeling, such as a semantic representation model.
Fig. 4 is a schematic diagram of a third embodiment of the present disclosure, and it should be noted that the method for training the topic reduction model according to the embodiment of the present disclosure may be applied to a device for training the topic reduction model, and the device may be configured in an electronic device, so that the electronic device may perform a function of training the topic reduction model.
As shown in fig. 4, the training method of the topic reduction model may include the following steps:
step 401, determining a corpus, wherein the corpus includes: the method comprises the steps that a preset number of sample long titles and corresponding sample short titles are obtained, and texts to which the sample long titles belong and texts to which the corresponding sample short titles belong correspond to the same event.
In the embodiment of the present disclosure, the training apparatus for the subject reduction model executing step 401 may, for example, obtain titles of at least two texts corresponding to the same event; taking the titles with the corresponding word number larger than or equal to a preset number threshold value in the at least two titles as sample long titles; and taking the title with the corresponding word number smaller than the preset number threshold value in the at least two titles as a sample short title corresponding to the sample long title.
The title contents of at least two texts corresponding to the same event are basically consistent, the corpus is determined by adopting the titles of at least two texts corresponding to the same event, the consistency of the contents between the simplified sentences and the corresponding candidate subject sentences is ensured, the simplified sentences and the corresponding candidate subject sentences are ensured to correspond to the same event, manual acquisition of the corpus is avoided, the labor cost for preparing the corpus is reduced, the preparation time of the corpus is shortened, and the subject simplified model can be trained in time.
In the embodiment of the present disclosure, events corresponding to each text may be extracted from the text through an event extraction algorithm, or extracted from the text through an event extraction model.
And step 402, training the initial subject simplification model by taking the sample long title as input and the corresponding sample short title as output to obtain a preset subject simplification model, and performing redundancy removal processing on candidate subject sentences to be processed to obtain simplified sentences.
In the embodiment of the present disclosure, the redundancy removal processing is performed on the candidate topic sentence, which means that redundant words or terms in the candidate topic sentence are removed, and a simplified sentence with a small number of words is generated. For example, the candidate subject sentence may be "wave breaking 2020| wind wave tip, innovation history new height on market of 17 property companies", and the generated condensed sentence may be "innovation height on market of 17 property companies", for example.
In the embodiment of the present disclosure, in order to improve the accuracy of the trained topic reduction model, the initial topic reduction model may be a pre-trained semantic representation model.
In the embodiment of the disclosure, in the training process of the topic simplification model, a sample long header is used as an input, a loss function is constructed according to a prediction short header output by the topic simplification model and a sample short header corresponding to the sample long header, and a coefficient of the topic simplification model is adjusted according to a value of the loss function, so that the training of the topic simplification model is realized.
The training method of the subject simplification model according to the embodiment of the present disclosure determines a corpus, where the corpus includes: the method comprises the steps that a preset number of sample long titles and corresponding sample short titles are obtained, and texts to which the sample long titles belong and texts to which the corresponding sample short titles belong correspond to the same event; the method comprises the steps of training an initial theme simplification model by taking a sample long title as input and a corresponding sample short title as output to obtain a preset theme simplification model, and performing redundancy removal processing on candidate theme sentences to be processed to obtain simplified sentences, so that the extracted candidate theme sentences can be subjected to redundancy removal processing, the extracted themes are prevented from being excessively redundant, the accuracy of the extracted themes is improved, and the theme extraction efficiency is improved.
In order to implement the above embodiments, the present disclosure further provides a theme extraction apparatus.
As shown in fig. 5, fig. 5 is a schematic diagram according to a fourth embodiment of the present disclosure. The theme extraction device 500 includes: a first determination module 510, a processing module 520, and a second determination module 530.
The first determining module 510 is configured to determine a text to be processed and candidate subject sentences in the text;
a processing module 520, configured to perform redundancy removal on the candidate topic statement to obtain a simplified statement corresponding to the candidate topic statement;
a second determining module 530, configured to determine the simplified sentence as the subject of the text.
As a possible implementation manner of the embodiment of the present disclosure, the processing module 520 is specifically configured to input the candidate topic statement into a preset topic reduction model, so as to obtain a reduced statement corresponding to the candidate topic statement.
As a possible implementation manner of the embodiment of the present disclosure, the topic reduction model is a semantic representation model.
The topic extraction device of the embodiment of the disclosure determines the text to be processed and the candidate topic sentences in the text; carrying out redundancy removal processing on the candidate subject sentences to obtain simplified sentences corresponding to the candidate subject sentences; the simplified sentences are determined as the subjects of the text, so that the extracted candidate subject sentences can be subjected to redundancy removal processing, the extracted subjects are prevented from being too redundant, the accuracy of the extracted subjects is improved, and the subject extraction efficiency is improved.
In order to implement the above embodiments, the present disclosure further provides a training device for the topic reduction model.
As shown in fig. 6, fig. 6 is a schematic diagram according to a fifth embodiment of the present disclosure. The training apparatus 600 for the topic reduction model includes: a determination module 610 and a training module 620.
The determining module 610 is configured to determine a corpus, where the corpus includes: the method comprises the steps that a preset number of sample long titles and corresponding sample short titles are obtained, and texts to which the sample long titles belong and texts to which the corresponding sample short titles belong correspond to the same event;
the training module 620 is configured to train the initial topic reduction model by using the sample long header as an input and using the corresponding sample short header as an output, to obtain a preset topic reduction model, and to perform redundancy removal processing on candidate topic statements to be processed, to obtain a reduced statement.
As a possible implementation manner of the embodiment of the present disclosure, the determining module 610 is specifically configured to obtain titles of at least two texts corresponding to the same event; taking the titles with the corresponding word number larger than or equal to a preset number threshold value in at least two titles as sample long titles; and taking the title with the corresponding word number smaller than the preset number threshold value in at least two titles as a sample short title corresponding to the sample long title.
As a possible implementation manner of the embodiment of the present disclosure, the initial topic reduction model is a pre-trained semantic representation model.
The training device of the topic reduction model according to the embodiment of the present disclosure determines a corpus, where the corpus includes: the method comprises the steps that a preset number of sample long titles and corresponding sample short titles are obtained, and texts to which the sample long titles belong and texts to which the corresponding sample short titles belong correspond to the same event; the method comprises the steps of training an initial theme simplification model by taking a sample long title as input and a corresponding sample short title as output to obtain a preset theme simplification model, and performing redundancy removal processing on candidate theme sentences to be processed to obtain simplified sentences, so that the extracted candidate theme sentences can be subjected to redundancy removal processing, the extracted themes are prevented from being excessively redundant, the accuracy of the extracted themes is improved, and the theme extraction efficiency is improved.
In the technical scheme of the disclosure, the collection, storage, use, processing, transmission, provision, disclosure and other processing of the personal information of the related user are all carried out on the premise of obtaining the consent of the user, and all accord with the regulation of related laws and regulations without violating the good custom of the public order.
The present disclosure also provides an electronic device, a readable storage medium, and a computer program product according to embodiments of the present disclosure.
FIG. 7 illustrates a schematic block diagram of an example electronic device 700 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular phones, smart phones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be examples only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 7, the electronic device 700 includes a computing unit 701, which may perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM)702 or a computer program loaded from a storage unit 708 into a Random Access Memory (RAM) 703. In the RAM 703, various programs and data required for the operation of the device 700 can also be stored. The computing unit 701, the ROM 702, and the RAM 703 are connected to each other by a bus 704. An input/output (I/O) interface 705 is also connected to bus 704.
A number of components in the electronic device 700 are connected to the I/O interface 705, including: an input unit 706 such as a keyboard, a mouse, or the like; an output unit 707 such as various types of displays, speakers, and the like; a storage unit 708 such as a magnetic disk, optical disk, or the like; and a communication unit 709 such as a network card, modem, wireless communication transceiver, etc. The communication unit 709 allows the electronic device 700 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunication networks.
Computing unit 701 may be a variety of general purpose and/or special purpose processing components with processing and computing capabilities. Some examples of the computing unit 701 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, and so forth. The calculation unit 701 performs the respective methods and processes described above, such as the topic extraction method or the training method of the topic reduction model. For example, in some embodiments, the topic extraction method or the training method of the topic reduction model may be implemented as a computer software program tangibly embodied in a machine-readable medium, such as the storage unit 708. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 700 via the ROM 702 and/or the communication unit 709. When loaded into RAM 703 and executed by the computing unit 701, may perform one or more steps of the above described topic extraction method or training method of the topic reduction model. Alternatively, in other embodiments, the computing unit 701 may be configured by any other suitable means (e.g., by means of firmware) to perform a topic extraction method or a training method of a topic reduction model.
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, Field Programmable Gate Arrays (FPGAs), Application Specific Integrated Circuits (ASICs), Application Specific Standard Products (ASSPs), system on a chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be special or general purpose, receiving data and instructions from, and transmitting data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for implementing the methods of the present disclosure may be written in any combination of one or more programming languages. These program codes may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the program codes, when executed by the processor or controller, cause the functions/operations specified in the flowchart and/or block diagram to be performed. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. A machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user can be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic, speech, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a back-end component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such back-end, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), Wide Area Networks (WANs), and the Internet.
The computer system may include clients and servers. A client and server are generally remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server with a combined blockchain.
It should be understood that various forms of the flows shown above may be used, with steps reordered, added, or deleted. For example, the steps described in the present disclosure may be executed in parallel, sequentially, or in different orders, as long as the desired results of the technical solutions disclosed in the present disclosure can be achieved, and the present disclosure is not limited herein.
The above detailed description should not be construed as limiting the scope of the disclosure. It should be understood by those skilled in the art that various modifications, combinations, sub-combinations and substitutions may be made in accordance with design requirements and other factors. Any modification, equivalent replacement, and improvement made within the spirit and principle of the present disclosure should be included in the scope of protection of the present disclosure.

Claims (19)

1. A method of topic extraction, comprising:
determining a text to be processed and candidate subject sentences in the text;
carrying out redundancy removal processing on the candidate subject statement to obtain a simplified statement corresponding to the candidate subject statement;
and determining the simplified sentence as the subject of the text.
2. The method of claim 1, wherein the performing redundancy removal on the candidate subject statement to obtain a reduced statement corresponding to the candidate subject statement comprises:
and inputting the candidate subject sentences into a preset subject simplification model to obtain simplified sentences corresponding to the candidate subject sentences.
3. The method of claim 2, wherein the topic reduction model is a semantic representation model.
4. The method of claim 1, wherein the determining the text to be processed and the candidate subject sentences in the text comprises:
determining the text to be processed and at least one sentence in the text;
and determining whether the sentence is a candidate subject sentence in the text or not according to the sentence and the text for each sentence in the at least one sentence.
5. The method of claim 4, wherein the determining, for each of the at least one sentence, from the sentence and the text, whether the sentence is a candidate subject sentence in the text comprises:
inputting the sentence and the reference content of the text into a preset theme extraction model to determine whether the sentence is a candidate theme sentence in the text; the reference content is all the content of the text, or the reference content is the title and abstract content of the text.
6. A training method of a topic reduction model comprises the following steps:
determining a corpus, wherein the corpus comprises: the method comprises the steps that a preset number of sample long titles and corresponding sample short titles are obtained, and texts to which the sample long titles belong and texts to which the corresponding sample short titles belong correspond to the same event;
and training the initial subject simplification model by taking the sample long title as input and the corresponding sample short title as output to obtain a preset subject simplification model, and performing redundancy removal processing on candidate subject sentences to be processed to obtain simplified sentences.
7. The method of claim 6, wherein the determining the corpus comprises:
acquiring titles of at least two texts corresponding to the same event;
taking the titles with the corresponding word number larger than or equal to a preset number threshold value in at least two titles as sample long titles;
and taking the title with the corresponding word number smaller than the preset number threshold value in at least two titles as a sample short title corresponding to the sample long title.
8. The method of claim 6, wherein the initial topic reduction model is a pre-trained semantic representation model.
9. A theme extraction apparatus comprising:
the first determination module is used for determining a text to be processed and candidate subject sentences in the text;
the processing module is used for carrying out redundancy removal processing on the candidate subject statement to obtain a simplified statement corresponding to the candidate subject statement;
and the second determining module is used for determining the simplified sentence as the subject of the text.
10. The apparatus of claim 9, wherein the processing module is specifically configured to,
and inputting the candidate subject sentences into a preset subject simplification model to obtain simplified sentences corresponding to the candidate subject sentences.
11. The apparatus of claim 10, wherein the topic reduction model is a semantic representation model.
12. The apparatus of claim 9, wherein the first determining means is specifically configured to,
determining the text to be processed and at least one sentence in the text;
and determining whether the sentence is a candidate subject sentence in the text or not according to the sentence and the text for each sentence in the at least one sentence.
13. The apparatus of claim 12, wherein the first determining means is specifically configured to,
inputting the sentence and the reference content of the text into a preset theme extraction model to determine whether the sentence is a candidate theme sentence in the text; the reference content is all the content of the text, or the reference content is the title and abstract content of the text.
14. A training apparatus for a topic compaction model, comprising:
a determining module, configured to determine a corpus, where the corpus includes: the method comprises the steps that a preset number of sample long titles and corresponding sample short titles are obtained, and texts to which the sample long titles belong and texts to which the corresponding sample short titles belong correspond to the same event;
and the training module is used for training the initial subject simplification model by taking the sample long title as input and the corresponding sample short title as output to obtain a preset subject simplification model, and performing redundancy removal processing on candidate subject sentences to be processed to obtain simplified sentences.
15. The apparatus of claim 14, wherein the means for determining is specifically configured to,
acquiring titles of at least two texts corresponding to the same event;
taking the titles with the corresponding word number larger than or equal to a preset number threshold value in at least two titles as sample long titles;
and taking the title with the corresponding word number smaller than the preset number threshold value in at least two titles as a sample short title corresponding to the sample long title.
16. The apparatus of claim 14, wherein the initial topic reduction model is a pre-trained semantic representation model.
17. An electronic device, comprising:
at least one processor; and
a memory communicatively coupled to the at least one processor; wherein the content of the first and second substances,
the memory stores instructions executable by the at least one processor to enable the at least one processor to perform the method of any one of claims 1-5 or to perform the method of any one of claims 6-8.
18. A non-transitory computer readable storage medium having stored thereon computer instructions for causing a computer to perform the method of any one of claims 1-5 or the method of any one of claims 6-8.
19. A computer program product comprising a computer program which, when executed by a processor, carries out the steps of the method according to any one of claims 1 to 5 or carries out the steps of the method according to any one of claims 6 to 8.
CN202111101344.8A 2021-09-18 2021-09-18 Theme extraction method and device, electronic equipment and storage medium Pending CN113850076A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111101344.8A CN113850076A (en) 2021-09-18 2021-09-18 Theme extraction method and device, electronic equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111101344.8A CN113850076A (en) 2021-09-18 2021-09-18 Theme extraction method and device, electronic equipment and storage medium

Publications (1)

Publication Number Publication Date
CN113850076A true CN113850076A (en) 2021-12-28

Family

ID=78974652

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111101344.8A Pending CN113850076A (en) 2021-09-18 2021-09-18 Theme extraction method and device, electronic equipment and storage medium

Country Status (1)

Country Link
CN (1) CN113850076A (en)

Similar Documents

Publication Publication Date Title
CN112528641A (en) Method and device for establishing information extraction model, electronic equipment and readable storage medium
CN114579104A (en) Data analysis scene generation method, device, equipment and storage medium
CN114818736B (en) Text processing method, chain finger method and device for short text and storage medium
CN114490969B (en) Question and answer method and device based on table and electronic equipment
CN115600592A (en) Method, device, equipment and medium for extracting key information of text content
CN115952258A (en) Generation method of government affair label library, and label determination method and device of government affair text
CN112528644B (en) Entity mounting method, device, equipment and storage medium
CN110895655A (en) Method and device for extracting text core phrase
CN113869042A (en) Text title generation method and device, electronic equipment and storage medium
CN114647727A (en) Model training method, device and equipment applied to entity information recognition
CN114048315A (en) Method and device for determining document tag, electronic equipment and storage medium
CN113850076A (en) Theme extraction method and device, electronic equipment and storage medium
CN113378015A (en) Search method, search apparatus, electronic device, storage medium, and program product
CN113641724A (en) Knowledge tag mining method and device, electronic equipment and storage medium
CN112989805A (en) Text detection method, device, equipment and storage medium
CN114492456B (en) Text generation method, model training method, device, electronic equipment and medium
CN114186552B (en) Text analysis method, device and equipment and computer storage medium
CN114492409B (en) Method and device for evaluating file content, electronic equipment and program product
CN113377923A (en) Semantic retrieval method, device, equipment, storage medium and computer program product
CN114742168A (en) Method and device for training webpage similarity model, electronic equipment and medium
CN117574168A (en) Information report generation method and device
CN114329212A (en) Information recommendation method and device and electronic equipment
CN114443935A (en) Content searching method and device and electronic equipment
CN113360602A (en) Method, apparatus, device and storage medium for outputting information
CN115496073A (en) Vehicle function analysis method, device, equipment and medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination