CN111831816A - Core content processing method and device, electronic equipment and readable storage medium - Google Patents

Core content processing method and device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN111831816A
CN111831816A CN202010704987.0A CN202010704987A CN111831816A CN 111831816 A CN111831816 A CN 111831816A CN 202010704987 A CN202010704987 A CN 202010704987A CN 111831816 A CN111831816 A CN 111831816A
Authority
CN
China
Prior art keywords
information
abstract
translation
character
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202010704987.0A
Other languages
Chinese (zh)
Other versions
CN111831816B (en
Inventor
栾博恒
熊军
谭悦
谭金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubo Network Technology Shanghai Co ltd
Original Assignee
Hubo Network Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubo Network Technology Shanghai Co ltd filed Critical Hubo Network Technology Shanghai Co ltd
Priority to CN202010704987.0A priority Critical patent/CN111831816B/en
Publication of CN111831816A publication Critical patent/CN111831816A/en
Application granted granted Critical
Publication of CN111831816B publication Critical patent/CN111831816B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a core content processing method, a core content processing device, electronic equipment and a readable storage medium, aiming at information to be processed, firstly, a corresponding information abstract for identifying the core content of the information is generated according to the content information of the information, then, a preset translation algorithm is adopted to process the information abstract, and a translation abstract corresponding to the information abstract is obtained, wherein the translation abstract is used for identifying the core content of the information but is different from the expression form of the information abstract. The method adopts a mode of extracting the information abstract to obtain brief information for expressing the core content of the information, and then converts the information abstract into a translation abstract which expresses the core content of the information in the same way but has a different expression form from the original information abstract. Therefore, on the basis of simplifying information, the core content can be displayed in different expression forms according to the requirements of different user groups and different platforms, and the method is flexibly suitable for the requirements of different user groups and different platforms.

Description

Core content processing method and device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a core content processing method and apparatus, an electronic device, and a readable storage medium.
Background
With the development of information technology, people will receive a great deal of information every day, which includes a great deal of redundant information. How to obtain the latest and hottest information view point and help people to quickly grasp the era venation and trend is an urgent problem to be solved. However, when the consultation information is displayed at present, on one hand, the redundancy of the content of the consultation information exists, on the other hand, the displayed consultation information is uniform, information with stronger adaptability can not be provided for different user groups and different platforms in a targeted manner, and the requirement of different user groups and different platforms on the content of the information with different expression forms is difficult to flexibly adapt to.
Disclosure of Invention
The application aims to provide a core content processing method, a core content processing device, an electronic device and a computer readable storage medium, which can perform text presentation in a targeted content expression mode to meet diversified requirements of users.
The embodiment of the application can be realized as follows:
in a first aspect, an embodiment of the present application provides a core content processing method, where the method includes:
acquiring information to be processed, and generating an information abstract corresponding to the information and used for identifying core content of the information according to content information of the information;
and processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract, wherein the translation abstract is used for identifying the core content of the information and has a different expression form from the information abstract.
In an optional implementation manner, the step of processing the information digest by using a preset translation algorithm to obtain a translation digest corresponding to the information digest includes:
vectorizing the information abstract to obtain an abstract vector corresponding to the information abstract;
decoding the abstract vector according to the abstract vector and a preset label to obtain translation characters corresponding to all characters contained in the information abstract;
and obtaining a translation abstract corresponding to the information abstract according to all the obtained translation characters.
In an optional implementation manner, the step of decoding the digest vector according to the digest vector and a preset tag to obtain a translation character corresponding to each character included in the information digest includes:
aiming at a first character in the information abstract, decoding the abstract vector according to the abstract vector and a preset starting label to obtain a translation character corresponding to the first character in the information abstract;
aiming at other characters except the first character and the last character in the information abstract, decoding the abstract vector according to the abstract vector and the translation character obtained by the previous character to obtain translation characters corresponding to the other characters;
and decoding the abstract vector according to the abstract vector and a preset end label aiming at the last character in the information abstract to obtain a translation character corresponding to the last character.
In an optional implementation manner, the step of decoding the digest vector according to the digest vector and a preset start tag to obtain a translation character corresponding to a first character in the information digest includes:
vectorizing a preset starting label;
decoding the abstract vector according to the abstract vector and a preset starting label after vectorization to obtain a plurality of translation characters obtained by converting a first vector in the abstract vector and translation scores carried by the translation characters;
and obtaining the translation character with the highest translation score in the plurality of translation characters, and using the translation character as the translation character corresponding to the first character in the information abstract.
In an optional implementation manner, the information to be processed includes a plurality of pieces, and before the step of generating, according to the content information of the information, an information digest corresponding to the information and used for identifying core content of the information, the method further includes:
calculating the similarity of any two pieces of information in the plurality of pieces of information to be processed;
clustering the information according to the similarity calculation result;
and filtering the information according to the clustering result.
In an optional implementation manner, before the step of processing the information summary by using a preset translation algorithm, the method further includes:
acquiring a required target translation type, wherein the target translation type is used for indicating the expression form type of the translated information abstract;
and acquiring a preset translation algorithm corresponding to the target translation type.
In an optional implementation manner, the step of generating, according to the content information of the information, an information summary corresponding to the information and used for identifying the core content of the information includes:
obtaining sentence vectors of sentences contained in the information and position vectors corresponding to positions of the sentences in the information;
judging whether the sentence is a sentence for marking the core content of the information or not by utilizing a discrimination model obtained by pre-training according to the sentence vector and the position vector;
and obtaining the information abstract of the information according to all the sentences used for identifying the core content of the information.
In a second aspect, an embodiment of the present application provides a core content processing apparatus, including:
the generating module is used for acquiring information to be processed and generating an information abstract which corresponds to the information and is used for identifying the core content of the information according to the content information of the information;
and the processing module is used for processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract, and the translation abstract is used for marking the core content of the information and has a different expression form from the information abstract.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of any of the methods described above.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having a computer program stored thereon, where the computer program is executed by a processor to perform the steps of any of the methods described above.
The beneficial effects of the embodiment of the application include, for example:
according to the core content processing method, the core content processing device, the electronic equipment and the readable storage medium, aiming at information to be processed, firstly, an information abstract which is corresponding to the information and used for identifying the core content of the information is generated according to the content information of the information, then, the information abstract is processed by adopting a preset translation algorithm, a translation abstract which is corresponding to the information abstract is obtained, and the translation abstract is used for identifying the core content of the information but is different from the expression form of the information abstract. The method adopts a mode of extracting the information abstract to obtain brief information for expressing the core content of the information, and then converts the information abstract into a translation abstract which expresses the core content of the information in the same way but has a different expression form from the original information abstract. Therefore, on the basis of simplifying information, the core content can be displayed in different expression forms according to the requirements of different user groups and different platforms, and the method is flexibly suitable for the requirements of different user groups and different platforms.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered as limiting the scope, and for those skilled in the art, other related drawings can be obtained from the drawings without inventive effort.
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present disclosure;
fig. 2 is a flowchart of a core content processing method according to an embodiment of the present application;
fig. 3 is a flowchart of a clustering method in the core content processing method according to the embodiment of the present application;
FIG. 4 is a flowchart of the substeps of step S210 in FIG. 2;
fig. 5 is a flowchart of a translation algorithm obtaining method in the core content processing method according to the embodiment of the present application;
FIG. 6 is a flowchart of the substeps of step S220 in FIG. 2;
FIG. 7 is a process diagram of a translation process provided by an embodiment of the present application;
fig. 8 is a functional block diagram of a core content processing apparatus according to an embodiment of the present application.
Icon: 110-a processor; 120-a memory; 130-a communication module; 800-core content processing means; 810-a generation module; 820-processing module.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, presented in the accompanying drawings, is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present application.
It should be noted that: like reference numbers and letters refer to like items in the following figures, and thus, once an item is defined in one figure, it need not be further defined and explained in subsequent figures.
It should be noted that the features of the embodiments of the present application may be combined with each other without conflict.
Referring to fig. 1, a block diagram of an electronic device provided in the embodiment of the present application is shown, where the electronic device may include, but is not limited to, a computer, a server, and other devices. The electronic device may include a memory 120, a processor 110, and a communication module 130. The memory 120, the processor 110 and the communication module 130 are electrically connected to each other directly or indirectly to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.
The memory 120 is used for storing programs or data. The Memory 120 may be, but is not limited to, a Random Access Memory (RAM), a Read Only Memory (ROM), a Programmable Read-Only Memory (PROM), an erasable Read-Only Memory (EPROM), an electrically erasable Read-Only Memory (EEPROM), and the like.
The processor 110 is used to read/write data or programs stored in the memory 120 and execute the core content processing method provided by any embodiment of the present application.
The communication module 130 is used for establishing a communication connection between the electronic device and another communication terminal through a network, and for transceiving data through the network.
It should be understood that the configuration shown in fig. 1 is merely a schematic configuration diagram of an electronic device, which may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
Referring to fig. 2, fig. 2 is a flowchart illustrating a core content processing method according to an embodiment of the present disclosure, where the core content processing method can be executed by the electronic device shown in fig. 1. It should be understood that, in other embodiments, the order of some steps in the core content processing method of this embodiment may be interchanged according to actual needs, or some steps may be omitted or deleted. The detailed steps of the core content processing method are described below.
Step S210, obtaining information to be processed, and generating an information summary corresponding to the information and used for identifying core content of the information according to content information of the information.
Step S220, processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract, wherein the translation abstract is used for marking the core content of the information and has a different expression form from the information abstract.
When a daily user acquires network information or an information platform displays information, there are different requirements for the expression form of the information in addition to paying attention to the information content of the most important information. For example, for the same piece of information, when the information is displayed on an information platform, the expression form of the information can be converted to avoid the same expression form as that of other platforms displaying the information, so that the user is provided with a uniform reading feeling, and the reading heat of the platform information is reduced. For example, when the information platform displays information to different types of users, the information platform can correspondingly convert the expression form of the information according to different types of users, for example, when the user is a teacher, the expression form of the information can be converted into a more standard form, and when the user is a student, the expression form of the information can be converted into a more easily understood form.
Through the research, the application provides a core content processing scheme, wherein the acquired information to be processed can be blog content, webpage content, news and the like without limitation. In order to avoid the problem of content redundancy of the information, the user can obtain the core information of the information quickly, and first, an information abstract capable of identifying the core content of the information can be generated according to the content of the information.
In this embodiment, the information abstract of the information is extracted first, so that on one hand, core content can be extracted to avoid redundancy of the information, and on the other hand, the subsequent translation processing is performed based on the extracted information abstract, so that the workload of the translation processing can be simplified.
On the basis, aiming at the obtained information abstract, a preset translation algorithm can be adopted to process the information abstract, and the expression form of the processed translation abstract is different from that of the information abstract, but the essential content is the same as the information abstract, and the essential content can also be used for identifying the core content of the original information.
In this embodiment, a translation model may be obtained by adopting sample training in advance, where the translation model may adopt a parallel corpus manner (a source-target pair of sentences with the same meaning and different expressions), that is, one expression manner generates another expression manner, and a difference between a generated text and a target is minimized. When the model is trained to be converged, the translation model meeting the requirements can be obtained.
On the basis, in the step of processing the information abstract by adopting the preset translation algorithm, the information abstract can be processed by using the corresponding translation model obtained by pre-training, so that the conversion (ideographic conversion) from the information abstract to the translation abstract is realized.
In this embodiment, the information is extracted with the core content and then the expression form is converted to obtain the abstract which has different expression forms and is also used for identifying the core content, so that the core content can be displayed in different expression forms according to the requirements of different user groups and different platforms on the basis of simplifying the information.
In this embodiment, on the basis of the above, in order to further avoid the problem of information redundancy, when the information to be processed includes a plurality of pieces of information, the information to be processed may be first clustered and filtered to filter out some information with too high similarity, so as to avoid the problem that the reading experience of the user is affected by the redundancy of the information display.
Therefore, referring to fig. 3, the core content processing method provided in this embodiment further includes the following steps:
in step S310, the similarity between any two pieces of information in the pieces of information to be processed is calculated.
Step S320, clustering the information according to the similarity calculation result.
Step S330, filtering the information according to the clustering result.
In this embodiment, the similarity between any two pieces of information in the pieces of information to be processed is calculated, and if the similarity between the two pieces of information is higher than a preset threshold, the two pieces of information can be divided into the same cluster set. After every two pieces of information are processed in this way, a plurality of cluster sets can be obtained. The similarity between information in the same cluster set is high, so that, for each cluster set, if the cluster set contains two or more information, the information in the cluster set can be filtered, for example, the information can be deleted to leave one information. Or more than two pieces of information are contained in the cluster set, the information in the cluster set can be filtered out to reserve two pieces of information. The specific number of reservations is not limited in this embodiment as long as a reduction in the degree of redundancy can be achieved.
If the cluster set only contains one piece of information, the cluster set is not filtered, and the piece of information is reserved.
After the filtering process is performed, the information summary of the information is extracted according to the step S210 for each piece of information obtained after the filtering process is performed.
In this embodiment, in the step S210, in order to more accurately extract the information summary that can represent the core content of the information, the position information of the sentence in the information is taken into consideration in addition to the characteristics of each sentence in the information, such as the frequency of occurrence. Because the position information of the sentence in the information can also reflect the importance of the sentence on the information content expression to a certain extent, for example, the first few segments or the last few segments of the information are generally important on the information expression. Therefore, referring to fig. 4, in the present embodiment, the step S210 can be implemented by:
step S211, obtaining sentence vectors of each sentence contained in the information and position vectors corresponding to the position of the sentence in the information.
Step S212, according to the sentence vector and the position vector, judging whether the sentence is a sentence for identifying the core content of the information by using a discriminant model obtained by pre-training.
Step S213, obtaining the information abstract of the information according to all the sentences for identifying the core content of the information.
In this embodiment, the information can be first processed into clauses, which can be executed according to clause marks, such as sentence marks, commas, semicolons, etc. The information can be divided into a plurality of sentences based on the sentence dividing marks.
For each sentence, the sentence vector of the sentence itself can be obtained, and the position of the sentence in the information, such as the first sentence in the information, can also be obtained. A position vector of the sentence is obtained based on the obtained position information.
And combining the sentence vector and the position vector, and processing the sentence vector and the position vector by using a discriminant model obtained by pre-training. The discriminant model is obtained by training in advance based on a plurality of information samples. Training is carried out according to sentence sample vectors and sentence sample position vectors contained in the information samples and carried labels whether each sentence is core content or not.
When the method is applied, the input of the discriminant model is the sentence vector and the position vector of each sentence, and the output is a score, which can reflect the importance degree of the sentence to the overall expression of the information, for example, the higher the score, the more important the sentence to the overall expression of the information. If the score is lower, the less important the sentence is for the overall expression of the information.
When determining whether the sentence is a sentence indicating the core content of the information based on the obtained score, the determination may be performed by setting a predetermined threshold, for example, when the score is higher than the predetermined threshold, the sentence may be determined as a sentence identifying the core content of the information, otherwise, the sentence may be determined as a sentence not identifying the core content of the information.
In another implementation, the sentence in which the core content of the information is identified may be determined by comparing scores of all sentences included in the information. For example, the sentences may be sorted in descending order of score, and the top predetermined number of sentences may be used as sentences for identifying the core content of the information.
After the sentences used for identifying the core content of the information are determined, the obtained sentences can be combined according to the positions of the sentences in the information to obtain the information abstract corresponding to the information.
In this embodiment, on the basis of obtaining the information abstract of the information message, when the step S220 is executed, the translation process on the information abstract is executed according to different requirements, and therefore, the adopted preset translation algorithm corresponds to the different requirements, please refer to fig. 5, and the preset translation algorithm may be determined by the following method:
in step S410, a target translation type is obtained, wherein the target translation type is used for indicating the expression form type of the translated information abstract.
Step S420, acquiring a preset translation algorithm corresponding to the target translation type.
When information is translated, the translation can be executed according to the required translation requirement, the target translation type can be specified by a user, for example, when Chinese is required to be translated into English, the target translation type is English, and when popular expression forms are required to be translated into Chinese, the target translation type is Chinese. Certainly, in implementation, multiple different translation types can be included, and the corpus corresponding to the different translation types can acquire information in multiple different expression forms in advance in a mass manner and generate sentence pairs so as to train to obtain corresponding translation models. The translation operations are executed by adopting different translation algorithms, namely, the translation operations are executed by adopting corresponding translation models based on different corpora.
In this embodiment, a corresponding preset translation algorithm may be determined according to the obtained target translation type, and the determined preset translation algorithm may correspond to the corpus corresponding to the target translation type, and further correspond to the translation model, that is, when the preset translation algorithm is used to translate the information abstract, the translation process may be performed based on the corresponding corpus and the corresponding translation model.
In this embodiment, after determining the preset translation algorithm, referring to fig. 6, when the step S220 is executed, the following steps may be performed:
step S221, vectorizing the information abstract to obtain an abstract vector corresponding to the information abstract.
Step S222, decoding the digest vector according to the digest vector and a preset tag to obtain a translation character corresponding to each character included in the information digest.
Step S223, obtaining a translation summary corresponding to the information summary according to all the obtained translation characters.
In this embodiment, the information digest is first subjected to overall vectorization to obtain a corresponding digest vector. When performing translation processing, a preset tag is required to trigger the translation process or end the translation process. For example, the default tag may include a default start tag and a default end tag, and the default start tag may be used to trigger translation when a first character included in the information abstract is translated, and the default end tag may be used to trigger translation ending when a last character included in the information abstract is translated. Wherein the preset start tag may be, for example, "\ t" and the preset end tag may be, for example, "\ n".
In implementation, the digest vector may be decoded according to the digest vector and a preset start tag for a first character in the information digest, so as to obtain a translation character corresponding to the first character in the information digest.
When other characters except the first character and the last character in the information abstract are processed, the abstract vector can be decoded according to the translation character obtained by the previous character and the abstract vector to obtain the translation characters corresponding to the other characters.
When the last character in the information abstract is processed, the abstract vector can be decoded according to the abstract vector and a preset end label to obtain a translation character corresponding to the last character.
In this embodiment, when decoding is performed by combining the digest vector and the preset character, the preset character also needs to be converted into a vector form in advance, and processing is performed by combining the digest vector and the preset character in the vector form.
In this embodiment, for example, when obtaining a translation character corresponding to a first character included in the information summary according to the preset start tag and the summary vector, optionally, vectorizing the preset start tag first, and decoding the summary vector according to the summary vector and the preset start tag after vectorization to obtain a plurality of translation characters obtained by converting the first vector in the summary vector and translation scores carried by each translation character. And obtaining the translation character with the highest translation score in the plurality of translation characters, and using the translation character as the translation character corresponding to the first character in the information abstract.
It should be understood that there may be a plurality of different expression forms for the same word, so when decoding the abstract vector, a plurality of corresponding translation results may be obtained by decoding, and under different translation algorithms, the emphasis scores of different translation results are different, and the translation character carrying the highest translation score may be the target character under the corresponding translation algorithm.
Similarly, when processing other characters except the first character and the last character of the information summary, such as the second character, the third character, etc., the translation character corresponding to the previous character may be vectorized first, and then the summary vector may be decoded according to the summary vector and the previous translation character after the vectorization processing, so as to obtain a plurality of translation characters obtained by converting the vectors corresponding to the other characters in the summary vector and the translation score carried by each translation character. And taking the translation character corresponding to the highest translation score as the translation character corresponding to the other character.
Similarly, when the last character in the information abstract is processed, the preset end tag can be vectorized, and the abstract vector is decoded according to the abstract vector and the vectorized preset end tag to obtain a plurality of translation characters obtained by converting the last vector in the abstract vector and translation values carried by the translation characters. And taking the translation character with the highest translation score as the translation character corresponding to the last character.
Referring to fig. 7, the translation process will be described with reference to the schematic diagram of fig. 7.
For the information to be processed, an information summary corresponding to the information is obtained, for example, the first character of the information summary is "G", the second character is "o", and the last character is "" (see the last box on the left side of fig. 7). First, vectorization processing is performed on the information summary to obtain a vector form corresponding to each character (see the second box on the left side of fig. 7). Combining the obtained vectors, a summary vector of the information summary can be obtained (see the middle vertical box in fig. 7).
Wherein, the preset start tag may be "\ t", and when processing is performed on the first character of the information abstract, firstly, vectorization processing may be performed on the preset start tag "\ t" to obtain the vectorized preset start tag (see the lower box on the right side of fig. 7). And decoding the abstract vector according to the abstract vector and the vectorized preset starting label to obtain a plurality of translated characters after the first vector in the abstract vector is converted and translation scores carried by each translated character. After comparison, the translation character "V" with the highest translation score is determined to be the translation character corresponding to the first character "G" in the information abstract.
In addition, aiming at the second character in the information abstract, firstly, vectorization processing can be carried out on the translation character 'V' corresponding to the first character, the abstract vector is decoded according to the abstract vector and the translation character after vectorization processing, a plurality of translation characters obtained by converting the second vector contained in the abstract vector are obtained, and the translation character 'a' carrying the highest translation score is determined to be the translation character corresponding to the second character in the information abstract.
The third and the fourth characters (except the last character and the first character) in the information abstract are processed by the same processing mode of the second character.
And aiming at the last character in the information abstract, vectorizing the preset ending tag "\ n", and decoding the abstract vector according to the vectorized preset ending tag and the abstract vector to obtain the translation character corresponding to the last character.
In this embodiment, the translation abstract obtained by conversion can be obtained by combining the translation characters obtained as described above. The translation summary is expressed in a different form than the previously obtained information summary, but may also identify the core content of the information message. Therefore, the requirements of the platform for displaying the information in different expression forms or different expression forms of the information by different user groups can be met.
Referring to fig. 8, which is a functional block diagram of a core content processing apparatus 800 according to another embodiment of the present application, the core content processing apparatus 800 includes a generating module 810 and a processing module 820.
The generating module 810 is configured to obtain information to be processed, and generate an information summary corresponding to the information and used for identifying core content of the information according to content information of the information.
It is understood that the generating module 810 can be used to execute the step S210, and for the detailed implementation of the generating module 810, reference can be made to the contents related to the step S210.
The processing module 820 is configured to process the information summary by using a preset translation algorithm to obtain a translation summary corresponding to the information summary, where the translation summary is used to identify core content of the information and has a different expression form from the information summary.
It is understood that the processing module 820 can be used to execute the step S220, and for the detailed implementation of the processing module 820, reference can be made to the above-mentioned contents related to the step S220.
In one possible implementation, the processing module 820 may be configured to obtain a translation summary corresponding to the information summary by:
vectorizing the information abstract to obtain an abstract vector corresponding to the information abstract;
decoding the abstract vector according to the abstract vector and a preset label to obtain translation characters corresponding to all characters contained in the information abstract;
and obtaining a translation abstract corresponding to the information abstract according to all the obtained translation characters.
In one possible implementation, the processing module 820 may be configured to obtain translation characters corresponding to each character included in the information summary by:
aiming at a first character in the information abstract, decoding the abstract vector according to the abstract vector and a preset starting label to obtain a translation character corresponding to the first character in the information abstract;
aiming at other characters except the first character and the last character in the information abstract, decoding the abstract vector according to the abstract vector and the translation character obtained by the previous character to obtain translation characters corresponding to the other characters;
and decoding the abstract vector according to the abstract vector and a preset end label aiming at the last character in the information abstract to obtain a translation character corresponding to the last character.
In one possible implementation, the processing module 820 may be configured to obtain a translation character corresponding to a first character in the information summary by:
vectorizing a preset starting label;
decoding the abstract vector according to the abstract vector and a preset starting label after vectorization to obtain a plurality of translation characters obtained by converting a first vector in the abstract vector and translation scores carried by the translation characters;
and obtaining the translation character with the highest translation score in the plurality of translation characters, and using the translation character as the translation character corresponding to the first character in the information abstract.
In a possible implementation manner, the information to be processed includes a plurality of pieces, and the core content processing apparatus 800 further includes a clustering module, where the clustering module is configured to:
calculating the similarity of any two pieces of information in the plurality of pieces of information to be processed;
clustering the information according to the similarity calculation result;
and filtering the information according to the clustering result.
In a possible implementation manner, the core content processing apparatus 800 further includes an algorithm obtaining module, where the algorithm obtaining module is configured to:
acquiring a required target translation type, wherein the target translation type is used for indicating the expression form type of the translated information abstract;
and acquiring a preset translation algorithm corresponding to the target translation type.
In one possible implementation, the generating module 810 may be configured to generate the information summary by:
obtaining sentence vectors of sentences contained in the information and position vectors corresponding to positions of the sentences in the information;
judging whether the sentence is a sentence for marking the core content of the information or not by utilizing a discrimination model obtained by pre-training according to the sentence vector and the position vector;
and obtaining the information abstract of the information according to all the sentences used for identifying the core content of the information.
The detailed processes executed by the modules in the core content processing apparatus 800 are not repeated herein, and reference may be made to the foregoing explanation of the core content processing method.
Further, an embodiment of the present application also provides a computer-readable storage medium, where machine-executable instructions are stored in the computer-readable storage medium, and when the machine-executable instructions are executed, the core content processing method provided by the foregoing embodiment is implemented.
The steps executed when the computer program runs are not described in detail herein, and reference may be made to the foregoing explanation of the core content processing method.
In summary, according to the core content processing method, the core content processing device, the electronic device, and the readable storage medium provided in the embodiments of the present application, for information to be processed, an information summary corresponding to the information is first generated according to content information of the information and used for identifying core content of the information, and then the information summary is processed by using a preset translation algorithm to obtain a translation summary corresponding to the information summary, where the translation summary is used for identifying the core content of the information and is different from an expression form of the information summary. The method adopts a mode of extracting the information abstract to obtain brief information for expressing the core content of the information, and then converts the information abstract into a translation abstract which expresses the core content of the information in the same way but has a different expression form from the original information abstract. Therefore, on the basis of simplifying information, the core content can be displayed in different expression forms according to the requirements of different user groups and different platforms, and the method is flexibly suitable for the requirements of different user groups and different platforms.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method can be implemented in other ways. The apparatus and method embodiments described above are illustrative only, as the flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that, in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, functional modules in the embodiments of the present application may be integrated together to form an independent part, or each module may exist separately, or two or more modules may be integrated to form an independent part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the present application shall be included in the protection scope of the present application.

Claims (10)

1. A method for core content processing, the method comprising:
acquiring information to be processed, and generating an information abstract corresponding to the information and used for identifying core content of the information according to content information of the information;
and processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract, wherein the translation abstract is used for identifying the core content of the information and has a different expression form from the information abstract.
2. The method for processing core content according to claim 1, wherein the step of processing the information summary by using a preset translation algorithm to obtain a translation summary corresponding to the information summary comprises:
vectorizing the information abstract to obtain an abstract vector corresponding to the information abstract;
decoding the abstract vector according to the abstract vector and a preset label to obtain translation characters corresponding to all characters contained in the information abstract;
and obtaining a translation abstract corresponding to the information abstract according to all the obtained translation characters.
3. The method of claim 2, wherein the step of decoding the digest vector according to the digest vector and a predetermined tag to obtain translation characters corresponding to each character included in the information digest comprises:
aiming at a first character in the information abstract, decoding the abstract vector according to the abstract vector and a preset starting label to obtain a translation character corresponding to the first character in the information abstract;
aiming at other characters except the first character and the last character in the information abstract, decoding the abstract vector according to the abstract vector and the translation character obtained by the previous character to obtain translation characters corresponding to the other characters;
and decoding the abstract vector according to the abstract vector and a preset end label aiming at the last character in the information abstract to obtain a translation character corresponding to the last character.
4. The method of claim 3, wherein the step of decoding the digest vector according to the digest vector and a predetermined start tag to obtain a translation character corresponding to a first character in the information digest comprises:
vectorizing a preset starting label;
decoding the abstract vector according to the abstract vector and a preset starting label after vectorization to obtain a plurality of translation characters obtained by converting a first vector in the abstract vector and translation scores carried by the translation characters;
and obtaining the translation character with the highest translation score in the plurality of translation characters, and using the translation character as the translation character corresponding to the first character in the information abstract.
5. The method for processing core content according to claim 1, wherein the information to be processed includes a plurality of pieces, and before the step of generating the information summary corresponding to the information and used for identifying the core content of the information according to the content information of the information, the method further includes:
calculating the similarity of any two pieces of information in the plurality of pieces of information to be processed;
clustering the information according to the similarity calculation result;
and filtering the information according to the clustering result.
6. The method of claim 1, wherein before the step of processing the information summary using the predetermined translation algorithm, the method further comprises:
acquiring a required target translation type, wherein the target translation type is used for indicating the expression form type of the translated information abstract;
and acquiring a preset translation algorithm corresponding to the target translation type.
7. The method for processing core content according to claim 1, wherein the step of generating the information summary corresponding to the information and used for identifying the core content of the information according to the content information of the information comprises:
obtaining sentence vectors of sentences contained in the information and position vectors corresponding to positions of the sentences in the information;
judging whether the sentence is a sentence for marking the core content of the information or not by utilizing a discrimination model obtained by pre-training according to the sentence vector and the position vector;
and obtaining the information abstract of the information according to all the sentences used for identifying the core content of the information.
8. A core content processing apparatus, the apparatus comprising:
the generating module is used for acquiring information to be processed and generating an information abstract which corresponds to the information and is used for identifying the core content of the information according to the content information of the information;
and the processing module is used for processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract, and the translation abstract is used for marking the core content of the information and has a different expression form from the information abstract.
9. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating via the bus when the electronic device is operating, the processor executing the machine-readable instructions to perform the steps of the method according to any one of claims 1 to 7.
10. A computer-readable storage medium, having stored thereon a computer program which, when being executed by a processor, is adapted to carry out the steps of the method according to any one of claims 1 to 7.
CN202010704987.0A 2020-07-21 2020-07-21 Core content processing method, device, electronic equipment and readable storage medium Active CN111831816B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010704987.0A CN111831816B (en) 2020-07-21 2020-07-21 Core content processing method, device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010704987.0A CN111831816B (en) 2020-07-21 2020-07-21 Core content processing method, device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111831816A true CN111831816A (en) 2020-10-27
CN111831816B CN111831816B (en) 2023-06-27

Family

ID=72924462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010704987.0A Active CN111831816B (en) 2020-07-21 2020-07-21 Core content processing method, device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111831816B (en)

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9122674B1 (en) * 2006-12-15 2015-09-01 Language Weaver, Inc. Use of annotations in statistical machine translation
CN106570171A (en) * 2016-11-03 2017-04-19 中国电子科技集团公司第二十八研究所 Semantics-based sci-tech information processing method and system
US20170220997A1 (en) * 2016-02-02 2017-08-03 Ricoh Company, Ltd. Conference support system, conference support method, and recording medium
CN109344413A (en) * 2018-10-16 2019-02-15 北京百度网讯科技有限公司 Translation processing method and device
CN110209774A (en) * 2018-02-11 2019-09-06 北京三星通信技术研究有限公司 Handle the method, apparatus and terminal device of session information
CN110209771A (en) * 2019-06-14 2019-09-06 哈尔滨哈银消费金融有限责任公司 User's geographic information analysis and text mining method and apparatus
CN110263350A (en) * 2019-03-08 2019-09-20 腾讯科技(深圳)有限公司 Model training method, device, computer readable storage medium and computer equipment
CN111027331A (en) * 2019-12-05 2020-04-17 百度在线网络技术(北京)有限公司 Method and apparatus for evaluating translation quality
CN111382261A (en) * 2020-03-17 2020-07-07 北京字节跳动网络技术有限公司 Abstract generation method and device, electronic equipment and storage medium
CN111428523A (en) * 2020-03-23 2020-07-17 腾讯科技(深圳)有限公司 Translation corpus generation method and device, computer equipment and storage medium

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9122674B1 (en) * 2006-12-15 2015-09-01 Language Weaver, Inc. Use of annotations in statistical machine translation
US20170220997A1 (en) * 2016-02-02 2017-08-03 Ricoh Company, Ltd. Conference support system, conference support method, and recording medium
CN106570171A (en) * 2016-11-03 2017-04-19 中国电子科技集团公司第二十八研究所 Semantics-based sci-tech information processing method and system
CN110209774A (en) * 2018-02-11 2019-09-06 北京三星通信技术研究有限公司 Handle the method, apparatus and terminal device of session information
CN109344413A (en) * 2018-10-16 2019-02-15 北京百度网讯科技有限公司 Translation processing method and device
CN110263350A (en) * 2019-03-08 2019-09-20 腾讯科技(深圳)有限公司 Model training method, device, computer readable storage medium and computer equipment
CN110209771A (en) * 2019-06-14 2019-09-06 哈尔滨哈银消费金融有限责任公司 User's geographic information analysis and text mining method and apparatus
CN111027331A (en) * 2019-12-05 2020-04-17 百度在线网络技术(北京)有限公司 Method and apparatus for evaluating translation quality
CN111382261A (en) * 2020-03-17 2020-07-07 北京字节跳动网络技术有限公司 Abstract generation method and device, electronic equipment and storage medium
CN111428523A (en) * 2020-03-23 2020-07-17 腾讯科技(深圳)有限公司 Translation corpus generation method and device, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
WAN XIAO-JUN 等: "Cross-language document summarization based on machine translation quality prediction", 《ACL \'10: PROCEEDINGS OF THE 48TH ANNUAL MEETING OF THE ASSOCIATION FOR COMPUTATIONAL LINGUISTICS》, pages 917 *
余传明 等: "基于序列到序列模型的生成式文本摘要研究综述", 《图书情报工作》, pages 108 - 117 *
殷明明 等: "基于对比注意力机制的跨语言句子摘要系统", 《计算机工程》, pages 86 - 93 *

Also Published As

Publication number Publication date
CN111831816B (en) 2023-06-27

Similar Documents

Publication Publication Date Title
Gómez-Adorno et al. Improving feature representation based on a neural network for author profiling in social media texts
CN110705206B (en) Text information processing method and related device
Masmoudi et al. Arabic transliteration of romanized tunisian dialect text: A preliminary investigation
CN109359308B (en) Machine translation method, device and readable storage medium
CN114827752B (en) Video generation method, video generation system, electronic device and storage medium
CN114036300A (en) Language model training method and device, electronic equipment and storage medium
CN112765319B (en) Text processing method and device, electronic equipment and storage medium
CN111858843A (en) Text classification method and device
CN111859940A (en) Keyword extraction method and device, electronic equipment and storage medium
CN115238039A (en) Text generation method, electronic device and computer-readable storage medium
CN113255331B (en) Text error correction method, device and storage medium
CN111159394A (en) Text abstract generation method and device
Park et al. Automatic analysis of thematic structure in written English
CN116562240A (en) Text generation method, computer device and computer storage medium
CN111831816B (en) Core content processing method, device, electronic equipment and readable storage medium
CN107908792B (en) Information pushing method and device
CN115438655A (en) Person gender identification method and device, electronic equipment and storage medium
KR102072708B1 (en) A method and computer program for inferring genre of a text contents
JP2005050156A (en) Method and system for replacing content
JP5398638B2 (en) Symbol input support device, symbol input support method, and program
CN111209724A (en) Text verification method and device, storage medium and processor
EP4336379A1 (en) Tracking concepts within content in content management systems and adaptive learning systems
CN117077664B (en) Method and device for constructing text error correction data and storage medium
JP2013050853A (en) Implication relation determination device and program
CN114997133A (en) Text template generation method and device, computer equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant