CN111831816B - Core content processing method, device, electronic equipment and readable storage medium - Google Patents

Core content processing method, device, electronic equipment and readable storage medium Download PDF

Info

Publication number
CN111831816B
CN111831816B CN202010704987.0A CN202010704987A CN111831816B CN 111831816 B CN111831816 B CN 111831816B CN 202010704987 A CN202010704987 A CN 202010704987A CN 111831816 B CN111831816 B CN 111831816B
Authority
CN
China
Prior art keywords
information
abstract
translation
character
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010704987.0A
Other languages
Chinese (zh)
Other versions
CN111831816A (en
Inventor
栾博恒
熊军
谭悦
谭金龙
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hubo Network Technology Shanghai Co ltd
Original Assignee
Hubo Network Technology Shanghai Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hubo Network Technology Shanghai Co ltd filed Critical Hubo Network Technology Shanghai Co ltd
Priority to CN202010704987.0A priority Critical patent/CN111831816B/en
Publication of CN111831816A publication Critical patent/CN111831816A/en
Application granted granted Critical
Publication of CN111831816B publication Critical patent/CN111831816B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/34Browsing; Visualisation therefor
    • G06F16/345Summarisation for human users
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a core content processing method, a device, electronic equipment and a readable storage medium, aiming at information to be processed, firstly, generating a corresponding information abstract for identifying the core content of the information according to the content information of the information, and then adopting a preset translation algorithm to process the information abstract to obtain a translation abstract corresponding to the information abstract, wherein the translation abstract is used for identifying the core content of the information and is different from the expression form of the information abstract. The scheme adopts a mode of extracting the information abstract to obtain brief information for representing the information core content, and then converts the information abstract into a translation abstract which also expresses the information core content but is different from the original information abstract expression form. Therefore, on the basis of simplifying information, the core content can be displayed in different expression forms according to the requirements of different user groups and different platforms, and the method is flexibly suitable for the requirements of different user groups and different platforms.

Description

Core content processing method, device, electronic equipment and readable storage medium
Technical Field
The present application relates to the field of natural language processing technologies, and in particular, to a core content processing method, a device, an electronic apparatus, and a readable storage medium.
Background
With the development of information technology, people will accept a large amount of information each day, including a large amount of redundant information. How to obtain the latest and hottest information point, which helps people to quickly grasp the time context and trend, is a problem to be solved urgently. However, when the consultation information is displayed at present, on one hand, the problem of redundancy of the consultation information content exists, on the other hand, the displayed consultation information is uniform, information with stronger adaptability cannot be provided for different user groups and different platforms in a targeted manner, and the requirements of the different user groups and different platforms on the information content with different expression forms are difficult to flexibly adapt.
Disclosure of Invention
The object of the present application includes, for example, providing a core content processing method, apparatus, electronic device, and computer readable storage medium, which can perform text presentation in a targeted content expression manner to meet the needs of user diversification.
Embodiments of the present application may be implemented as follows:
In a first aspect, an embodiment of the present application provides a core content processing method, where the method includes:
acquiring information to be processed, and generating an information abstract corresponding to the information and used for identifying the core content of the information according to the content information of the information;
and processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract, wherein the translation abstract is used for identifying the core content of the information and has different expression forms with the information abstract.
In an alternative embodiment, the step of processing the information abstract by using a preset translation algorithm to obtain a translation abstract corresponding to the information abstract includes:
carrying out vectorization processing on the information abstract to obtain an abstract vector corresponding to the information abstract;
decoding the abstract vector according to the abstract vector and a preset label to obtain translation characters corresponding to each character contained in the information abstract;
and obtaining the translation abstract corresponding to the information abstract according to all the obtained translation characters.
In an alternative embodiment, the step of decoding the abstract vector according to the abstract vector and a preset label to obtain translated characters corresponding to each character included in the information abstract includes:
Aiming at the first character in the information abstract, decoding the abstract vector according to the abstract vector and a preset starting label to obtain a translation character corresponding to the first character in the information abstract;
aiming at other characters except the first character and the last character in the information abstract, decoding the abstract vector according to the abstract vector and the translated character obtained by the previous character to obtain translated characters corresponding to the other characters;
and aiming at the last character in the information abstract, decoding the abstract vector according to the abstract vector and a preset end label to obtain a translation character corresponding to the last character.
In an alternative embodiment, the step of decoding the summary vector according to the summary vector and a preset start tag to obtain a translated character corresponding to the first character in the information summary includes:
vectorizing a preset starting label;
decoding the abstract vector according to the abstract vector and a preset starting label after vectorization processing to obtain a plurality of translation characters obtained by converting a first vector in the abstract vector and translation scores carried by each translation character;
And obtaining the translation character with the highest translation score in the plurality of translation characters, and taking the translation character as the translation character corresponding to the first character in the information abstract.
In an alternative embodiment, the information to be processed includes a plurality of information pieces, and before the step of generating the information digest corresponding to the information piece and used for identifying the core content of the information piece according to the content information of the information piece, the method further includes:
calculating the similarity of any two pieces of information in the plurality of pieces of information to be processed;
clustering the information according to the calculation result of the similarity;
and filtering the information pieces according to the clustering result.
In an alternative embodiment, before the step of processing the summary of information by using a preset translation algorithm, the method further includes:
acquiring a required target translation type, wherein the target translation type is used for indicating the expression form type of the translated information abstract;
and acquiring a preset translation algorithm corresponding to the target translation type.
In an alternative embodiment, the step of generating the information summary corresponding to the information and used for identifying the core content of the information according to the content information of the information includes:
Obtaining sentence vectors of all sentences contained in the information and position vectors corresponding to the positions of the sentences in the information;
judging whether the sentence is a sentence for identifying the core content of the information or not by utilizing a discriminant model obtained by training in advance according to the sentence vector and the position vector;
and obtaining the information abstract of the information according to all sentences which are obtained by discrimination and are used for identifying the core content of the information.
In a second aspect, an embodiment of the present application provides a core content processing apparatus, including:
the generation module is used for acquiring information to be processed, and generating an information abstract which corresponds to the information and is used for identifying the core content of the information according to the content information of the information;
the processing module is used for processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract, wherein the translation abstract is used for identifying the core content of the information and has different expression forms with the information abstract.
In a third aspect, an embodiment of the present application provides an electronic device, including: a processor, a storage medium storing machine-readable instructions executable by the processor, the processor in communication with the storage medium via the bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of any of the methods as described above.
In a fourth aspect, embodiments of the present application provide a computer-readable storage medium having stored thereon a computer program which, when executed by a processor, performs the steps of any of the methods described above.
The beneficial effects of the embodiment of the application include, for example:
according to the method, the device, the electronic equipment and the readable storage medium for processing the core content, firstly, an information abstract corresponding to information and used for identifying the core content of the information is generated according to the content information of the information, and then a preset translation algorithm is adopted to process the information abstract, so that a translation abstract corresponding to the information abstract is obtained, wherein the translation abstract is used for identifying the core content of the information and is different from the expression form of the information abstract. The scheme adopts a mode of extracting the information abstract to obtain brief information for representing the information core content, and then converts the information abstract into a translation abstract which also expresses the information core content but is different from the original information abstract expression form. Therefore, on the basis of simplifying information, the core content can be displayed in different expression forms according to the requirements of different user groups and different platforms, and the method is flexibly suitable for the requirements of different user groups and different platforms.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a block diagram of an electronic device according to an embodiment of the present application;
FIG. 2 is a flowchart of a core content processing method according to an embodiment of the present application;
fig. 3 is a flowchart of a clustering processing method in the core content processing method provided in the embodiment of the present application;
FIG. 4 is a flow chart of sub-steps of step S210 in FIG. 2;
fig. 5 is a flowchart of a method for obtaining a translation algorithm in the core content processing method provided in the embodiment of the present application;
FIG. 6 is a flow chart of sub-steps of step S220 in FIG. 2;
FIG. 7 is a schematic diagram illustrating a translation process according to an embodiment of the present application;
fig. 8 is a functional block diagram of a core content processing apparatus according to an embodiment of the present application.
Icon: 110-a processor; 120-memory; 130-a communication module; 800-core content processing means; 810-a generation module; 820-a processing module.
Detailed Description
For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations.
Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments herein without making any inventive effort, are intended to be within the scope of the present application.
It should be noted that: like reference numerals and letters denote like items in the following figures, and thus once an item is defined in one figure, no further definition or explanation thereof is necessary in the following figures.
It should be noted that, without conflict, features in embodiments of the present application may be combined with each other.
Referring to fig. 1, a block diagram of an electronic device according to an embodiment of the present application may include, but is not limited to, a computer, a server, and other devices. The electronic device may include a memory 120, a processor 110, and a communication module 130. The memory 120, the processor 110, and the communication module 130 are electrically connected directly or indirectly to each other to realize data transmission or interaction. For example, the components may be electrically connected to each other via one or more communication buses or signal lines.
Wherein the memory 120 is used for storing programs or data. The Memory 120 may be, but is not limited to, random access Memory (Random Access Memory, RAM), read Only Memory (ROM), programmable Read Only Memory (Programmable Read-Only Memory, PROM), erasable Read Only Memory (Erasable Programmable Read-Only Memory, EPROM), electrically erasable Read Only Memory (Electric Erasable Programmable Read-Only Memory, EEPROM), etc.
The processor 110 is configured to read/write data or programs stored in the memory 120 and execute the core content processing method provided in any embodiment of the present application.
The communication module 130 is used for establishing communication connection between the electronic device and other communication terminals through a network, and for transceiving data through the network.
It should be understood that the structure shown in fig. 1 is merely a schematic structural diagram of an electronic device that may also include more or fewer components than those shown in fig. 1, or have a different configuration than that shown in fig. 1. The components shown in fig. 1 may be implemented in hardware, software, or a combination thereof.
Referring to fig. 2, fig. 2 is a flowchart illustrating a core content processing method according to an embodiment of the present application, where the core content processing method may be executed by the electronic device shown in fig. 1. It should be understood that, in other embodiments, the order of some steps in the core content processing method of the present embodiment may be interchanged according to actual needs, or some steps may be omitted or deleted. The detailed steps of the core content processing method are described below.
Step S210, information to be processed is obtained, and an information abstract corresponding to the information and used for identifying the core content of the information is generated according to the content information of the information.
Step S220, processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract, wherein the translation abstract is used for identifying the core content of the information and has different expression forms with the information abstract.
When a daily user obtains network information or an information platform displays information, different demands are placed on the expression form of the information besides paying attention to the information content of the most important information. For example, for the same information, when the information is displayed on a certain information platform, the expression form of the information can be converted, so that the information can be prevented from adopting the same expression form as other platforms on which the information is displayed, and therefore, a uniform reading feeling is brought to a user, and the reading heat of the platform information is reduced. For example, when the information platform displays information to different types of users, the information platform can correspondingly convert the expression form of the information according to the types of the users, for example, when the users are teachers, the expression form of the information can be converted into a more standard form, and when the users are students, the expression form of the information can be converted into a more easily understood form.
Through the above research, the present application provides a core content processing scheme, where the information to be processed may be, for example, blog content, web page content, news, etc., without limitation. In order to avoid the problem of redundancy of the contents of the information, the user can quickly obtain the core information of the information, and firstly, an information abstract capable of identifying the core content of the information can be generated according to the content of the information.
In this embodiment, the extraction of the information summary of the information is first performed, so that on one hand, the core content can be extracted to avoid redundancy of information, and on the other hand, the subsequent translation processing is performed based on the extracted information summary, so that the workload of the translation processing can be simplified.
On the basis of the above, the information abstract can be processed by adopting a preset translation algorithm, and the processed translation abstract is different from the information abstract in expression form, but the essential content is the same as the information abstract, and can be used for identifying the core content of the original information.
In this embodiment, a sample may be used in advance to train to obtain a translation model, where the translation model may use a parallel corpus (source-target, data source-target) pair of the same meaning with different representation modes), i.e. one representation mode generates another representation mode, and minimizes the difference between the generated text and the target. When training is carried out until the model converges, a translation model meeting the requirements can be obtained.
On the basis of the above, in the step of processing the information abstract by adopting the preset translation algorithm, the information abstract can be processed by utilizing the corresponding translation model obtained by training in advance, so that the conversion (the conversion of the ideographic mode) from the information abstract to the translation abstract can be realized.
In this embodiment, the core content is extracted from the information and then converted into the expression form, so as to obtain the abstract which has different expression forms but is also used for identifying the core content, and the core content can be displayed in different expression forms according to the requirements of different user groups and different platforms on the basis of simplifying the information.
In this embodiment, in order to further avoid the problem of information redundancy on the basis of the above description, when the information to be processed includes a plurality of information, the plurality of information may be first clustered and filtered to filter out some information with too high similarity, so as to avoid the problem of redundancy of information display, and further influence the reading experience of the user.
Therefore, referring to fig. 3, the core content processing method provided in the present embodiment further includes the following steps:
step S310, calculating the similarity of any two pieces of information in the plurality of pieces of information to be processed.
Step S320, clustering the information pieces according to the similarity calculation result.
And step S330, filtering the information pieces according to the clustering result.
In this embodiment, the similarity between any two pieces of information in the plurality of pieces of information to be processed is calculated, and if the similarity between the two pieces of information is higher than a preset threshold, the two pieces of information may be divided into the same cluster set. After processing every arbitrary two pieces of information in the plurality of pieces of information in this way, a plurality of cluster sets can be obtained. The similarity between the information in the same cluster set is higher, so for each cluster set, if the cluster set contains two or more information, the information in the cluster set can be filtered, for example, the information can be deleted to leave one information. Or more than two pieces of information are contained in the cluster set, the information in the cluster set can be filtered out so as to keep and retain the two pieces of information. The specific number of reservations is not limited in the present embodiment as long as the reduction of the redundancy level can be achieved.
If the clustering set only contains one piece of information, the clustering set does not perform filtering processing, and the piece of information is reserved.
After the filtering process is performed, the extraction of the information summary of the information is performed according to the step S210 for each information obtained after the filtering.
In this embodiment, in order to more accurately extract the information abstract that characterizes the core content of the information, the position information of each sentence in the information is taken into consideration in addition to the characteristics of each sentence in the information, such as frequency of occurrence, when executing the step S210. Because, the position information of a sentence in information can also show the importance of the sentence for the information content expression to some extent, such as the first few segments or the last few segments of the information, which is generally important for the information expression. Therefore, referring to fig. 4, in the present embodiment, the above step S210 may be implemented by the following steps:
step S211, sentence vectors of each sentence contained in the information and position vectors corresponding to the positions of the sentences in the information are obtained.
Step S212, judging whether the sentence is a sentence for identifying the core content of the information or not by utilizing a discrimination model obtained by training in advance according to the sentence vector and the position vector.
Step S213, obtaining the information abstract of the information according to all sentences identified by the core content of the information.
In this embodiment, the sentence processing may be performed on the information first, and the sentence processing may be performed according to a sentence identifier, where the sentence identifier may be a period, a comma, a semicolon, or the like. The information may be divided into a plurality of sentences based on the sentence identification.
For each sentence, a sentence vector of the sentence itself may be obtained, and a position of the sentence in the information, for example, what sentence is in the information, may be obtained. A position vector of the sentence is obtained based on the obtained position information.
And combining the obtained sentence vector and the position vector, and processing the sentence vector and the position vector by utilizing a discriminant model obtained by training in advance. The discrimination model is obtained by training in advance based on a plurality of information samples. Training according to sentence sample vectors, sentence sample position vectors and labels of whether each sentence carried by the sentence sample vectors is core content or not.
When the method is applied, the input of the judging model is the sentence vector and the position vector of each sentence, and the sentence vector and the position vector are output as a score, and the score can show the importance degree of the sentence on the whole expression of the information, for example, the higher the score is, the more important the sentence on the whole expression of the information is. If the score is lower, the sentence is less important to the overall expression of the information.
When determining whether the sentence is a sentence representing the core content of the information based on the obtained score, it may be determined in such a manner that a preset threshold is set, for example, when the score is higher than the preset threshold, it may be determined that the sentence is a sentence identifying the core content of the information, otherwise, it is determined that the sentence is not a sentence identifying the core content of the information.
In another implementation, the sentences in which the core content of the information can be identified may also be determined by comparing scores of all sentences contained in the information. For example, the plurality of sentences may be ordered in the order of the scores from high to low, with the preceding predetermined number of sentences being ordered as sentences identifying the core content of the information.
After determining the sentences for identifying the core content of the information, the obtained multiple sentences can be combined according to the positions of the sentences in the information to obtain the information abstract corresponding to the information.
In this embodiment, on the basis of obtaining the summary of the information, when executing the step S220, the translation process of the summary of the information is performed according to different requirements, so the adopted preset translation algorithm corresponds to different requirements, and referring to fig. 5, the preset translation algorithm may be determined by:
In step S410, a desired target translation type is obtained, which is used to indicate the expression type of the translated information abstract.
Step S420, obtaining a preset translation algorithm corresponding to the target translation type.
When information is translated, the information can be executed according to the required translation requirement, the target translation type can be specified by a user, for example, when Chinese is required to be translated into English, the target translation type is English, and when popular expression forms are required to be translated into the text form, the target translation type is the text form. Of course, in implementation, the method can also comprise a plurality of different translation types, and the corpus corresponding to the different translation types can be obtained by carrying out mass collection on information in a plurality of different expression forms in advance and generating sentence pairs so as to train and obtain corresponding translation models. The translation operation is executed by adopting different translation algorithms, and the essence is that the translation operation is executed by adopting a corresponding translation model based on different corpuses.
In this embodiment, a corresponding preset translation algorithm may be determined according to the obtained target translation type, and the determined preset translation algorithm may correspond to a corpus corresponding to the target translation type, and further correspond to a translation model, that is, when the preset translation algorithm is adopted to translate the information abstract, translation processing may be performed based on the corresponding corpus and the corresponding translation model.
In this embodiment, after determining the preset translation algorithm, referring to fig. 6, when executing the above step S220, the following steps may be implemented:
step S221, vectorizing the information abstract to obtain the abstract vector corresponding to the information abstract.
Step S222, decoding the abstract vector according to the abstract vector and a preset label to obtain translation characters corresponding to each character contained in the information abstract.
Step S223, according to all the obtained translation characters, obtaining the translation abstract corresponding to the information abstract.
In this embodiment, the summary vector is obtained by performing overall vectorization on the summary of information. When the translation process is performed, a preset label is required to trigger the translation process or end the translation process. For example, the preset label may include a preset start label and a preset end label, when the first character included in the abstract is translated, the preset start label may be used to trigger translation, and when the last character included in the abstract is translated, the preset end label may be used to trigger translation end. The preset start tag may be "\t", and the preset end tag may be "\n", for example.
In practice, for the first character in the abstract, the abstract vector can be decoded according to the abstract vector and a preset start tag to obtain a translation character corresponding to the first character in the abstract.
When processing other characters except the first character and the last character in the information abstract, decoding the abstract vector according to the translated character obtained by the previous character and the abstract vector to obtain the translated character corresponding to the other characters.
When the last character in the information abstract is processed, decoding the abstract vector according to the abstract vector and a preset end label to obtain a translation character corresponding to the last character.
In this embodiment, when decoding is performed by combining the summary vector and the preset character, the preset character is also required to be converted into a vector form in advance, and the preset character in the form of the summary vector and the vector is processed.
In this embodiment, for example, when obtaining a translated character corresponding to a first character included in the information summary according to a preset start tag and a summary vector, optionally, the preset start tag may be vectorized first, and the summary vector may be decoded according to the summary vector and the vectorized preset start tag to obtain a plurality of translated characters obtained by converting the first vector in the summary vector and translation scores carried by each translated character. And obtaining the translation character with the highest translation score in the plurality of translation characters, and taking the translation character as the translation character corresponding to the first character in the information abstract.
It should be understood that there may be multiple different expressions for the same word, so when decoding the abstract vector, multiple corresponding translation results may be decoded, and under different translation algorithms, the emphasis scores of different translation results are not the same, and the translation character carrying the highest translation score may be the target character under the corresponding translation algorithm.
Similarly, when processing other characters except the first character and the last character of the information abstract, for example, the second character, the third character and the like, firstly, the translated character corresponding to the previous character can be vectorized, and then the abstract vector is decoded according to the abstract vector and the previous translated character after vectorization, so as to obtain a plurality of translated characters obtained by vector conversion corresponding to the other characters in the abstract vector and translation scores carried by the translated characters. And taking the translation character corresponding to the highest translation score carried in the translation character as the translation character corresponding to the other characters.
Similarly, when processing the last character in the information abstract, the preset end tag can be subjected to vectorization processing, and decoding is performed on the abstract vector according to the abstract vector and the vectorized preset end tag to obtain a plurality of translation characters obtained by converting the last vector in the abstract vector and translation scores carried by the translation characters. And taking the translation character carrying the highest translation score as the translation character corresponding to the last character.
Referring to fig. 7, the translation process described above will be illustrated with reference to the schematic diagram in fig. 7.
For the information to be processed, a summary of the information corresponding to the information is obtained, for example, the first character of the summary is "G", the second character is "o", and the last character is "" (see the last box on the left side of fig. 7). First, the summary is vectorized to obtain the vector form corresponding to each character (see the second box on the left side of fig. 7). The resulting vectors are combined to obtain a summary vector of the summary of the information as a whole (see the middle vertical box of fig. 7).
When the first character of the information abstract is processed, the preset start tag is vectorized to obtain a vectorized preset start tag (see the lower right box of fig. 7). Decoding the abstract vector according to the abstract vector and the vectorized preset start label to obtain a plurality of translated characters converted by the first vector in the abstract vector and translation scores carried by the translated characters. After comparison, the translation character "V" carrying the highest translation score is determined to be the translation character corresponding to the first character "G" in the abstract.
In addition, for the second character in the information abstract, firstly, the translation character V corresponding to the first character can be vectorized, the abstract vector is decoded according to the abstract vector and the vectorized translation character, and a plurality of translation characters obtained by converting the second vector contained in the abstract vector are obtained, wherein the translation character a carrying the highest translation score is determined to be the translation character corresponding to the second character in the information abstract.
For the third, fourth, etc. (other characters except the last character and the first character) in the summary of the information, the same processing method as the second character is adopted for processing.
And for the last character in the information abstract, carrying out vectorization processing on the preset end label "\n", and then decoding the abstract vector according to the vectorized preset end label and the abstract vector to obtain a translation character corresponding to the last character.
In this embodiment, the translation abstract obtained by conversion can be obtained by combining the translation characters obtained above. The translation digest is different from the expression of the pre-obtained information digest, but can also identify the core content of the information. Therefore, the requirements of the platform for displaying the information in different expression forms or for different user groups for different expression forms of the information can be met.
Referring to fig. 8, a functional block diagram of a core content processing apparatus 800 according to another embodiment of the present application is provided, where the core content processing apparatus 800 includes a generating module 810 and a processing module 820.
The generating module 810 is configured to obtain information to be processed, and generate, according to content information of the information, an information summary corresponding to the information and used for identifying core content of the information.
It is understood that the generating module 810 may be configured to perform the step S210 described above, and reference may be made to the details of the implementation of the generating module 810 related to the step S210 described above.
The processing module 820 is configured to process the information summary by using a preset translation algorithm to obtain a translation summary corresponding to the information summary, where the translation summary is used to identify core content of the information and has a different expression form from the information summary.
It is understood that the processing module 820 may be used to perform the step S220 described above, and reference may be made to the details of the implementation of the processing module 820 with respect to the step S220 described above.
In one possible implementation, the processing module 820 may be configured to obtain the translation digest corresponding to the information digest by:
Carrying out vectorization processing on the information abstract to obtain an abstract vector corresponding to the information abstract;
decoding the abstract vector according to the abstract vector and a preset label to obtain translation characters corresponding to each character contained in the information abstract;
and obtaining the translation abstract corresponding to the information abstract according to all the obtained translation characters.
In one possible implementation, the processing module 820 may be configured to obtain the translated characters corresponding to each character included in the summary of the information by:
aiming at the first character in the information abstract, decoding the abstract vector according to the abstract vector and a preset starting label to obtain a translation character corresponding to the first character in the information abstract;
aiming at other characters except the first character and the last character in the information abstract, decoding the abstract vector according to the abstract vector and the translated character obtained by the previous character to obtain translated characters corresponding to the other characters;
and aiming at the last character in the information abstract, decoding the abstract vector according to the abstract vector and a preset end label to obtain a translation character corresponding to the last character.
In one possible implementation, the processing module 820 may be configured to obtain the translated character corresponding to the first character in the summary by:
vectorizing a preset starting label;
decoding the abstract vector according to the abstract vector and a preset starting label after vectorization processing to obtain a plurality of translation characters obtained by converting a first vector in the abstract vector and translation scores carried by each translation character;
and obtaining the translation character with the highest translation score in the plurality of translation characters, and taking the translation character as the translation character corresponding to the first character in the information abstract.
In a possible implementation manner, the information to be processed includes a plurality of information pieces, and the core content processing apparatus 800 further includes a clustering module, where the clustering module may be used to:
calculating the similarity of any two pieces of information in the plurality of pieces of information to be processed;
clustering the information according to the calculation result of the similarity;
and filtering the information pieces according to the clustering result.
In one possible implementation, the core content processing apparatus 800 further includes an algorithm acquisition module, where the algorithm acquisition module may be configured to:
Acquiring a required target translation type, wherein the target translation type is used for indicating the expression form type of the translated information abstract;
and acquiring a preset translation algorithm corresponding to the target translation type.
In one possible implementation, the generation module 810 may be configured to generate the summary of the information by:
obtaining sentence vectors of all sentences contained in the information and position vectors corresponding to the positions of the sentences in the information;
judging whether the sentence is a sentence for identifying the core content of the information or not by utilizing a discriminant model obtained by training in advance according to the sentence vector and the position vector;
and obtaining the information abstract of the information according to all sentences which are obtained by discrimination and are used for identifying the core content of the information.
The detailed process executed by each module in the core content processing apparatus 800 is not described in detail herein, and reference is made to the explanation of the core content processing method.
Further, the embodiment of the present application further provides a computer readable storage medium storing machine executable instructions, where the machine executable instructions when executed implement the core content processing method provided in the foregoing embodiment.
The steps executed when the computer program runs are not described in detail herein, and reference may be made to the explanation of the core content processing method.
In summary, according to the core content processing method, the device, the electronic equipment and the readable storage medium provided in the embodiments of the present application, for the information to be processed, firstly, an information abstract corresponding to the information and used for identifying the core content of the information is generated according to the content information of the information, and then a preset translation algorithm is adopted to process the information abstract, so as to obtain a translation abstract corresponding to the information abstract, where the translation abstract is used for identifying the core content of the information and is different from the expression form of the information abstract. The scheme adopts a mode of extracting the information abstract to obtain brief information for representing the information core content, and then converts the information abstract into a translation abstract which also expresses the information core content but is different from the original information abstract expression form. Therefore, on the basis of simplifying information, the core content can be displayed in different expression forms according to the requirements of different user groups and different platforms, and the method is flexibly suitable for the requirements of different user groups and different platforms.
In the embodiments provided in the present application, it should be understood that the disclosed apparatus and method may be implemented in other manners. The apparatus and method embodiments described above are merely illustrative, for example, flow diagrams and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of apparatus, methods and computer program products according to various embodiments of the present application. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). It should also be noted that in some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. It will also be noted that each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.
In addition, the functional modules in the embodiments of the present application may be integrated together to form a single part, or each module may exist alone, or two or more modules may be integrated to form a single part.
The functions, if implemented in the form of software functional modules and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, an electronic device, or a network device, etc.) to perform all or part of the steps of the method described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes. It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.
The foregoing description is only of the preferred embodiments of the present application and is not intended to limit the same, but rather, various modifications and variations may be made by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principles of the present application should be included in the protection scope of the present application.

Claims (6)

1. A core content processing method, the method comprising:
acquiring information to be processed, and generating an information abstract corresponding to the information and used for identifying the core content of the information according to the content information of the information;
processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract, wherein the translation abstract is used for identifying the core content of the information and has different expression forms with the information abstract;
the step of processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract comprises the following steps:
carrying out vectorization processing on the information abstract to obtain an abstract vector corresponding to the information abstract; decoding the abstract vector according to the abstract vector and a preset label to obtain translation characters corresponding to each character contained in the information abstract; obtaining a translation abstract corresponding to the information abstract according to all the obtained translation characters;
The step of decoding the abstract vector according to the abstract vector and the preset label to obtain the translated characters corresponding to each character contained in the information abstract comprises the following steps:
aiming at the first character in the information abstract, decoding the abstract vector according to the abstract vector and a preset starting label to obtain a translation character corresponding to the first character in the information abstract; aiming at other characters except the first character and the last character in the information abstract, decoding the abstract vector according to the abstract vector and the translated character obtained by the previous character to obtain translated characters corresponding to the other characters; aiming at the last character in the information abstract, decoding the abstract vector according to the abstract vector and a preset end label to obtain a translation character corresponding to the last character;
the step of decoding the abstract vector according to the abstract vector and a preset starting label to obtain a translation character corresponding to the first character in the information abstract comprises the following steps:
vectorizing a preset starting label; decoding the abstract vector according to the abstract vector and a preset starting label after vectorization processing to obtain a plurality of translation characters obtained by converting a first vector in the abstract vector and translation scores carried by each translation character; obtaining a translation character with the highest translation score in a plurality of translation characters, and taking the translation character as the translation character corresponding to the first character in the information abstract;
The step of generating an information abstract corresponding to the information and used for identifying the core content of the information according to the content information of the information comprises the following steps:
obtaining sentence vectors of all sentences contained in the information and position vectors corresponding to the positions of the sentences in the information; judging whether the sentence is a sentence for identifying the core content of the information or not by utilizing a discriminant model obtained by training in advance according to the sentence vector and the position vector; and obtaining the information abstract of the information according to all sentences which are obtained by discrimination and are used for identifying the core content of the information.
2. The core content processing method according to claim 1, wherein the information to be processed includes a plurality of pieces, and before the step of generating the information digest for identifying the core content of the information corresponding to the information based on the content information of the information, the method further comprises:
calculating the similarity of any two pieces of information in the plurality of pieces of information to be processed;
clustering the information according to the calculation result of the similarity;
And filtering the information pieces according to the clustering result.
3. The method of claim 1, wherein prior to the step of processing the summary of information using a predetermined translation algorithm, the method further comprises:
acquiring a required target translation type, wherein the target translation type is used for indicating the expression form type of the translated information abstract;
and acquiring a preset translation algorithm corresponding to the target translation type.
4. A core content processing apparatus, the apparatus comprising:
the generation module is used for acquiring information to be processed, and generating an information abstract which corresponds to the information and is used for identifying the core content of the information according to the content information of the information;
the processing module is used for processing the information abstract by adopting a preset translation algorithm to obtain a translation abstract corresponding to the information abstract, wherein the translation abstract is used for identifying the core content of the information and has different expression forms with the information abstract;
the generation module is used for obtaining sentence vectors of all sentences contained in the information and position vectors corresponding to the positions of the sentences in the information; judging whether the sentence is a sentence for identifying the core content of the information or not by utilizing a discriminant model obtained by training in advance according to the sentence vector and the position vector; obtaining an information abstract of the information according to all sentences which are obtained by discrimination and are used for identifying the core content of the information;
The processing module is used for carrying out vectorization processing on the information abstract to obtain an abstract vector corresponding to the information abstract; decoding the abstract vector according to the abstract vector and a preset label to obtain translation characters corresponding to each character contained in the information abstract; obtaining a translation abstract corresponding to the information abstract according to all the obtained translation characters;
the processing module is specifically configured to decode, for a first character in the information summary, the summary vector according to the summary vector and a preset start tag, to obtain a translated character corresponding to the first character in the information summary; aiming at other characters except the first character and the last character in the information abstract, decoding the abstract vector according to the abstract vector and the translated character obtained by the previous character to obtain translated characters corresponding to the other characters; aiming at the last character in the information abstract, decoding the abstract vector according to the abstract vector and a preset end label to obtain a translation character corresponding to the last character;
The processing module is specifically used for vectorizing the preset starting label; decoding the abstract vector according to the abstract vector and a preset starting label after vectorization processing to obtain a plurality of translation characters obtained by converting a first vector in the abstract vector and translation scores carried by each translation character; and obtaining the translation character with the highest translation score in the plurality of translation characters, and taking the translation character as the translation character corresponding to the first character in the information abstract.
5. An electronic device, comprising: a processor, a storage medium and a bus, the storage medium storing machine-readable instructions executable by the processor, the processor and the storage medium communicating over the bus when the electronic device is running, the processor executing the machine-readable instructions to perform the steps of the method of any one of claims 1-3.
6. A computer-readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, performs the steps of the method according to any of claims 1-3.
CN202010704987.0A 2020-07-21 2020-07-21 Core content processing method, device, electronic equipment and readable storage medium Active CN111831816B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010704987.0A CN111831816B (en) 2020-07-21 2020-07-21 Core content processing method, device, electronic equipment and readable storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010704987.0A CN111831816B (en) 2020-07-21 2020-07-21 Core content processing method, device, electronic equipment and readable storage medium

Publications (2)

Publication Number Publication Date
CN111831816A CN111831816A (en) 2020-10-27
CN111831816B true CN111831816B (en) 2023-06-27

Family

ID=72924462

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010704987.0A Active CN111831816B (en) 2020-07-21 2020-07-21 Core content processing method, device, electronic equipment and readable storage medium

Country Status (1)

Country Link
CN (1) CN111831816B (en)

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9122674B1 (en) * 2006-12-15 2015-09-01 Language Weaver, Inc. Use of annotations in statistical machine translation
CN106570171A (en) * 2016-11-03 2017-04-19 中国电子科技集团公司第二十八研究所 Semantics-based sci-tech information processing method and system
CN109344413A (en) * 2018-10-16 2019-02-15 北京百度网讯科技有限公司 Translation processing method and device
CN110209774A (en) * 2018-02-11 2019-09-06 北京三星通信技术研究有限公司 Handle the method, apparatus and terminal device of session information
CN110209771A (en) * 2019-06-14 2019-09-06 哈尔滨哈银消费金融有限责任公司 User's geographic information analysis and text mining method and apparatus
CN110263350A (en) * 2019-03-08 2019-09-20 腾讯科技(深圳)有限公司 Model training method, device, computer readable storage medium and computer equipment
CN111027331A (en) * 2019-12-05 2020-04-17 百度在线网络技术(北京)有限公司 Method and apparatus for evaluating translation quality
CN111382261A (en) * 2020-03-17 2020-07-07 北京字节跳动网络技术有限公司 Abstract generation method and device, electronic equipment and storage medium
CN111428523A (en) * 2020-03-23 2020-07-17 腾讯科技(深圳)有限公司 Translation corpus generation method and device, computer equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10614418B2 (en) * 2016-02-02 2020-04-07 Ricoh Company, Ltd. Conference support system, conference support method, and recording medium

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9122674B1 (en) * 2006-12-15 2015-09-01 Language Weaver, Inc. Use of annotations in statistical machine translation
CN106570171A (en) * 2016-11-03 2017-04-19 中国电子科技集团公司第二十八研究所 Semantics-based sci-tech information processing method and system
CN110209774A (en) * 2018-02-11 2019-09-06 北京三星通信技术研究有限公司 Handle the method, apparatus and terminal device of session information
CN109344413A (en) * 2018-10-16 2019-02-15 北京百度网讯科技有限公司 Translation processing method and device
CN110263350A (en) * 2019-03-08 2019-09-20 腾讯科技(深圳)有限公司 Model training method, device, computer readable storage medium and computer equipment
CN110209771A (en) * 2019-06-14 2019-09-06 哈尔滨哈银消费金融有限责任公司 User's geographic information analysis and text mining method and apparatus
CN111027331A (en) * 2019-12-05 2020-04-17 百度在线网络技术(北京)有限公司 Method and apparatus for evaluating translation quality
CN111382261A (en) * 2020-03-17 2020-07-07 北京字节跳动网络技术有限公司 Abstract generation method and device, electronic equipment and storage medium
CN111428523A (en) * 2020-03-23 2020-07-17 腾讯科技(深圳)有限公司 Translation corpus generation method and device, computer equipment and storage medium

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
Wan Xiao-jun 等.Cross-language document summarization based on machine translation quality prediction.《ACL '10: Proceedings of the 48th Annual Meeting of the Association for Computational Linguistics》.2020,917–926. *
基于对比注意力机制的跨语言句子摘要系统;殷明明 等;《计算机工程》;86-93 *
基于序列到序列模型的生成式文本摘要研究综述;余传明 等;《图书情报工作》;108-117 *

Also Published As

Publication number Publication date
CN111831816A (en) 2020-10-27

Similar Documents

Publication Publication Date Title
CN110222152B (en) Question answer obtaining method and system based on machine reading understanding
CN111914568B (en) Method, device and equipment for generating text sentence and readable storage medium
CN109697291B (en) Text semantic paragraph recognition method and device
CN110705206B (en) Text information processing method and related device
CN111428493A (en) Entity relationship acquisition method, device, equipment and storage medium
CN114827752B (en) Video generation method, video generation system, electronic device and storage medium
CN111858843A (en) Text classification method and device
CN114036300A (en) Language model training method and device, electronic equipment and storage medium
CN111859940A (en) Keyword extraction method and device, electronic equipment and storage medium
CN115238039A (en) Text generation method, electronic device and computer-readable storage medium
CN111159394A (en) Text abstract generation method and device
CN113901838A (en) Dialog detection method and device, electronic equipment and storage medium
CN113609865A (en) Text emotion recognition method and device, electronic equipment and readable storage medium
CN111831816B (en) Core content processing method, device, electronic equipment and readable storage medium
CN115438655A (en) Person gender identification method and device, electronic equipment and storage medium
CN115796141A (en) Text data enhancement method and device, electronic equipment and storage medium
CN113497899A (en) Character and picture matching method, device and equipment and storage medium
CN114997164A (en) Text generation method and device
CN115017886A (en) Text matching method, text matching device, electronic equipment and storage medium
CN114020907A (en) Information extraction method and device, storage medium and electronic equipment
CN113342930B (en) Text representing method and device based on string vector, electronic equipment and storage medium
CN117172248B (en) Text data labeling method, system and medium
CN117077664B (en) Method and device for constructing text error correction data and storage medium
CN114997133A (en) Text template generation method and device, computer equipment and storage medium
CN114781370A (en) Text processing method, device, storage medium and processor

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant