Detailed Description
The author may write the paper in a predetermined format, for example, first writing "title: XXX ", re-writing" author: XXX ", and" author unit: XXX ", if the author writes" title: "," author: "and" author unit: the identifications can directly identify the identifications in the papers, thereby determining the corresponding contents of the paragraphs and providing a basis for automatic editing and publishing of the papers. However, some authors may not write exactly in a preset format when writing a paper, e.g., do not write "title: "but rather directly write the content of the title, in which case the content of the title cannot be accurately identified in the paper.
Based on this, embodiments of the present invention provide a thesis identification method and a training method for a preset identification model, where the preset identification model is obtained based on a thesis training set, and the model can identify an identifier corresponding to each paragraph in the thesis according to the content of each paragraph in the thesis, and even if an author does not write an identifier in the thesis, the identifier corresponding to each paragraph can be identified based on the model, so as to solve the problem in the prior art that the format of the thesis needs to be edited manually.
Fig. 1 is a flowchart illustrating a paper identification method according to an exemplary embodiment of the present invention.
As shown in fig. 1, the paper identification method provided in this embodiment includes:
step 101, obtaining a paper to be identified.
The method provided by the embodiment can be executed by an electronic device with a computing function, such as a computer. The electronic device is used for identifying the identification of each paragraph in the paper.
Specifically, the preset recognition model is stored in the electronic device, and the recognition model may be obtained by training other devices, or may be obtained by performing the method provided in this embodiment, which is not limited in this embodiment.
Further, a database for storing the papers may be provided, and the database may be provided in the electronic device executing the method provided by the embodiment, or may be provided in other devices. For example, the database may be provided in a server connected to an electronic device that performs the method provided by the present embodiment. The user can upload the thesis through the user side, and the server can receive the thesis uploaded by the user and store the thesis in the database. The electronic device can access the database, and the database can also actively push the paper to the electronic device, so that the electronic device can acquire the paper to be identified. When the electronic device obtains the papers to be identified, the first-in first-out principle can be adopted, and the earliest paper uploaded by the user can be processed first.
In practical application, when a user uploads a paper through a user terminal, the paper can be directly received by the electronic device executing the method provided by the embodiment, so that the electronic device can acquire the paper to be identified and directly identify the uploaded paper.
And 102, determining paragraph identifiers corresponding to the papers to be recognized according to a preset recognition model.
The preset recognition model is obtained by training according to a thesis training set in advance.
Specifically, the electronic device executing the method provided by the embodiment stores a preset recognition model. The paper to be recognized can be input into a preset recognition model, and then the preset recognition model outputs a recognition result.
Furthermore, before training the preset recognition model, a large number of papers may be collected as a paper training set, and these papers may or may not have an identifier labeled by the author during the writing process. The labels corresponding to the paragraphs in these papers can be predetermined, and the contents of the papers and the labels corresponding to the paragraphs in the papers are input into the model, so as to train the weight values inside the model.
In practical application, the trained recognition model can be directly applied to output the identification of the paper to be recognized. Specifically, the identifier corresponding to each paragraph in the thesis to be recognized may be output, for example, the first paragraph is a title, the second paragraph is an author, and the third paragraph is an author unit.
If the author writes specific identifiers in the process of writing, the model can determine identifiers corresponding to the paragraphs according to the existing identifiers, and if the paragraphs have no identifiers, the identifiers corresponding to the paragraphs can be calculated according to weights in the model.
The method provided by the embodiment is used for identifying a paper to be identified, and the method is executed by a device provided with the method provided by the embodiment, and the device is generally implemented in a hardware and/or software manner.
The paper identification method provided by the embodiment comprises the steps of obtaining a paper to be identified; determining paragraph marks corresponding to the thesis to be recognized according to a preset recognition model; the preset recognition model is obtained by training according to a thesis training set in advance. The method provided by the embodiment can identify the identifier corresponding to each paragraph in the thesis based on the preset identification model, so that the problem that the format of the thesis needs to be edited manually in the prior art is solved.
Fig. 2 is a flowchart illustrating a paper identification method according to another exemplary embodiment of the present invention.
As shown in fig. 2, the method for identifying a thesis provided in this embodiment includes:
step 201, a paper to be identified is obtained.
The specific principle and implementation of step 201 are similar to those of step 101, and are not described herein again.
Step 202, performing word segmentation processing on the paragraphs included in the paper to be identified to obtain the word segmentation included in each paragraph.
In the method provided by this embodiment, instead of directly inputting the paper to be recognized into the preset recognition model, word segmentation processing may be performed on the paper to be recognized, and then the identifier corresponding to each paragraph may be determined based on the obtained word segmentation by the model to be recognized.
The word segmentation algorithm can be preset, so that each paragraph included in the paper to be recognized is subjected to word segmentation processing according to the word segmentation algorithm. Chinese Word Segmentation refers to the Segmentation of a Chinese character sequence into individual words. The word segmentation algorithm is a process capable of recombining continuous word sequences into word sequences according to a certain specification. A segmentation dictionary can be set, so that the paragraphs in the paper are segmented according to the segmentation dictionary.
Specifically, the thesis may include a plurality of paragraphs, and each paragraph may be subjected to word segmentation processing, so as to obtain a word segmentation corresponding to each paragraph. In the word segmentation process, the semantic words in the paragraphs and words without actual meanings, such as "o", can be removed, so that the paragraphs are identified only according to the words with actual meanings, and the calculation amount in the paragraph recognition process can be reduced.
And step 203, determining the corresponding mark of the paragraph according to the preset recognition model and the word segmentation.
Further, in the method provided by this embodiment, the preset recognition model may determine the identifier corresponding to the paragraph based on the word segmentation in the paragraph. For example, the participles in each paragraph may be used as a participle combination, and the preset recognition model is input, so that the recognition model determines the paragraph identifier corresponding to the participle combination, that is, the paragraph identifier corresponding to the paragraph to which the participle combination belongs.
In another embodiment, a preset word bank may be further provided, wherein a plurality of words are provided in the preset word bank, and the words in the preset secondary bank may be considered as belonging to the same class of words. For example, synonyms or similar words can be combined to form a preset word stock.
The preset word bank to which the participle belongs can be determined, and the frequency of the participle included in each paragraph belonging to each preset sub-bank is determined. For a paragraph, a plurality of segmented words can be determined, and each segmented word has a corresponding preset word bank. Therefore, the frequency that the participles included in the paragraph belong to each preset lexicon can be counted, for example, if the paragraph includes 5 participles and respectively belongs to the lexicon A, A, B, C, C, the preset secondary lexicon frequencies corresponding to the paragraph can be obtained as A-0.4, B-0.2 and C-0.4.
Specifically, the feature vector corresponding to each paragraph is determined according to the frequency. The frequency can be directly taken as a feature vector, such as (0.4, 0.2, 0.4). In general, each preset lexicon includes words, which may not all appear in the same paragraph, and therefore, in order to express the meaning of the paragraph more accurately through the feature vector, the probability of the non-appearing lexicon may also be set to 0. For example, if the predetermined sub-base also includes the lexicon D, but the participles in the paragraph do not fall into the lexicon D, the feature vector may be (0.4, 0.2, 0.4, 0).
Further, the feature vectors also replace the paragraphs of the whole paper, so that the data processing amount of the recognition model can be reduced.
In practical application, the identifier corresponding to the paragraph can be determined according to the feature vector based on the preset identification model. The feature vector can be input into a preset recognition model, the feature vector is calculated according to a weight value in the preset recognition model, and paragraph identification corresponding to the feature vector, namely the identification of the paragraph corresponding to the feature vector, can be output. For example, the output paragraph identification may be "paper content".
The method provided by the embodiment can perform word segmentation on the paper to be recognized, determine the feature vector corresponding to each paragraph based on the obtained word segmentation, and input the feature vector into the preset recognition model, so that the identifier corresponding to the paragraph is determined according to the preset recognition model, and the calculation amount of the recognition model can be reduced.
Fig. 3 is a flowchart illustrating a training method of a preset recognition model according to an exemplary embodiment of the present invention.
As shown in fig. 3, the training method of the preset recognition model provided in this embodiment includes:
step 301, obtaining a thesis training set.
The method provided by the embodiment can be executed by an electronic device with a computing function, such as a computer. The electronic equipment is used for training a preset recognition model.
Specifically, the electronic device may be only used for training a preset recognition model, and may also be used for recognizing a paper to be recognized, that is, the electronic device used for training may be the same as or different from the electronic device used for identifying the paper. The present embodiment does not limit this.
Further, a database for storing the training set of papers may be provided, and the database may be provided in the electronic device executing the method provided by the embodiment, or may be provided in other devices. For example, the database may be provided in a server connected to an electronic device that executes the method provided in the present embodiment. These papers can be obtained in advance, and paragraph identifiers corresponding to respective paragraphs in the papers can be determined.
Step 302, training the model according to the thesis training set to obtain a preset recognition model.
The paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks.
Specifically, the model may be trained based on a paper set prepared in advance to obtain a preset recognition model.
Further, weight values inside the model may be initialized. And inputting the thesis included in the thesis set into a model, wherein the model can determine the corresponding identification of each paragraph of the thesis according to the current weight, then the determined result is compared with the preset paragraph identification, and the weight value in the model is adjusted according to the comparison result.
In practical application, the papers can be input into the model one by one, the weight in the model can be corrected once when one paper is input, and a set of accurate models for identifying the papers can be obtained through multiple times of adjustment.
Wherein, a part of the papers in the training set of papers can be used as the test set. After the model is trained, the thesis in the test set can be input into the model, paragraph identifiers corresponding to all paragraphs in the thesis are determined, the paragraph identifiers are compared with the predetermined identifiers, if the accuracy is higher than a preset threshold value, the model can be considered to be accurate, and the model can be used for identifying the thesis provided by a user.
Specifically, the papers in the test set may be used for training the model, or may be used for testing only. The preset threshold value can be set according to requirements.
The method provided by the present embodiment is used for training the preset recognition model, and is performed by a device provided with the method provided by the present embodiment, and the device is generally implemented in a hardware and/or software manner.
The training method of the preset recognition model provided by the embodiment comprises the steps of obtaining a thesis training set; training the model according to a thesis training set to obtain a preset recognition model; the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks. According to the method provided by the embodiment, the model for representing the thesis can be obtained by using the thesis training set training model provided with the preset identification of the paragraphs, so that the identification corresponding to each paragraph in the thesis can be identified based on the preset identification model, and the problem that the format of the thesis needs to be edited manually in the prior art is solved.
Fig. 4 is a flowchart illustrating a training method of a predetermined recognition model according to another exemplary embodiment of the present invention.
As shown in fig. 4, the training method of the preset recognition model provided in this embodiment includes:
step 401, obtaining a thesis training set.
The specific principle and implementation of step 401 are similar to those of step 301, and are not described herein again.
Step 402, performing word segmentation processing on paragraphs included in the training paper to obtain training words included in each paragraph.
In the method provided in this embodiment, in the process of training the model, the thesis for training may not be directly input to the model, but word segmentation is performed on the training thesis, and then the model is trained according to the obtained training word segmentation.
The word segmentation algorithm can be preset, so that each paragraph included in the training paper is subjected to word segmentation processing according to the word segmentation algorithm. Chinese Word Segmentation refers to segmenting a Chinese character sequence into individual words. The word segmentation algorithm is a process capable of recombining continuous word sequences into word sequences according to a certain specification. A segmentation dictionary can be set, so that the paragraphs in the paper are segmented according to the segmentation dictionary.
Specifically, the training paper may include a plurality of paragraphs, and each paragraph may be subjected to word segmentation processing, so as to obtain a training word segmentation corresponding to each paragraph. In the word segmentation process, the Chinese meaning words in the paragraphs and the words without actual meanings, such as "o" and the like, can be removed, so that the paragraphs are identified only according to the words with actual meanings, and the calculation amount in the paragraph identification process can be reduced.
And 403, training the model according to the training segmentation words to obtain a preset recognition model.
Further, in the method provided by this embodiment, the model may be trained based on the training segmented words in the paragraphs. For example, a weight value in the model may be initialized, then the training participles in each paragraph are used as a participle combination, the model is input into the model, the model may determine a paragraph identifier corresponding to the participle combination according to the current weight value, that is, an identifier of the paragraph corresponding to the participle combination, the determined identifier may be compared with a preset identifier of the paragraph, and the weight value in the model may be adjusted according to the comparison result. Through multi-round adjustment, the accurate preset recognition model can be obtained.
In another embodiment, a preset word bank may be further provided, where a plurality of words are provided in the preset word bank, and the words in the preset sub-bank may be considered to belong to the same class of words. For example, synonyms or similar words can be combined to form a preset word stock.
The preset word bank to which the training participle belongs can be determined, and the frequency of the training participle included in each paragraph belonging to each preset sub-bank is determined. For a paragraph, a plurality of training segmented words can be determined, and each segmented word has a corresponding preset word bank. Therefore, the frequency that the training participles included in the paragraph belong to each preset lexicon can be counted, for example, if the paragraph includes 5 training participles and the training respectively belongs to the lexicon A, A, B, C, C, the preset lexicon frequencies corresponding to the paragraph can be obtained as A-0.4, B-0.2 and C-0.4.
Specifically, the feature vector corresponding to each paragraph is determined according to the frequency. The frequency can be directly taken as a feature vector, such as (0.4, 0.2, 0.4). In general, it is unlikely that words included in each preset lexicon will all appear in the same paragraph, and therefore, in order to express the meaning of a paragraph more accurately by a feature vector, the probability of an absent lexicon may also be set to 0. For example, if the predetermined sub-base also includes the lexicon D, but the training participles in the paragraph do not fall into the lexicon D, the feature vector may be (0.4, 0.2, 0.4, 0).
Further, by replacing the paragraphs of the whole paper with feature vectors, the data throughput of the model can be reduced.
In practical application, the preset recognition model can be obtained according to the training feature vector corresponding to the paragraph and the preset identification training model. The feature vector may be input into the model, and the feature vector may be calculated according to the current weight value inside the model, so as to determine the paragraph identifier corresponding to the feature vector, that is, the paragraph identifier corresponding to the feature vector. For example, the determined paragraph identity may be "paper content". The determined identifier may be compared with a preset identifier, and a weight value in the model may be adjusted according to a comparison result.
The method provided by this embodiment can perform word segmentation on the training paper, determine the feature vector corresponding to each paragraph based on the obtained training word segmentation, and then input the feature vector into the model, thereby determining the identifier corresponding to the paragraph according to the model, and then adjust the weight value in the model based on the preset identifier, so as to reduce the calculation amount in the model training process.
Fig. 5 is a block diagram illustrating a paper identification apparatus according to an exemplary embodiment of the present invention.
As shown in fig. 5, the paper identification apparatus provided in this embodiment includes:
an obtaining module 51, configured to obtain a paper to be identified;
a determining module 52, configured to determine, according to a preset recognition model, a paragraph identifier corresponding to the thesis to be recognized;
wherein, the preset recognition model is obtained by training according to a thesis training set in advance.
The paper identification device provided by the embodiment comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a paper to be recognized; the determining module is used for determining paragraph identifiers corresponding to the papers to be recognized according to a preset recognition model; the preset recognition model is obtained by training according to a thesis training set in advance. The device provided by the embodiment can identify the identifier corresponding to each paragraph in the thesis based on the preset identification model, so that the problem that the format of the thesis needs to be edited manually in the prior art is solved.
The specific principle and implementation of the paper identification apparatus provided in this embodiment are similar to those of the embodiment shown in fig. 1, and are not described herein again.
Fig. 6 is a block diagram illustrating a paper identification apparatus according to another exemplary embodiment of the present invention.
As shown in fig. 6, on the basis of the embodiment shown in fig. 5, in the paper identification apparatus provided in this embodiment, the determining module 52 includes:
a word segmentation unit 521, configured to perform word segmentation processing on paragraphs included in the thesis to be identified, so as to obtain words included in each paragraph;
a determining unit 522, configured to determine, according to the preset recognition model and the word segmentation, an identifier corresponding to the paragraph.
The determining unit 522 is specifically configured to:
determining a preset word bank to which the participle belongs, and determining the frequency of the participle included in each paragraph belonging to each preset sub-bank;
determining a feature vector corresponding to each paragraph according to the frequency;
and determining the identifier corresponding to the paragraph according to the feature vector based on the preset identification model.
The specific principle and implementation of the paper identification apparatus provided in this embodiment are similar to those of the embodiment shown in fig. 2, and are not described here again.
Fig. 7 is a block diagram illustrating a training apparatus for presetting a recognition model according to an exemplary embodiment of the present invention.
As shown in fig. 7, the training apparatus for presetting a recognition model according to this embodiment includes:
an obtaining module 71, configured to obtain a thesis training set;
a training module 72, configured to train a model according to the thesis training set to obtain a preset recognition model;
the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks.
The training device for the preset recognition model provided by the embodiment comprises an acquisition module, a recognition module and a recognition module, wherein the acquisition module is used for acquiring a thesis training set; the training module is used for training the model according to the thesis training set to obtain a preset recognition model; the paper training set comprises a plurality of training papers, and the paragraphs of the training papers are preset with marks. The apparatus provided in this embodiment can obtain a model for representing a thesis by using a thesis training set training model provided with a preset identifier of a paragraph, so that an identifier corresponding to each paragraph in the thesis can be identified based on a preset identification model, and a problem that a format of the thesis needs to be edited manually in the prior art is solved.
The specific principle and implementation of the training apparatus for presetting a recognition model provided in this embodiment are similar to those of the embodiment shown in fig. 3, and are not described herein again.
Fig. 8 is a block diagram illustrating a training apparatus for presetting a recognition model according to another exemplary embodiment of the present invention.
As shown in fig. 8, on the basis of the embodiment shown in fig. 7, in the training apparatus for presetting a recognition model provided in this embodiment, the training module 72 includes:
a word segmentation unit 721, configured to perform word segmentation processing on paragraphs included in the training paper to obtain training words included in each paragraph;
and the training unit 722 is configured to train a model according to the training part words to obtain the preset recognition model.
The training unit 722 is specifically configured to:
determining a preset word bank to which the training participle belongs, and determining the frequency of the training participle included in each paragraph belonging to each preset sub-bank;
determining training feature vectors corresponding to the sections according to the frequency;
and obtaining the preset recognition model according to the training feature vector corresponding to the paragraph and a preset identification training model.
Fig. 9 is a block diagram illustrating a paper identification apparatus according to an exemplary embodiment of the present invention.
As shown in fig. 9, the paper identification device provided in this embodiment includes:
a memory 91;
a processor 92; and
a computer program;
wherein said computer program is stored in said memory 91 and configured to be executed by said processor 92 to implement any of the paper identification methods as described in fig. 1-2.
Fig. 10 is a block diagram illustrating a training apparatus for presetting a recognition model according to an exemplary embodiment of the present invention.
As shown in fig. 10, the training apparatus for presetting a recognition model provided in this embodiment includes:
a memory 1001;
a processor 1002; and
a computer program;
wherein the computer program is stored in the memory 1001 and configured to be executed by the processor 1002 to implement a training method of any one of the preset recognition models as described in fig. 3-4.
The present embodiments also provide a computer-readable storage medium, having stored thereon a computer program,
the computer program is executed by a processor to implement any of the paper identification methods described in fig. 1-2.
The present embodiments also provide a computer-readable storage medium, having stored thereon a computer program,
the computer program is executed by a processor to implement a training method of any one of the preset recognition models as described in fig. 3-4.
Those of ordinary skill in the art will understand that: all or a portion of the steps of implementing the above-described method embodiments may be performed by hardware associated with program instructions. The program may be stored in a computer-readable storage medium. When executed, the program performs steps comprising the method embodiments described above; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.