WO2023142809A1 - 文本分类、文本处理方法、装置、计算机设备及存储介质 - Google Patents

文本分类、文本处理方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2023142809A1
WO2023142809A1 PCT/CN2022/141171 CN2022141171W WO2023142809A1 WO 2023142809 A1 WO2023142809 A1 WO 2023142809A1 CN 2022141171 W CN2022141171 W CN 2022141171W WO 2023142809 A1 WO2023142809 A1 WO 2023142809A1
Authority
WO
WIPO (PCT)
Prior art keywords
text
topic
target
label
tag
Prior art date
Application number
PCT/CN2022/141171
Other languages
English (en)
French (fr)
Inventor
黄骏键
潘桂波
李彦辉
Original Assignee
北京字节跳动网络技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 北京字节跳动网络技术有限公司 filed Critical 北京字节跳动网络技术有限公司
Publication of WO2023142809A1 publication Critical patent/WO2023142809A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/12Use of codes for handling textual entities
    • G06F40/151Transformation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Definitions

  • the present disclosure relates to the technical field of computers, in particular, to a text classification, text processing method, device, computer equipment and storage medium.
  • users can search for favorite books in the reading software, but the content recalled by the existing search scheme is the book-tweeting topics that match the search keywords; however, the books recommended in the book-tweeting topics It may be irrelevant to the books you want to search for by the search keywords, or some recommended book topics are missed in the recalled content, resulting in users being unable to search for satisfactory books, which in turn reduces the user's reading experience for the reading software .
  • Embodiments of the present disclosure at least provide a text classification, a text processing method, a device, a computer device and a storage medium, a computer program product, and a computer program.
  • an embodiment of the present disclosure provides a text classification method applied to a server, including:
  • the target text feature includes a plurality of sub-text features, and each sub-text feature corresponds to each first unit text in the topic text to be classified; the determining the target text feature and each A label correlation between the label description features, including:
  • Correlation degree based on the correlation coefficient of each of the first unit texts, perform a weighted summation calculation on the sub-text features of each of the first unit texts, and determine the label correlation according to the calculation results.
  • the determining the correlation coefficient of each of the first unit texts based on the target text features and the label description features includes:
  • the first sub-correlation coefficient of the first unit text Based on the sub-text features of each of the first unit texts, determine the first sub-correlation coefficient of the first unit text; determine the second sub-correlation coefficient based on the target text features and the label description features; A ratio between a sub-correlation coefficient and said second sub-correlation coefficient determines said correlation coefficient.
  • the determining the first sub-correlation coefficient of the first unit text based on the sub-text features of each first unit text includes:
  • the label description features include a plurality of second unit texts; the determining the second sub-correlation coefficient based on the target text features and the label description features includes:
  • the acquiring the topic text to be classified and tag description information of at least one topic tag to be predicted includes:
  • the original text data is segmented to obtain the topic text to be classified and the tag description information.
  • the extracting the target text features of the topic text to be classified includes:
  • the topic text to be classified includes at least one of the following: topic title text, topic abstract text, and topic tag description text.
  • the extracting the target text features of the topic text to be classified, and extracting the label description features of the label description information of each of the topic labels to be predicted includes: through the text classification model
  • the feature extraction layer extracts the target text features of the topic text to be classified, and extracts the label description features of the label description information of each of the topic labels to be predicted; the determination of the target text features and each of the label description features
  • the label correlation between, get at least one label correlation including:
  • the method also includes:
  • each training sample contains a topic label to be predicted and a topic text to be trained, and each of the training samples contains a matching label, and the matching label is used to indicate the topic label to be predicted and the topic text to be trained
  • the matching between topic texts; the text classification model to be trained is trained through the plurality of training samples to obtain the text classification model.
  • the text classification model to be trained is trained through the plurality of training samples to obtain the text classification model, including:
  • the embodiment of the present disclosure also provides a text processing method applied to a terminal device, including:
  • the operation page displays the operation page of the topic text; receiving the target data input by the user on the operation page, wherein the target data includes: the topic text to be published, or the topic tag of interest; obtaining the screening result determined by the server based on the target data , wherein, the screening result is the result after the server screens the data to be screened determined based on the target data based on the text classification method described in any one of the above first aspects; the operation page displays the The target data and/or the screening result of the target data.
  • the target data includes the topic text to be published; the displaying the target data and/or the screening results of the target data on the operation page includes:
  • the method also includes:
  • the modified target hashtag is displayed on the page, wherein the modifying operation includes at least one of the following: adding, deleting, and modifying.
  • the target data includes the topic tags of interest; the method further includes:
  • the target data includes topic tags of interest;
  • the operation page displaying topic text includes:
  • the target data includes the topic tags of interest; the displaying the target data and/or the screening results of the target data on the operation page includes:
  • the topic tags of interest are displayed in the title display area of the operation page; the key topic content of the published topic text matching each of the topic tags of interest is displayed in the text display area of the operation page.
  • the method also includes:
  • the embodiment of the present disclosure also provides a text classification device applied to a server, including:
  • the first obtaining unit is used to obtain the label description information of the topic text to be classified and at least one topic label to be predicted;
  • the extraction unit is used to extract the target text features of the topic text to be classified, and extract each of the topics to be predicted
  • the label description feature of the label description information of the label is used to determine the label correlation between the target text feature and each of the label description features, and obtain at least one label correlation;
  • the second determination unit It is used for determining a target topic tag matching the topic text to be classified among at least one topic tag to be predicted based on at least one tag correlation.
  • an embodiment of the present disclosure further provides a text processing device, which is applied to a terminal device, including:
  • the first display unit is used to display the operation page of the topic text; the receiving unit is used to receive the target data input by the user on the operation page, wherein the target data includes: topic text to be published, or interesting topic tags a second acquisition unit, configured to acquire a screening result determined by the server based on the target data, wherein the screening result is that the server based on the text classification method described in any one of the above first aspects is based on the target A result of screening the data to be screened determined by the data; a second display unit configured to display the target data and/or the screening result of the target data on the operation page.
  • the embodiment of the present disclosure further provides a computer device, including: a processor, a memory, and a bus, the memory stores machine-readable instructions executable by the processor, and when the computer device is running, the processing
  • the processor communicates with the memory through a bus, and when the machine-readable instructions are executed by the processor, the steps in any one of the possible implementation manners in the first aspect to the second aspect above are executed.
  • the embodiments of the present disclosure further provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, any one of the above-mentioned first aspect to the second aspect is executed. steps in a possible implementation.
  • an embodiment of the present disclosure further provides a computer program product, where the computer program product includes: a computer program, the computer program is stored in a readable storage medium, and at least one processor of an electronic device can read from the The readable storage medium reads the computer program, and at least one of the processors executes the computer program, so that the electronic device executes the steps in any one possible implementation manner of the first aspect to the second aspect above.
  • an embodiment of the present disclosure further provides a computer program, the computer program is stored in a readable storage medium, at least one processor of an electronic device can read the above computer program from the readable storage medium, At least one of the processors executes the computer program, so that the electronic device executes the steps in any possible implementation manner of the first aspect to the second aspect above.
  • the embodiment of the present disclosure provides a text classification, text processing method, device, computer equipment and storage medium.
  • the topic text to be classified and the tag description information of at least one corresponding topic tag to be predicted can be obtained, and the target text features of the topic text to be classified are extracted, and the tag description of each topic tag to be predicted can be extracted
  • the tag description feature of the information after that, the tag correlation between the target text feature and the tag description feature can be determined; finally, it can be determined in at least one topic tag to be predicted to match the topic text to be classified based on the tag correlation target hashtag for .
  • the topic label of the book-tweeting topic can be determined more accurately, and the classification accuracy of the book-tweeting topic can be improved, so that it can be more accurate for Users push out satisfactory books, thereby improving the user's reading experience.
  • FIG. 1 shows a flowchart of a text classification method provided by an embodiment of the present disclosure
  • FIG. 2 shows a schematic diagram of segmentation processing of the original text data based on the data segmentation position provided by an embodiment of the present disclosure
  • FIG. 3 shows a frame structure diagram of a text classification model corresponding to a text classification method provided by an embodiment of the present disclosure
  • FIG. 4 shows a flowchart of a text processing method provided by an embodiment of the present disclosure
  • Fig. 5 shows a schematic diagram of an operation page of a topic text provided by an embodiment of the present disclosure
  • FIG. 6 shows a schematic diagram of a page of a hashtag to be selected provided by an embodiment of the present disclosure
  • FIG. 7 shows a schematic diagram of a display page when displaying target data provided by an embodiment of the present disclosure
  • Fig. 8 shows a schematic diagram of a text classification device provided by an embodiment of the present disclosure
  • FIG. 9 shows a schematic diagram of a text processing device provided by an embodiment of the present disclosure.
  • FIG. 10 shows a schematic diagram of a computer device provided by an embodiment of the present disclosure
  • FIG. 11 shows a schematic diagram of another computer device provided by an embodiment of the present disclosure.
  • users can search for favorite books in the reading software, but the content recalled by the existing search scheme is the book-tweeting topics that match the search keywords; however, the books recommended in the book-tweeting topics It may be irrelevant to the books you want to search for by the search keywords, or some recommended book topics are missed in the recalled content, resulting in users being unable to search for satisfactory books, which in turn reduces the user's reading experience for the reading software .
  • the present disclosure provides a text classification, text processing method, device, computer equipment and storage medium.
  • the topic text to be classified and the tag description information of at least one corresponding topic tag to be predicted can be obtained, and the target text features of the topic text to be classified are extracted, and the tag description of each topic tag to be predicted can be extracted
  • the tag description feature of the information after that, the tag correlation between the target text feature and the tag description feature can be determined; finally, it can be determined in at least one topic tag to be predicted to match the topic text to be classified based on the tag correlation target hashtag for .
  • the topic label of the book-tweeting topic can be determined more accurately, and the classification accuracy of the book-tweeting topic can be improved, so that it can be more accurate for Users push out satisfactory books, thereby improving the user's reading experience.
  • the execution subject of the text classification and text processing method provided in the embodiment of the present disclosure generally has a certain computing power computer equipment, the computer equipment includes, for example: a terminal device or a server or other processing equipment.
  • the text classification and text processing methods may be implemented by a processor invoking computer-readable instructions stored in a memory.
  • FIG. 1 it is a flowchart of a text classification method provided by an embodiment of the present disclosure.
  • the method is applied to a server, and the method includes steps S101 to S107, wherein:
  • S101 Acquire topic text to be classified and tag description information of at least one topic tag to be predicted.
  • the text classification method provided by the embodiments of the present disclosure can be applied to a server of books or article reading software.
  • users when using the reading software, users can obtain books and articles they want to browse by posting, or communicate with other users by posting.
  • the topic text to be classified may be the text edited by the current user through the reading software, and may also be the text edited by other users through the reading software.
  • the above topic text to be classified may be the post content input by the user through reading software.
  • tag description information corresponding to at least one topic tag to be predicted may be determined for the topic text to be classified.
  • a plurality of hashtags may be preset; then, all the preset hashtags may be determined as the aforementioned at least one to-be-predicted hashtag.
  • preliminary screening may be performed on preset topic tags to obtain at least one topic tag to be predicted.
  • the specific screening principle may be as follows: among the preset topic tags, the topic tags containing the characteristic information of the topic text to be classified are selected as at least one topic tag to be predicted. At this time, the at least one topic tag to be predicted may contain feature information corresponding to the topic text to be classified.
  • the feature information corresponding to the topic text to be classified can be "romance” and "novel".
  • at least one topic tag to be predicted corresponding to the topic text to be classified may include "romance” and/or "novel”.
  • each to-be-predicted topic tag may further include tag description information for annotating the to-be-predicted topic tag.
  • tag description information for annotating the to-be-predicted topic tag.
  • the topic tag to be predicted is "sports”
  • the tag description information corresponding to the topic tag to be predicted may include texts such as sports, sports, boxing, athletics, basketball, and football.
  • S103 Extract target text features of the topic text to be classified, and extract tag description features of tag description information of each topic tag to be predicted.
  • the feature extraction layer of the text classification model can be used to perform feature extraction on the topic text to be classified to obtain the corresponding The target text features, and feature extraction for each tag description information, to obtain the corresponding tag description features.
  • the data format of the extracted target text feature and tag description feature may be a vector, for example, a text representation vector and a tag representation vector.
  • the label correlation can be determined based on the text representation vector and the label representation vector. The method of determining the label correlation through the data in the form of vector can simplify and facilitate the comparison between the target text feature and the label description feature. The process of comparing the correlation between them.
  • the text classification model includes: an input layer, an embedding layer, and a feature extraction layer, wherein the input layer, the embedding layer, and the feature extraction layer are connected in series.
  • the input layer acquires the topic text to be classified and the tag description information, it can convert the above-mentioned topic text to be classified and the text in the tag description information into one-hot encoding (one-hot encoding).
  • the embedding layer can convert the one-hot encoding corresponding to the above-mentioned topic text to be classified and the one-hot encoding corresponding to the label description feature into a word vector.
  • the feature extraction layer obtains the above word vectors, it can perform vector extraction on the word vectors to obtain the target text features of the topic text to be classified and the label description features of the label description information.
  • S105 Determine a label correlation between the target text feature and each of the label description features to obtain at least one label correlation.
  • the tag correlation between the target text feature and each tag description feature can be calculated through correlation calculation.
  • the target text features and label description features can be fused through the fusion layer in the text classification model, so as to determine the label correlation between the target text features and the label description features according to the result of the fusion operation.
  • the input of the fusion layer is connected with the output of the feature extraction layer of the text classification model.
  • the above-mentioned tag correlation can be expressed as a correlation representation vector; wherein, the correlation representation vector is used to represent the tag correlation between the topic text to be classified and the corresponding topic tag to be predicted.
  • the correlation representation vector can be normalized, so that a value within the range of 0 to 1 can be obtained after normalization.
  • the value is used to represent the correlation probability between the topic text to be classified and the corresponding topic label to be predicted.
  • the correlation representation vector can be input to the binary classification layer in the text classification model for mapping processing, so that the correlation representation vector is mapped to a value within the range of 0 to 1.
  • the binary classification layer includes a fully connected layer and a Sigmod layer, and the fully connected layer and the Sigmod layer are connected in sequence.
  • the correlation representation vector can be sequentially processed through the fully connected layer and the Sigmod layer, so as to obtain the normalized correlation probability.
  • the input of the binary classification layer in the text classification model is concatenated with the output of the fusion layer.
  • S107 Based on at least one of the tag correlations, determine a target topic tag matching the topic text to be classified among at least one topic tag to be predicted.
  • each topic tag to be predicted can determine the corresponding correlation representation vectors.
  • normalization processing may be performed on each correlation representation vector to obtain at least one correlation probability, where the correlation probability may be a probability value ranging from 0 to 1.
  • each correlation probability is used to characterize the degree of correlation (or similarity) between the topic text to be classified and the corresponding topic label to be predicted.
  • the at least one relevant probability can be screened, so as to determine the relevant probability that meets the probability requirement.
  • the probability requirement can be understood as greater than or equal to a preset probability threshold.
  • a relevant probability that is greater than or equal to a preset probability threshold may be determined as the relevant probability that meets the probability requirement.
  • the topic label to be predicted corresponding to the relevant probability satisfying the probability requirement may be determined, and the determined corresponding topic label to be predicted is determined as the target topic label.
  • the target topic label is determined in the topic label to be predicted, so that the corresponding topic label can be more accurately determined for the topic text to be classified, Therefore, the accuracy of topic classification of the topic text to be classified is improved.
  • the topic text to be classified is a book-tweeting topic associated with book recommendation
  • the topic label of the book-tweeting topic can be determined more accurately, and the classification accuracy of the book-tweeting topic can be improved, so that it can be more accurate for Users push out satisfactory books, thereby improving the user's reading experience.
  • step S101 obtaining the tag description information of the topic text to be classified and at least one topic tag to be predicted, specifically includes the following process:
  • the raw text data to be processed may be composed of multiple parts.
  • the raw text data to be processed may include: topic text to be classified, and tag description information of at least one topic tag to be predicted.
  • each part of the original text data may correspond to a different text type identifier.
  • the original text data contains multiple text blocks, and each text block contains a corresponding data identification bit segment id, wherein the data identification bit is used to indicate the text type identification of the corresponding text block.
  • the data identification bit segment id of each text block in the original text data can be identified respectively to obtain the text type identification indicated by the segment id.
  • the identification value of the text type identification indicated by the data identification bit segment id of the text block to which the topic text belongs can be set to 0, and the data identification of the text block to which the tag description information belongs is indicated by the segment id
  • the value of the text type identifier can be set to 1.
  • the data segmentation position of the original text data may be determined based on the identification value of the text type identification, and the original text data may be segmented based on the data segmentation position.
  • the original text data when the original text data is segmented, it can first be segmented according to the above text type identification to obtain the topic text to be classified and tag description information.
  • the first delimiter [SEP] may be inserted into the original text data according to the identification value of the text type identification, and the original text data may be segmented based on the first delimiter.
  • the first separator [SEP] when it is detected that the identification values of any two continuous text type identifiers are not the same, insert the first separator [SEP] between the two continuous text type identifiers, and then pass the first separator [SEP] Segment the raw text data.
  • a second delimiter can also be inserted between different types of text blocks of the topic text to be classified in advance, and then the topic text to be classified can be further divided by the second delimiter.
  • the above-mentioned original text data includes: Classified topic text and label description information (also can be recorded as description).
  • the topic text to be classified includes at least one of the following: topic title text (also can be recorded as title), topic abstract text (also can be recorded as abstract), the topic title text can be the title of the topic text to be classified, topic abstract text It may be an introduction to the content of the topic text to be classified.
  • the different types of text blocks of the topic text to be classified can be understood as: the text blocks belonging to the topic title text, and the text blocks belonging to the topic summary text.
  • the original text data can be divided into different text blocks (each text block can also be recorded as a token), so that the BERT model (Bidirectional Encoder Representations from Transformer model, that is, the feature extraction layer) can be used to analyze the original text data. to process.
  • the BERT model can perform feature extraction on the original text data, so as to obtain the target text features corresponding to the topic text to be classified and the tag description features corresponding to the tag description features.
  • the above-mentioned target text feature can be recorded as topix vector (text representation vector), and the above-mentioned label description feature can be recorded as description vector (label representation vector), wherein, as shown in Figure 2, the target text feature and label description feature are respectively composed of consists of sub-vectors.
  • the target text features of the topic text to be classified and the label description of the topic label to be predicted can be quickly processed Features are divided, so as to improve the efficiency of determining the label correlation of the topic text to be classified and the topic label to be predicted.
  • step S103 extracting the target text features of the topic text to be classified, specifically includes the following process:
  • the topic text to be classified may be divided to obtain a plurality of first unit texts.
  • the length of the target vector corresponding to each first unit text may be determined by the text length contained in the first unit text, and the text lengths contained in a plurality of first unit texts of the topic text to be classified may be different .
  • the length of the text included in the first unit of text can be divided into four types: character, phrase, sentence, and paragraph.
  • the above-mentioned preset unit of text may be a preset text used to filter the first unit of text, wherein the number of the preset unit of text may be multiple.
  • the target vector corresponding to each first unit text can be determined, and the mapping relationship between the target vectors and each preset unit text can be respectively determined.
  • the sub-vector matching the preset unit text in the target vector can be determined based on the mapping relationship (that is, the sub-vector of the target text feature in FIG. 2 ) is the above key feature vector, and then the target text feature can be determined according to the determined key feature vector.
  • the first unit text corresponding to the sub-vector determined in the target vector that matches the preset unit text may also be “science fiction”.
  • the sub-vectors in the target vector and the preset unit text may not exactly match.
  • the first unit text is "Technology”
  • the sub-vector corresponding to the first unit text and the preset unit text At this time, the text feature corresponding to the first unit text "science and technology" can still be determined as the target text feature.
  • the key feature vectors in the target vector can be extracted, and the irrelevant content can be filtered, thereby reducing the amount of computation and improving the efficiency of determining the features of the target text.
  • the above step S105 determine the target
  • the label correlation between the text features and each of the label description features specifically includes the following process:
  • (1) based on the target text feature and the label description feature, determine the correlation coefficient of each of the first unit texts, wherein the correlation coefficient is used to characterize the first unit text and the corresponding topic label to be predicted The degree of correlation between labels;
  • a fusion operation may be performed on the target text features and the tag description features, so as to obtain the tag correlation.
  • the correlation coefficient of each first unit text in the topic text to be classified can be determined, wherein the correlation coefficient can be used to characterize the label correlation between each first unit text and the corresponding topic label to be predicted degree.
  • the correlation coefficient of the ith first unit text can be determined, for example, the correlation coefficient of the first unit text can be recorded as: in, D is the weight extraction matrix learned during the training process of the text classification model.
  • the weighted summation calculation can be performed on the sub-text features of each first unit text based on the correlation coefficient, so as to obtain the tag correlation.
  • the sum of the products of all the first unit texts can be summed to obtain the label correlation, wherein the above-mentioned label correlation can be recorded as R, based on
  • the process of weighted sum calculation of the correlation coefficient and the sub-text features of each first unit text can be written as:
  • the accuracy of tag correlation can be improved by calculating the correlation coefficient between each first unit text in the target text feature and the tag description feature and performing weighted summation on the correlation coefficient to obtain the tag correlation.
  • the above step: determining the correlation coefficient of each of the first unit texts based on the target text features and the label description features specifically includes the following process:
  • the transposition result of the i-th subtext feature of the first unit text can be determined
  • T is the transposition of the subtext feature Xi of the first unit text.
  • the above-mentioned first sub-correlation coefficient can be determined based on the transposition result in
  • D is the weight extraction matrix learned during the training process of the text classification model (ie, the preset weight matrix described below).
  • the above-mentioned second sub-correlation coefficient can be determined.
  • the second sub-correlation coefficient can be determined based on the target text features and label description features
  • j i+k
  • i represents the quantity of the first unit of text
  • k represents the quantity k of the second unit of text in the tag description information.
  • the value of each first unit text can be determined based on the ratio of the first sub-correlation coefficient and the second sub-correlation coefficient. correlation coefficient.
  • the accuracy of the tag correlation can be improved by determining the above-mentioned correlation coefficient through the first sub-correlation coefficient and the second sub-correlation coefficient.
  • the above step: determining the first sub-correlation coefficient of the first unit text based on the sub-text features of each first unit text specifically includes the following process:
  • the first weight w i of the first unit text can be determined, wherein the first weight w i can be used to characterize the fusion of the sub-text features of the first unit text in the target text features Weights.
  • the first sub-correlation coefficient can be determined based on the first weight.
  • the preset weight matrix D can be obtained, and then the calculation formula can be used Determine the first weight w i of each first unit of text.
  • the first sub-correlation coefficient corresponding to the first unit of text can be determined based on the first weight
  • the first sub-correlation coefficient of each first unit text is determined by determining the first weight of each first unit text in the target text feature, thereby improving the accuracy of the correlation coefficient.
  • the above step: determining the second sub-correlation coefficient based on the target text features and the tag description features specifically includes The following process:
  • the second weight may be determined based on the sub-text features in the target text features and the preset weight matrix D. Specifically, the formula Determine the second weight. Afterwards, the third weight can also be determined based on the label description features and the preset weight matrix. Specifically, the formula can be used Determine the third weight.
  • the second sub-correlation coefficient can be determined based on the second weight and the third weight
  • the second sub-correlation coefficient can be expressed as the determination of the second weight based on each first unit text and determined based on the third weight of each second unit text Perform the summation operation to get
  • step S103 extracting the target text features of the topic text to be classified, and extracting each of the topic tags to be predicted
  • the tag description features of the tag description information include: extracting the target text features of the topic text to be classified through the feature extraction layer in the text classification model, and extracting the tag description features of the tag description information of each topic tag to be predicted.
  • FIG. 3 is a frame structure diagram of a text classification model in the text classification method provided by the embodiment of the present disclosure.
  • the text classification model includes: a feature extraction network, a fusion layer and a classification layer (that is, a binary classification layer); wherein, the feature extraction network includes: an input layer, an embedding layer and a feature extraction layer.
  • the feature extraction network includes: an input layer, an embedding layer, and a feature extraction layer.
  • the extraction process of the feature extraction network to extract the target text features is as follows:
  • Input layer After obtaining the topic text to be classified, input the topic text to be classified to the input layer for processing. After that, the input layer can convert the topic text to be classified into one-hot encoding. After converting the topic text to be classified into one-hot encoding, each unit text in the topic text to be classified can be converted into a fixed-dimensional vector composed of 0 and 1.
  • the one-hot code After obtaining the one-hot code of the topic text to be classified above, the one-hot code can be converted into a word vector corresponding to the topic text to be classified, and the one-hot code of the label description information The -hot encoding is converted into a word vector corresponding to the tag description information.
  • the one-hot encoding can be converted into a corresponding word vector through the word2vec model.
  • the word vector After obtaining the word vector corresponding to the above-mentioned topic text to be classified and the word vector corresponding to the label description information, the word vector can be feature extracted, so as to obtain the A text representation vector of the expressed content of the text, and a tag representation vector corresponding to the tag description information.
  • the feature extraction layer when the feature extraction layer performs feature extraction, it can extract according to the semantics of word vectors, so that the obtained text representation vectors are fluent and can accurately express the content of the topic text to be classified.
  • the feature extraction layer can extract text representation vectors through CNN models (Convolutional Neural Networks, Convolutional Neural Networks), or RNN models (Recurrent Neural Networks, Recurrent Neural Networks).
  • step S105: determining the label correlation between the target text features and each of the label description features, to obtain at least A label correlation includes: determining the label correlation between the target text feature and each of the label description features through a correlation determination layer in the text classification model to obtain at least one label correlation.
  • the target text feature and the label description feature can be fused through the fusion layer (ie, the correlation determination layer), so as to obtain the label correlation between the target text feature and the label description feature sex.
  • the fusion layer ie, the correlation determination layer
  • the above-mentioned target text features can be divided into sub-text features of each first unit text, and then the correlation between the sub-text features and label description features of each first unit text is calculated separately, so that according to all first unit texts The correlation between the sub-text features and the label description features of the target text features and the label description features are determined.
  • the fusion layer can first pass the formula to calculate the first weight w i . Then, based on the first weight w i , the correlation R between the first unit text and the target text features can be calculated, where,
  • the above step S107: based on at least one of the tag correlations, determine among at least one of the to-be-predicted topic tags that are related to the The target topic label matching the topic text to be classified comprises: determining that the topic text to be classified matches the topic text to be classified in at least one topic label to be predicted based on at least one of the label correlations through a classification layer in the text classification model target hashtag for .
  • the above classification layer may be composed of a fully connected layer and a normalization layer, wherein the fully connected layer may include a matrix W.
  • the classification layer can use the fully connected layer and the normalization layer to map the vector of the label correlation into a correlation probability, wherein the correlation probability is used to represent the to-be-predicted The degree of correlation between the topic label and the topic text to be classified.
  • the expression form of logit may be a probability value in the form of a percentage, for example, 60%
  • R is the label correlation between the above-mentioned label description feature and the target text feature.
  • the topic label of the book-tweeting topic can be determined more accurately, and the classification accuracy of the book-tweeting topic can be improved, so that it can be more accurate for Users push out satisfactory books, thereby improving the user's reading experience.
  • the method also includes a process of training the text classification model to be trained:
  • each training sample contains topic labels to be predicted and topic text to be trained, and each of the training samples contains matching labels, and the matching labels are used to indicate the topics to be predicted The matching between the label and the topic text to be trained;
  • the text classification model to be trained is trained by using the plurality of training samples to obtain the text classification model.
  • a plurality of training samples containing topic labels to be predicted and topic texts to be trained can be determined, wherein each training sample contains a topic text to be trained and at least one topic label to be predicted, and each to-be The predicted topic label corresponds to a matching label, and the matching label is used to represent the matching between the topic label to be predicted and the topic text to be classified.
  • the text classification model to be trained is trained by using the plurality of training samples to obtain the text classification model, which specifically includes the following process:
  • the target loss function loss of the text classification model it is first necessary to determine the target loss function loss of the text classification model to be trained. Specifically, the calculation process of the target loss function loss is as follows:
  • N tags is the number of the first tags of the topic tags to be predicted contained in the plurality of training samples.
  • y true is a sign function, i.e. the matching label mentioned above.
  • the above-mentioned second label quantity may be determined according to a sign function.
  • y pred is the predicted value of the relevant probability output by the text classification model to be trained for the topic label to be predicted (ie, the prediction result of the text classification model to be trained for multiple training samples).
  • is a hyperparameter, generally the average of the number of first labels contained in each training sample.
  • the target loss function value of the text classification model to be trained can be determined based on the first label quantity, the second label quantity, matching labels and the prediction results of the text classification model to be trained for multiple training samples, and according to The target loss function value adjusts the model parameters of the text classification model to be trained, thereby improving the prediction accuracy of the text classification model.
  • FIG. 4 it is a flowchart of a text processing method provided by an embodiment of the present disclosure.
  • the method is applied to a terminal device, and reading software is pre-installed in the terminal device.
  • the method includes steps S401 to S407, in:
  • S401 Display an operation page of topic text.
  • the operation page of the above-mentioned topic text is shown in Figure 5, wherein, the posting page shown in Figure 5 is the user's posting operation in the above-mentioned reading software, and the user can enter the target text on the operation page data.
  • S403 Receive target data input by the user on the operation page, wherein the target data includes: topic text to be published, or interesting topic tags.
  • the target data is the topic text to be published.
  • the user can input the topic text to be published on the interface shown in Figure 5; after that, the terminal device can send the topic text to be published to the server, and the server can determine according to the text classification method described in the above-mentioned embodiments A hashtag matching the topic text to be published, and display the hashtag in the second display position as shown in FIG. 5 .
  • S405 Obtain the screening result determined by the server based on the target data, wherein the screening result is after the server screens the data to be screened determined based on the target data based on the text classification method described in any of the above embodiments the result of.
  • the screening results returned by the server are also different.
  • the server can determine the hashtag matching the topic text to be published according to the text classification method described in the above embodiment. If the target data is a topic tag of interest, then the server can determine the published topic text matching the topic tag of interest according to the text classification method described in the above embodiment.
  • S407 Displaying the target data and/or the filtering results of the target data on the operation page.
  • the filtering result of the target data may be published topic text matching the tag of interest.
  • the recommended topics displayed on the operation page may be the target data and books or articles related to the interest tag, wherein the recommended topics may be used to recommend books or articles Published topic text for .
  • the hashtags of the book push topics can be more accurately determined, and the classification accuracy of the book push topics can be improved, so that satisfactory books can be pushed to users more accurately, thereby improving the user's reading experience.
  • the above-mentioned target data includes the topic text to be published; the above-mentioned display of the target data and/or the screening results of the target data on the operation page specifically includes the following process:
  • the above-mentioned first display position is used to display the topic text to be published input by the user, wherein the first sub-display position in the first display position is used to display the topic to be published
  • the text title of the text, the second sub-display position in the first display position is used to display the text content of the topic text to be published.
  • the second display area includes at least one target hashtag matching the topic text to be published.
  • the method further includes:
  • the modified target hashtag is displayed on the page, wherein the modifying operation includes at least one of the following: adding, deleting, and modifying.
  • the user can also modify the target hashtag through the tag modification identifier, wherein, after detecting the user's trigger operation on the "+click to add" button (that is, the tag modification identifier) After that, it can be determined that the modification operation matching the "+click to add" button is an addition operation, and in response to the addition operation, a corresponding new topic tag is added at the second display position.
  • each target hashtag may also contain a " ⁇ " tag modification logo, wherein, after detecting the trigger operation of the user's " ⁇ ” tag modification logo, it can be determined and the " ⁇ " " tag modification identifies the matching modification operation as a deletion operation, and deletes the corresponding target hashtag in response to the deletion operation.
  • the user can also directly modify the tag content in the target hashtag by triggering the target hashtag of the second placement.
  • the topic tag corresponding to the modification content is determined as the target topic tag.
  • the target hashtag can be modified through the modification operation, so that the user can add the target hashtag more flexibly and conveniently, and the user experience is improved.
  • the method further includes the following process:
  • the candidate hashtag page as shown in FIG. 6 can be displayed on the display interface, wherein the user can select the hashtag Candidate hashtags in the page to identify hashtags of interest.
  • the user when the user selects a tag of interest, it can also be detected whether the number of tags of interest selected by the user exceeds the preset number, and when the number of tags exceeds the preset number, a prompt message is displayed, and the prompt information is used to indicate the interest tag.
  • the number of hashtags has reached the preset number.
  • the above interest tags may correspond to different category dimensions, wherein, as shown in FIG. 6 , the category dimensions corresponding to the interest tags include: topic type, gender preference, and push book type.
  • the above-mentioned preset quantity may be set for tags of interest of all category dimensions, or may be set for tags of interest of at least part of category dimensions.
  • a prompt message is displayed on the display interface: "Up to 3 book push types can be selected”.
  • the number of interest tags selected by the user can be limited by the preset number, thereby reducing the reduction in screening efficiency caused by too many interest tags and improving user experience.
  • the above-mentioned operation page for displaying the topic text specifically includes the following process:
  • the target topic category is divided into "topic type", "gender preference” and "tweet book type”.
  • the category display area of each target topic category may be determined on the operation page. For example, determine the category display area of "topic type", the category display area of "gender preference”, and the category display area of "tweet book type”.
  • the corresponding target topic category and the preset topic tags belonging to the target topic category can be displayed in the category display area.
  • the preset hashtags belonging to the “topic type” may include “by plot”, “by role”, and “by category”.
  • the preset hashtags belonging to the “gender preference” may include “male orientation” and “female orientation”.
  • the corresponding preset topic tags can be determined according to the target topic category, and displayed in the category display area corresponding to each target topic category, thereby improving the efficiency of determining the target topic label and making the interface layout more convenient. Beautiful and improve the user's browsing experience.
  • displaying the target data and/or the screening results of the target data on the operation page specifically includes the following process:
  • the display page when displaying the above target data is shown in Figure 7, wherein the display page includes a title display area and a text display area, wherein the title display area is used to display interest topic tags , the text display area is used to display the key topic content of the published topic text that matches the topic tag of interest.
  • the key topic content may include the text title of the published topic text and the browsing identifier, wherein the browsing identifier is used to characterize the number of times the published topic text has been browsed, the number of times recommended books have been adopted (the number of times adopted may be It is in the form of "saving the book shortage of 15.3w people" as shown in Figure 7) and other data.
  • the topic tags of interest and the key topic content of the published topic text can be displayed through the label display area and the text display area respectively, so that the page layout is more reasonable, and, by displaying the key topic content of the published topic text
  • the method realizes the refinement of the published topic text, further improves the rationality of the page layout, and enables the display interface to display more substantive content at the same time, which is convenient for users to watch.
  • the method also includes:
  • (1) in response to the selection operation for the interested hashtag, determine the target hashtag selected by the user, and obtain the published topic text matching the target hashtag;
  • the user may determine the published topic text corresponding to the target hashtag to be viewed through the selection operation on the above-mentioned interested hashtag. Specifically, after the target hashtag selected by the user is detected, the published topic text displayed on the topic screening page can be screened, so as to determine the published topic text that matches the target topic text, and display it in the text display area Display the key topic content of the published topic text matching the target topic text.
  • the key topic content of the published topic text displayed on the topic screening page can be screened through the topic label of interest, so as to better meet the user's use needs and improve the user's use experience.
  • topic tags so as to improve the accuracy of topic classification of topic texts to be classified.
  • the topic text to be classified is a book-tweeting topic associated with book recommendation
  • the topic label of the book-tweeting topic can be determined more accurately, and the classification accuracy of the book-tweeting topic can be improved, so that it can be more accurate for Users push out satisfactory books, thereby improving the user's reading experience.
  • the writing order of each step does not mean a strict execution order and constitutes any limitation on the implementation process.
  • the specific execution order of each step should be based on its function and possible
  • the inner logic is OK.
  • the embodiment of the present disclosure also provides a text classification device corresponding to the text classification method. Since the problem-solving principle of the device in the embodiment of the present disclosure is similar to the above-mentioned text classification method of the embodiment of the disclosure, the implementation of the device Reference can be made to the implementation of the method, and repeated descriptions will not be repeated.
  • FIG. 8 it is a schematic diagram of a text classification device provided by an embodiment of the present disclosure.
  • the device includes: a first acquisition unit 81, an extraction unit 82, a first determination unit 83, and a second determination unit 84; wherein,
  • the first obtaining unit 81 is used to obtain the tag description information of the topic text to be classified and at least one topic tag to be predicted;
  • the extraction unit 82 is configured to: extract the target text features of the topic text to be classified, and extract the label description features of the label description information of each of the topic labels to be predicted;
  • the first determining unit 83 is configured to: determine the tag correlation between the target text feature and each of the tag description features, to obtain at least one tag correlation;
  • the second determining unit 84 is configured to: based on at least one of the tag correlations, determine a target topic tag matching the topic text to be classified among at least one topic tag to be predicted.
  • the target topic label is determined in the topic label to be predicted, so that the corresponding topic label can be more accurately determined for the topic text to be classified, so that Improve the accuracy of topic classification of topic texts to be classified.
  • the topic text to be classified is a book-tweeting topic associated with book recommendation
  • the topic label of the book-tweeting topic can be determined more accurately, and the classification accuracy of the book-tweeting topic can be improved, so that it can be more accurate for Users push out satisfactory books, thereby improving the user's reading experience.
  • the target text features include a plurality of sub-text features, each sub-text feature corresponds to each first unit text in the topic text to be classified, and the first determining unit 83 is further configured to:
  • Correlation degree based on the correlation coefficient of each of the first unit texts, perform a weighted summation calculation on the sub-text features of each of the first unit texts, and determine the label correlation according to the calculation results.
  • the first determination unit 83 is further configured to:
  • the first sub-correlation coefficient of the first unit text Based on the sub-text features of each of the first unit texts, determine the first sub-correlation coefficient of the first unit text; determine the second sub-correlation coefficient based on the target text features and the label description features; A ratio between a sub-correlation coefficient and said second sub-correlation coefficient determines said correlation coefficient.
  • the first determination unit 83 is further configured to:
  • the label description features include a plurality of second unit texts; the first determining unit 83 is further configured to:
  • the first acquiring unit 81 is also configured to:
  • the original text data is segmented to obtain the topic text to be classified and the tag description information.
  • the extracting unit 82 is also used to:
  • the topic text to be classified includes at least one of the following: topic title text, topic abstract text, and topic label description text.
  • the device is also used for:
  • the extracting the target text features of the topic text to be classified, and extracting the label description features of the label description information of each topic label to be predicted includes: extracting the topic to be classified through the feature extraction layer in the text classification model The target text features of the text, and extract the label description features of the label description information of each of the topic tags to be predicted; the determination of the label correlation between the target text features and each of the label description features obtains at least A tag correlation, comprising: determining the tag correlation between the target text features and each of the tag description features through a correlation determination layer in the text classification model, to obtain at least one tag correlation;
  • the tag correlation, determining a target topic tag matching the topic text to be classified in at least one of the topic tags to be predicted includes: based on at least one of the tag correlations through a classification layer in a text classification model, A target topic tag matching the topic text to be classified is determined among at least one topic tag to be predicted.
  • the device is also used for:
  • each training sample contains a topic label to be predicted and a topic text to be trained, and each of the training samples contains a matching label, and the matching label is used to indicate the topic label to be predicted and the topic text to be trained
  • the matching between topic texts; the text classification model to be trained is trained through the plurality of training samples to obtain the text classification model.
  • the device is also used for:
  • the device includes: a first display unit 91, a receiving unit 92, a second acquisition unit 93, and a second display unit 94; wherein,
  • the first display unit 91 is configured to: display the operation page of the topic text
  • the receiving unit 92 is configured to: receive target data input by the user on the operation page, wherein the target data includes: topic text to be published, or interesting topic tags;
  • the second acquisition unit 93 is configured to: acquire the screening result determined by the server based on the target data, wherein the screening result is the target data determined by the server based on the text classification method described in the above embodiment. The result after screening the data for screening;
  • the second display unit 94 is configured to: display the target data and/or the screening results of the target data on the operation page.
  • the hashtags of the book push topics can be more accurately determined, and the classification accuracy of the book push topics can be improved, so that satisfactory books can be pushed to users more accurately, thereby improving the user's reading experience.
  • the target data includes the topic text to be published, and the second display unit 94 is also used for:
  • the second display unit 94 is also used for:
  • the modified target hashtag is displayed on the page, wherein the modifying operation includes at least one of the following: adding, deleting, and modifying.
  • the target data includes the topic tag of interest
  • the device is also used for:
  • the target data includes topic tags of interest
  • the first display unit 91 is also used for:
  • the target data includes the topic tag of interest; the second display unit 94 is further configured to:
  • the topic tags of interest are displayed in the title display area of the operation page; the key topic content of the published topic text matching each of the topic tags of interest is displayed in the text display area of the operation page.
  • the second display unit 94 is also used for:
  • the embodiment of the present disclosure also provides another computer device 1000, as shown in Figure 10, which is a schematic structural diagram of the computer device 1000 provided by the embodiment of the present disclosure, including:
  • Processor 101 memory 102, and bus 103; memory 102 is used for storing and executing instructions, including memory 1021 and external memory 1022; memory 1021 here is also called internal memory, and is used for temporarily storing computing data in the processor 101, and The data exchanged by the external memory 1022 such as hard disk, the processor 101 exchanges data with the external memory 1022 through the memory 1021, and when the computer device 1000 is running, the processor 101 communicates with the memory 102 through the bus 103, so that The processor 101 executes the following instructions:
  • a target topic tag matching the topic text to be classified is determined among at least one topic tag to be predicted.
  • the embodiment of the present disclosure also provides a computer device 1100, as shown in FIG. 11, which is a schematic structural diagram of the computer device 1100 provided by the embodiment of the present disclosure, including:
  • Processor 111 memory 112, and bus 113; memory 112 is used for storing execution order, comprises memory 1121 and external memory 1122; memory 1121 here is also called internal memory, is used for temporarily storing the operation data in processor 111, and The data exchanged by the external memory 1122 such as a hard disk, the processor 111 exchanges data with the external memory 1122 through the memory 1121, and when the computer device 1100 is running, the processor 111 communicates with the memory 112 through the bus 113, so that The processor 111 executes the following instructions:
  • target data input by the user on the operation page, wherein the target data includes: topic text to be published, or topic tags of interest;
  • the screening result is a result of the server filtering the data to be screened determined based on the target data based on the text classification method described in the above embodiment;
  • the target data and/or the screening results of the target data are displayed on the operation page.
  • Embodiments of the present disclosure also provide a computer-readable storage medium, on which a computer program is stored, and when the computer program is run by a processor, the text classification and text processing methods described in the above-mentioned method embodiments are executed. step.
  • the storage medium may be a volatile or non-volatile computer-readable storage medium.
  • Embodiments of the present disclosure also provide a computer program product, which carries a program code, and the instructions included in the program code can be used to execute the steps of the text classification and text processing methods described in the above method embodiments, specifically Refer to the foregoing method embodiments, and details are not repeated here.
  • the above-mentioned computer program product may be specifically implemented by means of hardware, software or a combination thereof.
  • the computer program product is embodied as a computer storage medium, and in another optional embodiment, the computer program product is embodied as a software product, such as a software development kit (Software Development Kit, SDK) etc. wait.
  • a software development kit Software Development Kit, SDK
  • An embodiment of the present disclosure also provides a computer program, the computer program is stored in a readable storage medium, at least one processor of an electronic device can read the computer program from the readable storage medium, at least one of the The processor executes the computer program, so that the electronic device executes the steps of the text classification and text processing methods described in the above method embodiments. For details, refer to the above method embodiments, and details will not be repeated here.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present disclosure may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions are realized in the form of software function units and sold or used as independent products, they can be stored in a non-volatile computer-readable storage medium executable by a processor.
  • the technical solution of the present disclosure is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present disclosure.
  • the aforementioned storage media include: U disk, mobile hard disk, read-only memory (Read-Only Memory, ROM), random access memory (Random Access Memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Biomedical Technology (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Evolutionary Computation (AREA)
  • Biophysics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种文本分类、文本处理方法、装置、计算机设备及存储介质,其中,该方法包括:获取待分类话题文本和至少一个待预测话题标签的标签描述信息(S101);提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征(S103);确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性(S105);基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签(S107)。

Description

文本分类、文本处理方法、装置、计算机设备及存储介质
相关申请的交叉引用
本公开要求于2022年1月27日提交中国专利局、申请号为202210102790.9、申请名称为“文本分类、文本处理方法、装置、计算机设备及存储介质”的中国专利申请的优先权,其全部内容通过引用并入本文。
技术领域
本公开涉及计算机的技术领域,具体而言,涉及一种文本分类、文本处理方法、装置、计算机设备及存储介质。
背景技术
在用户使用书籍或者文章阅读类软件时,一方面会浏览阅读类软件中感兴趣的与书籍推荐相关的推书话题,从而在推书话题中查找喜欢的书籍或者文章进行阅读。此时,用户需要对每个推书话题一一进行浏览,通过该一一浏览的方式降低了用户在各个推书话题中查找喜欢书籍的效率。另一方面,用户可以在该阅读类软件中搜索喜欢的书籍,但是现有的搜索方案所召回的内容为与搜索关键词相匹配的推书话题;然而,该推书话题中所推荐的书籍可能与搜索关键词所希望搜索的书籍不相关,或者,所召回的内容中漏掉了部分推书话题,从而造成用户无法搜索到满意的书籍,进而降低了用户对该阅读类软件的阅读体验。
发明内容
本公开实施例至少提供一种文本分类、文本处理方法、装置、计算机设备及存储介质、计算机程序产品以及计算机程序。
第一方面,本公开实施例提供了一种文本分类方法,应用于服务器,包括:
获取待分类话题文本和至少一个待预测话题标签的标签描述信息;提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征;确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性;基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
一种可选的实施方式中,所述目标文本特征中包含多个子文本特征,每个子文本特征对应所述待分类话题文本中每个第一单位文本;所述确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,包括:
基于所述目标文本特征和所述标签描述特征,确定每个所述第一单位文本的相关系数,其中,所述相关系数用于表征该第一单位文本与对应待预测话题标签之间的标签相关程度;基于每个所述第一单位文本的相关系数,对各个所述第一单位文本的子文本特征进行加权 求和计算,并根据计算结果确定所述标签相关性。
一种可选的实施方式中,所述基于所述目标文本特征和所述标签描述特征,确定每个所述第一单位文本的相关系数,包括:
基于每个所述第一单位文本的子文本特征,确定该第一单位文本的第一子相关系数;基于所述目标文本特征和所述标签描述特征确定第二子相关系数;基于所述第一子相关系数和所述第二子相关系数之间的比值确定所述相关系数。
一种可选的实施方式中,所述基于每个所述第一单位文本的子文本特征,确定该第一单位文本的第一子相关系数,包括:
基于每个所述第一单位文本的子文本特征和预设权重矩阵,确定该第一单位文本的第一权重;基于所述第一权重确定所述第一子相关系数。
一种可选的实施方式中,所述标签描述特征中包含多个第二单位文本;所述基于所述目标文本特征和所述标签描述特征确定第二子相关系数,包括:
基于所述目标文本特征和预设权重矩阵确定各个第一单位文本的第二权重;基于所述标签描述特征和所述预设权重矩阵确定各个第二单位文本的第三权重;基于所述第二权重和所述第三权重确定所述第二子相关系数。
一种可选的实施方式中,所述获取待分类话题文本和至少一个待预测话题标签的标签描述信息,包括:
获取待处理的原始文本数据,并确定所述原始文本数据中所包含的文本类型标识;基于所述文本类型标识确定所述原始文本数据的数据分割位置,并基于所述数据分割位置对所述原始文本数据进行分割处理,得到所述待分类话题文本和所述标签描述信息。
一种可选的实施方式中,所述提取所述待分类话题文本的目标文本特征,包括:
确定所述待分类话题文本中每个第一单位文本的目标向量,其中,所述目标向量中的元素用于指示该第一单位文本和每个预设单位文本之间的映射关系;在所述待分类话题文本中全部第一单位文本的目标向量中提取所述待分类话题文本的关键特征向量,并将所述关键特征向量确定为所述目标文本特征。
一种可选的实施方式中,所述待分类话题文本包括以下至少之一:话题标题文本、话题摘要文本、话题标签描述文本。
一种可选的实施方式中,所述提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征,包括:通过文本分类模型中的特征提取层提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征;所述确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性,包括:
通过文本分类模型中的相关性确定层确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性;所述基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签,包括:通过文本分类模型中的分类层基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
一种可选的实施方式中,所述方法还包括:
确定多个训练样本;其中,每个训练样本中包含待预测话题标签和待训练话题文本,每个所述训练样本包含匹配标签,所述匹配标签用于指示所述待预测话题标签和待训练话题文本之间的匹配性;通过所述多个训练样本对待训练的文本分类模型进行训练,得到所述文本分类模型。
一种可选的实施方式中,所述通过所述多个训练样本对待训练的文本分类模型进行训练,得到所述文本分类模型,包括:
确定所述多个训练样本中所包含待预测话题标签的第一标签数量,并确定所述待预测话题标签中与所述待训练话题文本相匹配的目标分类标签的第二标签数量;基于所述第一标签数量、所述第二标签数量、所述匹配标签和所述待训练的文本分类模型对所述多个训练样本的预测结果,确定所述待训练的文本分类模型的目标损失函数值;根据所述目标损失函数值,调整所述待训练的文本分类模型的模型参数,得到所述文本分类模型。
第二方面,本公开实施例还提供一种文本处理方法,应用于终端设备,包括:
展示话题文本的操作页面;接收用户在所述操作页面输入的目标数据,其中,所述目标数据包括:待发布话题文本,或者,感兴趣话题标签;获取服务器基于所述目标数据确定的筛选结果,其中,所述筛选结果为所述服务器基于上述第一方面中任一项所述的文本分类方法对基于所述目标数据确定的待筛选数据进行筛选之后的结果;在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果。
一种可选的实施方式中,所述目标数据包含所述待发布话题文本;所述在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果,包括:
在所述操作页面的第一展示位置展示所述待发布话题文本;在所述操作页面的第二展示位置展示所述待发布话题文本的发布类型和/或与所述待发布话题文本相匹配的至少一个目标话题标签。
一种可选的实施方式中,所述方法还包括:
检测用户针对所述操作页面中所展示的所述目标话题标签的标签修改标识的触发操作,对所述目标话题标签执行与用户所触发的标签修改标识相匹配的修改操作,并在所述操作页面中展示修改之后的目标话题标签,其中,所述修改操作包括以下至少之一:新增、删除、修改。
一种可选的实施方式中,所述目标数据包含所述感兴趣话题标签;所述方法还包括:
在接收用户在所述操作页面输入的感兴趣话题标签之后,检测所述感兴趣话题标签的标签数量是否超过预设数量;在所述标签数量超过所述预设数量的情况下,展示提示信息;所述提示信息用于指示所述感兴趣话题标签的数量已达到所述预设数量。
一种可选的实施方式中,所述目标数据包括感兴趣话题标签;所述展示话题文本的操作页面,包括:
响应于用户的话题筛选请求,获取所属于至少一个目标话题类别的预设话题标签;在所述操作页面中确定每个所述目标话题类别的类别展示区域,并在所述类别展示区域中展示对应目标话题类别和所属于该目标话题类别的预设话题标签。
一种可选的实施方式中,所述目标数据包括所述感兴趣话题标签;所述在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果,包括:
在所述操作页面的标题展示区域中展示所述感兴趣话题标签;在所述操作页面的文本展示区域中展示与每个所述感兴趣话题标签相匹配的已发布话题文本的关键话题内容。
一种可选的实施方式中,所述方法还包括:
响应于针对所述感兴趣话题标签的选择操作,确定用户所选择的目标话题标签,并获取与所述目标话题标签相匹配的已发布话题文本;在话题筛选页面的文本展示区域中展示与所述目标话题标签相匹配的已发布话题文本的关键话题内容。
第三方面,本公开实施例还提供一种文本分类装置,应用于服务器,包括:
第一获取单元,用于获取待分类话题文本和至少一个待预测话题标签的标签描述信息;提取单元,用于提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征;第一确定单元,用于确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性;第二确定单元,用于基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
第四方面,本公开实施例还提供一种文本处理装置,应用于终端设备,包括:
第一展示单元,用于展示话题文本的操作页面;接收单元,用于接收用户在所述操作页面输入的目标数据,其中,所述目标数据包括:待发布话题文本,或者,感兴趣话题标签;第二获取单元,用于获取服务器基于所述目标数据确定的筛选结果,其中,所述筛选结果为所述服务器基于上述第一方面中任一项所述的文本分类方法对基于所述目标数据确定的待筛选数据进行筛选之后的结果;第二展示单元,用于在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果。
第五方面,本公开实施例还提供一种计算机设备,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当计算机设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行上述第一方面至第二方面中任一种可能的实施方式中的步骤。
第六方面,本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述第一方面至第二方面中任一种可能的实施方式中的步骤。
第七方面,本公开实施例还提供了一种计算机程序产品,所述计算机程序产品包括:计算机程序,所述计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从所述可读存储介质读取所述计算机程序,至少一个所述处理器执行所述计算机程序,使得所述电子设备执行上述第一方面至第二方面中任一种可能的实施方式中的步骤。
第八方面,本公开实施例还提供了一种计算机程序,所述计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从所述可读存储介质中读取上述计算机程序,至少一个所述处理器执行所述计算机程序,使得所述电子设备执行上述第一方面至第二方面中任一种可能的实施方式中的步骤。
本公开实施例提供了一种文本分类、文本处理方法、装置、计算机设备及存储介质。在本公开实施例中,首先可以获取待分类话题文本和对应的至少一个待预测话题标签的标签描述信息,并提取待分类话题文本的目标文本特征,并提取每个待预测话题标签的标签 描述信息的标签描述特征;之后,就可以确定目标文本特征和标签描述特征之间的标签相关性;最后,就可以基于该标签相关性在至少一个待预测话题标签中确定与待分类话题文本相匹配的目标话题标签。
上述实施方式中,通过确定标签描述特征和目标文本特征之间的标签相关性在待预测话题标签中确定目标话题标签的方式,可以更加准确的为待分类话题文本确定对应的话题标签,从而提高待分类话题文本的话题分类的准确度。在待分类话题文本为与书籍推荐相关联的推书话题的情况下,通过上述处理方式,可以更加准确的确定推书话题的话题标签,提高推书话题的分类精度,从而能够更加准确的为用户推送出满意的书籍,进而提高用户的阅读体验。
为使本公开的上述目的、特征和优点能更明显易懂,下文特举较佳实施例,并配合所附附图,作详细说明如下。
附图说明
为了更清楚地说明本公开实施例的技术方案,下面将对实施例中所需要使用的附图作简单地介绍,此处的附图被并入说明书中并构成本说明书中的一部分,这些附图示出了符合本公开的实施例,并与说明书一起用于说明本公开的技术方案。应当理解,以下附图仅示出了本公开的某些实施例,因此不应被看作是对范围的限定,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他相关的附图。
图1示出了本公开实施例所提供的一种文本分类方法的流程图;
图2示出了本公开实施例所提供的基于数据分割位置对该原始文本数据进行分割处理的示意图;
图3示出了本公开实施例所提供的文本分类方法所对应的文本分类模型的框架结构图;
图4示出了本公开实施例所提供的一种文本处理方法的流程图;
图5示出了本公开实施例所提供的话题文本的操作页面的示意图;
图6示出了本公开实施例所提供的待选话题标签页面的示意图;
图7示出了本公开实施例所提供的展示目标数据时的展示页面的示意图;
图8示出了本公开实施例所提供的一种文本分类装置的示意图;
图9示出了本公开实施例所提供的一种文本处理装置的示意图;
图10示出了本公开实施例所提供的一种计算机设备的示意图;
图11示出了本公开实施例所提供的另一种计算机设备的示意图。
具体实施方式
为使本公开实施例的目的、技术方案和优点更加清楚,下面将结合本公开实施例中附图,对本公开实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本公开一部分实施例,而不是全部的实施例。通常在此处附图中描述和示出的本公开实施例的组件可以以各种不同的配置来布置和设计。因此,以下对在附图中提供的本公开的实施例的详细描述并非旨在限制要求保护的本公开的范围,而是仅仅表示本公开的选定实施例。基于本公开的实施例,本领域技术人员在没有做出创造性劳动的前提下所获得的所有 其他实施例,都属于本公开保护的范围。
应注意到:相似的标号和字母在下面的附图中表示类似项,因此,一旦某一项在一个附图中被定义,则在随后的附图中不需要对其进行进一步定义和解释。
本文中术语“和/或”,仅仅是描述一种关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,本文中术语“至少一种”表示多种中的任意一种或多种中的至少两种的任意组合,例如,包括A、B、C中的至少一种,可以表示包括从A、B和C构成的集合中选择的任意一个或多个元素。
经研究发现,在用户使用书籍或者文章阅读类软件时,一方面会浏览阅读类软件中感兴趣的与书籍推荐相关的推书话题,从而在推书话题中查找喜欢的书籍或者文章进行阅读。此时,用户需要对每个推书话题一一进行浏览,通过该一一浏览的方式降低了用户在各个推书话题中查找喜欢书籍的效率。另一个方面,用户可以在该阅读类软件中搜索喜欢的书籍,但是现有的搜索方案所召回的内容为与搜索关键词相匹配的推书话题;然而,该推书话题中所推荐的书籍可能与搜索关键词所希望搜索的书籍不相关,或者,所召回的内容中漏掉了部分推书话题,从而造成用户无法搜索到满意的书籍,进而降低了用户对该阅读类软件的阅读体验。
基于上述研究,本公开提供了一种文本分类、文本处理方法、装置、计算机设备及存储介质。在本公开实施例中,首先可以获取待分类话题文本和对应的至少一个待预测话题标签的标签描述信息,并提取待分类话题文本的目标文本特征,并提取每个待预测话题标签的标签描述信息的标签描述特征;之后,就可以确定目标文本特征和标签描述特征之间的标签相关性;最后,就可以基于该标签相关性在至少一个待预测话题标签中确定与待分类话题文本相匹配的目标话题标签。
上述实施方式中,通过确定标签描述特征和目标文本特征之间的标签相关性在待预测话题标签中确定目标话题标签的方式,可以更加准确的为待分类话题文本确定对应的话题标签,从而提高待分类话题文本的话题分类的准确度。在待分类话题文本为与书籍推荐相关联的推书话题的情况下,通过上述处理方式,可以更加准确的确定推书话题的话题标签,提高推书话题的分类精度,从而能够更加准确的为用户推送出满意的书籍,进而提高用户的阅读体验。
为便于对本实施例进行理解,首先对本公开实施例所公开的一种文本分类、文本处理方法进行详细介绍,本公开实施例所提供的文本分类、文本处理方法的执行主体一般为具有一定计算能力的计算机设备,该计算机设备例如包括:终端设备或服务器或其它处理设备。在一些可能的实现方式中,该文本分类、文本处理方法可以通过处理器调用存储器中存储的计算机可读指令的方式来实现。
参见图1所示,为本公开实施例提供的一种文本分类方法的流程图,该方法应用于服务器,该方法包括步骤S101~S107,其中:
S101:获取待分类话题文本和至少一个待预测话题标签的标签描述信息。
本公开实施例所提供的文本分类方法可以应用在书籍或者文章阅读类软件的服务器中。例如,用户在使用该阅读类软件时,可以通过发帖的方式获取想要浏览的书籍以及文章等,或者通过发帖和其他用户进行交流。
在本公开实施例中,待分类话题文本可以为当前用户通过阅读类软件编辑的文本,还可以为其他用户通过阅读类软件编辑的文本。举例来说,上述待分类话题文本可以为用户通过阅读类软件输入的帖子内容。
在获取到用户输入的待分类话题文本后,就可以为该待分类话题文本确定对应的至少一个待预测话题标签的标签描述信息。
具体实施时,可以预先设定多个话题标签(即,预设话题标签);然后,可以将全部预设话题标签确定为上述至少一个待预测话题标签。除此之外,还可以对预设话题标签进行初步筛选,得到至少一个待预测话题标签。具体筛选原则可以为:筛选预设话题标签中包含待分类话题文本的特征信息的话题标签为至少一个待预测话题标签。此时,该至少一个待预测话题标签就可以包含该待分类话题文本所对应的特征信息。
举例来说,上述待分类话题文本为:求高质量言情小说,那么,该待分类话题文本所对应的特征信息可以“言情”和“小说”。在此情况下,该待分类话题文本所对应的至少一个待预测话题标签就可以包含“言情”和/或“小说”。
在本公开实施例中,每个待预测话题标签还可以包含用于对该待预测话题标签进行注释的标签描述信息。例如,当上述待预测话题标签为“体育”时,该待预测话题标签所对应的标签描述信息可以包括:体育,运动,拳击,竞技,篮球,足球等文本。
S103:提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征。
在本公开实施例中,在获取到上述待分类话题文本和至少一个待预测话题标签的标签描述信息后,就可以通过文本分类模型中的特征提取层对待分类话题文本进行特征提取,得到对应的目标文本特征,以及对每个标签描述信息进行特征提取,得到对应的标签描述特征。其中,提取到的目标文本特征和标签描述特征的数据格式可以为向量,例如,可以为文本表示向量和标签表示向量。在得到文本表示向量和标签表示向量之后,就可以基于文本表示向量和标签表示向量确定标签相关性,通过向量形式的数据确定标签相关性的方式,可以简化便于对目标文本特征和标签描述特征之间的相关性进行对比的过程。
在本公开实施例中,该文本分类模型包括:输入层、嵌入层、特征提取层,其中,输入层、嵌入层、特征提取层串联连接。
具体实施时,输入层在获取到待分类话题文本和标签描述信息之后,可以将上述待分类话题文本和标签描述信息中的文本分别转换为one-hot编码(独热编码)。嵌入层可以将上述待分类话题文本所对应的one-hot编码和标签描述特征所对应的one-hot编码转换成词向量。特征提取层在得到上述词向量后,就可以对词向量进行向量提取,得到该待分类话题文本的目标文本特征和标签描述信息的标签描述特征。
S105:确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性。
在本公开实施例中,可以通过相关性计算,分别计算目标文本特征和每个标签描述特征之间的标签相关性。具体实施时,可以通过文本分类模型中的融合层将目标文本特征分别和标签描述特征进行融合运算,从而根据融合运算结果确定目标文本特征和该标签描述特征之间的标签相关性。其中,融合层的输入与文本分类模型的特征提取层的输出相连接。
这里,上述标签相关性可以表示为相关性表示向量;其中,相关性表示向量用于表征待分类话题文本和对应待预测话题标签之间的标签相关性。在得到相关性表示向量之后,就可以对相关性表示向量进行归一化处理,从而归一化后得到0至1范围内的数值。其中,该数值用于表征待分类话题文本和对应待预测话题标签之间的相关概率。
具体实施时,可以将相关性表示向量输入至文本分类模型中的二分类层进行映射处理,从而将相关性表示向量映射为0至1范围内的数值。其中,二分类层包含全连接层和Sigmod层,且全连接层和Sigmod层依次连接。这里,可以通过全连接层和Sigmod层对相关性表示向量依次进行处理,从而得到归一化后的相关概率。这里,文本分类模型中的二分类层的输入与融合层的输出相连接。
S107:基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
通过上述描述可知,针对待预测话题标签,待分类话题文本和每个待预测话题标签均可以确定出对应的相关性表示向量。此时,就可以分别对每个相关性表示向量进行归一化处理,得到至少一个相关概率,其中,该相关概率可以为0到1的概率值。这里,每个相关概率用于表征待分类话题文本与对应待预测话题标签之间的相关程度(或者相似程度)。
这里,在得到至少一个相关概率后,就可以对该至少一个相关概率进行筛选,从而确定出满足概率要求的相关概率。具体的,该概率要求可以理解为大于或者等于预设概率阈值。在此情况下,就可以在该至少一个相关概率中确定大于或者等于预设概率阈值的相关概率作为满足概率要求的相关概率。
在确定出满足概率要求的相关概率后,就可以确定该满足概率要求的相关概率所对应的待预测话题标签,并确定出的所对应的待预测话题标签确定为目标话题标签。
在本公开实施例中,通过确定标签描述特征和目标文本特征之间的标签相关性在待预测话题标签中确定目标话题标签的方式,可以更加准确的为待分类话题文本确定对应的话题标签,从而提高待分类话题文本的话题分类的准确度。在待分类话题文本为与书籍推荐相关联的推书话题的情况下,通过上述处理方式,可以更加准确的确定推书话题的话题标签,提高推书话题的分类精度,从而能够更加准确的为用户推送出满意的书籍,进而提高用户的阅读体验。
在一个可选的实施方式中,上述步骤S101,获取待分类话题文本和至少一个待预测话题标签的标签描述信息,具体包括如下过程:
(1)、获取待处理的原始文本数据,并确定所述原始文本数据中所包含的文本类型标识;
(2)、基于所述文本类型标识确定所述原始文本数据的数据分割位置,并基于所述数据分割位置对所述原始文本数据进行分割处理,得到所述待分类话题文本和所述标签描述信息。
在本公开实施例中,上述待处理的原始文本数据可以由多个部分组成,例如,该待处理的原始文本数据可以包含:待分类话题文本,至少一个待预测话题标签的标签描述信息。其中,该原始文本数据的每个部分可以对应着不同的文本类型标识。
在原始文本数据中包含多个文本块,每个文本块包含对应的数据标识位segment id,其 中,该数据标识位用于指示对应文本块的文本类型标识。具体实施时,可以分别对原始文本数据中每个文本块的数据标识位segment id进行识别,得到该segment id所指示的文本类型标识。
这里,在上述原始文本数据中,待分类话题文本所属文本块的数据标识位segment id所指示的文本类型标识的标识值可以设置为0,标签描述信息所属文本块的数据标识为segment id所指示的文本类型标识的值可以设置为1。
在本公开实施例中,可以基于文本类型标识的标识值确定原始文本数据的数据分割位置,并基于数据分割位置对该原始文本数据进行分割处理。
具体的,如图2所示,在对原始文本数据进行分割时,首先可以根据上述文本类型标识进行分割,得到待分类话题文本和标签描述信息。
这里,可以根据文本类型标识的标识值,在原始文本数据中插入第一分隔符[SEP],并基于第一分隔符对原始文本数据进行分割。具体实施时,在检测到任意两个连续文本类型标识的标识值不相同的情况下,在这两个连续文本类型标识中间插入第一分隔符[SEP],进而通过第一分隔符[SEP]对原始文本数据进行分割。
这里,还可以预先在待分类话题文本的各个不同类型的文本块之间插入第二分隔符,进而通过上述第二分隔符对待分类话题文本进行进一步分割,具体的,上述原始文本数据包括:待分类话题文本和标签描述信息(也可以记为description)。其中,待分类话题文本包括以下至少之一:话题标题文本(也可以记为title)、话题摘要文本(也可以记为abstract),话题标题文本可以为该待分类话题文本的标题,话题摘要文本可以为该待分类话题文本的内容简介。此时,待分类话题文本的各个不同类型的文本块可以理解为:所属于话题标题文本的文本块、所属于话题摘要文的文本块。
通过上述描述可知,原始文本数据可以划分为不同的文本块(每个文本块也可以记为token),从而便于BERT模型(Bidirectional Encoder Representations from Transformer模型,即,特征提取层)对该原始文本数据进行处理。其中,该BERT模型能够对该原始文本数据进行特征提取,从而分别得到该待分类话题文本所对应的目标文本特征,以及和标签描述特征所对应的标签描述特征。
这里,上述目标文本特征可以记为topix vector(文本表示向量),上述标签描述特征可以记为description vector(标签表示向量),其中,如图2所示,目标文本特征和标签描述特征分别由各自的子向量组成。
通过上述描述可知,通过根据文本类型标识对待处理的原始文本数据进行分割,得到待分类话题文本和标签描述信息的方式,能够快速的对待分类话题文本的目标文本特征和待预测话题标签的标签描述特征进行划分,从而提高待分类话题文本和待预测话题标签的标签相关性的确定效率。
在一个可选的实施方式中,上述步骤S103,提取所述待分类话题文本的目标文本特征,具体包括如下过程:
(1)、确定所述待分类话题文本中每个第一单位文本的目标向量,其中,所述目标向量中的元素用于指示该第一单位文本和每个预设单位文本之间的映射关系;
(2)、在所述待分类话题文本中全部第一单位文本的目标向量中提取所述待分类话题 文本的关键特征向量,并将所述关键特征向量确定为所述目标文本特征。
在本公开实施例中,首先可以对该待分类话题文本进行划分,得到多个第一单位文本。其中,每个第一单位文本所对应的目标向量的长度可以由该第一单位文本所包含的文本长度决定,该待分类话题文本的多个第一单位文本所包含的文本长度可以是不同的。例如,该第一单位文本中包含的文本长度可以分为:字、词、句、段四种类型。
这里,上述预设单位文本可以为预先设定的用于对第一单位文本进行筛选的文本,其中,该预设单位文本的数量可以为多个。在通过该预设单位文本对上述第一单位文本进行筛选时,首先可以确定各个第一单位文本所对应的目标向量,并分别确定该目标向量和每个预设单位文本之间的映射关系。
在本公开实施例中,在确定出上述映射关系后,就可以基于该映射关系,确定出该目标向量中和预设单位文本相匹配的子向量(即,图2中目标文本特征的子向量)为上述关键特征向量,然后就可以根据确定出的关键特征向量确定目标文本特征。
举例来说,假设上述预设单位文本为“科幻”,那么,在目标向量中确定出的和该预设单位文本相匹配的子向量所对应的第一单位文本也可以为“科幻”。或者,目标向量中的子向量和该预设单位文本也可以是不完全匹配的,例如,当第一单位文本为“科技”时,该第一单位文本所对应的子向量和预设单位文本的匹配度较高,此时,仍可以将该第一单位文本“科技”所对应的文本特征确定为目标文本特征。
通过上述描述可知,可以对目标向量中的关键特征向量进行提取,可以实现对不相关内容的过滤,从而减少运算量,进而提高确定目标文本特征的效率。
在一个可选的实施方式中,在目标文本特征中包含多个子文本特征,每个子文本特征对应所述待分类话题文本中每个第一单位文本的情况下,上述步骤S105:确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,具体包括如下过程:
(1)、基于所述目标文本特征和所述标签描述特征,确定每个所述第一单位文本的相关系数,其中,所述相关系数用于表征该第一单位文本与对应待预测话题标签之间的标签相关程度;
(2)、基于每个所述第一单位文本的相关系数,对各个所述第一单位文本的子文本特征进行加权求和计算,并根据计算结果确定所述标签相关性。
在本公开实施例中,首先可以根据目标文本特征中每个第一单位文本的子文本特征的融合权重,对目标文本特征和标签描述特征进行融合运算,从而得到该标签相关性。
具体实施时,首先可以确定待分类话题文本中的每个第一单位文本的相关系数,其中,该相关系数可以用于表征每个第一单位文本和对应的待预测话题标签之间的标签相关程度。
具体的,以待分类话题文本中的第i个第一单位文本的子文本特征X i为例,可以确定该第i个第一单位文本的相关系数,例如,该第一单位文本的相关系数可以记为:
Figure PCTCN2022141171-appb-000001
其中,
Figure PCTCN2022141171-appb-000002
D为文本分类模型的训练过程学习得到的权重抽取矩阵。
在确定出每个第一单位文本的相关系数之后,就可以基于该相关系数对各个第一单位文本的子文本特征进行加权求和计算,从而得到标签相关性。
具体实施时,可以将相关系数和对应的子文本特征进行相乘之后,对全部第一单 位文本的乘积进行求和运算,从而得到标签相关性,其中,上述标签相关性可以记为R,基于该相关系数和各个第一单位文本的子文本特征进行加权求和计算的过程可以记为:
Figure PCTCN2022141171-appb-000003
通过上述描述可知,通过计算目标文本特征中每个第一单位文本和标签描述特征的相关系数并对该相关系数进行加权求和得到标签相关性的方式,可以提高标签相关性的准确性。
在一个可选的实施方式中,上述步骤:基于所述目标文本特征和所述标签描述特征,确定每个所述第一单位文本的相关系数,具体包括如下过程:
(1)、基于每个所述第一单位文本的子文本特征,确定该第一单位文本的第一子相关系数;
(2)、基于所述目标文本特征和所述标签描述特征确定第二子相关系数;
(3)、基于所述第一子相关系数和所述第二子相关系数之间的比值确定所述相关系数。
在本公开实施例中,首先可以确定该第i个第一单位文本的子文本特征的转置结果
Figure PCTCN2022141171-appb-000004
其中,T为针对该第一单位文本的子文本特征X i进行转置。在确定出该第一单位文本的子文本特征的转置结果后,就可以基于该转置结果确定出上述第一子相关系数
Figure PCTCN2022141171-appb-000005
其中,
Figure PCTCN2022141171-appb-000006
D为文本分类模型的训练过程学习得到的权重抽取矩阵(即,下述预设权重矩阵)。
之后,就可以确定上述第二子相关系数,具体实施时,可以基于目标文本特征和标签描述特征确定第二子相关系数
Figure PCTCN2022141171-appb-000007
其中,j=i+k,i表示第一单位文本的数量,k表示标签描述信息中第二单位文本的数量k。其中,
Figure PCTCN2022141171-appb-000008
Figure PCTCN2022141171-appb-000009
表示为目标文本特征的子文本特征和标签描述特征的子文本特征。
在本公开实施例中,在确定出上述第一子相关系数以及第二子相关系数之后,就可以基于该第一子相关系数以及第二子相关系数的比值确定出每个第一单位文本的相关系数。
通过上述描述可知,通过第一子相关系数和第二子相关系数确定上述相关系数的方式,可以提高标签相关性的准确性。
在一个可选的实施方式中,上述步骤:基于每个所述第一单位文本的子文本特征,确定该第一单位文本的第一子相关系数,具体包括如下过程:
(1)、基于每个所述第一单位文本的子文本特征和预设权重矩阵,确定该第一单位文本的第一权重;
(2)、基于所述第一权重确定所述第一子相关系数。
在本公开实施例中,首先可以确定上述第一单位文本的第一权重w i,其中,该第一权重w i可以用于表征该第一单位文本的子文本特征在目标文本特征中的融合权重。在计算出上述第一权重w i之后,就可以基于第一权重确定第一子相关系数。
具体实施时,可以获取预设权重矩阵D,之后就可以根据计算公式
Figure PCTCN2022141171-appb-000010
确定每个第一单位文本的第一权重w i
在本公开实施例中,在确定出上述第一权重后,就可以基于该第一权重确定出上述第一单位文本所对应的第一子相关系数
Figure PCTCN2022141171-appb-000011
通过上述描述可知,通过确定目标文本特征中每个第一单位文本的第一权重确定每个第一单位文本的第一子相关系数的方式,从而提高相关系数的准确性。
在一个可选的实施方式中,在上述标签描述特征中包含多个第二单位文本的情况下,上述步骤:基于所述目标文本特征和所述标签描述特征确定第二子相关系数,具体包括如下过程:
(1)、基于所述目标文本特征和预设权重矩阵确定各个第一单位文本的第二权重;
(2)、基于所述标签描述特征和所述预设权重矩阵确定各个第二单位文本的第三权重;
(3)、基于所述第二权重和所述第三权重确定所述第二子相关系数。
在本公开实施例中,首先可以基于目标文本特征中的子文本特征和预设权重矩阵D确定第二权重。具体地,可以通过公式
Figure PCTCN2022141171-appb-000012
确定第二权重。之后,还可以基于标签描述特征和预设权重矩阵确定第三权重,具体的,可以通过公式
Figure PCTCN2022141171-appb-000013
确定第三权重。
在确定第二权重以及第三权重之后,就可以基于第二权重和第三权重确定第二子相关系数
Figure PCTCN2022141171-appb-000014
具体实施时,若上述第一单位文本的数量为i,第二单位文本的数量为k,且i+k=j。那么,该第二子相关系数可以表示为对基于各个第一单位文本的第二权重确定的
Figure PCTCN2022141171-appb-000015
和基于各个第二单位文本的第三权重确定的
Figure PCTCN2022141171-appb-000016
进行求和运算,从而得到
Figure PCTCN2022141171-appb-000017
通过上述描述可知,通过确定第二权重以及第三权重,进而根据第二权重以及第三权重确定第二子相关系数的方式,可以提高相关系数的准确性。
在一个可选的实施方式中,在如图1所示实施例的基础上,上述步骤S103:所述提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征,包括:通过文本分类模型中的特征提取层提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征。
在本公开实施例中,如图3所示为本公开实施例所提供的文本分类方法中文本分类模型的框架结构图。如图3所示,该文本分类模型包括:特征提取网络,融合层和分类层(也即,二分类层);其中,特征提取网络包括:输入层、嵌入层和特征提取层。
在本公开实施例中,首先需要通过特征提取网络来分别提取待分类话题文本的目标文本特征和标签描述信息的标签描述特征。其中,如图3所示,上述特征提取网络包括:输入层、嵌入层以及特征提取层。
以待分类话题文本为例,特征提取网络提取目标文本特征的提取过程如下:
(1)、输入层:在获取到上述待分类话题文本后,将待分类话题文本输入至该输入层进行处理。之后,输入层就可以将该待分类话题文本转换为one-hot编码。在将待分类话题文本转换为one-hot编码后,待分类话题文本中的各个单位文本可以转化为由0,1组成的固定维度的向量。
(2)、嵌入层:在获取到上述待分类话题文本的one-hot编码后,就可以将该one-hot编码转换为该待分类话题文本所对应的词向量,以及将标签描述信息的one-hot编码转换为 该标签描述信息所对应的词向量。这里,可以通过word2vec模型将该one-hot编码转换为对应的词向量。
(3)、特征提取层:在获取到上述待分类话题文本所对应的词向量和标签描述信息所对应的词向量后,就可以对词向量进行特征提取,从而得到用于表征该待分类话题文本的所表达内容的文本表示向量,以及标签描述信息所对应的标签表示向量。
应理解的是,该特征提取层在进行特征提取时,可以根据词向量的语义进行提取,从而使得得到的文本表示向量通顺且能准确表达待分类话题文本的内容。这里,该特征提取层可以通过CNN模型(Convolutional Neural Networks,卷积神经网络),或者RNN模型(Recurrent Neural Networks,循环神经网络)等进行文本表示向量的提取。
需要说明的是,上述标签描述信息的标签表示向量的提取过程和上述文本表示向量的提取过程相同,此处不再进行赘述。
在一个可选的实施方式中,在如图1所示实施例的基础上,上述步骤S105:所述确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性,包括:通过文本分类模型中的相关性确定层确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性。
在本公开实施例中,如图3所示,可以通过融合层(即相关性确定层)对目标文本特征和标签描述特征进行融合运算,从而得到目标文本特征和标签描述特征之间的标签相关性。
这里,可以将上述目标文本特征分为各个第一单位文本的子文本特征,再分别计算每个第一单位文本的子文本特征和标签描述特征之间的相关性,从而根据全部第一单位文本的子文本特征和标签描述特征之间的相关性,确定出目标文本特征和标签描述特征之间的标签相关性。
具体的,该融合层首先可以通过公式
Figure PCTCN2022141171-appb-000018
来计算第一权重w i。然后,就可以基于该第一权重w i,计算第一单位文本的和目标文本特征之间的相关性R,其中,
Figure PCTCN2022141171-appb-000019
需要说明的是,在待预测话题标签的数量为多个的情况下,待分类话题文本和每个待预测话题标签的标签描述信息之间都对应着一个标签相关性。
在一个可选的实施方式中,在如图1所示实施例的基础上,上述步骤S107:所述基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签,包括:通过文本分类模型中的分类层基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
在本公开实施例中,上述分类层可以由全连接层和归一化层组成,其中,该全连接层可以包含矩阵W。具体的,该分类层在获取到上述标签相关性后,就可以通过该全连接层和归一化层,将该标签相关性的向量映射为相关概率,其中,该相关概率用于表征待预测话题标签和待分类话题文本之间的相关程度。
这里,具体的映射过程如下:logit=sigmoid(R TW)。
其中,logit的表达形式可以为百分数形式的概率值,例如,60%,R为上述标签描 述特征和目标文本特征之间的标签相关性。上述sigmoid为归一化函数,该sigmoid的计算方式如下:
Figure PCTCN2022141171-appb-000020
通过上述描述可知,通过确定标签描述特征和目标文本特征之间的标签相关性在待预测话题标签中确定目标话题标签的方式,可以更加准确的为待分类话题文本确定对应的话题标签,从而提高待分类话题文本的话题分类的准确度。在待分类话题文本为与书籍推荐相关联的推书话题的情况下,通过上述处理方式,可以更加准确的确定推书话题的话题标签,提高推书话题的分类精度,从而能够更加准确的为用户推送出满意的书籍,进而提高用户的阅读体验。
在一个可选的实施方式中,所述方法还包括针对待训练的文本分类模型进行训练的过程:
(1)、确定多个训练样本;其中,每个训练样本中包含待预测话题标签和待训练话题文本,每个所述训练样本包含匹配标签,所述匹配标签用于指示所述待预测话题标签和待训练话题文本之间的匹配性;
(2)、通过所述多个训练样本对待训练的文本分类模型进行训练,得到所述文本分类模型。
在本公开实施例中,首先可以确定多个包含待预测话题标签和待训练话题文本的训练样本,其中,每个训练样本中包含一个待训练话题文本和至少一个待预测话题标签,每个待预测话题标签对应着一个匹配标签,该匹配标签用于表征该待预测话题标签和待分类话题文本之间的匹配性。
这里,上述匹配标签为“1”时,可以表示待预测话题标签和待训练话题文本之间为匹配的;当匹配标签为“0”时,可以表示待预测话题标签和待训练话题文本之间为不匹配的。
在本公开实施例中,通过所述多个训练样本对待训练的文本分类模型进行训练,得到所述文本分类模型,具体包括如下过程:
(1)、确定所述多个训练样本中所包含待预测话题标签的第一标签数量,并确定所述待预测话题标签中与所述待训练话题文本相匹配的目标分类标签的第二标签数量;
(2)、基于所述第一标签数量、所述第二标签数量、所述匹配标签和所述待训练的文本分类模型对所述多个训练样本的预测结果,确定所述待训练的文本分类模型的目标损失函数值;
(3)、根据所述目标损失函数值,调整所述待训练的文本分类模型的模型参数,得到所述文本分类模型。
在本公开实施例中,首先需要确定该待训练的文本分类模型的目标损失函数loss,具体的,该目标损失函数loss的计算过程如下:
Figure PCTCN2022141171-appb-000021
其中,N tags为多个训练样本中所包含待预测话题标签的第一标签数量。y true为符号函数,即上述匹配标签。在待预测话题标签和待训练话题文本匹配时,y true=1; 在待预测话题标签和待训练话题文本不匹配时,y true=0。其中,可以根据符号函数确定上述第二标签数量。y pred为该待训练的文本分类模型针对该待预测话题标签输出的相关概率的预测值(即,待训练的文本分类模型对多个训练样本的预测结果)。σ为超参数,一般为每个训练样本中包含的第一标签数量的平均数。
通过上述描述可知,可以基于第一标签数量、第二标签数量、匹配标签和待训练的文本分类模型对多个训练样本的预测结果,确定待训练的文本分类模型的目标损失函数值,并根据该目标损失函数值调整待训练的文本分类模型的模型参数,从而提高文本分类模型的预测精确度。
参见图4所示,为本公开实施例提供的一种文本处理方法的流程图,该方法应用于终端设备,在该终端设备中预先安装了阅读类软件,所述方法包括步骤S401~S407,其中:
S401:展示话题文本的操作页面。
在本公开实施例中,上述话题文本的操作页面如图5所示,其中,图5中所展示的用户在上述阅读类软件中进行发帖操作的发帖页面,用户可以在该操作页面中输入目标数据。
S403:接收用户在所述操作页面输入的目标数据,其中,所述目标数据包括:待发布话题文本,或者,感兴趣话题标签。
假设,目标数据为待发布话题文本。在此情况下,用户可以如图5所示的界面输入待发布话题文本;之后,终端设备就可以向服务器发送该待发布话题文本,服务器就可以根据上述实施例中所描述的文本分类方法确定与待发布话题文本相匹配的话题标签,并将该话题标签展示在如图5所示的第二展示位置。
S405:获取服务器基于所述目标数据确定的筛选结果,其中,所述筛选结果为所述服务器基于上述任一实施例所述的文本分类方法对基于所述目标数据确定的待筛选数据进行筛选之后的结果。
在本公开实施例中,针对不同类型的目标数据,服务器返回的筛选结果也是不同的。
举例来说,如果目标数据为待发布话题文本,那么服务器就可以根据上述实施例中所描述的文本分类方法确定与待发布话题文本相匹配的话题标签。如果目标数据为感兴趣话题标签,那么服务器就可以根据上述实施例中所描述的文本分类方法确定与该感兴趣话题标签相匹配的已发布话题文本。
S407:在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果。
在本公开实施例中,在上述目标数据为感兴趣话题标签的情况下,目标数据的筛选结果可以为与感兴趣标签相匹配的已发布话题文本。例如,在上述目标数据为“科技”时,在操作页面展示的可以为该目标数据以及和该感兴趣标签相关的书籍或者文章的推荐话题,其中,该推荐话题可以为用于推荐书籍或者文章的已发布话题文本。
通过上述处理方式,可以更加准确的确定推书话题的话题标签,提高推书话题的分类精度,从而能够更加准确的为用户推送出满意的书籍,进而提高用户的阅读体验。
在一个可选的实施方式中,上述目标数据包含所述待发布话题文本;上述在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果,具体包括如下过程:
(1)、在所述操作页面的第一展示位置展示所述待发布话题文本;
(2)、在所述操作页面的第二展示位置展示所述待发布话题文本的发布类型和/或与所述待发布话题文本相匹配的至少一个目标话题标签。
在本公开实施例中,如图5所示,上述第一展示位置用于展示用户输入的待发布话题文本,其中,该第一展示位置中的第一子展示位置用于展示该待发布话题文本的文本标题,该第一展示位置中的第二子展示位置用于展示该待发布话题文本的文本内容。
另外的,如图5所示,第二展示区域包含与待发布话题文本相匹配的至少一个目标话题标签。
通过上述描述可知,可以分别通过第一展示位置以及第二展示位置展示目标数据的不同内容,从而使得操作界面的布局更美观,更合理,提高了用户的操作体验。
在一个可选的实施方式中,在如图4所示实施例的基础上,所述方法还包括:
检测用户针对所述操作页面中所展示的所述目标话题标签的标签修改标识的触发操作,对所述目标话题标签执行与用户所触发的标签修改标识相匹配的修改操作,并在所述操作页面中展示修改之后的目标话题标签,其中,所述修改操作包括以下至少之一:新增、删除、修改。
在本公开实施例中,如图5所示,用户还可以通过标签修改标识对目标话题标签进行修改操作,其中,在检测到用户针对“+点击添加”按钮(即标签修改标识)的触发操作后,就可以确定与“+点击添加”按钮相匹配的修改操作为新增操作,并响应于该新增操作,在第二展示位置增加对应的新增话题标签。
另外的,如图5所示,每个目标话题标签内还可以包含“×”标签修改标识,其中,在检测到用户该“×”标签修改标识的触发操作后,就可以确定和该“×”标签修改标识相匹配的修改操作为删除操作,并响应于该删除操作删除对应的目标话题标签。
另外的,用户还可以通过触发该第二展示位置的目标话题标签,直接修改该目标话题标签中的标签内容,例如,在检测到用户针对“科技”目标话题标签的触发操作后,获取用户针对该“科技”目标话题中的修改内容,在该修改内容命中标签库中的话题标签后,将该修改内容所对应的话题标签确定为目标话题标签。
通过上述描述可知,可以通过修改操作对目标话题标签进行修改操作从而使得用户在添加目标话题标签时更灵活、更便捷,提高了用户的使用体验。
在一个可选的实施方式中,在目标数据包含所述感兴趣话题标签的情况下,所述方法还包括如下过程:
(1)、在接收用户在所述操作页面输入的感兴趣话题标签之后,检测所述感兴趣话题标签的标签数量是否超过预设数量;
(2)、在所述标签数量超过所述预设数量的情况下,展示提示信息;所述提示信息用于指示所述感兴趣话题标签的数量已达到所述预设数量。
在本公开实施例中,在检测到上述目标话题标签的新增操作后,就可以在显示界面上展示如图6所示的待选话题标签页面,其中,用户可以通过选择该待选话题标签页面中的待选话题标签来确定感兴趣话题标签。
另外,在用户选择感兴趣标签时,还可以检测用户选择的感兴趣标签是否超过预设数量,并在标签数量超过预设数量的情况下,展示提示信息,提示信息用于指示所述感兴趣 话题标签的数量已达到所述预设数量。
在本公开实施例中,上述感兴趣标签可以对应着不同的类别维度,其中,如图6所示,该感兴趣标签对应的类别维度包括:话题类型、性别偏好、推书类型。
因此,上述预设数量可以为针对全部类别维度的感兴趣标签设置的,也可以为针对至少部分类别维度的感兴趣标签设置的。这里,以该预设数量是针对“推书类型”的类别维度设置为例,具体的,若该预设数量为3,在检测到用户在“推书类型”的类别维度下选择的感兴趣标签超过3个时,则如图6所示,在显示界面上展示提示信息:“最多可选3个推书类型”。
通过上述描述可知,可以通过预设数量限制用户选择的感兴趣标签的数量,从而减少由于感兴趣标签的数量过多造成的筛选效率降低,提高用户的使用体验。
在一个可选的实施方式中,在目标数据包括感兴趣话题标签的情况下;上述展示话题文本的操作页面,具体包括如下过程:
(1)、响应于用户的话题筛选请求,获取所属于至少一个目标话题类别的预设话题标签;
(2)、在所述操作页面中确定每个所述目标话题类别的类别展示区域,并在所述类别展示区域中展示对应目标话题类别和所属于该目标话题类别的预设话题标签。
在本公开实施例中,如图6所示,目标话题类别分为“话题类型”、“性别偏好”和“推书类型”。
在本公开实施例中,在确定出所属于至少一个目标话题类别的预设话题标签之后,就可以在操作页面中确定每个所述目标话题类别的类别展示区域。例如,确定“话题类型”的类别展示区域,“性别偏好”的类别展示区域,以及“推书类型”的类别展示区域。
在确定出对应的类别展示区域之后,就可以在类别展示区域中展示对应目标话题类别和所属于该目标话题类别的预设话题标签。
例如,针对目标话题类别“话题类型”,所属于该“话题类型”的预设话题标签可以包含“按情节”、“按角色”、“按品类”。例如,针对目标话题类别“性别偏好”,所属于该“性别偏好”的预设话题标签可以包含“男生向”和“女生向”。
通过上述描述可知,可以根据目标话题类别分别确定对应的预设话题标签,并通过每个目标话题类别所对应的类别展示区域进行展示,从而提高了确定目标话题标签的效率,并且使得界面布局更加美观,提高用户的浏览体验。
在一个可选的实施方式中,在目标数据包括所述感兴趣话题标签的情况下,上述在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果,具体包括如下过程:
(1)、在所述操作页面的标题展示区域中展示所述感兴趣话题标签;
(2)、在所述操作页面的文本展示区域中展示与每个所述感兴趣话题标签相匹配的已发布话题文本的关键话题内容。
在本公开实施例中,在展示上述目标数据时的展示页面如图7所示,其中,该展示页面中包含标题展示区域以及文本展示区域,其中,该标题展示区域用于展示感兴趣话题标签,文本展示区域用于展示和感兴趣话题标签相匹配的已发布话题文本的关键话题内容。
具体的,该关键话题内容可以包含已发布话题文本的文本标题以及浏览标识,其中, 该浏览标识用于表征该已发布话题文本的被浏览次数、推荐数书籍被采纳次数(该被采纳次数可以为如图7所示的“拯救了15.3w人的书荒”的形式)等数据。
通过上述描述可知,可以分别通过标签展示区域以及文本展示区域对感兴趣话题标签和已发布话题文本的关键话题内容进行展示,使得页面布局更加合理,并且,通过展示已发布话题文本的关键话题内容的方式,实现了对已发布话题文本的提炼,进一步提高了页面布局的合理性,使得展示界面可以同事展示更多的实质性内容,方便用户观看。
在一个可选的实施方式中,所述方法还包括:
(1)、响应于针对所述感兴趣话题标签的选择操作,确定用户所选择的目标话题标签,并获取与所述目标话题标签相匹配的已发布话题文本;
(2)、在话题筛选页面的文本展示区域中展示与所述目标话题标签相匹配的已发布话题文本的关键话题内容。
在本公开实施例中,用户可以通过针对上述感兴趣话题标签的选择操作,确定想要查看的目标话题标签所对应的已发布话题文本。具体的,在检测到用户选择的目标话题标签后,就可以对话题筛选页面所展示已发布话题文本进行筛选,从而确定出和该目标话题文本相匹配的已发布话题文本,并在文本展示区域展示和该目标话题文本相匹配的已发布话题文本的关键话题内容。
通过上述描述可知,可以通过感兴趣话题标签对话题筛选页面中展示的已发布话题文本的关键话题内容进行筛选,从而更好的适用于用户的使用需求,提高用户的使用体验。
综上,在本公开实施例中,通过确定标签描述特征和目标文本特征之间的标签相关性在待预测话题标签中确定目标话题标签的方式,可以更加准确的为待分类话题文本确定对应的话题标签,从而提高待分类话题文本的话题分类的准确度。在待分类话题文本为与书籍推荐相关联的推书话题的情况下,通过上述处理方式,可以更加准确的确定推书话题的话题标签,提高推书话题的分类精度,从而能够更加准确的为用户推送出满意的书籍,进而提高用户的阅读体验。
本领域技术人员可以理解,在具体实施方式的上述方法中,各步骤的撰写顺序并不意味着严格的执行顺序而对实施过程构成任何限定,各步骤的具体执行顺序应当以其功能和可能的内在逻辑确定。
基于同一发明构思,本公开实施例中还提供了与文本分类方法对应的文本分类装置,由于本公开实施例中的装置解决问题的原理与本公开实施例上述文本分类方法相似,因此装置的实施可以参见方法的实施,重复之处不再赘述。
参照图8所示,为本公开实施例提供的一种文本分类装置的示意图,所述装置包括:第一获取单元81、提取单元82、第一确定单元83、第二确定单元84;其中,
第一获取单元81,用于获取待分类话题文本和至少一个待预测话题标签的标签描述信息;
提取单元82,用于:提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征;
第一确定单元83,用于:确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性;
第二确定单元84,用于:基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
本公开实施例中,通过确定标签描述特征和目标文本特征之间的标签相关性在待预测话题标签中确定目标话题标签的方式,可以更加准确的为待分类话题文本确定对应的话题标签,从而提高待分类话题文本的话题分类的准确度。在待分类话题文本为与书籍推荐相关联的推书话题的情况下,通过上述处理方式,可以更加准确的确定推书话题的话题标签,提高推书话题的分类精度,从而能够更加准确的为用户推送出满意的书籍,进而提高用户的阅读体验。
一种可能的实施方式中,所述目标文本特征中包含多个子文本特征,每个子文本特征对应所述待分类话题文本中每个第一单位文本,第一确定单元83,还用于:
基于所述目标文本特征和所述标签描述特征,确定每个所述第一单位文本的相关系数,其中,所述相关系数用于表征该第一单位文本与对应待预测话题标签之间的标签相关程度;基于每个所述第一单位文本的相关系数,对各个所述第一单位文本的子文本特征进行加权求和计算,并根据计算结果确定所述标签相关性。
一种可能的实施方式中,第一确定单元83,还用于:
基于每个所述第一单位文本的子文本特征,确定该第一单位文本的第一子相关系数;基于所述目标文本特征和所述标签描述特征确定第二子相关系数;基于所述第一子相关系数和所述第二子相关系数之间的比值确定所述相关系数。
一种可能的实施方式中,第一确定单元83,还用于:
基于每个所述第一单位文本的子文本特征和预设权重矩阵,确定该第一单位文本的第一权重;基于所述第一权重确定所述第一子相关系数。
一种可能的实施方式中,所述标签描述特征中包含多个第二单位文本;第一确定单元83,还用于:
基于所述目标文本特征和预设权重矩阵确定各个第一单位文本的第二权重;基于所述标签描述特征和所述预设权重矩阵确定各个第二单位文本的第三权重;基于所述第二权重和所述第三权重确定所述第二子相关系数。
一种可能的实施方式中,第一获取单元81,还用于:
获取待处理的原始文本数据,并确定所述原始文本数据中所包含的文本类型标识;基于所述文本类型标识确定所述原始文本数据的数据分割位置,并基于所述数据分割位置对所述原始文本数据进行分割处理,得到所述待分类话题文本和所述标签描述信息。
一种可能的实施方式中,提取单元82,还用于:
确定所述待分类话题文本中每个第一单位文本的目标向量,其中,所述目标向量中的元素用于指示该第一单位文本和每个预设单位文本之间的映射关系;在所述待分类话题文本中全部第一单位文本的目标向量中提取所述待分类话题文本的关键特征向量,并将所述关键特征向量确定为所述目标文本特征。
一种可能的实施方式中,所述待分类话题文本包括以下至少之一:话题标题文本、话题摘要文本、话题标签描述文本。
一种可能的实施方式中,该装置还用于:
所述提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征,包括:通过文本分类模型中的特征提取层提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征;所述确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性,包括:通过文本分类模型中的相关性确定层确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性;所述基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签,包括:通过文本分类模型中的分类层基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
一种可能的实施方式中,该装置还用于:
确定多个训练样本;其中,每个训练样本中包含待预测话题标签和待训练话题文本,每个所述训练样本包含匹配标签,所述匹配标签用于指示所述待预测话题标签和待训练话题文本之间的匹配性;通过所述多个训练样本对待训练的文本分类模型进行训练,得到所述文本分类模型。
一种可能的实施方式中,该装置还用于:
确定所述多个训练样本中所包含待预测话题标签的第一标签数量,并确定所述待预测话题标签中与所述待训练话题文本相匹配的目标分类标签的第二标签数量;基于所述第一标签数量、所述第二标签数量、所述匹配标签和所述待训练的文本分类模型对所述多个训练样本的预测结果,确定所述待训练的文本分类模型的目标损失函数值;根据所述目标损失函数值,调整所述待训练的文本分类模型的模型参数,得到所述文本分类模型。
参照图9所示,为本公开实施例提供的一种文本处理装置的示意图,所述装置包括:第一展示单元91、接收单元92、第二获取单元93、第二展示单元94;其中,
第一展示单元91,用于:展示话题文本的操作页面;
接收单元92,用于:接收用户在所述操作页面输入的目标数据,其中,所述目标数据包括:待发布话题文本,或者,感兴趣话题标签;
第二获取单元93,用于:获取服务器基于所述目标数据确定的筛选结果,其中,所述筛选结果为所述服务器基于上述实施例所述的文本分类方法对基于所述目标数据确定的待筛选数据进行筛选之后的结果;
第二展示单元94,用于:在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果。
通过上述处理方式,可以更加准确的确定推书话题的话题标签,提高推书话题的分类精度,从而能够更加准确的为用户推送出满意的书籍,进而提高用户的阅读体验。
一种可能的实施方式中,所述目标数据包含所述待发布话题文本,第二展示单元94,还用于:
在所述操作页面的第一展示位置展示所述待发布话题文本;在所述操作页面的第二展示位置展示所述待发布话题文本的发布类型和/或与所述待发布话题文本相匹配的至少一个目标话题标签。
一种可能的实施方式中,第二展示单元94,还用于:
检测用户针对所述操作页面中所展示的所述目标话题标签的标签修改标识的触发操作,对所述目标话题标签执行与用户所触发的标签修改标识相匹配的修改操作,并在所述操作页面中展示修改之后的目标话题标签,其中,所述修改操作包括以下至少之一:新增、删除、修改。
一种可能的实施方式中,所述目标数据包含所述感兴趣话题标签,该装置还用于:
在接收用户在所述操作页面输入的感兴趣话题标签之后,检测所述感兴趣话题标签的标签数量是否超过预设数量;在所述标签数量超过所述预设数量的情况下,展示提示信息;所述提示信息用于指示所述感兴趣话题标签的数量已达到所述预设数量。
一种可能的实施方式中,所述目标数据包括感兴趣话题标签,第一展示单元91,还用于:
响应于用户的话题筛选请求,获取所属于至少一个目标话题类别的预设话题标签;在所述操作页面中确定每个所述目标话题类别的类别展示区域,并在所述类别展示区域中展示对应目标话题类别和所属于该目标话题类别的预设话题标签。
一种可能的实施方式中,所述目标数据包括所述感兴趣话题标签;第二展示单元94,还用于:
在所述操作页面的标题展示区域中展示所述感兴趣话题标签;在所述操作页面的文本展示区域中展示与每个所述感兴趣话题标签相匹配的已发布话题文本的关键话题内容。
一种可能的实施方式中,第二展示单元94,还用于:
响应于针对所述感兴趣话题标签的选择操作,确定用户所选择的目标话题标签,并获取与所述目标话题标签相匹配的已发布话题文本;在话题筛选页面的文本展示区域中展示与所述目标话题标签相匹配的已发布话题文本的关键话题内容。
关于装置中的各单元的处理流程、以及各单元之间的交互流程的描述可以参照上述方法实施例中的相关说明,这里不再详述。
对应于图1中的文本分类方法,本公开实施例还提供了另一种计算机设备1000,如图10所示,为本公开实施例提供的计算机设备1000结构示意图,包括:
处理器101、存储器102、和总线103;存储器102用于存储执行指令,包括内存1021和外部存储器1022;这里的内存1021也称内存储器,用于暂时存放处理器101中的运算数据,以及与硬盘等外部存储器1022交换的数据,处理器101通过内存1021与外部存储器1022进行数据交换,当所述计算机设备1000运行时,所述处理器101与所述存储器102之间通过总线103通信,使得所述处理器101执行以下指令:
获取待分类话题文本和至少一个待预测话题标签的标签描述信息;
提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征;
确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性;
基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
对应于图1中的文本处理方法,本公开实施例还提供了一种计算机设备1100,如图11 所示,为本公开实施例提供的计算机设备1100结构示意图,包括:
处理器111、存储器112、和总线113;存储器112用于存储执行指令,包括内存1121和外部存储器1122;这里的内存1121也称内存储器,用于暂时存放处理器111中的运算数据,以及与硬盘等外部存储器1122交换的数据,处理器111通过内存1121与外部存储器1122进行数据交换,当所述计算机设备1100运行时,所述处理器111与所述存储器112之间通过总线113通信,使得所述处理器111执行以下指令:
展示话题文本的操作页面;
接收用户在所述操作页面输入的目标数据,其中,所述目标数据包括:待发布话题文本,或者,感兴趣话题标签;
获取服务器基于所述目标数据确定的筛选结果,其中,所述筛选结果为所述服务器基于上述实施例中所述的文本分类方法对基于所述目标数据确定的待筛选数据进行筛选之后的结果;
在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果。
本公开实施例还提供一种计算机可读存储介质,该计算机可读存储介质上存储有计算机程序,该计算机程序被处理器运行时执行上述方法实施例中所述的文本分类、文本处理方法的步骤。其中,该存储介质可以是易失性或非易失的计算机可读取存储介质。
本公开实施例还提供一种计算机程序产品,该计算机程序产品承载有程序代码,所述程序代码包括的指令可用于执行上述方法实施例中所述的文本分类、文本处理方法的步骤,具体可参见上述方法实施例,在此不再赘述。
其中,上述计算机程序产品可以具体通过硬件、软件或其结合的方式实现。在一个可选实施例中,所述计算机程序产品具体体现为计算机存储介质,在另一个可选实施例中,计算机程序产品具体体现为软件产品,例如软件开发包(Software Development Kit,SDK)等等。
本公开实施例还提供了一种计算机程序,该计算机程序存储在可读存储介质中,电子设备的至少一个处理器可以从所述可读存储介质中读取所述计算机程序,至少一个所述处理器执行所述计算机程序,使得所述电子设备执行上述方法实施例中所述的文本分类、文本处理方法的步骤,具体可参见上述方法实施例,在此不再赘述。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统和装置的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。在本公开所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,又例如,多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些通信接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本公开各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个处理器可执行的非易失的计算机可读取存储介质中。基于这样的理解,本公开的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本公开各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
最后应说明的是:以上所述实施例,仅为本公开的具体实施方式,用以说明本公开的技术方案,而非对其限制,本公开的保护范围并不局限于此,尽管参照前述实施例对本公开进行了详细的说明,本领域的普通技术人员应当理解:任何熟悉本技术领域的技术人员在本公开揭露的技术范围内,其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化,或者对其中部分技术特征进行等同替换;而这些修改、变化或者替换,并不使相应技术方案的本质脱离本公开实施例技术方案的精神和范围,都应涵盖在本公开的保护范围之内。因此,本公开的保护范围应所述以权利要求的保护范围为准。

Claims (24)

  1. 一种文本分类方法,其中,应用于服务器,包括:
    获取待分类话题文本和至少一个待预测话题标签的标签描述信息;
    提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征;
    确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性;
    基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
  2. 根据权利要求1所述的方法,其中,所述目标文本特征中包含多个子文本特征,每个子文本特征对应所述待分类话题文本中每个第一单位文本;
    所述确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,包括:
    基于所述目标文本特征和所述标签描述特征,确定每个所述第一单位文本的相关系数,其中,所述相关系数用于表征所述第一单位文本与对应待预测话题标签之间的标签相关程度;
    基于每个所述第一单位文本的相关系数,对各个所述第一单位文本的子文本特征进行加权求和计算,并根据计算结果确定所述标签相关性。
  3. 根据权利要求2所述的方法,其中,所述基于所述目标文本特征和所述标签描述特征,确定每个所述第一单位文本的相关系数,包括:
    基于每个所述第一单位文本的子文本特征,确定所述第一单位文本的第一子相关系数;
    基于所述目标文本特征和所述标签描述特征确定第二子相关系数;
    基于所述第一子相关系数和所述第二子相关系数之间的比值确定所述相关系数。
  4. 根据权利要求3所述的方法,其中,所述基于每个所述第一单位文本的子文本特征,确定所述第一单位文本的第一子相关系数,包括:
    基于每个所述第一单位文本的子文本特征和预设权重矩阵,确定所述第一单位文本的第一权重;
    基于所述第一权重确定所述第一子相关系数。
  5. 根据权利要求3所述的方法,其中,所述标签描述特征中包含多个第二单位文本;
    所述基于所述目标文本特征和所述标签描述特征确定第二子相关系数,包括:
    基于所述目标文本特征和预设权重矩阵确定各个第一单位文本的第二权重;
    基于所述标签描述特征和所述预设权重矩阵确定各个第二单位文本的第三权重;
    基于所述第二权重和所述第三权重确定所述第二子相关系数。
  6. 根据权利要求1至5中任一项所述的方法,其中,所述获取待分类话题文本和至少一个待预测话题标签的标签描述信息,包括:
    获取待处理的原始文本数据,并确定所述原始文本数据中所包含的文本类型标识;
    基于所述文本类型标识确定所述原始文本数据的数据分割位置,并基于所述数据分割位置对所述原始文本数据进行分割处理,得到所述待分类话题文本和所述标签描述信息。
  7. 根据权利要求1至6中任一项所述的方法,其中,所述提取所述待分类话题文本的目标文本特征,包括:
    确定所述待分类话题文本中每个第一单位文本的目标向量,其中,所述目标向量中的元素用于指示所述第一单位文本和每个预设单位文本之间的映射关系;
    在所述待分类话题文本中全部第一单位文本的目标向量中提取所述待分类话题文本的关键特征向量,并将所述关键特征向量确定为所述目标文本特征。
  8. 根据权利要求1至7中任一项所述的方法,其中,所述待分类话题文本包括以下至少之一:话题标题文本、话题摘要文本、话题标签描述文本。
  9. 根据权利要求1至6中任一项所述的方法,其中,
    所述提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征,包括:通过文本分类模型中的特征提取层提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征;
    所述确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性,包括:通过文本分类模型中的相关性确定层确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性;
    所述基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签,包括:通过文本分类模型中的分类层基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
  10. 根据权利要求9所述的方法,其中,所述方法还包括:
    确定多个训练样本;其中,每个训练样本中包含待预测话题标签和待训练话题文本,每个所述训练样本包含匹配标签,所述匹配标签用于指示所述待预测话题标签和待训练话题文本之间的匹配性;
    通过所述多个训练样本对待训练的文本分类模型进行训练,得到所述文本分类模型。
  11. 根据权利要求10所述的方法,其中,所述通过所述多个训练样本对待训练的文本分类模型进行训练,得到所述文本分类模型,包括:
    确定所述多个训练样本中所包含待预测话题标签的第一标签数量,并确定所述待预测话题标签中与所述待训练话题文本相匹配的目标分类标签的第二标签数量;
    基于所述第一标签数量、所述第二标签数量、所述匹配标签和所述待训练的文本分类模型对所述多个训练样本的预测结果,确定所述待训练的文本分类模型的目标损失函数值;
    根据所述目标损失函数值,调整所述待训练的文本分类模型的模型参数,得到所述文本分类模型。
  12. 一种文本处理方法,其中,应用于终端设备,包括:
    展示话题文本的操作页面;
    接收用户在所述操作页面输入的目标数据,其中,所述目标数据包括:待发布话题文本,或者,感兴趣话题标签;
    获取服务器基于所述目标数据确定的筛选结果,其中,所述筛选结果为所述服务器 基于上述权利要求1至11中任一项所述的文本分类方法对基于所述目标数据确定的待筛选数据进行筛选之后的结果;
    在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果。
  13. 根据权利要求12所述的方法,其中,所述目标数据包含所述待发布话题文本;
    所述在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果,包括:
    在所述操作页面的第一展示位置展示所述待发布话题文本;
    在所述操作页面的第二展示位置展示所述待发布话题文本的发布类型和/或与所述待发布话题文本相匹配的至少一个目标话题标签。
  14. 根据权利要求13所述的方法,其中,所述方法还包括:
    检测用户针对所述操作页面中所展示的所述目标话题标签的标签修改标识的触发操作,对所述目标话题标签执行与用户所触发的标签修改标识相匹配的修改操作,并在所述操作页面中展示修改之后的目标话题标签,其中,所述修改操作包括以下至少之一:新增、删除、修改。
  15. 根据权利要求12所述的方法,其中,所述目标数据包含所述感兴趣话题标签;所述方法还包括:
    在接收用户在所述操作页面输入的感兴趣话题标签之后,检测所述感兴趣话题标签的标签数量是否超过预设数量;
    在所述标签数量超过所述预设数量的情况下,展示提示信息;所述提示信息用于指示所述感兴趣话题标签的数量已达到所述预设数量。
  16. 根据权利要求12所述的方法,其中,所述目标数据包括感兴趣话题标签;所述展示话题文本的操作页面,包括:
    响应于用户的话题筛选请求,获取所属于至少一个目标话题类别的预设话题标签;
    在所述操作页面中确定每个所述目标话题类别的类别展示区域,并在所述类别展示区域中展示对应目标话题类别和所属于所述目标话题类别的预设话题标签。
  17. 根据权利要求12所述的方法,其中,所述目标数据包括所述感兴趣话题标签;
    所述在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果,包括:
    在所述操作页面的标题展示区域中展示所述感兴趣话题标签;
    在所述操作页面的文本展示区域中展示与每个所述感兴趣话题标签相匹配的已发布话题文本的关键话题内容。
  18. 根据权利要求17所述的方法,其中,所述方法还包括:
    响应于针对所述感兴趣话题标签的选择操作,确定用户所选择的目标话题标签,并获取与所述目标话题标签相匹配的已发布话题文本;
    在话题筛选页面的文本展示区域中展示与所述目标话题标签相匹配的已发布话题文本的关键话题内容。
  19. 一种文本分类装置,其中,应用于服务器,包括:
    第一获取单元,用于获取待分类话题文本和至少一个待预测话题标签的标签描述信息;
    提取单元,用于提取所述待分类话题文本的目标文本特征,并提取每个所述待预测话题标签的标签描述信息的标签描述特征;
    第一确定单元,用于确定所述目标文本特征和每个所述标签描述特征之间的标签相关性,得到至少一个标签相关性;
    第二确定单元,用于基于至少一个所述标签相关性,在至少一个所述待预测话题标签中确定与所述待分类话题文本相匹配的目标话题标签。
  20. 一种文本处理装置,其中,应用于终端设备,包括:
    第一展示单元,用于展示话题文本的操作页面;
    接收单元,用于接收用户在所述操作页面输入的目标数据,其中,所述目标数据包括:待发布话题文本,或者,感兴趣话题标签;
    第二获取单元,用于获取服务器基于所述目标数据确定的筛选结果,其中,所述筛选结果为所述服务器基于上述权利要求1至11中任一项所述的文本分类方法对基于所述目标数据确定的待筛选数据进行筛选之后的结果;
    第二展示单元,用于在所述操作页面展示所述目标数据和/或所述目标数据的筛选结果。
  21. 一种计算机设备,其中,包括:处理器、存储器和总线,所述存储器存储有所述处理器可执行的机器可读指令,当计算机设备运行时,所述处理器与所述存储器之间通过总线通信,所述机器可读指令被所述处理器执行时执行如权利要求1至11中任一项所述的文本分类方法的步骤或执行如权利要求12至18中任一项所述的文本处理方法的步骤。
  22. 一种计算机可读存储介质,其中,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器运行时执行如权利要求1至11中任一项所述的文本分类方法的步骤或执行如权利要求12至18中任一项所述的文本处理方法的步骤。
  23. 一种计算机程序产品,其中,所述计算机程序产品包括计算机程序指令,所述计算机程序指令使得计算机执行如权利要求1至11中任一项所述的文本分类方法的步骤或执行如权利要求12至18中任一项所述的文本处理方法的步骤。
  24. 一种计算机程序,其中,所述计算机程序使得计算机执行如权利要求1至11中任一项所述的文本分类方法的步骤或执行如权利要求12至18中任一项所述的文本处理方法的步骤。
PCT/CN2022/141171 2022-01-27 2022-12-22 文本分类、文本处理方法、装置、计算机设备及存储介质 WO2023142809A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202210102790.9A CN114443847A (zh) 2022-01-27 2022-01-27 文本分类、文本处理方法、装置、计算机设备及存储介质
CN202210102790.9 2022-01-27

Publications (1)

Publication Number Publication Date
WO2023142809A1 true WO2023142809A1 (zh) 2023-08-03

Family

ID=81369779

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/141171 WO2023142809A1 (zh) 2022-01-27 2022-12-22 文本分类、文本处理方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN114443847A (zh)
WO (1) WO2023142809A1 (zh)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114443847A (zh) * 2022-01-27 2022-05-06 北京字节跳动网络技术有限公司 文本分类、文本处理方法、装置、计算机设备及存储介质
CN116304745B (zh) * 2023-03-27 2024-04-12 济南大学 基于深层次语义信息的文本话题匹配方法及系统
CN116992031B (zh) * 2023-08-29 2024-01-09 腾讯科技(深圳)有限公司 数据处理方法、装置、电子设备、存储介质及程序产品

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109918653A (zh) * 2019-02-21 2019-06-21 腾讯科技(深圳)有限公司 确定文本数据的关联话题及模型的训练方法、装置和设备
US20200045122A1 (en) * 2018-08-06 2020-02-06 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for pushing information
CN113064964A (zh) * 2021-03-22 2021-07-02 广东博智林机器人有限公司 文本分类方法、模型训练方法、装置、设备以及存储介质
CN113778295A (zh) * 2021-09-28 2021-12-10 北京字跳网络技术有限公司 一种书籍推荐方法、装置、计算机设备及存储介质
CN114443847A (zh) * 2022-01-27 2022-05-06 北京字节跳动网络技术有限公司 文本分类、文本处理方法、装置、计算机设备及存储介质

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110046231B (zh) * 2018-12-21 2023-08-04 创新先进技术有限公司 一种客服信息处理方法、服务器和系统
CN113821589B (zh) * 2021-06-10 2024-06-25 腾讯科技(深圳)有限公司 一种文本标签的确定方法及装置、计算机设备和存储介质
CN113626589B (zh) * 2021-06-18 2023-04-18 电子科技大学 一种基于混合注意力机制的多标签文本分类方法
CN113627447B (zh) * 2021-10-13 2022-02-08 腾讯科技(深圳)有限公司 标签识别方法、装置、计算机设备、存储介质及程序产品

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20200045122A1 (en) * 2018-08-06 2020-02-06 Baidu Online Network Technology (Beijing) Co., Ltd. Method and apparatus for pushing information
CN109918653A (zh) * 2019-02-21 2019-06-21 腾讯科技(深圳)有限公司 确定文本数据的关联话题及模型的训练方法、装置和设备
CN113064964A (zh) * 2021-03-22 2021-07-02 广东博智林机器人有限公司 文本分类方法、模型训练方法、装置、设备以及存储介质
CN113778295A (zh) * 2021-09-28 2021-12-10 北京字跳网络技术有限公司 一种书籍推荐方法、装置、计算机设备及存储介质
CN114443847A (zh) * 2022-01-27 2022-05-06 北京字节跳动网络技术有限公司 文本分类、文本处理方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN114443847A (zh) 2022-05-06

Similar Documents

Publication Publication Date Title
WO2023142809A1 (zh) 文本分类、文本处理方法、装置、计算机设备及存储介质
WO2020108608A1 (zh) 搜索结果处理方法、装置、终端、电子设备及存储介质
WO2022116537A1 (zh) 一种资讯推荐方法、装置、电子设备和存储介质
CN103678335B (zh) 商品标识标签的方法、装置及商品导航的方法
CN106202256B (zh) 基于语义传播及混合多示例学习的Web图像检索方法
CN109740152B (zh) 文本类目的确定方法、装置、存储介质和计算机设备
CN107784092A (zh) 一种推荐热词的方法、服务器及计算机可读介质
EP2510464B1 (en) Lazy evaluation of semantic indexing
CN110188197B (zh) 一种用于标注平台的主动学习方法及装置
CN110413787B (zh) 文本聚类方法、装置、终端和存储介质
WO2023273686A1 (zh) 一种信息搜索方法、装置、计算机设备及存储介质
WO2023108980A1 (zh) 基于文本对抗样例的信息推送方法及装置
CN109918556B (zh) 一种综合微博用户社交关系和文本特征抑郁情绪识别方法
CN107832338B (zh) 一种识别核心产品词的方法和系统
US20140379719A1 (en) System and method for tagging and searching documents
CN112989208B (zh) 一种信息推荐方法、装置、电子设备及存储介质
CN112559684A (zh) 一种关键词提取及信息检索方法
CN111259173A (zh) 一种搜索信息推荐方法及装置
CN111506831A (zh) 一种协同过滤的推荐模块、方法、电子设备及存储介质
KR100876214B1 (ko) 문맥기반 광고 장치와 그 방법 및 이를 구현할 수 있는컴퓨터로 읽을 수 있는 기록 매체
CN111831885B (zh) 一种互联网信息检索系统与方法
WO2023207451A1 (zh) 一种搜索结果展示的方法、搜索请求处理方法以及装置
CN115730158A (zh) 一种搜索结果展示方法、装置、计算机设备及存储介质
CN117648504A (zh) 媒体资源序列的生成方法、装置、计算机设备和存储介质
CN115270790A (zh) 一种基于大数据的样本标识方法、设备及介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22923595

Country of ref document: EP

Kind code of ref document: A1