CN112131350B

CN112131350B - Text label determining method, device, terminal and readable storage medium

Info

Publication number: CN112131350B
Application number: CN202011065821.5A
Authority: CN
Inventors: 刘刚
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Technology Shenzhen Co Ltd
Priority date: 2020-09-30
Filing date: 2020-09-30
Publication date: 2024-04-30
Anticipated expiration: 2040-09-30
Also published as: CN112131350A

Abstract

The application relates to a text label determining method, a text label determining device, a text label determining terminal and a readable storage medium, and belongs to the field of label mining. The method comprises the following steps: performing word segmentation processing on a target text to obtain a word segmentation set, wherein the word segmentation set comprises word segmentation words obtained by word segmentation of the target text, and the target text is a text of a label to be determined; determining a first candidate tag of the target text according to the context relation of the word segmentation vocabulary; determining a second candidate tag of the target text according to a first frequency parameter of the word segmentation vocabulary in the target text and a second frequency parameter of the word segmentation vocabulary in a text set; and determining the label of the target text according to the first candidate label and the second candidate label. The method solves the problem of low accuracy of label determination caused by the fact that the context semantic environment is not considered in the label determination process, and improves the accuracy of label acquisition.

Description

Text label determining method, device, terminal and readable storage medium

Technical Field

The present application relates to the field of label mining, and in particular, to a text label determining method, a text label determining device, a text label determining terminal, and a readable storage medium.

Background

The labels are defined as the most important keywords capable of representing the content, in the information flow content distribution process, the label information is very important, after the content is provided with the labels, the content can be organized and displayed according to different labels, and more accurate content recommendation can be realized by matching the labels with the user images.

In the related art, a tag extraction method of article contents, which includes determining tags of current contents based on TF-IDF statistical features, tends to filter common words in articles, preserving important words.

However, the statistical-based method does not consider the relation between words and documents in the article, deviation exists between the obtained label and the actual semantics of the content expression, and the accuracy of the obtained label is not high.

Disclosure of Invention

The application provides a text label determining method, a text label determining device, a text label determining terminal and a readable storage medium, which can improve the accuracy of label determination. The technical scheme is as follows:

In one aspect, a text label determining method is provided, the method including:

Performing word segmentation processing on a target text to obtain a word segmentation set, wherein the word segmentation set comprises word segmentation words obtained by word segmentation of the target text, and the target text is a text of a label to be determined;

determining a first candidate tag of the target text according to the context relation of the word segmentation vocabulary;

Determining a second candidate tag of the target text according to a first frequency parameter of the word segmentation vocabulary in the target text and a second frequency parameter of the word segmentation vocabulary in a text set;

And determining the label of the target text according to the first candidate label and the second candidate label.

In another aspect, there is provided a text label determining apparatus, the apparatus comprising:

the processing module is used for carrying out word segmentation on a target text to obtain a word segmentation set, wherein the word segmentation set comprises word segmentation vocabularies obtained by word segmentation of the target text, and the target text is a text of a label to be determined;

the determining module is used for determining a first candidate label of the target text according to the context relation of the word segmentation vocabulary;

the determining module is further configured to determine a second candidate tag of the target text according to a first frequency parameter of the word segmentation vocabulary in the target text and a second frequency parameter of the word segmentation vocabulary in the text set;

the determining module is further configured to determine a tag of the target text according to the first candidate tag and the second candidate tag.

In another aspect, a computer device is provided, where the computer device includes a processor and a memory, where the memory stores at least one instruction, at least one program, a set of codes, or a set of instructions, where the at least one instruction, the at least one program, the set of codes, or the set of instructions are loaded and executed by the processor to implement a text label determining method according to any one of the embodiments of the present application.

In another aspect, a computer readable storage medium is provided, where at least one instruction, at least one program, a set of codes, or a set of instructions is stored, where the at least one instruction, the at least one program, the set of codes, or the set of instructions are loaded and executed by a processor to implement a text label determining method according to any one of the embodiments of the present application.

In another aspect, a computer program product or computer program is provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the text label determining method as described in any of the above embodiments.

The technical scheme provided by the embodiment of the application has the beneficial effects that at least:

The label of the target text is determined by the context relation of each word segmentation word of the target text, the first frequency parameter of each word segmentation word in the target text and the second frequency parameter in the text set, and the label of the target text is determined from the deep semantic and shallow frequency, so that the accuracy of determining the label of the target text is improved.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a schematic illustration of an implementation environment provided by an exemplary embodiment of the present application;

FIG. 2 is a flow chart of a text label determination method provided by an exemplary embodiment of the present application;

FIG. 3 is a flow chart of a text label determination method provided by another exemplary embodiment of the present application;

FIG. 4 is a text label determination model based on a bi-directional long and short term memory neural network provided by an exemplary embodiment of the present application;

FIG. 5 is a schematic diagram of an attention computing mechanism provided by an exemplary embodiment of the present application;

FIG. 6 is a flow chart of attention computation provided by an exemplary embodiment of the present application;

FIG. 7 is a flowchart of a method for conditional random field tag determination provided by an exemplary embodiment of the present application;

FIG. 8 is a system flow diagram of a text label determination method provided by an exemplary embodiment of the present application;

FIG. 9 is a system flow diagram of tag determination provided by an exemplary embodiment of the present application;

fig. 10 is a block diagram showing a configuration of a text label determining apparatus according to an exemplary embodiment of the present application;

fig. 11 is a block diagram showing a configuration of a text label determining apparatus according to another exemplary embodiment of the present application;

fig. 12 is a block diagram of a server according to an exemplary embodiment of the present application.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.

First, a brief description will be made of terms involved in the embodiments of the present application:

artificial intelligence (ARTIFICIAL INTELLIGENCE, AI) is the theory, method, technique, and application system that simulates, extends, and extends human intelligence using a digital computer or a machine controlled by a digital computer, perceives the environment, obtains knowledge, and uses the knowledge to obtain optimal results. In other words, artificial intelligence is an integrated technology of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar way to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision.

The artificial intelligence technology is a comprehensive subject, and relates to the technology with wide fields, namely the technology with a hardware level and the technology with a software level. Artificial intelligence infrastructure technologies generally include technologies such as sensors, dedicated artificial intelligence chips, cloud computing, distributed storage, big data processing technologies, operation/interaction systems, mechatronics, and the like. The artificial intelligence software technology mainly comprises a computer vision technology, a voice processing technology, a natural language processing technology, machine learning/deep learning and other directions.

Computer Vision (CV) is a science of studying how to "look" a machine, and more specifically, to replace a human eye with a camera and a Computer to perform machine Vision such as recognition and measurement on a target, and further perform graphic processing to make the Computer process an image more suitable for human eye observation or transmission to an instrument for detection. As a scientific discipline, computer vision research-related theory and technology has attempted to build artificial intelligence systems that can acquire information from images or multidimensional data. Computer vision techniques typically include image processing, image recognition, image semantic understanding, image retrieval, optical character recognition (Optical Character Recognition, OCR), video processing, video semantic understanding, video content/behavior recognition, three-dimensional object reconstruction, three-dimensional techniques, virtual reality, augmented reality, synchronous positioning, and map construction, among others.

Key technologies To Speech technology (Speech Technology) are automatic Speech recognition technology (Automatic Speech Recognition, ASR) and Speech synthesis technology (Text To Speech, TTS) and voiceprint recognition technology. The method can enable the computer to listen, watch, say and feel, is the development direction of human-computer interaction in the future, and voice becomes one of the best human-computer interaction modes in the future.

Natural language processing (Nature Language Processing, NLP) is an important direction in the fields of computer science and artificial intelligence. It is studying various theories and methods that enable effective communication between a person and a computer in natural language. Natural language processing is a science that integrates linguistics, computer science, and mathematics. Thus, the research in this field will involve natural language, i.e. language that people use daily, so it has a close relationship with the research in linguistics. Natural language processing techniques typically include text processing, semantic understanding, machine translation, robotic questions and answers, knowledge graph techniques, and the like.

Machine learning (MACHINE LEARNING, ML) is a multi-domain interdisciplinary, involving multiple disciplines such as probability theory, statistics, approximation theory, convex analysis, algorithm complexity theory, and the like. It is specially studied how a computer simulates or implements learning behavior of a human to acquire new knowledge or skills, and reorganizes existing knowledge structures to continuously improve own performance. Machine learning is the core of artificial intelligence, a fundamental approach to letting computers have intelligence, which is applied throughout various areas of artificial intelligence. Machine learning and deep learning typically include techniques such as artificial neural networks, belief networks, reinforcement learning, transfer learning, induction learning, teaching learning, and the like.

Word vectors (word embedding), a generic term for a set of language modeling and feature learning techniques in embedded natural language processing, where words or phrases from a vocabulary are mapped to vectors of real numbers. Conceptually, it involves mathematical embedding from a space of one dimension per word to a continuous vector space with lower dimensions. Methods of generating such mappings include neural networks, dimension reduction of word co-occurrence matrices, probabilistic models, interpretable knowledge base methods, and the explicit representation of the context of word occurrences. Word and phrase embedding has been demonstrated to improve the performance of natural language processing tasks, such as grammar and emotion analysis, when used as an underlying input representation.

Tags, in a recommendation system, are defined as the most important keywords that can represent the semantics of articles and are suitable for matching items of user portraits and content, are more fine-grained semantics than classification and topics. The method comprises the steps of content portrait dimension, user portrait dimension, recall model feature, order model feature, diversity scattering and the like in each link of a recommendation system.

The scheme provided by the embodiment of the application relates to artificial intelligence natural language processing, machine learning and other technologies, wherein the text label determining method provided by the application can be applied to at least one of the following scenes:

First, the text label determining method is applied to a recommendation system of an article reading platform, wherein the articles comprise news articles, public number articles, personal original articles, book contents and the like. And the server receives the article content uploaded by the user or the partner through the terminal, and the recommendation system performs label mining on the content and distributes the content to related users according to the labels. For example, a news reporting recent trends in the price of a house and governments implementing new related policies, the server's recommendation system determines the tag as "price of a house, policy" and pushes it to the top page or other page of the user's corresponding application program that has recently focused on the trends in the price of a house, increasing the likelihood that the news will be clicked.

Secondly, the text label determining method is applied to a recommendation system of a social platform, wherein the social platform comprises a blog, a micro blog and a public interaction module of social software. The server receives the content uploaded by the user through the terminal, and the recommendation system performs label mining on the content, classifies the content according to the labels and pushes the content in a targeted mode. In one example, on a "square" of a social application, a user may post dynamic content in the form of text, pictures, videos, etc., and a server's recommendation system mines its tags, e.g., posts content "today to Beijing travel, walks around home and bird nest, beijing roast ducks are really delicious", and the recommendation system determines the tags as "Beijing, travel" and recommends them to users located in Beijing or to users who have also post "travel" related dynamics, allowing more accurate vertical target users to be recommended.

Third, the text label determining method is applied to a recommendation system of a video platform, wherein the video platform can provide video contents such as common videos and short videos. The server receives the video content uploaded by the user through the terminal or provided by the provider, performs voice-to-word processing or optical character recognition subtitle on the video content to obtain a text corresponding to the video, performs label mining on the text, and recommends according to the user portrait. For example, the text content extracted from the video is "apple company is about to come on-line iphone12, mobile phone supports … …", the recommendation system determines the tag as "apple, mobile phone", determines that the video belongs to the field of electronic products according to the relationship between apple and mobile phone, recommends the video to users who pay attention to the mobile phone marketing recently, and does not recommend the video to users who pay attention to the field of agriculture frequently.

It should be noted that the above application scenario is only an illustrative example, and the text label determination provided by the present application may be applied to application scenarios of other label determinations.

The implementation environment of the embodiment of the application is described with reference to the description of the noun introduction and the application scenario.

For illustration, please refer to fig. 1, taking an example of the text label determining method applied to a server as an example, the implementation environment of the text label determining method includes: a terminal 110, a server 120, and a communication network 130;

The terminal 110 is provided with a related application program, and the related application program 111 may be an application program corresponding to an article reading platform, an application program corresponding to a social platform or an application program corresponding to a video platform, so that a user may upload content desired to be published through the terminal 110, or may receive content acquired from the server 120 through the terminal 110, that is, the terminal 110 is a content production end and a content consumption end. Data transmission is performed between the terminal 110 and the server 120 through the communication network 130.

The server 120 includes a content recommendation module 140 and a storage module 150, wherein the content recommendation module 140 includes a machine processing module 141, a content scheduling module 142, a manual auditing module 143, a weight balancing module 144, and an interface module 145.

The content recommendation module 140 is configured to extract entity information of the content as a tag, and match the content tag with a user image, so as to implement more targeted content distribution and recommendation.

The machine processing module 141 includes a tag determination module and a tag determination model, and is configured to perform tag determination on input content, and send a processing result to the content scheduling module 142.

The content scheduling module 142 is configured to take charge of the entire scheduling process of content streaming, receive the content of the interface module 145, and obtain meta information of the content from the content database module 153; the content scheduling module 142 is further configured to interact with the content storage module 151, store content from the content storage module 151, or read content; the content scheduling module 142 is further configured to schedule the manual review module 143 and the machine processing module 141, and control the scheduling order and priority; the content scheduling module 142 is further configured to obtain the content that is audited by the manual audit module 143, and distribute the content to the terminal 110 through the outlet unit in the interface module 145.

The manual auditing module 143 is a carrier of manual service capability, and is used for auditing the content that the machine cannot determine judgment, such as filtering pornography, legal disallowance, and the like.

The duplicate removal module 144 communicates with the content scheduling module 142 for header duplicate removal, jacket photograph duplicate removal, content text duplicate removal, and video fingerprint and audio fingerprint duplicate removal. In one example, the teletext titles and text are vectorized, the picture vectors are de-duplicated, the video fingerprint and audio fingerprint construction vectors are extracted for the video content, and then the distance between the vectors, such as the Euclidean distance, is calculated to determine whether or not to repeat.

The interface module 145 includes an inlet unit for receiving a content input of the terminal 110 and an outlet unit for outputting the content to the terminal 110.

The storage module 150 includes a content database module 153, a corpus 152, a content storage module 151, and a crawler module 154.

The content database module 153 is configured to store meta information of content, where the meta information includes at least one data information of file size data, cover map link information, code rate data, file format, title information, release time data, author information, video file size data, and video format information; the content database module 153 is further configured to store an audit result and an audit status of the content by the manual audit module 143; the content database module 153 is also used for storing the processing results of the machine processing module 141 and the duplication elimination module 144 transmitted through the content scheduling module 142.

The content storage module 151 is configured to store content entity information other than meta information of content, for example, a video source file and a picture source file of graphic content; the content storage module 151 is further configured to provide the machine processing module 141 with input of the source video files including frame-extracting content in the middle of the source video files, voice-to-text of the video files, and optical character recognition, when the video content tags are determined.

The crawler module 154 is configured to obtain a corpus of an external domain through the internet, store the corpus as domain corpus information in the corpus 152, and store a correspondence between a word segmentation sequence and a word vector in the corpus 152, where the word vector is optionally obtained from unsupervised learning from a large-scale corpus through a related model (word to vector, word2 vec) for generating the word vector.

Fig. 2 is a flowchart of a text label determining method according to an exemplary embodiment of the present application, where the method is applied to a server for example, and referring to fig. 2, the method includes:

step 201, word segmentation processing is performed on the target text, and a word segmentation set is obtained.

In the embodiment of the application, the server acquires the text, the picture or the video resource uploaded by the terminal. For literal resources, the target text is directly generated. And for the picture resources, the server acquires the text information in the picture resources through optical character recognition, and generates a target text. For video resources, the server performs word conversion processing on the audio by extracting the audio in the video to generate a target text; or extracting the video frame, and performing optical character recognition on the video frame to generate a target text. The target text is the text of the tag to be determined.

The term set includes term words obtained by segmenting the target text, that is, the term set is a set obtained by dividing a piece of text or a sentence into several separate words, for example, "from Wu Xiaobo to Luo Zhenyu, which weak points of the knowledge payment IP? "segmentation" results in "from", "Wu Xiaobo", "to", "Luo Zhenyu", "knowledge payment", "IP", "have", "which", "weak points", "? "11 word-segmented words, the 11 word-segmented words constitute a word-segmented set.

In the word segmentation process, a plurality of word segmentation modes exist, under the condition of the same word number, the smaller the total word number is, the smaller the semantic units are explained, the larger the weight relative to a single semantic unit is, the higher the accuracy is, the schematic "knowledge payment" can be used for segmenting into two word segments of "knowledge" and "payment" or one word segment of "knowledge payment", and in the range from Wu Xiaobo to Luo Zhenyu, which weak points of the knowledge payment IP are? In the semantic understanding of the word, the word "pay for knowledge" is extracted as a tag, and the directivity of the tag is more accurate.

Alternatively, a dictionary-based word segmentation method is used in the word segmentation process. Giving the maximum length of a word segmentation, segmenting the word with the length, comparing whether the obtained word segmentation appears in a dictionary, if so, the word segmentation is the result of word segmentation, otherwise, shortening the length of the word. The segmentation mode is divided into forward cutting and reverse cutting, a forward cutting result and a reverse cutting result are respectively obtained, and a result with the minimum length is selected as a word segmentation vocabulary set.

Alternatively, in the word segmentation process, a word segmentation method based on a hidden Markov model (Hiden Markov Model, HMM) is used. Acquiring the probability of transition between each state and the probability of each state generating the word, and the initial state, namely the probability of each state at the sentence head; starting from the initial state, solving the probability that four states in the state sequence set { S, B, M, E } can respectively generate a first character string, and recording the probability under each state; judging a second character, firstly transferring the state in the last time state to the current time state, taking the state with the maximum probability of the character in the current position from each state in the current time, and then recording the last state; processing each character in turn; and finally, word segmentation is carried out according to the state sequence, and a word segmentation vocabulary set is obtained.

Alternatively, in the word segmentation process, a word segmentation method based on a binary grammar is used. Traversing all words ending with the current position, wherein the length of the word is limited, finding the word and then finding the previous word, namely obtaining a segmentation result between the two words, and finally solving the probability maximum value in all the segmentation results, and obtaining a word segmentation vocabulary set according to the corresponding segmentation result.

Step 202, determining a first candidate label of the target text according to the context relation of the word segmentation vocabulary.

In the embodiment of the application, the segmented word is subjected to preset processing to obtain the representation of the segmented word based on the context relation, namely, the context information before the segmented word and the context information after the segmented word are considered simultaneously, so that the importance of the determined word is better fitted after the human user generally reads the complete sentence, and the determination of the first candidate tag can be more effectively carried out.

Optionally, the word segmentation vocabulary is input into a machine learning model, and a first candidate label of the target text is output. The machine learning model includes a two-way long and short term memory (Bi Long Short Term Memory, bi-LSTM) neural network, a conditional random field (Conditional Random Field, CRF), and an attention mechanism (Attention Mechanism).

Optionally, the word segmentation vocabulary is extracted based on a word graph model (Text Rank, TR), wherein the method is improved by determining a web Page ranking method (Page Rank, PR) through a hyperlink relation in the internet, PR is a method designed based on a voting idea, and when the PR value of the web Page a is calculated, it is required to know which web pages are linked to the web Page a, that is, to obtain the link of the web Page a first, and then calculate the PR value of the web Page a through voting of the link to the web Page a. Please refer to formula one:

Where V _i denotes a web page, V _j denotes a web page linked to V _i (i.e. the In-link of V _i), S (V _i) denotes the PR value of web page V _i, in (V _i) denotes the set of all In-links of web page V _i, out (V _j) denotes a web page, and d denotes the damping coefficient, for overcoming the inherent defect In the part following "d" In this formula: if there is only a summed portion, the formula will not be able to handle PR values for pages that are not in-chain, because then, according to the formula, the PR values for these pages are 0, but not in practice, a damping coefficient is added to ensure that each page has a PR value greater than 0, in one example, at a damping coefficient of 0.85, approximately 100 iterations of PR values converge to a stable value, and as the damping coefficient approaches 1, the number of iterations required increases dramatically and the ordering is unstable. The score in front of S (V _j) in the formula means that all out-links to pages of V _j should bisect the PR value of V _j, thus counting the current ticket to the currently linked to page.

TR is improved from PR, and only one more weight term W _ji is used to indicate that the edge connection between two nodes has different importance during the calculation process. The method comprises the following steps: (1) segmenting the target text according to complete sentences; (2) For each sentence, performing word segmentation and part-of-speech tagging, filtering stop words, and only reserving words with specified part of speech, such as nouns, verbs and adjectives, namely obtaining a sentence set, wherein elements in the set are reserved candidate labels; (3) Constructing a candidate keyword graph G= (V, E), wherein V is a node set and consists of the candidate keywords generated in the step (2), then constructing edges between any two points by adopting a co-occurrence relation, wherein edges exist between the two nodes only when corresponding vocabularies coexist in a window with the length of m, and m represents the window size, namely m words at most coexist; (4) Calculating the weight of each node of iterative propagation until convergence; (5) The node weights are ordered in the reverse order, so that the most important T vocabularies are obtained and used as candidate labels; (6) And marking T vocabularies obtained in the step (5) in the original target text, and if adjacent phrases are formed, combining the T vocabularies into a multi-word label.

Step 203, determining a second candidate label of the target text according to the first frequency parameter of the word segmentation vocabulary in the target text and the second frequency parameter of the word segmentation vocabulary in the text set.

In the embodiment of the present application, according to the first frequency parameter tf _i,j and the second frequency parameter idf _i, determining the vocabulary frequency tfidf _i,j corresponding to the word segmentation vocabulary, wherein the calculation method of the first frequency parameter tf _i,j refers to formula two, and the calculation method of the second frequency parameter idf _i refers to formula three:

Where tf _i,j represents a first frequency parameter, idf _i represents a second frequency parameter, n _i,j represents the number of occurrences of the current word-segmentation vocabulary in the text set j, |d| represents the number of texts in the corpus, | { j: t _i∈d_j } | represents the number of text sets containing the word-segmentation vocabulary.

And then the product of the determined first frequency parameter and the second frequency parameter is used as the vocabulary frequency tfidf _i,j corresponding to the word segmentation vocabulary, and the calculation formula of the vocabulary frequency is shown as a formula four:

Equation four: tfidf _i,j＝tf_i,j×idf_i

And determining the word segmentation vocabulary with the vocabulary frequency meeting the frequency requirement as a second candidate tag of the target text.

And 204, determining the label of the target text according to the first candidate label and the second candidate label.

In this embodiment of the present application, optionally, in response to a situation that an intersection exists between the first candidate tag and the second candidate tag, the intersection is taken between the first candidate tag and the second candidate tag, so as to obtain a tag of the target text. And determining the label of the target text from the first candidate label and the second candidate label according to a preset selection rule in response to the condition that no intersection exists between the first candidate label and the second candidate label. In one example, the preset selection rule is to assign different weights to the first candidate tag and the second candidate tag according to the domain to which the target text belongs, select the first candidate tag as the tag of the target text when the weight of the first candidate tag is greater than the weight of the second candidate tag, and select the second candidate tag as the tag of the target text when the weight of the first candidate tag is less than the weight of the second candidate tag.

In summary, in the text label determining method provided in the embodiment, the label of the target text is determined by the context relation of each word segmentation vocabulary of the target text, the first frequency parameter of each word segmentation vocabulary in the target text, and the second frequency parameter in the text set, so that the label of the target text is determined jointly from two aspects of deep semantics and shallow frequency, the accuracy of determining the label of the target text is improved, and the efficiency of determining the label of the target text is improved by effectively replacing the artificial feature recognition engineering through a machine learning model.

In connection with the above embodiment, taking the example that the first candidate tag based on the context of the word segmentation vocabulary is obtained through the machine learning model as an example, the text tag determining method provided by the embodiment of the present application is described, referring to fig. 3, which shows a flowchart of the text tag determining method provided by an exemplary embodiment, the method includes:

Step 301, word segmentation processing is performed on the target text, and a word segmentation set is obtained.

Please refer to step 201 for the relevant content of the word segmentation set acquisition process, which is not described herein.

Step 302, extracting features of the segmented word to obtain a word vector of the segmented word.

In the embodiment of the application, schematically, large-scale corpus of the Internet is obtained in real time through a crawler, vocabulary vectors corresponding to the segmented words are obtained through learning in an unsupervised mode, a segmented word sequence is obtained according to a segmented word set obtained through target text decomposition, and the segmented word sequence is input into a corpus to be inquired to obtain the corresponding vocabulary vectors.

And 303, performing feature analysis on the word vectors and the context word vectors to obtain entity probabilities corresponding to the word segmentation words.

In the embodiment of the application, feature analysis is performed on the word vectors in combination with the context word vectors to obtain entity probabilities corresponding to word segmentation words, and in one example, the word vectors are input into a two-way long-short-time memory neural network, wherein the entity probabilities comprise a first probability, a second probability and a third probability, the first probability represents the probability that the word segmentation words belong to tag entities, the second probability represents the probability that the word segmentation words do not belong to tag entities, and the third probability represents the probability that the word segmentation words belong to corresponding entities in the tag entities. For example, if the word segmentation vocabulary is an entity, such as a person name, a place name, or an organization name, the first probability is high; if the word segmentation vocabulary is non-entity, such as prepositions, verbs or adjectives, the second probability is high; if the word segmentation vocabulary is an entity and the word segmentation vocabulary before the word segmentation vocabulary is an entity, the third probability is high.

And then obtaining a first probability, a second probability and a third probability corresponding to the vocabulary vectors. And determining the probability with the highest numerical value from the first probability, the second probability and the third probability as the entity probability of the word segmentation vocabulary.

Step 304, determining a first candidate label of the target text from the word segmentation vocabulary according to the entity probability.

In the embodiment of the application, the filtering entity probability corresponds to the word segmentation vocabulary of the second probability, and optionally, the word segmentation vocabulary corresponding to the second probability does not belong to the tag entity, namely, is the word segmentation vocabulary which cannot be used as a tag, such as the word segmentation vocabulary with the part of speech as the preposition. And determining word segmentation vocabulary corresponding to the entity probability according to the first probability and the third probability to obtain a first candidate tag.

In one example, please refer to fig. 4, the vocabulary vector 410 is input to a bi-directional long-short-term memory neural network (i.e. BiLSTM in fig. 4) 420, the word segmentation vocabulary 411 corresponds to the vocabulary vector 410, each input vocabulary vector 410 outputs three prediction scores, which represent the first probability B, the second probability O and the third probability I430 respectively, and optionally, the prediction scores are input to a conditional random field (i.e. CRF in fig. 4) 440 to obtain a prediction label (i.e. a B label, an I label or an O label corresponding to each word segmentation vocabulary in fig. 4) 450, and the first candidate label is obtained according to the prediction label.

Optionally, after feature analysis is performed on the vocabulary vectors in combination with the context vocabulary vectors, self-attention calculation of the segmented vocabulary is added, and the calculation method of self-attention includes: in the figure, value501 is an output vocabulary vector of a bidirectional long-short-term memory neural network, key502 is a parameter matrix different from Value501, the essence of an Attention function can be described as a mapping from a Query element (namely Query in the figure, corresponding to a self-Attention calculating layer) 503 to a series of (Key 502-Value 501) pairs, namely by calculating similarity or relativity between the Query503 and each Key502, a weight coefficient of each Key502 corresponding to Value501 is obtained, and then the Value501 is weighted and summed, so that a final Attention Value 504Attention is obtained, and the formula is expressed as the following formula five:

wherein Attention (Qurey, source) represents the current sentence (Source) and the Attention value corresponding to the current element, source represents the input sentence, query represents the element, key represents the parameter matrix, L _x represents the length of the sentence, and Similarity (Qurey, key _i) is referred to formula seven.

The computing mechanism is shown in fig. 6, and is summarized as two processes: the first process is to calculate the weight coefficients from Query603 and Key602, and the second process is to weight sum Value601 according to the weight coefficients. The first process can be subdivided into two phases: the first stage 610 calculates the similarity or correlation of the Query603 and the Key602 to obtain an intermediate vector; the second stage 620 normalizes the raw scores of the first stage 610. In the first stage 610, different functions and computing mechanisms may be introduced, and according to Query603 and a Key602, the similarity or correlation between the two is computed, where the most common method includes: the vector dot product of the two is calculated to obtain a first intermediate vector (S1, S2, S3, S4 in fig. 6) 612, and the calculation method refers to formula six 611 (S (Q, K) in the figure):

Formula six: similarity (Qurey, key) _i)＝Qurey·Key_i

Wherein SIMILARTY represents a vector dot product, query represents an element, and Key _i represents an ith parameter matrix.

The first intermediate vector generated in the first stage 610 has different value ranges according to different specific generating methods, the second stage 620 carries out numerical conversion on the score of the first stage by introducing a calculation mode similar to SoftMax normalization 621 to obtain a second intermediate vector (namely a1, a2, a3 and a4 in the figure) 622, on one hand, normalization can be carried out, and the original calculated score is arranged into probability distribution with the sum of all element weights being 1; on the other hand, the weight of the important element can be more highlighted through the inherent mechanism of SoftMax, and the calculation method refers to the formula seven:

Where a _i is a second intermediate vector, softmax () is a Softmax function, sim _i is a first intermediate vector obtained in the first stage, and L _x represents the length of the current sentence (Source).

The calculation result in the third stage 630 is the corresponding weight coefficient, and then the weighted summation is performed to obtain the attention value 604, please refer to the formula eight:

Wherein, attention (Qurey, source) represents the Attention Value corresponding to the current sentence (Source) and the current element, L _x represents the length of the current sentence (Source), a _i is the second intermediate vector, and Value _i is the output vector of the bidirectional long and short time memory neural network.

Step 305, determining a second candidate label of the target text according to the first frequency parameter of the word segmentation vocabulary in the target text and the second frequency parameter of the word segmentation vocabulary in the text set.

Alternatively, the second candidate tag may be derived from a lexical or domain feature, illustratively comprising lexical encodings of a particular vertical class domain, e.g. apples representing a brand in the field of electronic devices and a fruit in the field of agriculture.

Please refer to the above step 203 for the relevant content of the second candidate tag determination method, and details are not repeated here.

Step 306, determining the label of the target text according to the first candidate label and the second candidate label.

Optionally, the result obtained by self-attention calculation of the data output from the bidirectional long-short-term memory neural network may be input together with the second candidate tag into the conditional random field model. A method of determining tags using conditional random fields, please refer to fig. 7, wherein the features include: word, part of speech, word segmentation boundary, characteristic words and word stock, wherein the word stock comprises word stock such as name, place name, organization, film and television, novel, music, medical treatment, network public opinion hotword and the like. After the features are collected, feature templates are required to be configured, and the feature templates are required to be configured with the same feature combination at different positions, the same position, the different feature combinations and the different feature combinations at different positions. In one example, the feature determination is exemplified by: (1) If the current position 701 is divided into words 702 (part of speech 703 is a noun) and (whether the word stock 704=1) and (tag 705=1), let t1=1 in the feature template otherwise t1=0, if the weight λ1 is larger, the model tends to take the words in the word stock as tags. (2) If the part of speech 703 of the word in the last position is a punctuation, the next part of speech 703 is a verb and the current word is a noun, let t2=1 in the feature template, if the weight λ2 is larger, the model is more prone to treat nouns sandwiched between the punctuation and the verb as a heading entity.

In an alternative embodiment, the text label determining method provided in the embodiment of the present application is applied to a short video application, and a system procedure of the method is schematically described, and referring to fig. 8, the system flow 800 includes four main processes, where the four main processes include video acquisition 810, content arrangement 820, label determination 830, and video distribution 840.

In the video capturing 810, the video file uploaded by the user through the client is received, and optionally, the application corresponding to the client may be the same as or different from the application receiving the video in the video distributing 840.

Content arrangement 820 includes: corpus collection 821, video text conversion 822, word segmentation extraction 823;

During the execution process of the corpus collection 821, the large-scale corpus of the internet is obtained in real time through a crawler, and vocabulary vectors corresponding to word segmentation are obtained through learning in an unsupervised mode; in the executing process, the video text conversion 822 converts the audio into characters by extracting the audio in the video file to obtain a target text, or extracts a video frame, and identifies the characters in the video by optical characters to obtain the target text; the word segmentation extraction 823 decomposes the target text obtained by the video text conversion 822 into a plurality of word segmentation vocabulary sets during execution.

In the execution process of the tag determination 830, the word segmentation vocabulary obtained by the content arrangement 820 is obtained, the word segmentation vocabulary is processed to determine the tag of the target text, and the corresponding relationship between the tag and the target text is output to the video distribution 840.

In the executing process of the video distribution 840, the corresponding relation between the tag and the video is determined according to the corresponding relation between the tag and the target text, the video is classified according to the tag, and the video is distributed to an interface of the video classification corresponding to the application program, or the video is distributed to a user client possibly interested in the video according to the user portrait.

Referring to fig. 9, the tag determination 930 includes a word segmentation map vector layer 931, an intermediate vector layer 932, a self-attention computation layer 933, a word segmentation map external feature layer 934, a full connection middle layer 935, a full connection layer 936, a multi-classification layer 937, a constraint layer 938, and a target text word segmentation vocabulary 900.

The word words are input into a word-map vector layer 931 and a word-map external feature layer 934, respectively. In the word-segmentation map vector layer 931, word-segmentation words are mapped into word vectors, and then a bidirectional long-short-term memory neural network, namely a forward cyclic neural network (Recurrent Neural Network, RNN) layer 9311 and a backward cyclic neural network layer 9312 are input, the result is input to the intermediate vector layer 932, the above cyclic neural network process is repeated, a plurality of linear mappings 9321 are output from the intermediate vector layer 932, and are input to the self-attention calculation layer 933, and the result is input to the full connection layer 936 after calculation; in the word segmentation map external feature layer 934, the word segmentation vocabulary is input into the word segmentation map external feature layer 934, then the result is output to the full-connection middle layer 935 for connection, the result is output to the full-connection layer 936 for full connection with the result output from the attention calculation layer 933, then the result is output to the multiple classification layer 937, the classification result is input into the constraint layer 938, and finally the label of the target text is output.

Fig. 10 is a block diagram illustrating a text label determining apparatus according to an exemplary embodiment of the present application, the apparatus including:

The processing module 1010 is configured to perform word segmentation on a target text to obtain a word segmentation set, where the word segmentation set includes word segmentation vocabulary obtained by word segmentation on the target text, and the target text is a text of a tag to be determined;

A determining module 1020, configured to determine a first candidate tag of the target text according to a context of the word segmentation vocabulary;

the determining module 1020 is further configured to determine a second candidate tag of the target text according to a first frequency parameter of the word segmentation vocabulary in the target text and a second frequency parameter of the word segmentation vocabulary in the text set;

The determining module 1020 is further configured to determine a tag of the target text according to the first candidate tag and the second candidate tag.

In an alternative embodiment, the determining module 1020 is further configured to, in response to an intersection between the first candidate tag and the second candidate tag, obtain a tag of the target text by intersecting the first candidate tag and the second candidate tag.

In an optional embodiment, the determining module is further configured to determine, according to a preset selection rule, a tag of the target text from the first candidate tag and the second candidate tag in response to a situation that there is no intersection between the first candidate tag and the second candidate tag.

In an alternative embodiment, referring to fig. 11, the determining module 1020 further includes:

A first determining unit 1021, configured to determine a vocabulary frequency corresponding to the word segmentation vocabulary according to the first frequency parameter and the second frequency parameter;

The first determining unit 1021 is further configured to determine the word segmentation vocabulary whose vocabulary frequency meets the frequency requirement as a second candidate tag of the target text.

In an alternative embodiment, the determining module 1020 further includes:

An extracting unit 1022, configured to perform feature extraction on the word segmentation vocabulary to obtain a vocabulary vector of the word segmentation vocabulary;

The analysis unit 1023 is configured to perform feature analysis on the vocabulary vector in combination with the context vocabulary vector, so as to obtain an entity probability corresponding to the word segmentation vocabulary;

And a second determining unit 1024, configured to determine, according to the entity probability, a first candidate tag of the target text from the word segmentation vocabulary.

It should be noted that: the text label determining apparatus provided in the above embodiment is only exemplified by the division of the above functional modules, and in practical application, the above functional allocation may be performed by different functional modules according to needs, that is, the internal structure of the device is divided into different functional modules, so as to perform all or part of the functions described above. In addition, the text label determining apparatus provided in the above embodiment and the text label determining method embodiment belong to the same concept, and detailed implementation processes of the text label determining apparatus are detailed in the method embodiment, and are not repeated herein.

The application also provides a server, which comprises a processor and a memory, wherein at least one instruction is stored in the memory, and the at least one instruction is loaded and executed by the processor to realize the text polarity recognition method provided by each method embodiment. It should be noted that the server may be a server as provided in fig. 12 below.

Referring to fig. 12, a schematic diagram of a server according to an exemplary embodiment of the present application is shown. Specifically, the present application relates to a method for manufacturing a semiconductor device. The server 1200 includes a central processing unit (Central Processing Unit, CPU) 1201, a system Memory 1204 including a random access Memory (Random Access Memory, RAM) 1202 and a Read Only Memory (ROM) 1203, and a system bus 1205 connecting the system Memory 1204 and the central processing unit 1201. The server 1200 also includes a basic input/output system 1206, which helps to transfer information between various devices within the computer, and a mass storage device 1207 for storing an operating system 1213, application programs 1214, and other program modules 1215.

The basic input/output system 1206 includes a display 1208 for displaying information and an input device 1209, such as a mouse, keyboard, etc., for user input of information. Wherein the display 1208 and the input device 1209 are coupled to the central processing unit 1201 via an input-output controller 1210 coupled to a system bus 1205. The basic input/output system 1206 may also include an input/output controller 1210 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 1210 also provides output to a display screen, a printer, or other type of output device.

The mass storage device 1207 is connected to the central processing unit 1201 through a mass storage controller (not shown) connected to the system bus 1205. The mass storage device 1207 and its associated computer-readable media provide non-volatile storage for the server 1200. That is, the mass storage device 1207 may include a computer readable medium (not shown), such as a hard disk or CD-ROM drive.

The computer readable medium may include computer storage media and communication media without loss of generality. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, erasable programmable read-only memory (Erasable Programmable Read Only Memory, EPROM), electrically erasable programmable read-only memory (ELECTRICALLY ERASABLE PROGRAMMABLE READ ONLY MEMORY, EEPROM), flash memory or other solid state memory technology, CD-ROM, digital versatile disks (DIGITAL VERSATILE DISC, DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the one described above. The system memory 1204 and mass storage device 1207 described above may be collectively referred to as memory.

The memory stores one or more programs configured to be executed by the one or more central processing units 1201, the one or more programs containing instructions for implementing the text polarity recognition method described above, the central processing unit 1201 executing the one or more programs to implement the text polarity recognition method provided by the respective method embodiments described above.

The server 1200 may also operate via a network, such as the internet, connected to remote computers on the network, in accordance with various embodiments of the present application. I.e., the server 1200 may be connected to the network 1212 through a network interface unit 1211 coupled to the system bus 1205, or alternatively, the network interface unit 1211 may be used to connect to other types of networks or remote computer systems (not shown).

The memory also includes one or more programs stored in the memory, the one or more programs including steps for performing the text polarity recognition method provided by the embodiments of the present application, which are executed by the server.

The embodiment of the application also provides a computer device, which comprises a memory and a processor, wherein at least one instruction, at least one section of program, code set or instruction set is stored in the memory, and the at least one instruction, the at least one section of program, the code set or the instruction set is loaded by the processor and realizes the text label determining method in any one of the above embodiments.

The embodiment of the application also provides a computer readable storage medium, in which at least one instruction, at least one section of program, a code set or an instruction set is stored, where the at least one instruction, the at least one section of program, the code set or the instruction set is loaded and executed by the processor to implement the text label determining method according to any one of the above embodiments.

The present application also provides a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer-readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the text label determining method as described in any of the above embodiments.

Those of ordinary skill in the art will appreciate that all or part of the steps in the various methods of the above embodiments may be implemented by a program for instructing related hardware, and the program may be stored in a computer readable storage medium, which may be a computer readable storage medium included in the memory of the above embodiments; or may be a computer-readable storage medium, alone, that is not incorporated into the terminal. The computer readable storage medium stores at least one instruction, at least one program, a code set, or an instruction set, where the at least one instruction, the at least one program, the code set, or the instruction set is loaded and executed by the processor to implement the text label determining method according to any one of the embodiments of the present application.

Alternatively, the computer-readable storage medium may include: read Only Memory (ROM), random access Memory (RAM, random Access Memory), solid state disk (SSD, solid STATE DRIVES), or optical disk, etc. The random access memory may include resistive random access memory (ReRAM, RESISTANCE RANDOM ACCESS MEMORY) and dynamic random access memory (DRAM, dynamic Random Access Memory), among others. The foregoing embodiment numbers of the present application are merely for the purpose of description, and do not represent the advantages or disadvantages of the embodiments.

It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program for instructing relevant hardware, where the program may be stored in a computer readable storage medium, and the storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The foregoing description of the preferred embodiments of the present application is not intended to limit the application, but rather, the application is to be construed as limited to the appended claims.

Claims

1. A text label determining method, the method comprising:

Extracting characteristics of the word segmentation vocabulary to obtain a vocabulary vector of the word segmentation vocabulary;

Performing feature analysis on the word vectors in combination with context word vectors to obtain a first probability, a second probability and a third probability corresponding to the word vectors, wherein the first probability represents the probability that the word segmentation word belongs to a tag entity, the second probability represents the probability that the word segmentation word does not belong to the tag entity, and the third probability represents the probability that the word segmentation word belongs to a corresponding entity in the tag entity; if the word segmentation vocabulary is an entity, the first probability is high; if the word segmentation vocabulary is non-entity, the second probability is high; if the word segmentation vocabulary is an entity and the word segmentation vocabulary before the word segmentation vocabulary is an entity, the third probability is high;

Inputting the first probability, the second probability and the third probability to a conditional random field to obtain a prediction label corresponding to the word segmentation vocabulary, wherein the prediction label indicates one of the first probability, the second probability and the third probability;

Filtering word segmentation vocabulary of the second probability indicated by the predictive label;

determining a first candidate tag of the target text according to the word segmentation vocabulary of the first probability indicated by the prediction tag and the word segmentation vocabulary of the third probability indicated by the prediction tag;

2. The method of claim 1, wherein the determining the tag of the target text from the first candidate tag and the second candidate tag comprises:

And under the condition that the intersection exists between the first candidate tag and the second candidate tag, the intersection is taken from the first candidate tag and the second candidate tag, and the tag of the target text is obtained.

3. The method of claim 1, wherein the determining the tag of the target text from the first candidate tag and the second candidate tag comprises:

And determining the label of the target text from the first candidate label and the second candidate label according to a preset selection rule under the condition that no intersection exists between the first candidate label and the second candidate label.

4. A method according to any one of claims 1 to 3, wherein said determining a second candidate tag for said target text based on a first frequency parameter of said word segmentation vocabulary in said target text and a second frequency parameter of said word segmentation vocabulary in a collection of text comprises:

Determining vocabulary frequencies corresponding to the word segmentation vocabularies according to the first frequency parameter and the second frequency parameter;

5. The method of claim 4, wherein determining the vocabulary frequency corresponding to the segmented vocabulary according to the first frequency parameter and the second frequency parameter comprises:

And determining the product of the first frequency parameter and the second frequency parameter as the vocabulary frequency corresponding to the word segmentation vocabulary.

6. A text label determining apparatus, the apparatus comprising:

The determining module is used for extracting the characteristics of the word segmentation vocabulary to obtain the vocabulary vector of the word segmentation vocabulary; performing feature analysis on the word vectors and context word vectors, and predicting to obtain a first probability, a second probability and a third probability corresponding to the word segmentation words, wherein the first probability represents the probability that the word segmentation words belong to a tag entity, the second probability represents the probability that the word segmentation words do not belong to the tag entity, and the third probability represents the probability that the word segmentation words belong to a corresponding entity in the tag entity; if the word segmentation vocabulary is an entity, the first probability is high; if the word segmentation vocabulary is non-entity, the second probability is high; if the word segmentation vocabulary is an entity and the word segmentation vocabulary before the word segmentation vocabulary is an entity, the third probability is high; inputting the first probability, the second probability and the third probability to a conditional random field to obtain a prediction label corresponding to the word segmentation vocabulary, wherein the prediction label indicates one of the first probability, the second probability and the third probability; filtering word segmentation vocabulary of the second probability indicated by the predictive label; determining a first candidate tag of the target text based on the word segmentation vocabulary of the first probability indicated by the prediction tag and the word segmentation vocabulary of the third probability indicated by the prediction tag;

7. The apparatus of claim 6, wherein the means for determining is further configured to, in the event that there is an intersection between the first candidate tag and the second candidate tag, intersect the first candidate tag and the second candidate tag to obtain a tag of the target text.

8. The apparatus of claim 6, wherein the determining module is further configured to determine a label of the target text from the first candidate label and the second candidate label according to a preset selection rule if there is no intersection between the first candidate label and the second candidate label.

9. The apparatus according to any one of claims 6 to 8, wherein the determining module further comprises:

The first determining unit is used for determining the vocabulary frequency corresponding to the word segmentation vocabulary according to the first frequency parameter and the second frequency parameter;

The first determining unit is further configured to determine the word segmentation vocabulary with the vocabulary frequency meeting the frequency requirement as a second candidate tag of the target text.

10. A computer device comprising a processor and a memory, wherein the memory has stored therein at least one program that is loaded and executed by the processor to implement the text label determining method of any of claims 1 to 5.

11. A computer-readable storage medium, characterized in that at least one section of a program is stored in the storage medium, the at least one section of the program being loaded and executed by a processor to implement the text label determining method according to any one of claims 1 to 5.

12. A computer program product comprising a computer program which when executed by a processor implements the text label determination method of any of claims 1 to 5.