WO2022143608A1 - 语言标注方法、装置、计算机设备和存储介质 - Google Patents
语言标注方法、装置、计算机设备和存储介质 Download PDFInfo
- Publication number
- WO2022143608A1 WO2022143608A1 PCT/CN2021/141917 CN2021141917W WO2022143608A1 WO 2022143608 A1 WO2022143608 A1 WO 2022143608A1 CN 2021141917 W CN2021141917 W CN 2021141917W WO 2022143608 A1 WO2022143608 A1 WO 2022143608A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- information
- language
- target
- incremental
- video
- Prior art date
Links
- 238000002372 labelling Methods 0.000 title abstract description 16
- 238000012549 training Methods 0.000 claims description 122
- 238000000034 method Methods 0.000 claims description 39
- 238000001914 filtration Methods 0.000 claims description 7
- 238000004590 computer program Methods 0.000 claims description 6
- 238000012545 processing Methods 0.000 description 8
- 230000006870 function Effects 0.000 description 7
- 238000004140 cleaning Methods 0.000 description 5
- 238000010586 diagram Methods 0.000 description 5
- 230000000694 effects Effects 0.000 description 5
- 238000012935 Averaging Methods 0.000 description 4
- 238000012015 optical character recognition Methods 0.000 description 4
- 238000006243 chemical reaction Methods 0.000 description 3
- 238000013527 convolutional neural network Methods 0.000 description 3
- 238000012217 deletion Methods 0.000 description 3
- 230000037430 deletion Effects 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 230000004927 fusion Effects 0.000 description 3
- 238000012216 screening Methods 0.000 description 3
- 230000001788 irregular Effects 0.000 description 2
- 238000003058 natural language processing Methods 0.000 description 2
- 230000003287 optical effect Effects 0.000 description 2
- 238000012706 support-vector machine Methods 0.000 description 2
- 230000009466 transformation Effects 0.000 description 2
- 238000000844 transformation Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000013145 classification model Methods 0.000 description 1
- 238000004891 communication Methods 0.000 description 1
- 230000008094 contradictory effect Effects 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 239000013307 optical fiber Substances 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 239000004065 semiconductor Substances 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/263—Language identification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/75—Clustering; Classification
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/70—Information retrieval; Database structures therefor; File system structures therefor of video data
- G06F16/78—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually
- G06F16/7867—Retrieval characterised by using metadata, e.g. metadata not derived from the content or metadata generated manually using information manually generated, e.g. tags, keywords, comments, title and artist information, manually generated time, location and usage information, user ratings
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/20—Natural language analysis
- G06F40/205—Parsing
- G06F40/216—Parsing using statistical methods
Definitions
- the present application relates to the technical field of natural language processing, for example, to a language tagging method, apparatus, computer equipment and storage medium.
- Some video platforms can receive video data released by users, such as short videos, etc. These video data usually carry text information to provide users with language-based services, such as searching for video data in the same language. , usually using classifiers to markup language for textual information.
- business data may involve hundreds of different languages, and the number of samples in each language needs to reach a certain number to train a classifier with high accuracy.
- relatively scarce languages ie small languages
- obtaining high-quality samples is relatively Time consuming.
- the present application proposes a language labeling method, apparatus, computer equipment and storage medium, so as to solve the problem of low efficiency in manually labeling language for text information.
- This application provides a language tagging method, including:
- the confidence level of the target language is checked, wherein the target language is the language to which the target information belongs, and the reference language is a plurality of languages to which the reference information belongs.
- the application also provides a language marking device, including:
- the language classifier determination module is set to determine the language classifier
- a video information collection module configured to collect multiple pieces of information related to the video data, and use the multiple pieces of information as multiple pieces of video information
- a video information division module configured to divide the plurality of video information into target information and reference information
- a video information classification module configured to input the plurality of video information into the language classifier respectively, to identify the language to which the plurality of video information belongs;
- the confidence check module is set to use the reference language as an aid to check the confidence of the target language, wherein the target language is the language to which the target information belongs, and the reference language is a plurality of the reference information to which the reference information belongs. language.
- the present application also provides a computer device, the computer device comprising:
- processors one or more processors
- memory arranged to store one or more programs
- the one or more processors When the one or more programs are executed by the one or more processors, the one or more processors implement the above-mentioned language annotation method.
- the present application also provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the above-mentioned language annotation method is implemented.
- FIG. 1 is a flowchart of a language labeling method provided in Embodiment 1 of the present application.
- FIG. 2 is a flowchart of a language labeling method provided in Embodiment 2 of the present application.
- Embodiment 3 is an overall flow chart of a language classifier based on semi-supervised learning training provided in Embodiment 2 of the present application;
- FIG. 4 is a partial flowchart of a language classifier based on semi-supervised learning training provided in Embodiment 2 of the present application;
- FIG. 5 is a schematic structural diagram of a language tagging apparatus provided in Embodiment 3 of the present application.
- FIG. 6 is a schematic structural diagram of a computer device according to Embodiment 4 of the present application.
- FIG. 1 is a flowchart of a language labeling method provided in Embodiment 1 of the present application. This embodiment is applicable to the case of labeling the specified text information with the aid of part of the text information for the same video data.
- Executed by a language tagging device which can be implemented by software and/or hardware, and can be configured in a computer device, such as a server, workstation, personal computer, etc., including the following steps:
- Step 101 determine a language classifier.
- a language classifier can be set, and the language classifier can be used to classify the languages of the text information.
- the language classifier can be a classifier based on machine learning, such as a Support Vector Machine (SVM). , Bayesian model, etc., can also be a deep learning-based classifier, such as a fast text (FastText) classifier, a text convolutional neural network (Text-Convolutional Neural Network, Text-CNN), etc., this embodiment does not be restricted.
- the input of the language classifier can be textual information
- the output can be the language
- the language classifier can be pre-trained in a supervised manner, that is, generating a training set, which is a labeled dataset.
- the text information can be text information related to video data, or text information unrelated to video data.
- text information can be text information related to video data, or text information unrelated to video data.
- some open-source language annotation training sets are used to crawl text information from web pages and manually annotate the attributable language.
- the attribution language is marked for the text information related to the video data, etc., which is not limited in this embodiment.
- the language classifier is trained by means of cross-entropy loss function and gradient descent.
- the language classifier Since the language classifier is the initial version of the language classifier, it can be iteratively updated in the later stage. Therefore, after the iterative training i (i is a positive integer) rounds, the iterative training can be stopped to confirm that the language classifier training is completed.
- evaluation parameters such as accuracy rate, recall rate, and F1 value may also be used as conditions for stopping iterative training, which is not limited in this embodiment.
- Step 102 Collect multiple pieces of information related to the video data, and use the multiple pieces of information as multiple pieces of video information.
- a video pool can be created in advance, and the video pool stores a plurality of video data to be marked with text information.
- the video data can be in the form of short video, live broadcast, TV series, movie, micro-movie, etc. .
- Appropriate video data can be filtered according to business needs and put into the video pool. For example, when the effect of pushing video data in a region is to be optimized, the video data published in a specified region can be filtered, or the video data in a time period to be optimized can be pushed. When the effect is achieved, the video data released in a specified time period can also be filtered, and so on, which is not limited in this embodiment.
- each video data in the video pool multiple (that is, two or more) pieces of information related to the video data can be collected from the context of the video data, and the information can be regarded as video information,
- the video information is of the same type as the training samples in the training set, that is, if the training samples in the training set are text information, the video information is text information, and the training samples in the training set are voice signals, so the video information is voice signals .
- the video information is an unlabeled (ie, unlabeled language) dataset.
- the video information includes at least one of the following:
- the description information is usually a text describing the content of the video data input by the user who produces the video data in order to introduce the video data.
- a user who produces video data can select a frame of image data as the cover of the video data, and input copy information for the cover.
- the subtitle information is usually the text typed in the video data by the user who produces the video data using the function of the client.
- the first feature information is usually text information extracted from the cover sheet through optical character recognition (Optical Character Recognition, OCR).
- OCR Optical Character Recognition
- the second feature information is usually text information extracted from multi-frame image data of video data through OCR.
- the comment information is usually information published after the user who is the viewer browses the video data.
- video information is only an example.
- other video information such as titles, voice signals, etc.
- those skilled in the art can also use other video information according to actual needs, which is not limited in this embodiment of the present application.
- the attribute value of video information and video identifier (Identifier, ID) is included, so as to facilitate subsequent searching of corresponding video data and video information.
- Step 103 Divide the plurality of video information into target information and reference information.
- One video information can be processed as one sentence, so as to conform to the habit of natural language processing.
- the sentence (ie video information) is split from a continuous sequence into independent words according to a certain specification.
- emoji such as etc., it does not help to identify the type of language and can be deleted.
- Sentences that is, video information whose number of words is less than the preset word threshold MIN_WORD_COUNT are eliminated.
- cleaning and filtering manners are only examples.
- other cleaning and filtering manners may be set according to actual conditions, which are not limited in the embodiments of the present application.
- those skilled in the art can also adopt other cleaning and filtering methods according to actual needs, which are not limited in the embodiments of the present application.
- the multiple pieces of video information can be divided into target information and reference information according to business requirements, wherein the target information is the video of the language to be marked and the language classifier updated. information, and the reference information is other video information that assists in verifying the confidence of the language of the target information.
- the correlation of multiple video information relative to the video data can be determined.
- the correlation can be determined by the attributes of the video information itself.
- the video information with the highest correlation is set as the target information, and the target information Other video information other than that is set as reference information.
- the multiple pieces of video information include description information, copy information matched with the cover, subtitle information, first feature information, second feature information, and comment information
- the description information is mainly used to introduce the content of the video data
- the video data has the highest correlation. Therefore, the target information can be set as the description information, and the reference information can be set to include at least one of the following:
- Copy information subtitle information, first feature information, second feature information, and comment information matching the cover.
- the multiple pieces of video information include a voice signal, description information, copy information matched with the cover, subtitle information, first feature information, second feature information, and comment information
- the voice signal mainly reflects the language of the video data Content has the highest correlation with video data. Therefore, the target information can be set as a voice signal, and the reference information can be set to include at least one of the following:
- Description information copywriting information matching the cover, subtitle information, first feature information, second feature information, and comment information.
- Step 104 Inputting the plurality of video information into the language classifier respectively to identify the language to which the plurality of video information belongs.
- Step 105 using the reference language as an aid, verify the confidence level of the target language.
- the language classifier belongs to a multi-classification model, so for each video information, it can output multiple languages to which it belongs, and the probability of attributing each language.
- the target information it is mainly the labeling language, and the language is unique. Therefore, the language with the highest probability can be selected from the multiple languages outputted by the language classifier as the language it belongs to, and other languages that may be attributable are ignored. To distinguish, the language can be called the target language, that is, the target language is the language to which the target information belongs.
- the language classifier can output multiple languages and their probabilities.
- the language can be referred to as a reference language, that is, the reference language is the multiple languages to which the reference information belongs.
- the users who produce the video data are relatively single, usually individuals or teams, and the video data is mostly expressed in images and sounds, which are related to culture and language, and the audience of the video data is relatively single. Most of them are in the same area as the user who produced the video data. Therefore, the language involved in the video data is usually a single language. In most cases, the video information related to the video data involves the same language. Therefore, the reference information can be attributed to the reference language. The situation (ie, multiple reference languages and their probabilities) is used as an aid to verify the confidence that the language of the target information is the target language.
- the video data is a life scene involving conversations in English
- the user who makes the video data writes the description information in English and adds English titles, and if the audience user knows the content of the video data, the comment information published will be large. In most cases it will also be in English.
- a confidence range that is biased towards the middle level may be preset, that is, the degree of confidence is general.
- One endpoint of the confidence range is the first probability threshold MIN_PROB_1, and the other endpoint is the second probability threshold MIN_PROB_2.
- the second probability threshold is MIN_PROB_2.
- the probability threshold MIN_PROB_2 is greater than the first probability threshold MIN_PROB_1.
- the probability that the language of the query target information is the target language is taken as the target probability P_S from the output result of the language classifier for the target information.
- the target probability P_S is within the confidence range, that is, the target probability P_S is greater than or equal to the preset first probability threshold MIN_PROB_1 and less than or equal to the preset second probability threshold MIN_PROB_2, it can be considered that the language of the target information is in the target language
- the confidence level is average. In the real situation, it is possible that the language of the target information is the target language, and the language of the target information is not the target language. At this time, each reference information can be traversed to query the probability of the same reference language as the target language, as a reference probability.
- the confidence score of the target information belonging to the target language is calculated, so as to characterize the strength of the reference information to prove that the language of the target information is the target language .
- the confidence level is checked by screening appropriate target information through the confidence range, which can reduce the quantity of target information, thereby reducing the amount of calculation and improving the efficiency.
- the language with the highest probability of the description information of a video data is English, and the probability of English is general (such as 0.6).
- the probability of English is general (such as 0.6).
- the description information and its target language cannot be used as training samples to update the language classifier.
- the prediction of the copy information matching the cover in the same video data is also in English, and its probability is high (such as 0.8)
- This additional information can be used to confirm that the description information is correctly predicted to be English.
- the description information, cover copy information and target language can be used as a training sample to update the language classifier, thereby expanding the standard sample size.
- the target probability P_S is less than the first probability threshold MIN_PROB_1, it can be considered that the language of the target information is the target language with a low degree of confidence, that is, the confidence degree is insufficient, and it may not be a normal language, which is ignored when updating the language classifier in this round iteratively.
- the current video data and its video information are the target probability thresholds.
- the current video data and its video information are ignored, and the current video data and its video information are not deleted.
- the performance of the language classifier is improved, and the target probability P_S may be greater than or equal to the first probability Threshold MIN_PROB_1.
- the target probability P_S is greater than the second probability threshold MIN_PROB_2, it can be considered that the language of the target information is the target language with a high degree of confidence, and the language of the target information is directly identified as the target language, without the need for the reference information to belong to the reference language.
- the language of the verification target information is the confidence level of the target language.
- a language classifier is determined, multiple pieces of information related to video data are collected, multiple pieces of information are regarded as multiple pieces of video information, multiple pieces of video information are divided into target information and reference information, and multiple pieces of video information are divided into
- the confidence of the target language is checked by identifying the language to which multiple video information belongs, and the reference language is used as an aid.
- the target language is the language to which the target information belongs
- the reference language is the multiple languages to which the reference information belongs.
- the users who produce the video data are relatively single, and the audience of the video data is relatively single.
- the video data usually involves a single language, and the video information related to the video data usually involves the same language. Therefore, the fact that the reference information belongs to the reference language can be used as an aid to verify the confidence that the language of the target information is the target language, thereby improving the accuracy of the predicted language.
- FIG. 2 is a flowchart of a language labeling method provided in Embodiment 2 of the present application. Based on the foregoing embodiments, this embodiment illustrates an operation of iteratively updating a language classifier by semi-supervised learning. The method includes: Follow the steps below:
- Step 201 Determine a language classifier.
- Step 202 Collect multiple pieces of information related to the video data, and use the multiple pieces of information as multiple pieces of video information.
- Step 203 Divide the multiple pieces of video information into target information and reference information.
- Step 204 Inputting a plurality of separate video information into a language classifier to identify the language to which the plurality of video information belongs.
- Step 205 using the reference language as an aid, verify the confidence level of the target language.
- the target language is the language to which the target information belongs
- the reference language is the multiple languages to which the reference information belongs.
- Step 206 If the confidence level is greater than or equal to a preset confidence threshold, use the video information as a reference to generate information similar to the video information as incremental information.
- the confidence score is compared with the preset confidence threshold MIN_SCORE. If the confidence score is greater than or equal to the preset confidence threshold MIN_SCORE, it means that the confidence score is higher, and the reference information has a stronger corroboration strength for the target information. High, at this time, the video information can be used as a reference to generate information similar to the video information. For the convenience of distinction, the text information can be recorded as incremental information.
- the incremental information is generated with reference to the video information, the incremental information can also be treated as a sentence.
- some words are deleted from the video information in a random manner to obtain incremental information.
- the quantity condition is that the proportion of the words of the incremental information to the words of the video information exceeds the preset first proportion threshold MIN_PERCENT_1.
- incremental information is obtained by converting the format of some or all of the words in the video information to uppercase.
- the incremental information is obtained by converting the format of some or all words in the video information to lowercase letters.
- some or all of the punctuation marks in the video information are removed to obtain incremental information.
- N (N is a positive integer, N ⁇ M) words are deleted within the range of M (M is a positive integer) words to obtain incremental information.
- the above method of generating incremental information is only an example, and can be used alone or in any combination.
- other methods of generating incremental information can be set according to actual conditions, which are not added in the embodiments of the present application. limit.
- those skilled in the art may also adopt other methods of generating incremental information according to actual needs, which are not limited in this embodiment of the present application.
- Step 207 Invoke the language classifier to detect the validity of the incremental information in recognizing the target language.
- the language predicted by the language classifier may be a language with a large number of training samples in the training set, resulting in erroneous predictions.
- the correct language is Hindi
- 7 of the 10 words are Hindi words input by transliterate
- the remaining 3 words are English word. Since there are more training samples in English, and the training samples in Hindi input are relatively scarce, the language classifier may incorrectly predict that the language of the video information is English because of the strong characteristics of the three English words.
- the video information can be verified by generating new sentences (that is, incremental information), that is, the language classifier is called to verify whether the incremental information is valid (validity) for recognizing the target language, thereby improving the accuracy of the predicted language .
- the incremental information may be input into a language classifier for processing to identify the language to which the incremental information belongs.
- this language can be called an incremental language, that is, an incremental language is a language to which incremental information belongs.
- Count the proportion when the incremental language is the same as the target language that is, count the first number of incremental languages that are the same as the target language, count the second number of all incremental languages, and calculate the ratio between the first number and the second number , as a proportion.
- the proportion is greater than or equal to the preset second proportion threshold MIN_PERCENT_2 (eg, 80%), it means that the incremental language is less ambiguous as the target language, and it can be determined that the incremental information is valid when recognizing the language.
- MIN_PERCENT_2 eg, 80%
- the proportion is less than the preset second proportion threshold MIN_PERCENT_2 (eg 80%), it means that the ambiguity of the incremental language being the target language is relatively large, and it can be determined that the incremental information is invalid when recognizing the language.
- MIN_PERCENT_2 eg 80%
- Step 208 If the incremental information is valid when identifying the target language, update the language classifier according to at least one of the video information and the incremental information and the target language.
- the collected data usually conform to The following two rules:
- the new data is not similar to the existing training samples in the current training set, which allows the language classifier to learn new features.
- An indicator to determine whether the new data is similar to the existing training samples is the probability of using the current language classifier to predict the language for the new data, that is, if the probability of predicting the language for the new data is low, it means that the language classifier is not in the training set. After traversing this type of data, a relatively low probability is predicted. Therefore, one solution is to add new data with a low probability to the training set.
- the label (language) of the new data is accurate to ensure that a language classifier with better performance is trained.
- a common practice to ensure accurate labels is to manually label new data.
- an automated solution is to treat languages with a high probability (such as more than 0.95) as the correct label, and a high probability indicates that the language classifier thinks that The new data is correct for the language, so one solution is to add these new data with high probability to the training set.
- this embodiment proposes to use the prediction of the reference information of the video data as evidence to assist in judging whether the language with a low probability of prediction of the target information is correct, and if the language with a low predicted probability is judged to be correct , it conforms to the above two rules and can be added to the training set, so that new features not involved or less involved before are added to the process of training the language classifier, improving the performance of the language classifier, and thus improving the predicted language. , the accuracy of the labeling language, and the fusion of semi-supervised training language classifiers and automatic labeling.
- the incremental information is valid when identifying the target language, it means that the language predicted for the newly generated incremental information is more consistent with the video information. In this case, it can be determined that the language prediction of the video information is unambiguous, and the video information and its target language update are adopted. language classifier.
- the incremental information is invalid when recognizing the target language, it means that the language predicted by the newly generated incremental information is too inconsistent with the video information. It may be that the video information contains words in different languages, or some words have strong features. In this case It can be determined that the language prediction of the video information is ambiguous, and the language classifier is not updated with the video information and its target language.
- the training set of the language classifier can be obtained.
- the training set has multiple text information (or voice signals), and the text information (or voice signals) has been marked with the attributable language, and the text information (or voice signals) in the training set can be It is the initially marked text information (or voice signal), and may also be video information and/or incremental information of the language marked subsequently by the language classifier, which is not limited in this embodiment.
- the video information can be added to the training set as text information (or voice signal) in the training set, and the target language can be marked as the language to which the video information belongs.
- the language classifier can be updated with the appropriate incremental information and its target language
- the incremental information that is effective for updating the language classifier can be filtered, and the incremental information can be added to the training set as text information (or speech signals) in the training set, and the target language can be marked as the language to which the incremental information belongs.
- a specified ratio MIN_RATIO (0 ⁇ MIN_RATIO ⁇ 1) may be taken for the probability that the video information belongs to the target language, as the third probability threshold MIN_PROB_3 of the incremental information.
- the probability that the incremental information belongs to the target language is compared with the preset first threshold MIN_PROB_1 and third threshold MIN_PROB_3.
- the incremental information is valid for updating the language classifier, wherein the probability that the target information belongs to the target language is greater than or Equal to the first threshold MIN_PROB_1.
- the probability that the incremental information belongs to the target language is general, and is smaller than the probability that the video information belongs to the target language.
- This situation indicates that the difference between the incremental information and the video information is that the incremental information has some transformations (such as some words are missing), resulting in a drop in the predicted probability. This may be because these transformations (such as missing some words) are stronger features for language classifiers to predict, while the original information (such as remaining word combinations) are less familiar to language classifiers (For example, not present in the current training set), adding incremental information can help improve the performance of the language classifier.
- the language classifier is more sensitive to the training samples when iteratively updated in the first h (h is a positive integer) round, the wrong labeling will affect the performance of the language classifier and cause more errors to accumulate in subsequent iterations. Therefore, in the first h rounds of iteration, the video information of the pre-labeled language can be used for iteration. According to the output results of the video information and incremental information output by the language classifier, the language labeled with the video information can be determined as the actual language (that is, the actual language of the video information). language) and compare the actual language with the target language.
- video information is allowed to be added to the training set as text information in the training set
- the target language is allowed to be marked as the language to which the video information belongs
- incremental information is allowed to be added to the training set , as the text information in the training set, allowing the target language to be marked as the language to which the incremental information belongs.
- the actual language is different from the target language
- at least one of the video information and incremental information, and the target language are ignored, that is, it is forbidden to add video information to the training set as text information in the training set, and it is forbidden to mark the target language as a video
- the language to which the information belongs and/or, it is forbidden to add incremental information to the training set as text information in the training set, and it is forbidden to mark the target language as the language to which the incremental information belongs.
- the total number of video information added to the training set after the last update of the language classifier can be counted, and the total number can be compared with a preset number threshold MAX_SENT_COUNT.
- training conditions are only examples.
- other training conditions can be set according to the actual situation. For example, after the language classifier was updated last time, the total number of ignored video information exceeded another , the language classifier may have defects, and the performance should be improved as soon as possible, etc., which is not limited in this embodiment of the present application.
- those skilled in the art may also adopt other training conditions according to actual needs, which are not limited in the embodiments of the present application.
- a labeled training set L includes the sentences (text information or voice signals) of the labeled language, an unlabeled data Set U, dataset U includes sentences in unlabeled language (video information of video data).
- a language classifier C i is trained using the sentences in the training set L and their annotated languages.
- the language classifier C i predicts the language to which each sentence S in the data set U may belong, and each language carries a probability.
- the sentence S in the data set is marked with language and added to the training set L.
- S401 some sentences S1 (target information) are taken from the data set U as the subset V, wherein, in the language to which the sentence S1 belongs, the highest probability is between the first probability thresholds MIN_PROB_1 and MIN_PROB_1. Between the second probability threshold MIN_PROB_2.
- a sentence S1 is randomly selected from the subset V, and the video ID of the video data where the sentence S1 is located, the language A with the highest prediction probability, and the probability P_S1 are confirmed.
- a sentence T is obtained by deleting some words in the sentence S, and the proportion of the words in the sentence T to the words in the sentence S exceeds the first proportion threshold MIN_PERCENT_1.
- the language classifier C i is called to predict the language of the sentence T respectively, when the language is A, its probability is P_T, and the proportion A_P of language A in all languages is calculated.
- sentence S (including sentences S1 and S2) is marked as A, and added to the training set L.
- sentence T For sentence T, if MIN_PROB_1 ⁇ P_T ⁇ MIN_PROB_3 belonging to language A, the sentence T is marked as language A and added to the training set L.
- FIG. 5 is a structural block diagram of a language tagging apparatus provided in Embodiment 3 of the present application, which may include the following modules:
- the language classifier determination module 501 is configured to determine the language classifier; the video information collection module 502 is configured to collect multiple pieces of information related to the video data, and use the multiple pieces of information as multiple pieces of video information; the video information division module 503 is set to Divide the plurality of video information into target information and reference information; the video information classification module 504 is configured to input the plurality of video information into the language classifier respectively, to identify the language to which the plurality of video information belongs Confidence check module 505 is set to use reference language as an aid to check the confidence of the target language, the target language is the language to which the target information belongs, and the reference language is a plurality of the reference language to which the reference information belongs. language.
- the language classifier determination module 501 includes:
- the training set generation module is set to generate a training set, the training set has a plurality of text information, and each text information has an attributable language; the language classifier training module is set to take each text information in the training set as the For the training samples, the language to which each text information has been labeled is used as the training label to train the language classifier.
- the video information division module 503 includes:
- a relevance determination module configured to determine the relevance of the plurality of video information relative to the video data; a target information setting module, configured to set the video information with the highest relevance as the target information; reference information setting The module is configured to set the video information other than the target information in the plurality of video information as the reference information.
- the video information includes at least one of the following:
- Description information copywriting information matching the cover, subtitle information, first feature information, second feature information, comment information; wherein, the first feature information is text information extracted from the cover, and the second feature information is Text information extracted from multi-frame image data of video data.
- the target information is description information
- the reference information includes at least one of the following:
- Copy information subtitle information, first feature information, second feature information, and comment information matching the cover.
- the confidence check module 505 includes:
- the target probability query module is set to query the probability in the target language, and the probability is taken as the target probability;
- the reference probability query module is set to be greater than or equal to the preset first probability threshold, and less than or is equal to the preset second probability threshold, then query the probability of the reference language that is the same as the target language as the reference probability;
- the probability fusion module is set to combine the target probability and the reference probability to calculate the target information Confidence at which to attribute the target language.
- the incremental information generation module is configured to generate information similar to the video information as the incremental information if the confidence is greater than or equal to a preset confidence threshold; the validity detection module is configured to call the language classification The device detects the validity of the incremental information when recognizing the target language; the language classifier update module is set to, if the incremental information is valid when recognizing the target language, according to the video information and the At least one of incremental information, and the target language updates the language classifier.
- the incremental information generation module includes:
- the first word deletion module is set to delete some words from the video information under the constraint of the quantity condition to obtain incremental information; wherein, the quantity condition is that the words of the incremental information account for 10% of the video information.
- the proportion of words exceeds the preset first proportion threshold; and/or, the first letter conversion module is set to convert the format of the words in the video information into uppercase letters to obtain incremental information; and/or, the second letter The conversion module is set to convert the format of the word in the video information into lowercase letters to obtain incremental information; and/or, the punctuation deletion module is set to delete the punctuation in the video information to obtain incremental information; And/or, the second word deletion module is configured to delete N words within the range of M words in the video information to obtain incremental information.
- the invoking the language classifier to detect the validity of the incremental information in recognizing the target language includes:
- An incremental information classification module configured to input the incremental information into the language classifier, to identify the language to which the incremental information belongs, as an incremental language; a proportion statistics module, configured to count the incremental information The proportion when the language is the same as the target language; the valid determination module is configured to determine that the incremental information is valid when recognizing the language if the proportion is greater than or equal to a preset second proportion threshold.
- the language classifier update module includes:
- a training set acquisition module configured to acquire a training set of the language classifier, the training set has a plurality of text information, and each text information has an attributable language; a video information addition module, configured to add the video information to the training set as the text information in the training set; the video information labeling module is set to label the target language as the language to which the video information belongs; the training condition detection module is set to detect whether a preset training conditions; if the preset training conditions are met, the iterative training module is called; the iterative training module is set to use the text information in the training set as a training sample and the marked language as a training label , update the language classifier.
- the training condition detection module includes:
- the total number statistics module is set to count the total number of the video information added to the training set after the last update of the language classifier; the satisfaction determination module is set to if the total number is greater than the preset number threshold, it is determined that the preset training conditions are met.
- the language classifier update module further includes:
- An incremental information screening module configured to filter the incremental information effective for updating the language classifier; an incremental information adding module, configured to add the filtered incremental information to the training set as the training set The text information of the incremental information; the incremental information labeling module is configured to label the target language as the language to which the incremental information belongs.
- the incremental information screening module includes:
- the probability threshold setting module is set to take a specified proportion of the probability that the video information belongs to the target language, as the third probability threshold of the incremental information; the effective determination module is set to if the incremental information probability belongs to The probability of the target language is greater than or equal to a preset first threshold and less than or equal to the third threshold, then the incremental information is valid for updating the language classifier, wherein the target information belongs to the target The probability of a language is greater than or equal to the first threshold.
- the language classifier update module further includes:
- the actual language determination module is configured to determine the language marked on the video information, and the determined language is used as the actual language; the sample ignoring module is configured to ignore the video information and the target language if the actual language is different from the target language. At least one of the incremental information, and the target language.
- the language tagging apparatus provided in the embodiment of the present application can execute the language tagging method provided in any embodiment of the present application, and has functional modules and effects corresponding to the execution method.
- FIG. 6 is a schematic structural diagram of a computer device according to Embodiment 4 of the present application.
- Figure 6 shows a block diagram of an exemplary computer device 12 suitable for use in implementing embodiments of the present application.
- the computer device 12 shown in FIG. 6 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
- computer device 12 takes the form of a general-purpose computing device.
- Components of computer device 12 may include, but are not limited to, one or more processors or processing units 16 , system memory 28 , and a bus 18 connecting various system components including system memory 28 and processing unit 16 .
- System memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32 .
- storage system 34 may be configured to read and write to non-removable, non-transitory, non-volatile magnetic media (not shown in Figure 6, commonly referred to as "hard disk drives").
- the memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of the embodiments of the present application.
- a program/utility 40 having a set (at least one) of program modules 42 may be stored in memory 28, for example.
- Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
- Computer device 12 may also communicate with one or more external devices 14 (eg, keyboard, pointing device, display 24, etc.). Such communication may take place through an input/output (I/O) interface 22 . Also, computer device 12 may communicate with one or more networks (eg, Local Area Network (LAN), Wide Area Network (WAN), and/or public networks such as the Internet) through network adapter 20.
- external devices 14 eg, keyboard, pointing device, display 24, etc.
- I/O input/output
- computer device 12 may communicate with one or more networks (eg, Local Area Network (LAN), Wide Area Network (WAN), and/or public networks such as the Internet) through network adapter 20.
- networks eg, Local Area Network (LAN), Wide Area Network (WAN), and/or public networks such as the Internet
- the processing unit 16 executes a variety of functional applications and data processing by running the programs stored in the system memory 28, for example, implementing the language annotation method provided by the embodiments of the present application.
- Embodiment 7 of the present application further provides a computer-readable storage medium, where a computer program is stored on the computer-readable storage medium.
- a computer program is stored on the computer-readable storage medium.
- Computer-readable storage media may include, but are not limited to, electrical, magnetic, optical, electromagnetic, infrared, or semiconductor systems, apparatuses, or devices, or any combination thereof, for example.
- Examples (a non-exhaustive list) of computer-readable storage media include: electrical connections with one or more wires, portable computer disks, hard disks, random access memory (RAM), read-only memory (Read- Only Memory, ROM), Erasable Programmable Read-Only Memory (Erasable Programmable Read-Only Memory, EPROM or Flash Memory), Optical Fiber, Portable Compact Disc Read-Only Memory (CD-ROM), Optical Memory devices, magnetic memory devices, or any suitable combination of the foregoing.
- a computer-readable storage medium can be any tangible medium that contains or stores a program that can be used by or in conjunction with an instruction execution system, apparatus, or device.
Abstract
Description
Claims (16)
- 一种语言标注方法,包括:确定语言分类器;采集与视频数据相关的多个信息,将所述多个信息作为多个视频信息;将所述多个视频信息划分为目标信息以及参考信息;将所述多个视频信息分别输入所述语言分类器中,以识别所述多个视频信息归属的语言;以参考语言作为辅助,校验目标语言的置信度,其中,所述目标语言为所述目标信息归属的语言,所述参考语言为所述参考信息归属的多个语言。
- 根据权利要求1所述的方法,其中,所述确定语言分类器,包括:生成训练集,其中,所述训练集中具有多个文本信息,每个文本信息已标注归属的语言;以所述训练集中的每个文本信息作为训练的样本,每个文本信息已标注归属的语言作为训练的标签,训练所述语言分类器。
- 根据权利要求1所述的方法,其中,所述将所述多个视频信息划分为目标信息以及参考信息,包括:确定所述多个视频信息相对于所述视频数据的关联性;将所述关联性最高的视频信息设置为所述目标信息;将所述多个视频信息中除所述目标信息之外的视频信息设置为所述参考信息。
- 根据权利要求3所述的方法,其中,所述视频信息包括如下的至少一种:描述信息、与封面配套的文案信息、字幕信息、第一特征信息、第二特征信息、评论信息;其中,所述第一特征信息为从封面中提取的文本信息,所述第二特征信息为从所述视频数据的多帧图像数据中提取的文本信息;在所述目标信息为描述信息的情况下,所述参考信息包括如下的至少一种:与封面配套的文案信息、字幕信息、第一特征信息、第二特征信息、评论信息。
- 根据权利要求1-4任一项所述的方法,其中,所述以参考语言作为辅助,校验目标语言的置信度,包括:查询所述目标语言中的概率,将所述概率作为目标概率;在所述目标概率大于或等于预设的第一概率阈值、且小于或等于预设的第二概率阈值的情况下,查询与所述目标语言相同的参考语言的概率,作为参考概率;结合所述目标概率与所述参考概率计算所述目标信息归属所述目标语言的置信度。
- 根据权利要求1-4任一项所述的方法,还包括:在所述置信度大于或等于预设的置信阈值的情况下,生成与所述视频信息相似的信息,作为增量信息;调用所述语言分类器检测所述增量信息在识别所述目标语言时的有效性;在所述增量信息在识别所述目标语言时有效的情况下,根据所述视频信息与所述增量信息中的至少一者、以及所述目标语言更新所述语言分类器。
- 根据权利要求6所述的方法,其中,所述生成与所述视频信息相似的信息,作为增量信息,包括以下至少之一:在数量条件的约束下,从所述视频信息中删除部分单词,获得所述增量信息;其中,所述数量条件为所述增量信息的单词占所述视频信息的单词的比例超过预设的第一比例阈值;将所述视频信息中单词的格式转换为大写字母,获得所述增量信息;将所述视频信息中单词的格式转换为小写字母,获得所述增量信息;删除所述视频信息中的标点符号,获得所述增量信息;在所述视频信息中,在M个单词的范围内删除N个单词,获得所述增量信息,M大于N,且M和N均为正整数。
- 根据权利要求6所述的方法,其中,所述调用所述语言分类器检测所述增量信息在识别所述目标语言时的有效性,包括:将所述增量信息输入所述语言分类器中,以识别所述增量信息归属的语言、作为增量语言;统计所述增量语言与所述目标语言相同时的占比;在所述占比大于或等于预设的第二比例阈值的情况下,确定所述增量信息在识别所述语言时有效。
- 根据权利要求6所述的方法,其中,所述根据所述视频信息与所述增量 信息中的至少一者、以及所述目标语言更新所述语言分类器,包括:获取所述语言分类器的训练集,其中,所述训练集中具有多个文本信息,每个文本信息已标注归属的语言;将所述视频信息添加至所述训练集中作为所述训练集中的文本信息;将所述目标语言标注为所述视频信息归属的语言;检测是否满足预设的训练条件;响应于满足所述预设的训练条件,以所述训练集中的文本信息作为训练的样本、已标注的语言作为训练的标签,更新所述语言分类器。
- 根据权利要求9所述的方法,其中,所述检测所述训练集是否满足预设的训练条件,包括:统计在上一次更新所述语言分类器之后添加至所述训练集中的视频信息的总数量;在所述总数量大于预设的数量阈值的情况下,确定满足所述预设的训练条件。
- 根据权利要求9所述的方法,其中,所述根据所述视频信息与所述增量信息中的至少一者、以及所述目标语言更新所述语言分类器,还包括:筛选对于更新所述语言分类器有效的增量信息;将筛选出的增量信息添加至所述训练集中作为所述训练集中的文本信息;将所述目标语言标注为所述增量信息归属的语言。
- 根据权利要求11所述的方法,其中,所述筛选对于更新所述语言分类器有效的增量信息,包括:对所述视频信息归属所述目标语言的概率取指定的比例,作为所述增量信息的第三概率阈值;在所述增量信息归属所述目标语言的概率大于或等于预设的第一阈值且小于或等于所述第三阈值的情况下,所述增量信息对于更新所述语言分类器有效,其中,所述目标信息归属所述目标语言的概率大于或等于所述第一阈值。
- 根据权利要求9或11所述的方法,其中,所述根据所述视频信息与所述增量信息中的至少一者、以及所述目标语言更新所述语言分类器,还包括:确定对所述视频信息标注的语言,将确定的语言作为实际语言;在所述实际语言与所述目标语言不同的情况下,忽略所述视频信息与所述 增量信息中的至少一者、以及所述目标语言。
- 一种语言标注装置,包括:语言分类器确定模块,设置为确定语言分类器;视频信息采集模块,设置为采集与视频数据相关的多个信息,将所述多个信息作为多个视频信息;视频信息划分模块,设置为将所述多个视频信息划分为目标信息以及参考信息;视频信息分类模块,设置为将所述多个视频信息分别输入所述语言分类器中,以识别所述多个视频信息归属的语言;置信度校验模块,设置为以参考语言作为辅助,校验目标语言的置信度,其中,所述目标语言为所述目标信息归属的语言,所述参考语言为所述参考信息归属的多个语言。
- 一种计算机设备,包括:至少一个处理器;存储器,设置为存储至少一个程序;当所述至少一个程序被所述至少一个处理器执行,使得所述至少一个处理器实现如权利要求1-13中任一项所述的语言标注方法。
- 一种计算机可读存储介质,存储计算机程序,所述计算机程序被处理器执行时实现如权利要求1-13中任一项所述的语言标注方法。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US18/258,743 US20240070389A1 (en) | 2020-12-31 | 2021-12-28 | Language labeling method and computer device, and non-volatile storage medium |
EP21914343.5A EP4273737A1 (en) | 2020-12-31 | 2021-12-28 | Language labeling method and apparatus, and computer device and storage medium |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202011630350.8 | 2020-12-31 | ||
CN202011630350.8A CN112699671B (zh) | 2020-12-31 | 2020-12-31 | 一种语言标注方法、装置、计算机设备和存储介质 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022143608A1 true WO2022143608A1 (zh) | 2022-07-07 |
Family
ID=75513512
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/CN2021/141917 WO2022143608A1 (zh) | 2020-12-31 | 2021-12-28 | 语言标注方法、装置、计算机设备和存储介质 |
Country Status (4)
Country | Link |
---|---|
US (1) | US20240070389A1 (zh) |
EP (1) | EP4273737A1 (zh) |
CN (1) | CN112699671B (zh) |
WO (1) | WO2022143608A1 (zh) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112699671B (zh) * | 2020-12-31 | 2023-11-17 | 百果园技术(新加坡)有限公司 | 一种语言标注方法、装置、计算机设备和存储介质 |
CN114926847B (zh) * | 2021-12-06 | 2023-04-07 | 百度在线网络技术(北京)有限公司 | 少数类语言的图像处理方法、装置、设备和存储介质 |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160110340A1 (en) * | 2014-10-17 | 2016-04-21 | Machine Zone, Inc. | Systems and Methods for Language Detection |
US20160239476A1 (en) * | 2015-02-13 | 2016-08-18 | Facebook, Inc. | Machine learning dialect identification |
CN108475260A (zh) * | 2016-04-14 | 2018-08-31 | 谷歌有限责任公司 | 基于评论的媒体内容项的语言识别的方法、系统和介质 |
CN112017630A (zh) * | 2020-08-19 | 2020-12-01 | 北京字节跳动网络技术有限公司 | 一种语种识别方法、装置、电子设备及存储介质 |
CN112699671A (zh) * | 2020-12-31 | 2021-04-23 | 百果园技术(新加坡)有限公司 | 一种语言标注方法、装置、计算机设备和存储介质 |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9848215B1 (en) * | 2016-06-21 | 2017-12-19 | Google Inc. | Methods, systems, and media for identifying and presenting users with multi-lingual media content items |
US10762375B2 (en) * | 2018-01-27 | 2020-09-01 | Microsoft Technology Licensing, Llc | Media management system for video data processing and adaptation data generation |
US11443227B2 (en) * | 2018-03-30 | 2022-09-13 | International Business Machines Corporation | System and method for cognitive multilingual speech training and recognition |
CN109933688A (zh) * | 2019-02-13 | 2019-06-25 | 北京百度网讯科技有限公司 | 确定视频标注信息的方法、装置、设备和计算机存储介质 |
-
2020
- 2020-12-31 CN CN202011630350.8A patent/CN112699671B/zh active Active
-
2021
- 2021-12-28 EP EP21914343.5A patent/EP4273737A1/en active Pending
- 2021-12-28 WO PCT/CN2021/141917 patent/WO2022143608A1/zh active Application Filing
- 2021-12-28 US US18/258,743 patent/US20240070389A1/en active Pending
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20160110340A1 (en) * | 2014-10-17 | 2016-04-21 | Machine Zone, Inc. | Systems and Methods for Language Detection |
US20160239476A1 (en) * | 2015-02-13 | 2016-08-18 | Facebook, Inc. | Machine learning dialect identification |
CN108475260A (zh) * | 2016-04-14 | 2018-08-31 | 谷歌有限责任公司 | 基于评论的媒体内容项的语言识别的方法、系统和介质 |
CN112017630A (zh) * | 2020-08-19 | 2020-12-01 | 北京字节跳动网络技术有限公司 | 一种语种识别方法、装置、电子设备及存储介质 |
CN112699671A (zh) * | 2020-12-31 | 2021-04-23 | 百果园技术(新加坡)有限公司 | 一种语言标注方法、装置、计算机设备和存储介质 |
Also Published As
Publication number | Publication date |
---|---|
CN112699671A (zh) | 2021-04-23 |
EP4273737A1 (en) | 2023-11-08 |
US20240070389A1 (en) | 2024-02-29 |
CN112699671B (zh) | 2023-11-17 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US9195646B2 (en) | Training data generation apparatus, characteristic expression extraction system, training data generation method, and computer-readable storage medium | |
AU2016210590B2 (en) | Method and System for Entity Relationship Model Generation | |
WO2017118427A1 (zh) | 网页训练的方法和装置、搜索意图识别的方法和装置 | |
WO2022143608A1 (zh) | 语言标注方法、装置、计算机设备和存储介质 | |
CN110162771B (zh) | 事件触发词的识别方法、装置、电子设备 | |
CN110083832B (zh) | 文章转载关系的识别方法、装置、设备及可读存储介质 | |
CN110427612B (zh) | 基于多语言的实体消歧方法、装置、设备和存储介质 | |
WO2022222300A1 (zh) | 开放关系抽取方法、装置、电子设备及存储介质 | |
US11720481B2 (en) | Method, apparatus and computer program product for predictive configuration management of a software testing system | |
CN109271624B (zh) | 一种目标词确定方法、装置及存储介质 | |
WO2023124647A1 (zh) | 一种纪要确定方法及其相关设备 | |
US10867255B2 (en) | Efficient annotation of large sample group | |
CN113204956B (zh) | 多模型训练方法、摘要分段方法、文本分段方法及装置 | |
CN112464927B (zh) | 一种信息提取方法、装置及系统 | |
CN111291535A (zh) | 剧本处理方法、装置、电子设备及计算机可读存储介质 | |
CN113792545B (zh) | 一种基于深度学习的新闻事件活动名称抽取方法 | |
CN113255319B (zh) | 模型训练方法、文本分段方法、摘要抽取方法及装置 | |
CN111552780B (zh) | 医用场景的搜索处理方法、装置、存储介质及电子设备 | |
CN116029280A (zh) | 一种文档关键信息抽取方法、装置、计算设备和存储介质 | |
CN110276001B (zh) | 盘点页识别方法、装置、计算设备和介质 | |
KR101126186B1 (ko) | 형태적 중의성 동사 분석 장치, 방법 및 그 기록 매체 | |
Das et al. | Incorporating domain knowledge to improve topic segmentation of long MOOC lecture videos | |
US20180307669A1 (en) | Information processing apparatus | |
RU2812952C1 (ru) | Способ и устройство и вычислительное устройство и носитель данных для расстановки меток языка | |
WO2022213864A1 (zh) | 一种语料标注方法、装置及相关设备 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21914343 Country of ref document: EP Kind code of ref document: A1 |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18258743 Country of ref document: US |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2023119195 Country of ref document: RU |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
ENP | Entry into the national phase |
Ref document number: 2021914343 Country of ref document: EP Effective date: 20230731 |