WO2024055603A1 - 一种未成年人文本识别方法及装置 - Google Patents

一种未成年人文本识别方法及装置 Download PDF

Info

Publication number
WO2024055603A1
WO2024055603A1 PCT/CN2023/092437 CN2023092437W WO2024055603A1 WO 2024055603 A1 WO2024055603 A1 WO 2024055603A1 CN 2023092437 W CN2023092437 W CN 2023092437W WO 2024055603 A1 WO2024055603 A1 WO 2024055603A1
Authority
WO
WIPO (PCT)
Prior art keywords
sentence
mark
recognition
minor
text
Prior art date
Application number
PCT/CN2023/092437
Other languages
English (en)
French (fr)
Inventor
邓其春
马金龙
吴文亮
黎子骏
张政统
王伟喆
曾锐鸿
盘子圣
焦南凯
兰翔
徐志坚
谢睿
陈光尧
Original Assignee
广州趣丸网络科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 广州趣丸网络科技有限公司 filed Critical 广州趣丸网络科技有限公司
Publication of WO2024055603A1 publication Critical patent/WO2024055603A1/zh

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Definitions

  • the present application relates to the field of text recognition, and more specifically, to a text recognition method and device for minors.
  • keyword matching and model prediction are generally used to identify minors' texts.
  • the function is too single and there is a lack of analysis process of the relationship between the text content and the nature of minors. Obviously, the accuracy of identifying minors' texts is unreliable. .
  • this application is proposed to provide a minor text recognition method and device to improve the reliability of the accuracy of minor text recognition.
  • a minor text recognition method including:
  • each recognition combination includes a sentence matching algorithm and keyword seal;
  • the text to be identified is the text of a minor.
  • the multi-mode AC algorithm For each sentence in the text to be recognized, according to the order of priority of each recognition combination in the pre-established sentence recognition module from high to low, the multi-mode AC algorithm is used to accelerate the sentences in each recognition combination.
  • the statement matching algorithm matches each keyword in the statement;
  • each of the several recognition combinations in the sentence recognition module also includes an early ending action
  • the method also includes:
  • each sentence in the sentence recognition module is The priority of the recognition combinations is in order from high to low.
  • the process of identifying the statement through each recognition combination in turn, when there is a keyword in the statement that hits the early ending action in the current recognition combination, skip the low priority
  • an intermediate mark sentence with a keyword mark that ends the recognition early is determined.
  • each of the several recognition combinations in the sentence recognition module also includes one or more filter conditions
  • the statement For each statement in the text to be recognized, according to the order of priority of each recognition combination in the pre-established statement recognition module from high to low, the statement is identified through each recognition combination in turn to obtain conditionally filtered An intermediate tag statement with a keyword tag.
  • each of the several recognition combinations in the sentence recognition module also includes an idle occupying action
  • the method also includes:
  • a temporary mask layer is generated to cover the target keyword, so that the recognition combination with a lower priority than the current recognition combination is recognized.
  • the target keyword is skipped until all recognition combinations in the sentence recognition module complete the recognition of the sentence, and then the temporary mask layer covering the target keyword is canceled.
  • each of the several recognition combinations in the sentence recognition module also includes a re-analysis action
  • the method also includes:
  • stop words in the sentence are detected and temporarily removed until the recognition combination containing the re-analysis action is correct. After the sentence recognition is completed, restore the temporary removal of the sentence stop words.
  • the keyword tag is a model inference tag
  • model inference mark carried by the middle mark sentence input the middle mark sentence into the existing minor prediction model, and output the minor discrimination mark of the middle mark sentence;
  • the intermediate marked sentence is marked with the outputted minor identification mark, and a target sentence with the outputted minor identification mark is obtained.
  • the keyword mark is a blacklist mark
  • the blacklist mark carried by the middle mark sentence determine that the minor identification mark marked on the middle mark sentence is a minor mark
  • the keyword is marked as a highly suspicious mark
  • the middle mark sentence determines that the minor identification mark marked on the middle mark sentence is a high suspicion minor mark
  • the intermediate marked sentence after analyzing the keyword tags carried by the intermediate marked sentence, marking the intermediate marked sentence with a minor identification mark, and obtaining the target sentence with the minor identification mark, it also includes:
  • the target sentence with the minor discrimination mark is displayed.
  • a text recognition device for minors including:
  • the recognition text acquisition unit is configured to acquire the text to be recognized containing several sentences
  • the target sentence marking unit is configured to identify each sentence in the text to be recognized in sequence from high to low according to the priority of each recognition combination in the pre-established sentence recognition module. statement, obtain a middle-marked sentence with a keyword mark, analyze the keyword mark carried by the middle-marked sentence, mark the middle-marked sentence with a minor identification mark, and obtain the minor-discrimination mark. Marked target sentences, the sentence recognition module includes multiple recognition combinations with different priorities, each recognition combination includes a sentence matching algorithm and a keyword seal;
  • the first score statistics unit is configured to count the number of target sentences with a minor identification mark as a minor mark in the text to be identified, determine it as the first number, and calculate the first number based on the first number. Fraction;
  • the second score statistics unit is configured to count the number of target sentences with minor identification marks as high suspected minor marks in the text to be identified, determine it as the second number, and calculate it based on the second number. second score;
  • the minor text confirmation unit is configured to determine that the text to be recognized is the text of a minor if the sum of the first score and the second score is greater than a preset score threshold.
  • this application obtains a text to be recognized that contains several sentences, and for each sentence in the text to be recognized, according to the priority of each recognition combination in the pre-established sentence recognition module from high to low sequence, identify the sentences through each recognition combination in turn, obtain an intermediate marked sentence with a keyword tag, analyze the keyword tag carried by the intermediate marked sentence, and mark the intermediate marked sentence with a minor Discrimination mark to obtain the target sentence with the minor discrimination mark.
  • the sentence recognition module contains multiple recognition combinations with different priorities. Each recognition combination includes a sentence matching algorithm and a keyword seal. Statistics will be described later.
  • the number of target sentences with minor identification marks as minors in the identified text is determined as the first number, and the first score is calculated based on the first number, and the number of target sentences with minors in the text to be identified is counted.
  • the number of target sentences marked as highly suspicious of minors by adults is determined as a second number, and a second score is calculated based on the second number. If the sum of the first score and the second score If the score is greater than the preset score threshold, it is determined that the text to be identified is the text of a minor.
  • each person related to minors Keywords related to the nature of the person are assigned tags, and the suspicious degree of minors in each sentence is comprehensively analyzed and labeled, thereby effectively and reliably analyzing whether the text to be identified belongs to a minor.
  • Figure 1 is a schematic flow chart of minor text recognition provided by an embodiment of the present application.
  • Figure 2 is a schematic diagram of the execution sequence of the recognition combination priority of a statement recognition module provided by an embodiment of the present application
  • Figure 3 is a schematic flowchart of identifying and analyzing statements provided by an embodiment of the present application.
  • Figure 4 is a schematic diagram of an early ending action in the process of recognizing a sentence by a sentence recognition module provided by an embodiment of the present application
  • Figure 5 is a schematic diagram of filtering conditions in the process of identifying sentences by a sentence recognition module provided by an embodiment of the present application
  • Figure 6 is a schematic diagram of an idle space-occupying action during sentence recognition by a sentence recognition module provided by an embodiment of the present application
  • Figure 7 is a schematic structural diagram of a minor text recognition device provided by an embodiment of the present application.
  • Figure 8 is a schematic structural diagram of a text recognition device for minors provided by an embodiment of the present application.
  • This application solution can be implemented based on a terminal with data processing capabilities, which can be a computer, server, cloud, etc.
  • the minor text recognition method of this application may include the following steps:
  • Step S110 Obtain the text to be recognized containing several sentences.
  • the text to be recognized can be the text of online real-time chat, or the text of the user's offline history.
  • Step S120 For each sentence in the text to be recognized, according to the order of priority of each recognition combination in the pre-established sentence recognition module from high to low, the sentence is identified through each recognition combination in turn, and the band is obtained. There is an intermediate mark sentence marked with a keyword, analyze the keyword mark carried by the intermediate mark sentence, mark the intermediate mark sentence with a minor identification mark, and obtain the target sentence with the minor identification mark. .
  • the sentence recognition module may include multiple recognition combinations with different priorities, and each recognition combination may include a sentence matching algorithm and a keyword seal.
  • a recognition combination with a high priority can indicate that the text to be recognized has a higher priority for matching through the sentence matching algorithm in the recognition combination, making the results recognized by the recognition combination more representative.
  • the sentence matching algorithm can be an algorithm for sentence recognition and analysis, such as full-text matching algorithm, clause matching algorithm, original text matching algorithm, intelligent semantic matching algorithm, etc.
  • Keyword seals can be labels for identifying minor sentences.
  • the matched text or part of the text can be given keyword tags, such as blacklist tags, whitelist tags, and high suspicion tags. tags, model inference tags, etc.
  • Figure 2 shows six recognition combinations with different priorities including the sentence matching algorithm and keyword seals. These six recognition combinations constitute the sentence recognition module. Therefore, for each statement, the order can be passed through priority 1 (match the statement in a full-text matching manner, and match the statement with the blacklist field sample as the matching target), priority 2... until priority 6 ( Match the statement in the way of original text matching, and use the model inference field sample as the matching target. statement), resulting in an intermediate marked statement with a keyword mark.
  • Figure 3 shows that the intermediate mark sentence obtained after six recognized combination marks enters the comprehensive analysis.
  • the minor corresponding to the mark marked by the keyword seal can be determined.
  • Discriminant mark to determine whether the statement is from a minor.
  • Step S130 Count the number of target sentences with a minor identification mark as a minor mark in the text to be recognized, determine it as a first number, and calculate a first score based on the first number.
  • the first score may be the first quantity multiplied by a proportional coefficient corresponding to the minor mark, and the proportional coefficient corresponding to the minor mark may be a preset value, such as a value of 1.
  • Step S140 Count the number of target sentences in the text to be identified that have a minor identification mark as a highly suspected minor mark, determine it as a second number, and calculate a second score based on the second number.
  • the second score may be the second quantity multiplied by a proportional coefficient corresponding to the high suspected minor mark.
  • the proportional coefficient corresponding to the high suspected minor mark may be a preset value, which may be higher than that of the minor.
  • the proportion coefficient corresponding to the human mark is small, for example, the value is 0.5.
  • Step S150 If the sum of the first score and the second score is greater than a preset score threshold, determine that the text to be identified is the text of a minor.
  • the preset score threshold can represent the lowest indicator that the text containing the highly suspected minor mark and the sentence with the minor mark are judged to be minor text, and the preset score threshold can be customized.
  • first score exceeds the first preset sub-score threshold, it may be determined that the text to be identified is the text of a minor. If the second score exceeds the third preset sub-score threshold, it may be determined that the text to be identified is the text of a minor.
  • the third preset sub-score threshold may be lower than the preset score threshold, and the second preset sub-score threshold may be lower than the third preset sub-score threshold.
  • a text to be recognized contains multiple sentences.
  • the text to be identified there are some or very few sentences that are identified and marked with a minor mark or a highly suspected minor mark, which is not enough to indicate that the text to be identified is from a minor, and unexpected events may occur. Therefore, it is necessary to set a preset score threshold.
  • the number of sentences in the text to be recognized increases, the number of sentences that are allowed to be marked as highly suspected minors and minors is also increased, so that the first score and the second score are The sum also increases accordingly.
  • the preset score threshold is defined as 30, the proportion coefficient corresponding to the minor mark is 1, and the proportion coefficient corresponding to the highly suspected minor mark is 0.5.
  • the minor text recognition method obtains a text to be recognized that contains several sentences, and for each sentence in the text to be recognized, according to the priority of each recognition combination in the pre-established sentence recognition module In order from high to low, the sentences are identified through each recognition combination in turn, and an intermediate marked sentence with a keyword tag is obtained. The keyword tag carried by the intermediate marked sentence is analyzed, and the intermediate marked sentence is marked. on the minor discrimination mark to obtain the target sentence with the minor discrimination mark.
  • the sentence recognition module contains multiple recognition combinations with different priorities.
  • Each recognition combination includes a sentence matching algorithm and a keyword seal, Count the number of target sentences with minor identification marks as minors in the text to be identified, determine it as a first number, and calculate a first score based on the first number, and count the text to be identified The number of target sentences with minor identification marks as high suspected minor marks is determined as the second quantity, and the second score is calculated based on the second quantity. If the first score is the same as the third If the sum of the two scores is greater than the preset score threshold, it is determined that the text to be identified is the text of a minor.
  • the multi-mode AC is used in each recognition combination. Algorithm-accelerated statement matching algorithm that matches individual keywords in the statement.
  • sentence matching algorithms such as full-text matching algorithm, clause matching algorithm, original text matching algorithm, Intelligent semantic matching algorithms, etc., can quickly match keywords through multi-mode AC algorithms to improve matching speed.
  • condition for being hit by the keyword seal may be that in each sentence in the text to be recognized, there is a field matching the sample field corresponding to the keyword seal.
  • each statement may be marked with zero, one, or multiple keyword tags, and the final required intermediate marked sentence is a sentence with only one keyword tag.
  • each of these statements can select the keyword tag corresponding to the keyword hit by the keyword stamp in the recognition combination with a higher priority as the only key of the statement. word tag, and then use the sentence as an intermediate tag sentence.
  • the text recognition method for minors uses a multi-mode AC algorithm to accelerate language processing.
  • the sentence matching algorithm can effectively reduce the time complexity of the minor text recognition process.
  • the provided minor text recognition method can also include:
  • several recognition combinations in the sentence recognition module may also include early ending actions.
  • each statement hits the conditions for early ending actions, it can directly enter comprehensive analysis.
  • the condition for each statement to hit the early ending action may be that the statement hits the keyword stamp in the recognition combination where the early ending action is located.
  • the text recognition method for minors provided in this embodiment can directly enter comprehensive analysis after each statement obtains the blacklist mark by adding the specific action of ending early, optimizing the time complexity of the statement matching algorithm in the text recognition process. .
  • some recognition combinations of the sentence recognition module may also include one or more filtering conditions, and the filtering conditions may Including non-first person filtering, question filtering, past time filtering, distance filtering, etc. Based on this, for each sentence in the text to be recognized mentioned in the above embodiment, each sentence in the module is recognized according to the pre-established sentence. The priority of the recognition combinations is in order from high to low, and the statement is identified through each recognition combination in turn to obtain a relevant The process of marking intermediate statements of keyword tags is introduced. This process can include:
  • the statement For each statement in the text to be recognized, according to the order of priority of each recognition combination in the pre-established statement recognition module from high to low, the statement is identified through each recognition combination in turn to obtain conditionally filtered An intermediate tag statement with a keyword tag.
  • Both the 5th and 6th recognition combinations include non-first-person filtering, interrogative filtering and past time filtering, that is, when the semantic matching of the 5th recognition combination is performed, and the original text matching of the 6th recognition combination is performed When , add these three filter conditions to optimize the matching algorithm.
  • the minor text recognition method provided in this embodiment can optimize the matching accuracy of the sentence matching algorithm and more accurately identify minor text by adding filter conditions.
  • each recognition combination in the recognition combination may also include an idle occupying action to reduce misrecognition of other recognition combinations with lower priority.
  • the minor text recognition method mentioned May also include:
  • a temporary mask layer is generated to cover the target keyword so that the priority is lower than that of the target keyword.
  • the recognition combination of the current recognition combination recognizes the sentence, the target keyword is skipped until all recognition combinations in the sentence recognition module complete the recognition of the sentence, and the temporary overwriting of the target keyword is cancelled.
  • the third identification combination includes an idle occupancy action, which can be triggered when the whitelist mark is hit.
  • An example is the statement "When I was 13 years old", where "13 years old” is a blacklist field for minors, but this statement does not mean a minor, so when "When I was 13 years old” the third recognition combination was triggered.
  • the whitelist mark in determine "13 years old” as the target keyword, generate a temporary mask layer to cover the "13 years old” field, so that when the 4th, 5th and 6th recognition combinations are executed Skip the field "When I was 13 years old” until the sentence where "When I was 13 years old" is recognized by the sentence recognition module, and then cancel the temporary mask layer.
  • the minor text recognition method provided in this embodiment ensures that the recognition combination with lower priority can reasonably match or recognize sentences by adding idle placeholder actions to the recognition combination.
  • the sentence matching algorithm cannot accurately identify the actual meaning of the sentence. Based on this, each of the several recognition combinations in the sentence recognition module can still Including re-analysis actions, in some embodiments of the present application, the minor text recognition method provided may also include:
  • stop words in the sentence are detected and temporarily removed until the recognition combination containing the re-analysis action is correct. After the recognition of the sentence is completed, the stop words in the sentence that are temporarily removed are restored.
  • the minor text recognition method provided in this embodiment can convert some highly suspicious-marked sentences into blacklist-marked sentences by adding a re-analysis action to the recognition combination, making the recognition of minor text more stringent.
  • the keyword tags marked with sentences can be model inference tags, blacklist tags, or high suspicion tags. Based on this, the analysis of the intermediate mentioned in the above embodiments is carried out. Mark the sentence with a keyword tag, mark the intermediate tag sentence with a minor identification mark, and introduce the process of obtaining the target sentence with the minor identification mark. This process can be based on different keyword tags. Divided into the following situations:
  • the minor prediction model outputs a minor identification mark that can be a blacklist mark, a high suspicion mark, or a qualified mark.
  • the qualified mark can indicate that the intermediate marked statement does not have minor-related content.
  • each sentence obtains a corresponding target sentence with a minor identification mark
  • the target sentence with a minor identification mark can be displayed, so that it can be applied to online When chatting, it can be analyzed in time whether the content of each sentence is minor content, which also makes it easier to monitor the progress of text recognition in scenarios where it is used for offline mining and detection of minor accounts.
  • the device for realizing minor text recognition provided by the embodiment of the present application is described below.
  • the device for realizing minor text recognition described below and the method for realizing minor text recognition described above can be mutually referenced.
  • Figure 7 is a schematic structural diagram of a device for realizing text recognition of minors disclosed in an embodiment of the present application.
  • the device may include:
  • the recognition text acquisition unit is configured to acquire the text to be recognized containing several sentences
  • the target sentence marking unit is configured to identify each sentence in the text to be recognized in sequence from high to low according to the priority of each recognition combination in the pre-established sentence recognition module. statement, obtain a middle-marked sentence with a keyword mark, analyze the keyword mark carried by the middle-marked sentence, mark the middle-marked sentence with a minor identification mark, and obtain the minor-discrimination mark. Marked target sentences, the sentence recognition module includes multiple recognition combinations with different priorities, each recognition combination includes a sentence matching algorithm and a keyword seal;
  • the first score statistics unit is configured to count the number of target sentences with a minor identification mark as a minor mark in the text to be identified, determine it as the first number, and calculate the first number based on the first number. Fraction;
  • the second score statistics unit is configured to count the number of target sentences with minor identification marks as high suspected minor marks in the text to be identified, determine it as the second number, and calculate it based on the second number. second score;
  • the minor text confirmation unit is configured to determine that the text to be recognized is the text of a minor if the sum of the first score and the second score is greater than a preset score threshold.
  • Figure 8 shows a block diagram of the hardware structure of the text recognition device for minors.
  • the hardware structure of the text recognition device for minors may include: at least one processor 1, at least one communication interface 2, at least one memory 3 and at least one communication bus 4;
  • the number of processor 1, communication interface 2, memory 3, and communication bus 4 is at least one, and processor 1, communication interface 2, and memory 3 complete communication with each other through communication bus 4;
  • the processor 1 may be a central processing unit CPU, or an application specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present application, etc.;
  • ASIC Application Specific Integrated Circuit
  • Memory 3 may include high-speed RAM memory, and may also include non-volatile memory (non-volatile memory), such as at least one disk memory;
  • the memory stores a program, and the processor can call the program stored in the memory.
  • the program is set to:
  • each recognition combination includes a sentence matching algorithm and keyword seal;
  • the text to be identified is the text of a minor.
  • Embodiments of the present application also provide a storage medium that can store a program suitable for execution by a processor, and the program is configured to:
  • each recognition combination includes a sentence matching algorithm and keyword seal;
  • the text to be identified is the text of a minor.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Probability & Statistics with Applications (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种未成年人文本识别方法及装置,方法包括:获取包含若干个语句的文本,对于文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别语句,得到带有一个关键词标记的中间标记语句,分析关键词标记以标上未成年人判别标记,得到带有标记的目标语句,若带有未成年人标记的第一分数与带有高疑似未成年人标记的第二分数之和大于预设分数阈值,确定待识别文本为未成年人文本。可见,通过对待识别文本的内容逐层分析,对每个与未成年人性质有关的关键词赋予标记,综合分析每一语句的未成年人可疑程度并贴上标签,从而有效且可靠地分析出待识别文本是否属于未成年人。

Description

一种未成年人文本识别方法及装置
本申请要求于2022年09月13日提交中国专利局、申请号为202211107466.2、发明名称为“一种未成年人文本识别方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及文本识别领域,更具体的说,是涉及一种未成年人文本识别方法及装置。
背景技术
随着科技的不断发展,网络技术日渐发达,越来越多未成年人参进网络世界。对于未成年人群体而言,网络却是一把双刃剑,部分未成年人能合理使用网络令他们健康成长,但更多的未成年人由于自制力尚未成熟,使得他们沉迷网络而难以自拔,因此在监管未成年人使用网络的工作上任重而道远。
如今未成年人使用网络所留下的信息众多,如打游戏时的聊天对话,qq或微信聊天的对话,作业传输中的作业文件,以上所示的信息都涉及到文本。因此可以通过对文本进行识别,以判断该文本是否属于未成年人,从而追溯未成年人的网络账号,实行对未成年人使用网络的管控。
目前对未成年人文本识别一般采用关键词匹配加模型预测,功能过于单一,且缺乏对文本内容与未成年人性质之间的关系的分析过程,显然对未成年人文本的识别准确率不可靠。
发明内容
鉴于上述问题,提出了本申请以便提供一种未成年人文本识别方法及装置,以提高对未成年人文本识别准确率的可靠性。
为了实现上述目的,现提出具体方案如下:
一种未成年人文本识别方法,包括:
获取包含若干个语句的待识别文本;
对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,所述语句识别模组包含多个不同优先级的识别组合,每个识别组合包括语句匹配算法和关键词印章;
统计所述待识别文本中带有未成年人判别标记为未成年人标记的目标语句的数量,确定为第一数量,并根据所述第一数量计算得到第一分数;
统计所述待识别文本中带有未成年人判别标记为高疑似未成年人标记的目标语句的数量,确定为第二数量,并根据所述第二数量计算得到第二分数;
若所述第一分数与所述第二分数之和大于预设分数阈值,确定所述待识别文本为未成年人的文本。
可选的,对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,包括:
对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合中的,利用了多模式AC算法加速的语句匹配算法,匹配所述语句中的各个关键词;
在所述待识别文本中的每个语句中,对被每个识别组合中的关键词印章命中的关键词标上关键词标记;
确定通过各个识别组合标上关键词标记后的,带有一个关键词标记的中间标记语句。
可选的,所述语句识别模组中若干个识别组合中的每个识别组合还包括提前结束动作;
该方法还包括:
对于所述待识别文本中的每个语句,在按照所述语句识别模组中各个 识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句的过程中,当所述语句中存在关键词命中当前识别组合中的提前结束动作时,跳过优先级低于所述当前识别组合的识别组合,确定提前结束识别的带有一个关键词标记的中间标记语句。
可选的,所述语句识别模组中若干个识别组合中的每个识别组合还包括一个或多个过滤条件;
对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,包括:
对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到条件过滤后的带有一个关键词标记的中间标记语句。
可选的,所述语句识别模组中若干个识别组合中的每个识别组合还包括空闲占位动作;
该方法还包括:
对于所述待识别文本中的每个语句,在按照所述语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句的过程中,当所述语句中存在目标关键词命中当前识别组合中的空闲占位动作时,生成用于覆盖所述目标关键词的临时遮罩层,以使优先级低于所述当前识别组合的识别组合在识别所述语句时,跳过所述目标关键词,直至所述语句识别模组中所有识别组合对所述语句识别结束后,取消覆盖所述目标关键词的临时遮罩层。
可选的,所述语句识别模组中若干个识别组合中的每个识别组合还包括重分析动作;
该方法还包括:
对于所述待识别文本中的每个语句,当通过含有重分析动作的识别组合分析所述语句时,检测并临时去除所述语句中的停用词,直至含有重分析动作的识别组合对所述语句识别结束后,还原所述临时去除所述语句中 的停用词。
可选的,所述关键词标记为模型推理标记;
分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,包括:
根据所述中间标记语句带有的模型推理标记,将所述中间标记语句输入至已有的未成年人预测模型,输出得到所述中间标记语句的未成年人判别标记;
对所述中间标记语句标上输出得到的未成年人判别标记,得到带有所述输出得到的未成年人判别标记的目标语句。
可选的,所述关键词标记为黑名单标记;
分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,包括:
根据所述中间标记语句带有的黑名单标记,确定对所述中间标记语句标注的未成年人判别标记为未成年人标记;
对所述中间标记语句标上所述未成年人标记,得到带有所述未成年人标记的目标语句。
可选的,所述关键词标记为高疑似标记;
分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,包括:
根据所述中间标记语句带有的高疑似标记,确定对所述中间标记语句标注的未成年人判别标记为高疑似未成年人标记;
对所述中间标记语句标上所述高疑似未成年人标记,得到带有所述高疑似未成年人标记的目标语句。
可选的,在分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句之后,还包括:
显示带有所述未成年人判别标记的目标语句。
一种未成年人文本识别装置,包括:
识别文本获取单元,设置为获取包含若干个语句的待识别文本;
目标语句标记单元,设置为对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,所述语句识别模组包含多个不同优先级的识别组合,每个识别组合包括语句匹配算法和关键词印章;
第一分数统计单元,设置为统计所述待识别文本中带有未成年人判别标记为未成年人标记的目标语句的数量,确定为第一数量,并根据所述第一数量计算得到第一分数;
第二分数统计单元,设置为统计所述待识别文本中带有未成年人判别标记为高疑似未成年人标记的目标语句的数量,确定为第二数量,并根据所述第二数量计算得到第二分数;
未成年人文本确认单元,设置为若所述第一分数与所述第二分数之和大于预设分数阈值,确定所述待识别文本为未成年人的文本。
借由上述技术方案,本申请通过获取包含若干个语句的待识别文本,对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,所述语句识别模组包含多个不同优先级的识别组合,每个识别组合包括语句匹配算法和关键词印章,统计所述待识别文本中带有未成年人判别标记为未成年人标记的目标语句的数量,确定为第一数量,并根据所述第一数量计算得到第一分数,统计所述待识别文本中带有未成年人判别标记为高疑似未成年人标记的目标语句的数量,确定为第二数量,并根据所述第二数量计算得到第二分数,若所述第一分数与所述第二分数之和大于预设分数阈值,确定所述待识别文本为未成年人的文本。由此可见,通过对待识别文本的内容逐层分析,对每个与未成年 人性质有关的关键词赋予标记,综合分析每一语句的未成年人可疑程度并贴上标签,从而有效且可靠地分析出待识别文本是否属于未成年人。
附图说明
通过阅读下文优选实施方式的详细描述,各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的,而并不认为是对本申请的限制。而且在整个附图中,用相同的参考符号表示相同的部件。在附图中:
图1为本申请实施例提供的一种未成年人文本识别的流程示意图;
图2为本申请实施例提供的一种语句识别模组的识别组合优先级执行顺序示意图;
图3为本申请实施例提供的一种识别与分析语句的流程示意图;
图4为本申请实施例提供的一种语句识别模组识别语句过程中提前结束动作的示意图;
图5为本申请实施例提供的一种语句识别模组识别语句过程中过滤条件的示意图;
图6为本申请实施例提供的一种语句识别模组识别语句过程中空闲占位动作的示意图;
图7为本申请实施例提供的一种未成年人文本识别的装置结构示意图;
图8为本申请实施例提供的一种未成年人文本识别设备的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的 范围。
本申请方案可以基于具备数据处理能力的终端实现,该终端可以是电脑、服务器、云端等。
接下来,结合图1所述,本申请的未成年人文本识别方法可以包括以下步骤:
步骤S110、获取包含若干个语句的待识别文本。
具体的,待识别文本可以为线上实时聊天的文本,也可以为用户离线的历史记录文本。
步骤S120、对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句。
具体的,所述语句识别模组可以包含多个不同优先级的识别组合,每个识别组合可以包括语句匹配算法和关键词印章。
优先级高的识别组合可以表示对待识别文本通过该识别组合中的语句匹配算法进行匹配的优先度较高,使得通过该识别组合识别的结果更具有代表性。
语句匹配算法可以为对语句识别分析的算法,如全文匹配算法、子句匹配算法、原文匹配算法、智能语义匹配算法等。关键词印章可以为未成年人语句标识的标签,当通过语句匹配算法对待识别文本匹配上时,可以对匹配上的文本或部分文本赋予关键词标记,如黑名单标记、白名单标记、高疑似标记、模型推理标记等。
示例如图2,图2示出了包括语句匹配算法和关键词印章的6个不同优先级的识别组合,这6个识别组合构成了语句识别模组。因此使得对于每个语句,可以依次通过优先级第1(以全文匹配的方式匹配该语句,并以黑名单字段样本为匹配目标匹配该语句)、优先级第2……直至优先级第6(以原文匹配的方式匹配该语句,并以模型推理字段样本为匹配目标匹配 该语句),从而得到带有一个关键词标记的中间标记语句。
示例如图3,图3示出了经6个识别组合标记后得到的中间标记语句进入综合分析,通过对中间标记语句进行综合分析,可以确定由关键词印章标上的标记对应的未成年人判别标记,从而判断该语句是否出自未成年人。
步骤S130、统计所述待识别文本中带有未成年人判别标记为未成年人标记的目标语句的数量,确定为第一数量,并根据所述第一数量计算得到第一分数。
具体的,所述第一分数可以为所述第一数量乘以未成年人标记对应的比例系数,未成年人标记对应的比例系数可以为预先设定的值,如取值为1。
步骤S140、统计所述待识别文本中带有未成年人判别标记为高疑似未成年人标记的目标语句的数量,确定为第二数量,并根据所述第二数量计算得到第二分数。
具体的,所述第二分数可以为所述第二数量乘以高疑似未成年人标记对应的比例系数,高疑似未成年人标记对应的比例系数可以为预先设定的值,可以比未成年人标记对应的比例系数小,如取值为0.5。
步骤S150、若所述第一分数与所述第二分数之和大于预设分数阈值,确定所述待识别文本为未成年人的文本。
具体的,所述预设分数阈值可以表示高疑似未成年人标记和未成年人标记的句子所在的文本判定为未成年人文本的最低指标,预设分数阈值可以自定义。
除此之外,若所述第一分数超过第一预设子分数阈值,可以确定所述待识别文本为未成年人的文本。若所述第二分数超过第三预设子分数阈值,可以确定所述待识别文本为未成年人的文本。
其中,第三预设子分数阈值可以低于预设分数阈值,第二预设子分数阈值可以低于第三预设子分数阈值。
可以理解的是,一份待识别文本中含有多个语句,在含有较多语句的 待识别文本中,个别或极少被识别并标有未成年人标记或高疑似未成年人标记的语句,不足以说明这份待识别文本出自未成年人,可能会意外事件。因此需要设定预设的分数阈值,当待识别文本的语句量增大时,允许标为高疑似未成年人标记和未成年人标记的句子量也增大,使得第一分数与第二分数之和也随之提高。
示例如,预设分数阈值定义为30,未成年人标记对应的比例系数为1,高疑似未成年人标记对应的比例系数为0.5,当一份含有100个语句的待识别文本中,其中有20个语句被标有未成年人标记,以及有30个语句被标有高疑似未成年人标记,那么第一分数为20*1=20,第二分数为30*0.5=15,那么第一分数与第二分数之和为20+15=35,大于预设分数阈值,因此可以确定该待识别文本为未成年人的文本。
本实施例提供的未成年人文本识别方法,通过获取包含若干个语句的待识别文本,对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,所述语句识别模组包含多个不同优先级的识别组合,每个识别组合包括语句匹配算法和关键词印章,统计所述待识别文本中带有未成年人判别标记为未成年人标记的目标语句的数量,确定为第一数量,并根据所述第一数量计算得到第一分数,统计所述待识别文本中带有未成年人判别标记为高疑似未成年人标记的目标语句的数量,确定为第二数量,并根据所述第二数量计算得到第二分数,若所述第一分数与所述第二分数之和大于预设分数阈值,确定所述待识别文本为未成年人的文本。由此可见,通过对待识别文本的内容逐层分析,对每个与未成年人性质有关的关键词赋予标记,综合分析每一语句的未成年人可疑程度并贴上标签,从而有效且可靠地分析出待识别文本是否属于未成年人。
本申请的一些实施例中,对上述实施例提到的、对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句的过程进行介绍,该过程可以包括:
S1、对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合中的,利用了多模式AC算法加速的语句匹配算法,匹配所述语句中的各个关键词。
可以理解的是,语句识别模组中每个识别组合执行的顺序为串行顺序,使得整体时间复杂度较高,基于此,语句匹配算法如全文匹配算法、子句匹配算法、原文匹配算法、智能语义匹配算法等,均可以通过多模式AC算法快速匹配关键词,提升匹配速度。
S2、在所述待识别文本中的每个语句中,对被每个识别组合中的关键词印章命中的关键词标上关键词标记。
具体的,被关键词印章命中的条件可以是待识别文本中的每个语句中,存在与关键词印章命对应的样本字段匹配的字段。
S3、确定通过各个识别组合标上关键词标记后的,带有一个关键词标记的中间标记语句。
具体的,通过各个识别组合识别并标记之后,可能对每个语句标上零个、一个或多个关键词标记,最终需要的中间标记语句为仅带有一个关键词标记的语句。
其中,若存在语句没有标上关键词标记,那么这些语句可以判断为非未成年人语句,也即可以不作为中间标记语句。若存在语句标上一个关键词标记,那么这些语句可以作为中间标记语句。若存在语句标上多个关键词标记,那么这些语句中的每个语句可以选取由优先度较高的识别组合中的关键词印章命中的关键词对应的关键词标记,作为该语句的唯一关键词标记,然后将该语句作为中间标记语句。
本实施例提供的未成年人文本识别方法,通过多模式AC算法加速语 句匹配算法,能够有效地减缓未成年人文本识别过程的时间复杂度。
考虑到优化文本识别过程中语句匹配算法时间复杂度,以节省文本识别的时间,加快文本识别的速度,每个语句可以在得到关键词标记后,直接进入综合分析,本申请的一些实施例中,所提供的未成年人文本识别方法还可以包括:
对于所述待识别文本中的每个语句,在按照所述语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句的过程中,当所述语句中存在关键词命中当前识别组合中的提前结束动作时,跳过优先级低于所述当前识别组合的识别组合,确定提前结束识别的带有一个关键词标记的中间标记语句。
具体的,语句识别模组中的若干识别组合可以还包含提前结束动作,每个语句当命中提前结束动作的条件时,可以直接进入综合分析。
其中,每个语句当命中提前结束动作的条件可以是该语句命中了这个提前结束动作所在识别组合中的关键词印章。
示例如图4,从上往下(优先级顺序)的第1、第2、第4个识别组合中均包含了提前结束动作,即当每个语句命中了黑名单标记时,触发提前结束动作,可以直接进入综合分析。
本实施例提供的未成年人文本识别方法,通过添加提前结束动作这一特定动作,能够在每个语句获得黑名单标记后,直接进入综合分析,优化了文本识别过程中语句匹配算法时间复杂度。
考虑到优化语句匹配算法的匹配准确率,更精准地识别未成年人文本,本申请的一些实施例中,语句识别模组的一些识别组合中还可以包括一个或多个过滤条件,过滤条件可以包括非第一人称过滤、疑问过滤、过去时间过滤、距离过滤等,基于此,对上述实施例提到的、对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关 键词标记的中间标记语句的过程进行介绍,该过程可以包括:
对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到条件过滤后的带有一个关键词标记的中间标记语句。
示例如图5,第5和第6个识别组合中均包含了非第一人称过滤、疑问过滤和过去时间过滤,即当进行第5个识别组合的语义匹配,以及第6个识别组合的原文匹配时,加入这三种过滤条件以优化匹配算法。
可以理解的是,通过过滤掉语句中的逻辑,以免利用语句匹配算法机械匹配,可以有效提高匹配准确率。
示例如,对于“我弟今天上课”在语义匹配时,可以使用非第一人称过滤,虽然匹配到“上课”为黑名单标记的样本字段,但由于非第一人称过滤的因素,这个语句不会被标为黑名单标记。对于“我们放暑假多了好多作业”,通过距离过滤匹配到“我们-假-作业”为黑名单标记的样本字段。对于“我们记得那时候挺小的,当时放假有好多作业”,虽然匹配到“我们-假-作业”为黑名单标记的样本字段,但由于过去时间过滤的因素,这个语句不会被标为黑名单标记。
本实施例提供的未成年人文本识别方法,通过添加过滤条件,能够优化语句匹配算法的匹配准确率,更精准地识别未成年人文本。
考虑到语句中一些特殊字段本意不是未成年人所言,但涉及到需要黑名单标记的信息,需要及时特殊处理,以保证优先级靠后的识别组合合理匹配,因此语句识别模组中的一些识别组合中的每个识别组合,还可以包括空闲占位动作,减少其它优先级较低的识别组合错误识别,基于此,本申请的一些实施例中,所提到的未成年人文本识别方法还可以包括:
对于所述待识别文本中的每个语句,在按照所述语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句的过程中,当所述语句中存在目标关键词命中当前识别组合中的空闲占位动作时,生成用于覆盖所述目标关键词的临时遮罩层,以使优先级低于所 述当前识别组合的识别组合在识别所述语句时,跳过所述目标关键词,直至所述语句识别模组中所有识别组合对所述语句识别结束后,取消覆盖所述目标关键词的临时遮罩层。
如图6所示,在第3个识别组合中包含了空闲占位动作,可以在命中白名单标记时触发。示例如语句“13岁的时候”,其中“13岁”为未成年人黑名单字段,但这个语句则不为未成年人的含义,因此当“13岁的时候”触发了第3个识别组合中的白名单标记时,确定“13岁的时候”为目标关键词,生成临时遮罩层覆盖“13岁的时候”这一字段,使得在执行第4、第5和第6个识别组合时跳过“13岁的时候”这一字段,直至“13岁的时候”所在的语句被语句识别模组识别结束后,取消临时遮罩层。
本实施例提供的未成年人文本识别方法,通过在识别组合中添加空闲占位动作,保证优先级靠后的识别组合能够对语句合理匹配或识别。
考虑到语句中可能存在较多语气词,将实质内容间隔较远距离,语句匹配算法无法准确识别出语句实际含义,基于此,语句识别模组中若干个识别组合中的每个识别组合还可以包括重分析动作,本申请的一些实施例中,所提供的未成年人文本识别方法还可以包括:
对于所述待识别文本中的每个语句,当通过含有重分析动作的识别组合分析所述语句时,检测并临时去除所述语句中的停用词,直至含有重分析动作的识别组合对所述语句识别结束后,还原所述临时去除所述语句中的停用词。
示例如语句“我真的也就这么个15岁吧”,可以理解到该语句来自未成年人,但通过原文匹配只能命中“15岁”,由于缺少主语,最后输出高疑似标记,但通过重分析动作,可以将该语句临时去除停用词/语气词至“我15岁”,那么可以直接命中黑名单标记。
本实施例提供的未成年人文本识别方法,通过在识别组合中添加重分析动作,能够将一些高疑似标记的语句转为黑名单标记的语句,对未成年人文本识别更加严格。
在本申请的一些实施例中,标上语句的关键词标记可以为模型推理标记,可以为黑名单标记,也可以为高疑似标记,基于此,对上述实施例提到的、分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句的过程进行介绍,该过程可以根据不同的关键词标记分为以下几种情况:
第一种、当关键词标记为黑名单标记时,可以包括以下步骤:
S11、根据所述中间标记语句带有的黑名单标记,确定对所述中间标记语句标注的未成年人判别标记为未成年人标记。
S12、对所述中间标记语句标上所述未成年人标记,得到带有所述未成年人标记的目标语句。
第二种、当关键词标记为高疑似标记时,可以包括以下步骤:
S21、根据所述中间标记语句带有的高疑似标记,确定对所述中间标记语句标注的未成年人判别标记为高疑似未成年人标记。
S22、对所述中间标记语句标上所述高疑似未成年人标记,得到带有所述高疑似未成年人标记的目标语句。
第三种、当关键词标记为模型推理标记时,可以包括以下步骤:
S31、根据所述中间标记语句带有的模型推理标记,将所述中间标记语句输入至已有的未成年人预测模型,输出得到所述中间标记语句的未成年人判别标记。
具体的,未成年人预测模型通过接收输入的中间标记语句,所输出的未成年人判别标记可以为黑名单标记,也可以为高疑似标记,也可以为合格标记。
其中,合格标记可以表示该中间标记语句没有未成年人相关的内容。
S32、对所述中间标记语句标上输出得到的未成年人判别标记,得到带有所述输出得到的未成年人判别标记的目标语句。
进一步地,每当每个语句在得到相应的带有未成年人判别标记的目标语句之后,可以显示带有未成年人判别标记的目标语句,使得应用于线上 聊天时可以及时分析每句话的内容是否为未成年人内容,也使得在应用于离线挖掘检测未成年人账号的场景下能够更方便地监测文本识别工作的进度。
下面对本申请实施例提供的实现未成年人文本识别的装置进行描述,下文描述的实现未成年人文本识别的装置与上文描述的实现未成年人文本识别方法可相互对应参照。
参见图7,图7为本申请实施例公开的一种实现未成年人文本识别的装置结构示意图。
如图7所示,该装置可以包括:
识别文本获取单元,设置为获取包含若干个语句的待识别文本;
目标语句标记单元,设置为对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,所述语句识别模组包含多个不同优先级的识别组合,每个识别组合包括语句匹配算法和关键词印章;
第一分数统计单元,设置为统计所述待识别文本中带有未成年人判别标记为未成年人标记的目标语句的数量,确定为第一数量,并根据所述第一数量计算得到第一分数;
第二分数统计单元,设置为统计所述待识别文本中带有未成年人判别标记为高疑似未成年人标记的目标语句的数量,确定为第二数量,并根据所述第二数量计算得到第二分数;
未成年人文本确认单元,设置为若所述第一分数与所述第二分数之和大于预设分数阈值,确定所述待识别文本为未成年人的文本。
本申请实施例提供的未成年人文本识别的装置可应用于未成年人文本 识别设备,如终端:手机、电脑等。可选的,图8示出了未成年人文本识别设备的硬件结构框图,参照图8,未成年人文本识别设备的硬件结构可以包括:至少一个处理器1,至少一个通信接口2,至少一个存储器3和至少一个通信总线4;
在本申请实施例中,处理器1、通信接口2、存储器3、通信总线4的数量为至少一个,且处理器1、通信接口2、存储器3通过通信总线4完成相互间的通信;
处理器1可能是一个中央处理器CPU,或者是特定集成电路ASIC(Application Specific Integrated Circuit),或者是被配置成实施本申请实施例的一个或多个集成电路等;
存储器3可能包含高速RAM存储器,也可能还包括非易失性存储器(non-volatile memory)等,例如至少一个磁盘存储器;
其中,存储器存储有程序,处理器可调用存储器存储的程序,所述程序设置为:
获取包含若干个语句的待识别文本;
对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,所述语句识别模组包含多个不同优先级的识别组合,每个识别组合包括语句匹配算法和关键词印章;
统计所述待识别文本中带有未成年人判别标记为未成年人标记的目标语句的数量,确定为第一数量,并根据所述第一数量计算得到第一分数;
统计所述待识别文本中带有未成年人判别标记为高疑似未成年人标记的目标语句的数量,确定为第二数量,并根据所述第二数量计算得到第二分数;
若所述第一分数与所述第二分数之和大于预设分数阈值,确定所述待识别文本为未成年人的文本。
可选的,所述程序的细化功能和扩展功能可参照上文描述。
本申请实施例还提供一种存储介质,该存储介质可存储有适于处理器执行的程序,所述程序设置为:
获取包含若干个语句的待识别文本;
对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,所述语句识别模组包含多个不同优先级的识别组合,每个识别组合包括语句匹配算法和关键词印章;
统计所述待识别文本中带有未成年人判别标记为未成年人标记的目标语句的数量,确定为第一数量,并根据所述第一数量计算得到第一分数;
统计所述待识别文本中带有未成年人判别标记为高疑似未成年人标记的目标语句的数量,确定为第二数量,并根据所述第二数量计算得到第二分数;
若所述第一分数与所述第二分数之和大于预设分数阈值,确定所述待识别文本为未成年人的文本。
可选的,所述程序的细化功能和扩展功能可参照上文描述。
最后,还需要说明的是,在本文中,诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来,而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、物品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、物品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、物品或者设备中还存在另外的相同要素。
本说明书中各个实施例采用递进的方式描述,每个实施例重点说明的 都是与其他实施例的不同之处,各个实施例之间可以根据需要进行组合,且相同相似部分互相参见即可。
对所公开的实施例的上述说明,使本领域专业技术人员能够实现或使用本申请。对这些实施例的多种修改对本领域的专业技术人员来说将是显而易见的,本文中所定义的一般原理可以在不脱离本申请的精神或范围的情况下,在其它实施例中实现。因此,本申请将不会被限制于本文所示的这些实施例,而是要符合与本文所公开的原理和新颖特点相一致的最宽的范围。

Claims (11)

  1. 一种未成年人文本识别方法,其特征在于,包括:
    获取包含若干个语句的待识别文本;
    对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,所述语句识别模组包含多个不同优先级的识别组合,每个识别组合包括语句匹配算法和关键词印章;
    统计所述待识别文本中带有未成年人判别标记为未成年人标记的目标语句的数量,确定为第一数量,并根据所述第一数量计算得到第一分数;
    统计所述待识别文本中带有未成年人判别标记为高疑似未成年人标记的目标语句的数量,确定为第二数量,并根据所述第二数量计算得到第二分数;
    若所述第一分数与所述第二分数之和大于预设分数阈值,确定所述待识别文本为未成年人的文本。
  2. 根据权利要求1所述的方法,其特征在于,对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,包括:
    对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合中的,利用了多模式AC算法加速的语句匹配算法,匹配所述语句中的各个关键词;
    在所述待识别文本中的每个语句中,对被每个识别组合中的关键词印章命中的关键词标上关键词标记;
    确定通过各个识别组合标上关键词标记后的,带有一个关键词标记的中间标记语句。
  3. 根据权利要求2所述的方法,其特征在于,所述语句识别模组中若 干个识别组合中的每个识别组合还包括提前结束动作;
    该方法还包括:
    对于所述待识别文本中的每个语句,在按照所述语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句的过程中,当所述语句中存在关键词命中当前识别组合中的提前结束动作时,跳过优先级低于所述当前识别组合的识别组合,确定提前结束识别的带有一个关键词标记的中间标记语句。
  4. 根据权利要求3所述的方法,其特征在于,所述语句识别模组中若干个识别组合中的每个识别组合还包括一个或多个过滤条件;
    对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,包括:
    对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到条件过滤后的带有一个关键词标记的中间标记语句。
  5. 根据权利要求1-4任一项所述的方法,其特征在于,所述语句识别模组中若干个识别组合中的每个识别组合还包括空闲占位动作;
    该方法还包括:
    对于所述待识别文本中的每个语句,在按照所述语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句的过程中,当所述语句中存在目标关键词命中当前识别组合中的空闲占位动作时,生成用于覆盖所述目标关键词的临时遮罩层,以使优先级低于所述当前识别组合的识别组合在识别所述语句时,跳过所述目标关键词,直至所述语句识别模组中所有识别组合对所述语句识别结束后,取消覆盖所述目标关键词的临时遮罩层。
  6. 根据权利要求1-4任一项所述的方法,其特征在于,所述语句识别模组中若干个识别组合中的每个识别组合还包括重分析动作;
    该方法还包括:
    对于所述待识别文本中的每个语句,当通过含有重分析动作的识别组合分析所述语句时,检测并临时去除所述语句中的停用词,直至含有重分析动作的识别组合对所述语句识别结束后,还原所述临时去除所述语句中的停用词。
  7. 根据权利要求1-4任一项所述的方法,其特征在于,所述关键词标记为模型推理标记;
    分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,包括:
    根据所述中间标记语句带有的模型推理标记,将所述中间标记语句输入至已有的未成年人预测模型,输出得到所述中间标记语句的未成年人判别标记;
    对所述中间标记语句标上输出得到的未成年人判别标记,得到带有所述输出得到的未成年人判别标记的目标语句。
  8. 根据权利要求1-4任一项所述的方法,其特征在于,所述关键词标记为黑名单标记;
    分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,包括:
    根据所述中间标记语句带有的黑名单标记,确定对所述中间标记语句标注的未成年人判别标记为未成年人标记;
    对所述中间标记语句标上所述未成年人标记,得到带有所述未成年人标记的目标语句。
  9. 根据权利要求1-4任一项所述的方法,其特征在于,所述关键词标记为高疑似标记;
    分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,包括:
    根据所述中间标记语句带有的高疑似标记,确定对所述中间标记语句标注的未成年人判别标记为高疑似未成年人标记;
    对所述中间标记语句标上所述高疑似未成年人标记,得到带有所述高 疑似未成年人标记的目标语句。
  10. 根据权利要求1-4任一项所述的方法,其特征在于,在分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句之后,还包括:
    显示带有所述未成年人判别标记的目标语句。
  11. 一种未成年人文本识别装置,其特征在于,包括:
    识别文本获取单元,设置为获取包含若干个语句的待识别文本;
    目标语句标记单元,设置为对于所述待识别文本中的每个语句,按照预先建立的语句识别模组中各个识别组合的优先级由高到低的顺序,依次通过每个识别组合识别所述语句,得到带有一个关键词标记的中间标记语句,分析所述中间标记语句带有的关键词标记,对所述中间标记语句标上未成年人判别标记,得到带有所述未成年人判别标记的目标语句,所述语句识别模组包含多个不同优先级的识别组合,每个识别组合包括语句匹配算法和关键词印章;
    第一分数统计单元,设置为统计所述待识别文本中带有未成年人判别标记为未成年人标记的目标语句的数量,确定为第一数量,并根据所述第一数量计算得到第一分数;
    第二分数统计单元,设置为统计所述待识别文本中带有未成年人判别标记为高疑似未成年人标记的目标语句的数量,确定为第二数量,并根据所述第二数量计算得到第二分数;
    未成年人文本确认单元,设置为若所述第一分数与所述第二分数之和大于预设分数阈值,确定所述待识别文本为未成年人的文本。
PCT/CN2023/092437 2022-09-13 2023-05-06 一种未成年人文本识别方法及装置 WO2024055603A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202211107466.2A CN115186095B (zh) 2022-09-13 2022-09-13 一种未成年人文本识别方法及装置
CN202211107466.2 2022-09-13

Publications (1)

Publication Number Publication Date
WO2024055603A1 true WO2024055603A1 (zh) 2024-03-21

Family

ID=83524563

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2023/092437 WO2024055603A1 (zh) 2022-09-13 2023-05-06 一种未成年人文本识别方法及装置

Country Status (2)

Country Link
CN (1) CN115186095B (zh)
WO (1) WO2024055603A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115186095B (zh) * 2022-09-13 2022-12-13 广州趣丸网络科技有限公司 一种未成年人文本识别方法及装置

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809236A (zh) * 2015-05-11 2015-07-29 苏州大学 一种基于微博的用户年龄分类方法及系统
CN106354872A (zh) * 2016-09-18 2017-01-25 广州视源电子科技股份有限公司 文本聚类的方法及系统
CN110597988A (zh) * 2019-08-28 2019-12-20 腾讯科技(深圳)有限公司 一种文本分类方法、装置、设备及存储介质
CN111651600A (zh) * 2020-06-02 2020-09-11 携程计算机技术(上海)有限公司 语句多意图识别方法、系统、电子设备及存储介质
WO2021237550A1 (zh) * 2020-05-28 2021-12-02 深圳市欢太科技有限公司 文本处理方法、电子设备和计算机可读存储介质
CN115186095A (zh) * 2022-09-13 2022-10-14 广州趣丸网络科技有限公司 一种未成年人文本识别方法及装置

Family Cites Families (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8539359B2 (en) * 2009-02-11 2013-09-17 Jeffrey A. Rapaport Social network driven indexing system for instantly clustering people with concurrent focus on same topic into on-topic chat rooms and/or for generating on-topic search results tailored to user preferences regarding topic
JP4742192B2 (ja) * 2009-04-28 2011-08-10 Necソフト株式会社 年齢推定装置及び方法並びにプログラム
US9471944B2 (en) * 2013-10-25 2016-10-18 The Mitre Corporation Decoders for predicting author age, gender, location from short texts
CN108108354B (zh) * 2017-06-18 2021-04-06 北京理工大学 一种基于深度学习的微博用户性别预测方法
CN110196945B (zh) * 2019-05-27 2021-10-01 北京理工大学 一种基于LSTM与LeNet融合的微博用户年龄预测方法
CN113850290B (zh) * 2021-08-18 2022-08-23 北京百度网讯科技有限公司 文本处理及模型训练方法、装置、设备和存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104809236A (zh) * 2015-05-11 2015-07-29 苏州大学 一种基于微博的用户年龄分类方法及系统
CN106354872A (zh) * 2016-09-18 2017-01-25 广州视源电子科技股份有限公司 文本聚类的方法及系统
CN110597988A (zh) * 2019-08-28 2019-12-20 腾讯科技(深圳)有限公司 一种文本分类方法、装置、设备及存储介质
WO2021237550A1 (zh) * 2020-05-28 2021-12-02 深圳市欢太科技有限公司 文本处理方法、电子设备和计算机可读存储介质
CN111651600A (zh) * 2020-06-02 2020-09-11 携程计算机技术(上海)有限公司 语句多意图识别方法、系统、电子设备及存储介质
CN115186095A (zh) * 2022-09-13 2022-10-14 广州趣丸网络科技有限公司 一种未成年人文本识别方法及装置

Also Published As

Publication number Publication date
CN115186095B (zh) 2022-12-13
CN115186095A (zh) 2022-10-14

Similar Documents

Publication Publication Date Title
CN110516067B (zh) 基于话题检测的舆情监控方法、系统及存储介质
CN109657054B (zh) 摘要生成方法、装置、服务器及存储介质
CN108509619B (zh) 一种语音交互方法及设备
CN103729474B (zh) 用于识别论坛用户马甲账号的方法和系统
CN110263248A (zh) 一种信息推送方法、装置、存储介质和服务器
CN112686022A (zh) 违规语料的检测方法、装置、计算机设备及存储介质
US20210117619A1 (en) Cyberbullying detection method and system
CN112468659B (zh) 应用于电话客服的质量评价方法、装置、设备及存储介质
CN110287314B (zh) 基于无监督聚类的长文本可信度评估方法及系统
CN111079029B (zh) 敏感账号的检测方法、存储介质和计算机设备
CN111522915A (zh) 中文事件的抽取方法、装置、设备及存储介质
CN108304452B (zh) 文章处理方法及装置、存储介质
CN107305545A (zh) 一种基于文本倾向性分析的网络意见领袖的识别方法
CN115099239B (zh) 一种资源识别方法、装置、设备以及存储介质
WO2024055603A1 (zh) 一种未成年人文本识别方法及装置
CN111782793A (zh) 智能客服处理方法和系统及设备
CN112016317A (zh) 基于人工智能的敏感词识别方法、装置及计算机设备
CN112579781B (zh) 文本归类方法、装置、电子设备及介质
CN112699671B (zh) 一种语言标注方法、装置、计算机设备和存储介质
CN107688594B (zh) 基于社交信息的风险事件的识别系统及方法
WO2024087754A1 (zh) 一种多维度文本综合辨识方法
WO2021174926A1 (zh) 一种网站不良信息监测系统及其监测方法
CN110110079B (zh) 一种社交网络垃圾用户检测方法
US11134045B2 (en) Message sorting system, message sorting method, and program
CN113177164B (zh) 基于大数据的多平台协同新媒体内容监控管理系统

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 23864343

Country of ref document: EP

Kind code of ref document: A1