WO2022134779A1 - Method, apparatus and device for extracting character action related data, and storage medium - Google Patents

Method, apparatus and device for extracting character action related data, and storage medium Download PDF

Info

Publication number
WO2022134779A1
WO2022134779A1 PCT/CN2021/124629 CN2021124629W WO2022134779A1 WO 2022134779 A1 WO2022134779 A1 WO 2022134779A1 CN 2021124629 W CN2021124629 W CN 2021124629W WO 2022134779 A1 WO2022134779 A1 WO 2022134779A1
Authority
WO
WIPO (PCT)
Prior art keywords
text data
analysis
preset
hanlp
actions
Prior art date
Application number
PCT/CN2021/124629
Other languages
French (fr)
Chinese (zh)
Inventor
蔡壮壮
Original Assignee
深圳壹账通智能科技有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 深圳壹账通智能科技有限公司 filed Critical 深圳壹账通智能科技有限公司
Publication of WO2022134779A1 publication Critical patent/WO2022134779A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/335Filtering based on additional data, e.g. user or group profiles
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/284Lexical analysis, e.g. tokenisation or collocates
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Abstract

A method, apparatus and device for extracting character action related data, and a storage medium, which relate to the field of artificial intelligence and are used for performing syntactic analysis and part-of-speech tagging on text data by means of a Han language processing (HanLP) algorithm, and screening out data related to an ongoing behavior action, and thereby improving the accuracy of data extraction and reducing the noise of an extracted data set. The method for extracting character action related data comprises: acquiring pre-set text data; performing classification processing on the pre-set text data, so as to screen out text data containing character information and obtain initial text data (102); performing segmentation processing and part-of-speech tagging on the initial text data, so as to generate intermediate text data; performing dependency syntactic analysis and semantic dependency analysis on the intermediate text data, so as to generate analysis text data; and performing filtering processing on the analysis text data, so as to generate target text data. In addition, the method further relates to blockchain technology, and target text data can be stored in a blockchain.

Description

人物动作相关数据的提取方法、装置、设备及存储介质Method, device, device and storage medium for extracting data related to character action
本申请要求于2020年12月23日提交中国专利局、申请号为202011545182.2、发明名称为“人物动作相关数据的提取方法、装置、设备及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在申请中。This application claims the priority of the Chinese patent application filed on December 23, 2020 with the application number 202011545182.2 and the invention titled "Method, Apparatus, Equipment and Storage Medium for Extracting Data Related to Character Actions", the entire contents of which are Incorporated in the application by reference.
技术领域technical field
本申请涉及自然语言处理领域,尤其涉及一种人物动作相关数据的提取方法、装置、设备及存储介质。The present application relates to the field of natural language processing, and in particular, to a method, apparatus, device and storage medium for extracting data related to a character's action.
背景技术Background technique
自然语言处理包括了自然语言理解和自然语言生成两个部分,实现人机间自然语言通信意味着要使计算机既能理解自然语言文本的意义,也能以自然语言文本来表达给定的意图、思想等,前者称为自然语言理解,后者称为自然语言生成,自然语言处理是计算机科学领域与人工智能领域中的一个重要方向,其中,中文自然语言处理HanLP算法是一种文本数据抽取算法,包括分词、词性标注和实体识别等。Natural language processing includes two parts: natural language understanding and natural language generation. Realizing natural language communication between humans and machines means that computers can not only understand the meaning of natural language texts, but also express given intentions and texts in natural language texts. Thoughts, etc., the former is called natural language understanding, the latter is called natural language generation, and natural language processing is an important direction in the field of computer science and artificial intelligence. Among them, the Chinese natural language processing HanLP algorithm is a text data extraction algorithm , including word segmentation, part-of-speech tagging, and entity recognition.
近年来,在大数据和深度学习的推动下,自然语言处理技术发展迅速,目前对文本数据的主谓宾抽取算法大致分为两种,一种是基于深度学习的方法,一种是基于语言规则的方法,发明人意识到基于深度学习的方法需要大量的标注数据,且对与人物动作相关的语言描述的提取效果不理想,而基于语言规则的提取方法误差较大,不符合人物行为动作相关数据提取的需求,且提取的数据噪声大。In recent years, driven by big data and deep learning, natural language processing technology has developed rapidly. At present, the subject-verb-object extraction algorithms for text data are roughly divided into two types, one is based on deep learning, the other is based on language The method based on rules, the inventor realized that the method based on deep learning requires a large amount of labeling data, and the extraction effect of language description related to the action of the character is not ideal, while the extraction method based on language rules has a large error and does not conform to the behavior of the character. The need for relevant data extraction, and the extracted data is noisy.
发明内容SUMMARY OF THE INVENTION
本申请提供了一种人物动作相关数据的提取方法、装置、设备及存储介质,用于通过中文自然语言处理HanLP算法对文本数据进行句法分析和词性标注,并基于主谓宾的语法关系和情态动词筛选出正在发生的行为动作的相关数据,提高了数据提取的精确度,降低了提取的数据集的噪声。The present application provides a method, device, device and storage medium for extracting data related to character actions, which are used for syntactic analysis and part-of-speech tagging of text data through Chinese natural language processing HanLP algorithm, and based on the grammatical relationship and modality of subject, predicate and object The verb filters out the relevant data of the action that is taking place, which improves the accuracy of data extraction and reduces the noise of the extracted data set.
本申请第一方面提供了一种人物动作相关数据的提取方法,包括:获取预置的文本数据,所述预置的文本数据为包含人物行为动作的小说文本数据;对所述预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;基于预置的中文自然语言处理HanLP算法对所述初始文本数据进行分词处理和词性标注,生成中间文本数据;基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;对所述分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。A first aspect of the present application provides a method for extracting data related to character actions, including: acquiring preset text data, where the preset text data is novel text data including character actions and actions; The data is classified and processed, and the text data containing the character information is screened out to obtain the initial text data; based on the preset Chinese natural language processing HanLP algorithm, the initial text data is subjected to word segmentation and part-of-speech tagging to generate intermediate text data; The preset Chinese natural language processing HanLP algorithm performs dependency syntactic analysis and semantic dependency analysis on the intermediate text data to generate analysis text data; filter the analysis text data to obtain target text data containing multiple character behaviors and actions .
本申请第二方面提供了一种人物动作相关数据的提取设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:获取预置的文本数据,所述预置的文本数据为包含人物行为动作的文本数据;对所述预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;基于预置的中文自然语言处理HanLP算法对所述初始文本数据进行分词处理和词性标注,生成中间文本数据;基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;对所述分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。A second aspect of the present application provides a device for extracting data related to character actions, including a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor executes the When the computer-readable instructions are described, the following steps are implemented: obtaining preset text data, which is text data containing character behaviors; classifying the preset text data, and filtering out the text data containing character information based on the preset Chinese natural language processing HanLP algorithm to perform word segmentation and part-of-speech tagging on the initial text data to generate intermediate text data; based on the preset Chinese natural language processing HanLP algorithm The intermediate text data is subjected to dependency syntactic analysis and semantic dependency analysis to generate analysis text data; the analysis text data is filtered to obtain target text data including behaviors and actions of a plurality of characters.
本申请的第三方面提供了一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:获取预置的文本数据,所述预置的文本数据为包含人物行为动作的文本数据;对所述预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;基于预置的中文自然语言处理HanLP算法对所述初始文本数据进行分词处理和词性标注,生成中间文本数 据;基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;对所述分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。A third aspect of the present application provides a computer-readable storage medium, where computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on a computer, the computer is caused to perform the following steps: acquiring preset text The preset text data is the text data containing the behaviors and actions of the characters; the preset text data is classified and processed, and the text data containing the character information is screened out to obtain the initial text data; based on the preset Chinese natural The language processing HanLP algorithm performs word segmentation and part-of-speech tagging on the initial text data to generate intermediate text data; based on the preset Chinese natural language processing HanLP algorithm, the intermediate text data is subjected to dependency syntax analysis and semantic dependency analysis to generate Analyzing the text data; filtering the analyzed text data to obtain target text data including behaviors and actions of a plurality of characters.
本申请第四方面提供了一种人物动作相关数据的提取装置,包括:获取模块,用于获取预置的文本数据,所述预置的文本数据为包含人物行为动作的小说文本数据;分类模块,用于对所述预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;分词模块,用于基于预置的中文自然语言处理HanLP算法对所述初始文本数据进行分词处理和词性标注,生成中间文本数据;分析模块,用于基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;过滤模块,用于对所述分析文本数据进行过滤处理得到包含多个人物行为动作的目标文本数据。A fourth aspect of the present application provides a device for extracting data related to character actions, comprising: an obtaining module for obtaining preset text data, where the preset text data is novel text data including character actions and actions; a classification module , used to classify and process the preset text data, screen out the text data containing the character information, and obtain the initial text data; the word segmentation module is used to analyze the initial text data based on the preset Chinese natural language processing HanLP algorithm Perform word segmentation processing and part-of-speech tagging to generate intermediate text data; an analysis module is used to perform dependency syntactic analysis and semantic dependency analysis on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm, and generate analysis text data; filter; The module is used for filtering and processing the analysis text data to obtain target text data including a plurality of characters' behaviors and actions.
本申请提供的技术方案中,获取预置的文本数据,所述预置的文本数据为包含人物行为动作的小说文本数据;对所述预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;基于预置的中文自然语言处理HanLP算法对所述初始文本数据进行分词处理和词性标注,生成中间文本数据;基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;对所述分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。本申请实施例中,通过中文自然语言处理HanLP算法对文本数据进行句法分析和词性标注,并基于主谓宾的语法关系和情态动词筛选出正在发生的行为动作的相关数据,提高了数据提取的精确度,降低了提取的数据集的噪声。In the technical solution provided by the present application, the preset text data is obtained, and the preset text data is novel text data containing character behaviors; the preset text data is classified and processed, and the text containing the character information is filtered out. text data to obtain initial text data; based on the preset Chinese natural language processing HanLP algorithm, word segmentation and part-of-speech tagging are performed on the initial text data to generate intermediate text data; based on the preset Chinese natural language processing HanLP algorithm The intermediate text data is subjected to dependency syntactic analysis and semantic dependency analysis to generate analysis text data; the analysis text data is filtered to obtain target text data containing behaviors and actions of multiple characters. In the embodiment of this application, the Chinese natural language processing HanLP algorithm is used to perform syntax analysis and part-of-speech tagging on the text data, and based on the grammatical relationship between the subject, predicate and object and modal verbs, the relevant data of the ongoing behavior and actions are screened out, which improves the efficiency of data extraction. accuracy, reducing the noise of the extracted dataset.
附图说明Description of drawings
图1为本申请实施例中人物动作相关数据的提取方法的一个实施例示意图;FIG. 1 is a schematic diagram of an embodiment of a method for extracting data related to character actions in an embodiment of the present application;
图2为本申请实施例中人物动作相关数据的提取方法的另一个实施例示意图;FIG. 2 is a schematic diagram of another embodiment of a method for extracting data related to character actions in an embodiment of the present application;
图3为本申请实施例中人物动作相关数据的提取装置的一个实施例示意图;FIG. 3 is a schematic diagram of an embodiment of an apparatus for extracting data related to character actions in an embodiment of the present application;
图4为本申请实施例中人物动作相关数据的提取装置的另一个实施例示意图;FIG. 4 is a schematic diagram of another embodiment of an apparatus for extracting data related to character actions in an embodiment of the present application;
图5为本申请实施例中人物动作相关数据的提取设备的一个实施例示意图。FIG. 5 is a schematic diagram of an embodiment of a device for extracting data related to a character action in an embodiment of the present application.
具体实施方式Detailed ways
本申请实施例提供了一种人物动作相关数据的提取方法、装置、设备及存储介质,通过中文自然语言处理HanLP算法对文本数据进行句法分析和词性标注,并基于主谓宾的语法关系和情态动词筛选出正在发生的行为动作的相关数据,提高了数据提取的精确度,降低了提取的数据集的噪声。The embodiments of the present application provide a method, device, device, and storage medium for extracting data related to a character's action. The Chinese natural language processing HanLP algorithm is used to perform syntax analysis and part-of-speech tagging on the text data, and based on the grammatical relationship and modality of subject, predicate and object The verb filters out the relevant data of the action that is taking place, which improves the accuracy of data extraction and reduces the noise of the extracted data set.
本申请的说明书和权利要求书及上述附图中的术语“第一”、“第二”、“第三”、“第四”等(如果存在)是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的实施例能够以除了在这里图示或描述的内容以外的顺序实施。此外,术语“包括”或“具有”及其任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。The terms "first", "second", "third", "fourth", etc. (if any) in the description and claims of this application and the above-mentioned drawings are used to distinguish similar objects and are not necessarily used to describe a specific order or sequence. It is to be understood that data so used may be interchanged under appropriate circumstances so that the embodiments described herein can be practiced in sequences other than those illustrated or described herein. Furthermore, the terms "comprising" or "having" and any variations thereof are intended to cover non-exclusive inclusion, for example, a process, method, system, product or device comprising a series of steps or units is not necessarily limited to those expressly listed steps or units, but may include other steps or units not expressly listed or inherent to these processes, methods, products or devices.
为便于理解,下面对本申请实施例的具体流程进行描述,请参阅图1,本申请实施例中人物动作相关数据的提取方法的一个实施例包括:For ease of understanding, the specific process of the embodiment of the present application will be described below. Please refer to FIG. 1 . An embodiment of the method for extracting data related to a character action in the embodiment of the present application includes:
101、获取预置的文本数据,预置的文本数据为包含人物行为动作的小说文本数据。101. Acquire preset text data, where the preset text data is novel text data including behaviors and actions of characters.
服务器获取预置的文本数据,预置的文本数据为包含人物行为动作的小说文本数据。服务器通过爬虫从网络上获取指定标签内的多个小说文本,并基于多个小说文本制作预置 的数据集。The server obtains preset text data, and the preset text data is novel text data including behaviors and actions of characters. The server obtains multiple novel texts in the specified tags from the network through the crawler, and creates a preset data set based on the multiple novel texts.
可以理解的是,本申请的执行主体可以为人物动作相关数据的提取装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It can be understood that the execution subject of the present application may be a device for extracting data related to a character action, and may also be a terminal or a server, which is not specifically limited here. The embodiments of the present application take the server as an execution subject as an example for description.
102、对预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据。102. Classify the preset text data, filter out the text data containing the personal information, and obtain initial text data.
服务器对预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据。具体的,服务器将预置的文本数据按照预置的分类规则进行分类,筛选出包含人物代词或人物姓名的文本数据,生成分类文本数据;服务器对分类文本数据进行过滤处理,识别目标标点符号并删除包含人物对话的文本数据,生成初始文本数据,目标标点符号用于指示人物对话。服务器按照是否包含人物信息将预置的文本数据分为两类,将不包含人物信息的文本数据剔除,例如,“小狗在院子里奔跑”、“小鸟在窗外叽叽喳喳地叫着”、“松鼠用它那蓬松的大尾巴当被子盖”等,筛选出包括人物代词或人物姓名的文本数据,人物代词包括我(们)、你(们)、他(们)和她(们)。预设的标点符号为“冒号+双引号”的组合,用于指示人物对话,虽然带有人物对话的文本数据中包含人物信息,但是不符合本方案中对人物行为动作相关数据的分析提取,故需要剔除。The server classifies the preset text data, filters out the text data containing the character information, and obtains the initial text data. Specifically, the server classifies the preset text data according to the preset classification rules, filters out the text data containing pronouns or personal names, and generates classified text data; the server filters the classified text data, identifies the target punctuation marks and generates Delete the text data containing the dialogue of the characters, generate the initial text data, and the target punctuation marks are used to indicate the dialogue of the characters. The server divides the preset text data into two categories according to whether it contains character information, and removes the text data that does not contain character information, for example, "the dog is running in the yard", "the bird is chirping outside the window", " The squirrel uses its big fluffy tail as a quilt cover, etc., and filters out text data including character pronouns or character names. Character pronouns include me (we), you (we), him (them) and she (them). The preset punctuation mark is a combination of "colon + double quotation marks", which is used to indicate the dialogue between characters. Although the text data with dialogue between characters contains character information, it does not conform to the analysis and extraction of data related to characters' actions and actions in this scheme. So it needs to be removed.
103、基于预置的中文自然语言处理HanLP算法对初始文本数据进行分词处理和词性标注,生成中间文本数据。103. Perform word segmentation and part-of-speech tagging on the initial text data based on the preset Chinese natural language processing HanLP algorithm to generate intermediate text data.
服务器基于预置的中文自然语言处理HanLP算法对初始文本数据进行分词处理和词性标注,生成中间文本数据。具体的,服务器通过标点符号对初始文本数据进行分句处理,得到分句结果;服务器基于预置的中文自然语言处理HanLP算法对分句结果进行分词处理,得到分词结果;服务器基于预置的中文自然语言处理HanLP算法和预置的HanLP词性标注集对分词结果进行词性标注,生成中间文本数据。词为文本最基本的单位,分词是进行自然语言处理中最基本的步骤,分词算法分为词典方法和统计方法,其中,基于词典和人工规则的方法是按照一定的策略将待分析词与词典中的词条进行匹配,统计方法是基本字符串在语料库中出现的统计频率。每一个标点符号都有相应的正则表达式,通过标点符号对初始文本数据进行分句处理,将长句划分为多个短句,得到第一文本数据。中文自然语言处理(han language processing,HanLP)是由一系列模型与算法组成的工具包,目标是促进自然语言处理在生产环境中的应用,HanLP具备功能完善、性能高效、架构清晰、语料时新和可自定义的特点,本方案中通过HanLP首先对文本数据进行分词处理,例如,输入“小明正在吃饭”,分词后的结果为“小明”、“正在”、“吃饭”。词性标注是指为分词结果中的每个单词标注一个正确的词性的过程,即确定分词结果中的每个词语是名词、动词、形容词或者其他词性的过程,本方案中通过预置的HanLP词性标注集对分词后的结果进行词性标注,“小明”对应的词性为“名词”,“正在”对应的词性为“副形词”,“吃饭”对应的词性为“动词”。The server performs word segmentation and part-of-speech tagging on the initial text data based on the preset Chinese natural language processing HanLP algorithm to generate intermediate text data. Specifically, the server performs sentence segmentation processing on the initial text data through punctuation to obtain a sentence segmentation result; the server performs word segmentation processing on the sentence segmentation result based on the preset Chinese natural language processing HanLP algorithm, and obtains a word segmentation result; The natural language processing HanLP algorithm and the preset HanLP part-of-speech tagging set perform part-of-speech tagging on the word segmentation results to generate intermediate text data. Word is the most basic unit of text, word segmentation is the most basic step in natural language processing, word segmentation algorithm is divided into dictionary method and statistical method, among which, the method based on dictionary and artificial rules is to analyze the word to be analyzed and dictionary according to a certain strategy. The terms in the corpus are matched, and the statistical method is the statistical frequency of the basic strings appearing in the corpus. Each punctuation mark has a corresponding regular expression, and the initial text data is segmented by the punctuation mark, and the long sentence is divided into multiple short sentences to obtain the first text data. Chinese natural language processing (han language processing, HanLP) is a toolkit composed of a series of models and algorithms, the goal is to promote the application of natural language processing in the production environment, HanLP has complete functions, high performance, clear structure, and up-to-date corpus In this solution, HanLP firstly performs word segmentation on the text data, for example, input "Xiao Ming is eating", and the result after word segmentation is "Xiao Ming", "Making", "Eating". Part-of-speech tagging refers to the process of marking a correct part-of-speech for each word in the segmentation result, that is, the process of determining whether each word in the segmentation result is a noun, verb, adjective or other part-of-speech. The tagging set performs part-of-speech tagging on the result of word segmentation. The part of speech corresponding to "Xiao Ming" is "noun", the part of speech corresponding to "Zheng" is "adverb", and the part of speech corresponding to "dining" is "verb".
104、基于预置的中文自然语言处理HanLP算法对中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据。104. Perform dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm, and generate analysis text data.
服务器基于预置的中文自然语言处理HanLP算法对中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据。依存句法分析(dependency parsing,DP)通过分析语言单位内成分之间的依存关系,揭示其句法结构,即分析句子中的“主谓宾”、“定状补”等语法成分,并分析各成分的关系,语义依存分析(semantic dependency parsing,SDP)分析句子各个语言单位之间的语义关联,并将语义关联以依存结构呈现,语义依存分析不受句法结构的影响,将具有直接语义关联的语言单元直接连接依存弧并标记上相应的语义关系,这也是语义依存分析与句法依存分析的重要区别。例如,“小明吃了苹果”、“小明把 苹果吃了”、“苹果被小明吃了”,虽然三个句子拥有不同的句法结构,产生了不同的句法分析结果,但是三个句子中语言单元之间的语义关系并没有发生变化,表达了同一个语义信息,即小明实施了一个吃的动作,吃的动作是对苹果实施的。The server performs dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm, and generates analysis text data. Dependency parsing (DP) reveals the syntactic structure by analyzing the dependencies between the components in the language unit, that is, analyzing the grammatical components such as "subject-predicate-object" and "definite-state complement" in the sentence, and analyzes each component. Semantic dependency parsing (SDP) analyzes the semantic associations between the language units of a sentence, and presents the semantic associations in a dependency structure. Units directly connect dependency arcs and mark corresponding semantic relations, which is also an important difference between semantic dependency analysis and syntactic dependency analysis. For example, "Xiao Ming ate an apple", "Xiao Ming ate an apple", "Xiao Ming ate an apple", although the three sentences have different syntactic structures, resulting in different syntactic analysis results, but the language units in the three sentences The semantic relationship between them has not changed, and the same semantic information is expressed, that is, Xiao Ming implements an eating action, and the eating action is implemented on the apple.
105、对分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。105. Perform filtering processing on the analyzed text data to obtain target text data including behaviors and actions of multiple characters.
服务器对分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。具体的,服务器获取分析文本数据,过滤分析文本数据中包含情态动词的文本数据,生成过滤文本数据;服务器将过滤文本数据进行归一化处理,生成目标文本数据,目标文本数据包括提取到的多个人物行为动作。在筛选出的主谓宾人物动作后,当句子中有修饰谓语动词的情态动词出现时,则不符合条件,因为由于情态动词的出现,句子呈现出一般将来时,表示将来某一时刻的动作或状态,人物动作还未发生,例如,“小明将要出发去荡秋千”,荡秋千的动作还未发生,因此需要将相关文本数据进行过滤删除。The server performs filtering processing on the analysis text data, and obtains target text data including the behaviors and actions of a plurality of characters. Specifically, the server obtains and analyzes the text data, filters and analyzes the text data containing modal verbs in the analysis text data, and generates the filtered text data; the server normalizes the filtered text data to generate target text data, and the target text data includes the extracted personal behavior. After the subject-predicate-object character actions are screened out, when there is a modal verb that modifies the predicate verb in the sentence, it does not meet the conditions, because due to the appearance of the modal verb, the sentence presents the general future tense, indicating the action at a certain moment in the future Or the state, the character action has not yet occurred, for example, "Xiao Ming is going to set off to swing on the swing", the swing action has not yet occurred, so the relevant text data needs to be filtered and deleted.
本申请实施例中,通过中文自然语言处理HanLP算法对文本数据进行句法分析和词性标注,并基于主谓宾的语法关系和情态动词筛选出正在发生的行为动作的相关数据,提高了数据提取的精确度,降低了提取的数据集的噪声。In the embodiment of this application, the Chinese natural language processing HanLP algorithm is used to perform syntax analysis and part-of-speech tagging on the text data, and based on the grammatical relationship between the subject, predicate and object and modal verbs, the relevant data of the ongoing behavior and actions are screened out, which improves the efficiency of data extraction. accuracy, reducing the noise of the extracted dataset.
请参阅图2,本申请实施例中人物动作相关数据的提取方法的另一个实施例包括:Referring to FIG. 2 , another embodiment of the method for extracting data related to character actions in the embodiment of the present application includes:
201、获取预置的文本数据,预置的文本数据为包含人物行为动作的小说文本数据。201. Acquire preset text data, where the preset text data is novel text data including behaviors and actions of characters.
服务器获取预置的文本数据,预置的文本数据为包含人物行为动作的小说文本数据。服务器通过爬虫从网络上获取指定标签内的多个小说文本,并基于多个小说文本制作预置的数据集。The server obtains preset text data, and the preset text data is novel text data including behaviors and actions of characters. The server obtains multiple novel texts within a specified tag from the network through a crawler, and creates a preset data set based on the multiple novel texts.
可以理解的是,本申请的执行主体可以为人物动作相关数据的提取装置,还可以是终端或者服务器,具体此处不做限定。本申请实施例以服务器为执行主体为例进行说明。It can be understood that the execution subject of the present application may be a device for extracting data related to a character action, and may also be a terminal or a server, which is not specifically limited here. The embodiments of the present application take the server as an execution subject as an example for description.
202、对预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据。202. Classify the preset text data, filter out the text data containing the character information, and obtain initial text data.
服务器对预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据。具体的,服务器将预置的文本数据按照预置的分类规则进行分类,筛选出包含人物代词或人物姓名的文本数据,生成分类文本数据;服务器对分类文本数据进行过滤处理,识别目标标点符号并删除包含人物对话的文本数据,生成初始文本数据,目标标点符号用于指示人物对话。服务器按照是否包含人物信息将预置的文本数据分为两类,将不包含人物信息的文本数据剔除,例如,“小狗在院子里奔跑”、“小鸟在窗外叽叽喳喳地叫着”、“松鼠用它那蓬松的大尾巴当被子盖”等,筛选出包括人物代词或人物姓名的文本数据,人物代词包括我(们)、你(们)、他(们)和她(们)。预设的标点符号为“冒号+双引号”的组合,用于指示人物对话,虽然带有人物对话的文本数据中包含人物信息,但是不符合本方案中对人物行为动作相关数据的分析提取,故需要剔除。The server classifies the preset text data, filters out the text data containing the character information, and obtains the initial text data. Specifically, the server classifies the preset text data according to the preset classification rules, filters out the text data containing pronouns or personal names, and generates classified text data; the server filters the classified text data, identifies the target punctuation marks and generates Delete the text data containing the dialogue of the characters, generate the initial text data, and the target punctuation marks are used to indicate the dialogue of the characters. The server divides the preset text data into two categories according to whether it contains character information, and removes the text data that does not contain character information, for example, "the dog is running in the yard", "the bird is chirping outside the window", " The squirrel uses its big fluffy tail as a quilt cover, etc., and filters out text data including character pronouns or character names. Character pronouns include me (we), you (we), him (them) and she (them). The preset punctuation mark is a combination of "colon + double quotation marks", which is used to indicate the dialogue between characters. Although the text data with dialogue between characters contains character information, it does not conform to the analysis and extraction of data related to characters' actions and actions in this scheme. So it needs to be removed.
203、基于预置的中文自然语言处理HanLP算法对初始文本数据进行分词处理和词性标注,生成中间文本数据。203. Perform word segmentation processing and part-of-speech tagging on the initial text data based on the preset Chinese natural language processing HanLP algorithm to generate intermediate text data.
服务器基于预置的中文自然语言处理HanLP算法对初始文本数据进行分词处理和词性标注,生成中间文本数据。具体的,服务器通过标点符号对初始文本数据进行分句处理,得到分句结果;服务器基于预置的中文自然语言处理HanLP算法对分句结果进行分词处理,得到分词结果;服务器基于预置的中文自然语言处理HanLP算法和预置的HanLP词性标注集对分词结果进行词性标注,生成中间文本数据。词为文本最基本的单位,分词是进行自然语言处理中最基本的步骤,分词算法分为词典方法和统计方法,其中,基于词典和人工规则的方法是按照一定的策略将待分析词与词典中的词条进行匹配,统计方法是基本字符串在语料库中出现的统计频率。每一个标点符号都有相应的正则表达式,通过标点符号对 初始文本数据进行分句处理,将长句划分为多个短句,得到第一文本数据。中文自然语言处理(han language processing,HanLP)是由一系列模型与算法组成的工具包,目标是促进自然语言处理在生产环境中的应用,HanLP具备功能完善、性能高效、架构清晰、语料时新和可自定义的特点,本方案中通过HanLP首先对文本数据进行分词处理,例如,输入“小明正在吃饭”,分词后的结果为“小明”、“正在”、“吃饭”。词性标注是指为分词结果中的每个单词标注一个正确的词性的过程,即确定分词结果中的每个词语是名词、动词、形容词或者其他词性的过程,本方案中通过预置的HanLP词性标注集对分词后的结果进行词性标注,“小明”对应的词性为“名词”,“正在”对应的词性为“副形词”,“吃饭”对应的词性为“动词”。The server performs word segmentation and part-of-speech tagging on the initial text data based on the preset Chinese natural language processing HanLP algorithm to generate intermediate text data. Specifically, the server performs sentence segmentation processing on the initial text data through punctuation to obtain a sentence segmentation result; the server performs word segmentation processing on the sentence segmentation result based on the preset Chinese natural language processing HanLP algorithm, and obtains a word segmentation result; The natural language processing HanLP algorithm and the preset HanLP part-of-speech tagging set perform part-of-speech tagging on the word segmentation results to generate intermediate text data. Word is the most basic unit of text, word segmentation is the most basic step in natural language processing, word segmentation algorithm is divided into dictionary method and statistical method, among which, the method based on dictionary and artificial rules is to analyze the word to be analyzed and dictionary according to a certain strategy. The terms in the corpus are matched, and the statistical method is the statistical frequency of the basic strings appearing in the corpus. Each punctuation mark has a corresponding regular expression, and the initial text data is segmented by the punctuation mark, and the long sentence is divided into multiple short sentences to obtain the first text data. Chinese natural language processing (han language processing, HanLP) is a toolkit composed of a series of models and algorithms, the goal is to promote the application of natural language processing in the production environment, HanLP has complete functions, high performance, clear structure, and up-to-date corpus In this solution, HanLP firstly performs word segmentation on the text data, for example, input "Xiao Ming is eating", and the result after word segmentation is "Xiao Ming", "Making", "Eating". Part-of-speech tagging refers to the process of marking a correct part-of-speech for each word in the segmentation result, that is, the process of determining whether each word in the segmentation result is a noun, verb, adjective or other part-of-speech. The tagging set performs part-of-speech tagging on the result of word segmentation. The part of speech corresponding to "Xiao Ming" is "noun", the part of speech corresponding to "Zheng" is "adverb", and the part of speech corresponding to "dining" is "verb".
204、调用预置的中文自然语言处理HanLP算法识别并分析中间文本数据中语法成分之间的关系,当宾语的核心关系指向谓语动词时,抽取核心主谓宾关系,生成第一分析文本数据。204. Call the preset Chinese natural language processing HanLP algorithm to identify and analyze the relationship between the grammatical components in the intermediate text data, and when the core relationship of the object points to the predicate verb, extract the core subject-predicate-object relationship to generate the first analysis text data.
服务器调用预置的中文自然语言处理HanLP算法识别并分析中间文本数据中语法成分之间的关系,当宾语的核心关系指向谓语动词时,抽取核心主谓宾关系,生成第一分析文本数据。例如,“小明正在房间里打游戏”中,“小明”属于名词性主语,“正”属于名词性状语,“在”属于介词性修饰语,“房间”属于介词性地点修饰,“里”属于时间介词,“打”属于谓语动词,“游戏”属于直接宾语,谓语动词“打”为核心词语,因此可以将这句话提取为包含主谓宾关系的“小明打游戏”。The server invokes the preset Chinese natural language processing HanLP algorithm to identify and analyze the relationship between the grammatical components in the intermediate text data. When the core relationship of the object points to the predicate verb, the core subject-verb-object relationship is extracted to generate the first analysis text data. For example, in "Xiao Ming is playing a game in the room", "Xiao Ming" belongs to the noun subject, "zheng" belongs to the noun adverbial, "zai" belongs to the prepositional modifier, "room" belongs to the prepositional location modifier, and "li" belongs to Time preposition, "play" belongs to the predicate verb, "game" belongs to the direct object, and the predicate verb "play" is the core word, so this sentence can be extracted as "Xiao Ming playing the game" which contains the subject-verb-object relationship.
205、调用预置的中文自然语言处理HanLP算法分析中间文本数据中的语义关联,确定关系类型并筛选出包含施事关系的文本数据,生成第二分析文本数据。205. Invoke the preset Chinese natural language processing HanLP algorithm to analyze the semantic relationship in the intermediate text data, determine the relationship type, filter out the text data including the agency relationship, and generate the second analysis text data.
服务器调用预置的中文自然语言处理HanLP算法分析中间文本数据中的语义关联,确定关系类型并筛选出包含施事关系的文本数据,生成第二分析文本数据。关系类型包括施事关系、当事关系、感事关系、领事关系、受事关系、客事关系、成事关系、源事关系、涉事关系和比较角色,例如,“小明送她一束花”,这句话中的语义关系类型为施事关系,“送花”的动作是人物作出的具体动作,符合本方案中的筛选条件,“小明在房间里吃饭,一边看电视,一边还在说话”,这句话中包含多个谓语动词“吃”、“看”和“说话”,且多个谓语动词之间具有顺承关系,同样符合本方案中的筛选条件。The server invokes the preset Chinese natural language processing HanLP algorithm to analyze the semantic relationship in the intermediate text data, determines the relationship type, filters out the text data containing the agency relationship, and generates the second analysis text data. The relationship types include agency relationship, party relationship, feeling relationship, consular relationship, client relationship, guest relationship, success relationship, source relationship, involved relationship, and comparative roles, for example, "Xiao Ming gave her a bouquet of flowers" , the semantic relationship type in this sentence is the agency relationship, and the action of "sending flowers" is the specific action made by the character, which meets the screening conditions in this scheme, "Xiao Ming is eating in the room, watching TV, and talking at the same time. ", this sentence contains multiple predicate verbs "eat", "see" and "speak", and the multiple predicate verbs have an inheritance relationship, which also meets the filtering conditions in this scheme.
206、将第一分析文本数据和第二分析文本数据进行合并,生成分析文本数据。206. Combine the first analysis text data and the second analysis text data to generate analysis text data.
服务器将第一分析文本数据和第二分析文本数据进行合并,生成分析文本数据。本方案中分词,词性标注,句法分析和语义分析都是基于HanLP算法,每一层都会形成单独的数据结果,每一层的数据结果可以单独使用,也可以传输至下一层进行进一步分析。The server combines the first analysis text data and the second analysis text data to generate analysis text data. In this solution, word segmentation, part-of-speech tagging, syntactic analysis and semantic analysis are all based on the HanLP algorithm. Each layer will form a separate data result. The data result of each layer can be used alone or transmitted to the next layer for further analysis.
207、对分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。207. Perform filtering processing on the analyzed text data to obtain target text data including behaviors and actions of multiple characters.
服务器对分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。具体的,服务器获取分析文本数据,过滤分析文本数据中包含情态动词的文本数据,生成过滤文本数据;服务器将过滤文本数据进行归一化处理,生成目标文本数据,目标文本数据包括提取到的多个人物行为动作。在筛选出的主谓宾人物动作后,当句子中有修饰谓语动词的情态动词出现时,则不符合条件,因为由于情态动词的出现,句子呈现出一般将来时,表示将来某一时刻的动作或状态,人物动作还未发生,例如,“小明将要出发去荡秋千”,荡秋千的动作还未发生,因此需要将相关文本数据进行过滤删除。The server performs filtering processing on the analysis text data, and obtains target text data including the behaviors and actions of a plurality of characters. Specifically, the server obtains and analyzes the text data, filters and analyzes the text data containing modal verbs in the analysis text data, and generates the filtered text data; the server normalizes the filtered text data to generate target text data, and the target text data includes the extracted personal behavior. After the subject-predicate-object character actions are screened out, when there is a modal verb that modifies the predicate verb in the sentence, it does not meet the conditions, because due to the appearance of the modal verb, the sentence presents the general future tense, indicating the action at a certain moment in the future Or the state, the character action has not yet occurred, for example, "Xiao Ming is going to set off to swing on the swing", the swing action has not yet occurred, so the relevant text data needs to be filtered and deleted.
本申请实施例中,通过中文自然语言处理HanLP算法对文本数据进行句法分析和词性标注,并基于主谓宾的语法关系和情态动词筛选出正在发生的行为动作的相关数据,提高了数据提取的精确度,降低了提取的数据集的噪声。In the embodiment of this application, the Chinese natural language processing HanLP algorithm is used to perform syntax analysis and part-of-speech tagging on the text data, and based on the grammatical relationship between the subject, predicate and object and modal verbs, the relevant data of the ongoing behavior and actions are screened out, which improves the efficiency of data extraction. accuracy, reducing the noise of the extracted dataset.
上面对本申请实施例中人物动作相关数据的提取方法进行了描述,下面对本申请实施 例中人物动作相关数据的提取装置进行描述,请参阅图3,本申请实施例中人物动作相关数据的提取装置的一个实施例包括:The method for extracting data related to character motion in the embodiment of the present application has been described above, and the apparatus for extracting data related to character motion in the embodiment of the present application is described below. Please refer to FIG. 3 , the apparatus for extracting data related to character motion in the embodiment of the present application An example of includes:
获取模块301,用于获取预置的文本数据,预置的文本数据为包含人物行为动作的小说文本数据;The obtaining module 301 is used for obtaining preset text data, where the preset text data is novel text data including the behavior and actions of characters;
分类模块302,用于对预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;The classification module 302 is used for classifying and processing preset text data, screening out text data containing personal information, and obtaining initial text data;
分词模块303,用于基于预置的中文自然语言处理HanLP算法对初始文本数据进行分词处理和词性标注,生成中间文本数据;The word segmentation module 303 is configured to perform word segmentation processing and part-of-speech tagging on the initial text data based on the preset Chinese natural language processing HanLP algorithm to generate intermediate text data;
分析模块304,用于基于预置的中文自然语言处理HanLP算法对中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;The analysis module 304 is configured to perform dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm, and generate analysis text data;
过滤模块305,用于对分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。The filtering module 305 is configured to perform filtering processing on the analyzed text data to obtain target text data including behaviors and actions of a plurality of characters.
本申请实施例中,通过中文自然语言处理HanLP算法对文本数据进行句法分析和词性标注,并基于主谓宾的语法关系和情态动词筛选出正在发生的行为动作的相关数据,提高了数据提取的精确度,降低了提取的数据集的噪声。In the embodiment of this application, the Chinese natural language processing HanLP algorithm is used to perform syntax analysis and part-of-speech tagging on the text data, and based on the grammatical relationship between the subject, predicate and object and modal verbs, the relevant data of the ongoing behavior and actions are screened out, which improves the efficiency of data extraction. accuracy, reducing the noise of the extracted dataset.
请参阅图4,本申请实施例中人物动作相关数据的提取装置的另一个实施例包括:Referring to FIG. 4 , another embodiment of the apparatus for extracting data related to character actions in the embodiment of the present application includes:
获取模块301,用于获取预置的文本数据,预置的文本数据为包含人物行为动作的小说文本数据;The obtaining module 301 is used for obtaining preset text data, where the preset text data is novel text data including the behavior and actions of characters;
分类模块302,用于对预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;The classification module 302 is used for classifying and processing preset text data, screening out text data containing personal information, and obtaining initial text data;
分词模块303,用于基于预置的中文自然语言处理HanLP算法对初始文本数据进行分词处理和词性标注,生成中间文本数据;The word segmentation module 303 is configured to perform word segmentation processing and part-of-speech tagging on the initial text data based on the preset Chinese natural language processing HanLP algorithm to generate intermediate text data;
分析模块304,用于基于预置的中文自然语言处理HanLP算法对中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;The analysis module 304 is configured to perform dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm, and generate analysis text data;
过滤模块305,用于对分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。The filtering module 305 is configured to perform filtering processing on the analyzed text data to obtain target text data including behaviors and actions of a plurality of characters.
可选的,分类模块302包括:Optionally, the classification module 302 includes:
分类单元3021,用于将预置的文本数据按照预置的分类规则进行分类,筛选出包含人物代词或人物姓名的文本数据,生成分类文本数据;The classification unit 3021 is used to classify the preset text data according to the preset classification rules, filter out the text data containing the pronouns or the names of the characters, and generate the classified text data;
删除单元3022,用于识别分类文本数据中的目标标点符号,并根据目标标点符号删除包含人物对话的文本数据,生成初始文本数据,目标标点符号用于指示人物对话。The deletion unit 3022 is configured to identify the target punctuation in the classified text data, and delete the text data containing the dialogue of the characters according to the target punctuation to generate initial text data, and the target punctuation is used to indicate the dialogue of the characters.
可选的,分词模块303包括:Optionally, the word segmentation module 303 includes:
分句单元3031,用于通过标点符号对初始文本数据进行分句处理,得到分句结果;The sentence segmentation unit 3031 is used to perform sentence segmentation processing on the initial text data through punctuation to obtain a sentence segmentation result;
分词单元3032,用于基于预置的中文自然语言处理HanLP算法对分句结果进行分词处理,得到分词结果;The word segmentation unit 3032 is used to perform word segmentation processing on the sentence segmentation result based on the preset Chinese natural language processing HanLP algorithm to obtain the word segmentation result;
词性标注单元3033,用于基于预置的中文自然语言处理HanLP算法和预置的HanLP词性标注集对分词结果进行词性标注,生成中间文本数据。The part-of-speech tagging unit 3033 is configured to perform part-of-speech tagging on the word segmentation result based on the preset Chinese natural language processing HanLP algorithm and the preset HanLP part-of-speech tagging set, and generate intermediate text data.
可选的,分析模块304包括:Optionally, the analysis module 304 includes:
第一分析单元3041,用于调用预置的中文自然语言处理HanLP算法识别并分析中间文本数据中语法成分之间的关系,当宾语的核心关系指向谓语动词时,抽取核心主谓宾关系,生成第一分析文本数据;The first analysis unit 3041 is used to call the preset Chinese natural language processing HanLP algorithm to identify and analyze the relationship between grammatical components in the intermediate text data. When the core relationship of the object points to the predicate verb, extract the core subject-predicate-object relationship to generate The first analyzes the text data;
第二分析单元3042,用于调用预置的中文自然语言处理HanLP算法分析中间文本数据中的语义关联,确定关系类型并筛选出包含施事关系的文本数据,生成第二分析文本数据;The second analysis unit 3042 is used to call the preset Chinese natural language processing HanLP algorithm to analyze the semantic association in the intermediate text data, determine the relationship type and filter out the text data containing the agency relationship, and generate the second analysis text data;
合并单元3043,用于将第一分析文本数据和第二分析文本数据进行合并,生成分析文本数据。The combining unit 3043 is configured to combine the first analysis text data and the second analysis text data to generate analysis text data.
可选的,过滤模块305包括:Optionally, the filtering module 305 includes:
过滤单元3051,用于过滤分析文本数据中包含情态动词的文本数据,生成过滤文本数据; Filtering unit 3051, for filtering and analyzing the text data containing modal verbs in the text data, and generating filtering text data;
归一化单元3052,用于将过滤文本数据进行归一化处理,生成包含多个人物行为动作的目标文本数据。The normalization unit 3052 is configured to perform normalization processing on the filtered text data, so as to generate target text data including the behaviors and actions of a plurality of characters.
可选的,在分析模块304之后,在过滤模块305之前,人物动作相关数据的提取装置还包括:Optionally, after the analysis module 304 and before the filtering module 305, the apparatus for extracting the data related to the action of the character further includes:
识别模块306,用于识别分析文本数据中是否包含过去发生的人物行为动作,当分析文本数据中不包含过去发生的人物行为动作时,保留分析文本数据,当分析文本数据中包含过去发生的人物行为动作时,将包含过去发生的人物行为动作的相关数据删除。The identification module 306 is used to identify and analyze whether the character behaviors and actions that occurred in the past are included in the analysis text data. When the analysis text data does not include the character behaviors and actions that occurred in the past, the analysis text data is retained, and the characters that occur in the past are included in the analysis text data. When performing actions, the data related to the actions and actions of the characters that have occurred in the past will be deleted.
具体的,例如“小明已经吃过饭了”中,“吃”是谓语动词,但是句子中呈现出的是一般过去时,语义关系中表达的是小明过去的状态,并不是现在进行的动作,因此需要将相关文本数据删除。Specifically, for example, in "Xiao Ming has already eaten", "eat" is a predicate verb, but the sentence is in the simple past tense, and the semantic relationship expresses Xiao Ming's past state, not the current action. Therefore, the relevant text data needs to be deleted.
本申请实施例中,通过中文自然语言处理HanLP算法对文本数据进行句法分析和词性标注,并基于主谓宾的语法关系和情态动词筛选出正在发生的行为动作的相关数据,提高了数据提取的精确度,降低了提取的数据集的噪声。In the embodiment of this application, the Chinese natural language processing HanLP algorithm is used to perform syntax analysis and part-of-speech tagging on the text data, and based on the grammatical relationship between the subject, predicate and object and modal verbs, the relevant data of the ongoing behavior and actions are screened out, which improves the efficiency of data extraction. accuracy, reducing the noise of the extracted dataset.
上面图3和图4从模块化功能实体的角度对本申请实施例中的人物动作相关数据的提取装置进行详细描述,下面从硬件处理的角度对本申请实施例中人物动作相关数据的提取设备进行详细描述。Figures 3 and 4 above describe in detail the device for extracting data related to human action in the embodiment of the present application from the perspective of modular functional entities, and the following describes the device for extracting data related to human action in the embodiment of the present application in detail from the perspective of hardware processing. describe.
图5是本申请实施例提供的一种人物动作相关数据的提取设备的结构示意图,该人物动作相关数据的提取设备500可因配置或性能不同而产生比较大的差异,可以包括一个或一个以上处理器(central processing units,CPU)510(例如,一个或一个以上处理器)和存储器520,一个或一个以上存储应用程序533或数据532的存储介质530(例如一个或一个以上海量存储设备)。其中,存储器520和存储介质530可以是短暂存储或持久存储。存储在存储介质530的程序可以包括一个或一个以上模块(图示没标出),每个模块可以包括对人物动作相关数据的提取设备500中的一系列指令操作。更进一步地,处理器510可以设置为与存储介质530通信,在人物动作相关数据的提取设备500上执行存储介质530中的一系列指令操作。5 is a schematic structural diagram of a device for extracting data related to character actions provided by an embodiment of the present application. The device 500 for extracting data related to human actions may vary greatly due to different configurations or performances, and may include one or more than one Central processing units (CPU) 510 (eg, one or more processors) and memory 520, one or more storage media 530 (eg, one or more mass storage devices) that store application programs 533 or data 532. Among them, the memory 520 and the storage medium 530 may be short-term storage or persistent storage. The program stored in the storage medium 530 may include one or more modules (not shown in the figure), and each module may include a series of instruction operations in the apparatus 500 for extracting data related to the action of a character. Furthermore, the processor 510 may be configured to communicate with the storage medium 530, and execute a series of instruction operations in the storage medium 530 on the device 500 for extracting data related to the character action.
人物动作相关数据的提取设备500还可以包括一个或一个以上电源540,一个或一个以上有线或无线网络接口550,一个或一个以上输入输出接口560,和/或,一个或一个以上操作系统531,例如Windows Serve,Mac OS X,Unix,Linux,FreeBSD等等。本领域技术人员可以理解,图5示出的人物动作相关数据的提取设备结构并不构成对人物动作相关数据的提取设备的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。The apparatus 500 for extracting data related to character actions may also include one or more power supplies 540, one or more wired or wireless network interfaces 550, one or more input and output interfaces 560, and/or, one or more operating systems 531, For example Windows Server, Mac OS X, Unix, Linux, FreeBSD, etc. Those skilled in the art can understand that the structure of the extraction device for character action-related data shown in FIG. 5 does not constitute a limitation on the extraction device for character action-related data, and may include more or less components than those shown in the figure, or a combination of certain some components, or a different arrangement of components.
本申请还提供一种人物动作相关数据的提取设备,所述计算机设备包括存储器和处理器,存储器中存储有计算机可读指令,计算机可读指令被处理器执行时,使得处理器执行上述各实施例中的所述人物动作相关数据的提取方法的步骤。The present application also provides a device for extracting data related to a character's action. The computer device includes a memory and a processor. Computer-readable instructions are stored in the memory. When the computer-readable instructions are executed by the processor, the processor executes the above implementations. The steps of the method for extracting the data related to the character action in the example.
本申请还提供一种计算机可读存储介质,该计算机可读存储介质可以为非易失性计算机可读存储介质,该计算机可读存储介质也可以为易失性计算机可读存储介质,所述计算机可读存储介质中存储有指令,当所述指令在计算机上运行时,使得计算机执行所述人物动作相关数据的提取方法的步骤。The present application also provides a computer-readable storage medium. The computer-readable storage medium may be a non-volatile computer-readable storage medium. The computer-readable storage medium may also be a volatile computer-readable storage medium. Instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, make the computer execute the steps of the method for extracting data related to a character movement.
进一步地,所述计算机可读存储介质可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序等;存储数据区可存储根据区块链节点的使用所创建的数据等。Further, the computer-readable storage medium may mainly include a storage program area and a storage data area, wherein the storage program area may store an operating system, an application program required by at least one function, and the like; The data created by the use of the node, etc.
本申请所指区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层等。The blockchain referred to in this application is a new application mode of computer technologies such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm. Blockchain, essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information to verify its Validity of information (anti-counterfeiting) and generation of the next block. The blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统,装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
所述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The integrated unit, if implemented in the form of a software functional unit and sold or used as an independent product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, removable hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。As mentioned above, the above embodiments are only used to illustrate the technical solutions of the present application, but not to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand: The technical solutions described in the embodiments are modified, or some technical features thereof are equivalently replaced; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions in the embodiments of the present application.

Claims (20)

  1. 一种人物动作相关数据的提取方法,其中,所述人物动作相关数据的提取方法包括:A method for extracting data related to character actions, wherein the method for extracting data related to character actions includes:
    获取预置的文本数据,所述预置的文本数据为包含人物行为动作的文本数据;Obtaining preset text data, the preset text data is text data containing the behavior and actions of characters;
    对所述预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;classifying the preset text data, screening out text data containing personal information, and obtaining initial text data;
    基于预置的中文自然语言处理HanLP算法对所述初始文本数据进行分词处理和词性标注,生成中间文本数据;Perform word segmentation and part-of-speech tagging on the initial text data based on the preset Chinese natural language processing HanLP algorithm to generate intermediate text data;
    基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;Performing dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm to generate analysis text data;
    对所述分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。The analysis text data is filtered to obtain target text data including the behaviors and actions of a plurality of characters.
  2. 根据权利要求1所述的人物动作相关数据的提取方法,其中,所述对所述预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据包括:The method for extracting data related to character movements according to claim 1, wherein the classifying and processing the preset text data, filtering out text data containing character information, and obtaining the initial text data comprises:
    将所述预置的文本数据按照预置的分类规则进行分类,筛选出包含人物代词或人物姓名的文本数据,生成分类文本数据;Classifying the preset text data according to preset classification rules, filtering out text data including character pronouns or character names, and generating classified text data;
    识别所述分类文本数据中的目标标点符号,并根据所述目标标点符号删除包含人物对话的文本数据,生成初始文本数据,所述目标标点符号用于指示人物对话。Identifying target punctuation marks in the classified text data, and deleting text data containing dialogues between characters according to the target punctuation marks, and generating initial text data, the target punctuation marks are used to indicate dialogues between characters.
  3. 根据权利要求1所述的人物动作相关数据的提取方法,其中,所述基于预置的中文自然语言处理HanLP算法对所述初始文本数据进行分词处理和词性标注,生成中间文本数据包括:The method for extracting character action-related data according to claim 1, wherein the preset Chinese natural language processing HanLP algorithm performs word segmentation and part-of-speech tagging on the initial text data, and generating the intermediate text data comprises:
    通过标点符号对所述初始文本数据进行分句处理,得到分句结果;Perform sentence segmentation processing on the initial text data through punctuation to obtain a sentence segmentation result;
    基于预置的中文自然语言处理HanLP算法对所述分句结果进行分词处理,得到分词结果;Perform word segmentation processing on the sentence segmentation result based on the preset Chinese natural language processing HanLP algorithm to obtain the word segmentation result;
    基于所述预置的中文自然语言处理HanLP算法和预置的HanLP词性标注集对所述分词结果进行词性标注,生成中间文本数据。Based on the preset HanLP algorithm for Chinese natural language processing and the preset HanLP part-of-speech tagging set, part-of-speech tagging is performed on the word segmentation result to generate intermediate text data.
  4. 根据权利要求1所述的人物动作相关数据的提取方法,其中,所述基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据包括:The method for extracting data related to character actions according to claim 1, wherein the HanLP algorithm based on the preset Chinese natural language processing performs dependency syntax analysis and semantic dependency analysis on the intermediate text data to generate analysis text data include:
    调用所述预置的中文自然语言处理HanLP算法识别并分析所述中间文本数据中语法成分之间的关系,当宾语的核心关系指向谓语动词时,抽取核心主谓宾关系,生成第一分析文本数据;Call the preset Chinese natural language processing HanLP algorithm to identify and analyze the relationship between the grammatical components in the intermediate text data, when the core relationship of the object points to the predicate verb, extract the core subject-predicate-object relationship, and generate the first analysis text data;
    调用所述预置的中文自然语言处理HanLP算法分析所述中间文本数据中的语义关联,确定关系类型并筛选出包含施事关系的文本数据,生成第二分析文本数据;Invoke the preset Chinese natural language processing HanLP algorithm to analyze the semantic association in the intermediate text data, determine the relationship type and filter out the text data containing the agency relationship, and generate the second analysis text data;
    将所述第一分析文本数据和所述第二分析文本数据进行合并,生成分析文本数据。The first analysis text data and the second analysis text data are combined to generate analysis text data.
  5. 根据权利要求1所述的人物动作相关数据的提取方法,其中,所述对所述分析文本数据进行过滤处理,生成目标文本数据,所述目标文本数据包括提取到的多个人物行为动作包括:The method for extracting character action-related data according to claim 1, wherein the filtering of the analysis text data to generate target text data, the target text data comprising the extracted multiple character actions and actions comprising:
    过滤所述分析文本数据中包含情态动词的文本数据,生成过滤文本数据;Filtering the text data containing modal verbs in the analysis text data to generate filtered text data;
    将所述过滤文本数据进行归一化处理,生成包含多个人物行为动作的目标文本数据。The filtered text data is normalized to generate target text data including multiple characters' actions.
  6. 根据权利要求5所述的人物动作相关数据的提取方法,其中,所述过滤所述分析文本数据中包含情态动词的文本数据,生成过滤文本数据包括:The method for extracting character action-related data according to claim 5, wherein the filtering the text data containing modal verbs in the analysis text data, and generating the filtered text data comprises:
    识别所述分析文本数据中包含情态动词的文本数据,所述情态动词用于指示还未发生的人物行为动作;Identifying text data that contains modal verbs in the analyzed text data, the modal verbs are used to indicate character actions that have not yet occurred;
    将所述包含情态动词的文本数据删除,生成过滤文本数据。The text data containing the modal verb is deleted to generate filtered text data.
  7. 根据权利要求1-6中任一项所述的人物动作相关数据的提取方法,其中,在基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据之后,在对所述分析文本数据进行过滤处理,生成目标文本数据之前,所述方法还包括:The method for extracting character action-related data according to any one of claims 1 to 6, wherein a dependency syntax analysis and a semantic dependency analysis are performed on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm. , after the analysis text data is generated, and before the analysis text data is filtered and the target text data is generated, the method further includes:
    识别所述分析文本数据中是否包含过去发生的人物行为动作,当所述分析文本数据中不包含过去发生的人物行为动作时,保留所述分析文本数据,当所述分析文本数据中包含过去发生的人物行为动作时,将包含所述过去发生的人物行为动作的相关数据删除。Identifying whether the analysis text data contains the behaviors and actions of characters that occurred in the past, when the analysis text data does not contain the behaviors and actions of characters that occurred in the past, keep the analysis text data, and when the analysis text data contains the behaviors that occurred in the past When the character behavior and action are mentioned, the relevant data including the character behavior and action that happened in the past will be deleted.
  8. 一种人物动作相关数据的提取设备,包括存储器、处理器及存储在所述存储器上并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现如下步骤:A device for extracting data related to character action, comprising a memory, a processor, and computer-readable instructions stored on the memory and executable on the processor, and the processor executes the computer-readable instructions to achieve Follow the steps below:
    获取预置的文本数据,所述预置的文本数据为包含人物行为动作的文本数据;Obtaining preset text data, the preset text data is text data containing the behavior and actions of characters;
    对所述预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;classifying the preset text data, screening out text data containing personal information, and obtaining initial text data;
    基于预置的中文自然语言处理HanLP算法对所述初始文本数据进行分词处理和词性标注,生成中间文本数据;Perform word segmentation and part-of-speech tagging on the initial text data based on the preset Chinese natural language processing HanLP algorithm to generate intermediate text data;
    基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;Performing dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm to generate analysis text data;
    对所述分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。The analysis text data is filtered to obtain target text data including the behaviors and actions of a plurality of characters.
  9. 根据权利要求8所述的人物动作相关数据的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:The device for extracting data related to character actions according to claim 8, wherein the processor further implements the following steps when executing the computer program:
    将所述预置的文本数据按照预置的分类规则进行分类,筛选出包含人物代词或人物姓名的文本数据,生成分类文本数据;Classifying the preset text data according to preset classification rules, filtering out text data including character pronouns or character names, and generating classified text data;
    识别所述分类文本数据中的目标标点符号,并根据所述目标标点符号删除包含人物对话的文本数据,生成初始文本数据,所述目标标点符号用于指示人物对话。Identifying target punctuation marks in the classified text data, and deleting text data containing dialogues between characters according to the target punctuation marks, and generating initial text data, the target punctuation marks are used to indicate dialogues between characters.
  10. 根据权利要求8所述的人物动作相关数据的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:The device for extracting data related to character actions according to claim 8, wherein the processor further implements the following steps when executing the computer program:
    通过标点符号对所述初始文本数据进行分句处理,得到分句结果;Perform sentence segmentation processing on the initial text data through punctuation to obtain a sentence segmentation result;
    基于预置的中文自然语言处理HanLP算法对所述分句结果进行分词处理,得到分词结果;Perform word segmentation processing on the sentence segmentation result based on the preset Chinese natural language processing HanLP algorithm to obtain the word segmentation result;
    基于所述预置的中文自然语言处理HanLP算法和预置的HanLP词性标注集对所述分词结果进行词性标注,生成中间文本数据。Based on the preset Chinese natural language processing HanLP algorithm and the preset HanLP part-of-speech tagging set, part-of-speech tagging is performed on the word segmentation result to generate intermediate text data.
  11. 根据权利要求8所述的人物动作相关数据的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:The device for extracting data related to character actions according to claim 8, wherein the processor further implements the following steps when executing the computer program:
    调用所述预置的中文自然语言处理HanLP算法识别并分析所述中间文本数据中语法成分之间的关系,当宾语的核心关系指向谓语动词时,抽取核心主谓宾关系,生成第一分析文本数据;Call the preset Chinese natural language processing HanLP algorithm to identify and analyze the relationship between the grammatical components in the intermediate text data, when the core relationship of the object points to the predicate verb, extract the core subject-predicate-object relationship to generate the first analysis text data;
    调用所述预置的中文自然语言处理HanLP算法分析所述中间文本数据中的语义关联,确定关系类型并筛选出包含施事关系的文本数据,生成第二分析文本数据;Invoke the preset Chinese natural language processing HanLP algorithm to analyze the semantic association in the intermediate text data, determine the relationship type and filter out the text data containing the agency relationship, and generate the second analysis text data;
    将所述第一分析文本数据和所述第二分析文本数据进行合并,生成分析文本数据。The first analysis text data and the second analysis text data are combined to generate analysis text data.
  12. 根据权利要求8所述的人物动作相关数据的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:The device for extracting data related to character actions according to claim 8, wherein the processor further implements the following steps when executing the computer program:
    过滤所述分析文本数据中包含情态动词的文本数据,生成过滤文本数据;Filtering text data containing modal verbs in the analysis text data to generate filtered text data;
    将所述过滤文本数据进行归一化处理,生成包含多个人物行为动作的目标文本数据。The filtered text data is normalized to generate target text data including multiple characters' actions.
  13. 根据权利要求12所述的人物动作相关数据的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:The device for extracting data related to character actions according to claim 12, wherein the processor further implements the following steps when executing the computer program:
    识别所述分析文本数据中包含情态动词的文本数据,所述情态动词用于指示还未发生的人物行为动作;Identifying text data that contains modal verbs in the analyzed text data, the modal verbs are used to indicate character actions that have not yet occurred;
    将所述包含情态动词的文本数据删除,生成过滤文本数据。The text data containing the modal verb is deleted to generate filtered text data.
  14. 根据权利要求8-13中任一项所述的人物动作相关数据的提取设备,所述处理器执行所述计算机程序时还实现以下步骤:According to the device for extracting data related to character action according to any one of claims 8-13, the processor further implements the following steps when executing the computer program:
    识别所述分析文本数据中是否包含过去发生的人物行为动作,当所述分析文本数据中不包含过去发生的人物行为动作时,保留所述分析文本数据,当所述分析文本数据中包含过去发生的人物行为动作时,将包含所述过去发生的人物行为动作的相关数据删除。Identifying whether the analysis text data contains the behaviors and actions of characters that occurred in the past, when the analysis text data does not contain the behaviors and actions of characters that occurred in the past, keep the analysis text data, and when the analysis text data contains the behaviors that occurred in the past When the character behavior and action are mentioned, the relevant data including the character behavior and action that happened in the past will be deleted.
  15. 一种计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:A computer-readable storage medium, storing computer instructions in the computer-readable storage medium, when the computer instructions are executed on a computer, the computer is made to perform the following steps:
    获取预置的文本数据,所述预置的文本数据为包含人物行为动作的文本数据;Obtaining preset text data, the preset text data is text data containing the behavior and actions of characters;
    对所述预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;classifying the preset text data, screening out text data containing personal information, and obtaining initial text data;
    基于预置的中文自然语言处理HanLP算法对所述初始文本数据进行分词处理和词性标注,生成中间文本数据;Perform word segmentation and part-of-speech tagging on the initial text data based on the preset Chinese natural language processing HanLP algorithm to generate intermediate text data;
    基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;Performing dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm to generate analysis text data;
    对所述分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。The analysis text data is filtered to obtain target text data including the behaviors and actions of a plurality of characters.
  16. 根据权利要求15所述的计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:The computer-readable storage medium according to claim 15, wherein computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on the computer, the computer is caused to perform the following steps:
    将所述预置的文本数据按照预置的分类规则进行分类,筛选出包含人物代词或人物姓名的文本数据,生成分类文本数据;Classifying the preset text data according to preset classification rules, filtering out text data containing pronouns or personal names, and generating classified text data;
    识别所述分类文本数据中的目标标点符号,并根据所述目标标点符号删除包含人物对话的文本数据,生成初始文本数据,所述目标标点符号用于指示人物对话。Identifying target punctuation marks in the classified text data, and deleting text data containing dialogue between characters according to the target punctuation marks, and generating initial text data, the target punctuation marks are used to indicate dialogue between characters.
  17. 根据权利要求15所述的计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:The computer-readable storage medium according to claim 15, wherein computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on the computer, the computer is caused to perform the following steps:
    通过标点符号对所述初始文本数据进行分句处理,得到分句结果;Perform sentence segmentation processing on the initial text data through punctuation to obtain a sentence segmentation result;
    基于预置的中文自然语言处理HanLP算法对所述分句结果进行分词处理,得到分词结果;Perform word segmentation processing on the sentence segmentation result based on the preset Chinese natural language processing HanLP algorithm to obtain the word segmentation result;
    基于所述预置的中文自然语言处理HanLP算法和预置的HanLP词性标注集对所述分词结果进行词性标注,生成中间文本数据。Based on the preset Chinese natural language processing HanLP algorithm and the preset HanLP part-of-speech tagging set, part-of-speech tagging is performed on the word segmentation result to generate intermediate text data.
  18. 根据权利要求15所述的计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:The computer-readable storage medium according to claim 15, wherein computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on the computer, the computer is caused to perform the following steps:
    调用所述预置的中文自然语言处理HanLP算法识别并分析所述中间文本数据中语法成分之间的关系,当宾语的核心关系指向谓语动词时,抽取核心主谓宾关系,生成第一分析文本数据;Call the preset Chinese natural language processing HanLP algorithm to identify and analyze the relationship between the grammatical components in the intermediate text data, when the core relationship of the object points to the predicate verb, extract the core subject-predicate-object relationship to generate the first analysis text data;
    调用所述预置的中文自然语言处理HanLP算法分析所述中间文本数据中的语义关联,确定关系类型并筛选出包含施事关系的文本数据,生成第二分析文本数据;Invoke the preset Chinese natural language processing HanLP algorithm to analyze the semantic association in the intermediate text data, determine the relationship type and filter out the text data containing the agency relationship, and generate the second analysis text data;
    将所述第一分析文本数据和所述第二分析文本数据进行合并,生成分析文本数据。The first analysis text data and the second analysis text data are combined to generate analysis text data.
  19. 根据权利要求15所述的计算机可读存储介质,所述计算机可读存储介质中存储计算机指令,当所述计算机指令在计算机上运行时,使得计算机执行如下步骤:The computer-readable storage medium according to claim 15, wherein computer instructions are stored in the computer-readable storage medium, and when the computer instructions are executed on the computer, the computer is caused to perform the following steps:
    过滤所述分析文本数据中包含情态动词的文本数据,生成过滤文本数据;Filtering the text data containing modal verbs in the analysis text data to generate filtered text data;
    将所述过滤文本数据进行归一化处理,生成包含多个人物行为动作的目标文本数据。The filtered text data is normalized to generate target text data including multiple characters' actions.
  20. 一种人物动作相关数据的提取装置,其中,所述人物动作相关数据的提取装置包括:A device for extracting data related to character actions, wherein the device for extracting data related to character actions includes:
    获取模块,用于获取预置的文本数据,所述预置的文本数据为包含人物行为动作的小说文本数据;an acquisition module, used for acquiring preset text data, the preset text data being novel text data containing the behavior and actions of characters;
    分类模块,用于对所述预置的文本数据进行分类处理,筛选出包含人物信息的文本数据,得到初始文本数据;a classification module, configured to classify and process the preset text data, screen out the text data containing the character information, and obtain the initial text data;
    分词模块,用于基于预置的中文自然语言处理HanLP算法对所述初始文本数据进行分词处理和词性标注,生成中间文本数据;The word segmentation module is used to perform word segmentation and part-of-speech tagging on the initial text data based on the preset Chinese natural language processing HanLP algorithm to generate intermediate text data;
    分析模块,用于基于所述预置的中文自然语言处理HanLP算法对所述中间文本数据进行依存句法分析和语义依存分析,生成分析文本数据;an analysis module, configured to perform dependency syntax analysis and semantic dependency analysis on the intermediate text data based on the preset Chinese natural language processing HanLP algorithm, and generate analysis text data;
    过滤模块,用于对所述分析文本数据进行过滤处理,得到包含多个人物行为动作的目标文本数据。The filtering module is used for filtering the analysis text data to obtain target text data including the behaviors and actions of a plurality of characters.
PCT/CN2021/124629 2020-12-23 2021-10-19 Method, apparatus and device for extracting character action related data, and storage medium WO2022134779A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011545182.2A CN112597307A (en) 2020-12-23 2020-12-23 Extraction method, device and equipment of figure action related data and storage medium
CN202011545182.2 2020-12-23

Publications (1)

Publication Number Publication Date
WO2022134779A1 true WO2022134779A1 (en) 2022-06-30

Family

ID=75200609

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/124629 WO2022134779A1 (en) 2020-12-23 2021-10-19 Method, apparatus and device for extracting character action related data, and storage medium

Country Status (2)

Country Link
CN (1) CN112597307A (en)
WO (1) WO2022134779A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117609518A (en) * 2024-01-17 2024-02-27 江西科技师范大学 Hierarchical Chinese entity relation extraction method and system for centering structure

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112597307A (en) * 2020-12-23 2021-04-02 深圳壹账通智能科技有限公司 Extraction method, device and equipment of figure action related data and storage medium
CN113065332B (en) * 2021-04-22 2023-05-12 深圳壹账通智能科技有限公司 Text processing method, device, equipment and storage medium based on reading model

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150081281A1 (en) * 2013-09-18 2015-03-19 International Business Machines Corporation Using Renaming Directives to Bootstrap Industry-Specific Knowledge and Lexical Resources
CN110309513A (en) * 2019-07-09 2019-10-08 北京金山数字娱乐科技有限公司 A kind of method and apparatus of context dependent analysis
CN110457676A (en) * 2019-06-26 2019-11-15 平安科技(深圳)有限公司 Extracting method and device, storage medium, the computer equipment of evaluation information
CN111177401A (en) * 2019-12-12 2020-05-19 西安交通大学 Power grid free text knowledge extraction method
CN112597307A (en) * 2020-12-23 2021-04-02 深圳壹账通智能科技有限公司 Extraction method, device and equipment of figure action related data and storage medium

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150081281A1 (en) * 2013-09-18 2015-03-19 International Business Machines Corporation Using Renaming Directives to Bootstrap Industry-Specific Knowledge and Lexical Resources
CN110457676A (en) * 2019-06-26 2019-11-15 平安科技(深圳)有限公司 Extracting method and device, storage medium, the computer equipment of evaluation information
CN110309513A (en) * 2019-07-09 2019-10-08 北京金山数字娱乐科技有限公司 A kind of method and apparatus of context dependent analysis
CN111177401A (en) * 2019-12-12 2020-05-19 西安交通大学 Power grid free text knowledge extraction method
CN112597307A (en) * 2020-12-23 2021-04-02 深圳壹账通智能科技有限公司 Extraction method, device and equipment of figure action related data and storage medium

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117609518A (en) * 2024-01-17 2024-02-27 江西科技师范大学 Hierarchical Chinese entity relation extraction method and system for centering structure
CN117609518B (en) * 2024-01-17 2024-04-26 江西科技师范大学 Hierarchical Chinese entity relation extraction method and system for centering structure

Also Published As

Publication number Publication date
CN112597307A (en) 2021-04-02

Similar Documents

Publication Publication Date Title
JP7346609B2 (en) Systems and methods for performing semantic exploration using natural language understanding (NLU) frameworks
KR101130444B1 (en) System for identifying paraphrases using machine translation techniques
WO2022134779A1 (en) Method, apparatus and device for extracting character action related data, and storage medium
US10915577B2 (en) Constructing enterprise-specific knowledge graphs
US9652719B2 (en) Authoring system for bayesian networks automatically extracted from text
Bikel Intricacies of Collins' parsing model
US9373075B2 (en) Applying a genetic algorithm to compositional semantics sentiment analysis to improve performance and accelerate domain adaptation
JP6676109B2 (en) Utterance sentence generation apparatus, method and program
WO2018045646A1 (en) Artificial intelligence-based method and device for human-machine interaction
CN112182252B (en) Intelligent medication question-answering method and device based on medicine knowledge graph
WO2017198031A1 (en) Semantic parsing method and apparatus
JP6729095B2 (en) Information processing device and program
CN114556328A (en) Data processing method and device, electronic equipment and storage medium
JP2007219947A (en) Causal relation knowledge extraction device and program
CN112580331A (en) Method and system for establishing knowledge graph of policy text
US20220245361A1 (en) System and method for managing and optimizing lookup source templates in a natural language understanding (nlu) framework
US20220229994A1 (en) Operational modeling and optimization system for a natural language understanding (nlu) framework
CN111552798A (en) Name information processing method and device based on name prediction model and electronic equipment
US20140303962A1 (en) Ordering a Lexicon Network for Automatic Disambiguation
US20220229990A1 (en) System and method for lookup source segmentation scoring in a natural language understanding (nlu) framework
US20220229998A1 (en) Lookup source framework for a natural language understanding (nlu) framework
CN114841138A (en) Machine reading between rows
JP3691773B2 (en) Sentence analysis method and sentence analysis apparatus capable of using the method
Sevilla et al. Enriched semantic graphs for extractive text summarization
CN113032529B (en) English phrase recognition method, device, medium and electronic equipment

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21908780

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 30.10.2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21908780

Country of ref document: EP

Kind code of ref document: A1