WO2022188585A1 - 用于文本数据的标注方法、装置、计算机设备及存储介质 - Google Patents

用于文本数据的标注方法、装置、计算机设备及存储介质 Download PDF

Info

Publication number
WO2022188585A1
WO2022188585A1 PCT/CN2022/075659 CN2022075659W WO2022188585A1 WO 2022188585 A1 WO2022188585 A1 WO 2022188585A1 CN 2022075659 W CN2022075659 W CN 2022075659W WO 2022188585 A1 WO2022188585 A1 WO 2022188585A1
Authority
WO
WIPO (PCT)
Prior art keywords
text data
words
word
label
search
Prior art date
Application number
PCT/CN2022/075659
Other languages
English (en)
French (fr)
Inventor
孙孟哲
刘凯
顾松庠
Original Assignee
京东科技控股股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 京东科技控股股份有限公司 filed Critical 京东科技控股股份有限公司
Publication of WO2022188585A1 publication Critical patent/WO2022188585A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/10Text processing
    • G06F40/103Formatting, i.e. changing of presentation of documents
    • G06F40/117Tagging; Marking up; Designating a block; Setting of attributes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • the present application relates to the technical field of artificial intelligence, and in particular, to a method, device, computer equipment and storage medium for text data annotation.
  • the labeling method of text data mainly relies on manual labor, or performs machine learning and keyword matching retrieval according to the existing labelled text data for labeling.
  • the present application aims to solve one of the technical problems in the related art at least to a certain extent.
  • the purpose of this application is to propose a labeling method, device, computer equipment and storage medium for text data, so that the labeling method can be automatically adapted to the labeling of new words in the text data, thereby effectively improving the text Data labeling efficiency and labeling accuracy.
  • the method for labeling text data proposed by the embodiment of the first aspect of the present application includes: acquiring text data; processing the text data to obtain corresponding target words and business keywords; target word, select the corresponding first label from the pre-configured label library; according to the business keyword, determine the corresponding second label in combination with the pre-trained label extraction model; and use the first label and the described The second label labels the text data.
  • the labeling method for text data proposed by the embodiment of the first aspect of the present application obtains the text data, processes the text data to obtain corresponding target words and business keywords, and selects the target words from a preconfigured label library according to the target words. Select the corresponding first label, determine the corresponding second label according to the business keywords, combined with the pre-trained label extraction model, and use the first label and the second label to label the text data, so that the labeling method can be automated. It is adapted to the new word labeling in text data, thereby effectively improving the labeling efficiency and labeling accuracy of text data.
  • the labeling device for text data proposed by the embodiment of the second aspect of the present application includes: an acquisition module for acquiring text data; and a processing module for processing the text data to obtain corresponding text data.
  • Target words and business keywords include: a selection module for selecting a corresponding first label from a pre-configured label library according to the target words; a determination module for combining pre-trained tags according to the business keywords
  • the label extraction model determines a corresponding second label; and a labeling module is configured to label the text data by using the first label and the second label.
  • the labeling device for text data proposed by the embodiment of the second aspect of the present application obtains the text data, processes the text data to obtain corresponding target words and business keywords, and selects the target words from a preconfigured label library according to the target words. Select the corresponding first label, determine the corresponding second label according to the business keywords, combined with the pre-trained label extraction model, and use the first label and the second label to label the text data, so that the labeling method can be automated. It is adapted to the new word labeling in text data, thereby effectively improving the labeling efficiency and labeling accuracy of text data.
  • the embodiment of the third aspect of the present application proposes a computer device, including a memory, a processor, and a computer program stored in the memory and running on the processor.
  • a computer program stored in the memory and running on the processor.
  • Embodiments of the fourth aspect of the present application provide a non-transitory computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, implements the text data storage medium as proposed in the first aspect of the present application. labeling method.
  • the embodiment of the fifth aspect of the present application provides a computer program product.
  • an instruction processor in the computer program product is executed, the method for marking text data as proposed in the embodiment of the first aspect of the present application is executed.
  • FIG. 1 is a schematic flowchart of a method for labeling text data proposed by an embodiment of the present application
  • Fig. 2 is the application schematic diagram in the embodiment of the present application.
  • FIG. 3 is a schematic flowchart of a method for labeling text data proposed by another embodiment of the present application.
  • FIG. 4 is a schematic structural diagram of an annotation device for text data proposed by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of a labeling device for text data proposed by another embodiment of the present application.
  • Figure 6 shows a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present application.
  • FIG. 1 is a schematic flowchart of a method for labeling text data proposed by an embodiment of the present application.
  • the execution body of the method for labeling text data in this embodiment is a labeling device for text data
  • the device may be implemented by software and/or hardware, and the device may be configured in an electronic device , the electronic device may include, but is not limited to, a terminal, a server, and the like.
  • the method includes steps S101 to S102.
  • the text data is, for example, the content contained in a piece of text with corresponding semantics.
  • a text input interface may be provided via an electronic device, a piece of text input by a user may be received, and the content of the text may be parsed as text data, or a piece of voice entered by the user's voice may be parsed, and the piece of voice may be parsed. Convert to the corresponding text, parse the content in this piece of text and use it as text data.
  • the above-mentioned process of acquiring text data may be a process of automatic analysis and acquisition, so as to realize closed-loop automatic text data annotation.
  • S102 Process the text data to obtain corresponding target words and business keywords.
  • the target word may be a word not recognized by the artificially assisted marking platform, or may be other words with some characteristics determined according to business requirements.
  • the human-assisted marking platform can adaptively identify the words needed for labeling from the text data in combination with some models.
  • the recognition accuracy of the artificially assisted marking platform in the actual labeling application scenario, there may be words that are not recognized. Therefore, in the embodiment of the present application, it is precisely to provide unrecognized words for the artificially assisted marking platform.
  • the recognized word is automatically closed-loop recognition, which assists the subsequent use of the target word for text data labeling, and improves the labeling accuracy.
  • the text data can also be processed to obtain corresponding business keywords, which can be used to describe the segment
  • the business type for example, finance, fund, education
  • the text data is processed to obtain corresponding target words and business keywords, which may be word segmentation processing of the text data to obtain multiple candidate search words, and named entities for the text data.
  • Recognition to obtain multiple corresponding entity words select target words from multiple candidate search words, and identify business keywords from multiple entity words, which can effectively improve target word and business keyword mining It is a search-based new word mining, which effectively improves the coverage of the new words obtained by mining, and is a business keyword extraction based on named entity recognition, which not only ensures the recognition accuracy, but also improves the recognition efficiency. .
  • word segmentation can be performed on text data to obtain multiple word segmentations.
  • the word segmentation can be used as a candidate search word, and a corresponding search can be triggered in the search engine to determine the most matching target word.
  • NER Named Entity Recognition
  • NER is used to obtain multiple corresponding entity words, so as to analyze and obtain business keywords based on multiple entity words.
  • performing named entity recognition on the text data to obtain a plurality of corresponding entity words which may be to use the text data as the input of the pre-trained named entity recognition model NER to obtain the output of the named entity recognition model. Since the named entity recognition model is pre-trained based on massive data, the mining efficiency and mining convenience can be greatly improved.
  • the above word features can be co-occurrence features, context features, special symbol features (such as whether the candidate entity words in the entity library contain dashes, the proportion of candidate entity words enclosed in quotation marks, the proportion of candidate entity words enclosed in brackets, the The ratio of English and numbers in entity words, etc.), inverse text frequency index (Inverse Document Frequency, IDF), completeness features, word vector features, etc.
  • special symbol features such as whether the candidate entity words in the entity library contain dashes, the proportion of candidate entity words enclosed in quotation marks, the proportion of candidate entity words enclosed in brackets, the The ratio of English and numbers in entity words, etc.
  • inverse text frequency index Inverse Document Frequency, IDF
  • completeness features word vector features, etc.
  • the above process can realize the use of target words to reason and expand to obtain extended entity words, such as word segmentation to obtain multiple word features corresponding to multiple entity words, and obtain candidate text data.
  • the candidate text data may be in a text database, or It can also be obtained by online search, and then segment each candidate text, filter out the segmented words with higher word frequency as candidate entity words, and build an entity database based on a large number of candidate entity words. Matching is carried out among them, and the matching degree of each candidate entity word is scored, so that the candidate entity word with higher score value is screened out and used as the extended entity word.
  • any other possible manners can also be used to achieve extended entity words by inference and expansion using the target word, for example, an artificial intelligence method, a machine learning method, and the like.
  • business keywords can be identified from multiple entity words and extended entity words, so as to effectively expand the coverage of the business keywords obtained by mining and ensure business key word recognition accuracy.
  • S103 Select a corresponding first tag from a preconfigured tag library according to the target word.
  • the corresponding first tags may be selected from the preconfigured tag library according to the target words.
  • the label corresponding to the target word may be referred to as the first label, and the first label may be used to label the text data.
  • the corresponding first label is selected from the pre-configured label library, which may be a word vector analysis algorithm to process the target word to obtain a feature representation corresponding to the target word, and then the feature representation. Map to the vector space dimension, get the word vector corresponding to the target word, match the corresponding word vector with the labeled word vector corresponding to each label in the tag library, and determine the similarity between the corresponding word vector and the labeled word vector, if the similarity If it is greater than the threshold (for example, 90%), the label corresponding to the label vector is determined, and if it matches the target word, the label can be used as the first label.
  • the threshold for example, 90%
  • the first tag corresponding to the target word may also be selected from the preconfigured tag library in any other possible manner, such as a method of model matching, a method of mathematical operation selection, and the like.
  • S104 Determine the corresponding second label according to the business keyword and in combination with the pre-trained label extraction model.
  • the corresponding second labels can be determined according to the business keywords and combined with the pre-trained label extraction model, wherein the labels corresponding to the business keywords, may be referred to as a second label, and the second label may be used to annotate the textual data.
  • business keywords can be input into a pre-trained label extraction model (the pre-trained label extraction model can be trained based on massive training data), and then the output of the pre-trained label extraction model that matches the business keywords can be obtained. Second tab.
  • S105 Annotate the text data with the first label and the second label.
  • the first label and the second label can be directly used to mark the text data.
  • Figure 2 is a schematic diagram of the application in the embodiment of the present application, including: artificial intelligence AI auxiliary classification module, artificial intelligence (Artificial Intelligence, AI) auxiliary marking module, thereby using artificial intelligence AI auxiliary classification module to assist manual labor
  • AI Artificial Intelligence
  • the assisted marking platform performs business keywords for new word recognition, and uses the artificial intelligence AI assisted marking module to assist the manual assisted marking platform to identify the first label and the second label, so as to realize closed-loop automatic labeling.
  • the text data is acquired and processed to obtain corresponding target words and business keywords, and according to the target words, a corresponding first label is selected from the preconfigured label library, and according to the business Keywords, combined with the pre-trained label extraction model to determine the corresponding second label, and use the first label and the second label to label the text data, so that the labeling method can be automatically adapted to the new word labeling in the text data, thereby Effectively improve the labeling efficiency and labeling accuracy of text data.
  • FIG. 3 is a schematic flowchart of a method for labeling text data proposed by another embodiment of the present application.
  • the method includes steps S301 to S309.
  • S302 Perform word segmentation processing on the text data to obtain multiple candidate search words.
  • S303 Perform named entity recognition on the text data to obtain a plurality of corresponding entity words.
  • the search feature can be related to some search field features, such as search volume, page views of the corresponding search result page, etc., when the candidate search term is used to search in the search engine.
  • the search feature can be used to determine the word frequency of the candidate search word in the search field, so that the word frequency is used as the search feature, so that the consideration of the word frequency is included in the identification of new words, that is, the artificial auxiliary Among the words not recognized by the standard platform, the target words are screened out by combining the search characteristics of each word, which can effectively ensure the recognition effect of new words and improve the accuracy and rationality of new word recognition.
  • At least one target search result corresponding to the candidate search term may be obtained, and statistics on the proportion of the target search result occupying multiple search results may be performed, And take the scale information as the search feature.
  • the target search result is a search result triggered by continuous clicks among multiple search results
  • the search result is a search engine
  • the target search result includes: the reference text data
  • the above-mentioned search result can be specifically a search result page, and the search result page can specifically correspond to a reference text (for example, the search result page specifically displays a reference text, and based on the link of the search result display interface, it can be Link to the reference text, the content contained in the reference text may be referred to as reference text data), the target search result is a partial search result among multiple search results, and the target search result is triggered by successive clicks (eg, the target search result The link is triggered by continuous clicks), the target search result can be linked to the text topic of the reference text data, including candidate search words.
  • a reference text for example, the search result page specifically displays a reference text, and based on the link of the search result display interface, it can be Link to the reference text, the content contained in the reference text may be referred to as reference text data
  • the target search result is a partial search result among multiple search results, and the target search result is triggered by successive clicks (eg, the target search result The link is triggered by continuous
  • the above process can be regarded as counting the proportion value of each candidate search word query appearing continuously in the clicked text topic title (the proportion value can be called proportion information).
  • the proportion information After counting the proportion information of the target search results occupying multiple search results, and using the proportion information as the search feature, the proportion information can be compared with the set threshold (90%), and it can be determined whether the candidate can be selected according to the comparison results.
  • the search term is recognized as a new target term.
  • the candidate search word query with a continuous occurrence ratio greater than or equal to 90% can be used as a new target word.
  • the machine learning method can be used adaptively to identify the word as a new word and use it as the target word.
  • the identified target word can be, for example, a Chinese word (word with 2-4 characters), such as: endowment insurance; or a compound word (word with 2-8 characters), such as QDII fund.
  • the labeling method can be more adapted to the requirements of the business scenario, so that the identified labels are more in line with the requirements of the business scenario.
  • the processing logic of the named entity recognition model NER can also be integrated into the transformer-based bidirectional encoder representation (Bidirectional Encoder Representations from Transformers, BERT), and the unsupervised method of BERT pre-training language model can be used. Therefore, based on the model obtained by the fusion, business keywords can be identified from multiple entity words.
  • the context semantic information referred to by the entity can be combined with the correlation analysis between the entity words (for example, the word can be used vector to analyze the contextual semantic information and related information between entity words, as the correlation degree), and combined with the keyword extraction technology based on text ranking TextRank, to help enhance the accuracy of the correlation degree measurement, so as to optimize the key of the above fusion model.
  • the word extraction effect can automatically identify business keywords from multiple entity words.
  • the corresponding first labels can be selected from the preconfigured label library according to the target words.
  • the label corresponding to the target word may be referred to as the first label, and the first label may be used to label the text data.
  • the corresponding first label is selected from the pre-configured label library, which may be a word vector analysis algorithm to process the target word to obtain a feature representation corresponding to the target word, and then the feature representation. Map to the vector space dimension, get the word vector corresponding to the target word, match the corresponding word vector with the labeled word vector corresponding to each label in the tag library, and determine the similarity between the corresponding word vector and the labeled word vector, if the similarity If it is greater than the threshold (for example, 90%), the label corresponding to the label vector is determined, and if it matches the target word, the label can be used as the first label.
  • the threshold for example, 90%
  • the first tag corresponding to the target word may also be selected from the preconfigured tag library in any other possible manner, such as a method of model matching, a method of mathematical operation selection, and the like.
  • S308 Determine the corresponding second label according to the business keyword and in combination with the pre-trained label extraction model.
  • the corresponding second labels can be determined according to the business keywords and combined with the pre-trained label extraction model, wherein the labels corresponding to the business keywords, may be referred to as a second label, and the second label may be used to annotate the textual data.
  • business keywords can be input into a pre-trained label extraction model (the pre-trained label extraction model can be trained based on massive training data), and then the output of the pre-trained label extraction model that matches the business keywords can be obtained. Second tab.
  • the first label and the second label can be directly used to mark the text data.
  • Figure 2 is a schematic diagram of the application in the embodiment of the present application, including: artificial intelligence AI auxiliary classification module, artificial intelligence (Artificial Intelligence, AI) auxiliary marking module, thereby using artificial intelligence AI auxiliary classification module to assist manual labor
  • AI Artificial Intelligence
  • the assisted marking platform performs business keywords for new word recognition, and uses the artificial intelligence AI assisted marking module to assist the manual assisted marking platform to identify the first label and the second label, so as to realize closed-loop automatic labeling.
  • the text data is acquired and processed to obtain corresponding target words and business keywords, and according to the target words, a corresponding first label is selected from the preconfigured label library, and according to the business Keywords, combined with the pre-trained label extraction model to determine the corresponding second label, and use the first label and the second label to label the text data, so that the labeling method can be automatically adapted to the new word labeling in the text data, thereby Effectively improve the labeling efficiency and labeling accuracy of text data.
  • the search feature can be used to determine the word frequency of the candidate search word in the search field, so that the word frequency can be used as a search feature, so that the consideration of word frequency can be included in the identification of new words, that is to say, the artificial auxiliary marking platform does not recognize the word frequency.
  • combining the search characteristics of each word to filter out the target word can effectively ensure the recognition effect of new words and improve the accuracy and rationality of new word recognition.
  • the labeling method can be more adapted to the requirements of the business scenario, so that the identified tags are more in line with the requirements of the business scenario.
  • FIG. 4 is a schematic structural diagram of an apparatus for labeling text data according to an embodiment of the present application.
  • the labeling device 40 for text data includes:
  • an acquisition module 401 for acquiring text data
  • a processing module 402 configured to process the text data to obtain corresponding target words and business keywords
  • the selection module 403 is used to select the corresponding first label from the preconfigured label library according to the target word;
  • a determination module 404 configured to determine the corresponding second label according to the business keyword in combination with the pre-trained label extraction model
  • the labeling module 405 is configured to label the text data by using the first label and the second label.
  • FIG. 5 is a schematic structural diagram of an annotation device for text data proposed by another embodiment of the present application.
  • the processing module 402 includes:
  • the word segmentation processing submodule 4021 is used to perform word segmentation processing on the text data to obtain multiple candidate search words
  • the entity identification submodule 4022 is used to perform named entity identification on the text data to obtain a plurality of corresponding entity words
  • the processing sub-module 4023 is configured to select a target word from the plurality of candidate search words, and identify the business keyword from the plurality of entity words.
  • processing sub-module 4023 is specifically used for:
  • a target word is identified from among the plurality of candidate search words based on the plurality of search features.
  • processing sub-module 4023 is specifically used for:
  • the target search result is a search result triggered by consecutive clicks among multiple search results
  • the search result is a search engine
  • the target search result includes : the reference text data, and the candidate search term exists in the text topic of the reference text data.
  • processing sub-module 4023 is specifically used for:
  • the ratio information is greater than a set threshold, it is determined that the candidate search word is the target word.
  • the entity identification sub-module 4022 is specifically used for:
  • the text data is used as the input of the pre-trained named entity recognition model to obtain the corresponding plurality of entity words output by the named entity recognition model.
  • the processing module 402 further includes:
  • the word expansion sub-module 4024 is used to parse the plurality of entity words to obtain a plurality of word features corresponding to the plurality of entity words respectively, and identify and obtain expansions from the entity library according to the plurality of word features entity word;
  • processing sub-module 4023 is specifically used for:
  • the business keyword is identified from the plurality of entity words and the extended entity word.
  • the present application also provides a labeling device for text data, because the labeling device for text data provided by the embodiment of the present application is the same as the labeling device for text data.
  • the labeling methods for text data provided in the above-mentioned embodiments of FIGS. 1 to 3 correspond to each other. Therefore, the implementation of the labeling method for text data is also applicable to the labeling device for text data provided in the embodiments of the present application. Details are not described in the embodiments of the present application.
  • the text data is acquired and processed to obtain corresponding target words and business keywords, and according to the target words, a corresponding first label is selected from the preconfigured label library, and according to the business Keywords, combined with the pre-trained label extraction model to determine the corresponding second label, and use the first label and the second label to label the text data, so that the labeling method can be automatically adapted to the new word labeling in the text data, thereby Effectively improve the labeling efficiency and labeling accuracy of text data.
  • the present application also proposes a computer device, including: a memory, a processor, and a computer program stored in the memory and running on the processor.
  • a computer program stored in the memory and running on the processor.
  • the present application also proposes a non-transitory computer-readable storage medium, on which a computer program is stored, and when the program is executed by a processor, realizes the annotation for text data as proposed in the foregoing embodiments of the present application method.
  • the present application also proposes a computer program product, when the instruction processor in the computer program product executes, executes the text data labeling method proposed in the foregoing embodiments of the present application.
  • Figure 6 shows a block diagram of an exemplary computer device suitable for use in implementing embodiments of the present application.
  • the computer device 12 shown in FIG. 6 is only an example, and should not impose any limitations on the functions and scope of use of the embodiments of the present application.
  • computer device 12 takes the form of a general-purpose computing device.
  • Components of computer device 12 may include, but are not limited to, one or more processors or processing units 16 , system memory 28 , and a bus 18 connecting various system components including system memory 28 and processing unit 16 .
  • Bus 18 represents one or more of several types of bus structures, including a memory bus or memory controller, a peripheral bus, a graphics acceleration port, a processor, or a local bus using any of a variety of bus structures.
  • these architectures include, but are not limited to, Industry Standard Architecture (hereinafter referred to as: ISA) bus, Micro Channel Architecture (hereinafter referred to as: MAC) bus, enhanced ISA bus, video electronics Standards Association (Video Electronics Standards Association; hereinafter referred to as: VESA) local bus and Peripheral Component Interconnection (Peripheral Component Interconnection; hereinafter referred to as: PCI) bus.
  • ISA Industry Standard Architecture
  • MAC Micro Channel Architecture
  • VESA Video Electronics Standards Association
  • PCI Peripheral Component Interconnection
  • Computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by computer device 12, including both volatile and nonvolatile media, removable and non-removable media.
  • the memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (Random Access Memory; hereinafter: RAM) 30 and/or cache memory 32 .
  • Computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media.
  • storage system 34 may be used to read and write to non-removable, non-volatile magnetic media (not shown in FIG. 6, commonly referred to as a "hard drive").
  • a magnetic disk drive for reading and writing to removable non-volatile magnetic disks (eg "floppy disks") and removable non-volatile optical disks (eg compact disk read only memory) may be provided Disc Read Only Memory; hereinafter referred to as: CD-ROM), Digital Video Disc Read Only Memory (hereinafter referred to as: DVD-ROM) or other optical media) read and write optical drives.
  • CD-ROM Disc Read Only Memory
  • DVD-ROM Digital Video Disc Read Only Memory
  • each drive may be connected to bus 18 through one or more data media interfaces.
  • Memory 28 may include at least one program product having a set (eg, at least one) of program modules configured to perform the functions of various embodiments of the present application.
  • a program/utility 40 having a set (at least one) of program modules 42, which may be stored, for example, in memory 28, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules, and program data , each or some combination of these examples may include an implementation of a network environment.
  • Program modules 42 generally perform the functions and/or methods of the embodiments described herein.
  • Computer device 12 may also communicate with one or more external devices 14 (eg, keyboard, pointing device, display 24, etc.), may also communicate with one or more devices that enable a user to interact with computer device 12, and/or communicate with Any device (eg, network card, modem, etc.) that enables the computer device 12 to communicate with one or more other computing devices. Such communication may take place through input/output (I/O) interface 22 .
  • the computer device 12 can also communicate with one or more networks (such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN) and/or a public network, such as the Internet, through the network adapter 20 ) communication.
  • networks such as a local area network (Local Area Network; hereinafter referred to as: LAN), a wide area network (Wide Area Network; hereinafter referred to as: WAN) and/or a public network, such as the Internet, through the network
  • network adapter 20 communicates with other modules of computer device 12 via bus 18 .
  • bus 18 It should be understood that, although not shown, other hardware and/or software modules may be used in conjunction with computer device 12, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives and data backup storage systems.
  • the processing unit 16 executes various functional applications and data processing by running the programs stored in the system memory 28 , for example, implementing the annotation method for text data mentioned in the foregoing embodiments.
  • any description of a process or method in the flowcharts or otherwise described herein may be understood to represent a module, segment or portion of code comprising one or more executable instructions for implementing a specified logical function or step of the process , and the scope of the preferred embodiments of the present application includes alternative implementations in which the functions may be performed out of the order shown or discussed, including performing the functions substantially concurrently or in the reverse order depending upon the functions involved, which should It is understood by those skilled in the art to which the embodiments of the present application belong.
  • each functional unit in each embodiment of the present application may be integrated into one processing module, or each unit may exist physically alone, or two or more units may be integrated into one module.
  • the above-mentioned integrated modules can be implemented in the form of hardware, and can also be implemented in the form of software function modules. If the integrated modules are implemented in the form of software functional modules and sold or used as independent products, they may also be stored in a computer-readable storage medium.
  • the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, and the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种用于文本数据的标注方法、装置、计算机设备及存储介质,该方法包括获取文本数据(S101);对文本数据进行处理,以得到对应的目标词和业务关键词(S102);根据目标词,从预配置的标签库之中选取出对应的第一标签(S103);根据业务关键词,结合预训练的标签抽取模型确定对应的第二标签(S104);以及采用第一标签和第二标签对文本数据进行标注(S105)。

Description

用于文本数据的标注方法、装置、计算机设备及存储介质
相关申请的交叉引用
本申请基于申请号为202110251799.1、申请日为2021年03月08日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此引入本申请作为参考。
技术领域
本申请涉及人工智能技术领域,尤其涉及一种用于文本数据的标注方法、装置、计算机设备及存储介质。
背景技术
在互联网的应用场景中,会产生大量的文本数据,从而会有大量的文本数据需要标注,标注得到的标签(比如语义标签、类别标签等)可被用于推荐、风控等领域。
相关技术中,文本数据的标注方式主要是依赖人工,或者是根据已有标注的文本数据,进行机器学习及关键词匹配检索进行标注。
这些方式下,不能够自动化地适配于文本数据当中的新词标注,从而影响文本数据的标注效率和标注准确性。
发明内容
本申请旨在至少在一定程度上解决相关技术中的技术问题之一。
为此,本申请的目的在于提出一种用于文本数据的标注方法、装置、计算机设备及存储介质,能够使得标注方法能够自动化地适配于文本数据当中的新词标注,从而有效地提升文本数据的标注效率和标注准确性。
为达到上述目的,本申请第一方面实施例提出的用于文本数据的标注方法,包括:获取文本数据;对所述文本数据进行处理,以得到对应的目标词和业务关键词;根据所述目标词,从预配置的标签库之中选取出对应的第一标签;根据所述业务关键词,结合预训练的标签抽取模型确定对应的第二标签;以及采用所述第一标签和所述第二标签对所述文本数据进行标注。
本申请第一方面实施例提出的用于文本数据的标注方法,通过获取文本数据,对文本数据进行处理,以得到对应的目标词和业务关键词,并根据目标词,从预配置的标签库之中选取出对应的第一标签,根据业务关键词,结合预训练的标签抽取模型确定对应的第二标签,以及采用第一标签和第二标签对文本数据进行标注,使得标注方法能够自动化地适配于文本数据当中的新词标注,从而有效地提升文本数据的标注效率和标注准确性。
为达到上述目的,本申请第二方面实施例提出的用于文本数据的标注装置,包括:获取模块,用于获取文本数据;处理模块,用于对所述文本数据进行处理,以得到对应 的目标词和业务关键词;选取模块,用于根据所述目标词,从预配置的标签库之中选取出对应的第一标签;确定模块,用于根据所述业务关键词,结合预训练的标签抽取模型确定对应的第二标签;以及标注模块,用于采用所述第一标签和所述第二标签对所述文本数据进行标注。
本申请第二方面实施例提出的用于文本数据的标注装置,通过获取文本数据,对文本数据进行处理,以得到对应的目标词和业务关键词,并根据目标词,从预配置的标签库之中选取出对应的第一标签,根据业务关键词,结合预训练的标签抽取模型确定对应的第二标签,以及采用第一标签和第二标签对文本数据进行标注,使得标注方法能够自动化地适配于文本数据当中的新词标注,从而有效地提升文本数据的标注效率和标注准确性。
本申请第三方面实施例提出了一种计算机设备,包括存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,所述处理器执行所述程序时,实现如本申请第一方面实施例提出的用于文本数据的标注方法。
本申请第四方面实施例提出了一种非临时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请第一方面实施例提出的用于文本数据的标注方法。
本申请第五方面实施例提出了一种计算机程序产品,当所述计算机程序产品中的指令处理器执行时,执行如本申请第一方面实施例提出的用于文本数据的标注方法。
本申请附加的方面和优点将在下面的描述中部分给出,部分将从下面的描述中变得明显,或通过本申请的实践了解到。
附图说明
本申请上述的和/或附加的方面和优点从下面结合附图对实施例的描述中将变得明显和容易理解,其中:
图1是本申请一实施例提出的用于文本数据的标注方法的流程示意图;
图2是本申请实施例中的应用示意图;
图3是本申请另一实施例提出的用于文本数据的标注方法的流程示意图;
图4是本申请一实施例提出的用于文本数据的标注装置的结构示意图;
图5是本申请另一实施例提出的用于文本数据的标注装置的结构示意图;
图6示出了适于用来实现本申请实施方式的示例性计算机设备的框图。
具体实施方式
下面详细描述本申请的实施例,所述实施例的示例在附图中示出,其中自始至终相同或类似的标号表示相同或类似的元件或具有相同或类似功能的元件。下面通过参考附图描述的实施例是示例性的,仅用于解释本申请,而不能理解为对本申请的限制。相反,本申请的实施例包括落入所附加权利要求书的精神和内涵范围内的所有变化、修改和等 同物。
图1是本申请一实施例提出的用于文本数据的标注方法的流程示意图。
其中,需要说明的是,本实施例的用于文本数据的标注方法的执行主体为用于文本数据的标注装置,该装置可以由软件和/或硬件的方式实现,该装置可以配置在电子设备中,电子设备可以包括但不限于终端、服务器端等。
如图1所示,该方法包括步骤S101至步骤S102。
S101:获取文本数据。
其中,文本数据比如一段具有相应的语义的文本中包含的内容。
本申请实施例中,可以经由电子设备提供文本输入界面,接收用户输入的一段文本,解析该段文本中的内容并作为文本数据,或者,也可以解析用户语音录入的一段语音,将该段语音转换为相应的文本,解析该段文本中的内容并作为文本数据。
上述获取文本数据的过程,可以是自动化解析获取的过程,从而实现闭环的自动化的文本数据的标注。
S102:对文本数据进行处理,以得到对应的目标词和业务关键词。
其中的目标词,可以是人工辅助打标平台未识别的词,或者,也可以是其它根据业务需求所确定的具有一些特征的词。
举例而言,人工辅助打标平台在打标过程中,能够结合一些模型自适应地从文本数据当中识别出标注所需要的词,则人工辅助打标平台能够识别出的词,可以是已识别词,而受限于人工辅助打标平台的识别准确性,在实际的标注应用场景中,可能会存在漏识别的词,由此,本申请实施例中正是提供了对人工辅助打标平台未识别的词进行自动化的闭环识别,辅助后续采用该目标词进行文本数据的标注,提升标注准确性。
而为了使得标注方法更适配于业务场景需求,使得识别得到的标签更符合业务场景需求,还可以对文本数据进行处理,以得到对应的业务关键词,该业务关键词能够用于描述该段文本数据对应的业务类型(比如,金融、基金、教育)等等。
可选地,一些实施例中,对文本数据进行处理,以得到对应的目标词和业务关键词,可以是对文本数据进行分词处理,以得到多个候选搜索词,并对文本数据进行命名实体识别,以得到对应的多个实体词,以及从多个候选搜索词之中选取出目标词,并从多个实体词之中识别得到业务关键词,能够有效地提升目标词和业务关键词挖掘的准确性,并且是基于搜索的新词挖掘,从而有效提升挖掘得到的新词的覆盖范围,并且是基于命名实体识别的业务关键词提取,在保障识别准确性的同时,提升了识别的效率。
比如,可以对文本数据进行分词处理,得到多个分词,相应地,将该分词作为候选搜索词,在搜索引擎中触发相应的搜索,以确定出最匹配的目标词,还可以对文本数据进行命名实体识别(Named Entity Recognition,NER),以得到对应的多个实体词,从而基于多个实体词来分析得到业务关键词。
可选地,一些实施例中,对文本数据进行命名实体识别,以得到对应的多个实体词,可以是将文本数据作为预训练的命名实体识别模型NER的输入,以得到命名实体识别 模型输出的对应的多个实体词,由于该命名实体识别模型是预先基于海量的数据训练得到的,从而能够较大程度地提升挖掘效率和挖掘便捷性。
而本申请实施例中,还为了有效地扩展挖掘得到的实体词的覆盖范围,不仅仅对文本数据当中已经出现的实体词进行识别,还实现基于已出现的实体词进行词扩展和推理,可以在对文本数据进行命名实体识别,以得到对应的多个实体词之后,解析多个实体词,以得到与多个实体词分别对应的多个词特征;根据多个词特征,从实体库之中识别得到扩展实体词。
上述的词特征,可以是共现特征、上下文特征、特殊符号特征(比如实体库之中的候选实体词是否含有破折号、候选实体词被引号包含的比例、候选实体词被括号包含的比例、候选实体词中英文和数字的比例等)、逆文本频率指数(Inverse Document Frequency,IDF)、完备性特征、词向量特征等。
上述过程可以实现使用目标词来推理和扩展得出扩展实体词,比如分词得到与多个实体词分别对应的多个词特征,获取候选文本数据,该候选文本数据可以是文本数据库当中的,或者也可以是线上搜索得到的,而后对各个候选文本进行分词,过滤筛选出其中词频较高的分词作为候选实体词,根据海量的候选实体词构建实体库,而后,可以基于词特征在实体库之中进行匹配,对各个候选实体词的匹配程度进行评分,从而筛选出评分值较高的候选实体词并作为扩展实体词。
当然,也可以采用其它任意可能的方式来实现使用目标词推理和扩展得出扩展实体词,比如,人工智能的方式、机器学习的方式等等。
上述在基于已出现的实体词进行词扩展和推理之后,可以从多个实体词和扩展实体词之中识别得到业务关键词,从而有效地扩展挖掘得到的业务关键词的覆盖范围,保障业务关键词识别的准确性。
S103:根据目标词,从预配置的标签库之中选取出对应的第一标签。
上述在对文本数据进行处理,以得到对应的目标词和业务关键词之后,可以根据目标词,从预配置的标签库之中选取出对应的第一标签。
其中,与目标词对应的标签,可以被称为第一标签,而第一标签可被用于对文本数据进行标注。
一些实施例中,在根据目标词,从预配置的标签库之中选取出对应的第一标签,可以是词向量分析算法处理目标词,得到与目标词对应的特征表示,而后,将特征表示映射至向量空间维度,得到与目标词对应的词向量,将对应的词向量与标签库中各个标签对应的标注词向量进行匹配,确定对应的词向量与标注词向量的相似度,如果相似度大于阈值(比如90%),则确定该标注向量对应的标签,与目标词相适配,则可以将该标签作为第一标签。
当然,也可以采用其它任意可能的方式从预配置的标签库之中选取出与目标词对应的第一标签,比如采用模型匹配的方式,数学运算选取的方式等等。
S104:根据业务关键词,结合预训练的标签抽取模型确定对应的第二标签。
上述在对文本数据进行处理,以得到对应的目标词和业务关键词之后,可以根据业务关键词,结合预训练的标签抽取模型确定对应的第二标签,其中,与业务关键词对应的标签,可以被称为第二标签,而第二标签可被用于对文本数据进行标注。
比如可以将业务关键词输入至预训练的标签抽取模型(该预训练的标签抽取模型可以是基于海量的训练数据训练得到的),而后得到预训练的标签抽取模型输出的与业务关键词匹配的第二标签。
S105:采用第一标签和第二标签对文本数据进行标注。
上述在识别得到与目标词对应的第一标签和与业务关键词对应的第二标签之后,可以直接采用第一标签和第二标签对文本数据进行标注。
如图2所示,图2是本申请实施例中的应用示意图,包括:人工智能AI辅助分类模块、人工智能(Artificial Intelligence,AI)辅助打标模块,从而采用人工智能AI辅助分类模块辅助人工辅助打标平台进行新词识别的业务关键词,以及采用人工智能AI辅助打标模块辅助人工辅助打标平台识别出第一标签和第二标签,从而实现闭环的自动化的标注。
本实施例中,通过获取文本数据,对文本数据进行处理,以得到对应的目标词和业务关键词,并根据目标词,从预配置的标签库之中选取出对应的第一标签,根据业务关键词,结合预训练的标签抽取模型确定对应的第二标签,以及采用第一标签和第二标签对文本数据进行标注,使得标注方法能够自动化地适配于文本数据当中的新词标注,从而有效地提升文本数据的标注效率和标注准确性。
图3是本申请另一实施例提出的用于文本数据的标注方法的流程示意图。
如图3所示,该方法包括步骤S301至步骤S309。
S301:获取文本数据。
S302:对文本数据进行分词处理,以得到多个候选搜索词。
S303:对文本数据进行命名实体识别,以得到对应的多个实体词。
S301-S303的步骤说明,可以具体参见上述实施例,在此不再赘述。
S304:获取与多个候选搜索词分别对应的多个搜索特征。
其中,该搜索特征,可以在采用候选搜索词在搜索引擎当中进行搜索时,所关联的一些搜索领域的特征,比如搜索量、对应搜索结果页面的浏览量等等。
本实施例中,可以采用搜索特征,确定该候选搜索词在搜索领域当中的词频,从而将词频作为搜索特征,从而将词频的考量纳入新词的识别当中,也即是说,将人工辅助打标平台未识别的词中,结合各词的搜索特征来筛选出目标词,能够有效地保障新词的识别效果,提升新词识别的准确性和合理性。
可选地,一些实施例中,在计算各个候选搜索词对应的搜索特征时,可以是获取与候选搜索词对应的至少一个目标搜索结果,并统计目标搜索结果占据多个搜索结果的比例信息,并将比例信息作为搜索特征。
其中,目标搜索结果是多个搜索结果之中,被连续点击触发的搜索结果,搜索结果 是搜索引擎,基于候选搜索词搜索得到的参考文本数据,目标搜索结果包括:参考文本数据,且候选搜索词存在于参考文本数据的文本主题中。
也即是说,上述的搜索结果,可以具体是搜索结果页面,该搜索结果页面可以具体对应参考文本(比如该搜索结果页面具体展示的是一个参考文本,且基于搜索结果展示界面的链接,能够链接至该参考文本,参考文本中包含的内容,可以被称为参考文本数据),目标搜索结果是多个搜索结果当中的部分搜索结果,且目标搜索结果被连续点击触发(例如,目标搜索结果的链接,被连续点击触发),该目标搜索结果能够链接至的参考文本数据的文本主题中,包含候选搜索词。
上述的过程即可以被视为统计各个候选搜索词query在点击文本主题title中连续出现的比例值(该比例值即可以被称为比例信息)。
S305:如果比例信息大于设定阈值,则确定候选搜索词是目标词。
上述在统计目标搜索结果占据多个搜索结果的比例信息,并将比例信息作为搜索特征之后,可以将比例信息与设定阈值(90%)作比对,根据比对的结果判断是否可以将候选搜索词识别为新的目标词。
举例而言,可以将连续出现比例值大于或者等于90%的候选搜索词query作为新的目标词,具体例如,如果人工辅助打标平台在多篇文章(可以被视为搜索得到的参考文本数据)中连续打出“指数基金”,且人工辅助打标平台对应的后台模型无法识别该词的类型,则可以自适应地采用机器学习的方法识别出该词为新词并作为目标词。
识别出的目标词,可以例如中文词(2-4字的词),比如:两全险;或者是复合词(2-8字的词),比如:QDII基金。
S306:从多个实体词之中识别得到业务关键词。
本实施例中,通过从多个实体词之中识别得到业务关键词,能够使得标注方法更适配于业务场景需求,使得识别得到的标签更符合业务场景需求。
举例而言,还可以对文本数据进行处理,首先从文本数据当中识别出多个实体词,而后,从多个实体之中识别得到对应的业务关键词,该业务关键词能够用于描述该段文本数据对应的业务类型(比如,金融、基金、教育)等等。
而另外一些实施例中,还可以将命名实体识别模型NER的处理逻辑融合至基于变压器的双向编码器表示(Bidirectional Encoder Representations from Transformers,BERT)的,且采用无监督方法的BERT预训练语言模型之中,从而基于该融合得到的模型,从多个实体词之中识别得到业务关键词,比如,可以结合实体所指代的上下文语义信息,和实体词之间的关联度分析(比如可以采用词向量来分析上下文语义信息和实体词之间的相关信息,作为关联度),并结合使用基于文本排名TextRank的关键词提取技术,来辅助增强关联度度量的准确性,从而优化上述融合模型的关键词提取效果,可以实现自动化地从多个实体词之中识别得到业务关键词。
S307:根据目标词,从预配置的标签库之中选取出对应的第一标签。
上述在对文本数据进行处理,以得到对应的目标词和业务关键词之后,可以根据目 标词,从预配置的标签库之中选取出对应的第一标签。
其中,与目标词对应的标签,可以被称为第一标签,而第一标签可被用于对文本数据进行标注。
一些实施例中,在根据目标词,从预配置的标签库之中选取出对应的第一标签,可以是词向量分析算法处理目标词,得到与目标词对应的特征表示,而后,将特征表示映射至向量空间维度,得到与目标词对应的词向量,将对应的词向量与标签库中各个标签对应的标注词向量进行匹配,确定对应的词向量与标注词向量的相似度,如果相似度大于阈值(比如90%),则确定该标注向量对应的标签,与目标词相适配,则可以将该标签作为第一标签。
当然,也可以采用其它任意可能的方式从预配置的标签库之中选取出与目标词对应的第一标签,比如采用模型匹配的方式,数学运算选取的方式等等。
S308:根据业务关键词,结合预训练的标签抽取模型确定对应的第二标签。
上述在对文本数据进行处理,以得到对应的目标词和业务关键词之后,可以根据业务关键词,结合预训练的标签抽取模型确定对应的第二标签,其中,与业务关键词对应的标签,可以被称为第二标签,而第二标签可被用于对文本数据进行标注。
比如可以将业务关键词输入至预训练的标签抽取模型(该预训练的标签抽取模型可以是基于海量的训练数据训练得到的),而后得到预训练的标签抽取模型输出的与业务关键词匹配的第二标签。
S309:采用第一标签和第二标签对文本数据进行标注。
上述在识别得到与目标词对应的第一标签和与业务关键词对应的第二标签之后,可以直接采用第一标签和第二标签对文本数据进行标注。
如图2所示,图2是本申请实施例中的应用示意图,包括:人工智能AI辅助分类模块、人工智能(Artificial Intelligence,AI)辅助打标模块,从而采用人工智能AI辅助分类模块辅助人工辅助打标平台进行新词识别的业务关键词,以及采用人工智能AI辅助打标模块辅助人工辅助打标平台识别出第一标签和第二标签,从而实现闭环的自动化的标注。
本实施例中,通过获取文本数据,对文本数据进行处理,以得到对应的目标词和业务关键词,并根据目标词,从预配置的标签库之中选取出对应的第一标签,根据业务关键词,结合预训练的标签抽取模型确定对应的第二标签,以及采用第一标签和第二标签对文本数据进行标注,使得标注方法能够自动化地适配于文本数据当中的新词标注,从而有效地提升文本数据的标注效率和标注准确性。可以采用搜索特征,确定该候选搜索词在搜索领域当中的词频,从而将词频作为搜索特征,从而将词频的考量纳入新词的识别当中,也即是说,将人工辅助打标平台未识别的词中,结合各词的搜索特征来筛选出目标词,能够有效地保障新词的识别效果,提升新词识别的准确性和合理性。通过从多个实体词之中识别得到业务关键词,能够使得标注方法更适配于业务场景需求,使得识别得到的标签更符合业务场景需求。
图4是本申请一实施例提出的用于文本数据的标注装置的结构示意图。
如图4所示,该用于文本数据的标注装置40,包括:
获取模块401,用于获取文本数据;
处理模块402,用于对所述文本数据进行处理,以得到对应的目标词和业务关键词;
选取模块403,用于根据所述目标词,从预配置的标签库之中选取出对应的第一标签;
确定模块404,用于根据所述业务关键词,结合预训练的标签抽取模型确定对应的第二标签;以及
标注模块405,用于采用所述第一标签和所述第二标签对所述文本数据进行标注。
在本申请的一些实施例中,如图5所示,图5是本申请另一实施例提出的用于文本数据的标注装置的结构示意图,所述处理模块402,包括:
分词处理子模块4021,用于对所述文本数据进行分词处理,以得到多个候选搜索词;
实体识别子模块4022,用于对所述文本数据进行命名实体识别,以得到对应的多个实体词;
处理子模块4023,用于从所述多个候选搜索词之中选取出目标词,并从所述多个实体词之中识别得到所述业务关键词。
在本申请的一些实施例中,所述处理子模块4023,具体用于:
获取与所述多个候选搜索词分别对应的多个搜索特征;
根据所述多个搜索特征,从所述多个候选搜索词之中识别出目标词。
在本申请的一些实施例中,所述处理子模块4023,具体用于:
获取与所述候选搜索词对应的至少一个目标搜索结果;
统计所述目标搜索结果占据所述多个搜索结果的比例信息,并将所述比例信息作为所述搜索特征;
其中,所述目标搜索结果是多个搜索结果之中,被连续点击触发的搜索结果,所述搜索结果是搜索引擎,基于所述候选搜索词搜索得到的参考文本数据,所述目标搜索结果包括:所述参考文本数据,且所述候选搜索词存在于所述参考文本数据的文本主题中。
在本申请的一些实施例中,所述处理子模块4023,具体用于:
如果所述比例信息大于设定阈值,则确定所述候选搜索词是所述目标词。
在本申请的一些实施例中,所述实体识别子模块4022,具体用于:
将所述文本数据作为预训练的命名实体识别模型的输入,以得到所述命名实体识别模型输出的所述对应的多个实体词。
在本申请的一些实施例中,如图5所示,所述处理模块402,还包括:
词扩展子模块4024,用于解析所述多个实体词,以得到与所述多个实体词分别对应的多个词特征,并根据所述多个词特征,从实体库之中识别得到扩展实体词;
则所述处理子模块4023,具体用于:
从所述多个实体词和所述扩展实体词之中识别得到所述业务关键词。
与上述图1至图3实施例提供的用于文本数据的标注方法相对应,本申请还提供一种用于文本数据的标注装置,由于本申请实施例提供的用于文本数据的标注装置与上述图1至图3实施例提供的用于文本数据的标注方法相对应,因此在用于文本数据的标注方法的实施方式也适用于本申请实施例提供的用于文本数据的标注装置,在本申请实施例中不再详细描述。
本实施例中,通过获取文本数据,对文本数据进行处理,以得到对应的目标词和业务关键词,并根据目标词,从预配置的标签库之中选取出对应的第一标签,根据业务关键词,结合预训练的标签抽取模型确定对应的第二标签,以及采用第一标签和第二标签对文本数据进行标注,使得标注方法能够自动化地适配于文本数据当中的新词标注,从而有效地提升文本数据的标注效率和标注准确性。
为了实现上述实施例,本申请还提出一种计算机设备,包括:存储器、处理器及存储在存储器上并可在处理器上运行的计算机程序,处理器执行程序时,实现如本申请前述实施例提出的用于文本数据的标注方法。
为了实现上述实施例,本申请还提出一种非临时性计算机可读存储介质,其上存储有计算机程序,该程序被处理器执行时实现如本申请前述实施例提出的用于文本数据的标注方法。
为了实现上述实施例,本申请还提出一种计算机程序产品,当计算机程序产品中的指令处理器执行时,执行如本申请前述实施例提出的用于文本数据的标注方法。
图6示出了适于用来实现本申请实施方式的示例性计算机设备的框图。图6显示的计算机设备12仅仅是一个示例,不应对本申请实施例的功能和使用范围带来任何限制。
如图6所示,计算机设备12以通用计算设备的形式表现。计算机设备12的组件可以包括但不限于:一个或者多个处理器或者处理单元16,系统存储器28,连接不同系统组件(包括系统存储器28和处理单元16)的总线18。
总线18表示几类总线结构中的一种或多种,包括存储器总线或者存储器控制器,外围总线,图形加速端口,处理器或者使用多种总线结构中的任意总线结构的局域总线。举例来说,这些体系结构包括但不限于工业标准体系结构(Industry Standard Architecture;以下简称:ISA)总线,微通道体系结构(Micro Channel Architecture;以下简称:MAC)总线,增强型ISA总线、视频电子标准协会(Video Electronics Standards Association;以下简称:VESA)局域总线以及外围组件互连(Peripheral Component Interconnection;以下简称:PCI)总线。
计算机设备12典型地包括多种计算机系统可读介质。这些介质可以是任何能够被计算机设备12访问的可用介质,包括易失性和非易失性介质,可移动的和不可移动的介质。
存储器28可以包括易失性存储器形式的计算机系统可读介质,例如随机存取存储器(Random Access Memory;以下简称:RAM)30和/或高速缓存存储器32。计算机设备12可以进一步包括其它可移动/不可移动的、易失性/非易失性计算机系统存储介质。 仅作为举例,存储系统34可以用于读写不可移动的、非易失性磁介质(图6未显示,通常称为“硬盘驱动器”)。尽管图6中未示出,可以提供用于对可移动非易失性磁盘(例如“软盘”)读写的磁盘驱动器,以及对可移动非易失性光盘(例如:光盘只读存储器(Compact Disc Read Only Memory;以下简称:CD-ROM)、数字多功能只读光盘(Digital Video Disc Read Only Memory;以下简称:DVD-ROM)或者其它光介质)读写的光盘驱动器。在这些情况下,每个驱动器可以通过一个或者多个数据介质接口与总线18相连。存储器28可以包括至少一个程序产品,该程序产品具有一组(例如至少一个)程序模块,这些程序模块被配置以执行本申请各实施例的功能。
具有一组(至少一个)程序模块42的程序/实用工具40,可以存储在例如存储器28中,这样的程序模块42包括但不限于操作系统、一个或者多个应用程序、其它程序模块以及程序数据,这些示例中的每一个或某种组合中可能包括网络环境的实现。程序模块42通常执行本申请所描述的实施例中的功能和/或方法。
计算机设备12也可以与一个或多个外部设备14(例如键盘、指向设备、显示器24等)通信,还可与一个或者多个使得用户能与该计算机设备12交互的设备通信,和/或与使得该计算机设备12能与一个或多个其它计算设备进行通信的任何设备(例如网卡,调制解调器等等)通信。这种通信可以通过输入/输出(I/O)接口22进行。并且,计算机设备12还可以通过网络适配器20与一个或者多个网络(例如局域网(Local Area Network;以下简称:LAN),广域网(Wide Area Network;以下简称:WAN)和/或公共网络,例如因特网)通信。如图所示,网络适配器20通过总线18与计算机设备12的其它模块通信。应当明白,尽管图中未示出,可以结合计算机设备12使用其它硬件和/或软件模块,包括但不限于:微代码、设备驱动器、冗余处理单元、外部磁盘驱动阵列、RAID系统、磁带驱动器以及数据备份存储系统等。
处理单元16通过运行存储在系统存储器28中的程序,从而执行各种功能应用以及数据处理,例如实现前述实施例中提及的用于文本数据的标注方法。
本领域技术人员在考虑说明书及实践这里公开的发明后,将容易想到本申请的其它实施方案。本申请旨在涵盖本申请的任何变型、用途或者适应性变化,这些变型、用途或者适应性变化遵循本申请的一般性原理并包括本申请未公开的本技术领域中的公知常识或惯用技术手段。说明书和实施例仅被视为示例性的,本申请的真正范围和精神由下面的权利要求指出。
应当理解的是,本申请并不局限于上面已经描述并在附图中示出的精确结构,并且可以在不脱离其范围进行各种修改和改变。本申请的范围仅由所附的权利要求来限制。
需要说明的是,在本申请的描述中,术语“第一”、“第二”等仅用于描述目的,而不能理解为指示或暗示相对重要性。此外,在本申请的描述中,除非另有说明,“多个”的含义是两个或两个以上。
流程图中或在此以其他方式描述的任何过程或方法描述可以被理解为,表示包括一个或更多个用于实现特定逻辑功能或过程的步骤的可执行指令的代码的模块、片段或部 分,并且本申请的优选实施方式的范围包括另外的实现,其中可以不按所示出或讨论的顺序,包括根据所涉及的功能按基本同时的方式或按相反的顺序,来执行功能,这应被本申请的实施例所属技术领域的技术人员所理解。
应当理解,本申请的各部分可以用硬件、软件、固件或它们的组合来实现。在上述实施方式中,多个步骤或方法可以用存储在存储器中且由合适的指令执行系统执行的软件或固件来实现。例如,如果用硬件来实现,和在另一实施方式中一样,可用本领域公知的下列技术中的任一项或他们的组合来实现:具有用于对数据信号实现逻辑功能的逻辑门电路的离散逻辑电路,具有合适的组合逻辑门电路的专用集成电路,可编程门阵列(PGA),现场可编程门阵列(FPGA)等。
本技术领域的普通技术人员可以理解实现上述实施例方法携带的全部或部分步骤是可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,该程序在执行时,包括方法实施例的步骤之一或其组合。
此外,在本申请各个实施例中的各功能单元可以集成在一个处理模块中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个模块中。上述集成的模块既可以采用硬件的形式实现,也可以采用软件功能模块的形式实现。所述集成的模块如果以软件功能模块的形式实现并作为独立的产品销售或使用时,也可以存储在一个计算机可读取存储介质中。
上述提到的存储介质可以是只读存储器,磁盘或光盘等。
在本说明书的描述中,参考术语“一个实施例”、“一些实施例”、“示例”、“具体示例”、或“一些示例”等的描述意指结合该实施例或示例描述的具体特征、结构、材料或者特点包含于本申请的至少一个实施例或示例中。在本说明书中,对上述术语的示意性表述不一定指的是相同的实施例或示例。而且,描述的具体特征、结构、材料或者特点可以在任何的一个或多个实施例或示例中以合适的方式结合。
尽管上面已经示出和描述了本申请的实施例,可以理解的是,上述实施例是示例性的,不能理解为对本申请的限制,本领域的普通技术人员在本申请的范围内可以对上述实施例进行变化、修改、替换和变型。

Claims (17)

  1. 一种用于文本数据的标注方法,包括:
    获取文本数据;
    对所述文本数据进行处理,以得到对应的目标词和业务关键词;
    根据所述目标词,从预配置的标签库之中选取出对应的第一标签;
    根据所述业务关键词,结合预训练的标签抽取模型确定对应的第二标签;以及
    采用所述第一标签和所述第二标签对所述文本数据进行标注。
  2. 如权利要求1所述的方法,其中所述对所述文本数据进行处理,以得到对应的目标词和业务关键词,包括:
    对所述文本数据进行分词处理,以得到多个候选搜索词;
    对所述文本数据进行命名实体识别,以得到对应的多个实体词;
    从所述多个候选搜索词之中选取出目标词,并从所述多个实体词之中识别得到所述业务关键词。
  3. 如权利要求2所述的方法,其中所述从所述多个候选搜索词之中选取出目标词,包括:
    获取与所述多个候选搜索词分别对应的多个搜索特征;
    根据所述多个搜索特征,从所述多个候选搜索词之中识别出目标词。
  4. 如权利要求3所述的方法,其中所述获取与所述多个候选搜索词分别对应的多个搜索特征,包括:
    获取与所述候选搜索词对应的至少一个目标搜索结果;
    统计所述目标搜索结果占据所述多个搜索结果的比例信息,并将所述比例信息作为所述搜索特征;
    其中,所述目标搜索结果是多个搜索结果之中,被连续点击触发的搜索结果,所述搜索结果是搜索引擎,基于所述候选搜索词搜索得到的参考文本数据,所述目标搜索结果包括:所述参考文本数据,且所述候选搜索词存在于所述参考文本数据的文本主题中。
  5. 如权利要求4所述的方法,其中所述根据所述多个搜索特征,从所述多个候选搜索词之中识别出目标词,包括:
    如果所述比例信息大于设定阈值,则确定所述候选搜索词是所述目标词。
  6. 如权利要求2所述的方法,其中所述对所述文本数据进行命名实体识别,以得到对应的多个实体词,包括:
    将所述文本数据作为预训练的命名实体识别模型的输入,以得到所述命名实体识别模型输出的所述对应的多个实体词。
  7. 如权利要求2所述的方法,其中在所述对所述文本数据进行命名实体识别,以得到对应的多个实体词之后,还包括:
    解析所述多个实体词,以得到与所述多个实体词分别对应的多个词特征;
    根据所述多个词特征,从实体库之中识别得到扩展实体词;
    则所述从所述多个实体词之中识别得到所述业务关键词,包括:
    从所述多个实体词和所述扩展实体词之中识别得到所述业务关键词。
  8. 一种用于文本数据的标注装置,包括:
    获取模块,用于获取文本数据;
    处理模块,用于对所述文本数据进行处理,以得到对应的目标词和业务关键词;
    选取模块,用于根据所述目标词,从预配置的标签库之中选取出对应的第一标签;
    确定模块,用于根据所述业务关键词,结合预训练的标签抽取模型确定对应的第二标签;以及
    标注模块,用于采用所述第一标签和所述第二标签对所述文本数据进行标注。
  9. 如权利要求8所述的装置,其中所述处理模块,包括:
    分词处理子模块,用于对所述文本数据进行分词处理,以得到多个候选搜索词;
    实体识别子模块,用于对所述文本数据进行命名实体识别,以得到对应的多个实体词;
    处理子模块,用于从所述多个候选搜索词之中选取出目标词,并从所述多个实体词之中识别得到所述业务关键词。
  10. 如权利要求9所述的装置,其中所述处理子模块进一步用于:
    获取与所述多个候选搜索词分别对应的多个搜索特征;
    根据所述多个搜索特征,从所述多个候选搜索词之中识别出目标词。
  11. 如权利要求10所述的装置,其中所述处理子模块进一步用于:
    获取与所述候选搜索词对应的至少一个目标搜索结果;
    统计所述目标搜索结果占据所述多个搜索结果的比例信息,并将所述比例信息作为所述搜索特征;
    其中,所述目标搜索结果是多个搜索结果之中,被连续点击触发的搜索结果,所述搜索结果是搜索引擎,基于所述候选搜索词搜索得到的参考文本数据,所述目标搜索结果包括:所述参考文本数据,且所述候选搜索词存在于所述参考文本数据的文本主题中。
  12. 如权利要求11所述的装置,其中所述处理子模块进一步用于:
    如果所述比例信息大于设定阈值,则确定所述候选搜索词是所述目标词。
  13. 如权利要求9所述的装置,其中所述实体识别子模块进一步用于:
    将所述文本数据作为预训练的命名实体识别模型的输入,以得到所述命名实体识别模型输出的所述对应的多个实体词。
  14. 如权利要求9所述的装置,其中所述处理模块,还包括:
    词扩展子模块,用于解析所述多个实体词,以得到与所述多个实体词分别对应的多个词特征,并根据所述多个词特征,从实体库之中识别得到扩展实体词;
    则所述处理子模块进一步用于:
    从所述多个实体词和所述扩展实体词之中识别得到所述业务关键词。
  15. 一种计算机设备,其中,包括存储器、处理器及存储在存储器上并可在处理器 上运行的计算机程序,所述处理器执行所述程序时,实现以下步骤:
    获取文本数据;
    对所述文本数据进行处理,以得到对应的目标词和业务关键词;
    根据所述目标词,从预配置的标签库之中选取出对应的第一标签;
    根据所述业务关键词,结合预训练的标签抽取模型确定对应的第二标签;以及
    采用所述第一标签和所述第二标签对所述文本数据进行标注。
  16. 一种存储介质,当所述存储介质中的指令由电子设备的处理器执行时,使得电子设备能够执行以下步骤:
    获取文本数据;
    对所述文本数据进行处理,以得到对应的目标词和业务关键词;
    根据所述目标词,从预配置的标签库之中选取出对应的第一标签;
    根据所述业务关键词,结合预训练的标签抽取模型确定对应的第二标签;以及
    采用所述第一标签和所述第二标签对所述文本数据进行标注。
  17. 一种计算机程序产品,其中,包括计算机程序,所述计算机程序产品中的指令处理器执行时,实现以下步骤:
    获取文本数据;
    对所述文本数据进行处理,以得到对应的目标词和业务关键词;
    根据所述目标词,从预配置的标签库之中选取出对应的第一标签;
    根据所述业务关键词,结合预训练的标签抽取模型确定对应的第二标签;以及
    采用所述第一标签和所述第二标签对所述文本数据进行标注。
PCT/CN2022/075659 2021-03-08 2022-02-09 用于文本数据的标注方法、装置、计算机设备及存储介质 WO2022188585A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202110251799.1A CN113822013B (zh) 2021-03-08 2021-03-08 用于文本数据的标注方法、装置、计算机设备及存储介质
CN202110251799.1 2021-03-08

Publications (1)

Publication Number Publication Date
WO2022188585A1 true WO2022188585A1 (zh) 2022-09-15

Family

ID=78912397

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/075659 WO2022188585A1 (zh) 2021-03-08 2022-02-09 用于文本数据的标注方法、装置、计算机设备及存储介质

Country Status (2)

Country Link
CN (1) CN113822013B (zh)
WO (1) WO2022188585A1 (zh)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113822013B (zh) * 2021-03-08 2024-04-05 京东科技控股股份有限公司 用于文本数据的标注方法、装置、计算机设备及存储介质

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108829893A (zh) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 确定视频标签的方法、装置、存储介质和终端设备
US20190087490A1 (en) * 2016-05-25 2019-03-21 Huawei Technologies Co., Ltd. Text classification method and apparatus
CN109992646A (zh) * 2019-03-29 2019-07-09 腾讯科技(深圳)有限公司 文本标签的提取方法和装置
CN111324771A (zh) * 2020-02-26 2020-06-23 腾讯科技(深圳)有限公司 视频标签的确定方法、装置、电子设备及存储介质
CN112347778A (zh) * 2020-11-06 2021-02-09 平安科技(深圳)有限公司 关键词抽取方法、装置、终端设备及存储介质
CN113822013A (zh) * 2021-03-08 2021-12-21 京东科技控股股份有限公司 用于文本数据的标注方法、装置、计算机设备及存储介质

Family Cites Families (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6052682A (en) * 1997-05-02 2000-04-18 Bbn Corporation Method of and apparatus for recognizing and labeling instances of name classes in textual environments
CN103838870B (zh) * 2014-03-21 2016-09-28 武汉科技大学 基于信息单元融合的新闻原子事件抽取方法
CN107436922B (zh) * 2017-07-05 2021-06-08 北京百度网讯科技有限公司 文本标签生成方法和装置
CN108280061B (zh) * 2018-01-17 2021-10-26 北京百度网讯科技有限公司 基于歧义实体词的文本处理方法和装置
CN108647194B (zh) * 2018-04-28 2022-04-19 北京神州泰岳软件股份有限公司 信息抽取方法及装置
EP3567605A1 (en) * 2018-05-08 2019-11-13 Siemens Healthcare GmbH Structured report data from a medical text report
US11023982B2 (en) * 2018-07-12 2021-06-01 Adp, Llc Method to efficiently categorize, extract and setup of payroll tax notices
CN109165380B (zh) * 2018-07-26 2022-07-01 咪咕数字传媒有限公司 一种神经网络模型训练方法及装置、文本标签确定方法及装置
CN109918645B (zh) * 2019-01-28 2022-12-02 平安科技(深圳)有限公司 深度分析文本的方法、装置、计算机设备和存储介质
CN111738009B (zh) * 2019-03-19 2023-10-20 百度在线网络技术(北京)有限公司 实体词标签生成方法、装置、计算机设备和可读存储介质
CN110377743B (zh) * 2019-07-25 2022-07-08 北京明略软件系统有限公司 一种文本标注方法及装置

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20190087490A1 (en) * 2016-05-25 2019-03-21 Huawei Technologies Co., Ltd. Text classification method and apparatus
CN108829893A (zh) * 2018-06-29 2018-11-16 北京百度网讯科技有限公司 确定视频标签的方法、装置、存储介质和终端设备
CN109992646A (zh) * 2019-03-29 2019-07-09 腾讯科技(深圳)有限公司 文本标签的提取方法和装置
CN111324771A (zh) * 2020-02-26 2020-06-23 腾讯科技(深圳)有限公司 视频标签的确定方法、装置、电子设备及存储介质
CN112347778A (zh) * 2020-11-06 2021-02-09 平安科技(深圳)有限公司 关键词抽取方法、装置、终端设备及存储介质
CN113822013A (zh) * 2021-03-08 2021-12-21 京东科技控股股份有限公司 用于文本数据的标注方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN113822013A (zh) 2021-12-21
CN113822013B (zh) 2024-04-05

Similar Documents

Publication Publication Date Title
US11216504B2 (en) Document recommendation method and device based on semantic tag
US10489439B2 (en) System and method for entity extraction from semi-structured text documents
AU2019263758B2 (en) Systems and methods for generating a contextually and conversationally correct response to a query
WO2020244073A1 (zh) 基于语音的用户分类方法、装置、计算机设备及存储介质
US10811125B2 (en) Cognitive framework to identify medical case safety reports in free form text
US10650094B2 (en) Predicting style breaches within textual content
US20120030157A1 (en) Training data generation apparatus, characteristic expression extraction system, training data generation method, and computer-readable storage medium
US11977589B2 (en) Information search method, device, apparatus and computer-readable medium
CN107861948B (zh) 一种标签提取方法、装置、设备和介质
CN112347241A (zh) 一种摘要提取方法、装置、设备及存储介质
US11630869B2 (en) Identification of changes between document versions
CN115017884B (zh) 基于图文多模态门控增强的文本平行句对抽取方法
WO2022188585A1 (zh) 用于文本数据的标注方法、装置、计算机设备及存储介质
WO2022143608A1 (zh) 语言标注方法、装置、计算机设备和存储介质
US9195706B1 (en) Processing of document metadata for use as query suggestions
CN110737770B (zh) 文本数据敏感性识别方法、装置、电子设备及存储介质
CN113806500B (zh) 信息处理方法、装置和计算机设备
CN114356924A (zh) 用于从结构化文档提取数据的方法和设备
CN111552780B (zh) 医用场景的搜索处理方法、装置、存储介质及电子设备
CN115358817A (zh) 基于社交数据的智能产品推荐方法、装置、设备及介质
US11983207B2 (en) Method, electronic device, and computer program product for information processing
CN115017385A (zh) 一种物品搜索方法、装置、设备和存储介质
CN110276001B (zh) 盘点页识别方法、装置、计算设备和介质
KR101126186B1 (ko) 형태적 중의성 동사 분석 장치, 방법 및 그 기록 매체
CN111768215B (zh) 广告投放方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22766112

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 22766112

Country of ref document: EP

Kind code of ref document: A1