WO2021218027A1 - 智能面试中专业术语的提取方法、装置、设备及介质 - Google Patents

智能面试中专业术语的提取方法、装置、设备及介质 Download PDF

Info

Publication number
WO2021218027A1
WO2021218027A1 PCT/CN2020/118919 CN2020118919W WO2021218027A1 WO 2021218027 A1 WO2021218027 A1 WO 2021218027A1 CN 2020118919 W CN2020118919 W CN 2020118919W WO 2021218027 A1 WO2021218027 A1 WO 2021218027A1
Authority
WO
WIPO (PCT)
Prior art keywords
phrase
character
interview
named entity
information
Prior art date
Application number
PCT/CN2020/118919
Other languages
English (en)
French (fr)
Inventor
邓悦
金戈
徐亮
Original Assignee
平安科技(深圳)有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 平安科技(深圳)有限公司 filed Critical 平安科技(深圳)有限公司
Publication of WO2021218027A1 publication Critical patent/WO2021218027A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition

Definitions

  • This application relates to the field of artificial intelligence, and in particular to a method, device, equipment, and medium for extracting professional terms in smart interviews.
  • the inventor realized that the prior art has at least the following problems: the extraction of technical terms needs to rely on named entity recognition, which uses natural language processing algorithms to perform three categories (entity, time, and Number category), seven subcategories (person name, organization name, place name, time, date, currency, and percentage) named entity extraction, the existing algorithm will directly select the most likely segment label for the character when segmenting the word, instead of Considering all possibilities, and the interview process involves more professional terms, most of these professional terms are difficult to be included in the preset named entities, so that the recognition accuracy is greatly affected by the training corpus, more or less There is a problem that the recognition accuracy is not high enough. At the same time, due to the use of more corpus, a model with a complex structure is required for processing, which makes the calculation efficiency low.
  • the embodiments of the present application provide a method, device, equipment, and storage medium for extracting professional terms in a smart interview, so as to improve the accuracy of extracting professional terms in a smart interview.
  • an embodiment of the present application provides a method for extracting professional terms in a smart interview, including:
  • a preset sequence mark model is used to determine the named entity contained in the response sentence, and the named entity is used as the professional term.
  • an embodiment of the present application also provides a device for extracting professional terms in a smart interview, including:
  • Phrase segmentation module used to use the historical interview lexicon to scan the response sentences in the smart interview, and respectively divide each character in the response sentence according to the preset N dimensions to obtain N corresponding to each character A set of phrases, where N is a positive integer;
  • the weight determination module is used to count the number of occurrences of each phrase in the historical interview vocabulary to obtain the word frequency of each phrase, and determine the phrase set corresponding to each character according to the word frequency of each phrase Weight information;
  • the vector representation module is used to smooth the weight information to obtain the vector representation of each character in the N phrase sets;
  • the character characterization module is used to extract information from the vector representation of each character to obtain characterization information of each character;
  • the term determination module is used to combine the characterization information of the characters and adopt a preset sequence mark model to determine the named entity contained in the response sentence, and use the named entity as the professional term.
  • an embodiment of the present application also provides a computer device, including a memory, a processor, and computer-readable instructions stored in the memory and running on the processor, and the processor executes all When the computer-readable instructions are described, the following steps are implemented:
  • a preset sequence mark model is used to determine the named entity contained in the response sentence, and the named entity is used as the professional term.
  • embodiments of the present application also provide a computer-readable storage medium, the computer-readable storage medium stores computer-readable instructions, and the computer-readable instructions implement the following steps when executed by a processor.
  • a preset sequence mark model is used to determine the named entity contained in the response sentence, and the named entity is used as the professional term.
  • the method, device, equipment, and medium for extracting professional terms in the smart interview provided by the embodiments of this application scan the response sentences in the smart interview by using the historical interview lexicon, and respond to each of the response sentences in the preset N dimensions.
  • Each character is segmented to obtain a set of N phrases corresponding to each character, so that all phrases corresponding to the character are used as candidate phrases, avoiding the existing word segmentation model to filter candidate phrases only based on the highest probability in the training corpus. Excessive reliance on training corpus is conducive to improving the accuracy of professional term extraction.
  • the number of occurrences of each phrase in the historical interview lexicon is counted to obtain the word frequency of each phrase, and according to the word frequency of each phrase, determine each
  • the weight information of the phrase set corresponding to the character is smoothed to obtain the vector representation of each character in the N phrase set, and the phrase information of the historical interview is combined to give the candidate phrase a weight that is more suitable for the interview scene. It is helpful to improve the accuracy of professional term extraction, and then extract the information of the vector representation of each character to obtain the characterization information of each character.
  • Combining the characterization information of the characters adopt a simple and universal sequence mark model to determine the content contained in the response sentence. Named entities, and use named entities as professional terms, avoiding the determination of named entities through complex named entity extraction models, and improving the efficiency of determining named entities.
  • Figure 1 is an exemplary system architecture diagram to which the present application can be applied;
  • FIG. 2 is a flowchart of an embodiment of the method for extracting professional terms in the smart interview of the present application
  • Fig. 3 is a schematic structural diagram of an embodiment of an apparatus for extracting professional terms in a smart interview according to the present application
  • Fig. 4 is a schematic structural diagram of an embodiment of a computer device according to the present application.
  • the system architecture 100 may include terminal devices 101, 102, and 103, a network 104 and a server 105.
  • the network 104 is used to provide a medium for communication links between the terminal devices 101, 102, 103 and the server 105.
  • the network 104 may include various connection types, such as wired, wireless communication links, or fiber optic cables, and so on.
  • the user can use the terminal devices 101, 102, and 103 to interact with the server 105 through the network 104 to receive or send messages and so on.
  • the terminal devices 101, 102, 103 may be various electronic devices with a display screen and support web browsing, including but not limited to smart phones, tablets, e-book readers, MP3 players (Moving Picture E interface display perts Group Audio Layer III. The moving picture expert compresses the standard audio layer 3), MP4 (Moving Picture E interface displays perts Group Audio Layer IV, the moving picture expert compresses the standard audio layer 4) player, laptop portable computer and desktop computer, etc.
  • the server 105 may be a server that provides various services, for example, a background server that provides support for pages displayed on the terminal devices 101, 102, and 103.
  • the method for extracting technical terms in the smart interview provided by the embodiment of the present application is executed by the server, and accordingly, the device for extracting technical terms in the smart interview is set in the server.
  • terminal devices, networks, and servers in FIG. 1 are merely illustrative. According to implementation needs, there may be any number of terminal devices, networks, and servers.
  • the terminal devices 101, 102, and 103 in the embodiments of the present application may specifically correspond to application systems in actual production.
  • FIG. 2 shows a method for extracting technical terms in a smart interview provided by an embodiment of the present application.
  • the method is applied to the server in FIG. 1 as an example for description, and the details are as follows:
  • S201 Use the historical interview vocabulary to scan the response sentences in the smart interview, and divide each character in the response sentence according to the preset N dimensions to obtain N phrase sets corresponding to each character, where N Is a positive integer.
  • the historical interview lexicon is used to scan and recognize the response sentences in the smart interview, identify synonyms and synonyms for each response sentence, and identify each character in the response sentence through the preset N dimensions. Perform segmentation, and each segmentation obtains a phrase set corresponding to the character, and a total of N phrase sets obtained by the character are obtained, where N is a positive integer.
  • the historical interview vocabulary refers to the word segmentation vocabulary obtained after analyzing and processing the question sentences and response sentences in the historical interview.
  • phrase set is a set of phrases obtained by segmentation in one dimension.
  • the value of N in this embodiment is 4, the preset 4 dimensions are each of the preset 4 dimensions, the first dimension is a phrase starting with a character as a named entity, and the second dimension is a character named as a name.
  • the third dimension is the phrase ending with a character as the named entity, and the fourth dimension is the phrase with the character itself as the named entity.
  • one character in the sentence s is c, and four segmentation labels B(c), M(c), E(c), and S(c) are respectively resumed, where B(c) Is the set of all the phrases in the sentence s starting with the character c as the named entity, M(c) is the set of all the phrases in the sentence s with the character c as the middle part of the named entity, E(c) is the sentence on the s The set of all phrases ending with the character c as the named entity. S(c) means the character c as the named entity.
  • S202 Count the number of occurrences of each phrase in the historical interview lexicon to obtain the word frequency of each phrase, and determine the weight information of the phrase set corresponding to each character according to the word frequency of each phrase.
  • the frequency of each phrase in the historical interview vocabulary related to the intelligent interview scene is counted to determine the word frequency of each phrase, and then the word frequency of each phrase is determined according to each The word frequency of the phrase is used to determine the weight information of the phrase set corresponding to the character.
  • the word frequency of the phrase is a static value, which can be calculated in advance and stored in the data table. When needed, it can be obtained by looking up the table to improve the efficiency of determining the weight information. It can also be calculated according to actual needs. The word frequency of is updated.
  • the frequency of occurrence of a phrase is used in this embodiment to indicate the weight of the phrase, because the more the specified single-character sequence appears in the historical interview lexicon, the more likely the sequence is a phrase.
  • S203 Perform smoothing processing on the weight information to obtain a vector representation of each character in the N phrase sets.
  • the weight of the phrase with a smaller weight is optimized to avoid that the phrase with a smaller weight is ignored in the subsequent labeling process due to a large difference in weight, which may lead to the named entity corresponding to the phrase with a smaller weight.
  • the weight information of the N phrase sets is updated and vectorized to obtain the vector representation of each character in the N phrase sets.
  • the smoothing processing in this embodiment is Refers to the optimization of the weights that do not meet the requirements through some mathematical models, so that the overall weight information is in a reasonable range, which is conducive to subsequent information extraction.
  • S204 Perform information extraction on the vector representation of each character to obtain characterization information of each character.
  • the key information is extracted from the vector representation to obtain the characterization information of a fixed dimension, so that the characterization information can be subsequently input into the sequence labeling machine model to determine the named entity .
  • the characterization information is a fixed-dimensional vector used to characterize the association between the character and each phrase.
  • a preset sequence mark model is used to determine the named entity contained in the response sentence, and the named entity is used as a professional term.
  • a preset sequence mark model is used to identify the named entity of the response sentence, and the named entity contained in the response sentence can be quickly and accurately obtained, that is, the named entity can be quickly extracted.
  • the terminology mentioned in the candidate's response sentence is used to identify the named entity of the response sentence, and the named entity contained in the response sentence can be quickly and accurately obtained, that is, the named entity can be quickly extracted.
  • named entity ((named entity)) recognition refers to the process of recognizing a specific type of thing name or symbol in a document collection. Named entity recognition consists of three questions: identify the named entity in the text; determine the type of the entity; when multiple entities represent the same thing, select one entity as the representative of the group of entities.
  • named entity recognition mainly refers to the identification of professional terms from the candidate’s response sentences, and subsequent subsequent evaluations of the candidate’s interview situation based on the identified professional terms, for example, identifying the experience in the resume Whether it is consistent with the response, that is, to verify the credibility of the candidate's response, or to test the candidate's professional ability.
  • sequence labeling models include but are not limited to: Conditional Random Field (CRF), Hidden Markov Model (HMM), Maximum Entropy Markov Model (MEMM), etc. .
  • CRF Conditional Random Field
  • HMM Hidden Markov Model
  • MEMM Maximum Entropy Markov Model
  • the preset sequence labeling model in this embodiment may specifically be any one of the existing sequence labeling models, and there is no further limitation here.
  • step S204 the characterization information of the characters is obtained.
  • the characterization information is used to weight the characters, which is beneficial to the sequence labeling model to quickly label the key features and improve the accuracy and accuracy of the labeling. efficient.
  • the response sentences in the smart interview are scanned, and each character in the response sentence is segmented according to the preset N dimensions to obtain N phrases corresponding to each character Set, realize that all the phrases corresponding to the character are used as candidate phrases, avoiding that the existing word segmentation model only selects candidate phrases based on the highest probability in the training corpus, and excessively relies on the training corpus, which helps to improve the accuracy of professional term extraction.
  • the number of occurrences of each phrase in the historical interview thesaurus is counted to obtain the word frequency of each phrase, and according to the word frequency of each phrase, the weight information of the phrase set corresponding to each character is determined, and the weight information is smoothed.
  • the extracted technical terms can be stored on the blockchain network, and the data information can be shared between different platforms through the blockchain storage, and the data can also be prevented from being tampered with.
  • Blockchain is a new application mode of computer technology such as distributed data storage, point-to-point transmission, consensus mechanism, and encryption algorithm.
  • Blockchain essentially a decentralized database, is a series of data blocks associated with cryptographic methods. Each data block contains a batch of network transaction information for verification. The validity of the information (anti-counterfeiting) and the generation of the next block.
  • the blockchain can include the underlying platform of the blockchain, the platform product service layer, and the application service layer.
  • step S202 determining the weight information of the phrase set corresponding to each character according to the word frequency of each phrase includes:
  • v s is the set of phrases
  • L is the length of the four phrase sets corresponding to the character ⁇
  • ⁇ c represents the phrase that composes the character ⁇
  • z( ⁇ ) represents the word frequency of ⁇ c in the historical interview lexicon
  • e ⁇ ( ⁇ ) represents The embedding vector of ⁇ c , B, M, E, S are four-dimensional phrase sets.
  • the weight information of each phrase set is calculated, so as to subsequently generate a corresponding vector representation based on the weight information of each phrase set.
  • the embedding vector (WordEmbedding) is an important concept in natural language processing. It is used to convert a word into a fixed-length vector representation, so as to facilitate mathematical processing.
  • the embedding vector can be specifically input by inputting characters into open source
  • the tensorflow model is generated, and the phrase corresponding to the character can also be expressed in the form of a sequence, and then the sequence is converted into an embedding vector through the tf.nn.embedding_lookup function.
  • the weight information of the phrase set corresponding to the character is obtained through the calculation of the set formula, so that the characterization information of the character can be obtained through the weight information.
  • the degree of fit is higher, which is conducive to improving the accuracy of subsequent named entity extraction.
  • step S203 performing smoothing processing on the weight information to obtain the vector representation of each character in the N phrase set includes:
  • the weight information of the updated phrase set is vectorized, and the vector representation of the character in the N phrase set is obtained.
  • the phrases in the N phrase sets corresponding to the characters are sorted by word frequency, and the phrases that need to be smoothed are selected according to preset conditions for smoothing, and then the weight information is updated, and then the updated weight information vector , Get the vector representation of the character in the set of N phrases.
  • the preset ratio value can be set according to actual needs.
  • the preset comparison value is 10%.
  • the value of M is determined according to the product of the preset comparison value and the number of phrases in the phrase set.
  • the preset weighting method can be selected according to actual needs.
  • the weighting is performed by the following formula:
  • a is a constant whose value is determined by the maximum word frequency in the phrase to be processed.
  • the preset ratio value is 10%
  • the number of phrases in the phrase set is 180
  • the corresponding value of M is 18.
  • step S204 information extraction is performed on the vector representation of each character, and the characterization information of each character is obtained including:
  • the concatenated vector is dimensionally compressed to obtain the characterization information of the character.
  • the vector identification of the N phrase corresponding to each character is compressed through a preset method, Get a fixed-dimensional vector.
  • the preset method adopted is to concatenate these four phrase sets, express them as a whole, and obtain the single-character representation of this character, which is compressed using the following formula:
  • e s represents the embedding vector corresponding to the character
  • v s represents the function of mapping the corresponding phrase set to the dense vector
  • x c is the characterization information corresponding to the character.
  • the concatenated vector is obtained, and the concatenated vector is compressed to a fixed dimension, so as to be subsequently input to the preset sequence labeling model to determine the named entity.
  • Fig. 3 shows a schematic block diagram of a device for extracting technical terms in a smart interview in a one-to-one correspondence with the method for extracting technical terms in a smart interview in the foregoing embodiment.
  • the device for extracting professional terms in the smart interview includes a phrase segmentation module 31, a weight determination module 32, a vector representation module 33, a character representation module 34, and a term determination module 35.
  • the detailed description of each functional module is as follows:
  • the phrase segmentation module 31 is used to use the historical interview vocabulary to scan the response sentences in the smart interview, and respectively divide each character in the response sentence according to the preset N dimensions to obtain N phrases corresponding to each character Set, where N is a positive integer;
  • the weight determination module 32 is used to count the number of occurrences of each phrase in the historical interview lexicon, obtain the word frequency of each phrase, and determine the weight information of the phrase set corresponding to each character according to the word frequency of each phrase;
  • the vector representation module 33 is used for smoothing the weight information to obtain the vector representation of each character in the set of N phrases;
  • the character characterization module 34 is used to extract information from the vector representation of each character to obtain characterization information of each character;
  • the term determining module 35 is used to combine the characterization information of the characters and adopt a preset sequence mark model to determine the named entity contained in the response sentence, and use the named entity as a professional term.
  • the phrase segmentation module 31 includes: the value of N is 4, among the preset 4 dimensions, the first dimension is a phrase starting with a character as a named entity, and the second dimension is a phrase with a character as the middle of the named entity.
  • the third dimension is a phrase ending with a character as a named entity, and the fourth dimension is a phrase ending with a character itself as a named entity.
  • the weight determination module 32 includes:
  • v s is the set of phrases
  • L is the length of the four phrase sets corresponding to the character ⁇
  • ⁇ c represents the phrase that composes the character ⁇
  • z( ⁇ ) represents the word frequency of ⁇ c in the historical interview lexicon
  • e ⁇ ( ⁇ ) represents The embedding vector of ⁇ c , B, M, E, S are four-dimensional phrase sets.
  • the vector representation module 33 includes:
  • the word frequency sorting unit is used for sorting each phrase in the N phrase set corresponding to the character for each character in the order of word frequency from low to high to obtain the sorting result;
  • the phrase selection unit is used to obtain the preset ratio value, determine the selected number M according to the preset ratio value, and take the phrase corresponding to the first M word frequencies of the sorting result as the phrase to be processed;
  • the weight update unit is used to increase the weight of the phrase to be processed according to a preset weighting method, and update the weight information of the phrase set corresponding to the character;
  • the weight vectorization unit is used to vectorize the weight information of the updated phrase set to obtain the vector representation of the character in the N phrase set.
  • the character characterization module 34 includes:
  • the vector concatenation unit is used to concatenate the vector representations of N phrase sets for each character to obtain a concatenation vector
  • the vector compression unit is used to perform dimensional compression on the concatenated vectors according to a preset fixed dimension to obtain characterization information of characters.
  • the device for extracting professional terms in the smart interview further includes:
  • the storage module is used to store technical terms in the blockchain network.
  • Each module in the device for extracting professional terms in the smart interview can be implemented in whole or in part by software, hardware, and a combination thereof.
  • the above-mentioned modules may be embedded in the form of hardware or independent of the processor in the computer equipment, or may be stored in the memory of the computer equipment in the form of software, so that the processor can call and execute the operations corresponding to the above-mentioned modules.
  • FIG. 4 is a block diagram of the basic structure of the computer device in this embodiment.
  • the computer device 4 includes a memory 41, a processor 42, and a network interface 43 that are connected to each other in communication via a system bus. It should be pointed out that the figure only shows the computer device 4 with the components connected to the memory 41, the processor 42, and the network interface 43, but it should be understood that it is not required to implement all the shown components, and alternative implementations can be made. More or fewer components. Among them, those skilled in the art can understand that the computer device here is a device that can automatically perform numerical calculation and/or information processing in accordance with pre-set or stored instructions.
  • Its hardware includes, but is not limited to, a microprocessor, a dedicated Integrated Circuit (Application Specific Integrated Circuit, ASIC), Programmable Gate Array (Field-Programmable Gate Array, FPGA), Digital Processor (Digital Signal Processor, DSP), embedded equipment, etc.
  • ASIC Application Specific Integrated Circuit
  • ASIC Application Specific Integrated Circuit
  • FPGA Field-Programmable Gate Array
  • DSP Digital Processor
  • the computer device may be a computing device such as a desktop computer, a notebook, a palmtop computer, and a cloud server.
  • the computer device can interact with the user through a keyboard, a mouse, a remote control, a touch panel, or a voice control device.
  • the memory 41 includes at least one type of readable storage medium, the readable storage medium includes flash memory, hard disk, multimedia card, card-type memory (for example, SD or D interface display memory, etc.), random access memory (RAM) , Static random access memory (SRAM), read only memory (ROM), electrically erasable programmable read only memory (EEPROM), programmable read only memory (PROM), magnetic memory, magnetic disks, optical disks, etc.
  • the memory 41 may be an internal storage unit of the computer device 4, such as a hard disk or memory of the computer device 4.
  • the memory 41 may also be an external storage device of the computer device 4, for example, a plug-in hard disk equipped on the computer device 4, a smart memory card (Smart Media Card, SMC), and a secure digital (Secure Digital, SD) card, Flash Card, etc.
  • the memory 41 may also include both the internal storage unit of the computer device 4 and its external storage device.
  • the memory 41 is generally used to store an operating system and various application software installed in the computer device 4, such as program codes for controlling electronic files.
  • the memory 41 can also be used to temporarily store various types of data that have been output or will be output.
  • the processor 42 may be a central processing unit (Central Processing Unit, CPU), a controller, a microcontroller, a microprocessor, or other data processing chips in some embodiments.
  • the processor 42 is generally used to control the overall operation of the computer device 4.
  • the processor 42 is configured to run program codes or process data stored in the memory 41, for example, run program codes for controlling electronic files.
  • the network interface 43 may include a wireless network interface or a wired network interface, and the network interface 43 is generally used to establish a communication connection between the computer device 4 and other electronic devices.
  • the computer-readable storage medium may be non-volatile or volatile, and the computer-readable storage medium stores An interface display program, the interface display program may be executed by at least one processor, so that the at least one processor executes the steps of the method for extracting professional terms in the smart interview as described above.
  • the technical solution of this application essentially or the part that contributes to the existing technology can be embodied in the form of a software product, and the computer software product is stored in a storage medium (such as ROM/RAM, magnetic disk, The optical disc) includes several instructions to make a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) execute the methods described in the various embodiments of the present application.
  • a terminal device which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Machine Translation (AREA)

Abstract

一种智能面试中专业术语的提取方法、装置、计算机设备及存储介质,该方法包括:按照预设N个维度,对应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,实现将该字符对应的所有词组,均作为候选词组,同时,对历史面试词库中每个词组的词频,确定词组集合的权重信息,实现对候选词组赋予更符合面试场景的权重,通过权重信息确定每个字符的表征信息,再结合字符的表征信息,采用预设的结构简单的序列标记模型,确定应答语句中包含的命名实体,并将命名实体作为专业术语存储至区块链网络中,而无需使用复杂的命名实体提取模型,提高确定命名实体的效率。

Description

智能面试中专业术语的提取方法、装置、设备及介质
本申请要求于2020年4月29日,提交中国专利局、申请号为2020103567531发明名称为“智能面试中专业术语的提取方法、装置、设备及介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,尤其涉及一种智能面试中专业术语的提取方法、装置、设备及介质。
背景技术
招聘面试是一项费时费力的工作,因为候选人数众多而面试官却有限,面试官需要在一天内连续面试大量候选人。为了及时对候选人的经历提出相关的问题,了解候选人的专业技能掌握情况,提取候选人的经历描述和简历中提到的专业术语并进行进一步的提问就很重要。
在实现本申请的过程中,发明人意识到现有技术至少存在如下问题:专业术语的提取需要依赖命名实体识别,命名实体识别是通过自然语言处理算法进行三大类(实体类、时间类和数字类)、七小类(人名、机构名、地名、时间、日期、货币和百分比)命名实体的提取,现有的算法在分词时会直接给字符选择最有可能的分段标签,而不是考虑所有的可能性,而面试过程中涉及较多的专业术语,这些专业术语中大部分难以被预设好的命名实体收录,使得识别准确率收到训练语料的影响较大,或多或少存在识别准确率不够高的问题,同时,由于使用语料较多,需要复杂结构的模型进行处理,使得运算效率较低。
发明内容
本申请实施例提供一种智能面试中专业术语的提取方法、装置、设备和 存储介质,以提在高智能面试中进行专业术语的提取的准确率。
为了解决上述技术问题,本申请实施例提供一种智能面试中专业术语的提取方法,包括:
使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对所述应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,其中,N为正整数;
对所述历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个所述词组的词频,确定每个字符对应的所述词组集合的权重信息;
对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示;
对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息;
结合所述字符的表征信息,采用预设的序列标记模型,确定所述应答语句中包含的命名实体,并将所述命名实体作为所述专业术语。
为了解决上述技术问题,本申请实施例还提供一种智能面试中专业术语的提取装置,包括:
词组分割模块,用于使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对所述应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,其中,N为正整数;
权重确定模块,用于对所述历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个所述词组的词频,确定每个字符对应的所述词组集合的权重信息;
向量表示模块,用于对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示;
字符表征模块,用于对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息;
术语确定模块,用于结合所述字符的表征信息,采用预设的序列标记模型,确定所述应答语句中包含的命名实体,并将所述命名实体作为所述专业术语。
为了解决上述技术问题,本申请实施例还提供一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,所述处理器执行所述计算机可读指令时实现上如下步骤:
使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个 维度,对所述应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,其中,N为正整数;
对所述历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个所述词组的词频,确定每个字符对应的所述词组集合的权重信息;
对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示;
对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息;
结合所述字符的表征信息,采用预设的序列标记模型,确定所述应答语句中包含的命名实体,并将所述命名实体作为所述专业术语。
为了解决上述技术问题,本申请实施例还提供一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,所述计算机可读指令被处理器执行时实现如下步骤。
使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对所述应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,其中,N为正整数;
对所述历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个所述词组的词频,确定每个字符对应的所述词组集合的权重信息;
对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示;
对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息;
结合所述字符的表征信息,采用预设的序列标记模型,确定所述应答语句中包含的命名实体,并将所述命名实体作为所述专业术语。
本申请实施例提供的智能面试中专业术语的提取方法、装置、设备及介质,通过使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,实现将该字符对应的所有词组,均作为候选词组,避免现有分词模型中仅仅根据训练语料中的最高概率进行候选词组筛选,过分依赖训练语料,有利于提高专业术语提取的准确性,同时,对历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个词组的词频,确定每个字符对应的词组集合的权重信息,对权重信息进行平滑处理,得到每个字符 在N个词组集合中的向量表示,实现结合历史面试的词组信息,来对候选词组赋予更符合面试场景的权重,有利于提高专业术语提取的准确性,进而对每个字符的向量表示进行信息提取,得到每个字符的表征信息,结合字符的表征信息,采用简单通用的序列标记模型,确定应答语句中包含的命名实体,并将命名实体作为专业术语,避免通过复杂的命名实体提取模型进行命名实体的确定,提高确定命名实体的效率。
附图说明
为了更清楚地说明本申请实施例的技术方案,下面将对本申请实施例的描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。
图1是本申请可以应用于其中的示例性系统架构图;
图2是本申请的智能面试中专业术语的提取方法的一个实施例的流程图;
图3是根据本申请的智能面试中专业术语的提取装置的一个实施例的结构示意图;
图4是根据本申请的计算机设备的一个实施例的结构示意图。
具体实施方式
除非另有定义,本文所使用的所有的技术和科学术语与属于本申请的技术领域的技术人员通常理解的含义相同;本文中在申请的说明书中所使用的术语只是为了描述具体的实施例的目的,不是旨在于限制本申请;本申请的说明书和权利要求书及上述附图说明中的术语“包括”和“具有”以及它们的任何变形,意图在于覆盖不排他的包含。本申请的说明书和权利要求书或上述附图中的术语“第一”、“第二”等是用于区别不同对象,而不是用于描述特定顺序。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行 清楚、完整地描述,显然,所描述的实施例是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有作出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
请参阅图1,如图1所示,系统架构100可以包括终端设备101、102、103,网络104和服务器105。网络104用以在终端设备101、102、103和服务器105之间提供通信链路的介质。网络104可以包括各种连接类型,例如有线、无线通信链路或者光纤电缆等等。
用户可以使用终端设备101、102、103通过网络104与服务器105交互,以接收或发送消息等。
终端设备101、102、103可以是具有显示屏并且支持网页浏览的各种电子设备,包括但不限于智能手机、平板电脑、电子书阅读器、MP3播放器(Moving Picture E界面显示perts Group Audio Layer III,动态影像专家压缩标准音频层面3)、MP4(Moving Picture E界面显示perts Group Audio Layer IV,动态影像专家压缩标准音频层面4)播放器、膝上型便携计算机和台式计算机等等。
服务器105可以是提供各种服务的服务器,例如对终端设备101、102、103上显示的页面提供支持的后台服务器。
需要说明的是,本申请实施例所提供的智能面试中专业术语的提取方法由服务器执行,相应地,智能面试中专业术语的提取装置设置于服务器中。
应该理解,图1中的终端设备、网络和服务器的数目仅仅是示意性的。根据实现需要,可以具有任意数目的终端设备、网络和服务器,本申请实施例中的终端设备101、102、103具体可以对应的是实际生产中的应用系统。
请参阅图2,图2示出本申请实施例提供的一种智能面试中专业术语的提取方法,以该方法应用在图1中的服务端为例进行说明,详述如下:
S201:使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,其中,N为正整数。
具体地,通过历史面试词库,来对智能面试中的应答语句进行扫描识别,对每个应答语句进行近义词和同义词进行识别,并通过预设的N个维度,对应答语句中的每个字符进行分割,每次分割,得到一个该字符对应的词组集合,一共得到N个该字符得到的词组集合,其中,N为正整数。
其中,历史面试词库是指对历史面试中的提问语句和应答语句进行分析处理后,得到的分词词库。
其中,一个词组集合为通过一个维度进行分割得到的词组的集合。
优选的,本实施例中N的取值为4,预设的4个维度分别为预设4个维度中,第一维度为以字符作为命名实体开头的词组,第二维度为以字符作为命名实体中间的词组,第三维度为以字符作为命名实体结尾的词组,第四维度为以字符本身作为命名实体的词组。
例如,在一具体实施方式中,句子s中的一个字符为c,分别简历4个分割标签B(c)、M(c)、E(c)和S(c),其中,B(c)为句子s上所有以字符c作为命名实体开头的词组组成的集合,M(c)为句子s上所有以字符c作为命名实体的中间部分的词组组成的集合,E(c)为句子s上所有以字符c作为命名实体结尾的词组组成的集合,S(c)是指以字符c作为命名实体。
应理解,采用这4个维度进行词组分词,有效确保不遗漏任何命名实体,有利于提高后续命名实体识别的准确性。
需要说明的是,如果词组集合为空,将在其中添加特殊单词“NONE”以指示这种情况。通过这种方式,引入预训练的单词嵌入,同时,可以从每个字符的词组集合中准确恢复相应的匹配结果。
S202:对历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个词组的词频,确定每个字符对应的词组集合的权重信息。
具体地,得到每个字符对应的N个词组集合,保证字符对应的所有词组均不会遗漏之后,需要结合智能面试的实际场景,对每个词组的重要程度进行区分,以便后续进行命名实体识别时实现快速标注,提高命名实体识别的效率,在本实施例中,通过统计与智能面试场景相关的历史面试词库中每个词组的出现次数,来确定每个词组的词频,进而根据每个词组的词频,来确定字符对应的词组集合的权重信息。
其中,词组的词频是静态值,可以通过预先统计计算得到,并存储到数据表格中,在需要使用时,通过查表的方式获取,以提高权重信息确定的效率,也可以根据实际需要对词组的词频进行更新。
应理解,为了提高计算效率,本实施例使用词组出现的频率以表示该词组的权重,因为指定单字序列在历史面试词库中出现的次数越多,那么这个序列越有可能是一个词组。
其中,根据每个词组的词频,确定每个字符对应的词组集合的权重信息的具体实现过程,可参考后续实施例的描述,为避免重复,此处不再赘述。
S203:对权重信息进行平滑处理,得到每个字符在N个词组集合中的向量表示。
具体地,通过平滑处理,对权重较小的词组进行权重优化,避免因权重差距太大,导致权重较小的词组在后续标注过程中被忽视,可能导致权重较 小的词组对应的命名实体的缺失,在平滑处理后,对N个词组集合的权重信息进行更新并向量化,得到每个字符在N个词组集合中的向量表示。
需要说明的是,不平滑的权重信息,对后续进行信息提取时的数据拟合会产生影响,容易导致信息提取不够准确,因而需要对权重信息进行平滑处理,本实施例中的平滑处理,是指通过一些数学模型,对不符合要求的权重进行优化,使得整体权重信息处于一个合理的范围,有利于后续的信息提取。
其中,对权重信息进行平滑处理的具体过程,可参考后续实施例的描述,为避免重复,此处不再赘述。
S204:对每个字符的向量表示进行信息提取,得到每个字符的表征信息。
具体地,在得到每个字符的向量表示后,从该向量表示中进行关键信息的提取,得到固定维度的表征信息,以便后续将该表征信息输入到序列标机模型中,进行命名实体的确定。
其中,表征信息是用于对字符与各个词组之间的关联进行表征的固定维度的向量。
对每个字符的向量表示进行信息提取,得到每个字符的表征信息的具体实现过程,可参考后续实施例的描述,为避免重复,此处不再赘述。
S205:结合字符的表征信息,采用预设的序列标记模型,确定应答语句中包含的命名实体,并将命名实体作为专业术语。
具体地,在获取到每个字符的表征信息后,采用预设的序列标记模型对应答语句进行命名实体识别,即可快速准确得到该应答语句中包含的命名实体,也即,可快速提取到候选人的应答语句中提到的专业术语。
其中,命名实体((named entity))识别是指在文档集合中识别出特定类型的事物名称或符号的过程。命名实体识别由3个问题组成:识别出文本中的命名实体;确定该实体的类型;对于多个实体表示同一事物时,选择其中的一个实体作为该组实体的代表。
在本实施例中,命名实体识别主要是指从候选人的应答语句中,识别出专业术语,以后后续根据识别出的专业术语,对候选人的面试情况进行评估,例如,辨别简历中的经历和应答中是否一致,也即,验证候选人的应答可信度,或者,检验候选人的专业能力。
其中,序列标记模型包括但不限于:条件随机场模型(Conditional Random Field,CRF)、隐马尔可夫模型(Hidden Markov Model,HMM)和最大熵马尔可夫模型(Maximum Entropy Markov Model,MEMM)等。本实施例中预设的序列标记模型具体可以是现有的序列标记模型中的任意一种,此处不做更多限定。
需要说明的是,在步骤S204中,得到了字符的表征信息,在序列标记模型中,使用该表征信息对字符进行加权,有利于序列标记模型快速进行关键特征的标注,提高标注的准确性和效率。
在本实施例中,通过使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,实现将该字符对应的所有词组,均作为候选词组,避免现有分词模型中仅仅根据训练语料中的最高概率进行候选词组筛选,过分依赖训练语料,有利于提高专业术语提取的准确性,同时,对历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个词组的词频,确定每个字符对应的词组集合的权重信息,对权重信息进行平滑处理,得到每个字符在N个词组集合中的向量表示,实现结合历史面试的词组信息,来对候选词组赋予更符合面试场景的权重,有利于提高专业术语提取的准确性,进而对每个字符的向量表示进行信息提取,得到每个字符的表征信息,结合字符的表征信息,采用简单通用的序列标记模型,确定应答语句中包含的命名实体,并将命名实体作为专业术语,避免通过复杂的命名实体提取模型进行命名实体的确定,提高确定命名实体的效率。
在一实施例中,可将提取的专业术语保存在区块链网络上,通过区块链存储,实现数据信息在不同平台之间的共享,也可防止数据被篡改。
区块链是分布式数据存储、点对点传输、共识机制、加密算法等计算机技术的新型应用模式。区块链(Blockchain),本质上是一个去中心化的数据库,是一串使用密码学方法相关联产生的数据块,每一个数据块中包含了一批次网络交易的信息,用于验证其信息的有效性(防伪)和生成下一个区块。区块链可以包括区块链底层平台、平台产品服务层以及应用服务层。
在本实施例的一些可选的实现方式中,步骤S202中,根据每个词组的词频,确定每个字符对应的词组集合的权重信息包括:
针对每个字符,采用如下公式确定字符对应的词组集合的权重信息:
Figure PCTCN2020118919-appb-000001
Figure PCTCN2020118919-appb-000002
其中,v s为词组集合
Figure PCTCN2020118919-appb-000003
的权重信息,L是字符ω对应的四个词组集合的长度,ω c表示组成字符ω的词组,z(ω)表示ω c在述历史面试词库中出现的词频,e ω(ω)代表ω c的嵌入向量,B、M、E、S分别为四个维度的词组集合。
具体地,通过上述公式,对每个词组集合的权重信息进行计算,以便后 续根据每个词组集合的权重信息生成对应的向量表示。
其中,其中,嵌入向量(WordEmbedding)是自然语言处理中的一个重要的概念,用于将一个单词转换成固定长度的向量表示,从而便于进行数学处理,嵌入向量具体可以通过将字符输入到开源的tensorflow模型进行生成,也可以将字符对应的词组用序列的形式进行表示,再通过tf.nn.embedding_lookup函数将序列转换成嵌入向量。
在本实施例中,通过设置好的公式计算,得到字符对应的词组集合的权重信息,以便后续通过权重信息得到字符的表征信息,通过对字符对应的不同词组赋予不同的权重,使得与面试场景的契合度更高,有利于提高后续命名实体提取的准确率。
在本实施例的一些可选的实现方式中,步骤S203中,对权重信息进行平滑处理,得到每个字符在N个词组集合中的向量表示包括:
针对每个字符,对字符对应的N个词组集合中的每个词组,按照词频由低到高的顺序进行排序,得到排序结果;
获取预设比例值,根据预设比例值,确定选取数量M,并取排序结果的前M个词频对应的词组,作为待处理词组;
按照预设的加权方式,提升待处理词组的权重,并对字符对应的词组集合的权重信息进行更新;
对更新后的词组集合的权重信息进行向量化,得到字符在N个词组集合中的向量表示。
具体地,通过对字符对应的N个词组集合中的词组进行词频排序,并按预设条件选取需要进行平滑处理的词组进行平滑处理,进而对权重信息进行更新,再讲更新后的权重信息向量化,得到字符在N个词组集合中的向量表示。
其中,预设比例值可根据实际需要进行设置,例如,在一具体实施方式中,预设比较值为10%。
其中,M的值根据预设比较值与词组集合中的词组数量的乘积来确定。
其中,预设的加权方式可根据实际需要来选取,本实施例中,通过如下公式来进行加权:
Figure PCTCN2020118919-appb-000004
其中,a为一常数,其数值由待处理词组中的最大词频决定。
例如,在一具体实施方式中,预设比例值为10%,词组集合中的词组数量为180,则对应的M的数值为18,按照词频进行排序后,取处于序列的最后18个词频对应的词组,作为待处理词组,并将待处理词组中的最大词频,作为该常数a。
在本实施例中,通过对词频较低的词组的权重进行适当平滑处理,避免词频较低的词组在后续失去作用,导致漏掉可能的命名实体。
在本实施例的一些可选的实现方式中,步骤S204中,对每个字符的向量表示进行信息提取,得到每个字符的表征信息包括:
针对每个字符,对N个词组集合的向量表示进行串联,得到串联向量;
按照预设的固定维度,对串联向量进行维度压缩,得到字符的表征信息。
具体地,在得到字符在N个词组中的向量表示后,为更佳精确地确定该字符对应的词组组合方式,通过预设方式,对每个字符对应的N个词组的向量标识进行压缩,得到一个固定维度的向量。
在本实施例中,为了保留尽可能多的信息,采用的预设方式为串联这四个词组集合,将其表示成一个整体,得到这个字符的单字表征,,采用如下公式进行压缩:
Figure PCTCN2020118919-appb-000005
其中,e s表示字符对应的嵌入向量,
Figure PCTCN2020118919-appb-000006
表示向量串联,v s表示将对应词组集合映射到稠密向量的函数,x c是字符对应的表征信息。
在本实施例中,通过对词组集合的向量表示进行串联,得到串联向量,并将串联向量压缩到固定维度,以便后续输入到预设的序列标记模型进行命名实体的确定。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
图3示出与上述实施例智能面试中专业术语的提取方法一一对应的智能面试中专业术语的提取装置的原理框图。如图3所示,该智能面试中专业术语的提取装置包括词组分割模块31、权重确定模块32、向量表示模块33、字符表征模块34和术语确定模块35。各功能模块详细说明如下:
词组分割模块31,用于使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,其中,N为正整数;
权重确定模块32,用于对历史面试词库中每个词组的出现次数进行统计, 得到每个词组的词频,并根据每个词组的词频,确定每个字符对应的词组集合的权重信息;
向量表示模块33,用于对权重信息进行平滑处理,得到每个字符在N个词组集合中的向量表示;
字符表征模块34,用于对每个字符的向量表示进行信息提取,得到每个字符的表征信息;
术语确定模块35,用于结合字符的表征信息,采用预设的序列标记模型,确定应答语句中包含的命名实体,并将命名实体作为专业术语。
可选地,词组分割模块31包括:N的取值为4,预设4个维度中,第一维度为以字符作为命名实体开头的词组,第二维度为以字符作为命名实体中间的词组,第三维度为以字符作为命名实体结尾的词组,第四维度为以字符本身作为命名实体的词组。
可选地,权重确定模块32包括:
权重计算单元,用于针对每个字符,采用如下公式确定字符对应的词组集合的权重信息:
Figure PCTCN2020118919-appb-000007
Figure PCTCN2020118919-appb-000008
其中,v s为词组集合
Figure PCTCN2020118919-appb-000009
的权重信息,L是字符ω对应的四个词组集合的长度,ω c表示组成字符ω的词组,z(ω)表示ω c在述历史面试词库中出现的词频,e ω(ω)代表ω c的嵌入向量,B、M、E、S分别为四个维度的词组集合。
可选地,向量表示模块33包括:
词频排序单元,用于针对每个字符,对字符对应的N个词组集合中的每个词组,按照词频由低到高的顺序进行排序,得到排序结果;
词组选取单元,用于获取预设比例值,根据预设比例值,确定选取数量M,并取排序结果的前M个词频对应的词组,作为待处理词组;
权重更新单元,用于按照预设的加权方式,提升待处理词组的权重,并对字符对应的词组集合的权重信息进行更新;
权重向量化单元,用于对更新后的词组集合的权重信息进行向量化,得到字符在N个词组集合中的向量表示。
可选地,字符表征模块34包括:
向量串联单元,用于针对每个字符,对N个词组集合的向量表示进行串联,得到串联向量;
向量压缩单元,用于按照预设的固定维度,对串联向量进行维度压缩,得到字符的表征信息。
可选地,智能面试中专业术语的提取装置还包括:
存储模块,用于将专业术语存储至区块链网络中。
关于智能面试中专业术语的提取装置的具体限定可以参见上文中对于智能面试中专业术语的提取方法的限定,在此不再赘述。上述智能面试中专业术语的提取装置中的各个模块可全部或部分通过软件、硬件及其组合来实现。上述各模块可以硬件形式内嵌于或独立于计算机设备中的处理器中,也可以以软件形式存储于计算机设备中的存储器中,以便于处理器调用执行以上各个模块对应的操作。
为解决上述技术问题,本申请实施例还提供计算机设备。具体请参阅图4,图4为本实施例计算机设备基本结构框图。
所述计算机设备4包括通过系统总线相互通信连接存储器41、处理器42、网络接口43。需要指出的是,图中仅示出了具有组件连接存储器41、处理器42、网络接口43的计算机设备4,但是应理解的是,并不要求实施所有示出的组件,可以替代的实施更多或者更少的组件。其中,本技术领域技术人员可以理解,这里的计算机设备是一种能够按照事先设定或存储的指令,自动进行数值计算和/或信息处理的设备,其硬件包括但不限于微处理器、专用集成电路(Application Specific Integrated Circuit,ASIC)、可编程门阵列(Field-Programmable Gate Array,FPGA)、数字处理器(Digital Signal Processor,DSP)、嵌入式设备等。
所述计算机设备可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述计算机设备可以与用户通过键盘、鼠标、遥控器、触摸板或声控设备等方式进行人机交互。
所述存储器41至少包括一种类型的可读存储介质,所述可读存储介质包括闪存、硬盘、多媒体卡、卡型存储器(例如,SD或D界面显示存储器等)、随机访问存储器(RAM)、静态随机访问存储器(SRAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、可编程只读存储器(PROM)、磁性存储器、磁盘、光盘等。在一些实施例中,所述存储器41可以是所述计算机设备4的内部存储单元,例如该计算机设备4的硬盘或内存。在另一些实施例中,所述存储器41也可以是所述计算机设备4的外部存储设备,例如该计算机设备4上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。当然,所述存储器41还可以既包括所述计算机设备4的内部存储单元也包括其外部存储设 备。本实施例中,所述存储器41通常用于存储安装于所述计算机设备4的操作系统和各类应用软件,例如电子文件的控制的程序代码等。此外,所述存储器41还可以用于暂时地存储已经输出或者将要输出的各类数据。
所述处理器42在一些实施例中可以是中央处理器(Central Processing Unit,CPU)、控制器、微控制器、微处理器、或其他数据处理芯片。该处理器42通常用于控制所述计算机设备4的总体操作。本实施例中,所述处理器42用于运行所述存储器41中存储的程序代码或者处理数据,例如运行电子文件的控制的程序代码。
所述网络接口43可包括无线网络接口或有线网络接口,该网络接口43通常用于在所述计算机设备4与其他电子设备之间建立通信连接。
本申请还提供了另一种实施方式,即提供一种计算机可读存储介质,所述计算机可读存储介质可以是非易失性,也可以是易失性,所述计算机可读存储介质存储有界面显示程序,所述界面显示程序可被至少一个处理器执行,以使所述至少一个处理器执行如上述的智能面试中专业术语的提取方法的步骤。
通过以上的实施方式的描述,本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现,当然也可以通过硬件,但很多情况下前者是更佳的实施方式。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质(如ROM/RAM、磁碟、光盘)中,包括若干指令用以使得一台终端设备(可以是手机,计算机,服务器,空调器,或者网络设备等)执行本申请各个实施例所述的方法。
显然,以上所描述的实施例仅仅是本申请一部分实施例,而不是全部的实施例,附图中给出了本申请的较佳实施例,但并不限制本申请的专利范围。本申请可以以许多不同的形式来实现,相反地,提供这些实施例的目的是使对本申请的公开内容的理解更加透彻全面。尽管参照前述实施例对本申请进行了详细的说明,对于本领域的技术人员来而言,其依然可以对前述各具体实施方式所记载的技术方案进行修改,或者对其中部分技术特征进行等效替换。凡是利用本申请说明书及附图内容所做的等效结构,直接或间接运用在其他相关的技术领域,均同理在本申请专利保护范围之内。

Claims (20)

  1. 一种智能面试中专业术语的提取方法,其中,所述智能面试中专业术语的提取方法包括:
    使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对所述应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,其中,N为正整数;
    对所述历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个所述词组的词频,确定每个字符对应的所述词组集合的权重信息;
    对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示;
    对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息;
    结合所述字符的表征信息,采用预设的序列标记模型,确定所述应答语句中包含的命名实体,并将所述命名实体作为所述专业术语。
  2. 如权利要求1所述的智能面试中专业术语的提取方法,其中,N的取值为4,所述预设4个维度中,第一维度为以字符作为命名实体开头的词组,第二维度为以字符作为命名实体中间的词组,第三维度为以字符作为命名实体结尾的词组,第四维度为以字符本身作为命名实体的词组。
  3. 如权利要求2所述的智能面试中专业术语的提取方法,其中,所述根据每个所述词组的词频,确定每个字符对应的所述词组集合的权重信息包括:
    针对每个字符,采用如下公式确定所述字符对应的词组集合的权重信息:
    Figure PCTCN2020118919-appb-100001
    Figure PCTCN2020118919-appb-100002
    其中,v s为所述词组集合
    Figure PCTCN2020118919-appb-100003
    的权重信息,L是字符ω对应的四个词组集合的长度,ω c表示组成字符ω的词组,z(ω)表示ω c在所述述历史面试词库中出现的词频,e ω(ω)代表ω c的嵌入向量,B、M、E、S分别为四个维度的词组集合。
  4. 如权利要求3所述的智能面试中专业术语的提取方法,其中,所述对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示包括:
    针对每个字符,对所述字符对应的N个所述词组集合中的每个词组,按照词频由低到高的顺序进行排序,得到排序结果;
    获取预设比例值,根据所述预设比例值,确定选取数量M,并取所述排序结果的前M个词频对应的词组,作为待处理词组;
    按照预设的加权方式,提升所述待处理词组的权重,并对所述字符对应的词组集合的权重信息进行更新;
    对更新后的词组集合的权重信息进行向量化,得到所述字符在N个所述词组集合中的向量表示。
  5. 如权利要求1所述的智能面试中专业术语的提取方法,其中,所述对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息包括:
    针对每个所述字符,对N个所述词组集合的向量表示进行串联,得到串联向量;
    按照预设的固定维度,对所述串联向量进行维度压缩,得到所述字符的表征信息。
  6. 如权利要求1所述的智能面试中专业术语的提取方法,其中,在所述将所述命名实体作为所述专业术语之后,还包括:将所述专业术语存储至区块链网络中。
  7. 一种智能面试中专业术语的提取装置,其中,所述智能面试中专业术语的提取装置包括:
    词组分割模块,用于使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对所述应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,其中,N为正整数;
    权重确定模块,用于对所述历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个所述词组的词频,确定每个字符对应的所述词组集合的权重信息;
    向量表示模块,用于对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示;
    字符表征模块,用于对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息;
    术语确定模块,用于结合所述字符的表征信息,采用预设的序列标记模型,确定所述应答语句中包含的命名实体,并将所述命名实体作为所述专业术语。
  8. 如权利要求7所述的智能面试中专业术语的提取装置,其中,所述向量表示模块包括:
    词频排序单元,用于针对每个字符,对所述字符对应的N个所述词组集合中的每个词组,按照词频由低到高的顺序进行排序,得到排序结果;
    词组选取单元,用于获取预设比例值,根据所述预设比例值,确定选取数量M,并取所述排序结果的前M个词频对应的词组,作为待处理词组;
    权重更新单元,用于按照预设的加权方式,提升所述待处理词组的权重,并对所述字符对应的词组集合的权重信息进行更新;
    权重向量化单元,用于对更新后的词组集合的权重信息进行向量化,得到所述字符在N个所述词组集合中的向量表示。
  9. 如权利要求7所述的智能面试中专业术语的提取装置,其中,智能面试中专业术语的提取装置还包括:
    存储模块,用于将专业术语存储至区块链网络中。
  10. 如权利要求7所述的智能面试中专业术语的提取装置,其中,所述词组分割模块包括:所述N的取值为4,所述预设4个维度中,第一维度为以字符作为命名实体开头的词组,第二维度为以字符作为命名实体中间的词组,第三维度为以字符作为命名实体结尾的词组,第四维度为以字符本身作为命名实体的词组。
  11. 一种计算机设备,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机可读指令,其中,所述处理器执行所述计算机可读指令时实现如下步骤:
    使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对所述应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,其中,N为正整数;
    对所述历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个所述词组的词频,确定每个字符对应的所述词组集合的权重信息;
    对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示;
    对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息;
    结合所述字符的表征信息,采用预设的序列标记模型,确定所述应答语句中包含的命名实体,并将所述命名实体作为所述专业术语。
  12. 如权利要求11所述的计算机设备,其中,N的取值为4,所述预设4个维度中,第一维度为以字符作为命名实体开头的词组,第二维度为以字符作为命名实体中间的词组,第三维度为以字符作为命名实体结尾的词组,第 四维度为以字符本身作为命名实体的词组。
  13. 如权利要求12所述的计算机设备,其中,所述根据每个所述词组的词频,确定每个字符对应的所述词组集合的权重信息包括:
    针对每个字符,采用如下公式确定所述字符对应的词组集合的权重信息:
    Figure PCTCN2020118919-appb-100004
    Figure PCTCN2020118919-appb-100005
    其中,v s为所述词组集合
    Figure PCTCN2020118919-appb-100006
    的权重信息,L是字符ω对应的四个词组集合的长度,ω c表示组成字符ω的词组,z(ω)表示ω c在所述述历史面试词库中出现的词频,e ω(ω)代表ω c的嵌入向量,B、M、E、S分别为四个维度的词组集合。
  14. 如权利要求13所述的计算机设备,其中,所述对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示包括:
    针对每个字符,对所述字符对应的N个所述词组集合中的每个词组,按照词频由低到高的顺序进行排序,得到排序结果;
    获取预设比例值,根据所述预设比例值,确定选取数量M,并取所述排序结果的前M个词频对应的词组,作为待处理词组;
    按照预设的加权方式,提升所述待处理词组的权重,并对所述字符对应的词组集合的权重信息进行更新;
    对更新后的词组集合的权重信息进行向量化,得到所述字符在N个所述词组集合中的向量表示。
  15. 如权利要求11所述的计算机设备,其中,所述对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息包括:
    针对每个所述字符,对N个所述词组集合的向量表示进行串联,得到串联向量;
    按照预设的固定维度,对所述串联向量进行维度压缩,得到所述字符的表征信息。
  16. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机可读指令,其中,所述计算机可读指令被处理器执行时实现如下步骤:
    使用历史面试词库,扫描智能面试中的应答语句,并分别按照预设N个维度,对所述应答语句中的每个字符进行分割,得到每个字符对应的N个词组集合,其中,N为正整数;
    对所述历史面试词库中每个词组的出现次数进行统计,得到每个词组的词频,并根据每个所述词组的词频,确定每个字符对应的所述词组集合的权 重信息;
    对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示;
    对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息;
    结合所述字符的表征信息,采用预设的序列标记模型,确定所述应答语句中包含的命名实体,并将所述命名实体作为所述专业术语。
  17. 如权利要求16所述的计算机可读存储介质,其中,N的取值为4,所述预设4个维度中,第一维度为以字符作为命名实体开头的词组,第二维度为以字符作为命名实体中间的词组,第三维度为以字符作为命名实体结尾的词组,第四维度为以字符本身作为命名实体的词组。
  18. 如权利要求17所述的计算机可读存储介质,其中,所述根据每个所述词组的词频,确定每个字符对应的所述词组集合的权重信息包括:
    针对每个字符,采用如下公式确定所述字符对应的词组集合的权重信息:
    Figure PCTCN2020118919-appb-100007
    Figure PCTCN2020118919-appb-100008
    其中,v s为所述词组集合
    Figure PCTCN2020118919-appb-100009
    的权重信息,L是字符ω对应的四个词组集合的长度,ω c表示组成字符ω的词组,z(ω)表示ω c在所述述历史面试词库中出现的词频,e ω(ω)代表ω c的嵌入向量,B、M、E、S分别为四个维度的词组集合。
  19. 如权利要求18所述的计算机可读存储介质,其中,所述对所述权重信息进行平滑处理,得到每个字符在N个所述词组集合中的向量表示包括:
    针对每个字符,对所述字符对应的N个所述词组集合中的每个词组,按照词频由低到高的顺序进行排序,得到排序结果;
    获取预设比例值,根据所述预设比例值,确定选取数量M,并取所述排序结果的前M个词频对应的词组,作为待处理词组;
    按照预设的加权方式,提升所述待处理词组的权重,并对所述字符对应的词组集合的权重信息进行更新;
    对更新后的词组集合的权重信息进行向量化,得到所述字符在N个所述词组集合中的向量表示。
  20. 如权利要求16所述的计算机可读存储介质,其中,所述对每个所述字符的向量表示进行信息提取,得到每个所述字符的表征信息包括:
    针对每个所述字符,对N个所述词组集合的向量表示进行串联,得到串 联向量;
    按照预设的固定维度,对所述串联向量进行维度压缩,得到所述字符的表征信息。
PCT/CN2020/118919 2020-04-29 2020-09-29 智能面试中专业术语的提取方法、装置、设备及介质 WO2021218027A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202010356753.1A CN111695337A (zh) 2020-04-29 2020-04-29 智能面试中专业术语的提取方法、装置、设备及介质
CN202010356753.1 2020-04-29

Publications (1)

Publication Number Publication Date
WO2021218027A1 true WO2021218027A1 (zh) 2021-11-04

Family

ID=72476870

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/118919 WO2021218027A1 (zh) 2020-04-29 2020-09-29 智能面试中专业术语的提取方法、装置、设备及介质

Country Status (2)

Country Link
CN (1) CN111695337A (zh)
WO (1) WO2021218027A1 (zh)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114124860A (zh) * 2021-11-26 2022-03-01 中国联合网络通信集团有限公司 会话管理方法、装置、设备及存储介质
CN115019327A (zh) * 2022-06-28 2022-09-06 珠海金智维信息科技有限公司 基于碎片票据分词和Transformer网络的碎片票据识别方法及系统

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111695337A (zh) * 2020-04-29 2020-09-22 平安科技(深圳)有限公司 智能面试中专业术语的提取方法、装置、设备及介质
CN112906016B (zh) * 2021-01-28 2023-10-27 北京金山云网络技术有限公司 数据处理方法、装置及电子设备

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271459A (zh) * 2007-03-22 2008-09-24 北京搜狗科技发展有限公司 一种生成词库的方法、一种输入的方法和一种输入法系统
CN102426603A (zh) * 2011-11-11 2012-04-25 任子行网络技术股份有限公司 一种文字信息地域识别方法及装置
US20180107933A1 (en) * 2016-01-07 2018-04-19 Tencent Technology (Shenzhen) Company Limited Web page training method and device, and search intention identifying method and device
CN110287488A (zh) * 2019-06-18 2019-09-27 上海晏鼠计算机技术股份有限公司 一种基于大数据和中文特征的中文文本分词方法
CN111695337A (zh) * 2020-04-29 2020-09-22 平安科技(深圳)有限公司 智能面试中专业术语的提取方法、装置、设备及介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101271459A (zh) * 2007-03-22 2008-09-24 北京搜狗科技发展有限公司 一种生成词库的方法、一种输入的方法和一种输入法系统
CN102426603A (zh) * 2011-11-11 2012-04-25 任子行网络技术股份有限公司 一种文字信息地域识别方法及装置
US20180107933A1 (en) * 2016-01-07 2018-04-19 Tencent Technology (Shenzhen) Company Limited Web page training method and device, and search intention identifying method and device
CN110287488A (zh) * 2019-06-18 2019-09-27 上海晏鼠计算机技术股份有限公司 一种基于大数据和中文特征的中文文本分词方法
CN111695337A (zh) * 2020-04-29 2020-09-22 平安科技(深圳)有限公司 智能面试中专业术语的提取方法、装置、设备及介质

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114124860A (zh) * 2021-11-26 2022-03-01 中国联合网络通信集团有限公司 会话管理方法、装置、设备及存储介质
CN115019327A (zh) * 2022-06-28 2022-09-06 珠海金智维信息科技有限公司 基于碎片票据分词和Transformer网络的碎片票据识别方法及系统
CN115019327B (zh) * 2022-06-28 2024-03-08 珠海金智维信息科技有限公司 基于碎片票据分词和Transformer网络的碎片票据识别方法及系统

Also Published As

Publication number Publication date
CN111695337A (zh) 2020-09-22

Similar Documents

Publication Publication Date Title
WO2021218027A1 (zh) 智能面试中专业术语的提取方法、装置、设备及介质
CN111581976A (zh) 医学术语的标准化方法、装置、计算机设备及存储介质
WO2021121198A1 (zh) 基于语义相似度的实体关系抽取方法、装置、设备及介质
CN111797214A (zh) 基于faq数据库的问题筛选方法、装置、计算机设备及介质
CN114780727A (zh) 基于强化学习的文本分类方法、装置、计算机设备及介质
WO2021218028A1 (zh) 基于人工智能的面试内容精炼方法、装置、设备及介质
CN112287069A (zh) 基于语音语义的信息检索方法、装置及计算机设备
CN112926308B (zh) 匹配正文的方法、装置、设备、存储介质以及程序产品
CN111930792A (zh) 数据资源的标注方法、装置、存储介质及电子设备
CN113722438A (zh) 基于句向量模型的句向量生成方法、装置及计算机设备
CN112632278A (zh) 一种基于多标签分类的标注方法、装置、设备及存储介质
CN113987125A (zh) 基于神经网络的文本结构化信息提取方法、及其相关设备
CN113158656A (zh) 讽刺内容识别方法、装置、电子设备以及存储介质
CN116402166B (zh) 一种预测模型的训练方法、装置、电子设备及存储介质
CN115248890A (zh) 用户兴趣画像的生成方法、装置、电子设备以及存储介质
CN112417875A (zh) 配置信息的更新方法、装置、计算机设备及介质
CN113095073B (zh) 语料标签生成方法、装置、计算机设备和存储介质
CN113505293B (zh) 信息推送方法、装置、电子设备及存储介质
CN112307183B (zh) 搜索数据识别方法、装置、电子设备以及计算机存储介质
CN115048523A (zh) 文本分类方法、装置、设备以及存储介质
CN114416990A (zh) 对象关系网络的构建方法、装置和电子设备
CN114528851A (zh) 回复语句确定方法、装置、电子设备和存储介质
CN114528378A (zh) 文本分类方法、装置、电子设备及存储介质
CN112199954A (zh) 基于语音语义的疾病实体匹配方法、装置及计算机设备
CN112559739A (zh) 电力设备绝缘状态数据处理方法

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20933957

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 20933957

Country of ref document: EP

Kind code of ref document: A1