WO2023092795A1 - 噪音数据识别方法、装置、终端及存储介质 - Google Patents

噪音数据识别方法、装置、终端及存储介质 Download PDF

Info

Publication number
WO2023092795A1
WO2023092795A1 PCT/CN2021/141769 CN2021141769W WO2023092795A1 WO 2023092795 A1 WO2023092795 A1 WO 2023092795A1 CN 2021141769 W CN2021141769 W CN 2021141769W WO 2023092795 A1 WO2023092795 A1 WO 2023092795A1
Authority
WO
WIPO (PCT)
Prior art keywords
noise
entity
target
feature vector
feature
Prior art date
Application number
PCT/CN2021/141769
Other languages
English (en)
French (fr)
Inventor
沈浩
吴优
Original Assignee
上海帜讯信息技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 上海帜讯信息技术股份有限公司 filed Critical 上海帜讯信息技术股份有限公司
Publication of WO2023092795A1 publication Critical patent/WO2023092795A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/217Validation; Performance evaluation; Active pattern learning techniques

Definitions

  • the invention relates to the technical field of data processing, in particular to a noise data identification method, device, terminal and storage medium.
  • Models are widely used in various fields, but before modeling, there will be a lot of noise in the data, and it is necessary to denoise the data at this time. In particular, how to denoise the high-dimensional vectorized data has become an urgent problem to be solved.
  • the standard deviation denoising method, binning denoising method, dbscan denoising method or isolated forest denoising method are generally used to denoise the high-dimensional vectorized data.
  • the main purpose of the present application is to provide a noise data identification method, device, terminal and storage medium to solve the problem of poor denoising effect on high-dimensional vectorized data in the related art.
  • the present application provides a noise data identification method, including:
  • the entity feature set corresponding to the target entity feature vector and the noise feature set corresponding to the target noise feature vector are obtained;
  • Noise entities are identified based on target weights and noise entity recognition algorithms.
  • vectorization and feature processing are performed sequentially on the initial entity information to obtain target entity feature vectors and target noise feature vectors, including:
  • the loss function is used to perform feature processing on the initial entity feature vector and the initial noise feature vector to obtain the target entity feature vector and the target noise feature vector.
  • the loss function is used to perform feature processing on the initial entity feature vector and the initial noise feature vector to obtain the target entity feature vector and the target noise feature vector, including:
  • the loss weight is used to enlarge the distance between the initial entity feature vector and the initial noise feature vector to obtain the target entity feature vector and the target noise feature vector.
  • a deep learning algorithm is used to classify the target entity feature vector and the target noise feature vector to obtain the entity feature set corresponding to the target entity feature vector and the noise feature set corresponding to the target noise feature vector, including:
  • the target entity feature vector is summarized to determine the entity feature set
  • the target noise feature vector is larger than the preset noise feature vector, the target noise feature vectors are summed up to determine the noise feature set.
  • the target weight is determined based on the entity feature set and the noise feature set, including:
  • the sum of all vectors in the entity feature set is used as the target weight
  • the negative value of the sum of all the vectors in the noise feature set is used as the target weight.
  • the noise entity is determined based on the target weight and the noise entity recognition algorithm, including:
  • an embodiment of the present invention provides a noise data identification device, including:
  • the preprocessing module is used to sequentially perform vectorization and feature processing on the initial entity information to obtain the target entity feature vector and the target noise feature vector;
  • Set determines module is used for utilizing deep learning algorithm to classify target entity feature vector and target noise feature vector, obtains the corresponding entity feature set of target entity feature vector and the noise feature set corresponding to target noise feature vector;
  • a weight determination module is used to determine the target weight based on the entity feature set and the noise feature set;
  • the noise recognition module is used to determine the noise entity based on the target weight and the noise entity recognition algorithm.
  • the preprocessing module includes:
  • the vectorization sub-module is used to sequentially perform low-dimensional space vectorization and high-dimensional space vectorization on the initial entity information to obtain the initial entity feature vector and the initial noise feature vector;
  • the feature processing sub-module is used to perform feature processing on the initial entity feature vector and the initial noise feature vector by using the loss function to obtain the target entity feature vector and the target noise feature vector.
  • an embodiment of the present invention provides a terminal, including a memory, a processor, and a computer program stored in the memory and operable on the processor.
  • the processor executes the computer program, any noise data identification method described above is implemented. A step of.
  • an embodiment of the present invention provides a computer-readable storage medium, where a computer program is stored in the computer-readable storage medium, and when the computer program is executed by a processor, the steps of any one of the noise data identification methods above are realized.
  • the embodiment of the present invention provides a noise data recognition method, device, terminal and storage medium, including: performing vectorization and feature processing on the initial entity information in sequence to obtain the target entity feature vector and the target noise feature vector, and then using deep learning
  • the algorithm classifies the target entity feature vector and the target noise feature vector, and obtains the entity feature set corresponding to the target entity feature vector and the noise feature set corresponding to the target noise feature vector, and then determines the target weight based on the entity feature set and the noise feature set, and finally Noise entities are identified based on target weights and noise entity recognition algorithms.
  • the present invention performs high-dimensional vectorization on the initial entity information, which can effectively distinguish entity feature vectors and noise feature vectors, and then further distinguish entity feature vectors and noise feature vectors by using a deep learning algorithm, and finally substitute the target weight into the noise entity recognition algorithm Among them, it can effectively distinguish the core entity with more real features and the noise entity with more noise features, thereby improving the effect of noise entity recognition, and then improving the accuracy of subsequent model processing.
  • Fig. 1 is the implementation flowchart of a kind of noise data identification method provided by the embodiment of the present invention
  • Fig. 2 is a schematic diagram of the mapping of entities from low-dimensional space to high-dimensional space provided by the embodiment of the present invention
  • FIG. 3 is a schematic diagram of determining a target feature vector corresponding to an entity based on a loss function provided by an embodiment of the present invention
  • Fig. 4 is a schematic diagram of determining a target weight based on a feature set corresponding to an entity provided by an embodiment of the present invention
  • Fig. 5 is a schematic structural diagram of a noise data identification device provided by an embodiment of the present invention.
  • Fig. 6 is a schematic diagram of a terminal provided by an embodiment of the present invention.
  • a noise data identification method comprising the following steps:
  • Step S101 Perform vectorization and feature processing on the initial entity information in sequence to obtain the target entity feature vector and the target noise feature vector;
  • Step S102 Using a deep learning algorithm to classify the target entity feature vector and the target noise feature vector to obtain the entity feature set corresponding to the target entity feature vector and the noise feature set corresponding to the target noise feature vector;
  • Step S103 Determine the target weight based on the entity feature set and the noise feature set
  • Step S104 Determine the noise entity based on the target weight and the noise entity recognition algorithm.
  • Deep learning is a new research direction in the field of machine learning (ML, Machine Learning), which is introduced into machine learning to make it closer to the original goal - artificial intelligence (AI, Artificial Intelligence ).
  • ML machine learning
  • AI Artificial Intelligence
  • Deep learning is to learn the internal laws and representation levels of sample data. The information obtained during the learning process is of great help to the interpretation of data such as text, images, and sounds. Its ultimate goal is to enable machines to be as analytical as humans Learning ability, able to recognize data such as text, images and sounds.
  • the present invention can learn the characteristics of the entity feature vector and the noise feature vector based on the deep learning algorithm, thereby classifying the target entity feature vector and the target noise feature vector, and can effectively distinguish the target entity feature vector and the target noise feature vector, thereby improving the target The accuracy of the entity feature set corresponding to the entity feature vector and the noise feature set corresponding to the target noise feature vector.
  • An embodiment of the present invention provides a method for identifying noise data, including: performing vectorization and feature processing on the initial entity information in sequence to obtain the target entity feature vector and the target noise feature vector, and then using a deep learning algorithm to analyze the target entity feature vector and Classify the target noise feature vector to obtain the entity feature set corresponding to the target entity feature vector and the noise feature set corresponding to the target noise feature vector, then determine the target weight based on the entity feature set and the noise feature set, and finally identify the object based on the target weight and noise entity Algorithm to identify noise entities.
  • the present invention performs high-dimensional vectorization on the initial entity information, which can effectively distinguish entity feature vectors and noise feature vectors, and then further distinguish entity feature vectors and noise feature vectors by using a deep learning algorithm, and finally substitute the target weight into the noise entity recognition algorithm Among them, it can effectively distinguish the core entity with more real features and the noise entity with more noise features, thereby improving the effect of noise entity recognition, and then improving the accuracy of subsequent model processing.
  • step S101 includes:
  • Step S201 Perform low-dimensional space vectorization and high-dimensional space vectorization on the initial entity information in sequence to obtain an initial entity feature vector and an initial noise feature vector.
  • the vectorization in the present invention includes low-dimensional space vectorization and high-dimensional space vectorization. Since the low-dimensional space vectorization can only identify the entity information in the initial entity information, and cannot clearly identify the noise information, it is necessary to sequentially perform low-dimensional space vectorization and high-dimensional space vectorization on the initial entity information to obtain the initial entity eigenvectors (solid arrows in Figure 2) and initial noise eigenvectors (dotted arrows in Figure 2).
  • Step S202 Using a loss function to perform feature processing on the initial entity feature vector and the initial noise feature vector to obtain a target entity feature vector and a target noise feature vector.
  • the noise feature is reduced under unsupervised conditions, and the spatial representation of the effective feature is improved, thereby reducing the impact of the noise vector on the final entity classification result.
  • a loss function that is, an entity high-dimensional vector loss function
  • the design of the loss function that is, designing a loss function for a single feature dimension, the formula is as follows:
  • z represents the value of the original feature vector in a single dimension
  • e is a natural constant.
  • the role of the loss function is to add a loss weight ⁇ to the initial entity feature vector i (ie i1 and i2 in Figure 3) and the initial noise feature vector j (ie j1-j5 in Figure 3), and then use the loss weight ⁇ to make the
  • the feature distance between the initial entity feature vector i and the initial noise feature vector j in the same space is further expanded, so that it is easier to determine the target entity feature vector i' (i'1 and i'2 in Figure 3) and the target noise Eigenvector j' (ie j'1-j'5 in Figure 3).
  • the target entity feature vector and the target noise feature vector After obtaining the target entity feature vector and the target noise feature vector through the previous embodiment, it is necessary to use a deep learning algorithm to classify the target entity feature vector and the target noise feature vector to determine the entity feature set and target noise corresponding to the target entity feature vector The noise feature set corresponding to the feature vector.
  • step S102 includes: if the target entity feature vector is less than or equal to the preset entity feature vector, summarizing the target entity feature vector to determine the entity feature set; if the target noise feature vector is greater than the preset noise feature vector, combining The target noise feature vectors are summed to determine the noise feature set.
  • step S103 includes:
  • Step S301 Determine the first vector number corresponding to the entity feature set and the second vector number corresponding to the noise feature set.
  • the first number of vectors refers to the total number of target entity feature vectors included in the entity feature set
  • the second number of vectors refers to the total number of target noise feature vectors included in the noise feature set
  • Step S302 If the first vector number is greater than or equal to the second vector number, use the sum of all vectors in the entity feature set as the target weight;
  • Step S303 If the first number of vectors is smaller than the second number of vectors, use the negative value of the sum of all the vectors in the noise feature set as the target weight.
  • the entity feature set includes the target entity feature vector i'1, the target entity feature vector i'2 and the target entity feature vector i'3, that is, the target entity characteristics in the entity feature set total number of vectors
  • step S104 includes: substituting the target weight into the noise entity recognition algorithm to determine the noise entity.
  • the target weight Substituting into the noise entity recognition algorithm it will be able to effectively distinguish the core entity with more target entity features and the noise entity with more target noise features, so as to effectively carry out noise entity recognition.
  • FIG. 5 shows a schematic structural diagram of a noise data identification device provided by an embodiment of the present invention. For convenience of description, only the parts related to the embodiment of the present invention are shown.
  • a noise data identification device includes a preprocessing module 51, Set determination module 52, weight determination module 53 and noise identification module 54, specifically as follows:
  • the preprocessing module 51 is used to sequentially perform vectorization and feature processing on the initial entity information to obtain a target entity feature vector and a target noise feature vector;
  • the set determination module 52 is used to classify the target entity feature vector and the target noise feature vector using a deep learning algorithm, and obtain the entity feature set corresponding to the target entity feature vector and the noise feature set corresponding to the target noise feature vector;
  • Weight determination module 53 for determining the target weight based on the entity feature set and the noise feature set
  • the noise identification module 54 is configured to determine the noise entity based on the target weight and the noise entity identification algorithm.
  • the preprocessing module 51 includes:
  • the vectorization sub-module is used to sequentially perform low-dimensional space vectorization and high-dimensional space vectorization on the initial entity information to obtain the initial entity feature vector and the initial noise feature vector;
  • the feature processing sub-module is used to perform feature processing on the initial entity feature vector and the initial noise feature vector by using the loss function to obtain the target entity feature vector and the target noise feature vector.
  • the feature processing submodule includes:
  • a parameter determination unit is used to determine the loss weight corresponding to the initial entity feature vector and the initial noise feature vector
  • the target vector determination unit is used to expand the distance between the initial entity feature vector and the initial noise feature vector by using the loss weight to obtain the target entity feature vector and the target noise feature vector.
  • the set determination module 52 includes:
  • the first set determination submodule is used to aggregate the target entity feature vectors to determine the entity feature set if the target entity feature vector is less than or equal to the preset entity feature vector;
  • the second set determination sub-module is configured to aggregate the target noise feature vectors to determine a noise feature set if the target noise feature vector is greater than the preset noise feature vector.
  • the weight determination module 53 includes:
  • the vector number determination submodule is used to determine the first vector number corresponding to the entity feature set and the second vector number corresponding to the noise feature set;
  • the first judging submodule is used to use the sum of all vectors in the entity feature set as the target weight if the first vector number is greater than or equal to the second vector number;
  • the second judging sub-module is configured to use the negative value of the sum of all vectors in the noise feature set as the target weight if the number of the first vectors is less than the number of the second vectors.
  • the noise identification module 54 includes:
  • the noise recognition sub-module is used to substitute the target weight into the noise entity recognition algorithm to determine the noise entity.
  • Fig. 6 is a schematic diagram of a terminal provided by an embodiment of the present invention.
  • the terminal 6 of this embodiment includes: a processor 60 , a memory 61 , and a computer program 62 stored in the memory 61 and operable on the processor 60 .
  • the processor 60 executes the computer program 62
  • the steps in the above embodiments of the noise data identification method are implemented, for example, steps 101 to 104 shown in FIG. 1 .
  • the processor 60 executes the computer program 62
  • the functions of the modules/units in the above-mentioned device embodiments are realized, for example, the functions of the modules/units 51 to 54 shown in FIG. 5 .
  • the present invention also provides a readable storage medium, wherein a computer program is stored in the readable storage medium, and when the computer program is executed by a processor, it is used to implement the methods provided by the above-mentioned various embodiments.
  • the readable storage medium may be a computer storage medium, or a communication medium.
  • Communication media includes any medium that facilitates transfer of a computer program from one place to another.
  • Computer storage media can be any available media that can be accessed by a general purpose or special purpose computer.
  • a readable storage medium is coupled to the processor such that the processor can read information from, and write information to, the readable storage medium.
  • the readable storage medium can also be a component of the processor.
  • the processor and the readable storage medium may be located in Application Specific Integrated Circuits (ASIC for short). Additionally, the ASIC may be located in the user equipment.
  • ASIC Application Specific Integrated Circuits
  • the processor and the readable storage medium can also exist in the communication device as discrete components.
  • the readable storage medium may be read only memory (ROM), random access memory (RAM), CD-ROM, magnetic tape, floppy disk, and optical data storage devices, among others.
  • the present invention also provides a program product, which includes execution instructions, and the execution instructions are stored in a readable storage medium.
  • At least one processor of the device may read the execution instruction from the readable storage medium, and the at least one processor executes the execution instruction so that the device implements the methods provided in the foregoing various implementation manners.
  • the processor may be a central processing unit (English: Central Processing Unit, referred to as: CPU), and may also be other general-purpose processors, digital signal processors (English: Digital Signal Processor, referred to as : DSP), application specific integrated circuit (English: Application Specific Integrated Circuit, referred to as: ASIC), etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like. The steps of the method disclosed in conjunction with the present invention can be directly implemented by a hardware processor, or implemented by a combination of hardware and software modules in the processor.

Abstract

本申请公开了一种噪音数据识别方法、装置、终端及存储介质。方法包括:对初始实体信息依次进行进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量;利用深度学习算法对目标实体特征向量和目标噪音特征向量进行分类,得到目标实体特征向量对应的实体特征集合和目标噪音特征向量对应的噪音特征集合;基于实体特征集合和噪音特征集合,确定目标权重;基于目标权重和噪音实体识别算法,确定噪音实体。本发明可有效区分真实特征偏多的核心实体和噪音特征偏多的噪音实体,从而提高噪音实体识别的效果,进而提高后续模型处理的精确度。

Description

噪音数据识别方法、装置、终端及存储介质
相关申请的交叉引用
本申请要求于2021年11月25日提交中国专利局,申请号为202111418593X,发明名称为“噪音数据识别方法、装置、终端及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本发明涉及数据处理技术领域,具体涉及一种噪音数据识别方法、装置、终端及存储介质。
背景技术
模型广泛应用于各种领域,但是在建模之前,数据会存在大量噪声,这个时候就需要对数据进行去噪处理。尤其是,如何对高维向量化后的数据进行去噪工作成为亟待解决的问题。
目前,一般采用标准差去噪法、分箱去噪法、dbscan去噪法或孤立森林去噪法,对高维向量化后的数据进行去噪。
但是,采用上述方法进行高维向量化后的数据去噪效果差。
发明内容
本申请的主要目的在于提供一种噪音数据识别方法、装置、终端及存储介质,以解决相关技术中对高维向量化后的数据进行去噪存在效果差的问题。
为了实现上述目的,第一方面,本申请提供了一种噪音数据识别方法,包括:
对初始实体信息依次进行进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量;
利用深度学习算法对目标实体特征向量和目标噪音特征向量进行分类,得到目标实体特征向量对应的实体特征集合和目标噪音特征向量对应的噪音特征集合;
基于实体特征集合和噪音特征集合,确定目标权重;
基于目标权重和噪音实体识别算法,确定噪音实体。
在一种可能的实现方式中,对初始实体信息依次进行进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量,包括:
对初始实体信息依次进行低维空间向量化和高维空间向量化,得到初始实体特征向量和初始噪音特征向量;
利用损失函数对初始实体特征向量和初始噪音特征向量进行特征处理,得到目标实体特征向量和目标噪音特征向量。
在一种可能的实现方式中,利用损失函数对初始实体特征向量和初始噪音特征向量进行特征处理,得到目标实体特征向量和目标噪音特征向量,包括:
确定初始实体特征向量和初始噪音特征向量对应的损失权重;
利用损失权重扩大初始实体特征向量和初始噪音特征向量之间的距离,得到目标实体特征向量和目标噪音特征向量。
在一种可能的实现方式中,利用深度学习算法对目标实体特征向量和目标噪音特征向量进行分类,得到目标实体特征向量对应的实体特征集合和目标噪音特征向量对应的噪音特征集合,包括:
若目标实体特征向量小于或等于预设实体特征向量,将目标实体特征向量进行汇总以确定实体特征集合;
若目标噪音特征向量大于预设噪音特征向量,将目标噪音特征向量进行汇总以确定噪音特征集合。
在一种可能的实现方式中,基于实体特征集合和噪音特征集合,确定目标权重,包括:
确定实体特征集合对应的第一向量数目和噪音特征集合对应的第二向量数目;
若第一向量数目大于或等于第二向量数目,将实体特征集合中所有向量的和作为目标权重;
若第一向量数目小于第二向量数目,将噪音特征集合中所有向量的和的负值作为目标权重。
在一种可能的实现方式中,基于目标权重和噪音实体识别算法,确定噪音实体,包括:
将目标权重代入噪音实体识别算法,确定噪音实体。
第二方面,本发明实施例提供了一种噪音数据识别装置,包括:
预处理模块,用于对初始实体信息依次进行进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量;
集合确定模块,用于利用深度学习算法对目标实体特征向量和目标噪音特征向量进行分 类,得到目标实体特征向量对应的实体特征集合和目标噪音特征向量对应的噪音特征集合;
权重确定模块,用于基于实体特征集合和噪音特征集合,确定目标权重;
噪音识别模块,用于基于目标权重和噪音实体识别算法,确定噪音实体。
在一种可能的实现方式中,预处理模块包括:
向量化子模块,用于对初始实体信息依次进行低维空间向量化和高维空间向量化,得到初始实体特征向量和初始噪音特征向量;
特征处理子模块,用于利用损失函数对初始实体特征向量和初始噪音特征向量进行特征处理,得到目标实体特征向量和目标噪音特征向量。
第三方面,本发明实施例提供了一种终端,包括存储器、处理器以及存储在存储器中并可在处理器上运行的计算机程序,处理器执行计算机程序时实现如上任一种噪音数据识别方法的步骤。
第四方面,本发明实施例提供了一种计算机可读存储介质,计算机可读存储介质存储有计算机程序,计算机程序被处理器执行时实现如上任一种噪音数据识别方法的步骤。
本发明实施例提供了一种噪音数据识别方法、装置、终端及存储介质,包括:对初始实体信息依次进行进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量,然后利用深度学习算法对目标实体特征向量和目标噪音特征向量进行分类,得到目标实体特征向量对应的实体特征集合和目标噪音特征向量对应的噪音特征集合,再基于实体特征集合和噪音特征集合,确定目标权重,最后基于目标权重和噪音实体识别算法,确定噪音实体。本发明对初始实体信息进行高维度的向量化,可有效区分实体特征向量和噪音特征向量,再利用深度学习算法对实体特征向量和噪音特征向量进行进一步区分,最后将目标权重代入噪音实体识别算法中,可有效区分真实特征偏多的核心实体和噪音特征偏多的噪音实体,从而提高噪音实体识别的效果,进而提高后续模型处理的精确度。
附图说明
为了更清楚地说明本发明具体实施方式或现有技术中的技术方案,下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施方式,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。
图1是本发明实施例提供的一种噪音数据识别方法的实现流程图;
图2是本发明实施例提供的实体从低维空间向高维空间映射的示意图;
图3是本发明实施例提供的基于损失函数确定实体对应的目标特征向量的示意图;
图4是本发明实施例提供的基于实体对应的特征集合确定目标权重的示意图;
图5是本发明实施例提供的一种噪音数据识别装置的结构示意图;
图6是本发明实施例提供的终端的示意图。
具体实施方式
为了使本技术领域的人员更好地理解本发明方案,下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分的实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都应当属于本发明保护的范围。
需要说明的是,本发明的说明书和权利要求书及上述附图中的术语“第一”、“第二”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。应该理解这样使用的数据在适当情况下可以互换,以便这里描述的本发明的实施例。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元的过程、方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。
需要说明的是,在不冲突的情况下,本发明中的实施例及实施例中的特征可以相互组合。下面将参考附图并结合实施例来详细说明本发明。
在一个实施例中,如图1所示,提供了一种噪音数据识别方法,包括以下步骤:
步骤S101:对初始实体信息依次进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量;
步骤S102:利用深度学习算法对目标实体特征向量和目标噪音特征向量进行分类,得到目标实体特征向量对应的实体特征集合和目标噪音特征向量对应的噪音特征集合;
步骤S103:基于实体特征集合和噪音特征集合,确定目标权重;
步骤S104:基于目标权重和噪音实体识别算法,确定噪音实体。
具体的,深度学习(DL,Deep Learning)是机器学习(ML,Machine Learning)领域中一个新的研究方向,它被引入机器学习使其更接近于最初的目标——人工智能(AI,Artificial Intelligence)。深度学习是学习样本数据的内在规律和表示层次,这些学习过程中获得的信息对诸如文字,图像和声音等数据的解释有很大的帮助,它的最终目标是让机器能够像人一样具有分析学习能力,能够识别文字、图像和声音等数据。本发明基于深度学习算法可学 习实体特征向量的特性和噪音特征向量的特性,从而对目标实体特征向量和目标噪音特征向量进行分类,可有效区分目标实体特征向量和目标噪音特征向量,进而提高目标实体特征向量对应的实体特征集合和目标噪音特征向量对应的噪音特征集合的准确度。
本发明实施例提供了一种噪音数据识别方法,包括:对初始实体信息依次进行进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量,然后利用深度学习算法对目标实体特征向量和目标噪音特征向量进行分类,得到目标实体特征向量对应的实体特征集合和目标噪音特征向量对应的噪音特征集合,再基于实体特征集合和噪音特征集合,确定目标权重,最后基于目标权重和噪音实体识别算法,确定噪音实体。本发明对初始实体信息进行高维度的向量化,可有效区分实体特征向量和噪音特征向量,再利用深度学习算法对实体特征向量和噪音特征向量进行进一步区分,最后将目标权重代入噪音实体识别算法中,可有效区分真实特征偏多的核心实体和噪音特征偏多的噪音实体,从而提高噪音实体识别的效果,进而提高后续模型处理的精确度。
在一实施例中,步骤S101包括:
步骤S201:对初始实体信息依次进行低维空间向量化和高维空间向量化,得到初始实体特征向量和初始噪音特征向量。
结合图2,本发明中的向量化包括低维空间向量化和高维空间向量化。由于低维空间向量化只能识别出初始实体信息中的实体信息,并不能明显识别出噪音信息,因此需要对初始实体信息依次进行低维空间向量化和高维空间向量化,从而得到初始实体特征向量(图2中的实线箭头)和初始噪音特征向量(图2中的虚线箭头)。
步骤S202:利用损失函数对初始实体特征向量和初始噪音特征向量进行特征处理,得到目标实体特征向量和目标噪音特征向量。
具体的,首先需确定初始实体特征向量和初始噪音特征向量对应的损失权重,然后利用损失权重扩大初始实体特征向量和初始噪音特征向量之间的距离,得到目标实体特征向量和目标噪音特征向量。本专利中通过引入损失函数(即实体高维向量损失函数)的方法,在无监督条件下降低噪音特征,提升有效特征的空间表示,从而减少噪音向量在最终实体分类结果的影响。
进一步的,结合图3说明确定目标实体特征向量i'和目标噪音特征向量j'的过程,具体如下:
损失函数的设计,即设计一个针对单一特征维度的损失函数,公式如下:
Figure PCTCN2021141769-appb-000001
其中,z代表原特征向量在单一维度中的取值,e为自然常数。损失函数的作用在于为初始实体特征向量i(即图3中的i1和i2)和初始噪音特征向量j(即图3中的j1-j5)添加一个损失权重ω,进而利用损失权重ω使得在同一空间中的初始实体特征向量i和初始噪音特征向量j的特征距离进一步扩大,从而能够更容易地确定目标实体特征向量i'(即图3中的i'1和i'2)和目标噪音特征向量j'(即图3中的j'1-j'5)。
通过上个实施例得到目标实体特征向量和目标噪音特征向量后,还需利用深度学习算法对目标实体特征向量和目标噪音特征向量进行分类,以确定目标实体特征向量对应的实体特征集合和目标噪音特征向量对应的噪音特征集合。
在一实施例中,步骤S102包括:若目标实体特征向量小于或等于预设实体特征向量,将目标实体特征向量进行汇总以确定实体特征集合;若目标噪音特征向量大于预设噪音特征向量,将目标噪音特征向量进行汇总以确定噪音特征集合。
在一实施例中,步骤S103包括:
步骤S301:确定实体特征集合对应的第一向量数目和噪音特征集合对应的第二向量数目。
其中,第一向量数目是指实体特征集合中包含的目标实体特征向量总个数,第二向量数目是指噪音特征集合中包含的目标噪音特征向量总个数。
步骤S302:若第一向量数目大于或等于第二向量数目,将实体特征集合中所有向量的和作为目标权重;
步骤S303:若第一向量数目小于第二向量数目,将噪音特征集合中所有向量的和的负值作为目标权重。
下面以图4为例对确定目标权重进行说明,具体如下:
(1)通过图4中的左侧图可知实体特征集合中包括目标实体特征向量i’1、目标实体特征向量i’2和目标实体特征向量i’3,即实体特征集合中的目标实体特性向量的总数目
Σi=2;噪音特征集合中包括目标噪音特征向量j’3和目标噪音特征向量j’4,即噪音特征集合中的目标噪音特征向量的总数目为Σj=2。通过上述可知第一向量数目大于第二向量数目,则目标权重
Figure PCTCN2021141769-appb-000002
(2)通过图4中的右侧图可知实体特征集合中包括目标实体特征向量i’1,即实体特征集合中的目标实体特征向量的总数目Σi=1;噪音特征集合中包括目标噪音特征向 量j’1、目标噪音特征向量j’2、目标噪音特征向量j’3、目标噪音特征向量j’4和目标噪音特征向量j’5,即噪音特征集合中的目标噪音特征向量的总数目为Σj=5。通过上述可知第一向量数目小于第二向量数目,则目标权重
Figure PCTCN2021141769-appb-000003
需要说明的是,本申请中的j'与j’代表同一含义,i'与i’代表同一含义。
在一实施例中,步骤S104包括:将目标权重代入噪音实体识别算法,确定噪音实体。
具体的,将目标权重
Figure PCTCN2021141769-appb-000004
代入噪音实体识别算法中,将可以有效区分目标实体特征偏多的核心实体和目标噪音特征偏多的噪音实体,从而有效进行噪音实体识别。
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本发明实施例的实施过程构成任何限定。
以下为本发明的装置实施例,对于其中未详尽描述的细节,可以参考上述对应的方法实施例。
图5示出了本发明实施例提供的一种噪音数据识别装置的结构示意图,为了便于说明,仅示出了与本发明实施例相关的部分,一种噪音数据识别装置包括预处理模块51、集合确定模块52、权重确定模块53和噪音识别模块54,具体如下:
预处理模块51,用于对初始实体信息依次进行进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量;
集合确定模块52,用于利用深度学习算法对目标实体特征向量和目标噪音特征向量进行分类,得到目标实体特征向量对应的实体特征集合和目标噪音特征向量对应的噪音特征集合;
权重确定模块53,用于基于实体特征集合和噪音特征集合,确定目标权重;
噪音识别模块54,用于基于目标权重和噪音实体识别算法,确定噪音实体。
在一种可能的实现方式中,预处理模块51包括:
向量化子模块,用于对初始实体信息依次进行低维空间向量化和高维空间向量化,得到初始实体特征向量和初始噪音特征向量;
特征处理子模块,用于利用损失函数对初始实体特征向量和初始噪音特征向量进行特征处理,得到目标实体特征向量和目标噪音特征向量。
在一种可能的实现方式中,特征处理子模块包括:
参数确定单元,用于确定初始实体特征向量和初始噪音特征向量对应的损失权重;
目标向量确定单元,用于利用损失权重扩大初始实体特征向量和初始噪音特征向量之间的距离,得到目标实体特征向量和目标噪音特征向量。
在一种可能的实现方式中,集合确定模块52包括:
第一集合确定子模块,用于若目标实体特征向量小于或等于预设实体特征向量,将目标实体特征向量进行汇总以确定实体特征集合;
第二集合确定子模块,用于若目标噪音特征向量大于预设噪音特征向量,将目标噪音特征向量进行汇总以确定噪音特征集合。
在一种可能的实现方式中,权重确定模块53包括:
向量数目确定子模块,用于确定实体特征集合对应的第一向量数目和噪音特征集合对应的第二向量数目;
第一判断子模块,用于若第一向量数目大于或等于第二向量数目,将实体特征集合中所有向量的和作为目标权重;
第二判断子模块,用于若第一向量数目小于第二向量数目,将噪音特征集合中所有向量的和的负值作为目标权重。
在一种可能的实现方式中,噪音识别模块54包括:
噪音识别子模块,用于将目标权重代入噪音实体识别算法,确定噪音实体。
图6是本发明实施例提供的终端的示意图。如图6所示,该实施例的终端6包括:处理器60、存储器61以及存储在存储器61中并可在处理器60上运行的计算机程序62。处理器60执行计算机程序62时实现上述各个噪音数据识别方法实施例中的步骤,例如图1所示的步骤101至步骤104。或者,处理器60执行计算机程序62时实现上述各装置实施例中各模块/单元的功能,例如图5所示模块/单元51至54的功能。
本发明还提供一种可读存储介质,可读存储介质中存储有计算机程序,计算机程序被处理器执行时用于实现上述的各种实施方式提供的方法。
其中,可读存储介质可以是计算机存储介质,也可以是通信介质。通信介质包括便于从一个地方向另一个地方传送计算机程序的任何介质。计算机存储介质可以是通用或专用计算机能够存取的任何可用介质。例如,可读存储介质耦合至处理器,从而使处理器能够从该可读存储介质读取信息,且可向该可读存储介质写入信息。当然,可读存储介质也可以是处理器的组成部分。处理器和可读存储介质可以位于专用集成电路(Application Specific Integrated Circuits,简称:ASIC)中。另外,该ASIC可以位于用户设备中。当然,处理器和可读存储介质也可以作为分立组件存在于通信设备中。可读存储介质可以是只读存储器(ROM)、随机存取存储器(RAM)、CD-ROM、磁带、软盘和光数据存储设备等。
本发明还提供一种程序产品,该程序产品包括执行指令,该执行指令存储在可读存储介 质中。设备的至少一个处理器可以从可读存储介质读取该执行指令,至少一个处理器执行该执行指令使得设备实施上述的各种实施方式提供的方法。
在上述设备的实施例中,应理解,处理器可以是中央处理单元(英文:Central Processing Unit,简称:CPU),还可以是其他通用处理器、数字信号处理器(英文:Digital Signal Processor,简称:DSP)、专用集成电路(英文:Application Specific Integrated Circuit,简称:ASIC)等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本发明所公开的方法的步骤可以直接体现为硬件处理器执行完成,或者用处理器中的硬件及软件模块组合执行完成。
虽然结合附图描述了本发明的实施方式,但是本领域技术人员可以在不脱离本发明的精神和范围的情况下作出各种修改和变型,这样的修改和变型均落入由所附权利要求所限定的范围之内。

Claims (14)

  1. 一种噪音数据识别方法,其特征在于,包括:
    对初始实体信息依次进行进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量;
    利用深度学习算法对所述目标实体特征向量和所述目标噪音特征向量进行分类,得到所述目标实体特征向量对应的实体特征集合和所述目标噪音特征向量对应的噪音特征集合;
    基于所述实体特征集合和所述噪音特征集合,确定目标权重;
    基于所述目标权重和噪音实体识别算法,确定噪音实体。
  2. 如权利要求1所述的噪音数据识别方法,其特征在于,所述对初始实体信息依次进行进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量,包括:
    对所述初始实体信息依次进行低维空间向量化和高维空间向量化,得到初始实体特征向量和初始噪音特征向量;
    利用损失函数对所述初始实体特征向量和所述初始噪音特征向量进行特征处理,得到目标实体特征向量和目标噪音特征向量。
  3. 如权利要求2所述的噪音数据识别方法,其特征在于,所述利用损失函数对所述初始实体特征向量和所述初始噪音特征向量进行特征处理,得到目标实体特征向量和目标噪音特征向量,包括:
    确定所述初始实体特征向量和所述初始噪音特征向量对应的损失权重;
    利用所述损失权重扩大所述初始实体特征向量和所述初始噪音特征向量之间的距离,得到所述目标实体特征向量和所述目标噪音特征向量。
  4. 如权利要求3所述的噪音数据识别方法,其特征在于,所述利用深度学习算法对所述目标实体特征向量和所述目标噪音特征向量进行分类,得到所述目标实体特征向量对应的实体特征集合和所述目标噪音特征向量对应的噪音特征集合,包括:
    若所述目标实体特征向量小于或等于预设实体特征向量,将所述目标实体特征向量进行汇总以确定所述实体特征集合;
    若所述目标噪音特征向量大于预设噪音特征向量,将所述目标噪音特征向量进行汇总以 确定所述噪音特征集合。
  5. 如权利要求4所述的噪音数据识别方法,其特征在于,所述基于所述实体特征集合和所述噪音特征集合,确定目标权重,包括:
    确定所述实体特征集合对应的第一向量数目和所述噪音特征集合对应的第二向量数目;
    若所述第一向量数目大于或等于所述第二向量数目,将所述实体特征集合中所有向量的和作为所述目标权重;
    若所述第一向量数目小于所述第二向量数目,将所述噪音特征集合中所有向量的和的负值作为所述目标权重。
  6. 如权利要求5所述的噪音数据识别方法,其特征在于,所述基于所述目标权重和噪音实体识别算法,确定噪音实体,包括:
    将所述目标权重代入所述噪音实体识别算法,确定所述噪音实体。
  7. 一种噪音数据识别装置,其特征在于,包括:
    预处理模块,用于对初始实体信息依次进行进行向量化和特征处理,得到目标实体特征向量和目标噪音特征向量;
    集合确定模块,用于利用深度学习算法对所述目标实体特征向量和所述目标噪音特征向量进行分类,得到所述目标实体特征向量对应的实体特征集合和所述目标噪音特征向量对应的噪音特征集合;
    权重确定模块,用于基于所述实体特征集合和所述噪音特征集合,确定目标权重;
    噪音识别模块,用于基于所述目标权重和噪音实体识别算法,确定噪音实体。
  8. 如权利要求7所述的噪音数据识别装置,其特征在于,所述预处理模块包括:
    向量化子模块,用于对所述初始实体信息依次进行低维空间向量化和高维空间向量化,得到初始实体特征向量和初始噪音特征向量;
    特征处理子模块,用于利用损失函数对所述初始实体特征向量和所述初始噪音特征向量进行特征处理,得到目标实体特征向量和目标噪音特征向量。
  9. 如权利要求8所述的噪音数据识别装置,其特征在于,所述特征处理子模块包括:
    参数确定单元,用于确定初始实体特征向量和初始噪音特征向量对应的损失权重;
    目标向量确定单元,用于利用损失权重扩大初始实体特征向量和初始噪音特征向量之间的距离,得到目标实体特征向量和目标噪音特征向量。
  10. 如权利要求7所述的噪音数据识别装置,其特征在于,所述集合确定模块包括:
    第一集合确定子模块,用于若目标实体特征向量小于或等于预设实体特征向量,将目标实体特征向量进行汇总以确定实体特征集合;
    第二集合确定子模块,用于若目标噪音特征向量大于预设噪音特征向量,将目标噪音特征向量进行汇总以确定噪音特征集合。
  11. 如权利要求7所述的噪音数据识别装置,其特征在于,所述权重确定模块包括:
    向量数目确定子模块,用于确定实体特征集合对应的第一向量数目和噪音特征集合对应的第二向量数目;
    第一判断子模块,用于若第一向量数目大于或等于第二向量数目,将实体特征集合中所有向量的和作为目标权重;
    第二判断子模块,用于若第一向量数目小于第二向量数目,将噪音特征集合中所有向量的和的负值作为目标权重。
  12. 如权利要求7所述的噪音数据识别装置,其特征在于,所述噪音识别模块包括:
    噪音识别子模块,用于将目标权重代入噪音实体识别算法,确定噪音实体。
  13. 一种终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,其特征在于,所述处理器执行所述计算机程序时实现如权利要求1至6中任一项所述噪音数据识别方法的步骤。
  14. 一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,其特征在于,所述计算机程序被处理器执行所述计算机程序时实现如权利要求1至6中任一项所述噪音数据识别方法的步骤。
PCT/CN2021/141769 2021-11-25 2021-12-27 噪音数据识别方法、装置、终端及存储介质 WO2023092795A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111418593.X 2021-11-25
CN202111418593.XA CN114154569B (zh) 2021-11-25 2021-11-25 噪音数据识别方法、装置、终端及存储介质

Publications (1)

Publication Number Publication Date
WO2023092795A1 true WO2023092795A1 (zh) 2023-06-01

Family

ID=80458060

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/141769 WO2023092795A1 (zh) 2021-11-25 2021-12-27 噪音数据识别方法、装置、终端及存储介质

Country Status (2)

Country Link
CN (1) CN114154569B (zh)
WO (1) WO2023092795A1 (zh)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108897045A (zh) * 2018-08-28 2018-11-27 中国石油天然气股份有限公司 深度学习模型训练方法和地震数据去噪方法、装置及设备
CN110858391A (zh) * 2018-08-23 2020-03-03 通用电气公司 患者专用的深度学习图像降噪方法和系统
CN112330569A (zh) * 2020-11-27 2021-02-05 上海眼控科技股份有限公司 模型训练方法、文本去噪方法、装置、设备及存储介质
CN112801888A (zh) * 2021-01-06 2021-05-14 杭州海康威视数字技术股份有限公司 图像处理方法、装置、计算机设备及存储介质
US20210241780A1 (en) * 2020-01-31 2021-08-05 Nuance Communications, Inc. Method And System For Speech Enhancement
CN113412491A (zh) * 2018-12-18 2021-09-17 诺基亚技术有限公司 基于机器学习的数据降噪

Family Cites Families (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR100455294B1 (ko) * 2002-12-06 2004-11-06 삼성전자주식회사 감시 시스템에서의 사용자 검출 방법, 움직임 검출 방법및 사용자 검출 장치
CN102411711B (zh) * 2012-01-04 2013-10-23 山东大学 一种基于个性化权重的手指静脉识别方法
CN102607531B (zh) * 2012-03-19 2013-08-14 中国科学院上海技术物理研究所 空间低速高精度二维像移补偿指向控制系统
CN106782504B (zh) * 2016-12-29 2019-01-22 百度在线网络技术(北京)有限公司 语音识别方法和装置
CN107705212B (zh) * 2017-07-07 2021-06-15 江苏开放大学 一种基于粒子群随机游走的角色识别方法
CN111737552A (zh) * 2020-06-04 2020-10-02 中国科学院自动化研究所 训练信息抽取模型和获取知识图谱的方法、装置和设备
CN111782826A (zh) * 2020-08-27 2020-10-16 清华大学 知识图谱的信息处理方法、装置、设备及存储介质

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110858391A (zh) * 2018-08-23 2020-03-03 通用电气公司 患者专用的深度学习图像降噪方法和系统
CN108897045A (zh) * 2018-08-28 2018-11-27 中国石油天然气股份有限公司 深度学习模型训练方法和地震数据去噪方法、装置及设备
CN113412491A (zh) * 2018-12-18 2021-09-17 诺基亚技术有限公司 基于机器学习的数据降噪
US20210241780A1 (en) * 2020-01-31 2021-08-05 Nuance Communications, Inc. Method And System For Speech Enhancement
CN112330569A (zh) * 2020-11-27 2021-02-05 上海眼控科技股份有限公司 模型训练方法、文本去噪方法、装置、设备及存储介质
CN112801888A (zh) * 2021-01-06 2021-05-14 杭州海康威视数字技术股份有限公司 图像处理方法、装置、计算机设备及存储介质

Also Published As

Publication number Publication date
CN114154569B (zh) 2024-02-02
CN114154569A (zh) 2022-03-08

Similar Documents

Publication Publication Date Title
WO2020140372A1 (zh) 一种基于识别模型的意图识别方法、识别设备及介质
WO2021072885A1 (zh) 识别文本的方法、装置、设备及存储介质
WO2020073507A1 (zh) 一种文本分类方法及终端
CN109101817B (zh) 一种识别恶意文件类别的方法及计算设备
WO2022126810A1 (zh) 文本聚类方法
WO2022042123A1 (zh) 图像识别模型生成方法、装置、计算机设备和存储介质
WO2021143267A1 (zh) 基于图像检测的细粒度分类模型处理方法、及其相关设备
WO2017017682A1 (en) Data fusion and classification with imbalanced datasets background
EP2370932B1 (en) Method, apparatus and computer program product for providing face pose estimation
CN111968625A (zh) 融合文本信息的敏感音频识别模型训练方法及识别方法
WO2021068563A1 (zh) 样本数据处理方法、装置、计算机设备及存储介质
CN112749300B (zh) 用于视频分类的方法、装置、设备、存储介质和程序产品
CN110059156A (zh) 基于关联词的协同检索方法、装置、设备及可读存储介质
WO2021135271A1 (zh) 一种分类模型训练方法、系统、电子设备及存储介质
US20220156634A1 (en) Training Data Augmentation for Machine Learning
CN112668482A (zh) 人脸识别训练方法、装置、计算机设备及存储介质
CN113449840A (zh) 神经网络训练方法及装置、图像分类的方法及装置
WO2022116444A1 (zh) 文本分类方法、装置、计算机设备和介质
CN112926592B (zh) 一种基于改进Fast算法的商标检索方法及装置
CN110020593A (zh) 信息处理方法及装置、介质及计算设备
WO2023092795A1 (zh) 噪音数据识别方法、装置、终端及存储介质
WO2020087949A1 (zh) 数据库更新方法和装置、电子设备、计算机存储介质
CN111277433A (zh) 基于属性网络表征学习的网络服务异常检测方法及装置
CN107665443B (zh) 获取目标用户的方法及装置
CN116451072A (zh) 结构化敏感数据识别方法及装置